Under LJ_GC64, RID_DISPATCH is removed from the pool of available general purpose
registers, and instead retains its role as a pointer to the dispatch table
thoughout JIT code. This guarantees that members of the global_State and
the jit_State can always be encoded in modrm. If the memory allocator is
kind, it also allows for various KGC and KPTR values to be encoded as
32-bit offsets from RID_DISPATCH. Likewise, when SSE instructions want to
use a KNUM as a memory operand, it often transpires that the address of
the KNUM's 64-bit payload can be expressed as 32-bit offset from
RID_DISPATCH.
In some cases the recording logic has been tweaked to encode constants
as relative to RID_DISPATCH instead of as absolute addresses. This is done
via calls to lj_ir_ggfload.
LJ_GC64 also introduces a new pseudo-register: RID_RIP. If the memory
allocator isn't kind enough to put things within a 32-bit range of the
dispatch table, it is sometimes kind enough to instead put things within a
32-bit range of the mcode pointer. Furthermore, for constants which we
want (or need) to be loaded via memory operands, the constant's payload can be
copied to the low part of an mcode region, at which point it is guaranteed
to be representable as a RIP-relative operand. Fused loads can result in
an mrm referencing RID_RIP. In such cases, the fusing is only valid for
the next emitted instruction - though as a special case, one asm_guardcc call is
permitted between the fusing and the instruction into which the fusion
result is inserted.
TValue detagging is notable under LJ_GC64. The basic code pattern is:
mov r64, [addr]
ror r64, 47
cmp r16, itype
jnz ->exit
shr r64, 17
If BMI2 is available, mov/ror are fused to be a single rorx. If BMI2 isn't
available, and a type test isn't required, ror47 becomes shl17 (and the
cmp/jnz are dropped). The type test is interesting as it only considers 16
bits of tag, despite the TValues in question nominally consisting of 47
bits of pointer and 17 bits of tag. The 16 considered bits are sufficient
to verify that the TValue is a NaN (11 bits), is a QNaN (1 bit), and has
the correct itype (4 bits). The one unconsidered bit is the sign bit of
the NaN. LuaJIT operates under the assumption that all NaNs in the system
are either canonical NaNs (as generated by the FPU) or are NaN-packed
TValues. In both cases, the sign bit of the NaN is set, and therefore does
not need to be verified during detagging. The cmp instruction encodes the
itype as an imm8, thus avoiding the LCP stall which using an imm16 would
result in. False LCP stalls are still an issue, and could be trivially
worked-around by sometimes inserting an extra nop instruction, but this
could break loop realignment (as the realigned code might be one byte
larger or one byte smaller, and loop realignment operates under the
assumption that a sequence of emitted instructions always occupies the
same number of bytes, regardless of where it is emitted [1]).
[1] This assumption also results in rip-relative operands being even more
slippery. A-priori, the realigned code might be able to reach things it
previously couldn't, or conversely not reach things it previously could.
To prevent this from happening, checki32/mcpofs is paired with
checki32/mctopofs: if a given address is reachable with a 32-bit
displacement from both of these points, then it'll also be reachable with
a 32-bit displacement from a realigned mcp.
The interesting changes here revolve around slots marked as TREF_FRAME /
TREF_CONT. Under !LJ_FR2, said slots contain two 32-bit values, and the
TRef for the slot primarily relates to the low 32 bits. In a snapshot, the
main SnapEntry relates to the low 32 bits, and the framelink from the
snapshot is used to restore the high 32 bits. Under LJ_FR2, TREF_FRAME /
TREF_CONT slots contain a single 64-bit value. The TRef relates to all 64
bits, the SnapEntry is used to restore all 64 bits, and no framelinks are
required to restore the slot. Restoration is done via IR_KNUM constants,
as the 64-bit values in question can be happily interpreted as denormal
numbers. These constants are created lazily: the slots in question get set
to just TREF_FRAME / TREF_CONT initially, and then if required for a
snapshot, the ref part of the TRef is changed from zero to the index of a
KNUM. Slot 1 is always zero, as although it is technically a frame link,
it never needs to be changed or saved or restored.
Though the framelink part of a snapshot isn't required for slot
restoration under LJ_FR2, it is still used for restoring PC. As such,
every snapshot has exactly two framelink entries, which are used to store
a 64-bit value.
Manipulations of J->maxslot are more interesting under LJ_FR2. For
example, the BC_MOV of a method call can introduce a three-slot gap under
LJ_FR2, whereas it could only introduce a one-slot gap under !LJ_FR2.
Other instructions can now introduce a one-slot gap where previously they
wouldn't ever introduce a gap.
Use a mix of linear probing and pseudo-random probing.
Workaround for 1GB MAP_32BIT limit on Linux/x64. Now 2GB with !LJ_GC64.
Enforce 128TB LJ_GC64 limit for > 47 bit memory layouts (ARM64).