We finished the last installment of this series mentioning that the compiler cannot copy, load or store a vector. Today we will address this.
Spill code and copies
LLVM, like many other compiler infrastructures, uses the concept of virtual register. Virtual registers are abstractions over the physical registers available in an architecture (note to architects: physical registers in compilers means architectural registers, not renaming registers).
During code generation in Machine IR (MIR), LLVM can combine the concept of virtual register with that of Static Single Assignment (SSA). This is very convenient for optimisations and transformations that happen on MIR.
The number of virtual registers is unbounded. However the number of physical registers available in a register-based architecture is finite. There is a process called register allocation that assigns a physical register to every virtual registers. Due to the finiteness of the physical registers, register allocation may run out of available physical registers. In these cases the compiler has to emit spill code in which a physical register already in use is preserved temporarily in some other memory, typically the stack of the program. The details are a bit complex at this point but we only need to know that saving a register on that memory is typically called spill and restoring it afterwards is called a reload.
There is another case where LLVM will emit spill code. Calling conventions can require registers be preserved around function calls or upon entry/exit of a function. While these are not considered spill code in the traditional sense, LLVM combines the two cases.
Before register allocation can run, however, SSA form must be removed. The
fundamentals of SSA is that every virtual register in the program must be
defined only once and can be used many times. However we quickly run into the
issue that the value of a virtual register may depend on control flow. Consider
a program thas uses a variable whose value may have conditionally changed in an
if statement. SSA represents those converging values due to control flow
using phi instructions. Phi instructions are not executable, they are an
abstraction, and must be removed. Without going into much detail, when
removing phi operations, the compiler will introduce copies to implement the
semantics of the phi instruction. It may not be obvious, but it is precisely
this insertion of copies what destroys the SSA property of the code: a same
virtual register will be defined by different instructions.
Load and store
There are a couple of functions
responsible for emitting the machine instruction used to spill and reload
respectively. One issue here is that only one instruction is allowed due to
expectations in the spilling functions used by the register allocator.
Ideally we would like to use
vstmsia for double and floating
point stores respectively. Similarly
vldmsia for loads. However
these instructions expect a contiguous range of registers. We allow
representing vectors over non-contiguous registers wrapping around in the same
vector bank (say
d4). Also LLVM comes with a limitation here: we
must use a single instruction for the spill and reload.
We can address this, again, using pseudo-instructions that we will lower in the
best way possible. First we will define two pseudo-instructions, one for each
Now we can extend
loadRegFromStackSlot so they
emit these instructions when we have to store/load a register of the
SPRx4 register classes.
As you can see, we basically encapsulate the destination register along with the abstraction of the stack, called frame index in LLVM. Finally we make sure the memory operands are there: these are extra bits of information LLVM can use to reason about memory accesses.
Those pseudo instructions will need some lowering. Again this happens in
Our strategy will be fairly simple. If we can use the multiple memory access
vstmsia we will use them. If
we cannot we will emit a bunch of simple loads and stores.
Note that for doubles this would be enough because vectors only have two
elements. For vectors of floats, though, there are more cases: a) all four
registers may be consecutive (e.g.:
s11) three are
s8), c) two are consecutive and the
two other are consecutive but not respect to the other two (e.g.
s9). For simplicity we will ignore those cases but they are not too
conceptually difficult to implement.
We will see the implementation for double vectors, float vectors would be
implemented in a similar way. We first check if the registers are
consecutive. If they are then we can use
If they are not consecutive then we need to emit explicit loads and stores
vstrd. Some complexity comes in the form of computing the right
memory operand for the new loads and stores as we need to adjust the right
offset and size of data being accessed. This is encapsulated in convenient
EmitStore. One important thing to note is that
vstrd instructions expect an offset operand in units of 4 bytes, hence
.addImm(Offset / 4). In the memory operand though, the offset is specified in
Let’s see this in action. In the last installment we said that doing a call while vectors were live caused the compiler to crash because it does not know how to spill and reload the vectors.
An example of LLVM IR that triggers this behaviour is this one:
Here we can see how the compiler explicitly spills and reloads vectors around
the call to
foo. Note that the spills have as source the same pair
because the compiler does not need another pair in that position. However, the
reloads do have to use different pairs otherwise the addition (
not be possible.
It is a bit difficult to observe the use of
vldrd for the
non-consecutive case. One thing we can do is manually modifying the Machine IR so
d4 instead of
d5. To get the Machine IR, we must stop
before finishing the code generation pipeline. A location that seems convenient
is stopping before the
arm-ldst-opt pass, responsible for optimising
several memory accesses into multiple memory accesses.
Now we modify the
test.mir file like this
The diff is a bit noisy: occurrences of physical registers
replaced with the physical register
$d8_d9x2 and occurrences of the physical
$d4_d5x2 are replaced with the physical register
register is non-contiguous.
Now we can restart the code generation pipeline using the modified Machine IR
file, right before
arm-ldst-opt (note the
Now the output looks like this:
One observation that we could make here is that, ideally we should prefer not
to pick these non-consecutive registers. Both the spills and reloads that
involve them are costlier in number of instructions. This is something we could
We mentioned above the compiler may have to introduce copies when removing SSA. But the backend does not know how to do that.
Fortunately the ARM backend is very well implemented so adding this is
relatively easy to do in
The code already knows how to copy subregisters so we specify the instruction
we want to use,
vmovs, and how many subregisters are involved
from the first subregister.
In order to force copies be emitted, we need to make sure our LLVM IR code has control flow that mandates this.
The following LLVM IR causes this:
Basically we check if the first parameter
%dis is larger that 4. If it is we
block1). Otherwise we multiply them (
This is the value we store in the pointer
block3). The value, however
flows in from the two blocks, hence a
phi instruction appears to merge the two
incoming values. The merged value
%p is the one stored.
We can look at the Machine IR right before the pass
postrapseudos, which is
used to expand the generic pseudo instructions used by the Register Allocator
ra). Those pseudos include a generic
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 body: | bb.0.block3: liveins: $r0, $r1, $r2, $r3, $d8, $d9 $sp = frame-setup VSTMDDB_UPD $sp, 14 /* CC::al */, $noreg, killed $d8, killed $d9 frame-setup CFI_INSTRUCTION def_cfa_offset 16 frame-setup CFI_INSTRUCTION offset $d9, -8 frame-setup CFI_INSTRUCTION offset $d8, -16 renamable $d4 = VLDRD renamable $r2, 0, 14 /* CC::al */, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb) renamable $d5 = VLDRD killed renamable $r2, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8) renamable $d8 = VLDRD renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa) renamable $d9 = VLDRD killed renamable $r1, 2, 14 /* CC::al */, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8) dead renamable $r1, dead renamable $r2 = VFPSETLEN 1, implicit-def $fpscr renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr CMPri killed renamable $r0, 4, 14 /* CC::al */, $noreg, implicit-def $cpsr Bcc %bb.2, 11 /* CC::lt */, killed $cpsr bb.1.select.false: liveins: $r3, $d6_d7x2 renamable $d4_d5x2 = COPY killed renamable $d6_d7x2 bb.2.select.end: liveins: $r3, $d4_d5x2 VSTRD renamable $d4, renamable $r3, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc) VSTRD renamable $d5, killed renamable $r3, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8) dead renamable $r0, dead renamable $r1 = VFPSETLEN 0, implicit-def $fpscr $sp = frame-destroy VLDMDIA_UPD $sp, 14 /* CC::al */, $noreg, def $d8, def $d9 BX_RET 14 /* CC::al */, $noreg
If you look closely at the code, you can see the compiler has changed the code
in a way that it especulatively executes the
fadd and the
fmul (look for
VADDDx2 instructions in lines 14 and 15) in the entry block
bb.0.block3, but the relevant bit is
bb.0). Each operation uses a
different register (
$d6_d7x2) and one of the registers
$d4_d5x2) is the one that ends being stored in the final block
bb.2.select.end, lines 27 and 28).
Then the comparison happens and if turns out false (block
lines 19 to 22) we copy the contents of
$d6_d7x2 (line 22), which contains
the multiplication result, into
If we look at the generated assembly, we see the copies we explicitly introduced.
And this is all for this installment. In the next one we will make enough changes to the compiler so we can convince the existing loop vectorizer of LLVM that it can vectorize some simple codes in VFPv2.