Fun with vectors in the Raspberry Pi 1 - Part 7
We finished the last installment of this series mentioning that the compiler cannot copy, load or store a vector. Today we will address this.
Spill code and copies
LLVM, like many other compiler infrastructures, uses the concept of virtual register. Virtual registers are abstractions over the physical registers available in an architecture (note to architects: physical registers in compilers means architectural registers, not renaming registers).
During code generation in Machine IR (MIR), LLVM can combine the concept of virtual register with that of Static Single Assignment (SSA). This is very convenient for optimisations and transformations that happen on MIR.
The number of virtual registers is unbounded. However the number of physical registers available in a register-based architecture is finite. There is a process called register allocation that assigns a physical register to every virtual registers. Due to the finiteness of the physical registers, register allocation may run out of available physical registers. In these cases the compiler has to emit spill code in which a physical register already in use is preserved temporarily in some other memory, typically the stack of the program. The details are a bit complex at this point but we only need to know that saving a register on that memory is typically called spill and restoring it afterwards is called a reload.
There is another case where LLVM will emit spill code. Calling conventions can require registers be preserved around function calls or upon entry/exit of a function. While these are not considered spill code in the traditional sense, LLVM combines the two cases.
Before register allocation can run, however, SSA form must be removed. The
fundamentals of SSA is that every virtual register in the program must be
defined only once and can be used many times. However we quickly run into the
issue that the value of a virtual register may depend on control flow. Consider
a program thas uses a variable whose value may have conditionally changed in an
if
statement. SSA represents those converging values due to control flow
using phi instructions. Phi instructions are not executable, they are an
abstraction, and must be removed. Without going into much detail, when
removing phi operations, the compiler will introduce copies to implement the
semantics of the phi instruction. It may not be obvious, but it is precisely
this insertion of copies what destroys the SSA property of the code: a same
virtual register will be defined by different instructions.
Load and store
There are a couple of functions storeRegToStackSlot
and
loadRegFromStackSlot
in llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
responsible for emitting the machine instruction used to spill and reload
respectively. One issue here is that only one instruction is allowed due to
expectations in the spilling functions used by the register allocator.
Ideally we would like to use vstmdia
/ vstmsia
for double and floating
point stores respectively. Similarly vldmdia
/ vldmsia
for loads. However
these instructions expect a contiguous range of registers. We allow
representing vectors over non-contiguous registers wrapping around in the same
vector bank (say d7
and d4
). Also LLVM comes with a limitation here: we
must use a single instruction for the spill and reload.
We can address this, again, using pseudo-instructions that we will lower in the
best way possible. First we will define two pseudo-instructions, one for each
vector type, VFPSPILL
and VFPRELOAD
.
Now we can extend storeRegToStackSlot
and loadRegFromStackSlot
so they
emit these instructions when we have to store/load a register of the DPRx2
or
SPRx4
register classes.
As you can see, we basically encapsulate the destination register along with the abstraction of the stack, called frame index in LLVM. Finally we make sure the memory operands are there: these are extra bits of information LLVM can use to reason about memory accesses.
Lowering
Those pseudo instructions will need some lowering. Again this happens in
ARMExpandPseudoInsts.cpp
.
Our strategy will be fairly simple. If we can use the multiple memory access
instructions vldmdia
/ vldmsia
/ vstmdia
/ vstmsia
we will use them. If
we cannot we will emit a bunch of simple loads and stores.
Note that for doubles this would be enough because vectors only have two
elements. For vectors of floats, though, there are more cases: a) all four
registers may be consecutive (e.g.: s8
, s9
, s10
, s11
) three are
consecutive (e.g. s13
, s14
, s15
, s8
), c) two are consecutive and the
two other are consecutive but not respect to the other two (e.g. s14
, s15
,
s8
, s9
). For simplicity we will ignore those cases but they are not too
conceptually difficult to implement.
We will see the implementation for double vectors, float vectors would be
implemented in a similar way. We first check if the registers are
consecutive. If they are then we can use vldmdia
or vstmdia
.
If they are not consecutive then we need to emit explicit loads and stores
vldrd
and vstrd
. Some complexity comes in the form of computing the right
memory operand for the new loads and stores as we need to adjust the right
offset and size of data being accessed. This is encapsulated in convenient
lambdas EmitLoad
and EmitStore
. One important thing to note is that vldrd
and vstrd
instructions expect an offset operand in units of 4 bytes, hence
.addImm(Offset / 4)
. In the memory operand though, the offset is specified in
bytes.
Results
Let’s see this in action. In the last installment we said that doing a call while vectors were live caused the compiler to crash because it does not know how to spill and reload the vectors.
An example of LLVM IR that triggers this behaviour is this one:
Here we can see how the compiler explicitly spills and reloads vectors around
the call to foo
. Note that the spills have as source the same pair d4
, d5
because the compiler does not need another pair in that position. However, the
reloads do have to use different pairs otherwise the addition (vadd.f64
) will
not be possible.
It is a bit difficult to observe the use of vstrd
and vldrd
for the
non-consecutive case. One thing we can do is manually modifying the Machine IR so
it uses d7
and d4
instead of d4
, d5
. To get the Machine IR, we must stop
before finishing the code generation pipeline. A location that seems convenient
is stopping before the arm-ldst-opt
pass, responsible for optimising
several memory accesses into multiple memory accesses.
Now we modify the test.mir
file like this
The diff is a bit noisy: occurrences of physical registers $d6_d7x2
are
replaced with the physical register $d8_d9x2
and occurrences of the physical
register $d4_d5x2
are replaced with the physical register $d7_d4x2
. This
register is non-contiguous.
Now we can restart the code generation pipeline using the modified Machine IR
file, right before arm-ldst-opt
(note the -start-before
flag).
Now the output looks like this:
One observation that we could make here is that, ideally we should prefer not
to pick these non-consecutive registers. Both the spills and reloads that
involve them are costlier in number of instructions. This is something we could
influence in ARMRegisterInfo.td
.
Copies
We mentioned above the compiler may have to introduce copies when removing SSA. But the backend does not know how to do that.
Fortunately the ARM backend is very well implemented so adding this is
relatively easy to do in ARMBaseInstrInfo.cpp
The code already knows how to copy subregisters so we specify the instruction
we want to use, vmovd
or vmovs
, and how many subregisters are involved
from the first subregister.
Results
In order to force copies be emitted, we need to make sure our LLVM IR code has control flow that mandates this.
The following LLVM IR causes this:
Basically we check if the first parameter %dis
is larger that 4. If it is we
add vectors %a
and %b
(block1
). Otherwise we multiply them (block2
).
This is the value we store in the pointer *%pc
(block3
). The value, however
flows in from the two blocks, hence a phi
instruction appears to merge the two
incoming values. The merged value %p
is the one stored.
We can look at the Machine IR right before the pass postrapseudos
, which is
used to expand the generic pseudo instructions used by the Register Allocator
(ra
). Those pseudos include a generic COPY
instruction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
body: |
bb.0.block3:
liveins: $r0, $r1, $r2, $r3, $d8, $d9
$sp = frame-setup VSTMDDB_UPD $sp, 14 /* CC::al */, $noreg, killed $d8, killed $d9
frame-setup CFI_INSTRUCTION def_cfa_offset 16
frame-setup CFI_INSTRUCTION offset $d9, -8
frame-setup CFI_INSTRUCTION offset $d8, -16
renamable $d4 = VLDRD renamable $r2, 0, 14 /* CC::al */, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
renamable $d5 = VLDRD killed renamable $r2, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
renamable $d8 = VLDRD renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa)
renamable $d9 = VLDRD killed renamable $r1, 2, 14 /* CC::al */, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8)
dead renamable $r1, dead renamable $r2 = VFPSETLEN 1, implicit-def $fpscr
renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr
renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr
CMPri killed renamable $r0, 4, 14 /* CC::al */, $noreg, implicit-def $cpsr
Bcc %bb.2, 11 /* CC::lt */, killed $cpsr
bb.1.select.false:
liveins: $r3, $d6_d7x2
renamable $d4_d5x2 = COPY killed renamable $d6_d7x2
bb.2.select.end:
liveins: $r3, $d4_d5x2
VSTRD renamable $d4, renamable $r3, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc)
VSTRD renamable $d5, killed renamable $r3, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
dead renamable $r0, dead renamable $r1 = VFPSETLEN 0, implicit-def $fpscr
$sp = frame-destroy VLDMDIA_UPD $sp, 14 /* CC::al */, $noreg, def $d8, def $d9
BX_RET 14 /* CC::al */, $noreg
If you look closely at the code, you can see the compiler has changed the code
in a way that it especulatively executes the fadd
and the fmul
(look for
VMULDx2
and VADDDx2
instructions in lines 14 and 15) in the entry block
(shown as bb.0.block3
, but the relevant bit is bb.0
). Each operation uses a
different register ($d4_d5x2
and $d6_d7x2
) and one of the registers
($d4_d5x2
) is the one that ends being stored in the final block
(bb.2.select.end
, lines 27 and 28).
Then the comparison happens and if turns out false (block bb.1.select.false
,
lines 19 to 22) we copy the contents of $d6_d7x2
(line 22), which contains
the multiplication result, into $d4_d5x2
.
If we look at the generated assembly, we see the copies we explicitly introduced.
And this is all for this installment. In the next one we will make enough changes to the compiler so we can convince the existing loop vectorizer of LLVM that it can vectorize some simple codes in VFPv2.