Fun with vectors in the Raspberry Pi 1 - Part 3
In the last chapter we modelled the storage in form of pairs and quadruples of registers that we will use for vectors of double and single precision, respectively.
But before we can do anything we need to deal with fpscr
.
The way to machine instructions
LLVM is famously known for its intermediate representation (IR) called LLVM IR. Compilers, however, and LLVM is not an exception, cannot do their job with just a single intermediate representation. The reason is that different steps in the compilation pipeline have different requirements. No IR can cater to all of them at the same time.
SelectionDAG
So, during code generation LLVM goes through two intermediate representations. LLVM IR is lowered, one basic block at a time, into a graph representation called SelectionDAG. It is called SelectionDAG because its goal is to do one of the main tasks of any backend: instruction selection.
Instruction selection can be understood as taking a directed acyclic graph (DAG) or a tree (which is a restricted form of DAG) and “tiling” it. This is, we will group several connected nodes and replace them with one or more nodes. Those new nodes correspond to actual instructions of the machine we target.
In the context of LLVM, instruction selection goes from what it is called the input DAG, built using the information from LLVM IR, to what is called the output DAG, where the nodes are instructions. One important difference between the input DAG and the output DAG is that the input DAG operates at the level of machine types (a finite set of types many of which have equivalent LLVM IR types) while the output DAG operates at the level of register classes.
Instruction selection completes when the output DAG is linearized, this is, a schedule is determined for it. Now the nodes of the output DAG are converted into the second intermediate representation used by LLVM in code generation.
Machine IR
This second intermediate representation is called Machine IR or MIR (this is unrelated to Rust MIR). MIR is a more conventional representation where each machine function is a graph of machine basic blocks and each (machine) basic block is a sequence of machine instructions. One way to understand machine instructions is looking at them as containers of operands along with an operation code (or opcode). Machine instructions can have different kinds of operands but commonly they are registers (of some specific register class) or immediates (i.e. a constant like 42).
Operands of machine instructions can be explicit or implicit. Explicit operands are those that are encoded as part of the instruction and they can be inputs or outputs. Explicit operands are the common ones and what we intuitively understand for operands. Implicit operands are those that are inputs or outputs of the instruction but are not explicitly encoded in the instruction.
For instance, in an architecture like Arm, conditional branches use (read)
the cpsr
register (later renamed into apsr
in Armv7-A) that has been
defined (written) earlier usually in a comparison. That register is an
implicit operand in those instructions because it is not encoded in the
instruction itself. Instead, instructions only encode things like the operands
being compared or, for the branch instruction, the target of the branch (where
we jump to) and the branching condition (when whe have to jump), etc.
Machine instructions need to know what operands they are using and defining. If they fail to do so, later scheduling passes that operate in the MIR may reorder instructions and break the semantics of the represented code.
Virtual registers
Most compilation infrastructures use the concept of virtual register. During compilation the compiler assumes that there is an infinite number of registers. Those registers are called virtual and belong to a register class. This largely simplifies code generation particularly because LLVM favours using Static Single Assignment (SSA). Under SSA, virtual registers are only defined once (i.e. assigned a value) and can be used many times. This constrained form is very beneficial for analysis because removes the ambiguity of knowing what was the last update to a register.
The opposite of a virtual register is a physical register (in compilers physical registers correspond to what in computer architecture are known by architectural registers). Physical registers are not subject to the regime of SSA: they can be redefined many times. This makes analysing them a bit harder.
Virtual registers do not exist in CPUs. So a process called register allocation assigns physical registers in place of virtual registers. This is an effective approach because the life spans of virtual registers is often very short. This means that, in general, not many physical registers are needed at the same time. However if this happens, register allocation uses memory so it can temporarily store a value held in a register to retrieve it later. This store operation is commonly known as a spill and its later retrieval it is known as a reload.
Machine instructions in MIR can have either virtual register operands and physical register operands. After register allocation, no virtual register remains in the machine function. This is largely true but some very late code generation steps may be easier to implement using virtual registers. LLVM provides a simplified mechanism to assign those virtual registers to physical registers without involving a full register allocation process.
fpscr
is a physical register so unfortunately we will not be able to benefit
the SSA advantages that virtual registers enjoy.
Approach
The approach I chose is similar, if simpler, to the one used in the RISC-V backend of LLVM to support vectors. By using pseudo instructions we define vector operations and handle them as needed.
For instance, an instruction like vadd.f64
, is represented in the ARM backend
of LLVM with a machine instruction with opcode VADDD
. VADDD
is currently
used for scalar operations. Our idea is to introduce a pseudo instruction
called VADDDx2
. Those pseudo instructions will use the register class DPRx2
that we defined already. Similarly for vadd.f32
, the instruction is called
VADDS
and we will define the pseudo instruction VADDSx4
that uses the
SPRx4
register class. We will repeat this operation for all the 12
instructions (in their two variants for double and single precision) that
honour the len
field of fpscr
.
By using pseudo instructions we can make SelectionDAG to select those pseudo instructions. For instance a LLVM IR like
will be selected using the VADDDx2
instruction.
A pseudo instruction is no different to an instruction for the purpose of code generation. The difference to actual instructions, is that the target does not have such instruction. So at some point we will need to expand it into real instructions. This happens at a later stage, after register allocation and will be part of a later installment in this series.
Tracking fpscr
Currently the ARM backend assumes the fpscr
does not have to be tracked for
scalar operations. Technically they do depend on the rounding mode in fpscr
but this part is not very well defined in LLVM (only recently constrained
floating point operations
have been introduced).
However, if we plan to use VADDD
and VADDDx2
in the same code we need to
make sure the len
field of the fpscr
has the right value for each
instruction. The easiest way to do that is to set the len
to the required
value right before every instruction.
In order to change the len
we will add a pseudo instruction called
VFPSETLEN
.
This tablegen definition of a pseudo instruction has a lot of information.
First, this instruction has one input called $len
. This input is an immediate
that ranges from 0 to 7 (3 bits). This is defined by the operand kind specifier
imm0_7
which is conveniently already defined in the ARM backend.
Second, this instruction has two register outputs $scratch1
and $scratch2
.
We will need those later on when we expand this pseudo instruction into the
actual instructions that set the len
field of the fpscr
. Note the different
register classes GPR
and GPRnopc
. The latter does not allow to use the pc
register (r15
in AArch32). We do this because the instructions used in the
expansion also have this restriction in the way they use $scratch2
. These two
output registers will be initially virtual registers and then register
allocation will assign two physical registers for them. This ensures we can
expand the pseudo instruction safely.
Finally, IIC_fpSTAT
is an instruction itinerary and we can ignore it for
now: it is used for scheduling. Then we specify this is only for VFPv2 which we
do using Requires<[HasVFP2]>
.
Tablegen allows to push attributes into definitions using the let
syntax,
above the definition. This is useful when we want a number of definitions to
have the same values in the attributes by avoiding having to specify it in each
definition.
A first attribute is Defs = [FPSCR]
that means this instruction defines
(writes) the register fpscr
. When instruction selection finally creates the
machine instructions (of MIR) it will add an implicit defining operand for
fpscr
.
The other attributes are used to state that this instruction does not have scheduling info, does not load memory, does not store memory and does not have any further unmodelled side-effects.
It should be possible now to mark VADDD
and VADDDx2
as Uses = [FPSCR]
,
so after instruction selection they have an implicit use of the register fpscr
.
However if we do so, then the instructions are reading a fpscr
that nobody
wrote and this is a bit odd. We will do something slightly different.
How to configure fpscr
It is possible to mark instructions (including pseudo instructions) with the
attribute hasCustomInserter
. When doing this an extra callback is invoked for
each instruction, right before finishing instruction selection. This allows
the compiler to tweak or change the instruction selection.
The idea is that for every instruction like VADDD
, VADDDx2
, VADDSx4
, …
the custom inserter will prepend the instruction with an appropriate
VFPSETLEN
and will mark the instruction to implicitly use fpscr
.
We mentioned there are 12 instructions in VFPv2 that honour the vector length.
Because in LLVM those are split between the single precision and the double
precision there are 24 machine instructions. Because we are adding the x2
and
x4
forms, we end with 48 cases. This is perhaps too many cases to handle
individually, so one option we have is to create a table that gives us
information for each instruction. There is a tablegen backend in LLVM to create
generic searchable tables.
Generic table
Firt we need to specify what we will have as elements of that table.
Any tablegen definition of this class (including those due to to inheritance)
will be part of the table. Each element of the table will contain 3 fields. The
first field is the Pseudo
instruction and will be used for lookups in the
table. This field is a bit of a misnomer, specially for the existing scalar
instructions such as VADDD
and VADDS
but I think we can live with this for
now.
The second field BaseInstr
links each pseudo instruction with its base one.
This is VADDDx2
and VADDSx4
will both be linked to VADDD
and VADDS
respectively.
A final field VectorLength
encoded in the same way as the len
field of
fpscr
.
Now that we know what every element of the table will contain, we can define the table itself.
The first field, FilterClass
, is the class whose definitions will make up the
table. This is the VFPPseudo
we defined above. Field CppTypeName
will be
used as the class generated in C++. This class will have a number of fields
that we specify in the Fields
attributes. We specify that Pseudo
will be
used as primary key, this is, used to look up elements in the table (the term
primary key comes from databases). Field PrimaryKeyName
is the name of the
method we will use on the table to retrieve an element. This function will
receive the values of the PrimaryKey
for the lookup: in this case the opcode
of an instruction. Finaly the field PrimaryKeyEarlyOut
is a small
optimisation when doing lookups in large spaces like the instructions.
OK. How do we use this? As mentioned earlier, any class that inherits from this one will automatically be part of the table. So we will first label all the scalar operations. Before, that, let’s create a convenience subclass for them.
This subclass hardcodes some values for scalar operations: the VectorLength
will always be 0b000
, which encodes a value of 1 in the len
field. The base
instruction will be itself. NAME
is a special identifier which evaluates
to the string of the name of the current tablegen definition. The cast is
needed to convert that string to the Instruction
definition identifier.
With this we can now annotate the existing operations. I will show only
here VADDD
and VADDS
but this can be extended to the other ones.
Note that we need to make sure usesCustomInserter
is set to 1 so the hook
is invoked for these instructions.
In order to make the generated table available to C++ we still need to make
some changes in files llvm/lib/Target/ARM/Utils/ARMBaseInfo.cpp
and
llvm/lib/Target/ARM/Utils/ARMBaseInfo.h
. This will make sure the
tablegen-generated C++ is included appropriately.
The pseudo instructions
Ok so now we have annotated our existing scalar instructions to have
VectorLength
equals to 1. Now it is a good moment to introduce our
pseudo instructions. Again I’ll show only VADDDx2
and VADDSx4
.
If you look closely you will see that these are almost identical to their
scalar counterparts. That is intentional because we want these machine
instructions be structurally identical as their scalar counterpart ones. The
only differences are in the register classes (DPR
is now DPRx2
and SPR
is now SPRx4
).
These definitions, like the scalar ones, include what is called a pattern for instruction selection. For instance, the one for VADDDx2 includes this pattern.
This basically says that we can match an input selection DAG with opcode fadd
that receives two input registers $Dn
and $Dm
to set a register $Dd
with
the current instruction. DPRx2
is here the register class used for those
registers. v2f64
is a type and it must be representable by DPRx2
, but if
you recall we did this in the previous installment when we created the register
class.
Tablegen can in general infer the types of the operands of the patterns. So
specifying v2f64
in the pattern fragment (v2f64 DPRx2:$Dm)
is redundant.
However tablegen cannot infer the types if the involved register class allows
more than one machine type. In the ARM backend, the DPR
register class
contains more than one machine type, hence the pattern needs to at least
specify one type (the rest of types can be inferred because the properties of
the fadd
SelectionDAG node).
If you check the above definition of VADDS
you will see that its pattern does
not include a type (in contrast to the one for VADDD
): the reason is that the
register class SPR
only includes the machine type f32
.
I decided to leave at least one type for clarity as a balanced alternative to cluttering everything with types.
Custom inserter
Now we are in a situation in which we can finally use the custom inserter
to insert the VFPSETLEN
instruction.
We can do that in the file llvm/lib/Target/ARMISelLowering.cpp
. The hook
called per instruction is EmitInstrWithCustomInserter
. We will check
if the MachineInstr
is included in our table of VFPPseudos
we have defined
above.
We will query the table VFPPseudos
. If an instruction appears there it means
we need to make sure the fpscr
has the len
field correctly set. This is
what the VFPSETLEN
instruction was for.
We basically create two virtual registers of the appropriate class for VFPSETLEN
and then we build a machine instruction corresponding to the VFPSETLEN
itself.
Note how we mark the scratch registers as dead: nobody will use their values
after this instruction.
We finally insert an implicit operand to the current instruction (not to
VFPSETLEN
but VADDD
, VADDS
, VADDDx2
, VADDSx4
, etc) in order to signify
that this instruction now uses fpscr
.
At this point we are still not able to lower vector operations because we are still missing a few bits which we will see in next installments. However we can see the impact of our changes in scalar code.
The following simple IR function can be used to observe the effect of the custom inserter.
We can use llc
, the code generator of LLVM, to inspect the machine
instructions. We have to stop right after instruction selection (isel)
finalises because we have still not implemented how VFPSETLEN
is to be
lowered and that would cause llc
to crash.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
--- |
; ModuleID = 'scalar.ll'
source_filename = "scalar.ll"
target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "armv6kz-unknown-linux-gnu"
define void @test_scalar(double* %pa, double* %pb, double* %pc) #0 {
%a = load double, double* %pa, align 8
%b = load double, double* %pb, align 8
%c = fadd double %a, %b
store double %c, double* %pc, align 8
ret void
}
attributes #0 = { "target-features"="+vfp2" }
...
---
name: test_scalar
alignment: 4
tracksRegLiveness: true
registers:
- { id: 0, class: gpr }
- { id: 1, class: gpr }
- { id: 2, class: gpr }
- { id: 3, class: dpr }
- { id: 4, class: dpr }
- { id: 5, class: dpr }
- { id: 6, class: gpr }
- { id: 7, class: gprnopc }
liveins:
- { reg: '$r0', virtual-reg: '%0' }
- { reg: '$r1', virtual-reg: '%1' }
- { reg: '$r2', virtual-reg: '%2' }
frameInfo:
maxAlignment: 1
maxCallFrameSize: 0
machineFunctionInfo: {}
body: |
bb.0 (%ir-block.0):
liveins: $r0, $r1, $r2
%2:gpr = COPY $r2
%1:gpr = COPY $r1
%0:gpr = COPY $r0
%3:dpr = VLDRD %1, 0, 14 /* CC::al */, $noreg :: (load 8 from %ir.pb)
%4:dpr = VLDRD %0, 0, 14 /* CC::al */, $noreg :: (load 8 from %ir.pa)
dead %6:gpr, dead %7:gprnopc = VFPSETLEN 0, implicit-def $fpscr
%5:dpr = VADDD killed %4, killed %3, 14 /* CC::al */, $noreg, implicit $fpscr
VSTRD killed %5, %2, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc)
BX_RET 14 /* CC::al */, $noreg
...
In line 48 the newly introduced VFPSETLEN
appears. It implicitly defines
fpscr
(physical registers are prepended with a $
sign while virtual
registers use %
). And then the VADDD
instruction in line 49 implicitly uses
it.
This is all still very basic but it is a necessary step for correct code generation for VFPv2.
In the next installment we will continue adding the missing bits in the instruction selection and lowering of operations for vector types.