Fun with vectors in the Raspberry Pi 1 - Part 4
In the last chapter we devised a way to tame the issue with fpscr
. Today
we are going to complete the code generation bits that we are still missing
so we can start emitting code.
Expand VFPSETLEN
Given that last week we finished with a VFPSETLEN
instruction being emitted
I guess it makes sense we expand this first.
The easiest way to achieve this is extending the file
llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp
which contains the
implementation of a pass, running after register allocation, intended to expand
pseudo instructions.
If you check the first installment of this series, you will see that the len
field of fpscr
is a field of 3 bits located in bits 18, 17 and 16. Setting
the length mostly means reading the fpscr
, which we can do using the
instruction vmrs
, change the bits and write them back into fpscr
using the
vmsr
instruction.
The complex part of this process is changing the bits. We need to ensure the
len
field has the bits we want. We can do this in general masking what we
obtained from the vmrs
instruction with ~(0x7 << 16)
. This will clear the
three bits of the len
field and then we can do a bitwise or to set
the precise operation we want. All this is what explained why we needed
two scratch output registers as outputs of VFPSETLEN
.
The code may look long but it should be relatively straightforward to follow.
There are two unobvious elements due to the AArch32 Arm ISA itself. Most
instructions can be predicated, but we want them to be always executed so we
set it to always using .add(predOps(ARMCC::AL))
. Many Arm instructions may
optionally update the cpsr
, because we do not want to do this we state
that using .add(consCodeOps())
.
Other than that, the code above is basically constructing the instruction and
finally removing the pseudo instruction. Opcode ARM::MVNi
represents a mvn
instruction that uses an immediate as an input, ARM::ADDrr
is an add
instruction with two registers as inputs. ARM::ORRrr
is an orr
(a bitwise
or) with two registers as inputs.
One very simple optimization we can do happens when the vector length is 1
(encoded as len
). In that case we can use the bic
instruction whose task
is precisely clearing consecutive bits in a register. So we can improve the
emitted instructions like below.
There is an instruction bfi
in a special variant of Armv6, called Armv6T2,
and as of Armv7-A similar to bic
to insert arbitrary values into bitfields.
Unfortunately the core of the Raspberry Pi 1 does not implement such
instruction, hence the dance with mov
and orr
.
Now we can see what happens with our last example of a scalar addition.
Above we can see the bic
instruction in action because here len
is set
to zero.
We can manually modify the file scalar.mir
we generated at the end of the
previous installment so it does VFPSETLEN 1
instead of VFPSETLEN 0
. We
get this:
So the lowering of VFPSETLEN
is done.
Next is teaching SelectionDAG to actually try to select operations
involving v2f64
and v4f32
.
Instruction selection
Even if in the previous installment we added patterns that instruction
selection can use to select VADDDx2
and VADDSx4
, those will not be
selected. The reason is that instruction selection believes that v2f64
and
v4f32
are illegal types. In LLVM parlance, this means that operations
with these types need to be softened so they only use legal types. Legal
types and operations are, intuitively, those that are supported more
or less straightforwardly by the target.
We can specify this in file llvm/lib/Target/ARM/ARMISelLowering.cpp
. For now
we will focus only on v2f64
but v2f32
is the same. We are only going to
make float addition legal (fadd
/ ISD::FADD
), but the same applies to all
other operations appearing in the patterns we addded.
First we associate the machine type v2f64
with the register class
DPRx2RegClass
. This is a class generated by one of the tablegen backends
based on the register classes we defined in ARMRegisterInfo.td
. We do this
with a call to addRegisterClass
.
Then we say that ISD::FADD
is a legal operation for this target. This means
this operation can be directly selected. Because we have a pattern for it,
we know it will select the VADDDx4
instruction.
One side-effect of linking the v2f64
to the DRPx2
register class is that
now SelectionDAG expects us to be able to lower loads and stores of v2f64
.
Unfortunately the field len
is not used by the load and store instructions of
VFPv2. Because of this we need a way to express a load of v2f64
in a set of
instructions of f64
. Before we can do that we need to let SelectionDAG know
that we will manually handle the lowering of a load and a store of type
v2f64
.
Lowering load/store
Before we can proceed any further we will need to implement the lowering
of loads and stores of v2f64
. When an operation is marked as Custom
,
the function LowerOperation
is invoked for it. We will create two functions
LowerShortVectorLoad
and LowerShortVectorStore
that we will use for
those types.
I am going to show only the v2f64
case, the v4f32
is similar (just a bit
longer to write). First the vector load.
This is a bit of code to explain but basically we obtain the specific information
of a load node and we keep it in the LD
variable. We use this to discriminate
the machine type we are handling.
To create a load we need a bunch of information, but basically we will assume
this is an unindexed load (this means there is no pre/postincrement or
pre/postdecrement involved). The kind of extension load, ExtType, is for
non-extending/zero-extending/sign-extending loads. We do not support it for now
(hence the
assert`).
The chain, Ch
, is a special operand that can be consumed as input or be
generated as part of the results of a SelectionDAG node and it is used for
control dependencies that cannot be represented via conventional data-flow
dependences (e.g. a load that follows a store will usually be linked to the
store via a chain).
Ptr
is the value that contains the address. Offset
will be left to
undefined. There is a bunch of flags we want to propagate from the memory
operands of this SelectionDAG node so (MMOFlags
) alongwith alias analysis
information (AAInfo
).
With all this information we can build a first load, that loads the first
element of the vector. Note the machine type is f64
here.
The second part of the vector is found at an offsets of 8 bytes of the original
Ptr
address. MPI
contains the abstract pointer information and we will state
that it is the same as the original pointer information plus 8 bytes. Similarly
we overwrite Ptr
with computing an offset of 8
bytes after it. Now we
can build a load for the second element of the vector. Again the type loaded
is f64
.
Now we have to build a vector value in the input SelectionDAG. We will take
the two values we just loaded and insert them to the proper subregisters
that we defined in the second installment in ARMRegisterInfo.td
. Our
vector value is Vec
.
Finally because this is a load, we need to return the vector value we just
loaded (Vec
) and a chain (NewCh
). The chain is such that represents the two
loads in parallel (i.e. any operation chained to this value will happen after
both loads but either load can be scheduled at any order). The node that
represents this parallel join of other chains is called ISD::TokenFactor
.
A similar process happens with stores.
In contrast to the load, we need to extract the different subregisters from
the vector value and then we can store them using regular f64
stores. Also note
that stores do not return any value but a chain. So our lowered store must
return a chain as well. Like we did with loads, ISD::TokenFactor
is used
to state that the two chained operations can happen in parallel.
MC layer
We are missing a final change so we can emit our instructions: we still have to replace the pseudo instructions into actual instructions. From a code generation point of view our vector pseudo instructions are fine. What it is not fine is that we do not know how to encode them as instructions.
The MC layer is the part of LLVM devoted to assembly and disassembly of instructions. All the instructions defined by tablegen have conceptual mirrors called MC Instructions and they are used when encoding (assembling) and decoding (disassembling) instructions.
If you remember from the last installment, we linked the pseudo instructions to the real instruction. Now it is the moment to use this.
Once machine instructions have reached their final stage, they are handed to
a process that creates MC instructions for each one. These MC instructions
can then be streamed as assembly output (like what llc
prints as an output)
or to generate an object file (such an ELF file).
So what we will do is change this lowering from machine instructions to MC instructions, so our pseudo instructions get encoded like normal VFPv2 instructions.
This goes in two steps: first we need to use the real opcode of each
pseudo instruction. Second we need to make sure we use the regular register
that would be encoded. All the changes are in file
llvm/lib/Target/ARM/ARMMCInstLower.cpp
.
First let’s add a small diversion in llvm::LowerARMMachineInstrToMCInst
.
If the function lowerVFPMachineInstrToMCInst
returns true we do not have
to do anything else here. That function looks like this
We query in our VFPPseudos
table if this is a VFP instruction that demands a
length larger than 1.
If this is the case, we set the opcode of the MC instruction being created
(OutMI
) to the opcode of the base instruction. E.g. in this step we go from
VADDDx2
to VADDD
. Now we proceed as usual lowering the operands.
When lowering the operands we need to be careful not to emit operands of
the register classes DPRx2
or SPRx4
. If this is the case, we use the first
subregister instead. To do that we change lowerOperand.
First test
With all this we can do a first experiment with the following LLVM IR.
Hey, not bad.
OK. We still need to think about preserving the len
field upon returning
the function (i.e. len
should be 0b00 again) but this is a start.
Under the hood
I think this is a good moment to take a look under the hood in the different steps. Let’s take a look at the whole instruction selection.
This testcase has a single basic block, so only a single SelectionDAG will be built for it. This is the initial SelectionDAG
Initial selection DAG: %bb.0 'test_vector:'
SelectionDAG has 15 nodes:
t0: ch = EntryToken
t7: i32 = Constant<0>
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t9: v2f64,ch = load<(load 16 from %ir.pa, align 8)> t0, t2, undef:i32
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t10: v2f64,ch = load<(load 16 from %ir.pb, align 8)> t0, t4, undef:i32
t12: ch = TokenFactor t9:1, t10:1
t11: v2f64 = fadd t9, t10
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t13: ch = store<(store 16 into %ir.pc, align 8)> t12, t11, t6, undef:i32
t14: ch = ARMISD::RET_FLAG t13
Note how node t11
adds two vectors, loaded in t9
and t10
respectively.
The SelectionDAG now undergoes a first optimisation step, which makes nothing
because it is so simple. Next a type legalisation step happens but all the types
mentioned are legal (i32
, v2f64
).
Next is legalisation of operations. Now the loads and the stores get custom lowered as we wanted.
Legalized selection DAG: %bb.0 'test_vector:'
SelectionDAG has 33 nodes:
t0: ch = EntryToken
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t35: ch = TokenFactor t32:1, t34:1
t27: ch = TokenFactor t24:1, t26:1
t12: ch = TokenFactor t35, t27
t36: v2f64 = INSERT_SUBREG undef:v2f64, t32, TargetConstant:i32<9>
t37: v2f64 = INSERT_SUBREG t36, t34, TargetConstant:i32<10>
t29: v2f64 = INSERT_SUBREG undef:v2f64, t24, TargetConstant:i32<9>
t30: v2f64 = INSERT_SUBREG t29, t26, TargetConstant:i32<10>
t11: v2f64 = fadd t37, t30
t24: f64,ch = load<(load 8 from %ir.pb)> t0, t4, undef:i32
t25: i32 = add nuw t4, Constant:i32<8>
t26: f64,ch = load<(load 8 from %ir.pb + 8)> t0, t25, undef:i32
t32: f64,ch = load<(load 8 from %ir.pa)> t0, t2, undef:i32
t33: i32 = add nuw t2, Constant:i32<8>
t34: f64,ch = load<(load 8 from %ir.pa + 8)> t0, t33, undef:i32
t16: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<9>
t19: ch = store<(store 8 into %ir.pc)> t12, t16, t6, undef:i32
t18: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<10>
t21: i32 = add nuw t6, Constant:i32<8>
t22: ch = store<(store 8 into %ir.pc + 8)> t12, t18, t21, undef:i32
t23: ch = TokenFactor t19, t22
t14: ch = ARMISD::RET_FLAG t23
Let’s take a look at t9
in the initial SelectionDAG. It has been expanded into
two stores t32
and t34
(note how we annotate the proper 8 bytes offsets in
t34
). Then t32
and t34
are inserted as subregisters in t36
and t37
.
The last one will make up the value of the original type v2f64
we had in
t9
. The original chain of t9
(denoted by t9:1
because it is the second
result of the operation) is now represented by t35
which is the parallel join
of the chains of the already mentioned loads t32
and t34
(i.e. t32:1
and
t34:1
).
The SelectionDAG undergoes another optimisation pass which only simplifies the
token factors (check t39
).
Optimized legalized selection DAG: %bb.0 'test_vector:'
SelectionDAG has 31 nodes:
t0: ch = EntryToken
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t36: v2f64 = INSERT_SUBREG undef:v2f64, t32, TargetConstant:i32<9>
t37: v2f64 = INSERT_SUBREG t36, t34, TargetConstant:i32<10>
t29: v2f64 = INSERT_SUBREG undef:v2f64, t24, TargetConstant:i32<9>
t30: v2f64 = INSERT_SUBREG t29, t26, TargetConstant:i32<10>
t11: v2f64 = fadd t37, t30
t24: f64,ch = load<(load 8 from %ir.pb)> t0, t4, undef:i32
t25: i32 = add nuw t4, Constant:i32<8>
t26: f64,ch = load<(load 8 from %ir.pb + 8)> t0, t25, undef:i32
t32: f64,ch = load<(load 8 from %ir.pa)> t0, t2, undef:i32
t33: i32 = add nuw t2, Constant:i32<8>
t34: f64,ch = load<(load 8 from %ir.pa + 8)> t0, t33, undef:i32
t39: ch = TokenFactor t32:1, t34:1, t24:1, t26:1
t16: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<9>
t43: ch = store<(store 8 into %ir.pc)> t39, t16, t6, undef:i32
t18: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<10>
t21: i32 = add nuw t6, Constant:i32<8>
t40: ch = store<(store 8 into %ir.pc + 8)> t39, t18, t21, undef:i32
t42: ch = TokenFactor t43, t40
t14: ch = ARMISD::RET_FLAG t42
At this point the input SelectionDAG can be “instruction selected”. This gives us the output SelectionDAG.
===== Instruction selection ends:
Selected selection DAG: %bb.0 'test_vector:'
SelectionDAG has 30 nodes:
t0: ch = EntryToken
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t32: f64,ch = VLDRD<Mem:(load 8 from %ir.pa)> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 $noreg, t0
t24: f64,ch = VLDRD<Mem:(load 8 from %ir.pb)> t4, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 $noreg, t0
t34: f64,ch = VLDRD<Mem:(load 8 from %ir.pa + 8)> t2, TargetConstant:i32<2>, TargetConstant:i32<14>, Register:i32 $noreg, t0
t26: f64,ch = VLDRD<Mem:(load 8 from %ir.pb + 8)> t4, TargetConstant:i32<2>, TargetConstant:i32<14>, Register:i32 $noreg, t0
t39: ch = TokenFactor t32:1, t34:1, t24:1, t26:1
t36: v2f64 = INSERT_SUBREG IMPLICIT_DEF:v2f64, t32, TargetConstant:i32<9>
t37: v2f64 = INSERT_SUBREG t36, t34, TargetConstant:i32<10>
t29: v2f64 = INSERT_SUBREG IMPLICIT_DEF:v2f64, t24, TargetConstant:i32<9>
t30: v2f64 = INSERT_SUBREG t29, t26, TargetConstant:i32<10>
t11: v2f64 = VADDDx2 t37, t30, TargetConstant:i32<14>, Register:i32 $noreg
t16: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<9>
t43: ch = VSTRD<Mem:(store 8 into %ir.pc)> t16, t6, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 $noreg, t39
t18: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<10>
t40: ch = VSTRD<Mem:(store 8 into %ir.pc + 8)> t18, t6, TargetConstant:i32<2>, TargetConstant:i32<14>, Register:i32 $noreg, t39
t42: ch = TokenFactor t43, t40
t14: ch = BX_RET TargetConstant:i32<14>, Register:i32 $noreg, t42
Note how the fadd
now is VADDDx2
as expected. The other operations are also
opcodes of the ARM backend. Now this output SelectionDAG is linearised
(scheduled) to obtain a machine basic block. There is only one in this function,
but all the basic blocks would be put together for the machine function.
# *** IR Dump Before Finalize ISel and expand pseudo-instructions (finalize-isel) ***:
# Machine code for function test_vector: IsSSA, TracksLiveness
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2
bb.0 (%ir-block.0):
liveins: $r0, $r1, $r2
%2:gpr = COPY $r2
%1:gpr = COPY $r1
%0:gpr = COPY $r0
%3:dpr = VLDRD %1:gpr, 0, 14, $noreg :: (load 8 from %ir.pb)
%4:dpr = VLDRD %1:gpr, 2, 14, $noreg :: (load 8 from %ir.pb + 8)
%5:dpr = VLDRD %0:gpr, 0, 14, $noreg :: (load 8 from %ir.pa)
%6:dpr = VLDRD %0:gpr, 2, 14, $noreg :: (load 8 from %ir.pa + 8)
%8:dprx2 = IMPLICIT_DEF
%7:dprx2 = INSERT_SUBREG %8:dprx2(tied-def 0), killed %3:dpr, %subreg.dsub_len2_0
%10:dprx2 = IMPLICIT_DEF
%9:dprx2 = INSERT_SUBREG %10:dprx2(tied-def 0), killed %5:dpr, %subreg.dsub_len2_0
%11:dprx2 = INSERT_SUBREG %7:dprx2(tied-def 0), killed %4:dpr, %subreg.dsub_len2_1
%12:dprx2 = INSERT_SUBREG %9:dprx2(tied-def 0), killed %6:dpr, %subreg.dsub_len2_1
%13:dprx2 = VADDDx2 killed %12:dprx2, killed %11:dprx2, 14, $noreg
%14:dpr = COPY %13.dsub_len2_1:dprx2
VSTRD killed %14:dpr, %2:gpr, 2, 14, $noreg :: (store 8 into %ir.pc + 8)
%15:dpr = COPY %13.dsub_len2_0:dprx2
VSTRD killed %15:dpr, %2:gpr, 0, 14, $noreg :: (store 8 into %ir.pc)
BX_RET 14, $noreg
And then the custom inserter runs and adds VFPSETLEN
where due. In this case
right before VADDDx2
.
# *** IR Dump After Finalize ISel and expand pseudo-instructions (finalize-isel) ***:
# Machine code for function test_vector: IsSSA, TracksLiveness
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2
bb.0 (%ir-block.0):
liveins: $r0, $r1, $r2
%2:gpr = COPY $r2
%1:gpr = COPY $r1
%0:gpr = COPY $r0
%3:dpr = VLDRD %1:gpr, 0, 14, $noreg :: (load 8 from %ir.pb)
%4:dpr = VLDRD %1:gpr, 2, 14, $noreg :: (load 8 from %ir.pb + 8)
%5:dpr = VLDRD %0:gpr, 0, 14, $noreg :: (load 8 from %ir.pa)
%6:dpr = VLDRD %0:gpr, 2, 14, $noreg :: (load 8 from %ir.pa + 8)
%8:dprx2 = IMPLICIT_DEF
%7:dprx2 = INSERT_SUBREG %8:dprx2(tied-def 0), killed %3:dpr, %subreg.dsub_len2_0
%10:dprx2 = IMPLICIT_DEF
%9:dprx2 = INSERT_SUBREG %10:dprx2(tied-def 0), killed %5:dpr, %subreg.dsub_len2_0
%11:dprx2 = INSERT_SUBREG %7:dprx2(tied-def 0), killed %4:dpr, %subreg.dsub_len2_1
%12:dprx2 = INSERT_SUBREG %9:dprx2(tied-def 0), killed %6:dpr, %subreg.dsub_len2_1
dead %16:gpr, dead %17:gprnopc = VFPSETLEN 1, implicit-def $fpscr
%13:dprx2 = VADDDx2 killed %12:dprx2, killed %11:dprx2, 14, $noreg, implicit $fpscr
%14:dpr = COPY %13.dsub_len2_1:dprx2
VSTRD killed %14:dpr, %2:gpr, 2, 14, $noreg :: (store 8 into %ir.pc + 8)
%15:dpr = COPY %13.dsub_len2_0:dprx2
VSTRD killed %15:dpr, %2:gpr, 0, 14, $noreg :: (store 8 into %ir.pc)
BX_RET 14, $noreg
You may be wondering how is it that the 4 loads and the 2 stores have become 2 load multiples and 1 store multiple. This is due a later pass in the ARM backend that knows how to optimise those consecutive loads.
# *** IR Dump After ARM load / store optimization pass (arm-ldst-opt) ***:
# Machine code for function test_vector: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten
Frame Objects:
fi#0: size=4, align=4, at location [SP-4]
fi#1: size=4, align=4, at location [SP-8]
Function Live Ins: $r0, $r1, $r2
bb.0 (%ir-block.0):
liveins: $r0, $r1, $r2, $r4, $lr
$sp = frame-setup STMDB_UPD $sp(tied-def 0), 14, $noreg, killed $r4, killed $lr
frame-setup CFI_INSTRUCTION def_cfa_offset 8
frame-setup CFI_INSTRUCTION offset $lr, -4
frame-setup CFI_INSTRUCTION offset $r4, -8
VLDMDIA killed $r1, 14, $noreg, def $d4, def $d5, implicit-def $d4_d5x2 :: (load 8 from %ir.pb), (load 8 from %ir.pb + 8)
VLDMDIA killed $r0, 14, $noreg, def $d6, def $d7, implicit-def $d6_d7x2 :: (load 8 from %ir.pa), (load 8 from %ir.pa + 8)
dead renamable $r0, dead renamable $r1 = VFPSETLEN 1, implicit-def $fpscr
renamable $d4_d5x2 = VADDDx2 killed renamable $d6_d7x2, killed renamable $d4_d5x2, 14, $noreg, implicit $fpscr
VSTMDIA killed $r2, 14, $noreg, $d4, $d5 :: (store 8 into %ir.pc), (store 8 into %ir.pc + 8)
$r4 = VMRS 14, $noreg, implicit $fpscr
$r4 = BICri $r4, 458752, 14, $noreg, $noreg
VMSR $r4, 14, $noreg, implicit-def $fpscr
$sp = frame-destroy LDMIA_RET $sp(tied-def 0), 14, $noreg, def $r4, def $pc
Final observation
What happens if our IR does two adds?
Well, we get the following output
As you can see we are setting the field len
in fpscr
twice. But once should
be enough: fpscr
is already encoding vector length 2 when we’re about to
execute the second vadd.f64
.
In a next installment we will see how to improve this and emit only the required
VFPSETLEN
instructions.