Fun with vectors in the Raspberry Pi 1 - Part 6
There is an issue we have mentioned several times in earlier installments: the value of the vector length at function boundaries. This is, when entering or leaving a function. We will address this question today.
Calling convention
Arm Procedure Call
Standard specifies how
parameters are passed in function calls. Also the convention specifies
other details, like the state registers upon entering a function. One of the
details it specifies is the value of len
.
The length bits (16-18) must be 0b100 when using M-profile Vector Extension, 0b000 when using VFP vector mode and otherwise preserved across a public interface.
So, in order to interface correctly with other functions we need to make sure
the len
field is set to 0
when calling a function. We will achieve this
using the following approach
- a
VFPSETLEN
that setslen
to 0 will be emitted prior a function call - a
VFPSETLEN
that setslen
to 0 will be emitted before returning from a function
We will do this in SelectionDAG. The optimisation we implemented in the last chapter should be able to remove all the redundant cases.
Changes in SelectionDAG
In order to implement this in SelectionDAG, the easiest approach is to create a new target-specific SelectionDAG node.
We do that by first declaring a new enumerator of NodeType
enum, in
ARMISelLowering.h
. We will call it VFPSETLENZERO
and its purpose
will be exclusively setting len
to 0.
Now we can define the tablegen node itself. This is done in ARMInstrVFP.td
.
This definition in tablegen defines a new record named arm_vfpsetlenzero
of
type SDNode
. This class needs a few parameters: the enumerator we declared
above in ARMISelLowering.h
, a prototype of the node and a list of attributes. The
prototype of the node allows specifying what operands and what values returns a
node. In our case arm_vfpsetlenzero
will not receive any parameter not return
anything so we can use the predefined prototype for this case, called
SDTNone
. The only attribute we have is SDNPHasChain
which means the node
has a chain.
A chain is one of the three dependence kinds that SelectionDAG nodes can
represent: data flow (called normal values, this is operands and results of a
node modelling some operation), control flow (chain, used for things like
memory accesses or other dependences that are unrelated to data but to
operation ordering), and scheduling-dependences (called glue, used for things
like CPU flags). Dependences are important when the output SelectionDAG is
linearised into MachineInstr
s because they determine a valid order. When a
SelectionDAG node has chain, it has an input and an output chain.
We still need to make one final change in ARMISelLowering.cpp
so we can print
the name of the node (used for debugging).
Lowering
We have to change two locations in ARMISelLowering.cpp
ARMTargetLowering::LowerCall
which deals with calls functions. We will add aVFPSETLENZERO
right before the lowering of a function call.ARMTargetLowering::LowerReturn
which deals with lowering a function. We will add aVFPSETLENZERO
very early in the return node.
In both places the code is the same.
We create an arm_vfpsetlenzero
in C++ using its enumerator
ARMISD::VFPSETLENZERO
. It only returns a chain which has type MVT::Other
and receives an input Chain
. The new node is the previous chain that will
be used in later nodes.
Initial DAG
Now we can do a first experiment and see how the SelectionDAG looks like. Let’s consider the following LLVM IR.
This will crash because we are still missing a few bits, but we can look at the initial SelectionDAG.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Initial selection DAG: %bb.0 'test_vec:'
SelectionDAG has 33 nodes:
t0: ch = EntryToken
t7: i32 = GlobalAddress<void (i32, i32)* @foo> 0
t10: ch = ARMISD::VFPSETLENZERO t0
t12: ch,glue = callseq_start t10, TargetConstant:i32<0>, TargetConstant:i32<0>
t14: i32,ch = CopyFromReg t12, Register:i32 $sp
t16: ch,glue = CopyToReg t12, Register:i32 $r0, Constant:i32<1>
t18: ch,glue = CopyToReg t16, Register:i32 $r1, Constant:i32<3>, t16:1
t21: ch,glue = ARMISD::CALL t18, TargetGlobalAddress:i32<void (i32, i32)* @foo> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t18:1
t23: ch,glue = callseq_end t21, TargetConstant:i32<0>, TargetConstant:i32<-1>, t21:1
t24: i32 = Constant<0>
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t26: v2f64,ch = load<(load 16 from %ir.pa, align 8)> t23, t2, undef:i32
t4: i32,ch = CopyFromReg t0, Register:i32 %1
t27: v2f64,ch = load<(load 16 from %ir.pb, align 8)> t23, t4, undef:i32
t29: ch = TokenFactor t26:1, t27:1
t28: v2f64 = fadd t26, t27
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t30: ch = store<(store 16 into %ir.pc, align 8)> t29, t28, t6, undef:i32
t31: ch = ARMISD::VFPSETLENZERO t30
t32: ch = ARMISD::RET_FLAG t31
If you check lines 5 and 21 you will see the new node. You will see each one
receives a chain t0
and t30
.
t0
is the initial chain of the basic block
and ARMISD::VFPSETLENZERO
has an output chain called t10
which is the inptu
chain of callseq_start
, a node used to signal the beginning of a function
call. We basically set len
to zero right before starting the function call
sequence.
Similarly, t30
is the input chain for the ARMISDF::VFPSETLENZERO
that we
emit right before returning. The return in ARM is represented using the node
ARMISD::RET_FLAG
. The input chain of that node is exactly t31
which is the
output chain of this second ARMISD::VFPSETLENZERO
.
Selection
As I mentioned, the initial test above crashes. At this stage, LLVM does not
know how to select this input SelectionDAG node ARMISD::VFPSETLENZERO
into an
output SelectionDAG node. So we have to tell LLVM how to do that.
The easiest way is to add a pattern. A suitable place is ARMInstrVFP.td
.
However, there is a minor issue. When the output SelectionDAG has been
scheduled, the creation of machine instructions (done by InstrEmitter) will set
the implicit Defs
to dead (meaning that nobody uses the value set there).
This means that this pattern will generate a MachineInstr
like this
This confuses later passes in the LLVM pipeline and causes wrong code generation. There are reasons why SelectionDAG does this. In fact, there is a number of situations in which InstrEmitter will not mark implicit definitions as dead, but this is not one of them. Luckily we can do a final fixup of an instruction after it has been emitted.
To do that we first need to change the definition of VFPSETLEN
.
Now InstrEmitter will call a function called AdjustInstrPostInstrSelection
after it has created the machine instruction. Let’s handle the
instruction there and make sure the implicit operand is never dead.
This will make the above machine instruction look like this.
Results
Now we can see what is the output of our test.ll
above with and without
optimisation. The first VFPSETLEN
can be removed.
If we move the call right before the return, this time the final VFPSETLEN
can be removed.
However, if we move the call to some other position the backend crashes. The reason is that the compiler wants to preserve the value of the vector registers that are live across the call. To do this it needs to store the vector register onto the stack, but it does not know how to do that.
In the next installment we will teach the compiler to spill, reload and copy vector registers.