There is an issue we have mentioned several times in earlier installments: the value of the vector length at function boundaries. This is, when entering or leaving a function. We will address this question today.
Arm Procedure Call
Standard specifies how
parameters are passed in function calls. Also the convention specifies
other details, like the state registers upon entering a function. One of the
details it specifies is the value of
The length bits (16-18) must be 0b100 when using M-profile Vector Extension, 0b000 when using VFP vector mode and otherwise preserved across a public interface.
So, in order to interface correctly with other functions we need to make sure
len field is set to
0 when calling a function. We will achieve this
using the following approach
lento 0 will be emitted prior a function call
lento 0 will be emitted before returning from a function
We will do this in SelectionDAG. The optimisation we implemented in the last chapter should be able to remove all the redundant cases.
Changes in SelectionDAG
In order to implement this in SelectionDAG, the easiest approach is to create a new target-specific SelectionDAG node.
We do that by first declaring a new enumerator of
NodeType enum, in
ARMISelLowering.h. We will call it
VFPSETLENZERO and its purpose
will be exclusively setting
len to 0.
Now we can define the tablegen node itself. This is done in
This definition in tablegen defines a new record named
SDNode. This class needs a few parameters: the enumerator we declared
ARMISelLowering.h, a prototype of the node and a list of attributes. The
prototype of the node allows specifying what operands and what values returns a
node. In our case
arm_vfpsetlenzero will not receive any parameter not return
anything so we can use the predefined prototype for this case, called
SDTNone. The only attribute we have is
SDNPHasChain which means the node
has a chain.
A chain is one of the three dependence kinds that SelectionDAG nodes can
represent: data flow (called normal values, this is operands and results of a
node modelling some operation), control flow (chain, used for things like
memory accesses or other dependences that are unrelated to data but to
operation ordering), and scheduling-dependences (called glue, used for things
like CPU flags). Dependences are important when the output SelectionDAG is
MachineInstrs because they determine a valid order. When a
SelectionDAG node has chain, it has an input and an output chain.
We still need to make one final change in
ARMISelLowering.cpp so we can print
the name of the node (used for debugging).
We have to change two locations in
ARMTargetLowering::LowerCallwhich deals with calls functions. We will add a
VFPSETLENZEROright before the lowering of a function call.
ARMTargetLowering::LowerReturnwhich deals with lowering a function. We will add a
VFPSETLENZEROvery early in the return node.
In both places the code is the same.
We create an
arm_vfpsetlenzero in C++ using its enumerator
ARMISD::VFPSETLENZERO. It only returns a chain which has type
and receives an input
Chain. The new node is the previous chain that will
be used in later nodes.
Now we can do a first experiment and see how the SelectionDAG looks like. Let’s consider the following LLVM IR.
This will crash because we are still missing a few bits, but we can look at the initial SelectionDAG.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Initial selection DAG: %bb.0 'test_vec:' SelectionDAG has 33 nodes: t0: ch = EntryToken t7: i32 = GlobalAddress<void (i32, i32)* @foo> 0 t10: ch = ARMISD::VFPSETLENZERO t0 t12: ch,glue = callseq_start t10, TargetConstant:i32<0>, TargetConstant:i32<0> t14: i32,ch = CopyFromReg t12, Register:i32 $sp t16: ch,glue = CopyToReg t12, Register:i32 $r0, Constant:i32<1> t18: ch,glue = CopyToReg t16, Register:i32 $r1, Constant:i32<3>, t16:1 t21: ch,glue = ARMISD::CALL t18, TargetGlobalAddress:i32<void (i32, i32)* @foo> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t18:1 t23: ch,glue = callseq_end t21, TargetConstant:i32<0>, TargetConstant:i32<-1>, t21:1 t24: i32 = Constant<0> t2: i32,ch = CopyFromReg t0, Register:i32 %0 t26: v2f64,ch = load<(load 16 from %ir.pa, align 8)> t23, t2, undef:i32 t4: i32,ch = CopyFromReg t0, Register:i32 %1 t27: v2f64,ch = load<(load 16 from %ir.pb, align 8)> t23, t4, undef:i32 t29: ch = TokenFactor t26:1, t27:1 t28: v2f64 = fadd t26, t27 t6: i32,ch = CopyFromReg t0, Register:i32 %2 t30: ch = store<(store 16 into %ir.pc, align 8)> t29, t28, t6, undef:i32 t31: ch = ARMISD::VFPSETLENZERO t30 t32: ch = ARMISD::RET_FLAG t31
If you check lines 5 and 21 you will see the new node. You will see each one
receives a chain
t0 is the initial chain of the basic block
ARMISD::VFPSETLENZERO has an output chain called
t10 which is the inptu
callseq_start, a node used to signal the beginning of a function
call. We basically set
len to zero right before starting the function call
t30 is the input chain for the
ARMISDF::VFPSETLENZERO that we
emit right before returning. The return in ARM is represented using the node
ARMISD::RET_FLAG. The input chain of that node is exactly
t31 which is the
output chain of this second
As I mentioned, the initial test above crashes. At this stage, LLVM does not
know how to select this input SelectionDAG node
ARMISD::VFPSETLENZERO into an
output SelectionDAG node. So we have to tell LLVM how to do that.
The easiest way is to add a pattern. A suitable place is
However, there is a minor issue. When the output SelectionDAG has been
scheduled, the creation of machine instructions (done by InstrEmitter) will set
Defs to dead (meaning that nobody uses the value set there).
This means that this pattern will generate a
MachineInstr like this
This confuses later passes in the LLVM pipeline and causes wrong code generation. There are reasons why SelectionDAG does this. In fact, there is a number of situations in which InstrEmitter will not mark implicit definitions as dead, but this is not one of them. Luckily we can do a final fixup of an instruction after it has been emitted.
To do that we first need to change the definition of
Now InstrEmitter will call a function called
after it has created the machine instruction. Let’s handle the
instruction there and make sure the implicit operand is never dead.
This will make the above machine instruction look like this.
Now we can see what is the output of our
test.ll above with and without
optimisation. The first
VFPSETLEN can be removed.
If we move the call right before the return, this time the final
can be removed.
However, if we move the call to some other position the backend crashes. The reason is that the compiler wants to preserve the value of the vector registers that are live across the call. To do this it needs to store the vector register onto the stack, but it does not know how to do that.
In the next installment we will teach the compiler to spill, reload and copy vector registers.