There is an issue we have mentioned several times in earlier installments: the value of the vector length at function boundaries. This is, when entering or leaving a function. We will address this question today.

Calling convention

Arm Procedure Call Standard specifies how parameters are passed in function calls. Also the convention specifies other details, like the state registers upon entering a function. One of the details it specifies is the value of len.

The length bits (16-18) must be 0b100 when using M-profile Vector Extension, 0b000 when using VFP vector mode and otherwise preserved across a public interface.

So, in order to interface correctly with other functions we need to make sure the len field is set to 0 when calling a function. We will achieve this using the following approach

a VFPSETLEN that sets len to 0 will be emitted prior a function call
a VFPSETLEN that sets len to 0 will be emitted before returning from a function

We will do this in SelectionDAG. The optimisation we implemented in the last chapter should be able to remove all the redundant cases.

Changes in SelectionDAG

In order to implement this in SelectionDAG, the easiest approach is to create a new target-specific SelectionDAG node.

We do that by first declaring a new enumerator of NodeType enum, in ARMISelLowering.h. We will call it VFPSETLENZERO and its purpose will be exclusively setting len to 0.

llvm/lib/Target/ARM/ARMISelLowering.h

@@ -312,6 +312,9 @@
     CSNEG, // Conditional select negate.
     CSINC, // Conditional select increment.
 
+    // VFP2
+    VFPSETLENZERO,
+
     // Vector load N-element structure to all lanes:
     VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
     VLD2DUP,

Now we can define the tablegen node itself. This is done in ARMInstrVFP.td.

llvm/lib/Target/ARM/ARMInstrVFP.td

@@ -32,6 +32,9 @@
 def arm_vmovhr : SDNode<"ARMISD::VMOVhr", SDT_VMOVhr>;
 def arm_vmovrh : SDNode<"ARMISD::VMOVrh", SDT_VMOVrh>;
 
+def arm_vfpsetlenzero : SDNode<"ARMISD::VFPSETLENZERO", SDTNone,
+                               [SDNPHasChain]>;
+
 //===----------------------------------------------------------------------===//
 // Pseudos VFP database.
 //

This definition in tablegen defines a new record named arm_vfpsetlenzero of type SDNode. This class needs a few parameters: the enumerator we declared above in ARMISelLowering.h, a prototype of the node and a list of attributes. The prototype of the node allows specifying what operands and what values returns a node. In our case arm_vfpsetlenzero will not receive any parameter not return anything so we can use the predefined prototype for this case, called SDTNone. The only attribute we have is SDNPHasChain which means the node has a chain.

A chain is one of the three dependence kinds that SelectionDAG nodes can represent: data flow (called normal values, this is operands and results of a node modelling some operation), control flow (chain, used for things like memory accesses or other dependences that are unrelated to data but to operation ordering), and scheduling-dependences (called glue, used for things like CPU flags). Dependences are important when the output SelectionDAG is linearised into MachineInstrs because they determine a valid order. When a SelectionDAG node has chain, it has an input and an output chain.

We still need to make one final change in ARMISelLowering.cpp so we can print the name of the node (used for debugging).

llvm/lib/Target/ARM/ARMISelLowering.cpp

@@ -1840,6 +1840,7 @@ const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {
     MAKE_CASE(ARMISD::CSINC)
     MAKE_CASE(ARMISD::MEMCPYLOOP)
     MAKE_CASE(ARMISD::MEMSETLOOP)
+    MAKE_CASE(ARMISD::VFPSETLENZERO)
 #undef MAKE_CASE
   }
   return nullptr;

Lowering

We have to change two locations in ARMISelLowering.cpp

ARMTargetLowering::LowerCall which deals with calls functions. We will add a VFPSETLENZERO right before the lowering of a function call.
ARMTargetLowering::LowerReturn which deals with lowering a function. We will add a VFPSETLENZERO very early in the return node.

In both places the code is the same.

llvm/lib/Target/ARM/ARMISelLowering.cpp

@@ -2386,6 +2387,10 @@ ARMTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
       AFI->setArgRegsSaveSize(-SPDiff);
   }
 
+  if (Subtarget->hasVFP2Base()) {
+    Chain = DAG.getNode(ARMISD::VFPSETLENZERO, dl, MVT::Other, Chain);
+  }
+
   if (isSibCall) {
     // For sibling tail calls, memory operands are available in our caller's stack.
     NumBytes = 0;
@@ -3116,6 +3121,10 @@ ARMTargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
     DAG.getContext()->diagnose(Diag);
   }
 
+  if (Subtarget->hasVFP2Base()) {
+    Chain = DAG.getNode(ARMISD::VFPSETLENZERO, dl, MVT::Other, Chain);
+  }
+
   // Copy the result values into the output registers.
   for (unsigned i = 0, realRVLocIdx = 0;
        i != RVLocs.size();

We create an arm_vfpsetlenzero in C++ using its enumerator ARMISD::VFPSETLENZERO. It only returns a chain which has type MVT::Other and receives an input Chain. The new node is the previous chain that will be used in later nodes.

Initial DAG

Now we can do a first experiment and see how the SelectionDAG looks like. Let’s consider the following LLVM IR.

test.ll

declare void @foo(i32 %a, i32 %b)

define void @test_vec(<2 x double> *%pa, <2 x double> *%pb, <2 x double> *%pc) {
  call void @foo(i32 1, i32 3)
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %c = fadd <2 x double> %a, %b
  store <2 x double> %c, <2 x double> *%pc
  ret void
}

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 -o - test.ll \
      -debug-only=isel

This will crash because we are still missing a few bits, but we can look at the initial SelectionDAG.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Initial selection DAG: %bb.0 'test_vec:'
SelectionDAG has 33 nodes:
  t0: ch = EntryToken
  t7: i32 = GlobalAddress<void (i32, i32)* @foo> 0
    t10: ch = ARMISD::VFPSETLENZERO t0
  t12: ch,glue = callseq_start t10, TargetConstant:i32<0>, TargetConstant:i32<0>
  t14: i32,ch = CopyFromReg t12, Register:i32 $sp
  t16: ch,glue = CopyToReg t12, Register:i32 $r0, Constant:i32<1>
  t18: ch,glue = CopyToReg t16, Register:i32 $r1, Constant:i32<3>, t16:1
  t21: ch,glue = ARMISD::CALL t18, TargetGlobalAddress:i32<void (i32, i32)* @foo> 0, Register:i32 $r0, Register:i32 $r1, RegisterMask:Untyped, t18:1
  t23: ch,glue = callseq_end t21, TargetConstant:i32<0>, TargetConstant:i32<-1>, t21:1
  t24: i32 = Constant<0>
    t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t26: v2f64,ch = load<(load 16 from %ir.pa, align 8)> t23, t2, undef:i32
    t4: i32,ch = CopyFromReg t0, Register:i32 %1
  t27: v2f64,ch = load<(load 16 from %ir.pb, align 8)> t23, t4, undef:i32
        t29: ch = TokenFactor t26:1, t27:1
        t28: v2f64 = fadd t26, t27
        t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t30: ch = store<(store 16 into %ir.pc, align 8)> t29, t28, t6, undef:i32
    t31: ch = ARMISD::VFPSETLENZERO t30
  t32: ch = ARMISD::RET_FLAG t31

If you check lines 5 and 21 you will see the new node. You will see each one receives a chain t0 and t30.

t0 is the initial chain of the basic block and ARMISD::VFPSETLENZERO has an output chain called t10 which is the inptu chain of callseq_start, a node used to signal the beginning of a function call. We basically set len to zero right before starting the function call sequence.

Similarly, t30 is the input chain for the ARMISDF::VFPSETLENZERO that we emit right before returning. The return in ARM is represented using the node ARMISD::RET_FLAG. The input chain of that node is exactly t31 which is the output chain of this second ARMISD::VFPSETLENZERO.

Selection

As I mentioned, the initial test above crashes. At this stage, LLVM does not know how to select this input SelectionDAG node ARMISD::VFPSETLENZERO into an output SelectionDAG node. So we have to tell LLVM how to do that.

The easiest way is to add a pattern. A suitable place is ARMInstrVFP.td.

def : Pat<(arm_vfpsetlenzero), (VFPSETLEN 0)>;

However, there is a minor issue. When the output SelectionDAG has been scheduled, the creation of machine instructions (done by InstrEmitter) will set the implicit Defs to dead (meaning that nobody uses the value set there). This means that this pattern will generate a MachineInstr like this

  %20:gpr, %21:gprnopc = VFPSETLEN 0, implicit-def dead $fpscr

This confuses later passes in the LLVM pipeline and causes wrong code generation. There are reasons why SelectionDAG does this. In fact, there is a number of situations in which InstrEmitter will not mark implicit definitions as dead, but this is not one of them. Luckily we can do a final fixup of an instruction after it has been emitted.

To do that we first need to change the definition of VFPSETLEN.

llvm/lib/Target/ARM/ARMInstrVFP.td

@@ -2928,7 +2927,8 @@ let Defs = [FPSCR],
     hasNoSchedulingInfo = 1,
     mayLoad = 0,
     mayStore = 0,
-    hasSideEffects = 0 in
+    hasSideEffects = 0,
+    hasPostISelHook = 1 in
 def VFPSETLEN : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
                            (ins imm0_7:$len),
                            IIC_fpSTAT, []>,

Now InstrEmitter will call a function called AdjustInstrPostInstrSelection after it has created the machine instruction. Let’s handle the instruction there and make sure the implicit operand is never dead.

llvm/lib/Target/ARM/ARMISelLowering.cpp

@@ -12030,6 +12029,14 @@ void ARMTargetLowering::AdjustInstrPostInstrSelection(MachineInstr &MI,
     return;
   }
 
+  if (MI.getOpcode() == ARM::VFPSETLEN) {
+      // fpscr is never dead.
+      MachineOperand &MO = MI.getOperand(3);
+      assert(MO.isImplicit() && "This is not an implicit operand");
+      MO.setIsDead(false);
+      return;
+  }
+
   const MCInstrDesc *MCID = &MI.getDesc();
   // Adjust potentially 's' setting instructions after isel, i.e. ADC, SBC, RSB,
   // RSC. Coming out of isel, they have an implicit CPSR def, but the optional

This will make the above machine instruction look like this.

  %20:gpr, %21:gprnopc = VFPSETLEN 0, implicit-def $fpscr

Results

Now we can see what is the output of our test.ll above with and without optimisation. The first VFPSETLEN can be removed.

$ diff -U1000 -u <(llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 \
                       -arm-optimize-vfp2-disable -o - test.ll)        \
                 <(llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 \
                       -o - test.ll)

 @ %bb.0:
 	push	{r4, r5, r6, lr}
 	mov	r5, r1
-	vmrs	r1, fpscr
 	mov	r6, r0
 	mov	r0, #1
-	mov	r4, r2
-	bic	r1, r1, #458752
-	vmsr	fpscr, r1
 	mov	r1, #3
+	mov	r4, r2
 	bl	foo
 	vldmia	r5, {d4, d5}
 	mov	r0, #65536
 	vldmia	r6, {d6, d7}
 	vmrs	r1, fpscr
 	bic	r1, r1, #458752
 	orr	r1, r1, r0
 	vmsr	fpscr, r1
 	vadd.f64	d4, d6, d4
 	vstmia	r4, {d4, d5}
 	vmrs	r1, fpscr
 	bic	r1, r1, #458752
 	vmsr	fpscr, r1
 	pop	{r4, r5, r6, pc}

If we move the call right before the return, this time the final VFPSETLEN can be removed.

test-2.ll

declare void @foo(i32 %a, i32 %b)

define void @test_vec(<2 x double> *%pa, <2 x double> *%pb, <2 x double> *%pc) {
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %c = fadd <2 x double> %a, %b
  store <2 x double> %c, <2 x double> *%pc
  call void @foo(i32 1, i32 3)
  ret void
}

$ diff -U1000 -u <(llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 \
                       -arm-optimize-vfp2-disable -o - test-2.ll)        \
                 <(llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 \
                       -o - test-2.ll)

 test_vec:
 	.fnstart
 @ %bb.0:
 	push	{r11, lr}
 	vldmia	r0, {d6, d7}
 	mov	r0, #65536
 	vldmia	r1, {d4, d5}
 	vmrs	r1, fpscr
 	bic	r1, r1, #458752
 	orr	r1, r1, r0
 	mov	r0, #1
 	vmsr	fpscr, r1
 	vadd.f64	d4, d6, d4
 	vstmia	r2, {d4, d5}
 	vmrs	r1, fpscr
 	bic	r1, r1, #458752
 	vmsr	fpscr, r1
 	mov	r1, #3
 	bl	foo
-	vmrs	r1, fpscr
-	bic	r1, r1, #458752
-	vmsr	fpscr, r1
 	pop	{r11, pc}

However, if we move the call to some other position the backend crashes. The reason is that the compiler wants to preserve the value of the vector registers that are live across the call. To do this it needs to store the vector register onto the stack, but it does not know how to do that.

In the next installment we will teach the compiler to spill, reload and copy vector registers.