Fun with vectors in the Raspberry Pi 1

We finished the last installment of this series mentioning that the compiler cannot copy, load or store a vector. Today we will address this.

Spill code and copies

LLVM, like many other compiler infrastructures, uses the concept of virtual register. Virtual registers are abstractions over the physical registers available in an architecture (note to architects: physical registers in compilers means architectural registers, not renaming registers).

During code generation in Machine IR (MIR), LLVM can combine the concept of virtual register with that of Static Single Assignment (SSA). This is very convenient for optimisations and transformations that happen on MIR.

The number of virtual registers is unbounded. However the number of physical registers available in a register-based architecture is finite. There is a process called register allocation that assigns a physical register to every virtual registers. Due to the finiteness of the physical registers, register allocation may run out of available physical registers. In these cases the compiler has to emit spill code in which a physical register already in use is preserved temporarily in some other memory, typically the stack of the program. The details are a bit complex at this point but we only need to know that saving a register on that memory is typically called spill and restoring it afterwards is called a reload.

There is another case where LLVM will emit spill code. Calling conventions can require registers be preserved around function calls or upon entry/exit of a function. While these are not considered spill code in the traditional sense, LLVM combines the two cases.

Before register allocation can run, however, SSA form must be removed. The fundamentals of SSA is that every virtual register in the program must be defined only once and can be used many times. However we quickly run into the issue that the value of a virtual register may depend on control flow. Consider a program thas uses a variable whose value may have conditionally changed in an if statement. SSA represents those converging values due to control flow using phi instructions. Phi instructions are not executable, they are an abstraction, and must be removed. Without going into much detail, when removing phi operations, the compiler will introduce copies to implement the semantics of the phi instruction. It may not be obvious, but it is precisely this insertion of copies what destroys the SSA property of the code: a same virtual register will be defined by different instructions.

Load and store

There are a couple of functions storeRegToStackSlot and loadRegFromStackSlot in llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp responsible for emitting the machine instruction used to spill and reload respectively. One issue here is that only one instruction is allowed due to expectations in the spilling functions used by the register allocator.

Ideally we would like to use vstmdia / vstmsia for double and floating point stores respectively. Similarly vldmdia / vldmsia for loads. However these instructions expect a contiguous range of registers. We allow representing vectors over non-contiguous registers wrapping around in the same vector bank (say d7 and d4). Also LLVM comes with a limitation here: we must use a single instruction for the spill and reload.

We can address this, again, using pseudo-instructions that we will lower in the best way possible. First we will define two pseudo-instructions, one for each vector type, VFPSPILL and VFPRELOAD.

llvm/lib/Target/ARM/ARMInstrVFP.td

@@ -2934,6 +2934,34 @@ def VFPSETLEN : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
                            IIC_fpSTAT, []>,
                           Requires<[HasVFP2]>;
 
+// Spill and reload helpers.
+let AM = AddrMode4 in {
+let hasNoSchedulingInfo = 1,
+    mayLoad = 0,
+    mayStore = 1 in
+def VFPSPILLDx2 : PseudoInst<(outs), (ins DPRx2:$Dm, GPR:$Rn),
+                             IIC_fpStore64, []>,
+                            Requires<[HasVFP2]>;
+let hasNoSchedulingInfo = 1,
+    mayLoad = 1,
+    mayStore = 0 in
+def VFPRELOADDx2 : PseudoInst<(outs DPRx2:$Dd), (ins GPR:$Rn),
+                              IIC_fpLoad64, []>,
+                             Requires<[HasVFP2]>;
+let hasNoSchedulingInfo = 1,
+    mayLoad = 0,
+    mayStore = 1 in
+def VFPSPILLSx4 : PseudoInst<(outs), (ins SPRx4:$Dm, GPR:$Rn),
+                             IIC_fpStore64, []>,
+                            Requires<[HasVFP2]>;
+let hasNoSchedulingInfo = 1,
+    mayLoad = 1,
+    mayStore = 0 in
+def VFPRELOADSx4 : PseudoInst<(outs SPRx4:$Dd), (ins GPR:$Rn),
+                              IIC_fpLoad64, []>,
+                             Requires<[HasVFP2]>;
+}
+
 // Computation: LEN=2
 let usesCustomInserter = 1 in {
 let VectorLength = 0b001 in {

Now we can extend storeRegToStackSlot and loadRegFromStackSlot so they emit these instructions when we have to store/load a register of the DPRx2 or SPRx4 register classes.

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

@@ -1213,6 +1223,18 @@ storeRegToStackSlot(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
           .addImm(0)
           .addMemOperand(MMO);
         addUnpredicatedMveVpredNOp(MIB);
+      } else if (ARM::DPRx2RegClass.hasSubClassEq(RC) &&
+                 Subtarget.hasVFP2Base()) {
+        BuildMI(MBB, I, DebugLoc(), get(ARM::VFPSPILLDx2))
+            .addReg(SrcReg, getKillRegState(isKill))
+            .addFrameIndex(FI)
+            .addMemOperand(MMO);
+      } else if (ARM::SPRx4RegClass.hasSubClassEq(RC) &&
+                 Subtarget.hasVFP2Base()) {
+        BuildMI(MBB, I, DebugLoc(), get(ARM::VFPSPILLSx4))
+            .addReg(SrcReg, getKillRegState(isKill))
+            .addFrameIndex(FI)
+            .addMemOperand(MMO);
       } else
         llvm_unreachable("Unknown reg class!");
       break;
@@ -1451,6 +1473,16 @@ loadRegFromStackSlot(MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
         .addImm(0)
         .addMemOperand(MMO);
       addUnpredicatedMveVpredNOp(MIB);
+    } else if (ARM::DPRx2RegClass.hasSubClassEq(RC) &&
+               Subtarget.hasVFP2Base()) {
+      BuildMI(MBB, I, DL, get(ARM::VFPRELOADDx2), DestReg)
+          .addFrameIndex(FI)
+          .addMemOperand(MMO);
+    } else if (ARM::SPRx4RegClass.hasSubClassEq(RC) &&
+               Subtarget.hasVFP2Base()) {
+      BuildMI(MBB, I, DL, get(ARM::VFPRELOADSx4), DestReg)
+          .addFrameIndex(FI)
+          .addMemOperand(MMO);
     } else
       llvm_unreachable("Unknown reg class!");
     break;

As you can see, we basically encapsulate the destination register along with the abstraction of the stack, called frame index in LLVM. Finally we make sure the memory operands are there: these are extra bits of information LLVM can use to reason about memory accesses.

Lowering

Those pseudo instructions will need some lowering. Again this happens in ARMExpandPseudoInsts.cpp.

Our strategy will be fairly simple. If we can use the multiple memory access instructions vldmdia/ vldmsia / vstmdia / vstmsia we will use them. If we cannot we will emit a bunch of simple loads and stores.

Note that for doubles this would be enough because vectors only have two elements. For vectors of floats, though, there are more cases: a) all four registers may be consecutive (e.g.: s8, s9, s10, s11) three are consecutive (e.g. s13, s14, s15, s8), c) two are consecutive and the two other are consecutive but not respect to the other two (e.g. s14, s15, s8, s9). For simplicity we will ignore those cases but they are not too conceptually difficult to implement.

We will see the implementation for double vectors, float vectors would be implemented in a similar way. We first check if the registers are consecutive. If they are then we can use vldmdia or vstmdia.

If they are not consecutive then we need to emit explicit loads and stores vldrd and vstrd. Some complexity comes in the form of computing the right memory operand for the new loads and stores as we need to adjust the right offset and size of data being accessed. This is encapsulated in convenient lambdas EmitLoad and EmitStore. One important thing to note is that vldrd and vstrd instructions expect an offset operand in units of 4 bytes, hence .addImm(Offset / 4). In the memory operand though, the offset is specified in bytes.

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

@@ -3026,6 +3026,173 @@ bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
       MI.eraseFromParent();
       return true;
     }
+    case ARM::VFPSPILLDx2: {
+      Register Src = MI.getOperand(0).getReg();
+      const MachineOperand &Addr = MI.getOperand(1);
+      DebugLoc dl = MI.getDebugLoc();
+
+      Register Src0 = TRI->getSubReg(Src, ARM::dsub_len2_0);
+      Register Src1 = TRI->getSubReg(Src, ARM::dsub_len2_1);
+
+      if (Src0 + 1 == Src1) {
+        // Use vstmdia.
+        BuildMI(MBB, MBBI, dl, TII->get(ARM::VSTMDIA))
+            .add(Addr)
+            .add(predOps(ARMCC::AL))
+            .cloneMemRefs(MI)
+            .addReg(Src0, getKillRegState(MI.getOperand(1).isKill()))
+            .addReg(Src1);
+      } else {
+        assert(MI.hasOneMemOperand() && "Need one memoperand");
+        const MachineMemOperand *MMO = *MI.memoperands_begin();
+        MachineFunction *MF = MBB.getParent();
+
+        auto EmitStore = [&](Register RegSrc, uint64_t Offset) {
+          MachineMemOperand *NewMMO = MF->getMachineMemOperand(
+              MMO->getPointerInfo().getWithOffset(Offset),
+              MachineMemOperand::MOStore, MMO->getSize() / 2, MMO->getAlign());
+          BuildMI(MBB, MBBI, dl, TII->get(ARM::VSTRD))
+              .addReg(RegSrc)
+              .add(Addr)
+              .addImm(Offset / 4)
+              .addMemOperand(NewMMO)
+              .add(predOps(ARMCC::AL));
+        };
+
+        EmitStore(Src0, 0);
+        EmitStore(Src1, 8);
+      }
+      MI.eraseFromParent();
+      return true;
+    }
+    case ARM::VFPRELOADDx2: {
+      Register Dest = MI.getOperand(0).getReg();
+      const MachineOperand &Addr = MI.getOperand(1);
+      DebugLoc dl = MI.getDebugLoc();
+
+      Register Dest0 = TRI->getSubReg(Dest, ARM::dsub_len2_0);
+      Register Dest1 = TRI->getSubReg(Dest, ARM::dsub_len2_1);
+
+      if (Dest0 + 1 == Dest1) {
+        // Use vldmdia.
+        BuildMI(MBB, MBBI, dl, TII->get(ARM::VLDMDIA))
+            .add(Addr)
+            .cloneMemRefs(MI)
+            .add(predOps(ARMCC::AL))
+            .addReg(Dest0, RegState::DefineNoRead)
+            .addReg(Dest1, RegState::DefineNoRead);
+      } else {
+        assert(MI.hasOneMemOperand() && "Need one memoperand");
+        const MachineMemOperand *MMO = *MI.memoperands_begin();
+        MachineFunction *MF = MBB.getParent();
+
+        auto EmitLoad = [&](Register RegDst, uint64_t Offset) {
+          MachineMemOperand *NewMMO = MF->getMachineMemOperand(
+              MMO->getPointerInfo().getWithOffset(Offset),
+              MachineMemOperand::MOLoad, MMO->getSize() / 2, MMO->getAlign());
+          BuildMI(MBB, MBBI, dl, TII->get(ARM::VLDRD), Dest0)
+              .add(Addr)
+              .addImm(Offset / 4)
+              .addMemOperand(NewMMO)
+              .add(predOps(ARMCC::AL));
+        };
+
+        EmitLoad(Dest0, 0);
+        EmitLoad(Dest1, 8);
+      }
+      MI.eraseFromParent();
+      return true;
+    }
   }
 }

Results

Let’s see this in action. In the last installment we said that doing a call while vectors were live caused the compiler to crash because it does not know how to spill and reload the vectors.

An example of LLVM IR that triggers this behaviour is this one:

test.ll

declare void @foo(i32 %a, i32 %b)

define void @test_vec(<2 x double> *%pa, <2 x double> *%pb, <2 x double> *%pc) {
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  call void @foo(i32 1, i32 3)
  %c = fadd <2 x double> %a, %b
  store <2 x double> %c, <2 x double> *%pc
  ret void
}

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o - test.ll

test_vec:
	.fnstart
@ %bb.0:
	push	{r4, r5, r6, lr}
	sub	sp, sp, #32
	add	lr, sp, #16
	vldmia	r1, {d4, d5}
	mov	r1, #3
	mov	r4, r2
	vstmia	lr, {d4, d5}                    @ 16-byte Spill
	vldmia	r0, {d4, d5}
	mov	r0, #1
	vstmia	sp, {d4, d5}                    @ 16-byte Spill
	bl	foo
	vmrs	r1, fpscr
	mov	r0, #65536
	add	lr, sp, #16
	bic	r1, r1, #458752
	orr	r1, r1, r0
	vmsr	fpscr, r1
	vldmia	lr, {d4, d5}                    @ 16-byte Reload
	vldmia	sp, {d6, d7}                    @ 16-byte Reload
	vadd.f64	d4, d6, d4
	vstmia	r4, {d4, d5}
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	vmsr	fpscr, r1
	add	sp, sp, #32
	pop	{r4, r5, r6, pc}

Here we can see how the compiler explicitly spills and reloads vectors around the call to foo. Note that the spills have as source the same pair d4, d5 because the compiler does not need another pair in that position. However, the reloads do have to use different pairs otherwise the addition (vadd.f64) will not be possible.

It is a bit difficult to observe the use of vstrd and vldrd for the non-consecutive case. One thing we can do is manually modifying the Machine IR so it uses d7 and d4 instead of d4, d5. To get the Machine IR, we must stop before finishing the code generation pipeline. A location that seems convenient is stopping before the arm-ldst-opt pass, responsible for optimising several memory accesses into multiple memory accesses.

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o test.mir \
      -stop-before=arm-ldst-opt

Now we modify the test.mir file like this

@@ -90,23 +90,23 @@ body:             |
     $sp = frame-setup SUBri killed $sp, 32, 14 /* CC::al */, $noreg, $noreg
     frame-setup CFI_INSTRUCTION def_cfa_offset 48
     $r4 = MOVr $r2, 14 /* CC::al */, $noreg, $noreg
-    renamable $d4 = VLDRD renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
-    renamable $d5 = VLDRD killed renamable $r1, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
+    renamable $d7 = VLDRD renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $d7_d4x2 :: (load 8 from %ir.pb)
+    renamable $d4 = VLDRD killed renamable $r1, 2, 14 /* CC::al */, $noreg, implicit killed $d7_d4x2, implicit-def $d7_d4x2 :: (load 8 from %ir.pb + 8)
     $lr = ADDri killed $sp, 16, 14 /* CC::al */, $noreg, $noreg
-    VFPSPILLDx2 killed renamable $d4_d5x2, killed $lr :: (store 16 into %stack.0, align 8)
-    renamable $d4 = VLDRD renamable $r0, 0, 14 /* CC::al */, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pa)
-    renamable $d5 = VLDRD killed renamable $r0, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pa + 8)
-    VFPSPILLDx2 killed renamable $d4_d5x2, $sp :: (store 16 into %stack.1, align 8)
+    VFPSPILLDx2 killed renamable $d7_d4x2, killed $lr :: (store 16 into %stack.0, align 8)
+    renamable $d7 = VLDRD renamable $r0, 0, 14 /* CC::al */, $noreg, implicit-def $d7_d4x2 :: (load 8 from %ir.pa)
+    renamable $d4 = VLDRD killed renamable $r0, 2, 14 /* CC::al */, $noreg, implicit killed $d7_d4x2, implicit-def $d7_d4x2 :: (load 8 from %ir.pa + 8)
+    VFPSPILLDx2 killed renamable $d7_d4x2, $sp :: (store 16 into %stack.1, align 8)
     $r0 = MOVi 1, 14 /* CC::al */, $noreg, $noreg
     $r1 = MOVi 3, 14 /* CC::al */, $noreg, $noreg
     BL @foo, csr_aapcs, implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit-def $sp
     dead renamable $r0, dead renamable $r1 = VFPSETLEN 1, implicit-def $fpscr
     $lr = ADDri killed $sp, 16, 14 /* CC::al */, $noreg, $noreg
-    renamable $d4_d5x2 = VFPRELOADDx2 killed $lr :: (load 16 from %stack.0, align 8)
-    renamable $d6_d7x2 = VFPRELOADDx2 $sp :: (load 16 from %stack.1, align 8)
-    renamable $d4_d5x2 = VADDDx2 killed renamable $d6_d7x2, killed renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr
-    VSTRD renamable $d4, renamable $r4, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc)
-    VSTRD renamable $d5, killed renamable $r4, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
+    renamable $d7_d4x2 = VFPRELOADDx2 killed $lr :: (load 16 from %stack.0, align 8)
+    renamable $d8_d9x2 = VFPRELOADDx2 $sp :: (load 16 from %stack.1, align 8)
+    renamable $d7_d4x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d7_d4x2, 14 /* CC::al */, $noreg, implicit $fpscr
+    VSTRD renamable $d7, renamable $r4, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc)
+    VSTRD renamable $d4, killed renamable $r4, 2, 14 /* CC::al */, $noreg, implicit killed $d7_d4x2 :: (store 8 into %ir.pc + 8)
     dead renamable $r0, dead renamable $r1 = VFPSETLEN 0, implicit-def $fpscr
     $sp = frame-destroy ADDri killed $sp, 32, 14 /* CC::al */, $noreg, $noreg
     $sp = frame-destroy LDMIA_RET $sp, 14 /* CC::al */, $noreg, def $r4, def $r5, def $r6, def $pc

The diff is a bit noisy: occurrences of physical registers $d6_d7x2 are replaced with the physical register $d8_d9x2 and occurrences of the physical register $d4_d5x2 are replaced with the physical register $d7_d4x2. This register is non-contiguous.

Now we can restart the code generation pipeline using the modified Machine IR file, right before arm-ldst-opt (note the -start-before flag).

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o - test.mir \
      -start-before=arm-ldst-opt

Now the output looks like this:

test_vec:
	.fnstart
@ %bb.0:
	push	{r4, r5, r6, lr}
	sub	sp, sp, #32
	vldr	d7, [r1]
	add	lr, sp, #16
	vldr	d4, [r1, #8]
	mov	r1, #3
	vstr	d7, [lr]                        @ 8-byte Spill
	mov	r4, r2
	vstr	d4, [lr, #8]                    @ 8-byte Spill
	vldr	d7, [r0]
	vldr	d4, [r0, #8]
	mov	r0, #1
	vstr	d7, [sp]                        @ 8-byte Spill
	vstr	d4, [sp, #8]                    @ 8-byte Spill
	bl	foo
	vmrs	r1, fpscr
	mov	r0, #65536
	add	lr, sp, #16
	bic	r1, r1, #458752
	orr	r1, r1, r0
	vmsr	fpscr, r1
	vldr	d7, [lr]                        @ 8-byte Reload
	vldmia	sp, {d8, d9}                    @ 16-byte Reload
	vldr	d7, [lr, #8]                    @ 8-byte Reload
	vadd.f64	d7, d8, d7
	vstr	d7, [r4]
	vstr	d4, [r4, #8]
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	vmsr	fpscr, r1
	add	sp, sp, #32
	pop	{r4, r5, r6, pc}

One observation that we could make here is that, ideally we should prefer not to pick these non-consecutive registers. Both the spills and reloads that involve them are costlier in number of instructions. This is something we could influence in ARMRegisterInfo.td.

Copies

We mentioned above the compiler may have to introduce copies when removing SSA. But the backend does not know how to do that.

Fortunately the ARM backend is very well implemented so adding this is relatively easy to do in ARMBaseInstrInfo.cpp

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

@@ -979,6 +979,16 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
     Opc = ARM::VMOVS;
     BeginIdx = ARM::ssub_0;
     SubRegs = 2;
+  } else if (ARM::DPRx2RegClass.contains(DestReg, SrcReg) &&
+             Subtarget.hasVFP2Base()) {
+    Opc = ARM::VMOVD;
+    BeginIdx = ARM::dsub_len2_0;
+    SubRegs = 2;
+  } else if (ARM::SPRx4RegClass.contains(DestReg, SrcReg) &&
+             Subtarget.hasVFP2Base()) {
+    Opc = ARM::VMOVS;
+    BeginIdx = ARM::ssub_len4_0;
+    SubRegs = 4;
   } else if (SrcReg == ARM::CPSR) {
     copyFromCPSR(MBB, I, DestReg, KillSrc, Subtarget);
     return;

The code already knows how to copy subregisters so we specify the instruction we want to use, vmovd or vmovs, and how many subregisters are involved from the first subregister.

Results

In order to force copies be emitted, we need to make sure our LLVM IR code has control flow that mandates this.

The following LLVM IR causes this:

t_doubles_phi.ll

define void @test_vec(i32 %dis,
                      <2 x double> *%pa,
                      <2 x double> *%pb,
                      <2 x double> *%pc) {
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %m = icmp slt i32 %dis, 4
  br i1 %m, label %block1, label %block2
block1:
  %x = fadd <2 x double> %a, %b
  br label %block3
block2:
  %y = fmul <2 x double> %a, %b
  br label %block3
block3:
  %p = phi <2 x double> [%x, %block1], [%y, %block2]
  store <2 x double> %p, <2 x double> *%pc
  ret void
}

Basically we check if the first parameter %dis is larger that 4. If it is we add vectors %a and %b (block1). Otherwise we multiply them (block2). This is the value we store in the pointer *%pc (block3). The value, however flows in from the two blocks, hence a phi instruction appears to merge the two incoming values. The merged value %p is the one stored.

We can look at the Machine IR right before the pass postrapseudos, which is used to expand the generic pseudo instructions used by the Register Allocator (ra). Those pseudos include a generic COPY instruction.

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o copy.mir \
      t_doubles_phi.ll  -stop-before=postrapseudos -simplify-mir

copy.mir

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
body:             |
  bb.0.block3:
    liveins: $r0, $r1, $r2, $r3, $d8, $d9
  
    $sp = frame-setup VSTMDDB_UPD $sp, 14 /* CC::al */, $noreg, killed $d8, killed $d9
    frame-setup CFI_INSTRUCTION def_cfa_offset 16
    frame-setup CFI_INSTRUCTION offset $d9, -8
    frame-setup CFI_INSTRUCTION offset $d8, -16
    renamable $d4 = VLDRD renamable $r2, 0, 14 /* CC::al */, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
    renamable $d5 = VLDRD killed renamable $r2, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
    renamable $d8 = VLDRD renamable $r1, 0, 14 /* CC::al */, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa)
    renamable $d9 = VLDRD killed renamable $r1, 2, 14 /* CC::al */, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8)
    dead renamable $r1, dead renamable $r2 = VFPSETLEN 1, implicit-def $fpscr
    renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr
    renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14 /* CC::al */, $noreg, implicit $fpscr
    CMPri killed renamable $r0, 4, 14 /* CC::al */, $noreg, implicit-def $cpsr
    Bcc %bb.2, 11 /* CC::lt */, killed $cpsr
  
  bb.1.select.false:
    liveins: $r3, $d6_d7x2
  
    renamable $d4_d5x2 = COPY killed renamable $d6_d7x2
  
  bb.2.select.end:
    liveins: $r3, $d4_d5x2
  
    VSTRD renamable $d4, renamable $r3, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc)
    VSTRD renamable $d5, killed renamable $r3, 2, 14 /* CC::al */, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
    dead renamable $r0, dead renamable $r1 = VFPSETLEN 0, implicit-def $fpscr
    $sp = frame-destroy VLDMDIA_UPD $sp, 14 /* CC::al */, $noreg, def $d8, def $d9
    BX_RET 14 /* CC::al */, $noreg

If you look closely at the code, you can see the compiler has changed the code in a way that it especulatively executes the fadd and the fmul (look for VMULDx2 and VADDDx2 instructions in lines 14 and 15) in the entry block (shown as bb.0.block3, but the relevant bit is bb.0). Each operation uses a different register ($d4_d5x2 and $d6_d7x2) and one of the registers ($d4_d5x2) is the one that ends being stored in the final block (bb.2.select.end, lines 27 and 28).

Then the comparison happens and if turns out false (block bb.1.select.false, lines 19 to 22) we copy the contents of $d6_d7x2 (line 22), which contains the multiplication result, into $d4_d5x2.

If we look at the generated assembly, we see the copies we explicitly introduced.

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o - t_doubles_phi.ll

test_vec:
	.fnstart
@ %bb.0:                                @ %block3
	vpush	{d8, d9}
	vldmia	r1, {d8, d9}
	mov	r1, #65536
	cmp	r0, #4
	vldmia	r2, {d4, d5}
	vmrs	r2, fpscr
	bic	r2, r2, #458752
	orr	r2, r2, r1
	vmsr	fpscr, r2
	vmul.f64	d6, d8, d4
	vadd.f64	d4, d8, d4
	blt	.LBB0_2
@ %bb.1:                                @ %select.false
	vmov.f64	d4, d6
	vmov.f64	d5, d7
.LBB0_2:                                @ %select.end
	vstmia	r3, {d4, d5}
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	vmsr	fpscr, r1
	vpop	{d8, d9}
	bx	lr

And this is all for this installment. In the next one we will make enough changes to the compiler so we can convince the existing loop vectorizer of LLVM that it can vectorize some simple codes in VFPv2.