Fun with vectors in the Raspberry Pi 1

In the last installment I mentioned we could start looking at enabling the vectoriser in the compiler. However when I did that I realised some benchmarks were giving weird results. I had made a mistake with copies, so let’s remediate this.

Source of the problem: copies and vector length

In the first installment of this series we discussed about instructions that used the vector length. Unfortunately I missed an important one: vmov, used to copy vector registers. Contrarily as I incorrectly assumed, this instruction does use the vector length so we do not have to copy every single element of the vector. Instead, we need to make sure the vector length is correctly set.

The error I saw is that the compiler intended to copy two scalar floating point registers but because the vector length was different to 1, we were overwriting more registers than expected (the registers happened to be in banks other than the first one). That was an interesting bug to find!

However this creates a bit of a complication here: if we need to change the vector length to implement a vector copy between registers, we are actually modifying its value. This means that to be safe we need to preserve the current vector length, set it to the correct one, do the copy and restore the original vector length.

My initial implementation did this, but turns out the code obtained is not very smart. It may happen that we are already in the right vector length to emit the copy, so we do not have to do all the preserve-copy-restore dance.

How to do this in a way that still allows us the option to remove unnecessary changes to the set vector length?

New design

Currently we only have a single VFPSETLEN pseudo instruction that we are able to optimise before register allocation.

Copies are introduced as part of the process that leads register allocation, when phi instructions are removed. If we want to optimise changes to the vector length due to copies, we need to run our pass after register allocation.

In fact, it would be a good idea to also run it before register allocation, to reduce the register pressure implicitly created by the redundant instructions. For simplicity we will only run it after register allocation.

So, when emitting copies we will want to know that we are doing the preserve-copy-restore dance in a way that it can be optimised.

So I did the following changes:

VFPSETLEN is now VFPSETLENi. The i designates it receives an immediate for the vector length.
A new VFPSETLENr pseudo instruction that sets the length as preserved in a GPR register (the r means register).
A new VFPGETLENr pseudo insruction that returns the vector length in a GPR.

llvm/lib/Target/ARM/ARMInstrVFP.td

@@ -2927,7 +2927,8 @@ let Uses = [FPSCR] in {
   }
 }
 
-// Set LEN field in FPSCR
+// Set LEN field in FPSCR using an immediate. This is the main way used
+// to change the vector length.
 let Defs = [FPSCR],
     hasNoSchedulingInfo = 1,
     mayLoad = 0,
@@ -2935,11 +2936,38 @@ let Defs = [FPSCR],
     hasSideEffects = 0,
     hasPostISelHook = 1,
     Size = 20 in
-def VFPSETLEN : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
+def VFPSETLENi : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
                            (ins imm0_7:$len),
                            IIC_fpSTAT, []>,
                           Requires<[HasVFP2]>;
 
+// This is a semantic version of VMSR which is used to signal
+// a change only to the vector length field. This is not used
+// during ISel.
+let Defs = [FPSCR],
+    hasNoSchedulingInfo = 1,
+    mayLoad = 0,
+    mayStore = 0,
+    hasSideEffects = 0,
+    Size = 4 in
+def VFPSETLENr : PseudoInst<(outs),
+                            (ins GPRnopc:$len),
+                            IIC_fpSTAT, []>,
+                           Requires<[HasVFP2]>;
+
+// This is a semantic version of VMRS which is used to signal
+// we read the vector length. This is not used during ISel.
+let Uses = [FPSCR],
+    hasNoSchedulingInfo = 1,
+    mayLoad = 0,
+    mayStore = 0,
+    hasSideEffects = 0,
+    Size = 4 in
+def VFPGETLENr : PseudoInst<(outs GPRnopc:$len),
+                            (ins),
+                            IIC_fpSTAT, []>,
+                           Requires<[HasVFP2]>;
+
 // Spill and reload helpers.
 let AM = AddrMode4 in {
 let hasNoSchedulingInfo = 1,
@@ -3242,4 +3270,4 @@ def : Pat<(fsub_mlx (fmul_su SPRx4:$a, SPRx4:$b), SPRx4:$dstin),
           Requires<[HasVFP2,DontUseNEONForFP,UseFPVMLx]>;
 
 // Set length pattern.
-def : Pat<(arm_vfpsetlenzero), (VFPSETLEN 0)>;
+def : Pat<(arm_vfpsetlenzero), (VFPSETLENi 0)>;

For simplicity VFPSETLENr and VFPGETLENr will overwrite the whole fpscr register, but the only change we will ever do is to the vector length so this should be reasonable. Technically we are changing more bits than the ones we claim to do, but ARMv6 is not rich enough to do this in an efficient way, hence the coarser approach.

When emitting a copy we will first emit a VFPGETLENr to keep the current vector length in a GPR (say rn). Then we will emit a VFPSETLENi to set the right length of the floating data type (1 for scalars, 2 or 4 for vectors). Then the vmov. Finally we will emit a VFPSETLENr using rn to restore everything as it was before the copy.

Because those are pseudo instructions, we need to expand them at some point: VFPGETLENr will be expanded to a regular vmrs instruction, VFPSETLENr will be expanded to the dual vmsr. VFPSETLENi is expanded as usual.

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

@@ -2991,7 +2991,7 @@ bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
       MI.eraseFromParent();
       return true;
     }
-    case ARM::VFPSETLEN: {
+    case ARM::VFPSETLENi: {
       Register Scratch1 = MI.getOperand(0).getReg();
       Register Scratch2 = MI.getOperand(1).getReg();
       DebugLoc dl = MI.getDebugLoc();
@@ -3026,6 +3026,27 @@ bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
       MI.eraseFromParent();
       return true;
     }
+    case ARM::VFPSETLENr: {
+      // This is a semantic version of VMSR intended only
+      // when changing the vector length.
+      DebugLoc dl = MI.getDebugLoc();
+      Register Length = MI.getOperand(0).getReg();
+      BuildMI(MBB, MBBI, dl, TII->get(ARM::VMSR))
+          .addUse(Length)
+          .add(predOps(ARMCC::AL));
+      MI.eraseFromParent();
+      return true;
+    }
+    case ARM::VFPGETLENr: {
+      // This is a semantic version of VMRS intended only
+      // when reading the vector length.
+      DebugLoc dl = MI.getDebugLoc();
+      Register Length = MI.getOperand(0).getReg();
+      BuildMI(MBB, MBBI, dl, TII->get(ARM::VMRS), Length)
+          .add(predOps(ARMCC::AL));
+      MI.eraseFromParent();
+      return true;
+    }
     case ARM::VFPSPILLDx2: {
       Register Src = MI.getOperand(0).getReg();
       const MachineOperand &Addr = MI.getOperand(1);

Using pseudo instructions that have trivial expansions is totally intentional: we are only using these instructions for copies so we can optimise them.

Copies

Let’s overhaul copies to be correct.

There is a bit of complication, that we will not fully address here. In order to preserve the vector length (via the fpscr) we need extra registers around. Because the copy expansion happens after register allocation, we cannot enjoy virtual registers here. So we need to resort to a class called RegScavenger (the register scavenger) which will try to get us registers for us.

However, the register scavenger may not be able to find free registers. In that case we would have to pick victim registers, spill them manually onto the stack and then use those registers, to finally reload them. A more complete implementation will have to do this though, or it risks itself to not being able to compile some codes with high register pressure.

Another important detail is that not all the copies of floating-point values need changing the set vector length. If the destination register of the copy is found in the first bank, the operation is always scalar. This helps with cases where we are copying scalars (vectors will never be register allocated in the first bank).

First let’s remove the wrong code.

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

@@ -979,16 +1068,6 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
     Opc = ARM::VMOVS;
     BeginIdx = ARM::ssub_0;
     SubRegs = 2;
-  } else if (ARM::DPRx2RegClass.contains(DestReg, SrcReg) &&
-             Subtarget.hasVFP2Base()) {
-    Opc = ARM::VMOVD;
-    BeginIdx = ARM::dsub_len2_0;
-    SubRegs = 2;
-  } else if (ARM::SPRx4RegClass.contains(DestReg, SrcReg) &&
-             Subtarget.hasVFP2Base()) {
-    Opc = ARM::VMOVS;
-    BeginIdx = ARM::ssub_len4_0;
-    SubRegs = 4;
   } else if (SrcReg == ARM::CPSR) {
     copyFromCPSR(MBB, I, DestReg, KillSrc, Subtarget);
     return;

Now let’s extend the part that deals originally with scalar copies of floating point values (SPR and DPR register classes). We will add cases for SPRx4 and DPRx4. Also we will note the vector length required for each of the copies.

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

+  unsigned VectorLength = 0;
   unsigned Opc = 0;
   if (SPRDest && SPRSrc)
     Opc = ARM::VMOVS;
@@ -913,10 +991,19 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
     Opc = ARM::VMOVSR;
   else if (ARM::DPRRegClass.contains(DestReg, SrcReg) && Subtarget.hasFP64())
     Opc = ARM::VMOVD;
-  else if (ARM::QPRRegClass.contains(DestReg, SrcReg))
+  else if (ARM::DPRx2RegClass.contains(DestReg, SrcReg) &&
+             Subtarget.hasVFP2Base()) {
+    Opc = ARM::VMOVD;
+    VectorLength = 1;
+  } else if (ARM::SPRx4RegClass.contains(DestReg, SrcReg) &&
+             Subtarget.hasVFP2Base()) {
+    Opc = ARM::VMOVS;
+    VectorLength = 3;
+  } else if (ARM::QPRRegClass.contains(DestReg, SrcReg))
     Opc = Subtarget.hasNEON() ? ARM::VORRq : ARM::MVE_VORR;
 
   if (Opc) {
+    RAIISetVLEN SetVLEN(Subtarget, this, MBB, I, DestReg, SrcReg, VectorLength);
     MachineInstrBuilder MIB = BuildMI(MBB, I, DL, get(Opc), DestReg);
     MIB.addReg(SrcReg, getKillRegState(KillSrc));
     if (Opc == ARM::VORRq || Opc == ARM::MVE_VORR)
@@ -925,6 +1012,8 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
       addUnpredicatedMveVpredROp(MIB, DestReg);
     else
       MIB.add(predOps(ARMCC::AL));
+    if (SetVLEN.isPreserved() && (Opc == ARM::VMOVD || Opc == ARM::VMOVS))
+      MIB.addReg(ARM::FPSCR, RegState::Implicit);
     return;
   }

To help us here, we use a new class RAIISetVLEN which given the registers being copied and the required VectorLength will emit the proper VFPGETLENr, VFPSETLENi sequence. We use a RAII pattern because we may have to restore the vector length using VFPSETLENr. If this class tells us fpscr is being preserved, we add an explicit usage of fpscr to the vmov instruction.

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

class RAIISetVLEN {
  bool MustPreserveFPSCR = false;
  MachineBasicBlock &MBB;
  MachineBasicBlock::iterator I;
  const ARMBaseInstrInfo *TII;
  Register SaveFPCSR;

public:
  RAIISetVLEN(const ARMSubtarget &Subtarget, const ARMBaseInstrInfo *TII,
               MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
               MCRegister DestReg, MCRegister SrcReg, unsigned VectorLength)
      : MBB(MBB), I(I), TII(TII) {
    if (!Subtarget.hasVFP2Base())
      return;
    auto isFPReg = [&](MCRegister R) {
      return ARM::SPRRegClass.contains(R) || ARM::DPRRegClass.contains(R);
    };
    auto inVectorBank = [&](MCRegister R) {
      return (ARM::D4 <= R && R <= ARM::D15) ||
             (ARM::S8 <= R && R <= ARM::S31);
    };
    auto isVectorReg = [&](MCRegister R) {
      return ARM::SPRx4RegClass.contains(R) || ARM::DPRx2RegClass.contains(R);
    };
    // If this is a scalar register copy, and the destination register happens
    // to be in a register bank other than the first, we must preserve fpscr.
    if (isFPReg(DestReg) && isFPReg(SrcReg) && isVectorBank(DestReg))
      MustPreserveFPSCR = true;
    // Vector copies always must preserve fpscr.
    if (isVectorReg(DestReg) || isVectorReg(SrcReg))
      MustPreserveFPSCR = true;

    if (!MustPreserveFPSCR)
      return;

    // We have to set FPSCR to length zero for safety.
    RegScavenger RS;
    RS.enterBasicBlock(MBB);
    RS.forward(I);

    SaveFPCSR = RS.FindUnusedReg(&ARM::GPRnopcRegClass);
    if (SaveFPCSR == ARM::NoRegister)
      report_fatal_error(
          "When emitting a floating point register copy, failed "
          "to find a free register");
    RS.setRegUsed(SaveFPCSR);

    Register Scratch1 = RS.FindUnusedReg(&ARM::GPRRegClass);
    if (Scratch1 == ARM::NoRegister)
      report_fatal_error(
          "When emitting a floating point register copy, failed "
          "to find a free register");
    RS.setRegUsed(Scratch1);
    Register Scratch2 = RS.FindUnusedReg(&ARM::GPRnopcRegClass);
    if (Scratch2 == ARM::NoRegister)
      report_fatal_error(
          "When emitting a floating point register copy, failed "
          "to find a free register");
    RS.setRegUsed(Scratch2);

    BuildMI(MBB, I, I->getDebugLoc(), TII->get(ARM::VFPGETLENr), SaveFPCSR);
    BuildMI(MBB, I, I->getDebugLoc(), TII->get(ARM::VFPSETLENi))
        .addDef(Scratch1, RegState::Dead)
        .addDef(Scratch2, RegState::Dead)
        .addImm(VectorLength);
  }

  ~RAIISetVLEN() {
    if (!MustPreserveFPSCR)
      return;
    // Restore FPCSR
    BuildMI(MBB, I, I->getDebugLoc(), TII->get(ARM::VFPSETLENr))
        .addUse(SaveFPCSR, RegState::Kill);
  }

  bool isPreserved() const { return MustPreserveFPSCR; }
};

Note the condition cases where the register scavenger would fail to find a register. A more complete implementation must handle those cases.

Optimising

As we mentioned above, we want the optimisation pass to run after register allocation.

llvm/lib/Target/ARM/ARMTargetMachine.cpp

@@ -513,13 +513,13 @@ void ARMPassConfig::addPreRegAlloc() {
 
     if (!DisableA15SDOptimization)
       addPass(createA15SDOptimizerPass());
-
-    addPass(createARMOptimizeVFP2Len());
   }
 }
 
 void ARMPassConfig::addPreSched2() {
   if (getOptLevel() != CodeGenOpt::None) {
+    addPass(createARMOptimizeVFP2Len());
+
     if (EnableARMLoadStoreOpt)
       addPass(createARMLoadStoreOptimizationPass());

To optimise these new instructions, we have to extend our existing pass. When a VFPGETLENr is seen, we will remember what register is keeping the current length. If the register is preserved until a further VFPSETLENr we know we are restoring the same length as before.

We can use this knowledge when analysing a single basic block, so we do not lose track of the length value during a VFPSETLENr.

llvm/lib/Target/ARM/ARMOptimizeVFP2Len.cpp

@@ -81,6 +81,12 @@ struct BlockData {
   bool InQueue = false;
 };
 
+struct KeptVFPInfo {
+  Register Reg = ARM::NoRegister;
+  Length Len;
+  MachineInstr *MI = nullptr;
+};
+
 class ARMOptimizeVFP2Len : public MachineFunctionPass {
 private:
   std::vector<BlockData> BlockInfo;
@@ -124,27 +124,43 @@ void ARMOptimizeVFP2Len::computeLocalBlockInfo(const MachineBasicBlock &MBB) {
   if (MBB.isEntryBlock())
     LI.InLen.setValue(0);
 
+  KeptVFPInfo KeptVFP;
+
   LI.OutLen = LI.InLen;
   for (auto &MI : MBB) {
-    if (MI.getOpcode() == ARM::VFPSETLEN) {
+    if (MI.getOpcode() == ARM::VFPSETLENi) {
       LI.OutLen.setValue(MI.getOperand(2).getImm());
       LI.LastChange = &MI;
-      continue;
-    }
-
-    // If the FPSCR is modified outside of our control, assume
-    // that it is variable.
-    if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
+    } else if (MI.getOpcode() == ARM::VFPGETLENr) {
+      // This instruction does not change the vector length, but we will note
+      // the register being written because these are usually paired with
+      // VFPSETLENr.
+      KeptVFP.Reg = MI.getOperand(0).getReg();
+      KeptVFP.Len = LI.OutLen;
+    } else if (MI.getOpcode() == ARM::VFPSETLENr) {
+      // If we are restoring a previously kept vector length, we can also
+      // restore the known vector length.
+      if (KeptVFP.Reg == MI.getOperand(0).getReg()) {
+        LI.OutLen = KeptVFP.Len;
+      } else {
+        LI.OutLen.setVariable();
+      }
+      LI.LastChange = &MI;
+    } else if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
+      // If the FPSCR is modified outside of our control, assume
+      // that it is variable.
       LI.OutLen.setVariable();
       LI.LastChange = &MI;
-      continue;
-    }
-
-    if (MI.isCall()) {
+    } else if (MI.isCall()) {
       // On exit, functions restore vector length == 1.
       LI.OutLen.setValue(0);
       LI.LastChange = &MI;
-      continue;
+    }
+
+    if (MI.getOpcode() != ARM::VFPGETLENr && KeptVFP.Reg != ARM::NoRegister &&
+        MI.modifiesRegister(KeptVFP.Reg)) {
+      // If this instruction modifies the GPR holding a vector length, reset it.
+      KeptVFP.Reg = ARM::NoRegister;
     }
   }
 }

And then use this in the removal as well.

llvm/lib/Target/ARM/ARMOptimizeVFP2Len.cpp

@@ -225,25 +241,61 @@ bool ARMOptimizeVFP2Len::removeRedundantVPFSETLEN(MachineBasicBlock &MBB) {
 
   bool Changed = false;
   MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
+
+  KeptVFPInfo KeptVFP;
+
   while (MBBI != E) {
     MachineBasicBlock::iterator NMBBI = std::next(MBBI);
     MachineInstr &MI = *MBBI;
 
-    if (MI.getOpcode() == ARM::VFPSETLEN) {
+    bool Remove = false;
+
+    if (MI.getOpcode() == ARM::VFPSETLENi) {
       unsigned Length = MI.getOperand(2).getImm();
       if (CurrentLength.hasValue() && CurrentLength.getValue() == Length) {
         LLVM_DEBUG(dbgs() << "Removing redundant: " << MI);
         // We can remove this one.
-        MI.removeFromParent();
-        Changed = true;
+        Remove = true;
       }
       CurrentLength.setValue(Length);
+    } else if (MI.getOpcode() == ARM::VFPSETLENr) {
+      if (KeptVFP.Reg == MI.getOperand(0).getReg()) {
+        if (CurrentLength.hasValue() && KeptVFP.Len.hasValue() &&
+            CurrentLength == KeptVFP.Len) {
+          // This is restoring to the same length we kept in an earlier
+          // VFPGETLENr, so we can remove this instruction.
+          Remove = true;
+          // If the register is killed here, also remove its last generator
+          // which we know it is a VFPGETLENi).
+          assert(KeptVFP.MI && KeptVFP.MI->getOpcode() == ARM::VFPGETLENr);
+          if (MI.getOperand(0).isKill()) {
+            KeptVFP.MI->removeFromParent();
+          }
+        }
+        CurrentLength = KeptVFP.Len;
+      } else {
+        CurrentLength.setVariable();
+      }
+    } else if (MI.getOpcode() == ARM::VFPGETLENr) {
+      KeptVFP.Reg = MI.getOperand(0).getReg();
+      KeptVFP.Len = CurrentLength;
+      KeptVFP.MI = &MI;
     } else if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
       CurrentLength.setVariable();
     } else if (MI.isCall()) {
       CurrentLength.setValue(0);
     }
 
+    if (Remove) {
+      MI.removeFromParent();
+      Changed = true;
+    } else if (MI.getOpcode() != ARM::VFPGETLENr &&
+               KeptVFP.Reg != ARM::NoRegister &&
+               MI.modifiesRegister(KeptVFP.Reg)) {
+      // If this instruction modifies the GPR holding a vector length, reset it.
+      KeptVFP.Reg = ARM::NoRegister;
+    }
+
     MBBI = NMBBI;
   }

Results

With all the changes above in place, we can retake the example of copies from the last chapter and see the code we emit now.

t_doubles_phi.ll

define void @test_vec(i32 %dis,
                      <2 x double> *%pa,
                      <2 x double> *%pb,
                      <2 x double> *%pc) {
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %m = icmp slt i32 %dis, 4
  br i1 %m, label %block1, label %block2
block1:
  %x = fadd <2 x double> %a, %b
  br label %block3
block2:
  %y = fmul <2 x double> %a, %b
  br label %block3
block3:
  %p = phi <2 x double> [%x, %block1], [%y, %block2]
  store <2 x double> %p, <2 x double> *%pc
  ret void
}

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o - t_doubles_phi.ll

test_vec:
	.fnstart
@ %bb.0:                                @ %block3
	vpush	{d8, d9}
	vldmia	r1, {d8, d9}
	mov	r1, #65536
	cmp	r0, #4
	vldmia	r2, {d4, d5}
	vmrs	r2, fpscr
	bic	r2, r2, #458752
	orr	r2, r2, r1
	vmsr	fpscr, r2
	vmul.f64	d6, d8, d4
	vadd.f64	d4, d8, d4
	vmovge.f64	d4, d6
	vstmia	r3, {d4, d5}
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	vmsr	fpscr, r1
	vpop	{d8, d9}
	bx	lr

This is really good as the branch is now gone.

We have to look at the MIR to understand what happened. First let’s see the MIR right before the optimisation pass.

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o - t_doubles_phi.ll \
      -print-before=arm-optimize-vfp2-len

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
bb.0.block3:
  successors: %bb.2(0x40000000), %bb.1(0x40000000); %bb.2(50.00%), %bb.1(50.00%)
  liveins: $r0, $r1, $r2, $r3, $d8, $d9
  $sp = frame-setup VSTMDDB_UPD $sp(tied-def 0), 14, $noreg, killed $d8, killed $d9
  frame-setup CFI_INSTRUCTION def_cfa_offset 16
  frame-setup CFI_INSTRUCTION offset $d9, -8
  frame-setup CFI_INSTRUCTION offset $d8, -16
  renamable $d4 = VLDRD renamable $r2, 0, 14, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
  renamable $d5 = VLDRD killed renamable $r2, 2, 14, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
  renamable $d8 = VLDRD renamable $r1, 0, 14, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa)
  renamable $d9 = VLDRD killed renamable $r1, 2, 14, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8)
  dead renamable $r1, dead renamable $r2 = VFPSETLENi 1, implicit-def $fpscr
  renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14, $noreg, implicit $fpscr
  renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14, $noreg, implicit $fpscr
  CMPri killed renamable $r0, 4, 14, $noreg, implicit-def $cpsr
  Bcc %bb.2, 11, killed $cpsr

bb.1.select.false:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)
  liveins: $r3, $d6_d7x2
  $r0 = VFPGETLENr implicit $fpscr
  dead $r1, dead $r2 = VFPSETLENi 1, implicit-def $fpscr
  $d4_d5x2 = VMOVD killed $d6_d7x2, 14, $noreg, implicit $fpscr
  VFPSETLENr killed $r0, implicit-def $fpscr

bb.2.select.end:
; predecessors: %bb.0, %bb.1
  liveins: $r3, $d4_d5x2
  VSTRD renamable $d4, renamable $r3, 0, 14, $noreg :: (store 8 into %ir.pc)
  VSTRD renamable $d5, killed renamable $r3, 2, 14, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
  dead renamable $r0, dead renamable $r1 = VFPSETLENi 0, implicit-def $fpscr
  $sp = frame-destroy VLDMDIA_UPD $sp(tied-def 0), 14, $noreg, def $d8, def $d9
  BX_RET 14, $noreg

Before the optimisation, our MIR looks like this. In lines 22 to 25 above, we see all the dance of VFPGETLENr, VFPSETLENi and VFPSETLENr.

Note in line 12 above we have already set the vector length to 2 (like we do in line 23), so we should be able to reuse this.

Let’s look now at the MIR after the optimiation pas.

$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2  -o - t_doubles_phi.ll \
      -print-after=arm-optimize-vfp2-len

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
bb.0.block3:
  successors: %bb.2(0x40000000), %bb.1(0x40000000); %bb.2(50.00%), %bb.1(50.00%)
  liveins: $r0, $r1, $r2, $r3, $d8, $d9
  $sp = frame-setup VSTMDDB_UPD $sp(tied-def 0), 14, $noreg, killed $d8, killed $d9
  frame-setup CFI_INSTRUCTION def_cfa_offset 16
  frame-setup CFI_INSTRUCTION offset $d9, -8
  frame-setup CFI_INSTRUCTION offset $d8, -16
  renamable $d4 = VLDRD renamable $r2, 0, 14, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
  renamable $d5 = VLDRD killed renamable $r2, 2, 14, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
  renamable $d8 = VLDRD renamable $r1, 0, 14, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa)
  renamable $d9 = VLDRD killed renamable $r1, 2, 14, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8)
  dead renamable $r1, dead renamable $r2 = VFPSETLENi 1, implicit-def $fpscr
  renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14, $noreg, implicit $fpscr
  renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14, $noreg, implicit $fpscr
  CMPri killed renamable $r0, 4, 14, $noreg, implicit-def $cpsr
  Bcc %bb.2, 11, killed $cpsr

bb.1.select.false:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)
  liveins: $r3, $d6_d7x2
  $d4_d5x2 = VMOVD killed $d6_d7x2, 14, $noreg, implicit $fpscr

bb.2.select.end:
; predecessors: %bb.0, %bb.1
  liveins: $r3, $d4_d5x2
  VSTRD renamable $d4, renamable $r3, 0, 14, $noreg :: (store 8 into %ir.pc)
  VSTRD renamable $d5, killed renamable $r3, 2, 14, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
  dead renamable $r0, dead renamable $r1 = VFPSETLENi 0, implicit-def $fpscr
  $sp = frame-destroy VLDMDIA_UPD $sp(tied-def 0), 14, $noreg, def $d8, def $d9
  BX_RET 14, $noreg

The optimisation pass identifies the unnecessary change so we can reuse the existing vector length. Now line 22 is the only instruction required to do the copy.

A later existing pass of the ARM backend identifies this pattern and adds the predicate to vmov (which becomes vmovge).

A bit of reflection with the current approach. We have gone through register allocation with instructions that we have later removed. This means that potentially those instructions may have increased the register pressure to the rest of the code. Unfortunately the expansion currently happens after register allocation (though there are a couple of target-specific hooks that might be worth looking into) so there is not much we can do for now.

Now I think we can move onto enabling vectorization :)