In the last chapter we devised a way to tame the issue with fpscr. Today we are going to complete the code generation bits that we are still missing so we can start emitting code.

Expand VFPSETLEN

Given that last week we finished with a VFPSETLEN instruction being emitted I guess it makes sense we expand this first.

The easiest way to achieve this is extending the file llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp which contains the implementation of a pass, running after register allocation, intended to expand pseudo instructions.

If you check the first installment of this series, you will see that the len field of fpscr is a field of 3 bits located in bits 18, 17 and 16. Setting the length mostly means reading the fpscr, which we can do using the instruction vmrs, change the bits and write them back into fpscr using the vmsr instruction.

The complex part of this process is changing the bits. We need to ensure the len field has the bits we want. We can do this in general masking what we obtained from the vmrs instruction with ~(0x7 << 16). This will clear the three bits of the len field and then we can do a bitwise or to set the precise operation we want. All this is what explained why we needed two scratch output registers as outputs of VFPSETLEN.

bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
                               MachineBasicBlock::iterator MBBI,
                               MachineBasicBlock::iterator &NextMBBI) {
    // ...
    case ARM::VFPSETLEN: {
      Register Scratch1 = MI.getOperand(0).getReg();
      Register Scratch2 = MI.getOperand(1).getReg();
      DebugLoc dl = MI.getDebugLoc();

      // Scratch2 ← FPSCR
      BuildMI(MBB, MBBI, dl, TII->get(ARM::VMRS), Scratch2)
          .add(predOps(ARMCC::AL));
      // Scratch1 ← ~(0x7 << 16)
      BuildMI(MBB, MBBI, dl, TII->get(ARM::MVNi), Scratch1)
          .addImm(0x7 << 16)
          .add(predOps(ARMCC::AL))
          .add(condCodeOp());
      // Scratch2 ← and Scratch2, Scratch1
      BuildMI(MBB, MBBI, dl, TII->get(ARM::ANDrr), Scratch2)
          .addUse(Scratch2)
          .addUse(Scratch1)
          .add(predOps(ARMCC::AL))
          .add(condCodeOp());
      // Scratch1 ← (Length << 16)
      BuildMI(MBB, MBBI, dl, TII->get(ARM::MOVi), Scratch1)
          .addImm(MI.getOperand(2).getImm() << 16)
          .add(predOps(ARMCC::AL))
          .add(condCodeOp());
      // Scratch2 ← or Scratch2, Scratch1
      BuildMI(MBB, MBBI, dl, TII->get(ARM::ORRrr), Scratch2)
          .addUse(Scratch2)
          .addUse(Scratch1)
          .add(predOps(ARMCC::AL))
          .add(condCodeOp());
      // FPSCR ← Scratch2
      BuildMI(MBB, MBBI, dl, TII->get(ARM::VMSR))
          .addUse(Scratch2)
          .add(predOps(ARMCC::AL));

      MI.eraseFromParent();
      return true;
    }
  }
}

The code may look long but it should be relatively straightforward to follow. There are two unobvious elements due to the AArch32 Arm ISA itself. Most instructions can be predicated, but we want them to be always executed so we set it to always using .add(predOps(ARMCC::AL)). Many Arm instructions may optionally update the cpsr, because we do not want to do this we state that using .add(consCodeOps()).

Other than that, the code above is basically constructing the instruction and finally removing the pseudo instruction. Opcode ARM::MVNi represents a mvn instruction that uses an immediate as an input, ARM::ADDrr is an add instruction with two registers as inputs. ARM::ORRrr is an orr (a bitwise or) with two registers as inputs.

One very simple optimization we can do happens when the vector length is 1 (encoded as len). In that case we can use the bic instruction whose task is precisely clearing consecutive bits in a register. So we can improve the emitted instructions like below.

bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
                               MachineBasicBlock::iterator MBBI,
                               MachineBasicBlock::iterator &NextMBBI) {
    // ...
    case ARM::VFPSETLEN: {
      Register Scratch1 = MI.getOperand(0).getReg();
      Register Scratch2 = MI.getOperand(1).getReg();
      DebugLoc dl = MI.getDebugLoc();

      // Scratch2 ← FPSCR
      BuildMI(MBB, MBBI, dl, TII->get(ARM::VMRS), Scratch2)
          .add(predOps(ARMCC::AL));
      // Scratch2 ← clear bits 16, 17, 18
      BuildMI(MBB, MBBI, dl, TII->get(ARM::BICri), Scratch2)
          .addUse(Scratch2)
          .addImm(0x7 << 16)
          .add(predOps(ARMCC::AL))
          .add(condCodeOp());
      if (MI.getOperand(2).getImm()) {
        // Scratch1 ← (Length << 16)
        BuildMI(MBB, MBBI, dl, TII->get(ARM::MOVi), Scratch1)
            .addImm(MI.getOperand(2).getImm() << 16)
            .add(predOps(ARMCC::AL))
            .add(condCodeOp());
        // Scratch2 ← or Scratch2, Scratch1
        BuildMI(MBB, MBBI, dl, TII->get(ARM::ORRrr), Scratch2)
            .addUse(Scratch2)
            .addUse(Scratch1)
            .add(predOps(ARMCC::AL))
            .add(condCodeOp());
      }
      // FPSCR ← Scratch2
      BuildMI(MBB, MBBI, dl, TII->get(ARM::VMSR))
          .addUse(Scratch2)
          .add(predOps(ARMCC::AL));

      MI.eraseFromParent();
      return true;
    }
  }
}

There is an instruction bfi in a special variant of Armv6, called Armv6T2, and as of Armv7-A similar to bic to insert arbitrary values into bitfields. Unfortunately the core of the Raspberry Pi 1 does not implement such instruction, hence the dance with mov and orr.

Now we can see what happens with our last example of a scalar addition.

scalar.ll

define void @test_scalar(double *%pa, double *%pb, double *%pc) { 
  %a = load double, double* %pa
  %b = load double, double* %pb
  %c = fadd double %a, %b
  store double %c, double *%pc
  ret void
}

$ clang -O2 -S -o - --target=armv6kz-unknown-linux-gnu -mfloat-abi=hard scalar.ll

test_scalar:
	vldr	d0, [r1]
	vldr	d1, [r0]
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	vmsr	fpscr, r1
	vadd.f64	d0, d1, d0
	vstr	d0, [r2]
	bx	lr

Above we can see the bic instruction in action because here len is set to zero.

We can manually modify the file scalar.mir we generated at the end of the previous installment so it does VFPSETLEN 1 instead of VFPSETLEN 0. We get this:

$ llc -mtriple=armv6kz-unknown-linux-gnu -mattr=+vfp2 \
      -start-after=finalize-isel scalar.mir -o -

test_scalar:
	.fnstart
@ %bb.0:
	vldr	d1, [r0]
	mov	r0, #65536
	vldr	d0, [r1]
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	orr	r1, r1, r0
	vmsr	fpscr, r1
	vadd.f64	d0, d1, d0
	vstr	d0, [r2]
	bx	lr

So the lowering of VFPSETLEN is done.

Next is teaching SelectionDAG to actually try to select operations involving v2f64 and v4f32.

Instruction selection

Even if in the previous installment we added patterns that instruction selection can use to select VADDDx2 and VADDSx4, those will not be selected. The reason is that instruction selection believes that v2f64 and v4f32 are illegal types. In LLVM parlance, this means that operations with these types need to be softened so they only use legal types. Legal types and operations are, intuitively, those that are supported more or less straightforwardly by the target.

We can specify this in file llvm/lib/Target/ARM/ARMISelLowering.cpp. For now we will focus only on v2f64 but v2f32 is the same. We are only going to make float addition legal (fadd / ISD::FADD), but the same applies to all other operations appearing in the patterns we addded.

diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 4b63f3f8b3b6..f70e0435241d 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -746,6 +746,31 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,
       setAllExpand(MVT::f32);
     if (!Subtarget->hasFP64())
       setAllExpand(MVT::f64);
+    if (Subtarget->hasVFP2Base()) {
+      addRegisterClass(MVT::v2f64, &ARM::DPRx2RegClass);
+      setOperationAction(ISD::LOAD, MVT::v2f64, Custom);
+      setOperationAction(ISD::STORE, MVT::v2f64, Custom);
+
+      setOperationAction(ISD::FADD, MVT::v2f64, Legal);
+    }
   }
 
   if (Subtarget->hasFullFP16()) {

First we associate the machine type v2f64 with the register class DPRx2RegClass. This is a class generated by one of the tablegen backends based on the register classes we defined in ARMRegisterInfo.td. We do this with a call to addRegisterClass.

Then we say that ISD::FADD is a legal operation for this target. This means this operation can be directly selected. Because we have a pattern for it, we know it will select the VADDDx4 instruction.

One side-effect of linking the v2f64 to the DRPx2 register class is that now SelectionDAG expects us to be able to lower loads and stores of v2f64. Unfortunately the field len is not used by the load and store instructions of VFPv2. Because of this we need a way to express a load of v2f64 in a set of instructions of f64. Before we can do that we need to let SelectionDAG know that we will manually handle the lowering of a load and a store of type v2f64.

Lowering load/store

Before we can proceed any further we will need to implement the lowering of loads and stores of v2f64. When an operation is marked as Custom, the function LowerOperation is invoked for it. We will create two functions LowerShortVectorLoad and LowerShortVectorStore that we will use for those types.

diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 4b63f3f8b3b6..f70e0435241d 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -10044,8 +10239,16 @@ SDValue ARMTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
   case ISD::SSUBSAT:
     return LowerSADDSUBSAT(Op, DAG, Subtarget);
   case ISD::LOAD:
+    if (Subtarget->hasVFP2Base() &&
+        (Op.getValueType() == MVT::v2f64 || Op.getValueType() == MVT::v4f32))
+      return LowerShortVectorLoad(Op, DAG);
     return LowerPredicateLoad(Op, DAG);
   case ISD::STORE:
+    if (Subtarget->hasVFP2Base()) {
+      EVT VT = Op.getOperand(1).getValueType();
+      if (VT == MVT::v2f64 || VT == MVT::v4f32)
+        return LowerShortVectorStore(Op, DAG);
+    }
     return LowerSTORE(Op, DAG, Subtarget);
   case ISD::MLOAD:
     return LowerMLOAD(Op, DAG);

I am going to show only the v2f64 case, the v4f32 is similar (just a bit longer to write). First the vector load.

static SDValue LowerShortVectorLoad(SDValue Op, SelectionDAG &DAG) {
  LoadSDNode *LD = cast<LoadSDNode>(Op.getNode());
  EVT MemVT = LD->getMemoryVT();
  switch (MemVT.getSimpleVT().SimpleTy) {
  default:
    llvm_unreachable("Unexpected type");
  case MVT::v2f64: {
    // This is fundamentally the same that SplitVecRes_LOAD does but assembling
    // a vector at the end.
    SDLoc dl(LD);
    ISD::LoadExtType ExtType = LD->getExtensionType();
    assert(ExtType == ISD::NON_EXTLOAD);
    SDValue Ch = LD->getChain();
    SDValue Ptr = LD->getBasePtr();
    SDValue Offset = DAG.getUNDEF(Ptr.getValueType());
    MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
    AAMDNodes AAInfo = LD->getAAInfo();

    SDValue First = DAG.getLoad(ISD::UNINDEXED, ExtType, MVT::f64, dl, Ch, Ptr,
                                Offset, LD->getPointerInfo(), MVT::f64,
                                LD->getOriginalAlign(), MMOFlags, AAInfo);

    MachinePointerInfo MPI = LD->getPointerInfo().getWithOffset(8);
    Ptr = DAG.getObjectPtrOffset(dl, Ptr, TypeSize::Fixed(8));

    SDValue Second =
        DAG.getLoad(ISD::UNINDEXED, ExtType, MVT::f64, dl, Ch, Ptr, Offset, MPI,
                    MVT::f64, LD->getOriginalAlign(), MMOFlags, AAInfo);

    // Remember that the loads are parallel.
    SDValue NewCh = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,
                                First.getValue(1), Second.getValue(1));

    SDValue Vec = DAG.getUNDEF(MVT::v2f64);
    Vec =
        DAG.getTargetInsertSubreg(ARM::dsub_len2_0, dl, MVT::v2f64, Vec, First);
    Vec = DAG.getTargetInsertSubreg(ARM::dsub_len2_1, dl, MVT::v2f64, Vec,
                                    Second);

    return DAG.getMergeValues({Vec, NewCh}, dl);
    break;
  }
  case MVT::v4f32: {
    // ...
  }
  }
}

This is a bit of code to explain but basically we obtain the specific information of a load node and we keep it in the LD variable. We use this to discriminate the machine type we are handling.

To create a load we need a bunch of information, but basically we will assume this is an unindexed load (this means there is no pre/postincrement or pre/postdecrement involved). The kind of extension load, ExtType, is for non-extending/zero-extending/sign-extending loads. We do not support it for now (hence theassert`).

The chain, Ch, is a special operand that can be consumed as input or be generated as part of the results of a SelectionDAG node and it is used for control dependencies that cannot be represented via conventional data-flow dependences (e.g. a load that follows a store will usually be linked to the store via a chain).

Ptr is the value that contains the address. Offset will be left to undefined. There is a bunch of flags we want to propagate from the memory operands of this SelectionDAG node so (MMOFlags) alongwith alias analysis information (AAInfo).

With all this information we can build a first load, that loads the first element of the vector. Note the machine type is f64 here.

The second part of the vector is found at an offsets of 8 bytes of the original Ptr address. MPI contains the abstract pointer information and we will state that it is the same as the original pointer information plus 8 bytes. Similarly we overwrite Ptr with computing an offset of 8 bytes after it. Now we can build a load for the second element of the vector. Again the type loaded is f64.

Now we have to build a vector value in the input SelectionDAG. We will take the two values we just loaded and insert them to the proper subregisters that we defined in the second installment in ARMRegisterInfo.td. Our vector value is Vec.

Finally because this is a load, we need to return the vector value we just loaded (Vec) and a chain (NewCh). The chain is such that represents the two loads in parallel (i.e. any operation chained to this value will happen after both loads but either load can be scheduled at any order). The node that represents this parallel join of other chains is called ISD::TokenFactor.

A similar process happens with stores.

static SDValue LowerShortVectorStore(SDValue Op, SelectionDAG &DAG) {
  StoreSDNode *ST = cast<StoreSDNode>(Op.getNode());
  SDLoc dl(Op);
  assert(ST->isUnindexed() && "Stores should be unindexed at this point.");
  EVT MemoryVT = ST->getMemoryVT();
  switch (MemoryVT.getSimpleVT().SimpleTy) {
  default:
    llvm_unreachable("Unexpected type");
  case MVT::v2f64: {
    // This is basically SplitVecOp_STORE but sourcing it from
    // a vector.
    assert(!ST->isTruncatingStore() && "Truncating stores not supported");
    SDValue Ch = ST->getChain();
    SDValue Ptr = ST->getBasePtr();
    Align Alignment = ST->getOriginalAlign();
    MachineMemOperand::Flags MMOFlags = ST->getMemOperand()->getFlags();
    AAMDNodes AAInfo = ST->getAAInfo();
    SDValue First = DAG.getTargetExtractSubreg(ARM::dsub_len2_0, dl, MVT::f64,
                                               ST->getValue());
    SDValue Second = DAG.getTargetExtractSubreg(ARM::dsub_len2_1, dl, MVT::f64,
                                                ST->getValue());

    First = DAG.getStore(Ch, dl, First, Ptr, ST->getPointerInfo(), Alignment,
                         MMOFlags, AAInfo);

    MachinePointerInfo MPI = ST->getPointerInfo().getWithOffset(8);
    Ptr = DAG.getObjectPtrOffset(dl, Ptr, TypeSize::Fixed(8));

    Second =
        DAG.getStore(Ch, dl, Second, Ptr, MPI, Alignment, MMOFlags, AAInfo);

    return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, First, Second);
  }
  case MVT::v4f32: {
    // ...
  }
  }
}

In contrast to the load, we need to extract the different subregisters from the vector value and then we can store them using regular f64 stores. Also note that stores do not return any value but a chain. So our lowered store must return a chain as well. Like we did with loads, ISD::TokenFactor is used to state that the two chained operations can happen in parallel.

MC layer

We are missing a final change so we can emit our instructions: we still have to replace the pseudo instructions into actual instructions. From a code generation point of view our vector pseudo instructions are fine. What it is not fine is that we do not know how to encode them as instructions.

The MC layer is the part of LLVM devoted to assembly and disassembly of instructions. All the instructions defined by tablegen have conceptual mirrors called MC Instructions and they are used when encoding (assembling) and decoding (disassembling) instructions.

If you remember from the last installment, we linked the pseudo instructions to the real instruction. Now it is the moment to use this.

Once machine instructions have reached their final stage, they are handed to a process that creates MC instructions for each one. These MC instructions can then be streamed as assembly output (like what llc prints as an output) or to generate an object file (such an ELF file).

So what we will do is change this lowering from machine instructions to MC instructions, so our pseudo instructions get encoded like normal VFPv2 instructions.

This goes in two steps: first we need to use the real opcode of each pseudo instruction. Second we need to make sure we use the regular register that would be encoded. All the changes are in file llvm/lib/Target/ARM/ARMMCInstLower.cpp.

First let’s add a small diversion in llvm::LowerARMMachineInstrToMCInst.

 void llvm::LowerARMMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,
                                         ARMAsmPrinter &AP) {
+  if (lowerVFPMachineInstrToMCInst(MI, OutMI, AP))
+    return;
+
   OutMI.setOpcode(MI->getOpcode());
 
   // In the MC layer, we keep modified immediates in their encoded form

If the function lowerVFPMachineInstrToMCInst returns true we do not have to do anything else here. That function looks like this

static bool lowerVFPMachineInstrToMCInst(const MachineInstr *MI, MCInst &OutMI,
                                         ARMAsmPrinter &AP) {
  const VFPPseudos::PseudoInfo *VFPInfo =
      VFPPseudos::getPseudoInfo(MI->getOpcode());
  if (!VFPInfo || !VFPInfo->VectorLength)
    return false;

  assert(MI->getOpcode() != VFPInfo->BaseInst &&
         "Opcodes should be different at this point");
  OutMI.setOpcode(VFPInfo->BaseInst);

  for (const MachineOperand &MO : MI->operands()) {
    MCOperand MCOp;
    if (AP.lowerOperand(MO, MCOp)) {
      OutMI.addOperand(MCOp);
    }
  }

  return true;
}

We query in our VFPPseudos table if this is a VFP instruction that demands a length larger than 1.

If this is the case, we set the opcode of the MC instruction being created (OutMI) to the opcode of the base instruction. E.g. in this step we go from VADDDx2 to VADDD. Now we proceed as usual lowering the operands.

When lowering the operands we need to be careful not to emit operands of the register classes DPRx2 or SPRx4. If this is the case, we use the first subregister instead. To do that we change lowerOperand.

 bool ARMAsmPrinter::lowerOperand(const MachineOperand &MO,
                                  MCOperand &MCOp) {
+  const MachineInstr *MI = MO.getParent();
+  assert(MI && "Operand expected to belong to a machine instruction");
+  const MachineBasicBlock *MBB = MI->getParent();
+  assert(MBB && "MI expected to be in a basic block");
+  const MachineFunction *MF = MBB->getParent();
+  assert(MF && "MBB expected to be in a machine function");
+  const TargetRegisterInfo *TRI =
+      MF->getSubtarget<ARMSubtarget>().getRegisterInfo();
+
   switch (MO.getType()) {
   default: llvm_unreachable("unknown operand type");
-  case MachineOperand::MO_Register:
+  case MachineOperand::MO_Register: {
     // Ignore all implicit register operands.
     if (MO.isImplicit())
       return false;
     assert(!MO.getSubReg() && "Subregs should be eliminated!");
-    MCOp = MCOperand::createReg(MO.getReg());
-    break;
+    unsigned Reg = MO.getReg();
+    // Replace the tuple register with the one used in the encoding.
+    if (ARM::DPRx2RegClass.contains(Reg)) {
+      Reg = TRI->getSubReg(Reg, ARM::dsub_len2_0);
+      assert(Reg && "Subregister does not exist");
+    } else if (ARM::SPRx4RegClass.contains(Reg)) {
+      Reg = TRI->getSubReg(Reg, ARM::ssub_len4_0);
+      assert(Reg && "Subregister does not exist");
+    }
+    MCOp = MCOperand::createReg(Reg);
+  } break;
   case MachineOperand::MO_Immediate:
     MCOp = MCOperand::createImm(MO.getImm());
     break;
@@ -120,8 +139,32 @@ bool ARMAsmPrinter::lowerOperand(const MachineOperand &MO,
   return true;
 }

First test

With all this we can do a first experiment with the following LLVM IR.

vector.ll

define void @test_vector(<2 x double> *%pa, <2 x double> *%pb, <2 x double> *%pc) { 
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %c = fadd <2 x double> %a, %b
  store <2 x double> %c, <2 x double> *%pc
  ret void
}

$ llc -mtriple=armv6kz-unknown-linux-gnu -mattr=+vfp2 -o - vector.ll

test_vector:
	.fnstart
@ %bb.0:
	vldmia	r0, {d6, d7}
	mov	r0, #65536
	vldmia	r1, {d4, d5}
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	orr	r1, r1, r0
	vmsr	fpscr, r1
	vadd.f64	d4, d6, d4
	vstmia	r2, {d4, d5}
	bx	lr

Hey, not bad.

OK. We still need to think about preserving the len field upon returning the function (i.e. len should be 0b00 again) but this is a start.

Under the hood

I think this is a good moment to take a look under the hood in the different steps. Let’s take a look at the whole instruction selection.

llc -mtriple=armv6kz-unknown-linux-gnu -mattr=+vfp2 \
    -o /dev/null vector.ll -debug-only=isel -print-before-all -print-after-all

This testcase has a single basic block, so only a single SelectionDAG will be built for it. This is the initial SelectionDAG

Initial selection DAG: %bb.0 'test_vector:'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
  t7: i32 = Constant<0>
    t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t9: v2f64,ch = load<(load 16 from %ir.pa, align 8)> t0, t2, undef:i32
    t4: i32,ch = CopyFromReg t0, Register:i32 %1
  t10: v2f64,ch = load<(load 16 from %ir.pb, align 8)> t0, t4, undef:i32
      t12: ch = TokenFactor t9:1, t10:1
      t11: v2f64 = fadd t9, t10
      t6: i32,ch = CopyFromReg t0, Register:i32 %2
    t13: ch = store<(store 16 into %ir.pc, align 8)> t12, t11, t6, undef:i32
  t14: ch = ARMISD::RET_FLAG t13

Note how node t11 adds two vectors, loaded in t9 and t10 respectively. The SelectionDAG now undergoes a first optimisation step, which makes nothing because it is so simple. Next a type legalisation step happens but all the types mentioned are legal (i32, v2f64).

Next is legalisation of operations. Now the loads and the stores get custom lowered as we wanted.

Legalized selection DAG: %bb.0 'test_vector:'
SelectionDAG has 33 nodes:
  t0: ch = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t4: i32,ch = CopyFromReg t0, Register:i32 %1
  t6: i32,ch = CopyFromReg t0, Register:i32 %2
    t35: ch = TokenFactor t32:1, t34:1
    t27: ch = TokenFactor t24:1, t26:1
  t12: ch = TokenFactor t35, t27
      t36: v2f64 = INSERT_SUBREG undef:v2f64, t32, TargetConstant:i32<9>
    t37: v2f64 = INSERT_SUBREG t36, t34, TargetConstant:i32<10>
      t29: v2f64 = INSERT_SUBREG undef:v2f64, t24, TargetConstant:i32<9>
    t30: v2f64 = INSERT_SUBREG t29, t26, TargetConstant:i32<10>
  t11: v2f64 = fadd t37, t30
  t24: f64,ch = load<(load 8 from %ir.pb)> t0, t4, undef:i32
    t25: i32 = add nuw t4, Constant:i32<8>
  t26: f64,ch = load<(load 8 from %ir.pb + 8)> t0, t25, undef:i32
  t32: f64,ch = load<(load 8 from %ir.pa)> t0, t2, undef:i32
    t33: i32 = add nuw t2, Constant:i32<8>
  t34: f64,ch = load<(load 8 from %ir.pa + 8)> t0, t33, undef:i32
        t16: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<9>
      t19: ch = store<(store 8 into %ir.pc)> t12, t16, t6, undef:i32
        t18: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<10>
        t21: i32 = add nuw t6, Constant:i32<8>
      t22: ch = store<(store 8 into %ir.pc + 8)> t12, t18, t21, undef:i32
    t23: ch = TokenFactor t19, t22
  t14: ch = ARMISD::RET_FLAG t23

Let’s take a look at t9 in the initial SelectionDAG. It has been expanded into two stores t32 and t34 (note how we annotate the proper 8 bytes offsets in t34). Then t32 and t34 are inserted as subregisters in t36 and t37. The last one will make up the value of the original type v2f64 we had in t9. The original chain of t9 (denoted by t9:1 because it is the second result of the operation) is now represented by t35 which is the parallel join of the chains of the already mentioned loads t32 and t34 (i.e. t32:1 and t34:1).

The SelectionDAG undergoes another optimisation pass which only simplifies the token factors (check t39).

Optimized legalized selection DAG: %bb.0 'test_vector:'
SelectionDAG has 31 nodes:
  t0: ch = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t4: i32,ch = CopyFromReg t0, Register:i32 %1
  t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t36: v2f64 = INSERT_SUBREG undef:v2f64, t32, TargetConstant:i32<9>
    t37: v2f64 = INSERT_SUBREG t36, t34, TargetConstant:i32<10>
      t29: v2f64 = INSERT_SUBREG undef:v2f64, t24, TargetConstant:i32<9>
    t30: v2f64 = INSERT_SUBREG t29, t26, TargetConstant:i32<10>
  t11: v2f64 = fadd t37, t30
  t24: f64,ch = load<(load 8 from %ir.pb)> t0, t4, undef:i32
    t25: i32 = add nuw t4, Constant:i32<8>
  t26: f64,ch = load<(load 8 from %ir.pb + 8)> t0, t25, undef:i32
  t32: f64,ch = load<(load 8 from %ir.pa)> t0, t2, undef:i32
    t33: i32 = add nuw t2, Constant:i32<8>
  t34: f64,ch = load<(load 8 from %ir.pa + 8)> t0, t33, undef:i32
  t39: ch = TokenFactor t32:1, t34:1, t24:1, t26:1
        t16: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<9>
      t43: ch = store<(store 8 into %ir.pc)> t39, t16, t6, undef:i32
        t18: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<10>
        t21: i32 = add nuw t6, Constant:i32<8>
      t40: ch = store<(store 8 into %ir.pc + 8)> t39, t18, t21, undef:i32
    t42: ch = TokenFactor t43, t40
  t14: ch = ARMISD::RET_FLAG t42

At this point the input SelectionDAG can be “instruction selected”. This gives us the output SelectionDAG.

===== Instruction selection ends:
Selected selection DAG: %bb.0 'test_vector:'
SelectionDAG has 30 nodes:
  t0: ch = EntryToken
  t2: i32,ch = CopyFromReg t0, Register:i32 %0
  t4: i32,ch = CopyFromReg t0, Register:i32 %1
  t6: i32,ch = CopyFromReg t0, Register:i32 %2
  t32: f64,ch = VLDRD<Mem:(load 8 from %ir.pa)> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 $noreg, t0
  t24: f64,ch = VLDRD<Mem:(load 8 from %ir.pb)> t4, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 $noreg, t0
  t34: f64,ch = VLDRD<Mem:(load 8 from %ir.pa + 8)> t2, TargetConstant:i32<2>, TargetConstant:i32<14>, Register:i32 $noreg, t0
  t26: f64,ch = VLDRD<Mem:(load 8 from %ir.pb + 8)> t4, TargetConstant:i32<2>, TargetConstant:i32<14>, Register:i32 $noreg, t0
  t39: ch = TokenFactor t32:1, t34:1, t24:1, t26:1
      t36: v2f64 = INSERT_SUBREG IMPLICIT_DEF:v2f64, t32, TargetConstant:i32<9>
    t37: v2f64 = INSERT_SUBREG t36, t34, TargetConstant:i32<10>
      t29: v2f64 = INSERT_SUBREG IMPLICIT_DEF:v2f64, t24, TargetConstant:i32<9>
    t30: v2f64 = INSERT_SUBREG t29, t26, TargetConstant:i32<10>
  t11: v2f64 = VADDDx2 t37, t30, TargetConstant:i32<14>, Register:i32 $noreg
        t16: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<9>
      t43: ch = VSTRD<Mem:(store 8 into %ir.pc)> t16, t6, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 $noreg, t39
        t18: f64 = EXTRACT_SUBREG t11, TargetConstant:i32<10>
      t40: ch = VSTRD<Mem:(store 8 into %ir.pc + 8)> t18, t6, TargetConstant:i32<2>, TargetConstant:i32<14>, Register:i32 $noreg, t39
    t42: ch = TokenFactor t43, t40
  t14: ch = BX_RET TargetConstant:i32<14>, Register:i32 $noreg, t42

Note how the fadd now is VADDDx2 as expected. The other operations are also opcodes of the ARM backend. Now this output SelectionDAG is linearised (scheduled) to obtain a machine basic block. There is only one in this function, but all the basic blocks would be put together for the machine function.

# *** IR Dump Before Finalize ISel and expand pseudo-instructions (finalize-isel) ***:
# Machine code for function test_vector: IsSSA, TracksLiveness
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2

bb.0 (%ir-block.0):
  liveins: $r0, $r1, $r2
  %2:gpr = COPY $r2
  %1:gpr = COPY $r1
  %0:gpr = COPY $r0
  %3:dpr = VLDRD %1:gpr, 0, 14, $noreg :: (load 8 from %ir.pb)
  %4:dpr = VLDRD %1:gpr, 2, 14, $noreg :: (load 8 from %ir.pb + 8)
  %5:dpr = VLDRD %0:gpr, 0, 14, $noreg :: (load 8 from %ir.pa)
  %6:dpr = VLDRD %0:gpr, 2, 14, $noreg :: (load 8 from %ir.pa + 8)
  %8:dprx2 = IMPLICIT_DEF
  %7:dprx2 = INSERT_SUBREG %8:dprx2(tied-def 0), killed %3:dpr, %subreg.dsub_len2_0
  %10:dprx2 = IMPLICIT_DEF
  %9:dprx2 = INSERT_SUBREG %10:dprx2(tied-def 0), killed %5:dpr, %subreg.dsub_len2_0
  %11:dprx2 = INSERT_SUBREG %7:dprx2(tied-def 0), killed %4:dpr, %subreg.dsub_len2_1
  %12:dprx2 = INSERT_SUBREG %9:dprx2(tied-def 0), killed %6:dpr, %subreg.dsub_len2_1
  %13:dprx2 = VADDDx2 killed %12:dprx2, killed %11:dprx2, 14, $noreg
  %14:dpr = COPY %13.dsub_len2_1:dprx2
  VSTRD killed %14:dpr, %2:gpr, 2, 14, $noreg :: (store 8 into %ir.pc + 8)
  %15:dpr = COPY %13.dsub_len2_0:dprx2
  VSTRD killed %15:dpr, %2:gpr, 0, 14, $noreg :: (store 8 into %ir.pc)
  BX_RET 14, $noreg

And then the custom inserter runs and adds VFPSETLEN where due. In this case right before VADDDx2.

# *** IR Dump After Finalize ISel and expand pseudo-instructions (finalize-isel) ***:
# Machine code for function test_vector: IsSSA, TracksLiveness
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2

bb.0 (%ir-block.0):
  liveins: $r0, $r1, $r2
  %2:gpr = COPY $r2
  %1:gpr = COPY $r1
  %0:gpr = COPY $r0
  %3:dpr = VLDRD %1:gpr, 0, 14, $noreg :: (load 8 from %ir.pb)
  %4:dpr = VLDRD %1:gpr, 2, 14, $noreg :: (load 8 from %ir.pb + 8)
  %5:dpr = VLDRD %0:gpr, 0, 14, $noreg :: (load 8 from %ir.pa)
  %6:dpr = VLDRD %0:gpr, 2, 14, $noreg :: (load 8 from %ir.pa + 8)
  %8:dprx2 = IMPLICIT_DEF
  %7:dprx2 = INSERT_SUBREG %8:dprx2(tied-def 0), killed %3:dpr, %subreg.dsub_len2_0
  %10:dprx2 = IMPLICIT_DEF
  %9:dprx2 = INSERT_SUBREG %10:dprx2(tied-def 0), killed %5:dpr, %subreg.dsub_len2_0
  %11:dprx2 = INSERT_SUBREG %7:dprx2(tied-def 0), killed %4:dpr, %subreg.dsub_len2_1
  %12:dprx2 = INSERT_SUBREG %9:dprx2(tied-def 0), killed %6:dpr, %subreg.dsub_len2_1
  dead %16:gpr, dead %17:gprnopc = VFPSETLEN 1, implicit-def $fpscr
  %13:dprx2 = VADDDx2 killed %12:dprx2, killed %11:dprx2, 14, $noreg, implicit $fpscr
  %14:dpr = COPY %13.dsub_len2_1:dprx2
  VSTRD killed %14:dpr, %2:gpr, 2, 14, $noreg :: (store 8 into %ir.pc + 8)
  %15:dpr = COPY %13.dsub_len2_0:dprx2
  VSTRD killed %15:dpr, %2:gpr, 0, 14, $noreg :: (store 8 into %ir.pc)
  BX_RET 14, $noreg

You may be wondering how is it that the 4 loads and the 2 stores have become 2 load multiples and 1 store multiple. This is due a later pass in the ARM backend that knows how to optimise those consecutive loads.

# *** IR Dump After ARM load / store optimization pass (arm-ldst-opt) ***:
# Machine code for function test_vector: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten
Frame Objects:
  fi#0: size=4, align=4, at location [SP-4]
  fi#1: size=4, align=4, at location [SP-8]
Function Live Ins: $r0, $r1, $r2

bb.0 (%ir-block.0):
  liveins: $r0, $r1, $r2, $r4, $lr
  $sp = frame-setup STMDB_UPD $sp(tied-def 0), 14, $noreg, killed $r4, killed $lr
  frame-setup CFI_INSTRUCTION def_cfa_offset 8
  frame-setup CFI_INSTRUCTION offset $lr, -4
  frame-setup CFI_INSTRUCTION offset $r4, -8
  VLDMDIA killed $r1, 14, $noreg, def $d4, def $d5, implicit-def $d4_d5x2 :: (load 8 from %ir.pb), (load 8 from %ir.pb + 8)
  VLDMDIA killed $r0, 14, $noreg, def $d6, def $d7, implicit-def $d6_d7x2 :: (load 8 from %ir.pa), (load 8 from %ir.pa + 8)
  dead renamable $r0, dead renamable $r1 = VFPSETLEN 1, implicit-def $fpscr
  renamable $d4_d5x2 = VADDDx2 killed renamable $d6_d7x2, killed renamable $d4_d5x2, 14, $noreg, implicit $fpscr
  VSTMDIA killed $r2, 14, $noreg, $d4, $d5 :: (store 8 into %ir.pc), (store 8 into %ir.pc + 8)
  $r4 = VMRS 14, $noreg, implicit $fpscr
  $r4 = BICri $r4, 458752, 14, $noreg, $noreg
  VMSR $r4, 14, $noreg, implicit-def $fpscr
  $sp = frame-destroy LDMIA_RET $sp(tied-def 0), 14, $noreg, def $r4, def $pc

Final observation

What happens if our IR does two adds?

vector2.ll

define void @test_vector2(<2 x double> *%pa, <2 x double> *%pb,
                          <2 x double> *%pc, <2 x double> *%pd) { 
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %c = load <2 x double>, <2 x double>* %pc
  %t1 = fadd <2 x double> %a, %b
  %t2 = fadd <2 x double> %t1, %c
  store <2 x double> %t2, <2 x double> *%pd
  ret void
}

Well, we get the following output

test_vector2:
	.fnstart
@ %bb.0:
	vldmia	r0, {d6, d7}
	mov	r0, #65536
	vldmia	r1, {d4, d5}
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	orr	r1, r1, r0
	mov	r0, #65536
	vmsr	fpscr, r1
	vadd.f64	d4, d6, d4
	vldmia	r2, {d6, d7}
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	orr	r1, r1, r0
	vmsr	fpscr, r1
	vadd.f64	d4, d4, d6
	vstmia	r3, {d4, d5}
	bx	lr

As you can see we are setting the field len in fpscr twice. But once should be enough: fpscr is already encoding vector length 2 when we’re about to execute the second vadd.f64.

In a next installment we will see how to improve this and emit only the required VFPSETLEN instructions.