In the last chapter we modelled the storage in form of pairs and quadruples of registers that we will use for vectors of double and single precision, respectively.

But before we can do anything we need to deal with fpscr.

The way to machine instructions

LLVM is famously known for its intermediate representation (IR) called LLVM IR. Compilers, however, and LLVM is not an exception, cannot do their job with just a single intermediate representation. The reason is that different steps in the compilation pipeline have different requirements. No IR can cater to all of them at the same time.

SelectionDAG

So, during code generation LLVM goes through two intermediate representations. LLVM IR is lowered, one basic block at a time, into a graph representation called SelectionDAG. It is called SelectionDAG because its goal is to do one of the main tasks of any backend: instruction selection.

Instruction selection can be understood as taking a directed acyclic graph (DAG) or a tree (which is a restricted form of DAG) and “tiling” it. This is, we will group several connected nodes and replace them with one or more nodes. Those new nodes correspond to actual instructions of the machine we target.

In the context of LLVM, instruction selection goes from what it is called the input DAG, built using the information from LLVM IR, to what is called the output DAG, where the nodes are instructions. One important difference between the input DAG and the output DAG is that the input DAG operates at the level of machine types (a finite set of types many of which have equivalent LLVM IR types) while the output DAG operates at the level of register classes.

Instruction selection completes when the output DAG is linearized, this is, a schedule is determined for it. Now the nodes of the output DAG are converted into the second intermediate representation used by LLVM in code generation.

Machine IR

This second intermediate representation is called Machine IR or MIR (this is unrelated to Rust MIR). MIR is a more conventional representation where each machine function is a graph of machine basic blocks and each (machine) basic block is a sequence of machine instructions. One way to understand machine instructions is looking at them as containers of operands along with an operation code (or opcode). Machine instructions can have different kinds of operands but commonly they are registers (of some specific register class) or immediates (i.e. a constant like 42).

Operands of machine instructions can be explicit or implicit. Explicit operands are those that are encoded as part of the instruction and they can be inputs or outputs. Explicit operands are the common ones and what we intuitively understand for operands. Implicit operands are those that are inputs or outputs of the instruction but are not explicitly encoded in the instruction.

For instance, in an architecture like Arm, conditional branches use (read) the cpsr register (later renamed into apsr in Armv7-A) that has been defined (written) earlier usually in a comparison. That register is an implicit operand in those instructions because it is not encoded in the instruction itself. Instead, instructions only encode things like the operands being compared or, for the branch instruction, the target of the branch (where we jump to) and the branching condition (when whe have to jump), etc.

Machine instructions need to know what operands they are using and defining. If they fail to do so, later scheduling passes that operate in the MIR may reorder instructions and break the semantics of the represented code.

Virtual registers

Most compilation infrastructures use the concept of virtual register. During compilation the compiler assumes that there is an infinite number of registers. Those registers are called virtual and belong to a register class. This largely simplifies code generation particularly because LLVM favours using Static Single Assignment (SSA). Under SSA, virtual registers are only defined once (i.e. assigned a value) and can be used many times. This constrained form is very beneficial for analysis because removes the ambiguity of knowing what was the last update to a register.

The opposite of a virtual register is a physical register (in compilers physical registers correspond to what in computer architecture are known by architectural registers). Physical registers are not subject to the regime of SSA: they can be redefined many times. This makes analysing them a bit harder.

Virtual registers do not exist in CPUs. So a process called register allocation assigns physical registers in place of virtual registers. This is an effective approach because the life spans of virtual registers is often very short. This means that, in general, not many physical registers are needed at the same time. However if this happens, register allocation uses memory so it can temporarily store a value held in a register to retrieve it later. This store operation is commonly known as a spill and its later retrieval it is known as a reload.

Machine instructions in MIR can have either virtual register operands and physical register operands. After register allocation, no virtual register remains in the machine function. This is largely true but some very late code generation steps may be easier to implement using virtual registers. LLVM provides a simplified mechanism to assign those virtual registers to physical registers without involving a full register allocation process.

fpscr is a physical register so unfortunately we will not be able to benefit the SSA advantages that virtual registers enjoy.

Approach

The approach I chose is similar, if simpler, to the one used in the RISC-V backend of LLVM to support vectors. By using pseudo instructions we define vector operations and handle them as needed.

For instance, an instruction like vadd.f64, is represented in the ARM backend of LLVM with a machine instruction with opcode VADDD. VADDD is currently used for scalar operations. Our idea is to introduce a pseudo instruction called VADDDx2. Those pseudo instructions will use the register class DPRx2 that we defined already. Similarly for vadd.f32, the instruction is called VADDS and we will define the pseudo instruction VADDSx4 that uses the SPRx4 register class. We will repeat this operation for all the 12 instructions (in their two variants for double and single precision) that honour the len field of fpscr.

By using pseudo instructions we can make SelectionDAG to select those pseudo instructions. For instance a LLVM IR like

%vc = fadd <2 x double> %va, %vb

will be selected using the VADDDx2 instruction.

A pseudo instruction is no different to an instruction for the purpose of code generation. The difference to actual instructions, is that the target does not have such instruction. So at some point we will need to expand it into real instructions. This happens at a later stage, after register allocation and will be part of a later installment in this series.

Tracking fpscr

Currently the ARM backend assumes the fpscr does not have to be tracked for scalar operations. Technically they do depend on the rounding mode in fpscr but this part is not very well defined in LLVM (only recently constrained floating point operations have been introduced).

However, if we plan to use VADDD and VADDDx2 in the same code we need to make sure the len field of the fpscr has the right value for each instruction. The easiest way to do that is to set the len to the required value right before every instruction.

In order to change the len we will add a pseudo instruction called VFPSETLEN.

let Defs = [FPSCR],
    hasNoSchedulingInfo = 1,
    mayLoad = 0,
    mayStore = 0,
    hasSideEffects = 0 in
def VFPSETLEN : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
                           (ins imm0_7:$len),
                           IIC_fpSTAT, []>,
                           Requires<[HasVFP2]>;

This tablegen definition of a pseudo instruction has a lot of information.

First, this instruction has one input called $len. This input is an immediate that ranges from 0 to 7 (3 bits). This is defined by the operand kind specifier imm0_7 which is conveniently already defined in the ARM backend.

Second, this instruction has two register outputs $scratch1 and $scratch2. We will need those later on when we expand this pseudo instruction into the actual instructions that set the len field of the fpscr. Note the different register classes GPR and GPRnopc. The latter does not allow to use the pc register (r15 in AArch32). We do this because the instructions used in the expansion also have this restriction in the way they use $scratch2. These two output registers will be initially virtual registers and then register allocation will assign two physical registers for them. This ensures we can expand the pseudo instruction safely.

Finally, IIC_fpSTAT is an instruction itinerary and we can ignore it for now: it is used for scheduling. Then we specify this is only for VFPv2 which we do using Requires<[HasVFP2]>.

Tablegen allows to push attributes into definitions using the let syntax, above the definition. This is useful when we want a number of definitions to have the same values in the attributes by avoiding having to specify it in each definition.

A first attribute is Defs = [FPSCR] that means this instruction defines (writes) the register fpscr. When instruction selection finally creates the machine instructions (of MIR) it will add an implicit defining operand for fpscr.

The other attributes are used to state that this instruction does not have scheduling info, does not load memory, does not store memory and does not have any further unmodelled side-effects.

It should be possible now to mark VADDD and VADDDx2 as Uses = [FPSCR], so after instruction selection they have an implicit use of the register fpscr. However if we do so, then the instructions are reading a fpscr that nobody wrote and this is a bit odd. We will do something slightly different.

How to configure fpscr

It is possible to mark instructions (including pseudo instructions) with the attribute hasCustomInserter. When doing this an extra callback is invoked for each instruction, right before finishing instruction selection. This allows the compiler to tweak or change the instruction selection.

The idea is that for every instruction like VADDD, VADDDx2, VADDSx4, … the custom inserter will prepend the instruction with an appropriate VFPSETLEN and will mark the instruction to implicitly use fpscr.

We mentioned there are 12 instructions in VFPv2 that honour the vector length. Because in LLVM those are split between the single precision and the double precision there are 24 machine instructions. Because we are adding the x2 and x4 forms, we end with 48 cases. This is perhaps too many cases to handle individually, so one option we have is to create a table that gives us information for each instruction. There is a tablegen backend in LLVM to create generic searchable tables.

Generic table

Firt we need to specify what we will have as elements of that table.

class VFPPseudo {
  // Note: Pseudo will be the same as BaseInstr for VectorLength == 000b
  Instruction Pseudo = !cast<Instruction>(NAME); // Used as a key.
  Instruction BaseInstr;
  bits<3> VectorLength; // Encoded with offset +1 (i.e. 000b = 1, 111b = 8)
}

Any tablegen definition of this class (including those due to to inheritance) will be part of the table. Each element of the table will contain 3 fields. The first field is the Pseudo instruction and will be used for lookups in the table. This field is a bit of a misnomer, specially for the existing scalar instructions such as VADDD and VADDS but I think we can live with this for now.

The second field BaseInstr links each pseudo instruction with its base one. This is VADDDx2 and VADDSx4 will both be linked to VADDD and VADDS respectively.

A final field VectorLength encoded in the same way as the len field of fpscr.

Now that we know what every element of the table will contain, we can define the table itself.

def VFPPseudosTable : GenericTable {
  let FilterClass = "VFPPseudo";
  let CppTypeName = "PseudoInfo";
  let Fields = [ "Pseudo", "BaseInstr", "VectorLength" ];
  let PrimaryKey = [ "Pseudo" ];
  let PrimaryKeyName = "getPseudoInfo";
  let PrimaryKeyEarlyOut = true;
}

The first field, FilterClass, is the class whose definitions will make up the table. This is the VFPPseudo we defined above. Field CppTypeName will be used as the class generated in C++. This class will have a number of fields that we specify in the Fields attributes. We specify that Pseudo will be used as primary key, this is, used to look up elements in the table (the term primary key comes from databases). Field PrimaryKeyName is the name of the method we will use on the table to retrieve an element. This function will receive the values of the PrimaryKey for the lookup: in this case the opcode of an instruction. Finaly the field PrimaryKeyEarlyOut is a small optimisation when doing lookups in large spaces like the instructions.

OK. How do we use this? As mentioned earlier, any class that inherits from this one will automatically be part of the table. So we will first label all the scalar operations. Before, that, let’s create a convenience subclass for them.

class VFPPseudoScalar : VFPPseudo {
  let BaseInstr = !cast<Instruction>(NAME);
  let VectorLength = 0b000;
}

This subclass hardcodes some values for scalar operations: the VectorLength will always be 0b000, which encodes a value of 1 in the len field. The base instruction will be itself. NAME is a special identifier which evaluates to the string of the name of the current tablegen definition. The cast is needed to convert that string to the Instruction definition identifier.

With this we can now annotate the existing operations. I will show only here VADDD and VADDS but this can be extended to the other ones.

 //===----------------------------------------------------------------------===//
 // FP Binary Operations.
 //
 
+let usesCustomInserter = 1 in
 let TwoOperandAliasConstraint = "$Dn = $Dd" in
 def VADDD  : ADbI<0b11100, 0b11, 0, 0,
                   (outs DPR:$Dd), (ins DPR:$Dn, DPR:$Dm),
                   IIC_fpALU64, "vadd", ".f64\t$Dd, $Dn, $Dm",
                   [(set DPR:$Dd, (fadd DPR:$Dn, (f64 DPR:$Dm)))]>,
-             Sched<[WriteFPALU64]>;
+             Sched<[WriteFPALU64]>,
+             VFPPseudoScalar;
 
+let usesCustomInserter = 1 in
 let TwoOperandAliasConstraint = "$Sn = $Sd" in
 def VADDS  : ASbIn<0b11100, 0b11, 0, 0,
                    (outs SPR:$Sd), (ins SPR:$Sn, SPR:$Sm),
                    IIC_fpALU32, "vadd", ".f32\t$Sd, $Sn, $Sm",
                    [(set SPR:$Sd, (fadd SPR:$Sn, SPR:$Sm))]>,
-             Sched<[WriteFPALU32]> {
+             Sched<[WriteFPALU32]>,
+             VFPPseudoScalar {
   // Some single precision VFP instructions may be executed on both NEON and
   // VFP pipelines on A8.
   let D = VFPNeonA8Domain;
 }

Note that we need to make sure usesCustomInserter is set to 1 so the hook is invoked for these instructions.

In order to make the generated table available to C++ we still need to make some changes in files llvm/lib/Target/ARM/Utils/ARMBaseInfo.cpp and llvm/lib/Target/ARM/Utils/ARMBaseInfo.h. This will make sure the tablegen-generated C++ is included appropriately.

diff --git a/llvm/lib/Target/ARM/Utils/ARMBaseInfo.cpp b/llvm/lib/Target/ARM/Utils/ARMBaseInfo.cpp
index 3356d56481e5..4ced4d57109a 100644
--- a/llvm/lib/Target/ARM/Utils/ARMBaseInfo.cpp
+++ b/llvm/lib/Target/ARM/Utils/ARMBaseInfo.cpp
@@ -74,4 +74,10 @@ namespace ARMBankedReg {
 #define GET_BANKEDREG_IMPL
 #include "ARMGenSystemRegister.inc"
 } // end namespce ARMSysReg
+
+// VFP Pseudo Instructions
+namespace VFPPseudos {
+#define GET_VFPPseudosTable_IMPL
+#include "ARMGenSystemRegister.inc"
+} // namespace VFPPseudos
 } // end namespace llvm
diff --git a/llvm/lib/Target/ARM/Utils/ARMBaseInfo.h b/llvm/lib/Target/ARM/Utils/ARMBaseInfo.h
index 80b7276adb4e..3b0346d790aa 100644
--- a/llvm/lib/Target/ARM/Utils/ARMBaseInfo.h
+++ b/llvm/lib/Target/ARM/Utils/ARMBaseInfo.h
@@ -232,6 +232,19 @@ namespace ARMBankedReg {
   #include "ARMGenSystemRegister.inc"
 } // end namespace ARMBankedReg
 
+// VFP Pseudo Instructions
+namespace VFPPseudos {
+using namespace ARM;
+struct PseudoInfo {
+  unsigned Pseudo;
+  unsigned BaseInst;
+  // Encoded with +1 offset (i.e. 000 means 1)
+  unsigned VectorLength;
+};
+#define GET_VFPPseudosTable_DECL
+#include "ARMGenSystemRegister.inc"
+} // namespace VFPPseudos
+
 } // end namespace llvm
 
 #endif // LLVM_LIB_TARGET_ARM_UTILS_ARMBASEINFO_H

The pseudo instructions

Ok so now we have annotated our existing scalar instructions to have VectorLength equals to 1. Now it is a good moment to introduce our pseudo instructions. Again I’ll show only VADDDx2 and VADDSx4.

let usesCustomInserter = 1 in {

// VectorLength=2
let VectorLength = 0b001 in {

let BaseInstr = VADDD in
def VADDDx2  : PseudoInst<(outs DPRx2:$Dd), (ins DPRx2:$Dn, DPRx2:$Dm, pred:$p),
                  IIC_fpALU64,
                  [(set DPRx2:$Dd, (fadd DPRx2:$Dn, (v2f64 DPRx2:$Dm)))]>,
             Sched<[WriteFPALU64]>,
             Requires<[HasVFP2]>,
             VFPPseudo;

} // VectorLength

// VectorLength=4
let VectorLength = 0b011 in {

let BaseInstr = VABSS in
def VABSSx4  : PseudoInst<(outs SPRx4:$Dd), (ins SPRx4:$Dm, pred:$p),
                  IIC_fpUNA32,
                  [(set SPRx4:$Dd, (fabs (v4f32 SPRx4:$Dm)))]>,
             Sched<[WriteFPALU32]>,
             Requires<[HasVFP2]>,
             VFPPseudo;

} // VectorLength

} // usesCustomInserter

If you look closely you will see that these are almost identical to their scalar counterparts. That is intentional because we want these machine instructions be structurally identical as their scalar counterpart ones. The only differences are in the register classes (DPR is now DPRx2 and SPR is now SPRx4).

These definitions, like the scalar ones, include what is called a pattern for instruction selection. For instance, the one for VADDDx2 includes this pattern.

   [(set DPRx2:$Dd, (fadd DPRx2:$Dn, (v2f64 DPRx2:$Dm)))]>,

This basically says that we can match an input selection DAG with opcode fadd that receives two input registers $Dn and $Dm to set a register $Dd with the current instruction. DPRx2 is here the register class used for those registers. v2f64 is a type and it must be representable by DPRx2, but if you recall we did this in the previous installment when we created the register class.

Tablegen can in general infer the types of the operands of the patterns. So specifying v2f64 in the pattern fragment (v2f64 DPRx2:$Dm) is redundant.

However tablegen cannot infer the types if the involved register class allows more than one machine type. In the ARM backend, the DPR register class contains more than one machine type, hence the pattern needs to at least specify one type (the rest of types can be inferred because the properties of the fadd SelectionDAG node).

If you check the above definition of VADDS you will see that its pattern does not include a type (in contrast to the one for VADDD): the reason is that the register class SPR only includes the machine type f32.

I decided to leave at least one type for clarity as a balanced alternative to cluttering everything with types.

Custom inserter

Now we are in a situation in which we can finally use the custom inserter to insert the VFPSETLEN instruction.

We can do that in the file llvm/lib/Target/ARMISelLowering.cpp. The hook called per instruction is EmitInstrWithCustomInserter. We will check if the MachineInstr is included in our table of VFPPseudos we have defined above.

MachineBasicBlock *
ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
                                               MachineBasicBlock *BB) const {
  if (const VFPPseudos::PseudoInfo *VFPInfo =
          VFPPseudos::getPseudoInfo(MI.getOpcode())) {
    if (Subtarget->hasVFP2Base()) {
      // Don't do anything if we are not generating code for VFP2.
      setFPCSRLength(*BB, MI, VFPInfo);
      return BB;
    }
  }
  // rest of the code
}

We will query the table VFPPseudos. If an instruction appears there it means we need to make sure the fpscr has the len field correctly set. This is what the VFPSETLEN instruction was for.

void ARMTargetLowering::setFPCSRLength(
    MachineBasicBlock &BB, MachineInstr &MI,
    const VFPPseudos::PseudoInfo *VFPInfo) const {
  MachineFunction *MF = BB.getParent();
  MachineRegisterInfo &MRI = MF->getRegInfo();
  const TargetInstrInfo *TII = Subtarget->getInstrInfo();
  DebugLoc dl = MI.getDebugLoc();

  Register Scratch1 = MRI.createVirtualRegister(&ARM::GPRRegClass);
  Register Scratch2 = MRI.createVirtualRegister(&ARM::GPRnopcRegClass);

  BuildMI(BB, MI, dl, TII->get(ARM::VFPSETLEN))
      .addDef(Scratch1, RegState::Dead)
      .addDef(Scratch2, RegState::Dead)
      .addImm(VFPInfo->VectorLength);

  // Now mark this instruction that it uses FPCSR implicitly.
  MI.addOperand(MachineOperand::CreateReg(ARM::FPSCR, /* isDef */ false,
                                          /* isImp */ true));
}

We basically create two virtual registers of the appropriate class for VFPSETLEN and then we build a machine instruction corresponding to the VFPSETLEN itself. Note how we mark the scratch registers as dead: nobody will use their values after this instruction.

We finally insert an implicit operand to the current instruction (not to VFPSETLEN but VADDD, VADDS, VADDDx2, VADDSx4, etc) in order to signify that this instruction now uses fpscr.

At this point we are still not able to lower vector operations because we are still missing a few bits which we will see in next installments. However we can see the impact of our changes in scalar code.

The following simple IR function can be used to observe the effect of the custom inserter.

define void @test_scalar(double *%pa, double *%pb, double *%pc) {
  %a = load double, double* %pa
  %b = load double, double* %pb
  %c = fadd double %a, %b
  store double %c, double *%pc
  ret void
}

We can use llc, the code generator of LLVM, to inspect the machine instructions. We have to stop right after instruction selection (isel) finalises because we have still not implemented how VFPSETLEN is to be lowered and that would cause llc to crash.

$ llc -mtriple=armv6kz-unknown-linux-gnu -mattr=+vfp2 \
      -stop-after=finalize-isel -simplify-mir scalar.ll -o scalar.mir
scalar.mir
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
--- |
  ; ModuleID = 'scalar.ll'
  source_filename = "scalar.ll"
  target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
  target triple = "armv6kz-unknown-linux-gnu"
  
  define void @test_scalar(double* %pa, double* %pb, double* %pc) #0 {
    %a = load double, double* %pa, align 8
    %b = load double, double* %pb, align 8
    %c = fadd double %a, %b
    store double %c, double* %pc, align 8
    ret void
  }
  
  attributes #0 = { "target-features"="+vfp2" }

...
---
name:            test_scalar
alignment:       4
tracksRegLiveness: true
registers:
  - { id: 0, class: gpr }
  - { id: 1, class: gpr }
  - { id: 2, class: gpr }
  - { id: 3, class: dpr }
  - { id: 4, class: dpr }
  - { id: 5, class: dpr }
  - { id: 6, class: gpr }
  - { id: 7, class: gprnopc }
liveins:
  - { reg: '$r0', virtual-reg: '%0' }
  - { reg: '$r1', virtual-reg: '%1' }
  - { reg: '$r2', virtual-reg: '%2' }
frameInfo:
  maxAlignment:    1
  maxCallFrameSize: 0
machineFunctionInfo: {}
body:             |
  bb.0 (%ir-block.0):
    liveins: $r0, $r1, $r2
  
    %2:gpr = COPY $r2
    %1:gpr = COPY $r1
    %0:gpr = COPY $r0
    %3:dpr = VLDRD %1, 0, 14 /* CC::al */, $noreg :: (load 8 from %ir.pb)
    %4:dpr = VLDRD %0, 0, 14 /* CC::al */, $noreg :: (load 8 from %ir.pa)
    dead %6:gpr, dead %7:gprnopc = VFPSETLEN 0, implicit-def $fpscr
    %5:dpr = VADDD killed %4, killed %3, 14 /* CC::al */, $noreg, implicit $fpscr
    VSTRD killed %5, %2, 0, 14 /* CC::al */, $noreg :: (store 8 into %ir.pc)
    BX_RET 14 /* CC::al */, $noreg

...

In line 48 the newly introduced VFPSETLEN appears. It implicitly defines fpscr (physical registers are prepended with a $ sign while virtual registers use %). And then the VADDD instruction in line 49 implicitly uses it.

This is all still very basic but it is a necessary step for correct code generation for VFPv2.

In the next installment we will continue adding the missing bits in the instruction selection and lowering of operations for vector types.