In the last installment we completed all the code generation step. However the whole process is still a bit suboptimal. Today we are going to see how we can improve this.

Discussion

From a modelling point of view, our biggest problem is that now all the vector floating point operations use an extra operand: the len field of fpscr. So we need to make sure the right value of len is set. Perhaps the most annoying fact here is that len is in practice like a global variable.

Our approach is a very simple one: every instruction will ensure len is correctly set before executing. There is an important upside to this approach: it is simple and it provides code that is trivially correct very early in the pipeline. This last property is important because it sets what we could call a correctness baseline within the code generation process. The downside is that we need to remove many redundant cases, so the quality of the code will directly depend on how good we are at removing them. Having a baseline is actually beneficial because it allows us to tell if there are functional differences once we have removed the redundant assignments to len.

However this is not the only approach possible. Another option is to delay as much as possible the updates to len and insert them when needed based on some analysis. This approach potentially can be faster, because we do not add instructions just to later remove them. The downside is that we will not enjoy a reference baseline that we can use. This means this is not an optimisation anymore, instead this is a non-optional step. The amount of analysis required for both approaches is similar. So, if we do not care about the correctness baseline, then this approach would be a better choice.

Strategy to remove redundant VFPSETLEN

We are going to use a relatively simple data flow algorithm to approximate the value of len through the program, very similar to a simple constant propagation. Our goal is to know if at a given point, what is the value of len. For the purpose of the analysis, only the value at the entry and the value at the exit of a basic block are the ones we care about.

The value of len can be modelled under three different circumstances:

  • we do not know anything, yet, about the value of len
  • we do know its exact value
  • we do not know the value because at this point of the program len could have two or more different values.

The first case happens at the beginning of the algorithm. For each basic block, its incoming len value is unknown. The only exception is the entry block of the function where the AAPCS (the Arm calling convention) guarantees that the vector length is 1 (i.e. len is 0).

So far we have not addressed the case where we do function calls, so our compiler is generating code that does not ensure len is 1 at the beginning or exit of a function. We will address this in a later chapter.

The second case happens, for instance, right after a VFPSETLEN instruction. After that instruction we know exactly the value of len (it is represented as an immediate of VFPSETLEN). So, given a basic block that contains VFPSETLEN, its last VFPSETLEN determines the outgoing len.

The third case happens, for instance, in the basic block that follows an if construct. Suppose the then block sets len to one value and the else block sets len to another different value. In this case we do not know which branch will be executed so len is in practice variable here.

Now, it should be possible, using an iterative algorithm, to propagate this information through the basic blocks.

Implementation

We are going to represent the length with the following convenience class.

struct Length {
  // Value is encoded with offset +1. 0b000 is length 1, 0b111 is length 8.
private:
  enum { Variable = 0b1000, Uninit = 0b1111 };
  uint8_t Value = Uninit;

public:
  Length() {}

  unsigned getValue() const {
    assert(hasValue() && "No value held");
    return Value;
  }
  void setValue(uint8_t V) {
    assert(0b000 <= V && V <= 0b111 && "Invalid value");
    Value = V;
  }
  bool hasValue() const {
      return !isVariable() && !isUninitialized();
  }

  void setVariable() { Value = Variable; }
  bool isVariable() const { return Value == Variable; }

  bool isUninitialized() const { return Value == Uninit; }

  bool operator==(const Length &L) const { 
      return this->Value == L.Value;
  }

  bool operator!=(const Length &L) const { 
      return !(*this == L);
  }
};

An instance of class Length can represent a length (from 0b000 to 0b111) and two extra values: uninitialised (which means unknown but as in no information is available, first case above) and variable (which also means unknown but as in conflicting information is available, third case above). The initial value of an object of class Length is uninitialised.

For each basic block we will want to know the length at the beginning of the basic block and at the end. Because the last instruction that changes len is relevant for the outgoing len, we will have a pointer to that instruction (if it exists).

struct BlockData {
  // The incoming and outgoing lengths of this block.
  Length InLen;
  Length OutLen;

  const MachineInstr* LastChange = nullptr;
};

In order to link this information to each basic block we will use a vector that we will index using the basic block number (an identifier that LLVM gives to each basic block).

std::vector<BlockData> BlockInfo;

Initialisation

A very basic step will be computing the initial information for each basic block.

void ARMOptimizeVFP2Len::computeLocalBlockInfo(const MachineBasicBlock &MBB) {
  BlockData &LI = BlockInfo[MBB.getNumber()];

  if (MBB.isEntryBlock())
    LI.InLen.setValue(0);

  LI.OutLen = LI.InLen;
  for (auto &MI : MBB) {
    if (MI.getOpcode() == ARM::VFPSETLEN) {
      LI.OutLen.setValue(MI.getOperand(2).getImm());
      LI.LastChange = &MI;
      continue;
    }

    // Handle calls first.
    if (MI.isCall()) {
      // On exit, functions restore vector length == 1.
      LI.OutLen.setValue(0);
      LI.LastChange = &MI;
      continue;
    }

    // If the FPSCR is modified outside of our control, assume
    // that it is variable.
    if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
      LI.OutLen.setVariable();
      LI.LastChange = &MI;
      continue;
    }

  }
}

As mentioned above, we know the entry block will have a vector length of 1. In absence of any instruction that changes the fpscr, the initial value of len will be the same as the final value of len, hence LI.OutLen = LI.InLen;.

Now, for each instruction of the basic block (in sequence order), we analyze it. If the instruction is a VFPSETLEN it is very easy to extract the value of the length from its immediate operand.

Function calls have to preserve the length in the fpscr, so after a call the length is always 1.

We need to deal with instructions that might modify the fpscr or inline assembly. We conservatively assume they could set len to any value. One detail here, function calls might modify fpscr but not the len field we care, so we check them first.

In all those cases we update LastChange to the instruction that made the change. If no instruction in a basic block may change len then LastChange will remain as nullptr.

Propagation

Now we can propagate this information through all the basic blocks. The algoritm will iterate until no more lengths need to be propagated. This is out of scope of this post but the theoretical underpinnings (semilattice and join operation) guarantee that this should happen.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void ARMOptimizeVFP2Len::computeBlocksInfo() {
  // Compute the initial information for the entry block.
  std::queue<const MachineBasicBlock*> WorkList;
  for (auto &MBB : *MF) {
    computeLocalBlockInfo(MBB);
    WorkList.push(&MBB);
    BlockInfo[MBB.getNumber()].InQueue = true;
  }

  while (!WorkList.empty()) {
    const MachineBasicBlock &MBB = *WorkList.front();
    WorkList.pop();
    BlockData &LI = BlockInfo[MBB.getNumber()];
    LI.InQueue = false;
    computeIncomingLen(MBB, WorkList);
  }
}

To implement this we will use a worklist. The worklist is a queue that will contain the basic blocks pending to propagate. Initially all the basic blocks should be propagated their predecessors so all of them should be in the queue. Given that we need to initialise all the basic blocks, we can initialise them and add them to the worklist queue at the same time (lines 4 to 9).

The InQueue attribute will be used to know if a basic block is in the queue or not. It will avoid infinite recursion in case of loops but may also avoid propagating too many times the same basic blocks. The InQueue field is in BlockData, which now looks like this.

struct BlockData {
  // The incoming and outgoing lengths of this block.
  Length InLen;
  Length OutLen;

  const MachineInstr* LastChange = nullptr;
  bool InQueue = false;
};

Now we can proceed with propagating the values of the length through the different basic blocks (lines 11 to 17).

The function computeIncomingLen is responsible to merge all the incoming len values from predecessors of a given block. If the merge results in a change of the output of the basic block, all the successors need to be updated as well.

But first let’s see how we merge two lengths.

static Length mergeLengths(Length Current, Length New) {
  if (Current.isVariable())
    return Current;
  if (New.isVariable())
    return New;

  if (Current.isUninitialized())
    return New;
  if (New.isUninitialized())
    return Current;

  assert(Current.hasValue() && New.hasValue());
  if (Current.getValue() == New.getValue())
    return Current;

  Length Ret;
  Ret.setVariable();
  return Ret;
}

This function must be commutative (due to the underpinning theory of this operation). If either of the merged lengths is variable, the result will be variable. This is because, once the length is variable, no other length can change that fact when merging them.

The opposite case with uninitialised. If either is uninitialised it means it has no information. So we just return the other length which, if not uninitialised, will always provide more information.

Finally if both have a known length, we check if it is the same. If they are the same just return that known length. If they are not of the same length, then we have a variable case.

Now we are ready to propagate the length through basic blocks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void ARMOptimizeVFP2Len::computeIncomingLen(
    const MachineBasicBlock &MBB,
    std::queue<const MachineBasicBlock *> &WorkList) {
  BlockData &LI = BlockInfo[MBB.getNumber()];

  if (!LI.InLen.isVariable()) {
    for (MachineBasicBlock *P : MBB.predecessors()) {
      Length PredOutLen = BlockInfo[P->getNumber()].OutLen;
      LI.InLen = mergeLengths(LI.InLen, PredOutLen);
    }
  }

  // If nothing changes the length in this basic block, propagate, the incoming
  // length is also the outgoing length.
  bool Changed = false;
  if (!LI.LastChange) {
    Length PrevOutLen = LI.OutLen;
    LI.OutLen = LI.InLen;
    Changed = LI.OutLen != PrevOutLen;
  }

  if (Changed) {
    // If the output has changed, propagate the changes to the successors.
    for (MachineBasicBlock *S : MBB.successors()) {
      if (!BlockInfo[S->getNumber()].InQueue) {
        // Don't add again those that are in the queue already.
        WorkList.push(S);
        BlockInfo[S->getNumber()].InQueue = true;
      }
    }
  }
}

If the input length of a basic block is not variable (line 6, this is an optimisation to avoid unnecessary work because the algorithm would be correct anyway) then we merge the current incoming length with the outgoing length of all the predecessor basic blocks (lines 7 to 10). Note that initially, the incoming length of a block will be uninitialised (except for the entry block but the entry block has no predecessors!).

Once we have updated the value of the incoming length we may have to update the value of the outgoing length. This is only relevant if there is no instruction that changes len (i.e. len is propagated unmodified through the block). If this is the case, we update the outgoing length. If the propagation resulted in a new value (lines 16 to 20) for the outgoing length then we need to update the successors of this basic block (lines 22 to 31). To do that we queue the successors in the worklist. Note that successors already in the worklist are not considered.

Removal of redundant instructions

With all this information now we can proceed to remove the redundant VFPSETLEN instructions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
bool ARMOptimizeVFP2Len::removeRedundantVPFSETLEN(MachineBasicBlock &MBB) {
  Length CurrentLength = BlockInfo[MBB.getNumber()].InLen;

  bool Changed = false;
  MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
  while (MBBI != E) {
    MachineBasicBlock::iterator NMBBI = std::next(MBBI);
    MachineInstr &MI = *MBBI;

    if (MI.getOpcode() == ARM::VFPSETLEN) {
      unsigned Length = MI.getOperand(2).getImm();
      if (CurrentLength.hasValue() && CurrentLength.getValue() == Length) {
        LLVM_DEBUG(dbgs() << "Removing redundant: " << MI);
        // We can remove this one.
        MI.removeFromParent();
        Changed = true;
      }
      CurrentLength.setValue(Length);
    } else if (MI.isCall()) {
      CurrentLength.setValue(0);
    } else if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
      CurrentLength.setVariable();
    }

    MBBI = NMBBI;
  }

  return Changed;
}

For each basic block we get its initial length, which we computed above, and keep it in CurrentLength (line 2). Now we go through each instruction that changes len and we update again CurrentLength (lines 6 to 26). If we find that a VFPSETLEN would set the same length as CurrentLength we can just remove it (lines 12 to 17).

Note that the iteration through the instructions using iterators is a bit unusual. The reason is that we may remove an element while iterating, and when this happens its iterator becomes invalid, so it would not be possible to get to advance the iterator. To avoid this why we first compute the iterator to the next instruction (line 7) and we use it to advance the loop (line 25).

Note that there is some amount of replication: the function removeRedundantVPFSETLEN and computeLocalBlockInfo must track the changes of len in the same way. Failing to do so will lead to errors. It should be possible to keep them aligned using a visitor-like pattern.

Entry point

This optimisation is run as a compiler pass of the ARM backend. There is some amount of boilerplate required to do that. This pass is a MachineFunctionPass, so it is run once per function in the program. I will skip most of the details (you can find them in the LLVM documentation) of the boilerplate required except for the entry point of the pass itself.

bool ARMOptimizeVFP2Len::runOnMachineFunction(MachineFunction &mf) {
  const ARMSubtarget &ST = static_cast<const ARMSubtarget &>(mf.getSubtarget());
  if (!ST.hasVFP2Base())
    return false;

  if (DoNotOptimizeVFP2Len)
    return false;

  if (skipFunction(mf.getFunction()))
    return false;

  MF = &mf;

  BlockInfo.clear();
  BlockInfo.resize(MF->getNumBlockIDs());
  computeBlocksInfo();

  bool Changed = false;
  for (auto &MBB : *MF) {
    Changed |= removeRedundantVPFSETLEN(MBB);
  }
  return Changed;
}

This function is invoked once per function. There is a number of cases where this function returns a false value meaning nothing was changed: if we do not have VFPv2 available, if we have explicitly requested not to optimise it or if the function explicitly requests no optimisations (via the LLVM function attribute optnone).

We can explicitly disable this optimisation pass using a command line flag. Those are declared like this.

static cl::opt<bool>
    DoNotOptimizeVFP2Len("arm-optimize-vfp2-disable", cl::Hidden,
                         cl::desc("Do not optimize vfp2 length changes"),
                         cl::init(false));

You can find details on how these flags work in the LLVM documentation.

The rest of the function is pretty straightforward. We cache the current MachineFunction in the field MF of the class. Then we initialize the array of the information of the basic blocks. We now invoke computeBlocksInfo and then we remove the redundant instructions.

Result

Now we can apply to the final LLVM IR example of last week.

define void @test_vector2(<2 x double> *%pa, <2 x double> *%pb,
                          <2 x double> *%pc, <2 x double> *%pd) { 
  %a = load <2 x double>, <2 x double>* %pa
  %b = load <2 x double>, <2 x double>* %pb
  %c = load <2 x double>, <2 x double>* %pc
  %t1 = fadd <2 x double> %a, %b
  %t2 = fadd <2 x double> %t1, %c
  store <2 x double> %t2, <2 x double> *%pd
  ret void
}
test_vector2:
	.fnstart
@ %bb.0:
	vldmia	r0, {d6, d7}
	vldmia	r1, {d4, d5}
	vmrs	r1, fpscr
	mov	r0, #65536
	bic	r1, r1, #458752
	orr	r1, r1, r0
	vmsr	fpscr, r1
	vadd.f64	d4, d6, d4
	vldmia	r2, {d6, d7}
	vadd.f64	d4, d4, d6
	vstmia	r3, {d4, d5}
	bx	lr

Yay, we set fpscr just once.

Let’s try a more complex example that shows that the propagation works as expected.

test.c
typedef __attribute__((vector_size(16))) double v2f64;

void test(v2f64 *vdc, v2f64 *vda, v2f64 *vdb, v2f64 *vdd, double *dc,
          double *da, double *db, int x) {
  if (x > 10) {
    *dc = *da + *db;
    *vdc = *vda + *vdb;
  } else {
    *vdc = *vda * *vdb;
  }

  *vdc = *vdc / *vdd;
}

Using the flag -mllvm -arm-optimize-vfp2-disable we can disable the optimisation and observe its effects.

$ diff -U1000 -u  <(clang -O2 -S -o - --target=armv6kz-unknown-linux-gnu       \
                          -mfloat-abi=hard                                     \
                          -mllvm -arm-optimize-vfp2-disable test.c )           \
                  <(clang -O2 -S -o - --target=armv6kz-unknown-linux-gnu       \
                          -mfloat-abi=hard test.c )
 test:
 	.fnstart
 @ %bb.0:                                @ %entry
-	push	{r4, r5, r6, lr}
-	ldr	r12, [sp, #28]
+	push	{r4, lr}
+	ldr	r12, [sp, #20]
 	cmp	r12, #11
 	blt	.LBB0_2
 @ %bb.1:                                @ %if.then
-	ldr	r12, [sp, #24]
-	ldr	lr, [sp, #20]
-	ldr	r6, [sp, #16]
+	ldr	r12, [sp, #16]
+	ldr	lr, [sp, #12]
+	ldr	r4, [sp, #8]
 	vldr	d0, [r12]
 	vldr	d1, [lr]
-	vmrs	r5, fpscr
-	bic	r5, r5, #458752
-	vmsr	fpscr, r5
 	vadd.f64	d0, d1, d0
-	vstr	d0, [r6]
+	vstr	d0, [r4]
 	vldmia	r1, {d6, d7}
 	vldmia	r2, {d4, d5}
 	vmrs	r2, fpscr
 	mov	r1, #65536
 	bic	r2, r2, #458752
 	orr	r2, r2, r1
 	vmsr	fpscr, r2
 	vadd.f64	d4, d6, d4
 	b	.LBB0_3
 .LBB0_2:                                @ %if.else
 	vldmia	r1, {d6, d7}
 	vldmia	r2, {d4, d5}
 	vmrs	r2, fpscr
 	mov	r1, #65536
 	bic	r2, r2, #458752
 	orr	r2, r2, r1
 	vmsr	fpscr, r2
 	vmul.f64	d4, d6, d4
 .LBB0_3:                                @ %if.end
 	vstmia	r0, {d4, d5}
 	vldmia	r3, {d6, d7}
-	vmrs	r2, fpscr
-	mov	r1, #65536
-	bic	r2, r2, #458752
-	orr	r2, r2, r1
-	vmsr	fpscr, r2
 	vdiv.f64	d4, d4, d6
 	vstmia	r0, {d4, d5}
-	pop	{r4, r5, r6, pc}
+	pop	{r4, pc}

In this test we can remove 2 of the three changes to len. A first one sets len to 0 (i.e. vector length equals to 1) to execute the scalar operation *dc = *da + *db;. But we know that upon entry len will be 0, so no need to change that. After the if statement, we set len to 1 (i.e. vector length equals to two) to execute *vdc = *vdc / *vdd;, but both branches of the if will have already set len to 1, so no need to do that again.

Current limitation

There is still a suboptimal case left: if we use vector operations inside a loop they will set len at every iteration. This is redundant if all the vector operations inside the loop use the same length. We could set the length before the loop, just once. We will address this issue in a later installment.

In the next episode we will fix the issue related to the value of len when entering and returning from a function.