Fun with vectors in the Raspberry Pi 1 - Part 5
In the last installment we completed all the code generation step. However the whole process is still a bit suboptimal. Today we are going to see how we can improve this.
Discussion
From a modelling point of view, our biggest problem is that now all the vector
floating point operations use an extra operand: the len
field of fpscr
. So
we need to make sure the right value of len
is set. Perhaps the most annoying
fact here is that len
is in practice like a global variable.
Our approach is a very simple one: every instruction will ensure len
is
correctly set before executing. There is an important upside to this approach:
it is simple and it provides code that is trivially correct very early in the
pipeline. This last property is important because it sets what we could call a
correctness baseline within the code generation process. The downside is that
we need to remove many redundant cases, so the quality of the code will
directly depend on how good we are at removing them. Having a baseline is
actually beneficial because it allows us to tell if there are functional
differences once we have removed the redundant assignments to len
.
However this is not the only approach possible. Another option is to delay as
much as possible the updates to len
and insert them when needed based on some
analysis. This approach potentially can be faster, because we do not add
instructions just to later remove them. The downside is that we will not enjoy
a reference baseline that we can use. This means this is not an optimisation
anymore, instead this is a non-optional step. The amount of analysis required
for both approaches is similar. So, if we do not care about the correctness
baseline, then this approach would be a better choice.
Strategy to remove redundant VFPSETLEN
We are going to use a relatively simple data flow algorithm to approximate the
value of len
through the program, very similar to a simple constant
propagation. Our goal is to know if at a given point, what is the value of
len
. For the purpose of the analysis, only the value at the entry and the value
at the exit of a basic block are the ones we care about.
The value of len
can be modelled under three different circumstances:
- we do not know anything, yet, about the value of
len
- we do know its exact value
- we do not know the value because at this point of the program
len
could have two or more different values.
The first case happens at the beginning of the algorithm. For each basic block,
its incoming len
value is unknown. The only exception is the entry
block of the function where the AAPCS (the Arm calling convention) guarantees
that the vector length is 1 (i.e. len
is 0).
So far we have not addressed the case where we do function calls, so our
compiler is generating code that does not ensure len
is 1 at the beginning or
exit of a function. We will address this in a later chapter.
The second case happens, for instance, right after a VFPSETLEN
instruction.
After that instruction we know exactly the value of len
(it is represented as
an immediate of VFPSETLEN
). So, given a basic block that contains
VFPSETLEN
, its last VFPSETLEN
determines the outgoing len
.
The third case happens, for instance, in the basic block that follows an if
construct. Suppose the then block sets len
to one value and the else block
sets len
to another different value. In this case we do not know which branch
will be executed so len
is in practice variable here.
Now, it should be possible, using an iterative algorithm, to propagate this information through the basic blocks.
Implementation
We are going to represent the length with the following convenience class.
An instance of class Length
can represent a length (from 0b000
to 0b111
)
and two extra values: uninitialised (which means unknown but as in no
information is available, first case above) and variable (which also means
unknown but as in conflicting information is available, third case above). The
initial value of an object of class Length
is uninitialised.
For each basic block we will want to know the length at the beginning of the
basic block and at the end. Because the last instruction that changes len
is relevant for the outgoing len
, we will have a pointer to that instruction
(if it exists).
In order to link this information to each basic block we will use a vector that we will index using the basic block number (an identifier that LLVM gives to each basic block).
Initialisation
A very basic step will be computing the initial information for each basic block.
As mentioned above, we know the entry block will have a vector length of 1. In
absence of any instruction that changes the fpscr
, the initial value of len
will be the same as the final value of len
, hence LI.OutLen = LI.InLen;
.
Now, for each instruction of the basic block (in sequence order), we analyze it.
If the instruction is a VFPSETLEN
it is very easy to extract the value of the
length from its immediate operand.
Function calls have to preserve the length in the fpscr
, so after
a call the length is always 1.
We need to deal with instructions that might modify the fpscr
or inline
assembly. We conservatively assume they could set len
to any value. One
detail here, function calls might modify fpscr
but not the len
field we
care, so we check them first.
In all those cases we update LastChange
to the instruction that made the
change. If no instruction in a basic block may change len
then LastChange
will remain as nullptr
.
Propagation
Now we can propagate this information through all the basic blocks. The algoritm will iterate until no more lengths need to be propagated. This is out of scope of this post but the theoretical underpinnings (semilattice and join operation) guarantee that this should happen.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void ARMOptimizeVFP2Len::computeBlocksInfo() {
// Compute the initial information for the entry block.
std::queue<const MachineBasicBlock*> WorkList;
for (auto &MBB : *MF) {
computeLocalBlockInfo(MBB);
WorkList.push(&MBB);
BlockInfo[MBB.getNumber()].InQueue = true;
}
while (!WorkList.empty()) {
const MachineBasicBlock &MBB = *WorkList.front();
WorkList.pop();
BlockData &LI = BlockInfo[MBB.getNumber()];
LI.InQueue = false;
computeIncomingLen(MBB, WorkList);
}
}
To implement this we will use a worklist. The worklist is a queue that will contain the basic blocks pending to propagate. Initially all the basic blocks should be propagated their predecessors so all of them should be in the queue. Given that we need to initialise all the basic blocks, we can initialise them and add them to the worklist queue at the same time (lines 4 to 9).
The InQueue
attribute will be used to know if a basic block is in the queue
or not. It will avoid infinite recursion in case of loops but may also avoid
propagating too many times the same basic blocks. The InQueue
field is in
BlockData
, which now looks like this.
Now we can proceed with propagating the values of the length through the different basic blocks (lines 11 to 17).
The function computeIncomingLen
is responsible to merge all the incoming
len
values from predecessors of a given block. If the merge results in a
change of the output of the basic block, all the successors need to be updated
as well.
But first let’s see how we merge two lengths.
This function must be commutative (due to the underpinning theory of this operation). If either of the merged lengths is variable, the result will be variable. This is because, once the length is variable, no other length can change that fact when merging them.
The opposite case with uninitialised. If either is uninitialised it means it has no information. So we just return the other length which, if not uninitialised, will always provide more information.
Finally if both have a known length, we check if it is the same. If they are the same just return that known length. If they are not of the same length, then we have a variable case.
Now we are ready to propagate the length through basic blocks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void ARMOptimizeVFP2Len::computeIncomingLen(
const MachineBasicBlock &MBB,
std::queue<const MachineBasicBlock *> &WorkList) {
BlockData &LI = BlockInfo[MBB.getNumber()];
if (!LI.InLen.isVariable()) {
for (MachineBasicBlock *P : MBB.predecessors()) {
Length PredOutLen = BlockInfo[P->getNumber()].OutLen;
LI.InLen = mergeLengths(LI.InLen, PredOutLen);
}
}
// If nothing changes the length in this basic block, propagate, the incoming
// length is also the outgoing length.
bool Changed = false;
if (!LI.LastChange) {
Length PrevOutLen = LI.OutLen;
LI.OutLen = LI.InLen;
Changed = LI.OutLen != PrevOutLen;
}
if (Changed) {
// If the output has changed, propagate the changes to the successors.
for (MachineBasicBlock *S : MBB.successors()) {
if (!BlockInfo[S->getNumber()].InQueue) {
// Don't add again those that are in the queue already.
WorkList.push(S);
BlockInfo[S->getNumber()].InQueue = true;
}
}
}
}
If the input length of a basic block is not variable (line 6, this is an optimisation to avoid unnecessary work because the algorithm would be correct anyway) then we merge the current incoming length with the outgoing length of all the predecessor basic blocks (lines 7 to 10). Note that initially, the incoming length of a block will be uninitialised (except for the entry block but the entry block has no predecessors!).
Once we have updated the value of the incoming length we may have to update the
value of the outgoing length. This is only relevant if there is no instruction
that changes len
(i.e. len
is propagated unmodified through the block). If
this is the case, we update the outgoing length. If the propagation resulted in
a new value (lines 16 to 20) for the outgoing length then we need to update the
successors of this basic block (lines 22 to 31). To do that we queue the
successors in the worklist. Note that successors already in the worklist are
not considered.
Removal of redundant instructions
With all this information now we can proceed to remove the redundant VFPSETLEN
instructions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
bool ARMOptimizeVFP2Len::removeRedundantVPFSETLEN(MachineBasicBlock &MBB) {
Length CurrentLength = BlockInfo[MBB.getNumber()].InLen;
bool Changed = false;
MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
while (MBBI != E) {
MachineBasicBlock::iterator NMBBI = std::next(MBBI);
MachineInstr &MI = *MBBI;
if (MI.getOpcode() == ARM::VFPSETLEN) {
unsigned Length = MI.getOperand(2).getImm();
if (CurrentLength.hasValue() && CurrentLength.getValue() == Length) {
LLVM_DEBUG(dbgs() << "Removing redundant: " << MI);
// We can remove this one.
MI.removeFromParent();
Changed = true;
}
CurrentLength.setValue(Length);
} else if (MI.isCall()) {
CurrentLength.setValue(0);
} else if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
CurrentLength.setVariable();
}
MBBI = NMBBI;
}
return Changed;
}
For each basic block we get its initial length, which we computed above, and
keep it in CurrentLength
(line 2). Now we go through each instruction that changes
len
and we update again CurrentLength
(lines 6 to 26). If we find that a VFPSETLEN
would set the same length as CurrentLength
we can just remove it (lines 12 to 17).
Note that the iteration through the instructions using iterators is a bit unusual. The reason is that we may remove an element while iterating, and when this happens its iterator becomes invalid, so it would not be possible to get to advance the iterator. To avoid this why we first compute the iterator to the next instruction (line 7) and we use it to advance the loop (line 25).
Note that there is some amount of replication: the function
removeRedundantVPFSETLEN
and computeLocalBlockInfo
must track the changes
of len
in the same way. Failing to do so will lead to errors. It should be
possible to keep them aligned using a visitor-like pattern.
Entry point
This optimisation is run as a compiler pass of the ARM backend. There is some
amount of boilerplate required to do that. This pass is a
MachineFunctionPass
, so it is run once per function in the program. I will
skip most of the details (you can find them in the LLVM
documentation)
of the boilerplate required except for the entry point of the pass itself.
This function is invoked once per function. There is a number of cases where
this function returns a false value meaning nothing was changed: if we do not
have VFPv2 available, if we have explicitly requested not to optimise it or if
the function explicitly requests no optimisations (via the LLVM function
attribute optnone
).
We can explicitly disable this optimisation pass using a command line flag. Those are declared like this.
You can find details on how these flags work in the LLVM documentation.
The rest of the function is pretty straightforward. We cache the current
MachineFunction
in the field MF
of the class. Then we initialize the array
of the information of the basic blocks. We now invoke computeBlocksInfo
and
then we remove the redundant instructions.
Result
Now we can apply to the final LLVM IR example of last week.
Yay, we set fpscr
just once.
Let’s try a more complex example that shows that the propagation works as expected.
Using the flag -mllvm -arm-optimize-vfp2-disable
we can disable the
optimisation and observe its effects.
In this test we can remove 2 of the three changes to len
. A first one sets
len
to 0 (i.e. vector length equals to 1) to execute the scalar operation
*dc = *da + *db;
. But we know that upon entry len
will be 0, so no need
to change that. After the if
statement, we set len
to 1 (i.e. vector length
equals to two) to execute *vdc = *vdc / *vdd;
, but both branches of the
if
will have already set len
to 1, so no need to do that again.
Current limitation
There is still a suboptimal case left: if we use vector operations inside a loop
they will set len
at every iteration. This is redundant if all the vector operations
inside the loop use the same length. We could set the length before the loop, just
once. We will address this issue in a later installment.
In the next episode we will fix the issue related to the value of len
when
entering and returning from a function.