In the last installment I mentioned we could start looking at enabling
the vectoriser in the compiler. However when I did that I realised some
benchmarks were giving weird results. I had made a mistake with copies,
so let’s remediate this.
Source of the problem: copies and vector length
In the first installment of this series we discussed about instructions that
used the vector length. Unfortunately I missed an important one: vmov, used
to copy vector registers. Contrarily as I incorrectly assumed, this
instruction does use the vector length so we do not have to copy every single
element of the vector. Instead, we need to make sure the vector length is
correctly set.
The error I saw is that the compiler intended to copy two scalar floating point
registers but because the vector length was different to 1, we were overwriting
more registers than expected (the registers happened to be in banks other than
the first one). That was an interesting bug to find!
However this creates a bit of a complication here: if we need to change
the vector length to implement a vector copy between registers, we are actually
modifying its value. This means that to be safe we need to preserve the current
vector length, set it to the correct one, do the copy and restore the
original vector length.
My initial implementation did this, but turns out the code obtained is not very
smart. It may happen that we are already in the right vector length to emit the
copy, so we do not have to do all the preserve-copy-restore dance.
How to do this in a way that still allows us the option to remove unnecessary
changes to the set vector length?
New design
Currently we only have a single VFPSETLEN pseudo instruction that we are able
to optimise before register allocation.
Copies are introduced as part of the process that leads register allocation,
when phi instructions are removed. If we want to optimise changes to the vector
length due to copies, we need to run our pass after register allocation.
In fact, it would be a good idea to also run it before register allocation, to
reduce the register pressure implicitly created by the redundant instructions.
For simplicity we will only run it after register allocation.
So, when emitting copies we will want to know that we are doing the
preserve-copy-restore dance in a way that it can be optimised.
So I did the following changes:
VFPSETLEN is now VFPSETLENi. The i designates it receives an immediate for the vector length.
A new VFPSETLENr pseudo instruction that sets the length as preserved in a GPR register (the r means register).
A new VFPGETLENr pseudo insruction that returns the vector length in a GPR.
For simplicity VFPSETLENr and VFPGETLENr will overwrite the whole fpscr
register, but the only change we will ever do is to the vector length so this
should be reasonable. Technically we are changing more bits than the ones we
claim to do, but ARMv6 is not rich enough to do this in an efficient way, hence
the coarser approach.
When emitting a copy we will first emit a VFPGETLENr to keep the current
vector length in a GPR (say rn). Then we will emit a VFPSETLENi to set the
right length of the floating data type (1 for scalars, 2 or 4 for vectors).
Then the vmov. Finally we will emit a VFPSETLENr using rn to restore
everything as it was before the copy.
Because those are pseudo instructions, we need to expand them at some point:
VFPGETLENr will be expanded to a regular vmrs instruction, VFPSETLENr will
be expanded to the dual vmsr. VFPSETLENi is expanded as usual.
Using pseudo instructions that have trivial expansions is totally intentional:
we are only using these instructions for copies so we can optimise them.
Copies
Let’s overhaul copies to be correct.
There is a bit of complication, that we will not fully address here. In order
to preserve the vector length (via the fpscr) we need extra registers around.
Because the copy expansion happens after register allocation, we cannot enjoy
virtual registers here. So we need to resort to a class called RegScavenger
(the register scavenger) which will try to get us registers for us.
However, the register scavenger may not be able to find free registers. In that
case we would have to pick victim registers, spill them manually onto the stack
and then use those registers, to finally reload them. A more complete
implementation will have to do this though, or it risks itself to not being
able to compile some codes with high register pressure.
Another important detail is that not all the copies of floating-point values
need changing the set vector length. If the destination register of the copy
is found in the first bank, the operation is always scalar. This helps with
cases where we are copying scalars (vectors will never be register allocated
in the first bank).
First let’s remove the wrong code.
Now let’s extend the part that deals originally with scalar copies of floating
point values (SPR and DPR register classes). We will add cases for SPRx4
and DPRx4. Also we will note the vector length required for each of the
copies.
To help us here, we use a new class RAIISetVLEN which given the registers
being copied and the required VectorLength will emit the proper VFPGETLENr,
VFPSETLENi sequence. We use a RAII pattern because we may have to restore the
vector length using VFPSETLENr. If this class tells us fpscr is being
preserved, we add an explicit usage of fpscr to the vmov instruction.
Note the condition cases where the register scavenger would fail to find
a register. A more complete implementation must handle those cases.
Optimising
As we mentioned above, we want the optimisation pass to run after register
allocation.
To optimise these new instructions, we have to extend our existing pass. When a
VFPGETLENr is seen, we will remember what register is keeping the current
length. If the register is preserved until a further VFPSETLENr we know we
are restoring the same length as before.
We can use this knowledge when analysing a single basic block, so we do not
lose track of the length value during a VFPSETLENr.
And then use this in the removal as well.
Results
With all the changes above in place, we can retake the example of copies from
the last chapter and see the code we emit now.
This is really good as the branch is now gone.
We have to look at the MIR to understand what happened. First let’s see the MIR
right before the optimisation pass.
The optimisation pass identifies the unnecessary change so we can reuse the existing
vector length. Now line 22 is the only instruction required to do the copy.
A later existing pass of the ARM backend identifies this pattern and adds the
predicate to vmov (which becomes vmovge).
A bit of reflection with the current approach. We have gone through register
allocation with instructions that we have later removed. This means that
potentially those instructions may have increased the register pressure to the
rest of the code. Unfortunately the expansion currently happens after register
allocation (though there are a couple of target-specific hooks that might be
worth looking into) so there is not much we can do for now.
Now I think we can move onto enabling vectorization :)