Fun with vectors in the Raspberry Pi 1 - Part 8
In the last installment I mentioned we could start looking at enabling the vectoriser in the compiler. However when I did that I realised some benchmarks were giving weird results. I had made a mistake with copies, so let’s remediate this.
Source of the problem: copies and vector length
In the first installment of this series we discussed about instructions that
used the vector length. Unfortunately I missed an important one: vmov
, used
to copy vector registers. Contrarily as I incorrectly assumed, this
instruction does use the vector length so we do not have to copy every single
element of the vector. Instead, we need to make sure the vector length is
correctly set.
The error I saw is that the compiler intended to copy two scalar floating point registers but because the vector length was different to 1, we were overwriting more registers than expected (the registers happened to be in banks other than the first one). That was an interesting bug to find!
However this creates a bit of a complication here: if we need to change the vector length to implement a vector copy between registers, we are actually modifying its value. This means that to be safe we need to preserve the current vector length, set it to the correct one, do the copy and restore the original vector length.
My initial implementation did this, but turns out the code obtained is not very smart. It may happen that we are already in the right vector length to emit the copy, so we do not have to do all the preserve-copy-restore dance.
How to do this in a way that still allows us the option to remove unnecessary changes to the set vector length?
New design
Currently we only have a single VFPSETLEN
pseudo instruction that we are able
to optimise before register allocation.
Copies are introduced as part of the process that leads register allocation, when phi instructions are removed. If we want to optimise changes to the vector length due to copies, we need to run our pass after register allocation.
In fact, it would be a good idea to also run it before register allocation, to reduce the register pressure implicitly created by the redundant instructions. For simplicity we will only run it after register allocation.
So, when emitting copies we will want to know that we are doing the preserve-copy-restore dance in a way that it can be optimised.
So I did the following changes:
VFPSETLEN
is nowVFPSETLENi
. Thei
designates it receives an immediate for the vector length.- A new
VFPSETLENr
pseudo instruction that sets the length as preserved in a GPR register (ther
means register). - A new
VFPGETLENr
pseudo insruction that returns the vector length in a GPR.
@@ -2927,7 +2927,8 @@ let Uses = [FPSCR] in {
}
}
-// Set LEN field in FPSCR
+// Set LEN field in FPSCR using an immediate. This is the main way used
+// to change the vector length.
let Defs = [FPSCR],
hasNoSchedulingInfo = 1,
mayLoad = 0,
@@ -2935,11 +2936,38 @@ let Defs = [FPSCR],
hasSideEffects = 0,
hasPostISelHook = 1,
Size = 20 in
-def VFPSETLEN : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
+def VFPSETLENi : PseudoInst<(outs GPR:$scratch1, GPRnopc:$scratch2),
(ins imm0_7:$len),
IIC_fpSTAT, []>,
Requires<[HasVFP2]>;
+// This is a semantic version of VMSR which is used to signal
+// a change only to the vector length field. This is not used
+// during ISel.
+let Defs = [FPSCR],
+ hasNoSchedulingInfo = 1,
+ mayLoad = 0,
+ mayStore = 0,
+ hasSideEffects = 0,
+ Size = 4 in
+def VFPSETLENr : PseudoInst<(outs),
+ (ins GPRnopc:$len),
+ IIC_fpSTAT, []>,
+ Requires<[HasVFP2]>;
+
+// This is a semantic version of VMRS which is used to signal
+// we read the vector length. This is not used during ISel.
+let Uses = [FPSCR],
+ hasNoSchedulingInfo = 1,
+ mayLoad = 0,
+ mayStore = 0,
+ hasSideEffects = 0,
+ Size = 4 in
+def VFPGETLENr : PseudoInst<(outs GPRnopc:$len),
+ (ins),
+ IIC_fpSTAT, []>,
+ Requires<[HasVFP2]>;
+
// Spill and reload helpers.
let AM = AddrMode4 in {
let hasNoSchedulingInfo = 1,
@@ -3242,4 +3270,4 @@ def : Pat<(fsub_mlx (fmul_su SPRx4:$a, SPRx4:$b), SPRx4:$dstin),
Requires<[HasVFP2,DontUseNEONForFP,UseFPVMLx]>;
// Set length pattern.
-def : Pat<(arm_vfpsetlenzero), (VFPSETLEN 0)>;
+def : Pat<(arm_vfpsetlenzero), (VFPSETLENi 0)>;
For simplicity VFPSETLENr
and VFPGETLENr
will overwrite the whole fpscr
register, but the only change we will ever do is to the vector length so this
should be reasonable. Technically we are changing more bits than the ones we
claim to do, but ARMv6 is not rich enough to do this in an efficient way, hence
the coarser approach.
When emitting a copy we will first emit a VFPGETLENr
to keep the current
vector length in a GPR (say rn
). Then we will emit a VFPSETLENi
to set the
right length of the floating data type (1 for scalars, 2 or 4 for vectors).
Then the vmov
. Finally we will emit a VFPSETLENr
using rn
to restore
everything as it was before the copy.
Because those are pseudo instructions, we need to expand them at some point:
VFPGETLENr
will be expanded to a regular vmrs
instruction, VFPSETLENr
will
be expanded to the dual vmsr
. VFPSETLENi
is expanded as usual.
@@ -2991,7 +2991,7 @@ bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
MI.eraseFromParent();
return true;
}
- case ARM::VFPSETLEN: {
+ case ARM::VFPSETLENi: {
Register Scratch1 = MI.getOperand(0).getReg();
Register Scratch2 = MI.getOperand(1).getReg();
DebugLoc dl = MI.getDebugLoc();
@@ -3026,6 +3026,27 @@ bool ARMExpandPseudo::ExpandMI(MachineBasicBlock &MBB,
MI.eraseFromParent();
return true;
}
+ case ARM::VFPSETLENr: {
+ // This is a semantic version of VMSR intended only
+ // when changing the vector length.
+ DebugLoc dl = MI.getDebugLoc();
+ Register Length = MI.getOperand(0).getReg();
+ BuildMI(MBB, MBBI, dl, TII->get(ARM::VMSR))
+ .addUse(Length)
+ .add(predOps(ARMCC::AL));
+ MI.eraseFromParent();
+ return true;
+ }
+ case ARM::VFPGETLENr: {
+ // This is a semantic version of VMRS intended only
+ // when reading the vector length.
+ DebugLoc dl = MI.getDebugLoc();
+ Register Length = MI.getOperand(0).getReg();
+ BuildMI(MBB, MBBI, dl, TII->get(ARM::VMRS), Length)
+ .add(predOps(ARMCC::AL));
+ MI.eraseFromParent();
+ return true;
+ }
case ARM::VFPSPILLDx2: {
Register Src = MI.getOperand(0).getReg();
const MachineOperand &Addr = MI.getOperand(1);
Using pseudo instructions that have trivial expansions is totally intentional: we are only using these instructions for copies so we can optimise them.
Copies
Let’s overhaul copies to be correct.
There is a bit of complication, that we will not fully address here. In order
to preserve the vector length (via the fpscr
) we need extra registers around.
Because the copy expansion happens after register allocation, we cannot enjoy
virtual registers here. So we need to resort to a class called RegScavenger
(the register scavenger) which will try to get us registers for us.
However, the register scavenger may not be able to find free registers. In that case we would have to pick victim registers, spill them manually onto the stack and then use those registers, to finally reload them. A more complete implementation will have to do this though, or it risks itself to not being able to compile some codes with high register pressure.
Another important detail is that not all the copies of floating-point values need changing the set vector length. If the destination register of the copy is found in the first bank, the operation is always scalar. This helps with cases where we are copying scalars (vectors will never be register allocated in the first bank).
First let’s remove the wrong code.
@@ -979,16 +1068,6 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
Opc = ARM::VMOVS;
BeginIdx = ARM::ssub_0;
SubRegs = 2;
- } else if (ARM::DPRx2RegClass.contains(DestReg, SrcReg) &&
- Subtarget.hasVFP2Base()) {
- Opc = ARM::VMOVD;
- BeginIdx = ARM::dsub_len2_0;
- SubRegs = 2;
- } else if (ARM::SPRx4RegClass.contains(DestReg, SrcReg) &&
- Subtarget.hasVFP2Base()) {
- Opc = ARM::VMOVS;
- BeginIdx = ARM::ssub_len4_0;
- SubRegs = 4;
} else if (SrcReg == ARM::CPSR) {
copyFromCPSR(MBB, I, DestReg, KillSrc, Subtarget);
return;
Now let’s extend the part that deals originally with scalar copies of floating
point values (SPR
and DPR
register classes). We will add cases for SPRx4
and DPRx4
. Also we will note the vector length required for each of the
copies.
+ unsigned VectorLength = 0;
unsigned Opc = 0;
if (SPRDest && SPRSrc)
Opc = ARM::VMOVS;
@@ -913,10 +991,19 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
Opc = ARM::VMOVSR;
else if (ARM::DPRRegClass.contains(DestReg, SrcReg) && Subtarget.hasFP64())
Opc = ARM::VMOVD;
- else if (ARM::QPRRegClass.contains(DestReg, SrcReg))
+ else if (ARM::DPRx2RegClass.contains(DestReg, SrcReg) &&
+ Subtarget.hasVFP2Base()) {
+ Opc = ARM::VMOVD;
+ VectorLength = 1;
+ } else if (ARM::SPRx4RegClass.contains(DestReg, SrcReg) &&
+ Subtarget.hasVFP2Base()) {
+ Opc = ARM::VMOVS;
+ VectorLength = 3;
+ } else if (ARM::QPRRegClass.contains(DestReg, SrcReg))
Opc = Subtarget.hasNEON() ? ARM::VORRq : ARM::MVE_VORR;
if (Opc) {
+ RAIISetVLEN SetVLEN(Subtarget, this, MBB, I, DestReg, SrcReg, VectorLength);
MachineInstrBuilder MIB = BuildMI(MBB, I, DL, get(Opc), DestReg);
MIB.addReg(SrcReg, getKillRegState(KillSrc));
if (Opc == ARM::VORRq || Opc == ARM::MVE_VORR)
@@ -925,6 +1012,8 @@ void ARMBaseInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
addUnpredicatedMveVpredROp(MIB, DestReg);
else
MIB.add(predOps(ARMCC::AL));
+ if (SetVLEN.isPreserved() && (Opc == ARM::VMOVD || Opc == ARM::VMOVS))
+ MIB.addReg(ARM::FPSCR, RegState::Implicit);
return;
}
To help us here, we use a new class RAIISetVLEN
which given the registers
being copied and the required VectorLength
will emit the proper VFPGETLENr
,
VFPSETLENi
sequence. We use a RAII pattern because we may have to restore the
vector length using VFPSETLENr
. If this class tells us fpscr
is being
preserved, we add an explicit usage of fpscr
to the vmov
instruction.
class RAIISetVLEN {
bool MustPreserveFPSCR = false;
MachineBasicBlock &MBB;
MachineBasicBlock::iterator I;
const ARMBaseInstrInfo *TII;
Register SaveFPCSR;
public:
RAIISetVLEN(const ARMSubtarget &Subtarget, const ARMBaseInstrInfo *TII,
MachineBasicBlock &MBB, MachineBasicBlock::iterator I,
MCRegister DestReg, MCRegister SrcReg, unsigned VectorLength)
: MBB(MBB), I(I), TII(TII) {
if (!Subtarget.hasVFP2Base())
return;
auto isFPReg = [&](MCRegister R) {
return ARM::SPRRegClass.contains(R) || ARM::DPRRegClass.contains(R);
};
auto inVectorBank = [&](MCRegister R) {
return (ARM::D4 <= R && R <= ARM::D15) ||
(ARM::S8 <= R && R <= ARM::S31);
};
auto isVectorReg = [&](MCRegister R) {
return ARM::SPRx4RegClass.contains(R) || ARM::DPRx2RegClass.contains(R);
};
// If this is a scalar register copy, and the destination register happens
// to be in a register bank other than the first, we must preserve fpscr.
if (isFPReg(DestReg) && isFPReg(SrcReg) && isVectorBank(DestReg))
MustPreserveFPSCR = true;
// Vector copies always must preserve fpscr.
if (isVectorReg(DestReg) || isVectorReg(SrcReg))
MustPreserveFPSCR = true;
if (!MustPreserveFPSCR)
return;
// We have to set FPSCR to length zero for safety.
RegScavenger RS;
RS.enterBasicBlock(MBB);
RS.forward(I);
SaveFPCSR = RS.FindUnusedReg(&ARM::GPRnopcRegClass);
if (SaveFPCSR == ARM::NoRegister)
report_fatal_error(
"When emitting a floating point register copy, failed "
"to find a free register");
RS.setRegUsed(SaveFPCSR);
Register Scratch1 = RS.FindUnusedReg(&ARM::GPRRegClass);
if (Scratch1 == ARM::NoRegister)
report_fatal_error(
"When emitting a floating point register copy, failed "
"to find a free register");
RS.setRegUsed(Scratch1);
Register Scratch2 = RS.FindUnusedReg(&ARM::GPRnopcRegClass);
if (Scratch2 == ARM::NoRegister)
report_fatal_error(
"When emitting a floating point register copy, failed "
"to find a free register");
RS.setRegUsed(Scratch2);
BuildMI(MBB, I, I->getDebugLoc(), TII->get(ARM::VFPGETLENr), SaveFPCSR);
BuildMI(MBB, I, I->getDebugLoc(), TII->get(ARM::VFPSETLENi))
.addDef(Scratch1, RegState::Dead)
.addDef(Scratch2, RegState::Dead)
.addImm(VectorLength);
}
~RAIISetVLEN() {
if (!MustPreserveFPSCR)
return;
// Restore FPCSR
BuildMI(MBB, I, I->getDebugLoc(), TII->get(ARM::VFPSETLENr))
.addUse(SaveFPCSR, RegState::Kill);
}
bool isPreserved() const { return MustPreserveFPSCR; }
};
Note the condition cases where the register scavenger would fail to find a register. A more complete implementation must handle those cases.
Optimising
As we mentioned above, we want the optimisation pass to run after register allocation.
@@ -513,13 +513,13 @@ void ARMPassConfig::addPreRegAlloc() {
if (!DisableA15SDOptimization)
addPass(createA15SDOptimizerPass());
-
- addPass(createARMOptimizeVFP2Len());
}
}
void ARMPassConfig::addPreSched2() {
if (getOptLevel() != CodeGenOpt::None) {
+ addPass(createARMOptimizeVFP2Len());
+
if (EnableARMLoadStoreOpt)
addPass(createARMLoadStoreOptimizationPass());
To optimise these new instructions, we have to extend our existing pass. When a
VFPGETLENr
is seen, we will remember what register is keeping the current
length. If the register is preserved until a further VFPSETLENr
we know we
are restoring the same length as before.
We can use this knowledge when analysing a single basic block, so we do not
lose track of the length value during a VFPSETLENr
.
@@ -81,6 +81,12 @@ struct BlockData {
bool InQueue = false;
};
+struct KeptVFPInfo {
+ Register Reg = ARM::NoRegister;
+ Length Len;
+ MachineInstr *MI = nullptr;
+};
+
class ARMOptimizeVFP2Len : public MachineFunctionPass {
private:
std::vector<BlockData> BlockInfo;
@@ -124,27 +124,43 @@ void ARMOptimizeVFP2Len::computeLocalBlockInfo(const MachineBasicBlock &MBB) {
if (MBB.isEntryBlock())
LI.InLen.setValue(0);
+ KeptVFPInfo KeptVFP;
+
LI.OutLen = LI.InLen;
for (auto &MI : MBB) {
- if (MI.getOpcode() == ARM::VFPSETLEN) {
+ if (MI.getOpcode() == ARM::VFPSETLENi) {
LI.OutLen.setValue(MI.getOperand(2).getImm());
LI.LastChange = &MI;
- continue;
- }
-
- // If the FPSCR is modified outside of our control, assume
- // that it is variable.
- if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
+ } else if (MI.getOpcode() == ARM::VFPGETLENr) {
+ // This instruction does not change the vector length, but we will note
+ // the register being written because these are usually paired with
+ // VFPSETLENr.
+ KeptVFP.Reg = MI.getOperand(0).getReg();
+ KeptVFP.Len = LI.OutLen;
+ } else if (MI.getOpcode() == ARM::VFPSETLENr) {
+ // If we are restoring a previously kept vector length, we can also
+ // restore the known vector length.
+ if (KeptVFP.Reg == MI.getOperand(0).getReg()) {
+ LI.OutLen = KeptVFP.Len;
+ } else {
+ LI.OutLen.setVariable();
+ }
+ LI.LastChange = &MI;
+ } else if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
+ // If the FPSCR is modified outside of our control, assume
+ // that it is variable.
LI.OutLen.setVariable();
LI.LastChange = &MI;
- continue;
- }
-
- if (MI.isCall()) {
+ } else if (MI.isCall()) {
// On exit, functions restore vector length == 1.
LI.OutLen.setValue(0);
LI.LastChange = &MI;
- continue;
+ }
+
+ if (MI.getOpcode() != ARM::VFPGETLENr && KeptVFP.Reg != ARM::NoRegister &&
+ MI.modifiesRegister(KeptVFP.Reg)) {
+ // If this instruction modifies the GPR holding a vector length, reset it.
+ KeptVFP.Reg = ARM::NoRegister;
}
}
}
And then use this in the removal as well.
@@ -225,25 +241,61 @@ bool ARMOptimizeVFP2Len::removeRedundantVPFSETLEN(MachineBasicBlock &MBB) {
bool Changed = false;
MachineBasicBlock::iterator MBBI = MBB.begin(), E = MBB.end();
+
+ KeptVFPInfo KeptVFP;
+
while (MBBI != E) {
MachineBasicBlock::iterator NMBBI = std::next(MBBI);
MachineInstr &MI = *MBBI;
- if (MI.getOpcode() == ARM::VFPSETLEN) {
+ bool Remove = false;
+
+ if (MI.getOpcode() == ARM::VFPSETLENi) {
unsigned Length = MI.getOperand(2).getImm();
if (CurrentLength.hasValue() && CurrentLength.getValue() == Length) {
LLVM_DEBUG(dbgs() << "Removing redundant: " << MI);
// We can remove this one.
- MI.removeFromParent();
- Changed = true;
+ Remove = true;
}
CurrentLength.setValue(Length);
+ } else if (MI.getOpcode() == ARM::VFPSETLENr) {
+ if (KeptVFP.Reg == MI.getOperand(0).getReg()) {
+ if (CurrentLength.hasValue() && KeptVFP.Len.hasValue() &&
+ CurrentLength == KeptVFP.Len) {
+ // This is restoring to the same length we kept in an earlier
+ // VFPGETLENr, so we can remove this instruction.
+ Remove = true;
+ // If the register is killed here, also remove its last generator
+ // which we know it is a VFPGETLENi).
+ assert(KeptVFP.MI && KeptVFP.MI->getOpcode() == ARM::VFPGETLENr);
+ if (MI.getOperand(0).isKill()) {
+ KeptVFP.MI->removeFromParent();
+ }
+ }
+ CurrentLength = KeptVFP.Len;
+ } else {
+ CurrentLength.setVariable();
+ }
+ } else if (MI.getOpcode() == ARM::VFPGETLENr) {
+ KeptVFP.Reg = MI.getOperand(0).getReg();
+ KeptVFP.Len = CurrentLength;
+ KeptVFP.MI = &MI;
} else if (MI.modifiesRegister(ARM::FPSCR) || MI.isInlineAsm()) {
CurrentLength.setVariable();
} else if (MI.isCall()) {
CurrentLength.setValue(0);
}
+ if (Remove) {
+ MI.removeFromParent();
+ Changed = true;
+ } else if (MI.getOpcode() != ARM::VFPGETLENr &&
+ KeptVFP.Reg != ARM::NoRegister &&
+ MI.modifiesRegister(KeptVFP.Reg)) {
+ // If this instruction modifies the GPR holding a vector length, reset it.
+ KeptVFP.Reg = ARM::NoRegister;
+ }
+
MBBI = NMBBI;
}
Results
With all the changes above in place, we can retake the example of copies from the last chapter and see the code we emit now.
define void @test_vec(i32 %dis,
<2 x double> *%pa,
<2 x double> *%pb,
<2 x double> *%pc) {
%a = load <2 x double>, <2 x double>* %pa
%b = load <2 x double>, <2 x double>* %pb
%m = icmp slt i32 %dis, 4
br i1 %m, label %block1, label %block2
block1:
%x = fadd <2 x double> %a, %b
br label %block3
block2:
%y = fmul <2 x double> %a, %b
br label %block3
block3:
%p = phi <2 x double> [%x, %block1], [%y, %block2]
store <2 x double> %p, <2 x double> *%pc
ret void
}
$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 -o - t_doubles_phi.ll
test_vec:
.fnstart
@ %bb.0: @ %block3
vpush {d8, d9}
vldmia r1, {d8, d9}
mov r1, #65536
cmp r0, #4
vldmia r2, {d4, d5}
vmrs r2, fpscr
bic r2, r2, #458752
orr r2, r2, r1
vmsr fpscr, r2
vmul.f64 d6, d8, d4
vadd.f64 d4, d8, d4
vmovge.f64 d4, d6
vstmia r3, {d4, d5}
vmrs r1, fpscr
bic r1, r1, #458752
vmsr fpscr, r1
vpop {d8, d9}
bx lr
This is really good as the branch is now gone.
We have to look at the MIR to understand what happened. First let’s see the MIR right before the optimisation pass.
$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 -o - t_doubles_phi.ll \
-print-before=arm-optimize-vfp2-len
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
bb.0.block3:
successors: %bb.2(0x40000000), %bb.1(0x40000000); %bb.2(50.00%), %bb.1(50.00%)
liveins: $r0, $r1, $r2, $r3, $d8, $d9
$sp = frame-setup VSTMDDB_UPD $sp(tied-def 0), 14, $noreg, killed $d8, killed $d9
frame-setup CFI_INSTRUCTION def_cfa_offset 16
frame-setup CFI_INSTRUCTION offset $d9, -8
frame-setup CFI_INSTRUCTION offset $d8, -16
renamable $d4 = VLDRD renamable $r2, 0, 14, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
renamable $d5 = VLDRD killed renamable $r2, 2, 14, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
renamable $d8 = VLDRD renamable $r1, 0, 14, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa)
renamable $d9 = VLDRD killed renamable $r1, 2, 14, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8)
dead renamable $r1, dead renamable $r2 = VFPSETLENi 1, implicit-def $fpscr
renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14, $noreg, implicit $fpscr
renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14, $noreg, implicit $fpscr
CMPri killed renamable $r0, 4, 14, $noreg, implicit-def $cpsr
Bcc %bb.2, 11, killed $cpsr
bb.1.select.false:
; predecessors: %bb.0
successors: %bb.2(0x80000000); %bb.2(100.00%)
liveins: $r3, $d6_d7x2
$r0 = VFPGETLENr implicit $fpscr
dead $r1, dead $r2 = VFPSETLENi 1, implicit-def $fpscr
$d4_d5x2 = VMOVD killed $d6_d7x2, 14, $noreg, implicit $fpscr
VFPSETLENr killed $r0, implicit-def $fpscr
bb.2.select.end:
; predecessors: %bb.0, %bb.1
liveins: $r3, $d4_d5x2
VSTRD renamable $d4, renamable $r3, 0, 14, $noreg :: (store 8 into %ir.pc)
VSTRD renamable $d5, killed renamable $r3, 2, 14, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
dead renamable $r0, dead renamable $r1 = VFPSETLENi 0, implicit-def $fpscr
$sp = frame-destroy VLDMDIA_UPD $sp(tied-def 0), 14, $noreg, def $d8, def $d9
BX_RET 14, $noreg
Before the optimisation, our MIR looks like this. In lines 22 to 25 above, we
see all the dance of VFPGETLENr
, VFPSETLENi
and VFPSETLENr
.
Note in line 12 above we have already set the vector length to 2 (like we do in line 23), so we should be able to reuse this.
Let’s look now at the MIR after the optimiation pas.
$ llc -mtriple armv6kz-unknown-linux-gnu -mattr=+vfp2 -o - t_doubles_phi.ll \
-print-after=arm-optimize-vfp2-len
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
bb.0.block3:
successors: %bb.2(0x40000000), %bb.1(0x40000000); %bb.2(50.00%), %bb.1(50.00%)
liveins: $r0, $r1, $r2, $r3, $d8, $d9
$sp = frame-setup VSTMDDB_UPD $sp(tied-def 0), 14, $noreg, killed $d8, killed $d9
frame-setup CFI_INSTRUCTION def_cfa_offset 16
frame-setup CFI_INSTRUCTION offset $d9, -8
frame-setup CFI_INSTRUCTION offset $d8, -16
renamable $d4 = VLDRD renamable $r2, 0, 14, $noreg, implicit-def $d4_d5x2 :: (load 8 from %ir.pb)
renamable $d5 = VLDRD killed renamable $r2, 2, 14, $noreg, implicit killed $d4_d5x2, implicit-def $d4_d5x2 :: (load 8 from %ir.pb + 8)
renamable $d8 = VLDRD renamable $r1, 0, 14, $noreg, implicit-def $d8_d9x2 :: (load 8 from %ir.pa)
renamable $d9 = VLDRD killed renamable $r1, 2, 14, $noreg, implicit killed $d8_d9x2, implicit-def $d8_d9x2 :: (load 8 from %ir.pa + 8)
dead renamable $r1, dead renamable $r2 = VFPSETLENi 1, implicit-def $fpscr
renamable $d6_d7x2 = VMULDx2 renamable $d8_d9x2, renamable $d4_d5x2, 14, $noreg, implicit $fpscr
renamable $d4_d5x2 = VADDDx2 killed renamable $d8_d9x2, killed renamable $d4_d5x2, 14, $noreg, implicit $fpscr
CMPri killed renamable $r0, 4, 14, $noreg, implicit-def $cpsr
Bcc %bb.2, 11, killed $cpsr
bb.1.select.false:
; predecessors: %bb.0
successors: %bb.2(0x80000000); %bb.2(100.00%)
liveins: $r3, $d6_d7x2
$d4_d5x2 = VMOVD killed $d6_d7x2, 14, $noreg, implicit $fpscr
bb.2.select.end:
; predecessors: %bb.0, %bb.1
liveins: $r3, $d4_d5x2
VSTRD renamable $d4, renamable $r3, 0, 14, $noreg :: (store 8 into %ir.pc)
VSTRD renamable $d5, killed renamable $r3, 2, 14, $noreg, implicit killed $d4_d5x2 :: (store 8 into %ir.pc + 8)
dead renamable $r0, dead renamable $r1 = VFPSETLENi 0, implicit-def $fpscr
$sp = frame-destroy VLDMDIA_UPD $sp(tied-def 0), 14, $noreg, def $d8, def $d9
BX_RET 14, $noreg
The optimisation pass identifies the unnecessary change so we can reuse the existing vector length. Now line 22 is the only instruction required to do the copy.
A later existing pass of the ARM backend identifies this pattern and adds the
predicate to vmov
(which becomes vmovge
).
A bit of reflection with the current approach. We have gone through register allocation with instructions that we have later removed. This means that potentially those instructions may have increased the register pressure to the rest of the code. Unfortunately the expansion currently happens after register allocation (though there are a couple of target-specific hooks that might be worth looking into) so there is not much we can do for now.
Now I think we can move onto enabling vectorization :)