Fun with vectors in the Raspberry Pi 1 - Part 2
In the previous installment we discussed a bit how to generate code using the vector feature of the CPU of the Raspberry Pi 1.
Let’s start hacking LLVM.
Registers
One way to understand registers in LLVM is a set of storage resources that we can group in register classes. Those register classes can then be mentioned as register operands of instructions.
The register information for the ARM
backend (the one used for 32-bit Arm
CPUs, currently known as the AArch32 execution state of the Arm architecture)
is found in llvm/lib/Target/ARM/ARMRegisterInfo.td
.
This is a tablegen file. Tablegen is a domain specific language to generate records called definitions. Each definition is an instance of a class and classes define the attributes that a definition will contain. A tablegen file is then processed by one or more backends commonly to generate C++ code. This tablegen-generated C++ code is compiled along with the rest of the C++ code that makes up LLVM. This way it is relatively quick to update parts of the compiler without having to express them directly in C++.
Registers in LLVM are specified using definitions of class Register
. Most
backends have to specialise this class, so the Arm backend uses a class called
ARMFReg
for floating point registers
class ARMFReg<bits<16> Enc, string n> : Register<n> {
let HWEncoding = Enc;
let Namespace = "ARM";
}
The single precision floating point registers (s<n>
) are defined like this.
def S0 : ARMFReg< 0, "s0">; def S1 : ARMFReg< 1, "s1">;
def S2 : ARMFReg< 2, "s2">; def S3 : ARMFReg< 3, "s3">;
def S4 : ARMFReg< 4, "s4">; def S5 : ARMFReg< 5, "s5">;
def S6 : ARMFReg< 6, "s6">; def S7 : ARMFReg< 7, "s7">;
def S8 : ARMFReg< 8, "s8">; def S9 : ARMFReg< 9, "s9">;
def S10 : ARMFReg<10, "s10">; def S11 : ARMFReg<11, "s11">;
def S12 : ARMFReg<12, "s12">; def S13 : ARMFReg<13, "s13">;
def S14 : ARMFReg<14, "s14">; def S15 : ARMFReg<15, "s15">;
def S16 : ARMFReg<16, "s16">; def S17 : ARMFReg<17, "s17">;
def S18 : ARMFReg<18, "s18">; def S19 : ARMFReg<19, "s19">;
def S20 : ARMFReg<20, "s20">; def S21 : ARMFReg<21, "s21">;
def S22 : ARMFReg<22, "s22">; def S23 : ARMFReg<23, "s23">;
def S24 : ARMFReg<24, "s24">; def S25 : ARMFReg<25, "s25">;
def S26 : ARMFReg<26, "s26">; def S27 : ARMFReg<27, "s27">;
def S28 : ARMFReg<28, "s28">; def S29 : ARMFReg<29, "s29">;
def S30 : ARMFReg<30, "s30">; def S31 : ARMFReg<31, "s31">;
The double precision registers (d<n>
) are defined as registers
that include two single precision registers in it. This is achieved by
first declaring what is called a subregister index.
def ssub_0 : SubRegIndex<32>;
def ssub_1 : SubRegIndex<32, 32>;
Now the registers can be defined by telling LLVM that they have two subregister
indices and then linking each subregister index to the corresponding s<n>
and s<n+1>
registers.
// Aliases of the F* registers used to hold 64-bit fp values (doubles)
let SubRegIndices = [ssub_0, ssub_1] in {
def D0 : ARMReg< 0, "d0", [S0, S1]>, DwarfRegNum<[256]>;
def D1 : ARMReg< 1, "d1", [S2, S3]>, DwarfRegNum<[257]>;
def D2 : ARMReg< 2, "d2", [S4, S5]>, DwarfRegNum<[258]>;
def D3 : ARMReg< 3, "d3", [S6, S7]>, DwarfRegNum<[259]>;
def D4 : ARMReg< 4, "d4", [S8, S9]>, DwarfRegNum<[260]>;
def D5 : ARMReg< 5, "d5", [S10, S11]>, DwarfRegNum<[261]>;
def D6 : ARMReg< 6, "d6", [S12, S13]>, DwarfRegNum<[262]>;
def D7 : ARMReg< 7, "d7", [S14, S15]>, DwarfRegNum<[263]>;
def D8 : ARMReg< 8, "d8", [S16, S17]>, DwarfRegNum<[264]>;
def D9 : ARMReg< 9, "d9", [S18, S19]>, DwarfRegNum<[265]>;
def D10 : ARMReg<10, "d10", [S20, S21]>, DwarfRegNum<[266]>;
def D11 : ARMReg<11, "d11", [S22, S23]>, DwarfRegNum<[267]>;
def D12 : ARMReg<12, "d12", [S24, S25]>, DwarfRegNum<[268]>;
def D13 : ARMReg<13, "d13", [S26, S27]>, DwarfRegNum<[269]>;
def D14 : ARMReg<14, "d14", [S28, S29]>, DwarfRegNum<[270]>;
def D15 : ARMReg<15, "d15", [S30, S31]>, DwarfRegNum<[271]>;
}
Ok so we can use a similar strategy for our vector registers. Let’s define first a couple of new subregister indices. For now let’s focus on double precision.
def dsub_len2_0: SubRegIndex<64, -1>;
def dsub_len2_1: SubRegIndex<64, -1>;
The first argument to SubRegIndex
is the size of the register. Because we are
defining vectors of double precision, this will be 64 bit. The second operand
represents the offset within the register. In contrast to d<n>
registers that
do include two consecutive registers, VFP vectors may include non-consecutive
registers due to the wraparound within a vector bank (recall (d7, d4)
). So we
specify -1
to represent that this is not a physical subregister but a
logical one.
Now we can use tablegen looping features to define the pairs of registers.
// Double precision pairs
defset list<Register> DPRx2Regs = {
foreach base = [4, 8, 12] in {
foreach offset = [0, 1, 2, 3] in {
defvar m = !add(base, offset);
defvar mnext = !add(base, !and(!add(offset, 1), 0x3));
let SubRegIndices = [dsub_len2_0, dsub_len2_1] in {
def "D" # m # "_D" # mnext # "x2" :
VFPRegistersWithSubregs<
!cast<Register>("D" # m),
"d" # m # "x2",
[!cast<Register>("D" # m), !cast<Register>("D" # mnext)],
["d" # m # "x2"]>;
}
}
}
}
This is a bit difficult to read. base
represents the d<n>
that begins a
vector bank: d4
, d8
and d12
. offset
represents how many elements there
are within each bank. These two loops execute and will be generating definitions.
Because of the defset
directive enclosing everything, those definitions will
also be referenced in a list called DPRx2Regs
.
So we compute first base + offset
and we name this m
.
Then we compute mnext
as the logical next one but making sure we wrap around
(we achieve this using !and(..., 0x3)
as we have to compute mod 4).
Now that we have m
and mnext
we can define the pair itself. The definition
will be named D<m>_D<mnext>x2
(e.g. D4_D5x2
, D5_D6x2
, D6_D7x2
,
D7_D4x2
, D8_D9x2
, …) this name is arbitrary but should be a valid C++
identifier because one of the tablegen backends will define enumerators for
those registers.
In order to generate the register we use a specialised class called
VFPRegistersWithSubregs
which is just a convenience for this task.
class VFPRegistersWithSubregs<Register EncReg, string n, list<Register> subregs,
list<string> alt = []>
: RegisterWithSubRegs<n, subregs> {
let HWEncoding = EncReg.HWEncoding;
let AltNames = alt;
let Namespace = "ARM";
}
If you check above how we use this class, the first argument is the encoding
register. We will always use the first register of the group for the encoding
(however you will see that eventually we won’t be using this). We are naming
those registers d<n>x2
in the assembly. We will not use them and in fact we
should forbid those names in the assembler that LLVM will generate for the ARM
backend, but for simplicity we will ignore this. Finally see how we link
the current definition to each d<m>
and d<mnext>
.
Now we have the registers defined. Those are the resources. Those resources
can be used in instructions via register classes, which are the sets of useable
registers in instructions. Due to the way we have designed the registers
all of them will be usable in a register class for vectors of doubles. We can
simply use the list DPRx2Regs
that we built using defset
above.
def DPRx2 : RegisterClass<"ARM", [v2f64], 64, (add DPRx2Regs)>;
The second operand is the list of machine types that we can represent with
this register. In this case v2f64
is equivalent to <2 x double>
in LLVM IR.
Machine types are fixed set of types that backends can use (i.e. LLVM IR has
types that machine types do not represent) and are somehow associated to the
physical types of CPUs. The third operand is the alignment, in bits, used
when loading or storing a register from memory. Due to the way we are going
to load them, they can be aligned to 8 bytes (64 bit).
And that’s it. We can do the same for single precision. This time sizes
are 32 and each register will contain 4 subregisters. The type of the
registers will be v4f32
.
def ssub_len4_0: SubRegIndex<32, -1>;
def ssub_len4_1: SubRegIndex<32, -1>;
def ssub_len4_2: SubRegIndex<32, -1>;
def ssub_len4_3: SubRegIndex<32, -1>;
// Single precision quads
defset list<Register> SPRx4Regs = {
foreach base = [8, 16, 24] in {
foreach offset = [0, 1, 2, 3, 4, 5, 6, 7] in {
defvar m = !add(base, offset);
defvar mnext1 = !add(base, !and(!add(offset, 1), 0x7));
defvar mnext2 = !add(base, !and(!add(offset, 2), 0x7));
defvar mnext3 = !add(base, !and(!add(offset, 3), 0x7));
let SubRegIndices = [ssub_len4_0, ssub_len4_1, ssub_len4_2, ssub_len4_3]
in {
def "S" # m # "_S" # mnext1 # "_S" # mnext2 # "_S" # mnext3 # "x4" :
VFPRegistersWithSubregs<
!cast<Register>("S" # m),
"s" # m # "x4",
[!cast<Register>("S" # m),
!cast<Register>("S" # mnext1),
!cast<Register>("S" # mnext2),
!cast<Register>("S" # mnext3)],
["s" # m # "x4"]>;
}
}
}
}
def SPRx4 : RegisterClass<"ARM", [v4f32], 32, (add SPRx4Regs)>;
In the next chapter we will talk about what changes we have to do to be able
to track fpscr
so we can change the len
field with confidence.