Fun with vectors in the Raspberry Pi 1 - Part 2
In the previous installment we discussed a bit how to generate code using the vector feature of the CPU of the Raspberry Pi 1.
Let’s start hacking LLVM.
Registers
One way to understand registers in LLVM is a set of storage resources that we can group in register classes. Those register classes can then be mentioned as register operands of instructions.
The register information for the ARM
backend (the one used for 32-bit Arm
CPUs, currently known as the AArch32 execution state of the Arm architecture)
is found in llvm/lib/Target/ARM/ARMRegisterInfo.td
.
This is a tablegen file. Tablegen is a domain specific language to generate records called definitions. Each definition is an instance of a class and classes define the attributes that a definition will contain. A tablegen file is then processed by one or more backends commonly to generate C++ code. This tablegen-generated C++ code is compiled along with the rest of the C++ code that makes up LLVM. This way it is relatively quick to update parts of the compiler without having to express them directly in C++.
Registers in LLVM are specified using definitions of class Register
. Most
backends have to specialise this class, so the Arm backend uses a class called
ARMFReg
for floating point registers
The single precision floating point registers (s<n>
) are defined like this.
The double precision registers (d<n>
) are defined as registers
that include two single precision registers in it. This is achieved by
first declaring what is called a subregister index.
Now the registers can be defined by telling LLVM that they have two subregister
indices and then linking each subregister index to the corresponding s<n>
and s<n+1>
registers.
Ok so we can use a similar strategy for our vector registers. Let’s define first a couple of new subregister indices. For now let’s focus on double precision.
The first argument to SubRegIndex
is the size of the register. Because we are
defining vectors of double precision, this will be 64 bit. The second operand
represents the offset within the register. In contrast to d<n>
registers that
do include two consecutive registers, VFP vectors may include non-consecutive
registers due to the wraparound within a vector bank (recall (d7, d4)
). So we
specify -1
to represent that this is not a physical subregister but a
logical one.
Now we can use tablegen looping features to define the pairs of registers.
This is a bit difficult to read. base
represents the d<n>
that begins a
vector bank: d4
, d8
and d12
. offset
represents how many elements there
are within each bank. These two loops execute and will be generating definitions.
Because of the defset
directive enclosing everything, those definitions will
also be referenced in a list called DPRx2Regs
.
So we compute first base + offset
and we name this m
.
Then we compute mnext
as the logical next one but making sure we wrap around
(we achieve this using !and(..., 0x3)
as we have to compute mod 4).
Now that we have m
and mnext
we can define the pair itself. The definition
will be named D<m>_D<mnext>x2
(e.g. D4_D5x2
, D5_D6x2
, D6_D7x2
,
D7_D4x2
, D8_D9x2
, …) this name is arbitrary but should be a valid C++
identifier because one of the tablegen backends will define enumerators for
those registers.
In order to generate the register we use a specialised class called
VFPRegistersWithSubregs
which is just a convenience for this task.
If you check above how we use this class, the first argument is the encoding
register. We will always use the first register of the group for the encoding
(however you will see that eventually we won’t be using this). We are naming
those registers d<n>x2
in the assembly. We will not use them and in fact we
should forbid those names in the assembler that LLVM will generate for the ARM
backend, but for simplicity we will ignore this. Finally see how we link
the current definition to each d<m>
and d<mnext>
.
Now we have the registers defined. Those are the resources. Those resources
can be used in instructions via register classes, which are the sets of useable
registers in instructions. Due to the way we have designed the registers
all of them will be usable in a register class for vectors of doubles. We can
simply use the list DPRx2Regs
that we built using defset
above.
The second operand is the list of machine types that we can represent with
this register. In this case v2f64
is equivalent to <2 x double>
in LLVM IR.
Machine types are fixed set of types that backends can use (i.e. LLVM IR has
types that machine types do not represent) and are somehow associated to the
physical types of CPUs. The third operand is the alignment, in bits, used
when loading or storing a register from memory. Due to the way we are going
to load them, they can be aligned to 8 bytes (64 bit).
And that’s it. We can do the same for single precision. This time sizes
are 32 and each register will contain 4 subregisters. The type of the
registers will be v4f32
.
In the next chapter we will talk about what changes we have to do to be able
to track fpscr
so we can change the len
field with confidence.