Long ago, we saw that the Raspberry Pi 1 has vector computation capabilities. However to the best of my knowledge no compiler attempted to exploit the vector capability in general.

I think we are going to have some fun in trying to fix this.

# A bit of a reminder

The Raspberry Pi 1 comes with a BCM2835 SoC that includes a ARM1176JZF-S core. This core implements the ARMv6 architecture and includes support for floating point operations via an extension of ARMv6 that was commercially called VFP. VFP had several versions but the one included in the ARM1176JZF-S core is VFPv2.

VFP provides 32 registers of single precison which can be grouped in pairs to provide 16 registers of double precision.

However the registers are grouped in banks, a first scalar bank and three vector banks.

The `fpscr` register has a field called `len` that encodes the vector length.

By default the vector length is set to 1 (encoded as `0b000` in the `len` field of `fpscr`) and so all the operations with all the registers are regular scalar floating point operations, regardless of the bank they are found in.

However if we set `len` to a value larger than 1, typically 2 or 4 but it is possible to use any value from 2 to 8, the behaviour changes for a few instructions. If the operands only involve registers in the scalar bank, then they will continue computing scalar values.

If the operands involve registers in the vector banks, the computation will be extended to the next `len` registers, wrapping around inside the bank if needed.

Finally there is a mixed mode where one source register is in the scalar bank and the other one in a vector bank, the value of the scalar bank will be reused `len` times as operand of each vec

Finally note that not all the instructions honour `len`. Only the following instructions do:

Instruction Meaning Notes
`vadd.<ty> a, b, c` `a ← b + c`
`vsub.<ty> a, b, c` `a ← b - c`
`vmul.<ty> a, b, c` `a ← b * c`
`vnmul.<ty> a, b, c` `a ← -(b * c)`
`vdiv.<ty> a, b, c` `a ← b / c`
`vmla.<ty> a, b, c` `a ← a + (b * c)` Different rounding to `vmul` followed by `vadd`
`vmls.<ty> a, b, c` `a ← a - (b * c)` Ditto.
`vnmla.<ty> a, b, c` `a ← -a - (b * c)` Ditto.
`vnmls.<ty> a, b, c` `a ← -a + (b * c)` Ditto.
`vneg.<ty> a, b` `a ← -b`
`vabs.<ty> a, b` `a ← |b|`
`vsqrt.<ty> a, b` `a ← √b`

# Modelling in LLVM

Before we can even try to generate code, we need to think a bit how to model the vector computation in the LLVM infrastructure.

## Types

LLVM IR supports vector types of the form `<k x ty>` where `k` is a constant and `ty` is the base type of the vector. In our case `ty` is either going to be `double` (called `f64` in the code generator of LLVM) and `float` (called `f32` in the code generator).

Technically speaking we could support a range of vector types, specially for `float` from 2 to 8, as in `<2 x float>`, `<3 x float>`, etc. However non-power of 2 vector lengths are a bit unusual. In fact the vectorizer of LLVM currently only considers powers of two. To avoid complicating ourselves a lot, we will consider only `<2 x double>` and `<4 x float>`.

Technically speaking, `<4 x double>` is something we could consider too. However if you look at the register banks above you will see that in practice we can only represent 3 of those vectors at a time. A few algorithms can live with this but it will be often too limiting.

## Registers

Now the next step is thinking what storage we are going to use to represent those vectors.

If you look closely at the table above and given the behaviour of VFP, any consecutive pair of `d<n>` registers within the same bank can be used for ```<2 x double>```. This means that `(d4, d5)`, `(d5, d6)`, `(d6, d7)` and `(d7,d4)` are four options available in the first bank. Note that the pair `(d5, d6)` may feel a bit suboptimal but we still can use `(d7, d4)` if needed. LLVM knows when two groups of registers overlap, so in principle we can give all the options and let LLVM generate correct code.

We will see later than some other parts of VFP will lead us to prefer pairs whose first component is an even numbered `d<n>` register. It is also possible to suggest LLVM to prefer some specific order for registers.

A similar rationale applies for `<4 x float>` but this time it will be groups of 4 consecutive (including the wrap-around within the bank) `s<n>` registers. So for instance `(s8, s9, s10, 11)` but also `(s14, s15, s8, s9)`. Similarly groups whose first `s<n>` register is a multiple of four will be preferable.

Effectively, not counting overlaps, VFPv2 provides us with the following resources:

Type Number of registers
`<2 x double>` 6
`<4 x float>` 12

## Operations

LLVM IR does not have explicit operations for all the instructions above but it will be relatively easy to pattern match them when possible. For instance if we are happy with floating-point contraction (i.e. we don’t care about all the roundings that IEEE 754 might require) a `vmla.<ty>` instruction can be used when the program wants to compute `a + (b * c)`

The contentious part, however, is that we need to make sure the field `len` of the `fpscr` register is correctly set.

The following LLVM IR means add two vectors of type `<2 x double>`. The operands of this operation are represented by values `%va` and `%vb`. The result of the operation is represented by `%vc`.

Note however that VFP also supports scalar operations and LLVM lowers them to those instructions. So a piece of IR like the following one

may be lowered into the following instruction

This is correct because the state of the program is assumed to be such that the `len` field is always 0.

However, because we want to use vectors, we are about to break this assumption. This means that our program will need to ensure that the `len` field has the right value before executing the corresponding instruction: scalar or vector.

We will see that maintaining this invariant when generating code is going to give us some headaches. These sort of designs actually complicate compilers. Now a compiler needs to introduce computation that sets the CPU state (in our case the `len` field) as needed but at the same time we want it to do it in the most efficient way possible (i.e. avoiding resetting the CPU state more than it is needed).

VFP will help us a bit because not all the operations will care about the `len` field so we will have a bit of leeway here.

# Goal

Our first goal is to add enough functionality to LLVM so the following hello world of vectorization:

is emitted like this:

## Is any of this useful?

According to the Wikipedia:

The VFP architecture was intended to support execution of short “vector mode” instructions but these operated on each vector element sequentially and thus did not offer the performance of true single instruction, multiple data (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction, to be replaced with the much more powerful Advanced SIMD, also known as Neon.

I don’t expect any performance improvement when using vector operations in the Raspberry Pi 1. In fact the only metric we may possibly improve is code size. It is going to be fun anyways.

And that should be it for today. The rest of the chapters will be more deep down how we can hammer LLVM so it emits vector code for the Raspberry Pi 1.