In chapter 13 we saw the basic elements of VFPv2, the floating point subarchitecture of ARMv6. In this chapter we will implement a floating point matrix multiply using VFPv2.

Disclaimer: I advise you against using the code in this chapter in commercial-grade projects unless you fully review it for both correctness and precision.

## Matrix multiply

Given two vectors v and w of rank r where v = <v0, v1, ... vr-1> and w = <w0, w1, ..., wr-1>, we define the dot product of v by w as the scalar v · w = v0×w0 + v1×w1 + ... + vr-1×wr-1.

We can multiply a matrix `A` of `n` rows and `m` columns (`n` x `m`) by a matrix `B` of `m` rows and `p` columns (`m` x `p`). The result is a matrix of `n` rows and `p` columns. Matrix multiplication may seem complicated but actually it is not. Every element in the result matrix it is just the dot product (defined in the paragraph above) of the corresponding row of the matrix `A` by the corresponding column of the matrix `B` (this is why there must be as many columns in `A` as there are rows in `B`). A straightforward implementation of the matrix multiplication in C is as follows.

In order to simplify the example, we will asume that both matrices A and B are square matrices of size `n x n`. This simplifies just a bit the algorithm.

Matrix multiplication is an important operation used in many areas. For instance, in computer graphics is usually performed on 3x3 and 4x4 matrices representing 3D geometry. So we will try to make a reasonably fast version of it (we do not aim at making the best one, though).

A first improvement we want to do in this algorithm is making the loops perfectly nested. There are some technical reasons beyond the scope of this code for that. So we will get rid of the initialization of `C[i][j]` to 0, outside of the loop.

After this change, the interesting part of our algorithm, line 13, is inside a perfect nest of loops of depth 3.

### Accessing a matrix

It is relatively straightforward to access an array of just one dimension, like in `a[i]`. Just get `i`, multiply it by the size in bytes of each element of the array and then add the address of `a` (the base address of the array). So, the address of `a[i]` is just `a + ELEMENTSIZE*i`.

Things get a bit more complicated when our array has more than one dimension, like a matrix or a cube. Given an access like `a[i][j][k]` we have to compute which element is denoted by `[i][j][k]`. This depends on whether the language is row-major order or column-major order. We assume row-major order here (like in C language). So `[i][j][k]` must denote `k + j * NK + i * NK * NJ`, where `NK` and `NJ` are the number of elements in every dimension. For instance, a three dimensional array of 3 x 4 x 5 elements, the element  is 3 + 2 * 5 + 1 * 5 * 4 = 23 (here `NK` = 5 and `NJ` = 4. Note that `NI` = 3 but we do not need it at all). We assume that our language indexes arrays starting from 0 (like C). If the language allows a lower bound other than 0, we first have to subtract the lower bound to get the position.

We can compute the position in a slightly better way if we reorder it. Instead of calculating `k + j * NK + i * NK * NJ` we will do `k + NK * (j + NJ * i)`. This way all the calculus is just a repeated set of steps calculating `x + Ni * y` like in the example below.

## Naive matrix multiply of 4x4 single-precision

As a first step, let's implement a naive matrix multiply that follows the C algorithm above according to the letter.

That's a lot of code but it is not complicated. Unfortunately computing the address of the array takes an important number of instructions. In our `naive_matmul_4x4` we have the three loops `i`, `j` and `k` of the C algorithm. We compute the address of `C[i][j]` in the loop `j` (there is no need to compute it every time in the loop `k`) in lines 52 to 63. The value contained in `C[i][j]` is then loaded into `s0`. In each iteration of loop `k` we load `A[i][k]` and `B[k][j]` in `s1` and `s2` respectively (lines 70 to 82). After the loop `k` ends, we can store `s0` back to the array position (kept in `r7`, line 90)

In order to print the result matrix we have to pass 16 floating points to `printf`. Unfortunately, as stated in chapter 13, we have to first convert them into double-precision before passing them. Note also that the first single-precision can be passed using registers `r2` and `r3`. All the remaining must be passed on the stack and do not forget that the stack parameters must be passed in opposite order. This is why once the first element of the C matrix has been loaded in `{r2,r3}` (lines 117 to 120) we advance 60 bytes r4. This is `C`, the last element of the matrix C. We load the single-precision, convert it into double-precision, push it in the stack and then move backwards register `r4`, to the previous element in the matrix (lines 128 to 137). Observe that we use `r6` as a marker of the stack, since we need to restore the stack once `printf` returns (line 122 and line 141). Of course we could avoid using `r6` and instead do `add sp, sp, #120` since this is exactly the amount of bytes we push to the stack (15 values of double-precision, each taking 8 bytes).

I have not chosen the values of the two matrices randomly. The second one is (approximately) the inverse of the first. This way we will get the identity matrix (a matrix with all zeros but a diagonal of ones). Due to rounding issues the result matrix will not be the identity, but it will be pretty close. Let's run the program.

## Vectorial approach

The algorithm we are trying to implement is fine but it is not the most optimizable. The problem lies in the way the loop `k` accesses the elements. Access `A[i][k]` is eligible for a multiple load as `A[i][k]` and `A[i][k+1]` are contiguous elements in memory. This way we could entirely avoid all the loop `k` and perform a 4 element load from `A[i]` to `A[i]`. The access `B[k][j]` does not allow that since elements `B[k][j]` and `B[k+1][j]` have a full row inbetween. This is a strided access (the stride here is a full row of 4 elements, this is 16 bytes), VFPv2 does not allow a strided multiple load, so we will have to load one by one.. Once we have all the elements of the loop `k` loaded, we can do a vector multiplication and a sum.

With this approach we can entirely remove the loop `k`, as we do 4 operations at once. Note that we have to modify `fpscr` so the field `len` is set to 4 (and restore it back to 1 when leaving the function).

### Fill the registers

In the previous version we are not exploiting all the registers of VFPv2. Each rows takes 4 registers and so does each column, so we end using only 8 registers plus 4 for the result and one in the bank 0 for the summation. We got rid the loop `k` to process `C[i][j]` at once. What if we processed `C[i][j]` and `C[i][j+1]` at the same time? This way we can fill all the 8 registers in each bank.

Note that because we now process `j` and `j + 1`, `r5` (`j`) is now increased by 2 at the end of the loop. This is usually known as loop unrolling and it is always legal to do. We do more than one iteration of the original loop in the unrolled loop. The amount of iterations of the original loop we do in the unrolled loop is the unroll factor. In this case since the number of iterations (4) perfectly divides the unrolling factor (2) we do not need an extra loop for the remainder iterations (the remainder loop has one less iteration than the value of the unrolling factor).

As you can see, the accesses to `b[k][j]` and `b[k][j+1]` are starting to become tedious. Maybe we should change a bit more the matrix multiply algorithm.

## Reorder the accesses

Is there a way we can mitigate the strided accesses to the matrix B? Yes, there is one, we only have to permute the loop nest i, j, k into the loop nest k, i, j. Now you may be wondering if this is legal. Well, checking for the legality of these things is beyond the scope of this post so you will have to trust me here. Such permutation is fine. What does this mean? Well, it means that our algorithm will now look like this.

This may seem not very useful, but note that, since now k is in the outermost loop, now it is easier to use vectorial instructions.

If you remember the chapter 13, VFPv2 instructions have a mixed mode when the `Rsource2` register is in bank 0. This case makes a perfect match: we can load `C[i][0..3]` and `B[k][0..3]` with a load multiple and then load `A[i][k]` in a register of the bank 0. Then we can make multiply` A[i][k]*B[k][0..3]` and add the result to `C[i][0..3]`. As a bonus, the number of instructions is much lower.

As adding after a multiplication is a relatively usual sequence, we can replace the sequence

with a single instruction `vmla` (multiply and add).

Now we can also unroll the loop `i`, again with an unrolling factor of 2. This would give us the best version.

## Comparing versions

Out of curiosity I tested the versions, to see which one was faster.

The benchmark consists on repeatedly calling the multiplication matrix function 221 times (actually 221-1 because of a typo, see the code) in order to magnify differences. While the input should be randomized as well for a better benchmark, the benchmark more or less models contexts where a matrix multiplication is performed many times (for instance in graphics).

This is the skeleton of the benchmark.

Here are the results. The one we named the best turned to actually deserve that name.

VersionTime (seconds)
naive_matmul_4x46.41
naive_vectorial_matmul_4x43.51
naive_vectorial_matmul_2_4x42.87
better_vectorial_matmul_4x42.59
best_vectorial_matmul_4x41.51

That's all for today.