I think we have enough pieces of machinery working already that we can start with the most exciting part of this journey: autovectorisation!
If we look at the step of computations required by an algorithm, we may find that often the precise order between some of the steps is not relevant. When this happens we say that those steps could run concurrently and the algorithm would still be correct. We can call concurrency to the number of operations that can be executed concurrently. When concurrency is somewhat related (or directly proportional) to the amount of data being processed by the algorithm we can say that it exposes data parallelism.
The following C program tells us to add elements from two arrays from
N-1 but nothing it it requires that. For instance, we could run from
0 and the observable effect would be identical.
We can exploit data parallelism in many ways: we can distribute parts of the computation over different computers in a cluster, over different threads in a multicore or, the approach we are interested in, between the different elements of a vector of a CPU using SIMD instructions. The important assumption here is that all of the mentioned approaches can perform several computations simultaneously in time.
SIMD precisely represents this idea: Single Instruction Multiple Data. With a single CPU instruction we can process more than one element of data. We can obtain performane gains from using SIMD instructions if the CPU can execute them in a similar amount of time as their scalar counterparts. It all this depends on the amount of resources that the CPU has. The ARM1176JZF-S that powers the Raspberry Pi 1 does not devote extra resources, so vector instructions take proportionally longer, so we will not improve the performance a lot. However there are still some small gains here: each instruction executed comes with a (small) fixed cost which we are now avoiding.
Compilers may be able to identify in some circumstances that the source code is expressing data parallel computation. We will focus on loops though it is possible to identify similar cases for straight-line code.
Once those cases are identified, the compiler may be able to implement the scalar computation using vector instructions. This process is called vectorisation.
Historically automatic vectorisation has been a bit disappointing. Compilers must be very careful not to break the semantics of the program. Some programming languages, such as C and C++, require very complex analyses to determine the safety of the vectorisation. This process is also time consuming and it is not commonly enabled by default in mainstream compilers. So many interesting loops which potentially could be vectorised are not vectorised. Or worse, the programmer has to adapt the code so it ends being vectorised.
Enable vectorisation in LLVM
Vectorization is necessarily a target-specific transformation. LLVM IR is not platform neutral but its genericity helps reusing code between architectures. Sometimes the LLVM IR optimisation passes need information from the backend to assess if a transformation is profitable or not.
The loop vectoriser is not an exception to this, so before we can get it to vectorise simple codes, we need to teach the ARM backend about the new vector reality.
A first thing we need to do is to let LLVM know how many vector registers are there. We mentioned that in practice is like if there were 6 of them.
A second thing we need to let LLVM know is the size of our vectors. Because we aim only for vectors that can hold either 4 floats or 2 doubles and both cases amount to 128 bit, we will claim that size.
With all this we can try our loop above.
Uhm, this is not what we wanted, right? The reason is that in general the vectoriser will try not to make unsafe transformations. VFP instructions are not 100% compliant with IEEE-754 so the vectoriser will not use them by default.
We need to tell the compiler “it is OK, let’s use not 100% precise
instructions” by using
-O2 -ffast-math or the shorter form
This is more interesting!
Note however we have some gross inefficiency here: we are changing the vector length (setting it to 2) in every iteration of the loop. Later in this post we will evaluate if it is worth trying to hoist it out of the loop.
A very simple benchmark
Let’s use this simple
that computes a vector addition of floats (similar to the code shown above in
the post). It also has a mechanism to validate the scalar version and the
#pragma clang loop is used to explicitly disable
vectorisation in the scalar loops that would otherwise be vectorised.
The benchmark can be adjusted for number of times we run the benchmark and the size of the vector. This is useful to run it in different scenarios.
This benchmark has a low ratio of computation over memory accesses. We do one addition and three memory accesses (two loads and one store). This means that the arithmetic intensity of this benchmark is small. We may not be able to observe a lot of improvement with vector instructions.
We can study more favourable situations if we use smaller arrays. In this case, when we run the benchmark again, chances are that the vector will be in the cache already. While the arithmetic intensity hasn’t changed, in this situation the arithmetic computation has higher weight in the overall execution.
Let’s look at two executions of the benchmark. The figure below show the ratio of execution time of the vector loop respect to the scalar loop (not vectorised at all). The plot at the left shows the results when the vectorised loop sets the vector length at each iteration, as it is emitted by the compiler. The plot at the right shows the results when the vector length change is hoisted out of the loop: this is, it is set only once before entering the vector body. I did this transformation by editing the assembly output.
In both cases an array of 4 floats does not perform very well because the loop has been vectorised using an interleave factor of 2. So each vector loop iteration wants to process 8 iterations of the original scalar loop. There is some overhead setting up the vector length to 1 in the tail loop (the loop that processes the remaining elements that do not complete a vector) hence the bad performance. This is less noticeable on the right plot as the vector length is only set once before entering the scalar loop (not once per iteration of the loop as it happens on the left).
On the left plot we see that, until 4096 float elements, the improvement over
scalar is modest: around 2.5X the scalar code. I believe changing the vector
lenght (which requires reading the
fpscr) introduces some extra cost that
limits the performance. On the right plot we see it improves up to ~3.4X. This
means it is a good idea to hoist the set vector length out of the vector loop
if possible. We will look into it in a later chapter.
Starting from 4096, both plots show a similar performance degradation. The reason is that our working set is now beyond the 16KB of the L1 data cache. Note that when using arrays of 2048 float elements each array takes 8KB. Given that the benchmark uses two of them in the loop, our working set is 16KB. Beyond that we overflow the cache (the L2, afaict is not used) and the performance drops to ~1.3X. The low arithmetic intensity of this benchmark means it quickly becomes a memory-bound problem rather than a CPU-bound problem. Vectorisation is unlikely to help under memory-bound problems.
Too good to be true?
After thinking a bit more about the results, a doubt came to mind. In practice, this is like if we had unrolled the loop but using hardware instructions. This is because the vector mode of the VFP does not bring any performance benefit: the latency of the instructions is scaled by the vector length.
So I extended the benchmark to include an unrolled version of the vector addition, without vectorisation. I think the results speak for themselves.
So as I already hypothesised at the first chapter of this series, the only benefit we may obtain from using vector instructions is code size improvement.
In the next chapter we will look into trying to hoist the set vector length out of the loops that only set it once (which we expect to be the common case for vectorised loops).