I think we have enough pieces of machinery working already that we can start with the most exciting part of this journey: autovectorisation!

SIMD

Data parallelism

If we look at the step of computations required by an algorithm, we may find that often the precise order between some of the steps is not relevant. When this happens we say that those steps could run concurrently and the algorithm would still be correct. We can call concurrency to the number of operations that can be executed concurrently. When concurrency is somewhat related (or directly proportional) to the amount of data being processed by the algorithm we can say that it exposes data parallelism.

The following C program tells us to add elements from two arrays from 0 to N-1 but nothing it it requires that. For instance, we could run from N-1 to 0 and the observable effect would be identical.

simple_add.c
enum { N = 1024 };

float a[N];
float b[N];
float c[N];

void vector_sum(void) {
  for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
  }
}

We can exploit data parallelism in many ways: we can distribute parts of the computation over different computers in a cluster, over different threads in a multicore or, the approach we are interested in, between the different elements of a vector of a CPU using SIMD instructions. The important assumption here is that all of the mentioned approaches can perform several computations simultaneously in time.

SIMD precisely represents this idea: Single Instruction Multiple Data. With a single CPU instruction we can process more than one element of data. We can obtain performane gains from using SIMD instructions if the CPU can execute them in a similar amount of time as their scalar counterparts. It all this depends on the amount of resources that the CPU has. The ARM1176JZF-S that powers the Raspberry Pi 1 does not devote extra resources, so vector instructions take proportionally longer, so we will not improve the performance a lot. However there are still some small gains here: each instruction executed comes with a (small) fixed cost which we are now avoiding.

Autovectorisation

Compilers may be able to identify in some circumstances that the source code is expressing data parallel computation. We will focus on loops though it is possible to identify similar cases for straight-line code.

Once those cases are identified, the compiler may be able to implement the scalar computation using vector instructions. This process is called vectorisation.

Historically automatic vectorisation has been a bit disappointing. Compilers must be very careful not to break the semantics of the program. Some programming languages, such as C and C++, require very complex analyses to determine the safety of the vectorisation. This process is also time consuming and it is not commonly enabled by default in mainstream compilers. So many interesting loops which potentially could be vectorised are not vectorised. Or worse, the programmer has to adapt the code so it ends being vectorised.

Enable vectorisation in LLVM

Vectorization is necessarily a target-specific transformation. LLVM IR is not platform neutral but its genericity helps reusing code between architectures. Sometimes the LLVM IR optimisation passes need information from the backend to assess if a transformation is profitable or not.

The loop vectoriser is not an exception to this, so before we can get it to vectorise simple codes, we need to teach the ARM backend about the new vector reality.

A first thing we need to do is to let LLVM know how many vector registers are there. We mentioned that in practice is like if there were 6 of them.

llvm/lib/Target/ARM/ARMTargetTransformInfo.h
   unsigned getNumberOfRegisters(unsigned ClassID) const {
     bool Vector = (ClassID == 1);
     if (Vector) {
+      if (ST->hasVFP2Base())
+        return 6;
       if (ST->hasNEON())
         return 16;
       if (ST->hasMVEIntegerOps())
         return 8;
       return 0;
     }
 
     if (ST->isThumb1Only())
       return 8;
     return 13;
   }

A second thing we need to let LLVM know is the size of our vectors. Because we aim only for vectors that can hold either 4 floats or 2 doubles and both cases amount to 128 bit, we will claim that size.

llvm/lib/Target/ARM/ARMTargetTransformInfo.h
   TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
     switch (K) {
     case TargetTransformInfo::RGK_Scalar:
       return TypeSize::getFixed(32);
     case TargetTransformInfo::RGK_FixedWidthVector:
+      if (ST->hasVFP2Base())
+        return TypeSize::getFixed(128);
       if (ST->hasNEON())
         return TypeSize::getFixed(128);
       if (ST->hasMVEIntegerOps())
         return TypeSize::getFixed(128);
       return TypeSize::getFixed(0);
     case TargetTransformInfo::RGK_ScalableVector:
       return TypeSize::getScalable(0);
     }
     llvm_unreachable("Unsupported register kind");
   }

With all this we can try our loop above.

$ clang --target=armv6-linux-gnueabihf -O2 -S -o- simple_add.c
vector_sum:
	.fnstart
@ %bb.0:
	.save	{r11, lr}
	push	{r11, lr}
	ldr	r12, .LCPI0_0
	ldr	lr, .LCPI0_1
	ldr	r3, .LCPI0_2
	mov	r0, #0
.LBB0_1:                                @ =>This Inner Loop Header: Depth=1
	add	r1, r12, r0
	add	r2, lr, r0
	vldr	s2, [r1]
	vldr	s0, [r2]
	add	r1, r3, r0
	add	r0, r0, #4
	cmp	r0, #4096
	vadd.f32	s0, s2, s0
	vstr	s0, [r1]
	bne	.LBB0_1
@ %bb.2:
	pop	{r11, pc}
	.p2align	2

Uhm, this is not what we wanted, right? The reason is that in general the vectoriser will try not to make unsafe transformations. VFP instructions are not 100% compliant with IEEE-754 so the vectoriser will not use them by default.

We need to tell the compiler “it is OK, let’s use not 100% precise instructions” by using -O2 -ffast-math or the shorter form -Ofast.

$ clang --target=armv6-linux-gnueabihf -Ofast -S -o- simple_add.c
vector_sum:
	.fnstart
@ %bb.0:
	.save	{r4, lr}
	push	{r4, lr}
	.vsave	{d8, d9, d10, d11}
	vpush	{d8, d9, d10, d11}
	ldr	r12, .LCPI0_0
	ldr	lr, .LCPI0_1
	ldr	r3, .LCPI0_2
	mov	r0, #0
.LBB0_1:                                @ =>This Inner Loop Header: Depth=1
	add	r1, r12, r0
	add	r2, lr, r0
	vldmia	r1, {s16, s17, s18, s19, s20, s21, s22, s23}
	vldmia	r2, {s8, s9, s10, s11, s12, s13, s14, s15}
	vmrs	r2, fpscr
	mov	r1, #196608
	bic	r2, r2, #458752
	orr	r2, r2, r1
	vmsr	fpscr, r2
	vadd.f32	s12, s20, s12
	vadd.f32	s8, s16, s8
	add	r1, r3, r0
	add	r0, r0, #32
	add	r2, r1, #20
	cmp	r0, #4096
	vstmia	r2, {s13, s14, s15}
	vstmia	r1, {s8, s9, s10, s11, s12}
	bne	.LBB0_1
@ %bb.2:
	vmrs	r1, fpscr
	bic	r1, r1, #458752
	vmsr	fpscr, r1
	vpop	{d8, d9, d10, d11}
	pop	{r4, pc}
	.p2align	2

This is more interesting!

Note however we have some gross inefficiency here: we are changing the vector length (setting it to 2) in every iteration of the loop. Later in this post we will evaluate if it is worth trying to hoist it out of the loop.

A very simple benchmark

Let’s use this simple benchmark that computes a vector addition of floats (similar to the code shown above in the post). It also has a mechanism to validate the scalar version and the vector version. #pragma clang loop is used to explicitly disable vectorisation in the scalar loops that would otherwise be vectorised.

The benchmark can be adjusted for number of times we run the benchmark and the size of the vector. This is useful to run it in different scenarios.

This benchmark has a low ratio of computation over memory accesses. We do one addition and three memory accesses (two loads and one store). This means that the arithmetic intensity of this benchmark is small. We may not be able to observe a lot of improvement with vector instructions.

We can study more favourable situations if we use smaller arrays. In this case, when we run the benchmark again, chances are that the vector will be in the cache already. While the arithmetic intensity hasn’t changed, in this situation the arithmetic computation has higher weight in the overall execution.

Let’s look at two executions of the benchmark. The figure below show the ratio of execution time of the vector loop respect to the scalar loop (not vectorised at all). The plot at the left shows the results when the vectorised loop sets the vector length at each iteration, as it is emitted by the compiler. The plot at the right shows the results when the vector length change is hoisted out of the loop: this is, it is set only once before entering the vector body. I did this transformation by editing the assembly output.

Plot with speedup of vector addition results
Very simple array addition benchmark. The plot at the left contains the results for the program as it is emitted by the compiler with vectorisation enabled. It sets the vector length at each iteration of the vectorised loop. The plot at the right contains the result for a manually modified assembly output so it sets the vector length right before entering the vectorised loop. The benchmark runs 256 times and it was run 50 times for each array length (from 4 to 131072).

In both cases an array of 4 floats does not perform very well because the loop has been vectorised using an interleave factor of 2. So each vector loop iteration wants to process 8 iterations of the original scalar loop. There is some overhead setting up the vector length to 1 in the tail loop (the loop that processes the remaining elements that do not complete a vector) hence the bad performance. This is less noticeable on the right plot as the vector length is only set once before entering the scalar loop (not once per iteration of the loop as it happens on the left).

On the left plot we see that, until 4096 float elements, the improvement over scalar is modest: around 2.5X the scalar code. I believe changing the vector lenght (which requires reading the fpscr) introduces some extra cost that limits the performance. On the right plot we see it improves up to ~3.4X. This means it is a good idea to hoist the set vector length out of the vector loop if possible. We will look into it in a later chapter.

Starting from 4096, both plots show a similar performance degradation. The reason is that our working set is now beyond the 16KB of the L1 data cache. Note that when using arrays of 2048 float elements each array takes 8KB. Given that the benchmark uses two of them in the loop, our working set is 16KB. Beyond that we overflow the cache (the L2, afaict is not used) and the performance drops to ~1.3X. The low arithmetic intensity of this benchmark means it quickly becomes a memory-bound problem rather than a CPU-bound problem. Vectorisation is unlikely to help under memory-bound problems.

Too good to be true?

After thinking a bit more about the results, a doubt came to mind. In practice, this is like if we had unrolled the loop but using hardware instructions. This is because the vector mode of the VFP does not bring any performance benefit: the latency of the instructions is scaled by the vector length.

So I extended the benchmark to include an unrolled version of the vector addition, without vectorisation. I think the results speak for themselves.

Plot with speedup of vector addition results including an unrolled version
Unrolling is almost as competitive in performance as vectorising in the naive way. We can get a bit of an edge if we hoist the set vector length, but this advantage quickly fades away.

So as I already hypothesised at the first chapter of this series, the only benefit we may obtain from using vector instructions is code size improvement.

In the next chapter we will look into trying to hoist the set vector length out of the loops that only set it once (which we expect to be the common case for vectorised loops).