Think In Geek

In geek we trust

# ARM assembler in Raspberry Pi – Chapter 25

In chapter 13 we saw VFPv2 and the fact that it allows vectorial operations on floating-point numbers. You may be wondering if such a similar feature exists for integers. The answer is yes although in a more limited way.

## SIMD

SIMD stands for single instruction multiple data and means that an instruction can be used to perform the same operation on several operands at the same time. In chapter 13 and 14 we saw that by changing the `len` field in the `fpscr` and using at least one operand in the vectorial banks, then an instruction operated on `len` registers in the vectorial bank(s), effectively doing `len` times a floating point operation. This way, a single instruction like `vadd.f32` could then be used to perform up to 8 floating point additions. This strategy of speeding up computation is also called data parallelism.

### SIMD with integers

SIMD support for integers exists also in ARMv6 but it is more limited: the multiple data are the subwords (see chapter 21) of a general purpose register. This means that we can do 2 operations on the 2 half words of a general purpose register. Similarly, we can do and up to 4 operations on the 4 bytes of a general purpose register.

## Motivating example

At this point you may be wondering what is the purpose of this feature and why it does exist. Let’s assume we have two 16-bit PCM audio signals sampled at some frequency (i.e. 44.1kHz like in a CD Audio). This means that at the time of recording the “analog sound” of each channel is sampled many times per second and the sample, which represents the amplitude of the signal, is encoded using a 16-bit number.

An operation we may want to do is mixing the two signals in one signal (e.g. prior playing that final signal through the speakers). A (slightly incorrect) way to do this is by averaging the two signals. The code belows is a schema of what we want to do.

```short int channel1[num_samples]; // in our environment a 'short int' is a half-word short int channel2[num_samples];   short int channel_out[num_samples]; for (i = 0; i < num_samples; i++) { channel_out[i] = (channel1[i] + channel2[i]) / 2; }```

Now imagine we want to implement this in ARMv6. With our current knowledge the code would look like this (I will omit in these examples the AAPCS function call convention).

```naive_channel_mixing: /* r0 contains the base address of channel1 */ /* r1 contains the base address of channel2 */ /* r2 contains the base address of channel_out */ /* r3 is the number of samples */ /* r4 is the number of the current sample so it holds that 0 ≤ r4 < r3 */   mov r4, #0 /* r4 ← 0 */ b .Lcheck_loop /* branch to check_loop */ .Lloop: mov r5, r4, LSL #1 /* r5 ← r4 << 1 (this is r5 ← r4 * 2) */ /* a halfword takes two bytes, so multiply the index by two. We do this here because ldrsh does not allow an addressing mode like [r0, r5, LSL #1] */ ldrsh r6, [r0, r5] /* r6 ← *{signed half}(r0 + r5) */ ldrsh r7, [r1, r5] /* r7 ← *{signed half}(r1 + r5) */ add r8, r6, r7 /* r8 ← r6 + r7 */ mov r8, r8, ASR #1 /* r8 ← r8 >> 1 (this is r8 ← r8 / 2)*/ strh r8, [r2, r5] /* *{half}(r2 + r5) ← r8 */ add r4, r4, #1 /* r4 ← r4 + 1 */ .Lcheck_loop: cmp r4, r3 /* compute r4 - r3 and update cpsr */ blt .Lloop /* if r4 < r3 jump to the beginning of the loop */```

We could probably be happy with this code but if you were in the business of designing processors for embedded devices you would probably be sensitive to your customer codes. And chances are that your portable MP3 player (or any gadget able to play music) is “ARM inside”. So this is a code that is eligible for improvement from an architecture point of view.

## Parallel additions and subtractions

ARMv6 data parallel instructions allow us to add/subtract the corresponding half words or bytes. It provides them both for unsigned integers and signed integers.

• Halfwords
• Signed: `sadd16`, `ssub16`
• Unsigned: `uadd16`, `usub16`
• Bytes
• Signed: `sadd8`, `ssub8`
• Unsigned: `uadd8`, `usub8`

It should not be hard to find obvious uses for these instructions. For instance, the following loop can benefit from the `uadd8` instruction.

```// unsigned char is an unsigned byte in our environment // a, b and c are arrays of N unsigned chars unsigned char a[N], b[N], c[N];   int i; for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; }```

Let’s first write a naive approach to the above loop, which is similar to the one in the beginning of the post.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 naive_byte_array_addition: /* r0 contains the base address of a */ /* r1 contains the base address of b */ /* r2 contains the base address of c */ /* r3 is N */ /* r4 is the number of the current item so it holds that 0 ≤ r4 < r3 */   mov r4, #0 /* r4 ← 0 */ b .Lcheck_loop0 /* branch to check_loop0 */   .Lloop0: ldrb r5, [r0, r4] /* r5 ← *{unsigned byte}(r0 + r4) */ ldrb r6, [r1, r4] /* r6 ← *{unsigned byte}(r1 + r4) */ add r7, r5, r6 /* r7 ← r5 + r6 */ strb r7, [r2, r4] /* *{unsigned byte}(r2 + r4) ← r7 */ add r4, r4, #1 /* r4 ← r4 + 1 */ .Lcheck_loop0: cmp r4, r3 /* perform r4 - r3 and update cpsr */ blt .Lloop0 /* if cpsr means that r4 < r3 jump to loop0 */```

This loop again is fine but we can do better by using the instruction `uadd8`. Note that now we will be able to add 4 bytes at a time. This means that we will have to increment `r4` by 4.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 simd_byte_array_addition_0: /* r0 contains the base address of a */ /* r1 contains the base address of b */ /* r2 contains the base address of c */ /* r3 is N */ /* r4 is the number of the current item so it holds that 0 ≤ r4 < r3 */   mov r4, #0 /* r4 ← 0 */ b .Lcheck_loop1 /* branch to check_loop1 */   .Lloop1: ldr r5, [r0, r4] /* r5 ← *(r0 + r4) */ ldr r6, [r1, r4] /* r6 ← *(r1 + r4) */ sadd8 r7, r5, r6 /* r7[7:0] ← r5[7:0] + r6[7:0] */ /* r7[15:8] ← r5[15:8] + r6[15:8] */ /* r7[23:16] ← r5[23:16] + r6[23:16] */ /* r7[31:24] ← r5[31:24] + r6[31:24] */ /* rA[x:y] means bits x to y of the register rA */ str r7, [r2, r4] /* *(r2 + r4) ← r7 */ add r4, r4, #4 /* r4 ← r4 + 4 */ .Lcheck_loop1: cmp r4, r3 /* perform r4 - r3 and update cpsr */ blt .Lloop1 /* if cpsr means that r4 < r3 jump to loop1 */```

A subtlety of the above code is that it only works if `N` (kept in `r3`) is a multiple of 4. If it is not the case (and this includes when 0 ≤ r3 < 4), then the loop will do fewer iterations than expected. If we know that `N` is a multiple of 4, then nothing else must be done. But if it may be not a multiple of 4, we will need what is called an epilog loop, for the remaining cases. Note that in our case, the epilog loop will have to do 0 (if N was a multiple of 4), 1, 2 or 3 iterations. We can implement it as a switch with 4 cases plus fall-through (see chapter 16) or if we are concerned about code size, with a loop. We will use a loop.

We cannot, though, simply append an epilog loop to the above loop,because it is actually doing more work than we want. When N is not a multiple of four, the last iteration will add 1, 2 or 3 more bytes that do not belong to the original array. This is a recipe for a disaster so we have to avoid this. We need to make sure that when we are in the loop, `r4` is such that `r4`, `r4 + 1`, `r4 + 2` and `r4 + 3` are valid elements of the array. This means that we should check that `r4 < N`, `r4 + 1 < N`,` r4 + 2 < N` and `r4 + 3 < N`. Since the last of these four implies the first three, it is enough to check that `r4 + 3 < N`.

Note that checking `r4 + 3 < N` would force us to compute `r4 + 3` at every iteration in the loop, but we do not have to. Checking `r4 + 3 < N` is equivalent to check `r4 < N - 3`. `N - 3` does not depend on `r4` so it can be computed before the loop.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 simd_byte_array_addition_2: /* r0 contains the base address of a */ /* r1 contains the base address of b */ /* r2 contains the base address of c */ /* r3 is N */ /* r4 is the number of the current item so it holds that 0 ≤ r4 < r3 */   mov r4, #0 /* r4 ← 0 */ sub r8, r3, #3 /* r8 ← r3 - 3 this is r8 ← N - 3 */ b .Lcheck_loop2 /* branch to check_loop2 */   .Lloop2: ldr r5, [r0, r4] /* r5 ← *(r0 + r4) */ ldr r6, [r1, r4] /* r6 ← *(r1 + r4) */ sadd8 r7, r5, r6 /* r7[7:0] ← r5[7:0] + r6[7:0] */ /* r7[15:8] ← r5[15:8] + r6[15:8] */ /* r7[23:16] ← r5[23:16] + r6[23:16] */ /* r7[31:24] ← r5[31:24] + r6[31:24] */ str r7, [r2, r4] /* *(r2 + r4) ← r7 */ add r4, r4, #4 /* r4 ← r4 + 4 */ .Lcheck_loop2: cmp r4, r8 /* perform r4 - r8 and update cpsr */ blt .Lloop2 /* if cpsr means that r4 < r8 jump to loop2 */ /* i.e. if r4 < N - 3 jump to loop2 */```

In line 10 where we compute `r8` which will keep `N - 3`, we use it in line 24 to check the loop iteration.

The epilog loop follows.

```27 28 29 30 31 32 33 34 35 36 37 38 39 /* epilog loop */ b .Lcheck_loop3 /* branch to check_loop3 */   .Lloop3: ldrb r5, [r0, r4] /* r5 ← *{unsigned byte}(r0 + r4) */ ldrb r6, [r1, r4] /* r6 ← *{unsigned byte}(r1 + r4) */ add r7, r5, r6 /* r7 ← r5 + r6 */ strb r7, [r2, r4] /* *{unsigned byte}(r2 + r4) ← r7 */   add r4, r4, #1 /* r4 ← r4 + 1 */ .Lcheck_loop3: cmp r4, r3 /* perform r4 - r3 and update cpsr */ blt .Lloop3 /* if cpsr means that r4 < r3 jump to loop 3 */```

The epilog loop is like the naive one, but it will only run 0, 1, 2 or 3 iterations. This means that for big enough values of `N`, in practice all iterations will use the data parallel instructions and only up to 3 will have to use the slower approach.

## Halving instructions

The data parallel instructions also come in a form where the addition/subtraction is halved. This means that it is possible to compute averages of half words and bytes easily.

• Halfwords
• Signed: `shadd16`, `shsub16`
• Unsigned: `uhadd16`, `uhsub16`
• Bytes
• Signed: `shadd8`, `shsub8`
• Unsigned: `uhadd8`, `uhsub8`

Thus, the motivating example of the beginning of the post can be implemented using the `shsub16` instruction. For simplicity, let's assume that `num_samples` is a multiple of 2 (now we are dealing with halfwords) so no epilog is necessary.

```better_channel_mixing: /* r0 contains the base address of channel1 */ /* r1 contains the base address of channel2 */ /* r2 contains the base address of channel_out */ /* r3 is the number of samples */ /* r4 is the number of the current sample so it holds that 0 ≤ r4 < r3 */   mov r4, #0 /* r4 ← 0 */ b .Lcheck_loop /* branch to check_loop */ .Lloop: ldr r6, [r0, r4] /* r6 ← *(r0 + r4) */ ldr r7, [r1, r4] /* r7 ← *(r1 + r4) */ shadd16 r8, r6, r7 /* r8[15:0] ← (r6[15:0] + r7[15:0]) >> 1*/ /* r8[31:16] ← (r6[31:16] + r7[31:16]) >> 1*/ str r8, [r2, r4] /* *(r2 + r4) ← r8 */ add r4, r4, #2 /* r4 ← r4 + 2 */ .Lcheck_loop: cmp r4, r3 /* compute r4 - r3 and update cpsr */ blt .Lloop /* if r4 < r3 jump to the beginning of the loop */```

## Saturating arithmetic

Let's go back to our motivating example. We averaged the two 16-bit channels to mix them but, in reality, mixing is achieved by just adding the two channels. In general this is OK because signals are not correlated and the amplitude of a mixed sample usually can be encoded in 16-bit. Sometimes, though, the mixed sample may have an amplitude that falls outside the 16-bit range. In this case we want to clip the sample within the representable range. A sample with a too positive amplitude will be clipped to 215-1, a sample with a too negative amplitude will be clipped to -215.

With lack of hardware support, clipping can be implemented by checking overflow after each addition. So, every addition should check that the resulting number is in the interval [-32768, 32767]
Let's write a function that adds two 32-bit integers and clips them in the 16-bit range.

```.data max16bit: .word 32767   .text   clipped_add16bit: /* first operand is in r0 */ /* second operand is in r0 */ /* result is left in r0 */ push {r4, lr} /* keep registers */   ldr r4, addr_of_max16bit /* r4 ← &max16bit */ ldr r4, [r4] /* r4 ← *r4 */ /* now r4 == 32767 (i.e. 2^15 - 1) */   add r0, r0, r1 /* r0 ← r0 + r1 */ cmp r0, r4 /* perform r0 - r4 and update cpsr */ movgt r0, r4 /* if r0 > r4 then r0 ← r4 */ bgt end /* if r0 > r4 then branch to end */   mvn r4, r4 /* r4 ← ~r4 now r4 == -32768 (i.e. -2^15) */ cmp r0, r4 /* perform r0 - r4 and update cpsr */ movlt r0, r4 /* if r0 < r4 then r0 ← r4 */   end:   pop {r4, lr} /* restore registers */ bx lr /* return */ addr_of_max16bit: .word max16bit```

As you can see, a seemingly simple addition that clips the result requires a bunch of instructions. As before, the code is correct but we can do much better thanks to the saturated arithmetics instructions of ARMv6.

• Halfwords
• Signed: `qadd16`, `qsub16`
• Unsigned: `uqadd16`, `uqsub16`
• Bytes
• Signed: `qadd8`, `qsub8`
• Unsigned: `uqadd8`, `uqsub8`

Now we can write a more realistic mixing of two channels.

```more_realistic_channel_mixing: /* r0 contains the base address of channel1 */ /* r1 contains the base address of channel2 */ /* r2 contains the base address of channel_out */ /* r3 is the number of samples */ /* r4 is the number of the current sample so it holds that 0 ≤ r4 < r3 */   mov r4, #0 /* r4 ← 0 */ b .Lcheck_loop /* branch to check_loop */ .Lloop: ldr r6, [r0, r4] /* r6 ← *(r0 + r4) */ ldr r7, [r1, r4] /* r7 ← *(r1 + r4) */ qadd16 r8, r6, r7 /* r8[15:0] ← saturated_sum_16(r6[15:0], r7[15:0]) */ /* r8[31:16] ← saturated_sum_16(r6[31:16], r7[31:16]) */ str r8, [r2, r4] /* *(r2 + r4) ← r8 */ add r4, r4, #2 /* r4 ← r4 + 2 */ .Lcheck_loop: cmp r4, r3 /* compute r4 - r3 and update cpsr */ blt .Lloop /* if r4 < r3 jump to the beginning of the loop */```

That's all for today.

### 8 thoughts on “ARM assembler in Raspberry Pi – Chapter 25”

• Very interesting topic! I do have trouble seeing in your clipping example why if r4 == 32767 the instruction mvn r4,r4 /* r4 <- ~r4 */ gives r4 == -32768 (the largest negative number) and not just -32767.

• rferrer says:

Hi William,

this happens because the `mvn` instruction performs a bitwise `not` on its operand prior the movement (the ~ symbol used in the comment is the C unary operator that does a bitwise `not` to its operand).

In two’s complement, the bitwise `not` of a number is the negative number minus one. This follows from the usual algorithm to change the sign of a number in two’s complement: you first do a bitwise `not` and then `add` 1.

Kind regards,

• Ryan Salvador says:

Hi Roger,

I’ve been following along the tutorials. I’m now on chapter 12 and I notice that there has really been a big improvement in my programming. I was just wondering if there’s such a thing as system calls in ARM like if I would like to read and write to files using assembly and if so, then is there any chance that we are going to see a tutorial on that in the future?

• Roger Ferrer Ibáñez says:

Hi Ryan,

yes, there is such a mechanism. Chapter 19 is about system calls of the operating system.

That said it is rather cumbersome to call them directly, so most of the time it is easier to use the functions offered by the C library (they act as wrappers to the system calls).

Kind regards,

• Filippo says:

Hi Roger, thanks for the tutorials.
In the first example (naive_channel_mixing) should it be:
mov r8, r8, ASR #1
insted of:
mov r8, r8, LSR #1?

• Roger Ferrer Ibáñez says:

Hi Filippo,

yes thanks for spotting this. We are operating with signed integers.

I have updated the post.

Kind regards,
Roger

• Veselin says:

Hi Roger,
Thank you very very much for this wonderful guide! It helped me a lot to understand the ARM asm!
I have one request – besides Rasp Pi there is a lot of other boards on the market these days – Odroid, Orange Pi etc. They use armv7, armv8 which includes NEON SIMD.
Please, could you provide any examples how to program and compile such code? There is only partial information on the net, as:
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=174848
https://people.xiph.org/~tterribe/daala/neon_tutorial.pdf
https://bugs.chromium.org/p/chromium/issues/detail?id=67954

Please! I’ll be very grateful!
Thank you in advance:
Veselin

• Roger Ferrer Ibáñez says:

Hi Veselin,

I cannot promise anything but I may explore this in another post.

Kind regards,
Roger

This site uses Akismet to reduce spam. Learn how your comment data is processed.