ARM assembler in Raspberry Pi

In chapter 13 we saw VFPv2 and the fact that it allows vectorial operations on floating-point numbers. You may be wondering if such a similar feature exists for integers. The answer is yes although in a more limited way.

SIMD

SIMD stands for single instruction multiple data and means that an instruction can be used to perform the same operation on several operands at the same time. In chapter 13 and 14 we saw that by changing the len field in the fpscr and using at least one operand in the vectorial banks, then an instruction operated on len registers in the vectorial bank(s), effectively doing len times a floating point operation. This way, a single instruction like vadd.f32 could then be used to perform up to 8 floating point additions. This strategy of speeding up computation is also called data parallelism.

SIMD with integers

SIMD support for integers exists also in ARMv6 but it is more limited: the multiple data are the subwords (see chapter 21) of a general purpose register. This means that we can do 2 operations on the 2 half words of a general purpose register. Similarly, we can do and up to 4 operations on the 4 bytes of a general purpose register.

Motivating example

At this point you may be wondering what is the purpose of this feature and why it does exist. Let's assume we have two 16-bit PCM audio signals sampled at some frequency (i.e. 44.1kHz like in a CD Audio). This means that at the time of recording the "analog sound" of each channel is sampled many times per second and the sample, which represents the amplitude of the signal, is encoded using a 16-bit number.

An operation we may want to do is mixing the two signals in one signal (e.g. prior playing that final signal through the speakers). A (slightly incorrect) way to do this is by averaging the two signals. The code belows is a schema of what we want to do.

short int channel1[num_samples]; // in our environment a 'short int' is a half-word
short int channel2[num_samples];

short int channel_out[num_samples];
for (i = 0; i < num_samples; i++)
{
   channel_out[i] = (channel1[i] + channel2[i]) / 2;
}

Now imagine we want to implement this in ARMv6. With our current knowledge the code would look like this (I will omit in these examples the AAPCS function call convention).

naive_channel_mixing:
    /* r0 contains the base address of channel1 */
    /* r1 contains the base address of channel2 */
    /* r2 contains the base address of channel_out */
    /* r3 is the number of samples */
    /* r4 is the number of the current sample
          so it holds that 0 ≤ r4 < r3 */
    
    mov r4, #0              /* r4 ← 0 */
    b .Lcheck_loop          /* branch to check_loop */
    .Lloop:
      mov r5, r4, LSL #1    /* r5 ← r4 << 1 (this is r5 ← r4 * 2) */
                            /* a halfword takes two bytes, so multiply
                               the index by two. We do this here because
                               ldrsh does not allow an addressing mode
                               like [r0, r5, LSL #1] */
      ldrsh r6, [r0, r5]    /* r6 ← *{signed half}(r0 + r5) */
      ldrsh r7, [r1, r5]    /* r7 ← *{signed half}(r1 + r5) */
      add r8, r6, r7        /* r8 ← r6 + r7 */
      mov r8, r8, ASR #1    /* r8 ← r8 >> 1 (this is r8 ← r8 / 2)*/
      strh r8, [r2, r5]     /* *{half}(r2 + r5) ← r8 */
      add r4, r4, #1        /* r4 ← r4 + 1 */
    .Lcheck_loop: 
      cmp r4, r3            /* compute r4 - r3 and update cpsr */
      blt .Lloop            /* if r4 < r3 jump to the
                               beginning of the loop */

We could probably be happy with this code but if you were in the business of designing processors for embedded devices you would probably be sensitive to your customer codes. And chances are that your portable MP3 player (or any gadget able to play music) is "ARM inside". So this is a code that is eligible for improvement from an architecture point of view.

Parallel additions and subtractions

ARMv6 data parallel instructions allow us to add/subtract the corresponding half words or bytes. It provides them both for unsigned integers and signed integers.

Halfwords

Signed: sadd16, ssub16
Unsigned: uadd16, usub16

Bytes

Signed: sadd8, ssub8
Unsigned: uadd8, usub8

It should not be hard to find obvious uses for these instructions. For instance, the following loop can benefit from the uadd8 instruction.

// unsigned char is an unsigned byte in our environment
// a, b and c are arrays of N unsigned chars
unsigned char a[N], b[N], c[N];

int i;
for (i = 0; i < N; i++)
{
  c[i] = a[i] + b[i];
}

Let's first write a naive approach to the above loop, which is similar to the one in the beginning of the post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
naive_byte_array_addition:
    /* r0 contains the base address of a */
    /* r1 contains the base address of b */
    /* r2 contains the base address of c */
    /* r3 is N */
    /* r4 is the number of the current item
          so it holds that 0 ≤ r4 < r3 */

    mov r4, #0             /* r4 ← 0 */
    b .Lcheck_loop0        /* branch to check_loop0 */

    .Lloop0:
      ldrb r5, [r0, r4]    /* r5 ← *{unsigned byte}(r0 + r4) */
      ldrb r6, [r1, r4]    /* r6 ← *{unsigned byte}(r1 + r4) */
      add r7, r5, r6       /* r7 ← r5 + r6 */
      strb r7, [r2, r4]    /* *{unsigned byte}(r2 + r4) ← r7 */
      add r4, r4, #1       /* r4 ← r4 + 1 */
    .Lcheck_loop0:
       cmp r4, r3          /* perform r4 - r3 and update cpsr */
       blt .Lloop0         /* if cpsr means that r4 < r3 jump to loop0 */

This loop again is fine but we can do better by using the instruction uadd8. Note that now we will be able to add 4 bytes at a time. This means that we will have to increment r4 by 4.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
simd_byte_array_addition_0:
    /* r0 contains the base address of a */
    /* r1 contains the base address of b */
    /* r2 contains the base address of c */
    /* r3 is N */
    /* r4 is the number of the current item
          so it holds that 0 ≤ r4 < r3 */

    mov r4, #0             /* r4 ← 0 */
    b .Lcheck_loop1        /* branch to check_loop1 */

    .Lloop1:
      ldr r5, [r0, r4]     /* r5 ← *(r0 + r4) */
      ldr r6, [r1, r4]     /* r6 ← *(r1 + r4) */
      sadd8 r7, r5, r6     /* r7[7:0] ← r5[7:0] + r6[7:0] */
                           /* r7[15:8] ← r5[15:8] + r6[15:8] */
                           /* r7[23:16] ← r5[23:16] + r6[23:16] */
                           /* r7[31:24] ← r5[31:24] + r6[31:24] */
                           /* rA[x:y] means bits x to y of the register rA */
      str r7, [r2, r4]     /* *(r2 + r4) ← r7 */
      add r4, r4, #4       /* r4 ← r4 + 4 */
    .Lcheck_loop1:
       cmp r4, r3          /* perform r4 - r3 and update cpsr */
       blt .Lloop1         /* if cpsr means that r4 < r3 jump to loop1 */

A subtlety of the above code is that it only works if N (kept in r3) is a multiple of 4. If it is not the case (and this includes when 0 ≤ r3 < 4), then the loop will do fewer iterations than expected. If we know that N is a multiple of 4, then nothing else must be done. But if it may be not a multiple of 4, we will need what is called an epilog loop, for the remaining cases. Note that in our case, the epilog loop will have to do 0 (if N was a multiple of 4), 1, 2 or 3 iterations. We can implement it as a switch with 4 cases plus fall-through (see chapter 16) or if we are concerned about code size, with a loop. We will use a loop.

We cannot, though, simply append an epilog loop to the above loop,because it is actually doing more work than we want. When N is not a multiple of four, the last iteration will add 1, 2 or 3 more bytes that do not belong to the original array. This is a recipe for a disaster so we have to avoid this. We need to make sure that when we are in the loop, r4 is such that r4, r4 + 1, r4 + 2 and r4 + 3 are valid elements of the array. This means that we should check that r4 < N, r4 + 1 < N, r4 + 2 < N and r4 + 3 < N. Since the last of these four implies the first three, it is enough to check that r4 + 3 < N.

Note that checking r4 + 3 < N would force us to compute r4 + 3 at every iteration in the loop, but we do not have to. Checking r4 + 3 < N is equivalent to check r4 < N - 3. N - 3 does not depend on r4 so it can be computed before the loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
simd_byte_array_addition_2:
    /* r0 contains the base address of a */
    /* r1 contains the base address of b */
    /* r2 contains the base address of c */
    /* r3 is N */
    /* r4 is the number of the current item
          so it holds that 0 ≤ r4 < r3 */

    mov r4, #0             /* r4 ← 0 */
    sub r8, r3, #3         /* r8 ← r3 - 3
                              this is r8 ← N - 3 */
    b .Lcheck_loop2        /* branch to check_loop2 */

    .Lloop2:
      ldr r5, [r0, r4]     /* r5 ← *(r0 + r4) */
      ldr r6, [r1, r4]     /* r6 ← *(r1 + r4) */
      sadd8 r7, r5, r6     /* r7[7:0] ← r5[7:0] + r6[7:0] */
                           /* r7[15:8] ← r5[15:8] + r6[15:8] */
                           /* r7[23:16] ← r5[23:16] + r6[23:16] */
                           /* r7[31:24] ← r5[31:24] + r6[31:24] */
      str r7, [r2, r4]     /* *(r2 + r4) ← r7 */
      add r4, r4, #4       /* r4 ← r4 + 4 */
    .Lcheck_loop2:
       cmp r4, r8          /* perform r4 - r8 and update cpsr */
       blt .Lloop2         /* if cpsr means that r4 < r8 jump to loop2 */
                           /* i.e. if r4 < N - 3 jump to loop2 */

In line 10 where we compute r8 which will keep N - 3, we use it in line 24 to check the loop iteration.

The epilog loop follows.

27
28
29
30
31
32
33
34
35
36
37
38
39
     /* epilog loop */
     b .Lcheck_loop3       /* branch to check_loop3 */

     .Lloop3:
        ldrb r5, [r0, r4]  /* r5 ← *{unsigned byte}(r0 + r4) */
        ldrb r6, [r1, r4]  /* r6 ← *{unsigned byte}(r1 + r4) */
        add r7, r5, r6     /* r7 ← r5 + r6 */
        strb r7, [r2, r4]  /* *{unsigned byte}(r2 + r4) ← r7 */

        add r4, r4, #1     /* r4 ← r4 + 1 */
     .Lcheck_loop3:
        cmp r4, r3         /* perform r4 - r3 and update cpsr */
        blt .Lloop3        /* if cpsr means that r4 < r3 jump to loop 3 */

The epilog loop is like the naive one, but it will only run 0, 1, 2 or 3 iterations. This means that for big enough values of N, in practice all iterations will use the data parallel instructions and only up to 3 will have to use the slower approach.

Halving instructions

The data parallel instructions also come in a form where the addition/subtraction is halved. This means that it is possible to compute averages of half words and bytes easily.

Halfwords

Signed: shadd16, shsub16
Unsigned: uhadd16, uhsub16

Bytes

Signed: shadd8, shsub8
Unsigned: uhadd8, uhsub8

Thus, the motivating example of the beginning of the post can be implemented using the shsub16 instruction. For simplicity, let's assume that num_samples is a multiple of 2 (now we are dealing with halfwords) so no epilog is necessary.

better_channel_mixing:
    /* r0 contains the base address of channel1 */
    /* r1 contains the base address of channel2 */
    /* r2 contains the base address of channel_out */
    /* r3 is the number of samples */
    /* r4 is the number of the current sample
          so it holds that 0 ≤ r4 < r3 */

    mov r4, #0              /* r4 ← 0 */
    b .Lcheck_loop          /* branch to check_loop */
    .Lloop:
      ldr r6, [r0, r4]      /* r6 ← *(r0 + r4) */
      ldr r7, [r1, r4]      /* r7 ← *(r1 + r4) */
      shadd16 r8, r6, r7    /* r8[15:0] ← (r6[15:0] + r7[15:0]) >> 1*/
                            /* r8[31:16] ← (r6[31:16] + r7[31:16]) >> 1*/
      str r8, [r2, r4]      /* *(r2 + r4) ← r8 */
      add r4, r4, #2        /* r4 ← r4 + 2 */
    .Lcheck_loop:
      cmp r4, r3            /* compute r4 - r3 and update cpsr */
      blt .Lloop            /* if r4 < r3 jump to the
                               beginning of the loop */

Saturating arithmetic

Let's go back to our motivating example. We averaged the two 16-bit channels to mix them but, in reality, mixing is achieved by just adding the two channels. In general this is OK because signals are not correlated and the amplitude of a mixed sample usually can be encoded in 16-bit. Sometimes, though, the mixed sample may have an amplitude that falls outside the 16-bit range. In this case we want to clip the sample within the representable range. A sample with a too positive amplitude will be clipped to 2¹⁵-1, a sample with a too negative amplitude will be clipped to -2¹⁵.

With lack of hardware support, clipping can be implemented by checking overflow after each addition. So, every addition should check that the resulting number is in the interval [-32768, 32767] Let's write a function that adds two 32-bit integers and clips them in the 16-bit range.

.data
max16bit: .word 32767

.text

clipped_add16bit:
    /* first operand is in r0 */
    /* second operand is in r0 */
    /* result is left in r0 */
    push {r4, lr}             /* keep registers */

    ldr r4, addr_of_max16bit  /* r4 ← &max16bit */
    ldr r4, [r4]              /* r4 ← *r4 */
                              /* now r4 == 32767 (i.e. 2^15 - 1) */

    add r0, r0, r1            /* r0 ← r0 + r1 */
    cmp r0, r4                /* perform r0 - r4 and update cpsr */
    movgt r0, r4              /* if r0 > r4 then r0 ← r4 */
    bgt end                   /* if r0 > r4 then branch to end */

    mvn r4, r4                /* r4 ← ~r4
                                 now r4 == -32768 (i.e. -2^15) */
    cmp r0, r4                /* perform r0 - r4 and update cpsr */
    movlt r0, r4              /* if r0 < r4 then r0 ← r4 */

    end:

    pop {r4, lr}              /* restore registers */
    bx lr                     /* return */
addr_of_max16bit: .word max16bit

As you can see, a seemingly simple addition that clips the result requires a bunch of instructions. As before, the code is correct but we can do much better thanks to the saturated arithmetics instructions of ARMv6.

Halfwords

Signed: qadd16, qsub16
Unsigned: uqadd16, uqsub16

Bytes

Signed: qadd8, qsub8
Unsigned: uqadd8, uqsub8

Now we can write a more realistic mixing of two channels.

more_realistic_channel_mixing:
    /* r0 contains the base address of channel1 */
    /* r1 contains the base address of channel2 */
    /* r2 contains the base address of channel_out */
    /* r3 is the number of samples */
    /* r4 is the number of the current sample
          so it holds that 0 ≤ r4 < r3 */

    mov r4, #0              /* r4 ← 0 */
    b .Lcheck_loop          /* branch to check_loop */
    .Lloop:
      ldr r6, [r0, r4]      /* r6 ← *(r0 + r4) */
      ldr r7, [r1, r4]      /* r7 ← *(r1 + r4) */
      qadd16 r8, r6, r7     /* r8[15:0] ← saturated_sum_16(r6[15:0], r7[15:0]) */
                            /* r8[31:16] ← saturated_sum_16(r6[31:16], r7[31:16]) */
      str r8, [r2, r4]      /* *(r2 + r4) ← r8 */
      add r4, r4, #2        /* r4 ← r4 + 2 */
    .Lcheck_loop:
      cmp r4, r3            /* compute r4 - r3 and update cpsr */
      blt .Lloop            /* if r4 < r3 jump to the
                               beginning of the loop */

That's all for today.