ARM assembler in Raspberry Pi – Chapter 25
In chapter 13 we saw VFPv2 and the fact that it allows vectorial operations on floating-point numbers. You may be wondering if such a similar feature exists for integers. The answer is yes although in a more limited way.
SIMD
SIMD stands for single instruction multiple data and means that an instruction can be used to perform the same operation on several operands at the same time. In chapter 13 and 14 we saw that by changing the len
field in the fpscr
and using at least one operand in the vectorial banks, then an instruction operated on len
registers in the vectorial bank(s), effectively doing len
times a floating point operation. This way, a single instruction like vadd.f32
could then be used to perform up to 8 floating point additions. This strategy of speeding up computation is also called data parallelism.
SIMD with integers
SIMD support for integers exists also in ARMv6 but it is more limited: the multiple data are the subwords (see chapter 21) of a general purpose register. This means that we can do 2 operations on the 2 half words of a general purpose register. Similarly, we can do and up to 4 operations on the 4 bytes of a general purpose register.
Motivating example
At this point you may be wondering what is the purpose of this feature and why it does exist. Let's assume we have two 16-bit PCM audio signals sampled at some frequency (i.e. 44.1kHz like in a CD Audio). This means that at the time of recording the "analog sound" of each channel is sampled many times per second and the sample, which represents the amplitude of the signal, is encoded using a 16-bit number.
An operation we may want to do is mixing the two signals in one signal (e.g. prior playing that final signal through the speakers). A (slightly incorrect) way to do this is by averaging the two signals. The code belows is a schema of what we want to do.
Now imagine we want to implement this in ARMv6. With our current knowledge the code would look like this (I will omit in these examples the AAPCS function call convention).
We could probably be happy with this code but if you were in the business of designing processors for embedded devices you would probably be sensitive to your customer codes. And chances are that your portable MP3 player (or any gadget able to play music) is "ARM inside". So this is a code that is eligible for improvement from an architecture point of view.
Parallel additions and subtractions
ARMv6 data parallel instructions allow us to add/subtract the corresponding half words or bytes. It provides them both for unsigned integers and signed integers.
- Halfwords
- Signed:
sadd16
,ssub16
- Unsigned:
uadd16
,usub16
- Bytes
- Signed:
sadd8
,ssub8
- Unsigned:
uadd8
,usub8
It should not be hard to find obvious uses for these instructions. For instance, the following loop can benefit from the uadd8
instruction.
Let's first write a naive approach to the above loop, which is similar to the one in the beginning of the post.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
naive_byte_array_addition:
/* r0 contains the base address of a */
/* r1 contains the base address of b */
/* r2 contains the base address of c */
/* r3 is N */
/* r4 is the number of the current item
so it holds that 0 ≤ r4 < r3 */
mov r4, #0 /* r4 ← 0 */
b .Lcheck_loop0 /* branch to check_loop0 */
.Lloop0:
ldrb r5, [r0, r4] /* r5 ← *{unsigned byte}(r0 + r4) */
ldrb r6, [r1, r4] /* r6 ← *{unsigned byte}(r1 + r4) */
add r7, r5, r6 /* r7 ← r5 + r6 */
strb r7, [r2, r4] /* *{unsigned byte}(r2 + r4) ← r7 */
add r4, r4, #1 /* r4 ← r4 + 1 */
.Lcheck_loop0:
cmp r4, r3 /* perform r4 - r3 and update cpsr */
blt .Lloop0 /* if cpsr means that r4 < r3 jump to loop0 */
This loop again is fine but we can do better by using the instruction uadd8
. Note that now we will be able to add 4 bytes at a time. This means that we will have to increment r4
by 4.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
simd_byte_array_addition_0:
/* r0 contains the base address of a */
/* r1 contains the base address of b */
/* r2 contains the base address of c */
/* r3 is N */
/* r4 is the number of the current item
so it holds that 0 ≤ r4 < r3 */
mov r4, #0 /* r4 ← 0 */
b .Lcheck_loop1 /* branch to check_loop1 */
.Lloop1:
ldr r5, [r0, r4] /* r5 ← *(r0 + r4) */
ldr r6, [r1, r4] /* r6 ← *(r1 + r4) */
sadd8 r7, r5, r6 /* r7[7:0] ← r5[7:0] + r6[7:0] */
/* r7[15:8] ← r5[15:8] + r6[15:8] */
/* r7[23:16] ← r5[23:16] + r6[23:16] */
/* r7[31:24] ← r5[31:24] + r6[31:24] */
/* rA[x:y] means bits x to y of the register rA */
str r7, [r2, r4] /* *(r2 + r4) ← r7 */
add r4, r4, #4 /* r4 ← r4 + 4 */
.Lcheck_loop1:
cmp r4, r3 /* perform r4 - r3 and update cpsr */
blt .Lloop1 /* if cpsr means that r4 < r3 jump to loop1 */
A subtlety of the above code is that it only works if N
(kept in r3
) is a multiple of 4. If it is not the case (and this includes when 0 ≤ r3 < 4), then the loop will do fewer iterations than expected. If we know that N
is a multiple of 4, then nothing else must be done. But if it may be not a multiple of 4, we will need what is called an epilog loop, for the remaining cases. Note that in our case, the epilog loop will have to do 0 (if N was a multiple of 4), 1, 2 or 3 iterations. We can implement it as a switch with 4 cases plus fall-through (see chapter 16) or if we are concerned about code size, with a loop. We will use a loop.
We cannot, though, simply append an epilog loop to the above loop,because it is actually doing more work than we want. When N is not a multiple of four, the last iteration will add 1, 2 or 3 more bytes that do not belong to the original array. This is a recipe for a disaster so we have to avoid this. We need to make sure that when we are in the loop, r4
is such that r4
, r4 + 1
, r4 + 2
and r4 + 3
are valid elements of the array. This means that we should check that r4 < N
, r4 + 1 < N
, r4 + 2 < N
and r4 + 3 < N
. Since the last of these four implies the first three, it is enough to check that r4 + 3 < N
.
Note that checking r4 + 3 < N
would force us to compute r4 + 3
at every iteration in the loop, but we do not have to. Checking r4 + 3 < N
is equivalent to check r4 < N - 3
. N - 3
does not depend on r4
so it can be computed before the loop.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
simd_byte_array_addition_2:
/* r0 contains the base address of a */
/* r1 contains the base address of b */
/* r2 contains the base address of c */
/* r3 is N */
/* r4 is the number of the current item
so it holds that 0 ≤ r4 < r3 */
mov r4, #0 /* r4 ← 0 */
sub r8, r3, #3 /* r8 ← r3 - 3
this is r8 ← N - 3 */
b .Lcheck_loop2 /* branch to check_loop2 */
.Lloop2:
ldr r5, [r0, r4] /* r5 ← *(r0 + r4) */
ldr r6, [r1, r4] /* r6 ← *(r1 + r4) */
sadd8 r7, r5, r6 /* r7[7:0] ← r5[7:0] + r6[7:0] */
/* r7[15:8] ← r5[15:8] + r6[15:8] */
/* r7[23:16] ← r5[23:16] + r6[23:16] */
/* r7[31:24] ← r5[31:24] + r6[31:24] */
str r7, [r2, r4] /* *(r2 + r4) ← r7 */
add r4, r4, #4 /* r4 ← r4 + 4 */
.Lcheck_loop2:
cmp r4, r8 /* perform r4 - r8 and update cpsr */
blt .Lloop2 /* if cpsr means that r4 < r8 jump to loop2 */
/* i.e. if r4 < N - 3 jump to loop2 */
In line 10 where we compute r8
which will keep N - 3
, we use it in line 24 to check the loop iteration.
The epilog loop follows.
27
28
29
30
31
32
33
34
35
36
37
38
39
/* epilog loop */
b .Lcheck_loop3 /* branch to check_loop3 */
.Lloop3:
ldrb r5, [r0, r4] /* r5 ← *{unsigned byte}(r0 + r4) */
ldrb r6, [r1, r4] /* r6 ← *{unsigned byte}(r1 + r4) */
add r7, r5, r6 /* r7 ← r5 + r6 */
strb r7, [r2, r4] /* *{unsigned byte}(r2 + r4) ← r7 */
add r4, r4, #1 /* r4 ← r4 + 1 */
.Lcheck_loop3:
cmp r4, r3 /* perform r4 - r3 and update cpsr */
blt .Lloop3 /* if cpsr means that r4 < r3 jump to loop 3 */
The epilog loop is like the naive one, but it will only run 0, 1, 2 or 3 iterations. This means that for big enough values of N
, in practice all iterations will use the data parallel instructions and only up to 3 will have to use the slower approach.
Halving instructions
The data parallel instructions also come in a form where the addition/subtraction is halved. This means that it is possible to compute averages of half words and bytes easily.
- Halfwords
- Signed:
shadd16
,shsub16
- Unsigned:
uhadd16
,uhsub16
- Bytes
- Signed:
shadd8
,shsub8
- Unsigned:
uhadd8
,uhsub8
Thus, the motivating example of the beginning of the post can be implemented using the shsub16
instruction. For simplicity, let's assume that num_samples
is a multiple of 2 (now we are dealing with halfwords) so no epilog is necessary.
Saturating arithmetic
Let's go back to our motivating example. We averaged the two 16-bit channels to mix them but, in reality, mixing is achieved by just adding the two channels. In general this is OK because signals are not correlated and the amplitude of a mixed sample usually can be encoded in 16-bit. Sometimes, though, the mixed sample may have an amplitude that falls outside the 16-bit range. In this case we want to clip the sample within the representable range. A sample with a too positive amplitude will be clipped to 215-1, a sample with a too negative amplitude will be clipped to -215.
With lack of hardware support, clipping can be implemented by checking overflow after each addition. So, every addition should check that the resulting number is in the interval [-32768, 32767] Let's write a function that adds two 32-bit integers and clips them in the 16-bit range.
As you can see, a seemingly simple addition that clips the result requires a bunch of instructions. As before, the code is correct but we can do much better thanks to the saturated arithmetics instructions of ARMv6.
- Halfwords
- Signed:
qadd16
,qsub16
- Unsigned:
uqadd16
,uqsub16
- Bytes
- Signed:
qadd8
,qsub8
- Unsigned:
uqadd8
,uqsub8
Now we can write a more realistic mixing of two channels.
That's all for today.