# ARM assembler in Raspberry Pi – Chapter 12

We saw in chapter 6 some simple schemes to implement usual structured programming constructs like if-then-else and loops. In this chapter we will revisit these constructs and exploit a feature of the ARM instruction set that we have not learnt yet.

## Playing with loops

The most generic form of loop is this one.

There are also two special forms, which are actually particular incarnations of the one shown above but are interesting as well.

Some languages, like Pascal, have constructs like this one.

but this is like a `do S while (!E)`

.

We can manipulate loops to get a form that may be more convenient. For instance.

The last manipulation is interesting, because we can avoid the `if-then`

if we directly go to the `while`

part.

In valid C, the above transformation would be written as follows.

Which looks much uglier than abusing a bit C syntax.

## The -s suffix

So far, when checking the condition of an `if`

or `while`

, we have evaluated the condition and then used the `cmp`

intruction to update `cpsr`

. The update of the `cpsr`

is mandatory for our conditional codes, no matter if we use branching or predication. But `cmp`

is not the only way to update `cpsr`

. In fact many instructions can update it.

By default an instruction does not update `cpsr`

unless we append the suffix `-s`

. So instead of the instruction `add`

or `sub`

we write `adds`

or `subs`

. The result of the instruction (what would be stored in the destination register) is used to update `cpsr`

.

How can we use this? Well, consider this simple loop counting backwards.

If we replace `sub`

by `subs`

then `cpsr`

will be updated with the result of the subtration. This means that the flags N, Z, C and V will be updated, so we can use a branch right after `subs`

. In our case we want to jump back to loop only if `i >= 0`

, this is when the result is non-negative. We can use `bpl`

to achieve this.

It is a bit tricky to get these things right (this is why we use compilers). For instance this similar, but not identical, loop would use `bne`

instead of `bpl`

. Here the condition is `ne`

(not equal). It would be nice to have an alias like `nz`

(not zero) but, unfortunately, this does not exist in ARM.

A rule of thumb where we may want to apply the use of the -s suffix is in codes in the following form.

where `@`

means any comparison respect 0 (equals, different, lower, etc.).

## Operating 64-bit numbers

As an example of using the suffix -s we will implement three 64-bit integer operations in ARM: addition, subtraction and multiplication. Remember that ARM is a 32-bit architecture, so everything is 32-bit minded. If we only use 32-bit numbers, this is not a problem, but if for some reason we need 64-bit numbers things get a bit more complicated. We will represent a 64-bit number as two 32-bit numbers, the lower and higher part. This way a 64-bit number n represented using two 32-bit parts, n_{lower} and n_{higher} will have the value n = 2^{32} × n_{higher} + n_{lower}

We will, obviously, need to kep the 32-bit somewhere. When keeping them in registers, we will use two consecutive registers (e.g. r1 and r2, that we will write it as `{r1,r2}`

) and we will keep the higher part in the higher numbered register. When keeping a 64-bit number in memory, we will store in two consecutive addresses the two parts, being the lower one in the lower address. The address will be 8-byte aligned.

### Addition

Adding two 64-bit numbers using 32-bit operands means adding first the lower part and then adding the higher parts but taking into account a possible carry from the lower part. With our current knowledge we could write something like this (assume the first number is in `{r2,r3}`

, the second in `{r4,r5}`

and the result will be in `{r0,r1}`

).

This would work. Fortunately ARM provides an instructions `adc`

which adds two numbers and the carry flag. So we could rewrite the above code with just two instructions.

### Subtraction

Subtracting two numbers is similar to adding them. In ARM when subtracting two numbers using `subs`

, if we need to borrow (because the second operand is larger than the first) then C will be disabled (C will be 0). If we do not need to borrow, C will be enabled (C will be 1). This is a bit surprising but consistent with the remainder of the architecture (check in chapter 5 conditions CS/HS and CC/LO). Similar to `adc`

there is a `sbc`

which performs a normal subtraction if C is 1. Otherwise it subtracts one more element. Again, this is consistent on how C works in the `subs`

instruction.

### Multiplication

Multiplying two 64-bit numbers is a tricky thing. When we multiply two N-bit numbers the result may need up to 2*N-bits. So when multiplying two 64-bit numbers we may need a 128-bit number. For the sake of simplicity we will assume that this does not happen and 64-bit will be enough. Our 64-bit numbers are two 32-bit integers, so a 64-bit x is actually x = 2^{32} × x_{1} + x_{0}, where x_{1} and x_{0} are two 32-bit numbers. Similarly another 64-bit number y would be y = 2^{32} × y_{1} + y_{0}. Multiplying x and y yields z where z = 2^{64} × x_{1} × y_{1} + 2^{32} × (x_{0} × y_{1} + x_{1} × y_{0}) + x_{0} × y_{0}. Well, now our problem is multiplying each x_{i} by y_{i}, but again we may need 64-bit to represent the value.

ARM provides a bunch of different instructions for multiplication. Today we will see just three of them. If we are multiplying 32-bits and we do not care about the result not fitting in a 32-bit number we can use `mul Rd, Rsource1, Rsource2`

. Unfortunately it does not set any flag in the `cpsr`

useful for detecting an overflow of the multiplication (i.e. when the result does not fit in the 32-bit range). This instruction is the fastest one of the three. If we do want the 64-bit resulting from the multiplication, we have two other instructions `smull`

and `umull`

. The former is used when we multiply to numbers in two's complement, the latter when we represent unsigned values. Their syntax is `{s,u}mull RdestLower, RdestHigher, Rsource1, Rsource2`

. The lower part of the 64-bit result is kept in the register `RdestLower`

and the higher part in he register `RdestHigher`

.

In this example we have to use `umull`

otherwise the 32-bit lower parts might end being interpreted as negative numbers, giving negative intermediate values. That said, we can now multiply x_{0} and y_{0}. Recall that we have the two 64-bit numbers in `r2,r3`

and `r4,r5`

pairs of registers. So first multiply `r2`

and `r4`

. Note the usage of `r0`

since this will be its final value. In contrast, register `r6`

will be used later.

Now let's multiply x_{0} by y_{1} and x_{1} by y_{0}. This is `r3`

by `r4`

and `r2`

by `r5`

. Note how we overwrite `r4`

and `r5`

in the second multiplication. This is fine since we will not need them anymore.

There is no need to make the multiplication of x_{1} by y_{1} because if it gives a nonzero value, it will always overflow a 64-bit number. This means that if both `r3`

and `r5`

were nonzero, the multiplication will never fit a 64-bit. This is a suficient condition, but not a necessary one. The number might overflow when adding the intermediate values that will result in `r1`

.

Let's package this code in a nice function in a program to see if it works. We will multiply numbers 12345678901 (this is 2×2^{32} + 3755744309) and 12345678 and print the result.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

/* -- mult64.s */
.data
.align 4
message : .asciz "Multiplication of %lld by %lld is %lld\n"
.align 8
number_a_low: .word 3755744309
number_a_high: .word 2
.align 8
number_b_low: .word 12345678
number_b_high: .word 0
.text
/* Note: This is not the most efficient way to doa 64-bit multiplication.
This is for illustration purposes */
mult64:
/* The argument will be passed in r0, r1 and r2, r3 and returned in r0, r1 */
/* Keep the registers that we are going to write */
push {r4, r5, r6, r7, r8, lr}
/* For covenience, move {r0,r1} into {r4,r5} */
mov r4, r0 /* r0 ← r4 */
mov r5, r1 /* r5 ← r1 */
umull r0, r6, r2, r4 /* {r0,r6} ← r2 * r4 */
umull r7, r8, r3, r4 /* {r7,r8} ← r3 * r4 */
umull r4, r5, r2, r5 /* {r4,r5} ← r2 * r5 */
adds r2, r7, r4 /* r2 ← r7 + r4 and update cpsr */
adc r1, r2, r6 /* r1 ← r2 + r6 + C */
/* Restore registers */
pop {r4, r5, r6, r7, r8, lr}
bx lr /* Leave mult64 */
.global main
main:
push {r4, r5, r6, r7, r8, lr} /* Keep the registers we are going to modify */
/* r8 is not actually used here, but this way
the stack is already 8-byte aligned */
/* Load the numbers from memory */
/* {r4,r5} ← a */
ldr r4, addr_number_a_low /* r4 ← &a_low */
ldr r4, [r4] /* r4 ← *r4 */
ldr r5, addr_number_a_high /* r5 ← &a_high */
ldr r5, [r5] /* r5 ← *r5 */
/* {r6,r7} ← b */
ldr r6, addr_number_b_low /* r6 ← &b_low */
ldr r6, [r6] /* r6 ← *r6 */
ldr r7, addr_number_b_high /* r7 ← &b_high */
ldr r7, [r7] /* r7 ← *r7 */
/* Now prepare the call to mult64
/*
The first number is passed in
registers {r0,r1} and the second one in {r2,r3}
*/
mov r0, r4 /* r0 ← r4 */
mov r1, r5 /* r1 ← r5 */
mov r2, r6 /* r2 ← r6 */
mov r3, r7 /* r3 ← r7 */
bl mult64 /* call mult64 function */
/* The result of the multiplication is in r0,r1 */
/* Now prepare the call to printf */
/* We have to pass &message, {r4,r5}, {r6,r7} and {r0,r1} */
push {r1} /* Push r1 onto the stack. 4th (higher) parameter */
push {r0} /* Push r0 onto the stack. 4th (lower) parameter */
push {r7} /* Push r7 onto the stack. 3rd (higher) parameter */
push {r6} /* Push r6 onto the stack. 3rd (lower) parameter */
mov r3, r5 /* r3 ← r5. 2rd (higher) parameter */
mov r2, r4 /* r2 ← r4. 2nd (lower) parameter */
ldr r0, addr_of_message /* r0 ← &message 1st parameter */
bl printf /* Call printf */
add sp, sp, #16 /* sp ← sp + 16 */
/* Pop the two registers we pushed above */
mov r0, #0 /* r0 ← 0 */
pop {r4, r5, r6, r7, r8, lr} /* Restore the registers we kept */
bx lr /* Leave main */
addr_of_message : .word message
addr_number_a_low: .word number_a_low
addr_number_a_high: .word number_a_high
addr_number_b_low: .word number_b_low
addr_number_b_high: .word number_b_high

Observe first that we have the addresses of the lower and upper part of each number. Instead of this we could load them by just using an offset, as we saw in chapter 8. So, in lines 41 to 44 we could have done the following.

40
41
42
43

/* {r4,r5} ← a */
ldr r4, addr_number_a_low /* r4 ← &a_low */
ldr r5, [r4, +#4] /* r5 ← *(r4 + 4) */
ldr r4, [r4] /* r4 ← *r4 */

In the function `mult64`

we pass the first value (x) as `r0,r1`

and the second value (y) as `r2,r3`

. The result is stored in `r0,r1`

. We move the values to the appropiate registers for parameter passing in lines 57 to 61.

Printing the result is a bit complicated. 64-bits must be passed as pairs of consecutive registers where the lower part is in an even numbered register. Since we pass the address of the message
in `r0`

we cannot pass the first 64-bit integer in `r1`

. So we skip `r1`

and we use `r2`

and `r3`

for the first argument. But now we have run out of registers for parameter passing. When this happens, we have to use the stack for parameter passing.

Two rules have to be taken into account when passing data in the stack.

- You must ensure that the stack is aligned for the data you are going to pass (by adjusting the stack first). So, for 64-bit numbers, the stack must be 8-byte aligned. If you pass an 32-bit number and then a 64-bit number, you will have to skip 4 bytes before passing the 64-bit number. Do not forget to keep the stack always 8-byte aligned per the Procedure Call Standard for ARM Architecture (AAPCS) requirement.
- An argument with a lower position number in the call must have a lower address in the stack. So we have to pass the arguments in opposite order.

The second rule is what explains why we push first `r1`

and then `r0`

, when they are the registers containing the last 64-bit number (the result of the multiplication) we want to pass to `printf`

.

Note that in the example above, we cannot pass the parameters in the stack using `push {r0,r1,r6,r7}`

, which is equivalent to `push {r0}`

, `push {r1}`

, `push {r6}`

and `push {r7}`

, but not equivalent to the required order when passing the arguments on the stack.

If we run the program we should see something like.

That's all for today.