# ARM assembler in Raspberry Pi – Chapter 12

We saw in chapter 6 some simple schemes to implement usual structured programming constructs like if-then-else and loops. In this chapter we will revisit these constructs and exploit a feature of the ARM instruction set that we have not learnt yet.

## Playing with loops

The most generic form of loop is this one.

while (E) S; |

There are also two special forms, which are actually particular incarnations of the one shown above but are interesting as well.

for (i = lower; i <= upper; i += step) S; |

do S while (E); |

Some languages, like Pascal, have constructs like this one.

repeat S until E; |

but this is like a `do S while (!E)`

.

We can manipulate loops to get a form that may be more convenient. For instance.

do S while (E); /* Can be rewritten as */ S; while (E) S; |

while (E) S; /* Can be rewritten as */ if (E) { do S while (E); } |

The last manipulation is interesting, because we can avoid the `if-then`

if we directly go to the `while`

part.

/* This is not valid C */ goto check; do S check: while (E); |

In valid C, the above transformation would be written as follows.

goto check; loop: S; check: if (E) goto loop; |

Which looks much uglier than abusing a bit C syntax.

## The -s suffix

So far, when checking the condition of an `if`

or `while`

, we have evaluated the condition and then used the `cmp`

intruction to update `cpsr`

. The update of the `cpsr`

is mandatory for our conditional codes, no matter if we use branching or predication. But `cmp`

is not the only way to update `cpsr`

. In fact many instructions can update it.

By default an instruction does not update `cpsr`

unless we append the suffix `-s`

. So instead of the instruction `add`

or `sub`

we write `adds`

or `subs`

. The result of the instruction (what would be stored in the destination register) is used to update `cpsr`

.

How can we use this? Well, consider this simple loop counting backwards.

/* for (int i = 100 ; i >= 0; i--) */ mov r1, #100 loop: /* do something */ sub r1, r1, #1 /* r1 ← r1 - 1 */ cmp r1, #0 /* update cpsr with r1 - 0 */ bge loop /* branch if r1 >= 100 */ |

If we replace `sub`

by `subs`

then `cpsr`

will be updated with the result of the subtration. This means that the flags N, Z, C and V will be updated, so we can use a branch right after `subs`

. In our case we want to jump back to loop only if `i >= 0`

, this is when the result is non-negative. We can use `bpl`

to achieve this.

/* for (int i = 100 ; i >= 0; i--) */ mov r1, #100 loop: /* do something */ subs r1, r1, #1 /* r1 ← r1 - 1 and update cpsr with the final r1 */ bpl loop /* branch if the previous sub computed a positive number (N flag in cpsr is 0) */ |

It is a bit tricky to get these things right (this is why we use compilers). For instance this similar, but not identical, loop would use `bne`

instead of `bpl`

. Here the condition is `ne`

(not equal). It would be nice to have an alias like `nz`

(not zero) but, unfortunately, this does not exist in ARM.

/* for (int i = 100 ; i > 0; i--). Note here i > 0, not i >= 0 as in the example above */ mov r1, #100 loop: /* do something */ subs r1, r1, #1 /* r1 ← r1 - 1 and update cpsr with the final r1 */ bne loop /* branch if the previous sub computed a number that is not zero (Z flag in cpsr is 0) */ |

A rule of thumb where we may want to apply the use of the -s suffix is in codes in the following form.

s = ... if (s @ 0) |

where `@`

means any comparison respect 0 (equals, different, lower, etc.).

## Operating 64-bit numbers

As an example of using the suffix -s we will implement three 64-bit integer operations in ARM: addition, subtraction and multiplication. Remember that ARM is a 32-bit architecture, so everything is 32-bit minded. If we only use 32-bit numbers, this is not a problem, but if for some reason we need 64-bit numbers things get a bit more complicated. We will represent a 64-bit number as two 32-bit numbers, the lower and higher part. This way a 64-bit number n represented using two 32-bit parts, n_{lower} and n_{higher} will have the value n = 2^{32} × n_{higher} + n_{lower}

We will, obviously, need to kep the 32-bit somewhere. When keeping them in registers, we will use two consecutive registers (e.g. r1 and r2, that we will write it as `{r1,r2}`

) and we will keep the higher part in the higher numbered register. When keeping a 64-bit number in memory, we will store in two consecutive addresses the two parts, being the lower one in the lower address. The address will be 8-byte aligned.

### Addition

Adding two 64-bit numbers using 32-bit operands means adding first the lower part and then adding the higher parts but taking into account a possible carry from the lower part. With our current knowledge we could write something like this (assume the first number is in `{r2,r3}`

, the second in `{r4,r5}`

and the result will be in `{r0,r1}`

).

add r1, r3, r5 /* First we add the higher part */ /* r1 ← r3 + r5 */ adds r0, r2, r4 /* Now we add the lower part and we update cpsr */ /* r0 ← r2 + r4 */ addcs r1, r1, #1 /* If adding the lower part caused carry, add 1 to the higher part */ /* if C = 1 then r1 ← r1 + 1 */ /* Note that here the suffix -s is not applied, -cs means carry set */ |

This would work. Fortunately ARM provides an instructions `adc`

which adds two numbers and the carry flag. So we could rewrite the above code with just two instructions.

adds r0, r2, r4 /* First add the lower part and update cpsr */ /* r0 ← r2 + r4 */ adc r1, r3, r5 /* Now add the higher part plus the carry from the lower one */ /* r1 ← r3 + r5 + C */ |

### Subtraction

Subtracting two numbers is similar to adding them. In ARM when subtracting two numbers using `subs`

, if we need to borrow (because the second operand is larger than the first) then C will be disabled (C will be 0). If we do not need to borrow, C will be enabled (C will be 1). This is a bit surprising but consistent with the remainder of the architecture (check in chapter 5 conditions CS/HS and CC/LO). Similar to `adc`

there is a `sbc`

which performs a normal subtraction if C is 1. Otherwise it subtracts one more element. Again, this is consistent on how C works in the `subs`

instruction.

subs r0, r2, r4 /* First subtract the lower part and update cpsr */ /* r0 ← r2 - r4 */ sbc r1, r3, r5 /* Now subtract the higher part plus the NOT of the carry from the lower one */ /* r1 ← r3 - r5 - ~C */ |

### Multiplication

Multiplying two 64-bit numbers is a tricky thing. When we multiply two N-bit numbers the result may need up to 2*N-bits. So when multiplying two 64-bit numbers we may need a 128-bit number. For the sake of simplicity we will assume that this does not happen and 64-bit will be enough. Our 64-bit numbers are two 32-bit integers, so a 64-bit x is actually x = 2^{32} × x_{1} + x_{0}, where x_{1} and x_{0} are two 32-bit numbers. Similarly another 64-bit number y would be y = 2^{32} × y_{1} + y_{0}. Multiplying x and y yields z where z = 2^{64} × x_{1} × y_{1} + 2^{32} × (x_{0} × y_{1} + x_{1} × y_{0}) + x_{0} × y_{0}. Well, now our problem is multiplying each x_{i} by y_{i}, but again we may need 64-bit to represent the value.

ARM provides a bunch of different instructions for multiplication. Today we will see just three of them. If we are multiplying 32-bits and we do not care about the result not fitting in a 32-bit number we can use `mul Rd, Rsource1, Rsource2`

. Unfortunately it does not set any flag in the `cpsr`

useful for detecting an overflow of the multiplication (i.e. when the result does not fit in the 32-bit range). This instruction is the fastest one of the three. If we do want the 64-bit resulting from the multiplication, we have two other instructions `smull`

and `umull`

. The former is used when we multiply to numbers in two’s complement, the latter when we represent unsigned values. Their syntax is `{s,u}mull RdestLower, RdestHigher, Rsource1, Rsource2`

. The lower part of the 64-bit result is kept in the register `RdestLower`

and the higher part in he register `RdestHigher`

.

In this example we have to use `umull`

otherwise the 32-bit lower parts might end being interpreted as negative numbers, giving negative intermediate values. That said, we can now multiply x_{0} and y_{0}. Recall that we have the two 64-bit numbers in `r2,r3`

and `r4,r5`

pairs of registers. So first multiply `r2`

and `r4`

. Note the usage of `r0`

since this will be its final value. In contrast, register `r6`

will be used later.

umull r0, r6, r2, r4 |

Now let’s multiply x_{0} by y_{1} and x_{1} by y_{0}. This is `r3`

by `r4`

and `r2`

by `r5`

. Note how we overwrite `r4`

and `r5`

in the second multiplication. This is fine since we will not need them anymore.

umull r7, r8, r3, r4 umull r4, r5, r2, r5 |

There is no need to make the multiplication of x_{1} by y_{1} because if it gives a nonzero value, it will always overflow a 64-bit number. This means that if both `r3`

and `r5`

were nonzero, the multiplication will never fit a 64-bit. This is a suficient condition, but not a necessary one. The number might overflow when adding the intermediate values that will result in `r1`

.

adds r2, r7, r4 adc r1, r2, r6 |

Let’s package this code in a nice function in a program to see if it works. We will multiply numbers 12345678901 (this is 2×2^{32} + 3755744309) and 12345678 and print the result.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | /* -- mult64.s */ .data .align 4 message : .asciz "Multiplication of %lld by %lld is %lld\n" .align 8 number_a_low: .word 3755744309 number_a_high: .word 2 .align 8 number_b_low: .word 12345678 number_b_high: .word 0 .text /* Note: This is not the most efficient way to doa 64-bit multiplication. This is for illustration purposes */ mult64: /* The argument will be passed in r0, r1 and r2, r3 and returned in r0, r1 */ /* Keep the registers that we are going to write */ push {r4, r5, r6, r7, r8, lr} /* For covenience, move {r0,r1} into {r4,r5} */ mov r4, r0 /* r0 ← r4 */ mov r5, r1 /* r5 ← r1 */ umull r0, r6, r2, r4 /* {r0,r6} ← r2 * r4 */ umull r7, r8, r3, r4 /* {r7,r8} ← r3 * r4 */ umull r4, r5, r2, r5 /* {r4,r5} ← r2 * r5 */ adds r2, r7, r4 /* r2 ← r7 + r4 and update cpsr */ adc r1, r2, r6 /* r1 ← r2 + r6 + C */ /* Restore registers */ pop {r4, r5, r6, r7, r8, lr} bx lr /* Leave mult64 */ .global main main: push {r4, r5, r6, r7, r8, lr} /* Keep the registers we are going to modify */ /* r8 is not actually used here, but this way the stack is already 8-byte aligned */ /* Load the numbers from memory */ /* {r4,r5} ← a */ ldr r4, addr_number_a_low /* r4 ← &a_low */ ldr r4, [r4] /* r4 ← *r4 */ ldr r5, addr_number_a_high /* r5 ← &a_high */ ldr r5, [r5] /* r5 ← *r5 */ /* {r6,r7} ← b */ ldr r6, addr_number_b_low /* r6 ← &b_low */ ldr r6, [r6] /* r6 ← *r6 */ ldr r7, addr_number_b_high /* r7 ← &b_high */ ldr r7, [r7] /* r7 ← *r7 */ /* Now prepare the call to mult64 /* The first number is passed in registers {r0,r1} and the second one in {r2,r3} */ mov r0, r4 /* r0 ← r4 */ mov r1, r5 /* r1 ← r5 */ mov r2, r6 /* r2 ← r6 */ mov r3, r7 /* r3 ← r7 */ bl mult64 /* call mult64 function */ /* The result of the multiplication is in r0,r1 */ /* Now prepare the call to printf */ /* We have to pass &message, {r4,r5}, {r6,r7} and {r0,r1} */ push {r1} /* Push r1 onto the stack. 4th (higher) parameter */ push {r0} /* Push r0 onto the stack. 4th (lower) parameter */ push {r7} /* Push r7 onto the stack. 3rd (higher) parameter */ push {r6} /* Push r6 onto the stack. 3rd (lower) parameter */ mov r3, r5 /* r3 ← r5. 2rd (higher) parameter */ mov r2, r4 /* r2 ← r4. 2nd (lower) parameter */ ldr r0, addr_of_message /* r0 ← &message 1st parameter */ bl printf /* Call printf */ add sp, sp, #16 /* sp ← sp + 16 */ /* Pop the two registers we pushed above */ mov r0, #0 /* r0 ← 0 */ pop {r4, r5, r6, r7, r8, lr} /* Restore the registers we kept */ bx lr /* Leave main */ addr_of_message : .word message addr_number_a_low: .word number_a_low addr_number_a_high: .word number_a_high addr_number_b_low: .word number_b_low addr_number_b_high: .word number_b_high |

Observe first that we have the addresses of the lower and upper part of each number. Instead of this we could load them by just using an offset, as we saw in chapter 8. So, in lines 41 to 44 we could have done the following.

40 41 42 43 | /* {r4,r5} ← a */ ldr r4, addr_number_a_low /* r4 ← &a_low */ ldr r5, [r4, +#4] /* r5 ← *(r4 + 4) */ ldr r4, [r4] /* r4 ← *r4 */ |

In the function `mult64`

we pass the first value (x) as `r0,r1`

and the second value (y) as `r2,r3`

. The result is stored in `r0,r1`

. We move the values to the appropiate registers for parameter passing in lines 57 to 61.

Printing the result is a bit complicated. 64-bits must be passed as pairs of consecutive registers where the lower part is in an even numbered register. Since we pass the address of the message

in `r0`

we cannot pass the first 64-bit integer in `r1`

. So we skip `r1`

and we use `r2`

and `r3`

for the first argument. But now we have run out of registers for parameter passing. When this happens, we have to use the stack for parameter passing.

Two rules have to be taken into account when passing data in the stack.

- You must ensure that the stack is aligned for the data you are going to pass (by adjusting the stack first). So, for 64-bit numbers, the stack must be 8-byte aligned. If you pass an 32-bit number and then a 64-bit number, you will have to skip 4 bytes before passing the 64-bit number. Do not forget to keep the stack always 8-byte aligned per the Procedure Call Standard for ARM Architecture (AAPCS) requirement.
- An argument with a lower position number in the call must have a lower address in the stack. So we have to pass the arguments in opposite order.

The second rule is what explains why we push first `r1`

and then `r0`

, when they are the registers containing the last 64-bit number (the result of the multiplication) we want to pass to `printf`

.

Note that in the example above, we cannot pass the parameters in the stack using `push {r0,r1,r6,r7}`

, which is equivalent to `push {r0}`

, `push {r1}`

, `push {r6}`

and `push {r7}`

, but not equivalent to the required order when passing the arguments on the stack.

If we run the program we should see something like.

$ ./mult64_2 Multiplication of 12345678901 by 12345678 is 152415776403139878 |

That’s all for today.

ARM assembler in Raspberry Pi – Chapter 11 Capybara, pop up windows and the new PayPal sandbox

That’s an excellent tutorial! I read through all 12 chapters and learned a great deal. You have a talent for explaining things in an understandable way. Thank you for sharing this, and keep up the good work!

Me too.

I agree that you explain very well.

I have written some of the problems from projecteuler.net in ARM assembly language (http://www.fourtheye.org/cgi-bin/language.pl?language=asm).

Please use any of the code in your discussion if it helps.

Thanks a lot Bob!

ultimate nice piece of coding … i stun of this …

The 12 chapters of this tutorial were just what I was looking for to break into Raspberry Pi assembly language. I am working on a project where I want to read the GPIO pins repeatedly as fast as possible, accumulate a LOT of data (like say 512KB) on the stack, then write that data to a file on the Raspberry Pi’s SD card. Any thoughts or suggestions? Any problem or memory boundary issues for a stack that large? How would you initiate a file save of that block? Thanks for any comments you might have…

Hi Ricky,

thank you for the kind comments.

Regarding to your question and without more information, I think that big stacks should not be a problem as long as they are not unrealistically big (e.g. more than 1 GB).

Great tutorials by the way. I found a type here:

subs r0, r2, r4 /* First add the lower part and update cpsr */

/* r0 ← r2 – r4 */

sbc r1, r3, r5 /* Now add the higher part plus the NOT of the carry from the lower one */

/* r1 ← r3 – r5 – ~C */

In the comments you say ADD instead of subtract.

Thanks again for this tutorial.

Fixed. Thanks!

Hi,

i think i found a better approach to multiplication:

umul64:

stmdb sp!, {r4, r5}

mov r5, r1

mov r4, r0

umull r0, r1, r2, r4

umlal r1, r5, r2, r5 // r5 overwritten but not needed anymore

umlal r1, r5, r3, r4

ldmia sp!, {r4, r5}

bx lr

great tutorial btw.

Cheers!

Hi Marek,

yes this, I think, the shortest sequence.

It is slightly more complex to understand than the naive code I wrote in the post.

The following explanation is aimed at other readers that may be interested in your approach.

First recall hat the multiplication of two 64 bit numbers A*B is equivalent to (2^32*A1 + A0) * (2^32*B1 + B0), where X1 and X0 are the 32 top bits and 32 lower bits of X. So the multiplication is actually 2^64*A1*B1 + 2^32*(A1*B0 + A0*B1) + A0*B0. Since we do not take into account overflow, we only have to compute A0*B0 and 2^32(A1*B0 + A0*B1). The trick here is realising that A0*B0 yields a 64-bit result that we will have to propagate to the upper 32 bits of the result. The result, R, will then be R0 = Lower32(A0*B0) and R1 = Lower32(A1*B0) + Lower32(A0*B1) + Upper32(A0*B0). Here Lower32 and Upper32 mean the lower and upper 32 bits of the 64-bit value.

`umul r0, r1, r2, r4 /* {r0, r1} ← r2 * r4 */`

Recall that r0 will contain the lower 32-bits of the result and r1 will contain the upper 32-bit of the result at the end of the function. At this point r4 is A0 and r2 is B0. After the instruction, if r1 is nonzero means that r2*r4 did not fit in 32-bit. So r0 is Lower32(A0*B0) and r1 is Upper32(A0*B0).

`umlal r1, r5, r2, r5 /* This instruction performs the following operation`

{lower, upper} ← r2 * r5

r1 ← lower + r1

r5 ← upper + r5 + C (C=1 if lower + r1 does not fit in 32 bit, C=0 otherwise)

At this point r5 is A1 and r2 B0. We compute A1*B0 giving a 64 bit result {lower, upper}. Note that ‘upper’ (equivalent to Upper32(A1*B0)) is uninteresting because if it is nonzero it would mean that our multiplication does not fit in 64 bit, so the update of r5 is nonrelevant for our computation. r5 is just used here because the instruction umlal requires it, and we will discard r5 at the end of the function anyway.

The lower 32 bits of the multiplication are stored in ‘lower’ (equivalent to Lower32(A1*B0)). But remember that we had to add Upper32(A0*B0). And it happens that as said above r1 is now Upper32(A0*B0). This way at the end of this instruction r1 will be the Lower32(A1*B0)

plusUpper32(A0*B0).`umlal r1, r5, r3, r4`

Likewise the previous, but this time we multiply A0*B1. Again r5 is uninteresting here. At the end of the instruction r1 will have the value of Lower32(A0*B1)

plusthe value it had: Lower32(A1*B0)plusUpper32(A0*B0). So after these three instructions r0 happens to be Lower32(A0*B0) and r1 happens to be Lower32(A0*B1) + Lower32(A1*B0) + Upper32(A0*B0), and this is what we wanted.Errata: addcs instruction doesn’t exist. Use instead adcs

I’m sorry. It was my mistake add+cs exists, I have not read the comments

I loved this set of tutorials. Thanks for the great work!

Just a minor correction in the code transformation examples: the explanation on how to transforme a “while (E) S;” has an initial “S” inside the if that shouldn’t be there, as it is now the loop body will never run exactly once (it will run twice if the condition is initially true, without checking the condition between the first two body executions).

Hi Daniel,

you’re right. I’ve fixed that in the post. Thank you very much!

Kind regards,

Very very interesting articles. I learned a lot.

Is it possible to persuade you to write an article about I/O (eg. asynch ports) and timers and an article about interrupts (eg timer interrupts) ?

Kristian sander

Hi Kristian,

I’ll check that but given that the examples in the chapters are run on top of an operating system, probably these operations are privileged and only the OS can do them.

Kind regards,

hello how can i add and sub two numbers of 128 bits?

Hi,

similar to 64-bit but with more instructions. The complexitly lies in that you will probably not want to use registers to represent the 128-bit numbers: that would cost you 4 registers per operand. So the easiest way is having an array of 4 words per operand and load the words. Then you operate them with add/adc and sub/sbc. Make sure you update the flags for the first three operations.

Below is an example of this.

%lld, not %11d

Ha! It took almost all day 🙂

Thanks for these series.

I started programming in Assembly LOOONG time ago (1985 IIRC). TMS 9900 on a TI 99/4A.

Then I went directly to a PC were I tried Pascal, C, PHP, Python, Ruby, Python again (I love Python), but lately I wanted to do some low level programming with my RPi just for fun.

I’m really enjoying it 🙂

Thanks!

“I’ll check that but given that the examples in the chapters are run on top of an operating system, probably these operations are privileged and only the OS can do them.”

Ah. How would you write code that has the system totally to itself and returns to the OS on completion (OR say resets the PI and reboots the OS)

Regards

Hi Richard,

this enters the domain of baremetal / OS programming. You can take a look at https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/ (reports itself as dated but may give you a starting point).

This is just one example, I’m pretty sure there are other courses / tutorials.

Kind regards,

Roger

Thank you very much, Roger. This is an excellent Tutorial

Thank you very much!

Hey,

If we push main function parameters to stack and call a function after that like printf, a new stack will allocated for that right? So can it access any other function’s stack?

Hi,

I’m not sure to understand what you mean by “a new stack will be allocated for that”.

There is a single stack, functions can grow it if they need it (by subtracting values to

`sp`

). But when a function returns, it must ensure`sp`

is left as it was upon entering the function.When we pass arguments through the stack, the allocation is done by the

caller. Thecallee(`printf`

in your case) then accesses the caller stack to find the arguments.Kind regards,

Roger

Given that number_b_high = 0, wouldn’t the following have sufficed?

umull r0, r6, r2, r4 @ {r0,r6} = x1y1

umull r4, r5, r2, r5 @ {r4,r5} = x2y1

add r1, r4, r6 @ r1 = r2 + r6

Thank you for the tutorial. It’s been a good read.

Hi,

yes but it could have not been, right? Maybe I should have made this clearer in the example.

Kind regards,