Think In Geek

In geek we trust

ARM assembler in Raspberry Pi – Chapter 12

We saw in chapter 6 some simple schemes to implement usual structured programming constructs like if-then-else and loops. In this chapter we will revisit these constructs and exploit a feature of the ARM instruction set that we have not learnt yet.

Playing with loops

The most generic form of loop is this one.

while (E)

There are also two special forms, which are actually particular incarnations of the one shown above but are interesting as well.

for (i = lower; i <= upper; i += step)
while (E);

Some languages, like Pascal, have constructs like this one.

until E;

but this is like a do S while (!E).

We can manipulate loops to get a form that may be more convenient. For instance.

   while (E);
/* Can be rewritten as */
   while (E)
   while (E)
/* Can be rewritten as */
   if (E)
      while (E);

The last manipulation is interesting, because we can avoid the if-then if we directly go to the while part.

/* This is not valid C */
goto check;
check: while (E);

In valid C, the above transformation would be written as follows.

goto check;
  if (E) goto loop;

Which looks much uglier than abusing a bit C syntax.

The -s suffix

So far, when checking the condition of an if or while, we have evaluated the condition and then used the cmp intruction to update cpsr. The update of the cpsr is mandatory for our conditional codes, no matter if we use branching or predication. But cmp is not the only way to update cpsr. In fact many instructions can update it.

By default an instruction does not update cpsr unless we append the suffix -s. So instead of the instruction add or sub we write adds or subs. The result of the instruction (what would be stored in the destination register) is used to update cpsr.

How can we use this? Well, consider this simple loop counting backwards.

/* for (int i = 100 ; i >= 0; i--) */
mov r1, #100
  /* do something */
  sub r1, r1, #1      /* r1 ← r1 - 1 */
  cmp r1, #0          /* update cpsr with r1 - 0 */
  bge loop            /* branch if r1 >= 100 */

If we replace sub by subs then cpsr will be updated with the result of the subtration. This means that the flags N, Z, C and V will be updated, so we can use a branch right after subs. In our case we want to jump back to loop only if i >= 0, this is when the result is non-negative. We can use bpl to achieve this.

/* for (int i = 100 ; i >= 0; i--) */
mov r1, #100
  /* do something */
  subs r1, r1, #1      /* r1 ← r1 - 1  and update cpsr with the final r1 */
  bpl loop             /* branch if the previous sub computed a positive number (N flag in cpsr is 0) */

It is a bit tricky to get these things right (this is why we use compilers). For instance this similar, but not identical, loop would use bne instead of bpl. Here the condition is ne (not equal). It would be nice to have an alias like nz (not zero) but, unfortunately, this does not exist in ARM.

/* for (int i = 100 ; i > 0; i--). Note here i > 0, not i >= 0 as in the example above */
mov r1, #100
  /* do something */
  subs r1, r1, #1      /* r1 ← r1 - 1  and update cpsr with the final r1 */
  bne loop             /* branch if the previous sub computed a number that is not zero (Z flag in cpsr is 0) */

A rule of thumb where we may want to apply the use of the -s suffix is in codes in the following form.

s = ...
if (s @ 0)

where @ means any comparison respect 0 (equals, different, lower, etc.).

Operating 64-bit numbers

As an example of using the suffix -s we will implement three 64-bit integer operations in ARM: addition, subtraction and multiplication. Remember that ARM is a 32-bit architecture, so everything is 32-bit minded. If we only use 32-bit numbers, this is not a problem, but if for some reason we need 64-bit numbers things get a bit more complicated. We will represent a 64-bit number as two 32-bit numbers, the lower and higher part. This way a 64-bit number n represented using two 32-bit parts, nlower and nhigher will have the value n = 232 × nhigher + nlower

We will, obviously, need to kep the 32-bit somewhere. When keeping them in registers, we will use two consecutive registers (e.g. r1 and r2, that we will write it as {r1,r2}) and we will keep the higher part in the higher numbered register. When keeping a 64-bit number in memory, we will store in two consecutive addresses the two parts, being the lower one in the lower address. The address will be 8-byte aligned.


Adding two 64-bit numbers using 32-bit operands means adding first the lower part and then adding the higher parts but taking into account a possible carry from the lower part. With our current knowledge we could write something like this (assume the first number is in {r2,r3}, the second in {r4,r5} and the result will be in {r0,r1}).

add r1, r3, r5      /* First we add the higher part */
                    /* r1 ← r3 + r5 */
adds r0, r2, r4     /* Now we add the lower part and we update cpsr */
                    /* r0 ← r2 + r4 */
addcs r1, r1, #1    /* If adding the lower part caused carry, add 1 to the higher part */
                    /* if C = 1 then r1 ← r1 + 1 */
                    /* Note that here the suffix -s is not applied, -cs means carry set */

This would work. Fortunately ARM provides an instructions adc which adds two numbers and the carry flag. So we could rewrite the above code with just two instructions.

adds r0, r2, r4     /* First add the lower part and update cpsr */
                    /* r0 ← r2 + r4 */
adc r1, r3, r5      /* Now add the higher part plus the carry from the lower one */
                    /* r1 ← r3 + r5 + C */


Subtracting two numbers is similar to adding them. In ARM when subtracting two numbers using subs, if we need to borrow (because the second operand is larger than the first) then C will be disabled (C will be 0). If we do not need to borrow, C will be enabled (C will be 1). This is a bit surprising but consistent with the remainder of the architecture (check in chapter 5 conditions CS/HS and CC/LO). Similar to adc there is a sbc which performs a normal subtraction if C is 1. Otherwise it subtracts one more element. Again, this is consistent on how C works in the subs instruction.

subs r0, r2, r4     /* First subtract the lower part and update cpsr */
                    /* r0 ← r2 - r4 */
sbc r1, r3, r5      /* Now subtract the higher part plus the NOT of the carry from the lower one */
                    /* r1 ← r3 - r5 - ~C */


Multiplying two 64-bit numbers is a tricky thing. When we multiply two N-bit numbers the result may need up to 2*N-bits. So when multiplying two 64-bit numbers we may need a 128-bit number. For the sake of simplicity we will assume that this does not happen and 64-bit will be enough. Our 64-bit numbers are two 32-bit integers, so a 64-bit x is actually x = 232 × x1 + x0, where x1 and x0 are two 32-bit numbers. Similarly another 64-bit number y would be y = 232 × y1 + y0. Multiplying x and y yields z where z = 264 × x1 × y1 + 232 × (x0 × y1 + x1 × y0) + x0 × y0. Well, now our problem is multiplying each xi by yi, but again we may need 64-bit to represent the value.

ARM provides a bunch of different instructions for multiplication. Today we will see just three of them. If we are multiplying 32-bits and we do not care about the result not fitting in a 32-bit number we can use mul Rd, Rsource1, Rsource2. Unfortunately it does not set any flag in the cpsr useful for detecting an overflow of the multiplication (i.e. when the result does not fit in the 32-bit range). This instruction is the fastest one of the three. If we do want the 64-bit resulting from the multiplication, we have two other instructions smull and umull. The former is used when we multiply to numbers in two’s complement, the latter when we represent unsigned values. Their syntax is {s,u}mull RdestLower, RdestHigher, Rsource1, Rsource2. The lower part of the 64-bit result is kept in the register RdestLower and the higher part in he register RdestHigher.

In this example we have to use umull otherwise the 32-bit lower parts might end being interpreted as negative numbers, giving negative intermediate values. That said, we can now multiply x0 and y0. Recall that we have the two 64-bit numbers in r2,r3 and r4,r5 pairs of registers. So first multiply r2 and r4. Note the usage of r0 since this will be its final value. In contrast, register r6 will be used later.

umull r0, r6, r2, r4

Now let’s multiply x0 by y1 and x1 by y0. This is r3 by r4 and r2 by r5. Note how we overwrite r4 and r5 in the second multiplication. This is fine since we will not need them anymore.

umull r7, r8, r3, r4
umull r4, r5, r2, r5

There is no need to make the multiplication of x1 by y1 because if it gives a nonzero value, it will always overflow a 64-bit number. This means that if both r3 and r5 were nonzero, the multiplication will never fit a 64-bit. This is a suficient condition, but not a necessary one. The number might overflow when adding the intermediate values that will result in r1.

adds r2, r7, r4
adc r1, r2, r6

Let’s package this code in a nice function in a program to see if it works. We will multiply numbers 12345678901 (this is 2×232 + 3755744309) and 12345678 and print the result.

/* -- mult64.s */
.align 4
message : .asciz "Multiplication of %lld by %lld is %lld\n"
.align 8
number_a_low: .word 3755744309
number_a_high: .word 2
.align 8
number_b_low: .word 12345678
number_b_high: .word 0
/* Note: This is not the most efficient way to doa 64-bit multiplication.
   This is for illustration purposes */
   /* The argument will be passed in r0, r1 and r2, r3 and returned in r0, r1 */
   /* Keep the registers that we are going to write */
   push {r4, r5, r6, r7, r8, lr}
   /* For covenience, move {r0,r1} into {r4,r5} */
   mov r4, r0   /* r0 ← r4 */
   mov r5, r1   /* r5 ← r1 */
   umull r0, r6, r2, r4    /* {r0,r6} ← r2 * r4 */
   umull r7, r8, r3, r4    /* {r7,r8} ← r3 * r4 */
   umull r4, r5, r2, r5    /* {r4,r5} ← r2 * r5 */
   adds r2, r7, r4         /* r2 ← r7 + r4 and update cpsr */
   adc r1, r2, r6          /* r1 ← r2 + r6 + C */
   /* Restore registers */
   pop {r4, r5, r6, r7, r8, lr}
   bx lr                   /* Leave mult64 */
.global main
    push {r4, r5, r6, r7, r8, lr}       /* Keep the registers we are going to modify */
                                        /* r8 is not actually used here, but this way 
                                           the stack is already 8-byte aligned */
    /* Load the numbers from memory */
    /* {r4,r5} ← a */
    ldr r4, addr_number_a_low       /* r4 ← &a_low */
    ldr r4, [r4]                    /* r4 ← *r4 */
    ldr r5, addr_number_a_high      /* r5 ← &a_high  */
    ldr r5, [r5]                    /* r5 ← *r5 */
    /* {r6,r7} ← b */
    ldr r6, addr_number_b_low       /* r6 ← &b_low  */
    ldr r6, [r6]                    /* r6 ← *r6 */
    ldr r7, addr_number_b_high      /* r7 ← &b_high  */
    ldr r7, [r7]                    /* r7 ← *r7 */
    /* Now prepare the call to mult64
       The first number is passed in 
       registers {r0,r1} and the second one in {r2,r3}
    mov r0, r4                  /* r0 ← r4 */
    mov r1, r5                  /* r1 ← r5 */
    mov r2, r6                  /* r2 ← r6 */
    mov r3, r7                  /* r3 ← r7 */
    bl mult64                  /* call mult64 function */
    /* The result of the multiplication is in r0,r1 */
    /* Now prepare the call to printf */
    /* We have to pass &message, {r4,r5}, {r6,r7} and {r0,r1} */
    push {r1}                   /* Push r1 onto the stack. 4th (higher) parameter */
    push {r0}                   /* Push r0 onto the stack. 4th (lower) parameter */
    push {r7}                   /* Push r7 onto the stack. 3rd (higher) parameter */
    push {r6}                   /* Push r6 onto the stack. 3rd (lower) parameter */
    mov r3, r5                  /* r3 ← r5.                2rd (higher) parameter */
    mov r2, r4                  /* r2 ← r4.                2nd (lower) parameter */
    ldr r0, addr_of_message     /* r0 ← &message           1st parameter */
    bl printf                   /* Call printf */
    add sp, sp, #16             /* spsp + 16 */
                                /* Pop the two registers we pushed above */
    mov r0, #0                  /* r0 ← 0 */
    pop {r4, r5, r6, r7, r8, lr}        /* Restore the registers we kept */
    bx lr                       /* Leave main */
addr_of_message : .word message
addr_number_a_low: .word number_a_low
addr_number_a_high: .word number_a_high
addr_number_b_low: .word number_b_low
addr_number_b_high: .word number_b_high

Observe first that we have the addresses of the lower and upper part of each number. Instead of this we could load them by just using an offset, as we saw in chapter 8. So, in lines 41 to 44 we could have done the following.

    /* {r4,r5} ← a */
    ldr r4, addr_number_a_low       /* r4 ← &a_low */
    ldr r5, [r4, +#4]               /* r5 ← *(r4 + 4) */
    ldr r4, [r4]                    /* r4 ← *r4  */

In the function mult64 we pass the first value (x) as r0,r1 and the second value (y) as r2,r3. The result is stored in r0,r1. We move the values to the appropiate registers for parameter passing in lines 57 to 61.

Printing the result is a bit complicated. 64-bits must be passed as pairs of consecutive registers where the lower part is in an even numbered register. Since we pass the address of the message
in r0 we cannot pass the first 64-bit integer in r1. So we skip r1 and we use r2 and r3 for the first argument. But now we have run out of registers for parameter passing. When this happens, we have to use the stack for parameter passing.

Two rules have to be taken into account when passing data in the stack.

  1. You must ensure that the stack is aligned for the data you are going to pass (by adjusting the stack first). So, for 64-bit numbers, the stack must be 8-byte aligned. If you pass an 32-bit number and then a 64-bit number, you will have to skip 4 bytes before passing the 64-bit number. Do not forget to keep the stack always 8-byte aligned per the Procedure Call Standard for ARM Architecture (AAPCS) requirement.
  2. An argument with a lower position number in the call must have a lower address in the stack. So we have to pass the arguments in opposite order.

The second rule is what explains why we push first r1 and then r0, when they are the registers containing the last 64-bit number (the result of the multiplication) we want to pass to printf.

Note that in the example above, we cannot pass the parameters in the stack using push {r0,r1,r6,r7}, which is equivalent to push {r0}, push {r1}, push {r6} and push {r7}, but not equivalent to the required order when passing the arguments on the stack.

If we run the program we should see something like.

$ ./mult64_2
Multiplication of 12345678901 by 12345678 is 152415776403139878

That’s all for today.

Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

, , ,

17 thoughts on “ARM assembler in Raspberry Pi – Chapter 12

  • Vibe says:

    That’s an excellent tutorial! I read through all 12 chapters and learned a great deal. You have a talent for explaining things in an understandable way. Thank you for sharing this, and keep up the good work!

  • I agree that you explain very well.

    I have written some of the problems from in ARM assembly language (

    Please use any of the code in your discussion if it helps.

  • ultimate nice piece of coding … i stun of this …

  • Ricky Roberson says:

    The 12 chapters of this tutorial were just what I was looking for to break into Raspberry Pi assembly language. I am working on a project where I want to read the GPIO pins repeatedly as fast as possible, accumulate a LOT of data (like say 512KB) on the stack, then write that data to a file on the Raspberry Pi’s SD card. Any thoughts or suggestions? Any problem or memory boundary issues for a stack that large? How would you initiate a file save of that block? Thanks for any comments you might have…

    • rferrer says:

      Hi Ricky,

      thank you for the kind comments.

      Regarding to your question and without more information, I think that big stacks should not be a problem as long as they are not unrealistically big (e.g. more than 1 GB).

  • Henryk Sarat says:

    Great tutorials by the way. I found a type here:

    subs r0, r2, r4 /* First add the lower part and update cpsr */
    /* r0 ← r2 – r4 */
    sbc r1, r3, r5 /* Now add the higher part plus the NOT of the carry from the lower one */
    /* r1 ← r3 – r5 – ~C */

    In the comments you say ADD instead of subtract.

    Thanks again for this tutorial.

  • Marek Mularczyk says:


    i think i found a better approach to multiplication:

    stmdb sp!, {r4, r5}

    mov r5, r1
    mov r4, r0

    umull r0, r1, r2, r4
    umlal r1, r5, r2, r5 // r5 overwritten but not needed anymore
    umlal r1, r5, r3, r4

    ldmia sp!, {r4, r5}
    bx lr

    great tutorial btw.


    • rferrer says:

      Hi Marek,

      yes this, I think, the shortest sequence.

      It is slightly more complex to understand than the naive code I wrote in the post.

      The following explanation is aimed at other readers that may be interested in your approach.

      First recall hat the multiplication of two 64 bit numbers A*B is equivalent to (2^32*A1 + A0) * (2^32*B1 + B0), where X1 and X0 are the 32 top bits and 32 lower bits of X. So the multiplication is actually 2^64*A1*B1 + 2^32*(A1*B0 + A0*B1) + A0*B0. Since we do not take into account overflow, we only have to compute A0*B0 and 2^32(A1*B0 + A0*B1). The trick here is realising that A0*B0 yields a 64-bit result that we will have to propagate to the upper 32 bits of the result. The result, R, will then be R0 = Lower32(A0*B0) and R1 = Lower32(A1*B0) + Lower32(A0*B1) + Upper32(A0*B0). Here Lower32 and Upper32 mean the lower and upper 32 bits of the 64-bit value.

      umul r0, r1, r2, r4 /* {r0, r1} ← r2 * r4 */

      Recall that r0 will contain the lower 32-bits of the result and r1 will contain the upper 32-bit of the result at the end of the function. At this point r4 is A0 and r2 is B0. After the instruction, if r1 is nonzero means that r2*r4 did not fit in 32-bit. So r0 is Lower32(A0*B0) and r1 is Upper32(A0*B0).

      umlal r1, r5, r2, r5 /* This instruction performs the following operation
      {lower, upper} ← r2 * r5
      r1 ← lower + r1
      r5 ← upper + r5 + C (C=1 if lower + r1 does not fit in 32 bit, C=0 otherwise)

      At this point r5 is A1 and r2 B0. We compute A1*B0 giving a 64 bit result {lower, upper}. Note that ‘upper’ (equivalent to Upper32(A1*B0)) is uninteresting because if it is nonzero it would mean that our multiplication does not fit in 64 bit, so the update of r5 is nonrelevant for our computation. r5 is just used here because the instruction umlal requires it, and we will discard r5 at the end of the function anyway.

      The lower 32 bits of the multiplication are stored in ‘lower’ (equivalent to Lower32(A1*B0)). But remember that we had to add Upper32(A0*B0). And it happens that as said above r1 is now Upper32(A0*B0). This way at the end of this instruction r1 will be the Lower32(A1*B0) plus Upper32(A0*B0).

      umlal r1, r5, r3, r4

      Likewise the previous, but this time we multiply A0*B1. Again r5 is uninteresting here. At the end of the instruction r1 will have the value of Lower32(A0*B1) plus the value it had: Lower32(A1*B0) plus Upper32(A0*B0). So after these three instructions r0 happens to be Lower32(A0*B0) and r1 happens to be Lower32(A0*B1) + Lower32(A1*B0) + Upper32(A0*B0), and this is what we wanted.

  • Errata: addcs instruction doesn’t exist. Use instead adcs

  • I’m sorry. It was my mistake add+cs exists, I have not read the comments

  • I loved this set of tutorials. Thanks for the great work!

    Just a minor correction in the code transformation examples: the explanation on how to transforme a “while (E) S;” has an initial “S” inside the if that shouldn’t be there, as it is now the loop body will never run exactly once (it will run twice if the condition is initially true, without checking the condition between the first two body executions).

  • Very very interesting articles. I learned a lot.

    Is it possible to persuade you to write an article about I/O (eg. asynch ports) and timers and an article about interrupts (eg timer interrupts) ?

    Kristian sander

    • Roger Ferrer Ibáñez says:

      Hi Kristian,

      I’ll check that but given that the examples in the chapters are run on top of an operating system, probably these operations are privileged and only the OS can do them.

      Kind regards,

Leave a Reply

Your email address will not be published. Required fields are marked *