Think In Geek

In geek we trust

ARM assembler in Raspberry Pi – Chapter 5


Until now our small assembler programs execute one instruction after the other. If our ARM processor were only able to run this way it would be of limited use. It could not react to existing conditions which may require different sequences of instructions. This is the purpose of the branch instructions.

A special register

In chapter 2 we learnt that our Raspberry Pi ARM processor has 16 integer general purpose registers and we also said that some of them play special roles in our program. I deliberately ignored which registers were special as it was not relevant at that time.

But now it is relevant, at least for register r15. This register is very special, so special it has also another name: pc. It is unlikely that you see it used as r15 since it is confusing (although correct from the point of view of the ARM architecture). From now we will only use pc to name it.

What does pc stand for? pc means program counter. This name, the origins of which are in the dawn of computing, means little to nothing nowadays. In general the pc register (also called ip, instruction pointer, in other architectures like 386 or x86_64) contains the address of the next instruction going to be executed.

When the ARM processor executes an instruction, two things may happen at the end of its execution. If the instruction does not modify pc (and most instructions do not), pc is just incremented by 4 (like if we did add pc, pc, #4). Why 4? Because in ARM, instructions are 32 bit wide, so there are 4 bytes between every instruction. If the instruction modifies pc then the new value for pc is used.

Once the processor has fully executed an instruction then it uses the value in the pc as the address for the next instruction to execute. This way, an instruction that does not modify the pc will be followed by the next contiguous instruction in memory (since it has been automatically increased by 4). This is called implicit sequencing of instructions: after one has run, usually the next one in memory runs. But if an instruction does modify the pc, for instance to a value other than pc + 4, then we can be running another instruction of the program. This process of changing the value of pc is called branching. In ARM this done using branch instructions.

Unconditional branches

You can tell the processor to branch unconditionally by using the instruction b (for branch) and a label. Consider the following program.

/* -- branch01.s */
.global main
    mov r0, #2 /* r0 ← 2 */
    b end      /* branch to 'end' */
    mov r0, #3 /* r0 ← 3 */
    bx lr

If you execute this program you will see that it returns an error code of 2.

$ ./branch01 ; echo $?

What happened is that instruction b end branched (modifying the pc) to the instruction at the label end, which is bx lr, the instruction we run at the end of our program. This way the instruction mov r0, #3 has not actually been run at all (the processor never reached that instruction).

At this point the unconditional branch instruction b may look a bit useless. It is not the case. In fact this instruction is essential in some contexts, in particular when linked with conditional branching. But before we can talk about conditional branching we need to talk about conditions.

Conditional branches

If our processor were only able to branch just because, it would not be very useful. It is much more useful to branch when some condition is met. So a processor should be able to evaluate some sort of conditions.

Before continuing, we need to unveil another register called cpsr (for Current Program Status Register). This register is a bit special and directly modifying it is out of the scope of this chapter. That said, it keeps some values that can be read and updated when executing an instruction. The values of that register include four condition code flags called N (negative), Z (zero), C (carry) and V (overflow). These four condition code flags are usually read by branch instructions. Arithmetic instructions and special testing and comparison instruction can update these condition codes too if requested.

The semantics of these four condition codes in instructions updating the cpsr are roughly the following

  • N will be enabled if the result of the instruction yields a negative number. Disabled otherwise.
  • Z will be enabled if the result of the instruction yields a zero value. Disabled if nonzero.
  • C will be enabled if the result of the instruction yields a value that requires a 33rd bit to be fully represented. For instance an addition that overflows the 32 bit range of integers. There is a special case for C and subtractions where a non-borrowing subtraction enables it, disabled otherwise: subtracting a larger number to a smaller one enables C, but it will be disabled if the subtraction is done the other way round.
  • V will be enabled if the result of the instruction yields a value that cannot be represented in 32 bits two’s complement.

So we have all the needed pieces to perform branches conditionally. But first, let’s start comparing two values. We use the instruction cmp for this purpose.

cmp r1, r2 /* updates cpsr doing "r1 - r2", but r1 and r2 are not modified */

This instruction subtracts to the value in the first register the value in the second register. Examples of what could happen in the snippet above?

  • If r2 had a value (strictly) greater than r1 then N would be enabled because r1-r2 would yield a negative result.
  • If r1 and r2 had the same value, then Z would be enabled because r1-r2 would be zero.
  • If r1 was 1 and r2 was 0 then r1-r2 would not borrow, so in this case C would be enabled. If the values were swapped (r1 was 0 and r2 was 1) then C would be disabled because the subtraction does borrow.
  • If r1 was 2147483647 (the largest positive integer in 32 bit two’s complement) and r2 was -1 then r1-r2 would be 2147483648 but such number cannot be represented in 32 bit two’s complement, so V would be enabled to signal this.

How can we use these flags to represent useful conditions for our programs?

  • EQ (equal) When Z is enabled (Z is 1)
  • NE (not equal). When Z is disabled. (Z is 0)
  • GE (greater or equal than, in two’s complement). When both V and N are enabled or disabled (V is N)
  • LT (lower than, in two’s complement). This is the opposite of GE, so when V and N are not both enabled or disabled (V is not N)
  • GT (greather than, in two’s complement). When Z is disabled and N and V are both enabled or disabled (Z is 0, N is V)
  • LE (lower or equal than, in two’s complement). When Z is enabled or if not that, N and V are both enabled or disabled (Z is 1. If Z is not 1 then N is V)
  • MI (minus/negative) When N is enabled (N is 1)
  • PL (plus/positive or zero) When N is disabled (N is 0)
  • VS (overflow set) When V is enabled (V is 1)
  • VC (overflow clear) When V is disabled (V is 0)
  • HI (higher) When C is enabled and Z is disabled (C is 1 and Z is 0)
  • LS (lower or same) When C is disabled or Z is enabled (C is 0 or Z is 1)
  • CS/HS (carry set/higher or same) When C is enabled (C is 1)
  • CC/LO (carry clear/lower) When C is disabled (C is 0)

These conditions can be combined to our b instruction to generate new instructions. This way, beq will branch only if Z is 1. If the condition of a conditional branch is not met, then the branch is ignored and the next instruction will be run. It is the programmer task to make sure that the condition codes are properly set prior a conditional branch.

/* -- compare01.s */
.global main
    mov r1, #2       /* r1 ← 2 */
    mov r2, #2       /* r2 ← 2 */
    cmp r1, r2       /* update cpsr condition codes with the value of r1-r2 */
    beq case_equal   /* branch to case_equal only if Z = 1 */
case_different :
    mov r0, #2       /* r0 ← 2 */
    b end            /* branch to end */
    mov r0, #1       /* r0 ← 1 */
    bx lr

If you run this program it will return an error code of 1 because both r1 and r2 have the same value. Now change mov r1, #2 in line 5 to be mov r1, #3 and the returned error code should be 2. Note that case_different we do not want to run the case_equal instructions, thus we have to branch to end (otherwise the error code would always be 1).

That’s all for today.

, , , ,

33 thoughts on “ARM assembler in Raspberry Pi – Chapter 5

  • Hi 🙂

    I though that you can manipulate the cpsr directly, isn’t the commands MSR and MRS meant for that ? (Source: Arm v6 reference manual:

    The text says that direct manipulation of cpsr is not possible.

    Thanks for the great series, I have got so much out of it. I started by studying the arm instructions at and after that your article series adds to that information nicely by showing how to use the commands in real life 🙂

    Keep up the good work 🙂

    • rferrer says:

      Hi Mikael,

      thanks for the comment! 🙂

      You are right. I’ll reword the text, just to make clear that while it can be modified it is out of the scope of this chapter.

      Kind regards,

  • Fernando says:

    Still reading… very nice tutorial, I must say. I think I’m really learning a lot, just by playing with your examples and reading the text.

  • Damien says:

    I tried modifying the program counter directly, but it doesn’t do what I expect:

    mov r0, #1
    add pc, pc, #4
    mov r0, #2
    mov r0, #3
    bx lr

    This gives me 1 (it jumped to end directly, while I was expecting that the r0←3 would get executed (I shifted the normal behavior of pc by the length of one instruction only, not two)

    Another hypothesis I had was that since it’s modifying pc, it would not increment it, but I guess only the branching instructions have that behavior…

    Any insights?

    • rferrer says:

      Hi Damien,

      you are experiencing what, in my opinion, is a «quirk» in the ARM architecture (this is: the contract between the CPU designer and the software developer on how the CPU behaves).

      Ideally one would expect, when reading the pc register in an instruction, to have the address of the current instruction.

      Imagine that the instruction add pc, pc, #4 is in address 0x1000. You would expect, at the end of the instruction pc be 0x1004. As usual in ARM, since pc got modified in the instruction, you would not add 4 bytes to it (as in implicit sequencing) but directly jump to 0x1004. So the next instruction run would be the one at the address 0x1004.

      Well, this is where the ARM quirk comes into play. When you read the pc register in an instruction its value is the current instruction plus 8 bytes.

      For instance, the following code,

      mov r1, #0
      current: mov pc, pc
      plus4: add r1, r1, #1
      plus8: add r1, r1, #2
      plus12: add r1, r1, #3

      Here r1 will at end have the value 5 (2+3) instead of 6 (1+2+3). Why? Because in instruction mov pc, pc, pc did not have the address current, it was current + 8 which in the example is plus8. Since the instruction does modify pc, the ARM processor does not do pc ← pc + 4 before starting the next instruction but just keeps the pc as is. So by simply updating the pc to itself we were able to skip 1 instruction.

      This is what is happening to you: in the add pc, pc, #4 instruction you are reading a pc of the instruction mov r0, #3. If you add to it 4 more bytes, it is the address of the bx lr.

      This quirk may be a bit annoying, just remember that when you directly read pc it will always be the current instruction plus 8.

      Is this a problem most of the time? No, if you use labels in your branches, the assembler internally fixes everything for you.

      I cannot explain the reason of this behaviour in ARM. I think this issue has historical roots in the earlier ARM designs where it probably happened that the pc was read at a point in the processor state where it had already been implicitly advanced by 8 bytes. This seems to be a very ARM specific thing (a similar sequence of code like the one above in other architectures would set r1 to 6).

      I hope this answers your question.

      • mirz says:

        AFAIK, this behaviour is due to pipelining, isn’t it?

        • rferrer says:

          I think so.

          My guess is that in earlier (and simpler) iterations of the ARM architecture in the alu stage when you read the pc you were reading the pc of the instruction-after-the-next-one (the value that the physical register had at that stage).

          Probably ARM had to preserve this architectural behaviour in later versions of the architecture, so the quirk remained. Note, though, there is no technical reason that prevents the alu stage to have the address of the current instruction when reading the operands.

  • Damien says:

    Yup, thanks 🙂

  • Randy says:

    FYI, I think there is an error in this tutorial.

    For BGE, the branch condition is when N=V, not N=Z.

    I looked it up in the ARM documentation when I considered that N=Z should be impossible. Zero has a sign bit of zero, and any calculation that results in zero should not have the N bit set.

    • rferrer says:

      Hi Randy,

      you’re right. I made a typo in GE and then I propagated it to LT.

      I fixed the post. Thanks a lot.

      Kind regards,

  • blahyo says:

    It should be “vs” instead of “os(overflow set)” I believe.

  • Smasher says:

    I think there is a mistake in this chapter. According to ARM Documentation the condition not-equal is NE and not NEQ.

  • Smasher says:

    Hey Roger,

    I noticed this because “bneq end_of_loop” threw an error.

    I am glad to make a small contribution to your great tutorial, which I am going through with excitement and pleasure by learning asm 🙂

    I also hope the subject will stay understandable for me to the last chapter. May I contact you If some questions will remain open after that?


  • Matt Miller says:

    The work subtraction is spelled substraction in this post.

  • John Ganci says:

    In the “Unconditional branches” section, the code sample is named branch01.s. However, the sample execution that follows the code is

    $ ./compare01 ; echo $?

    Should be ./branch01 ; echo $?

  • Luis Ramirez says:

    Hello is my code, help! please:
    mov x2, x0
    mov x3, 2
    mov x1, 128
    sturh w10, [x0] // w10 = color Blue
    add x0, x0, 2
    sub x1, x1, 1
    cbnz x1, loop1
    –>>> here delay loop
    sub x3, x3, 1
    mov x0, x2 // here my problem..
    cbnz x3, loop0

    I want you to paint me a blue line first and then go back to the first value of X0 and repaint that same red line.
    How do I save that variable for later use?

    • Roger Ferrer Ibáñez says:

      Hi Luis,

      I’m not sure to understand, but usually if you need temporary storage and you’ve run out of registers the easy thing to do is to keep it in the stack.

      I talk about the stack in chapter 10. You may want to read up to that chapter at least.

      Kind regards,

  • Is there a typo here?

    If r1 was 2147483648 (the largest positive integer in 32 bit two’s complement) and r1 was -1

    !!! shouldn’t that be “and r2 was -1”? !!!

    then r1-r2 would be 2147483649 but such number cannot be represented in 32 bit two’s complement, so V would be enabled to signal this.

  • patrick says:

    Greetings Roger;

    I have followed the comment thread with interest and it seems to me that a number of mistakes in the original code
    advanced the readers learning. You might keep some code that at least fails assembly as an exercise.

    I am testing your code on a rpi3 with debian

    Thanks great tutorial.

    • Roger Ferrer Ibáñez says:

      Hi Patrick,

      indeed, I obviously try to minimise the mistakes but I may still have missed some. Feel free to let me know if you spot one and I’ll fix it ASAP.

      Kind regards,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.