Think In Geek

In geek we trust

# ARM assembler in Raspberry Pi – Chapter 10

In chapter 9 we were introduced to functions and we saw that they have to follow a number of conventions in order to play nice with other functions. We also briefly mentioned the stack, as an area of memory owned solely by the function. In this chapter we will go in depth with the stack and why it is important for functions.

## Dynamic activation

One of the benefits of functions is being able to call them more than once. But that more than once hides a small trap. We are not restricting who will be able to call the function, so it might happen that it is the same function who calls itself. This happens when we use recursion.

A typical example of recursion is the factorial of a number n, usually written as n!. A factorial in C can be written as follows.

```int factorial(int n) { if (n == 0) return 1; else return n * factorial(n-1); }```

Note that there is only one function `factorial`, but it may be called several times. For instance: factorial(3) → factorial(2) → factorial(1) → factorial(0), where → means a «it calls». A function, thus, is dynamically activated each time is called. The span of a dynamic activation goes from the point where the function is called until it returns. At a given time, more than one function is dynamically activated. The whole dynamic activation set of functions includes the current function and the dynamic activation set of the function that called it (the current function).

Ok. We have a function that calls itself. No big deal, right? Well, this would not be a problem if it weren’t for the rules that a function must observe. Let’s quickly recall them.

• Only `r0`, `r1`, `r2` and `r3` can be freely modified.
• `lr` value at the entry of the function must be kept somewhere because we will need it to leave the function (to return to the caller).
• All other registers `r4` to `r11` and `sp` can be modified but they must be restored to their original values upon leaving the function.

In chapter 9 we used a global variable to keep `lr`. But if we attempted to use a global variable in our factorial(3) example, it would be overwritten at the next dynamic activation of factorial. We would only be able to return from factorial(0) to factorial(1). After that we would be stuck in factorial(1), as `lr` would always have the same value.

So it looks like we need some way to keep at least the value of `lr` per each dynamic activation. And not only `lr`, if we wanted to use registers from `r4` to `r11` we also need to keep somehow per each dynamic activation, a global variable would not be enough either. This is where the stack comes into play.

## The stack

In computing, a stack is a data structure (a way to organize data that provides some interesting properties). A stack typically has three operations: access the top of the stack, push onto the top, pop from the top. Dependening on the context you can only access the top of the stack, in our case we will be able to access more elements than just the top.

But, what is the stack? I already said in chaper 9 that the stack is a region of memory owned solely by the function. We can now reword this a bit better: the stack is a region of memory owned solely by the current dynamic activation. And how we control the stack? Well, in chapter 9 we said that the register `sp` stands for stack pointer. This register will contain the top of the stack. The region of memory owned by the dynamic activation is the extent of bytes contained between the current value of `sp` and the initial value that `sp` had at the beginning of the function. We will call that region the local memory of a function (more precisely, of a dynamic activation of it). We will put there whatever has to be saved at the beginning of a function and restored before leaving. We will also keep there the local variables of a function (dynamic activation).

Our function also has to adhere to some rules when handling the stack.

• The stack pointer (`sp`) is always 4 byte aligned. This is absolutely mandatory. However, due to the Procedure Call Standard for the ARM architecture (AAPCS), the stack pointer will have to be 8 byte aligned, otherwise funny things may happen when we call what the AAPCS calls as public interfaces (this is, code written by other people).
• The value of `sp` when leaving the function should be the same value it had upon entering the function.

The first rule is consistent with the alignment constraints of ARM, where most of times addresses must be 4 byte aligned. Due to AAPCS we will stick to the extra 8 byte alignment constraint. The second rule states that, no matter how large is our local memory, it will always disappear at the end of the function. This is important, because local variables of a dynamic activation need not have any storage after that dynamic activation ends.

It is a convention how the stack, and thus the local memory, has its size defined. The stack can grow upwards or downwards. If it grows upwards it means that we have to increase the value of the `sp` register in order to enlarge the local memory. If it grows downwards we have to do the opposite, the value of the `sp` register must be subtracted as many bytes as the size of the local storage. In Linux ARM, the stack grows downwards, towards zero (although it never should reach zero). Addresses of local variables have very large values in the 32 bit range. They are usually close to 232.

Another convention when using the stack concerns whether the `sp` register contains the address of the top of the stack or some bytes above. In Linux ARM the `sp` register directly points to the top of the stack: in the memory addressed by `sp` there is useful information.

Ok, we know the stack grows downwards and the top of the stack must always be in `sp`. So to enlarge the local memory it should be enough by decreasing `sp`. The local memory is then defined by the range of memory from the current `sp` value to the original value that `sp` had at the beginning of the function. One register we almost always have to keep is `lr`. Let’s see how can we keep in the stack.

```sub sp, sp, #8 /* sp ← sp - 8. This enlarges the stack by 8 bytes */ str lr, [sp] /* *sp ← lr */ ... // Code of the function ldr lr, [sp] /* lr ← *sp */ add sp, sp, #8 /* sp ← sp + 8. /* This reduces the stack by 8 bytes effectively restoring the stack pointer to its original value */ bx lr```

A well behaved function may modify sp but must ensure that at the end it has the same value it had when we entered the function. This is what we do here. We first subtract 8 bytes to sp and at the end we add back 8 bytes.

This sequence of instructions would do indeed. But maybe you remember chapter 8 and the indexing modes that you could use in load and store. Note that the first two instructions behave exactly like a preindexing. We first update `sp` and then we use `sp` as the address where we store `lr`. This is exactly a preindex! Likewise for the last two instructions. We first load `lr` using the current address of `sp` and then we decrease `sp`. This is exactly a postindex!

```str lr, [sp, #-8]! /* preindex: sp ← sp - 8; *sp ← lr */ ... // Code of the function ldr lr, [sp], #+8 /* postindex; lr ← *sp; sp ← sp + 8 */ bx lr```

Yes, these addressing modes were invented to support this sort of things. Using a single instruction is better in terms of code size. This may not seem relevant, but it is when we realize that the stack bookkeeping is required in almost every function we write!

## First approach

Let’s implement the factorial function above.

First we have to learn a new instruction to multiply two numbers: `mul Rdest, Rsource1, Rsource2`. Note that multiplying two 32 bit values may require up to 64 bits for the result. This instruction only computes the lower 32 bits. Because we are not going to use 64 bit values in this example, the maximum factorial we will be able to compute is 12! (13! is bigger than 232). We will not check that the entered number is lower than 13 to keep the example simple (I encourage you to add this check to the example, though). In versions of the ARM architecture prior to ARMv6 this instruction could not have `Rdest` the same as `Rsource1`. GNU assembler may print a warning if you don’t pass `-march=armv6`.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 /* -- factorial01.s */ .data   message1: .asciz "Type a number: " format: .asciz "%d" message2: .asciz "The factorial of %d is %d\n"   .text   factorial: str lr, [sp,#-4]! /* Push lr onto the top of the stack */ str r0, [sp,#-4]! /* Push r0 onto the top of the stack */ /* Note that after that, sp is 8 byte aligned */ cmp r0, #0 /* compare r0 and 0 */ bne is_nonzero /* if r0 != 0 then branch */ mov r0, #1 /* r0 ← 1. This is the return */ b end is_nonzero: /* Prepare the call to factorial(n-1) */ sub r0, r0, #1 /* r0 ← r0 - 1 */ bl factorial /* After the call r0 contains factorial(n-1) */ /* Load r0 (that we kept in th stack) into r1 */ ldr r1, [sp] /* r1 ← *sp */ mul r0, r0, r1 /* r0 ← r0 * r1 */   end: add sp, sp, #+4 /* Discard the r0 we kept in the stack */ ldr lr, [sp], #+4 /* Pop the top of the stack and put it in lr */ bx lr /* Leave factorial */   .global main main: str lr, [sp,#-4]! /* Push lr onto the top of the stack */ sub sp, sp, #4 /* Make room for one 4 byte integer in the stack */ /* In these 4 bytes we will keep the number */ /* entered by the user */ /* Note that after that the stack is 8-byte aligned */ ldr r0, address_of_message1 /* Set &message1 as the first parameter of printf */ bl printf /* Call printf */   ldr r0, address_of_format /* Set &format as the first parameter of scanf */ mov r1, sp /* Set the top of the stack as the second parameter */ /* of scanf */ bl scanf /* Call scanf */   ldr r0, [sp] /* Load the integer read by scanf into r0 */ /* So we set it as the first parameter of factorial */ bl factorial /* Call factorial */   mov r2, r0 /* Get the result of factorial and move it to r2 */ /* So we set it as the third parameter of printf */ ldr r1, [sp] /* Load the integer read by scanf into r1 */ /* So we set it as the second parameter of printf */ ldr r0, address_of_message2 /* Set &message2 as the first parameter of printf */ bl printf /* Call printf */     add sp, sp, #+4 /* Discard the integer read by scanf */ ldr lr, [sp], #+4 /* Pop the top of the stack and put it in lr */ bx lr /* Leave main */   address_of_message1: .word message1 address_of_message2: .word message2 address_of_format: .word format```

Most of the code is pretty straightforward. In both functions, `main` and `factorial`, we allocate 4 extra bytes on the top of the stack. In `factorial`, to keep the value of `r0`, because it will be overwritten during the recursive call (twice, as a first parameter and as the result of the recursive function call). In `main`, to keep the value entered by the user (if you recall chapter 9 we used a global variable here).

It is important to bear in mind that the stack, like a real stack, the last element stacked (pushed onto the top) will be the first one to be taken out the stack (popped from the top). We store `lr` and make room for a 4 bytes integer. Since this is a stack, the opposite order must be used to return the stack to its original state. We first discard the integer and then we restore the `lr`. Note that this happens as well when we reserve the stack storage for the integer using a `sub` and then we discard such storage doing the opposite operation `add`.

## Can we do it better?

Note that the number of instructions that we need to push and pop data to and from the stack grows linearly with respect to the number of data items. Since ARM was designed for embedded systems, ARM designers devised a way to reduce the number of instructions we need for the «bookkeeping» of the stack. These instructions are load multiple, `ldm`, and store multiple, `stm`.

These two instructions are rather powerful and allow in a single instruction perform a lot of things. Their syntax is shown as follows. Elements enclosed in curly braces `{` and `}` may be omitted from the syntax (the effect of the instruction will vary, though).

```ldm addressing-mode Rbase{!}, register-set
```

We will consider `addressing-mode` later. `Rbase` is the base address used to load to or store from the `register-set`. All 16 ARM registers may be specified in `register-set` (except `pc` in `stm`). A set of addresses is generated when executing these instructions. One address per register in the register-set. Then, each register, in ascending order, is paired with each of these addresses, also in ascending order. This way the lowest-numbered register gets the lowest memory address, and the highest-numbered register gets the highest memory address. Each pair register-address is then used to perform the memory operation: load or store. Specifying `!` means that `Rbase` will be updated. The updated value depends on `addressing-mode`.

Note that, if the registers are paired with addresses depending on their register number, it seems that they will always be loaded and stored in the same way. For instance a `register-set` containing `r4`, `r5` and `r6` will always store `r4` in the lowest address generated by the instruction and `r6` in the highest one. We can, though, specify what is considered the lowest address or the highest address. So, is `Rbase` actually the highest address or the lowest address of the multiple load/store? This is one of the two aspects that is controlled by `addressing-mode`. The second aspect relates to when the address of the memory operation changes between each memory operation.

If the value in `Rbase` is to be considered the the highest address it means that we should first decrease `Rbase` as many bytes as required by the number of registers in the `register-set` (this is 4 times the number of registers) to form the lowest address. Then we can load or store each register consecutively starting from that lowest address, always in ascending order of the register number. This addressing mode is called decreasing and is specified using a `d`. Conversely, if `Rbase` is to be considered the lowest address, then this is a bit easier as we can use its value as the lowest address already. We proceed as usual, loading or storing each register in ascending order of their register number. This addressing mode is called increasing and is specified using an `i`.

At each load or store, the address generated for the memory operation may be updated after or before the memory operation itself. We can specify this using `a` or `b`, respectively.

If we specify `!`, after the instruction, `Rbase` will have the highest address generated in the increasing mode and the lowest address generated in the decreasing mode. The final value of `Rbase` will include the final addition or subtraction if we use a mode that updates after (an `a` mode).

So we have four addressing modes, namely: `ia`, `ib`, `da` and `db`. These addressing modes are specified as suffixes of the `stm` and `ldm` instructions. So the full set of names is `stmia`, `stmib`, `stmda`, `stmdb`, `ldmia`, `ldmib`, `ldmda`, `ldmdb`. Now you may think that this is overly complicated, but we need not use all the eight modes. Only two of them are of interest to us now.

When we push something onto the stack we actually decrease the stack pointer (because in Linux the stack grows downwards). More precisely, we first decrease the stack pointer as many bytes as needed before doing the actual store on that just computed stack pointer. So the appropiate `addressing-mode` when pushing onto the stack is `stmdb`. Conversely when popping from the stack we will use `ldmia`: we increment the stack pointer after we have performed the load.

## Factorial again

Before illustrating these two instructions, we will first slightly rewrite our factorial.

If you go back to the code of our factorial, there is a moment, when computing `n * factorial(n-1)`, where the initial value of `r0` is required. The value of `n` was in `r0` at the beginning of the function, but `r0` can be freely modified by called functions. We chose, in the example above, to keep a copy of `r0` in the stack in line 12. Later, in line 24, we loaded it from the stack in `r1`, just before computing the multiplication.

In our second version of factorial, we will keep a copy of the initial value of `r0` into `r4`. But `r4` is a register the value of which must be restored upon leaving a function. So we will keep the value of `r4` at the entry of the function in the stack. At the end we will restore it back from the stack. This way we can use `r4` without breaking the rules of well-behaved functions.

```10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 factorial: str lr, [sp,#-4]! /* Push lr onto the top of the stack */ str r4, [sp,#-4]! /* Push r4 onto the top of the stack */ /* The stack is now 8 byte aligned */ mov r4, r0 /* Keep a copy of the initial value of r0 in r4 */     cmp r0, #0 /* compare r0 and 0 */ bne is_nonzero /* if r0 != 0 then branch */ mov r0, #1 /* r0 ← 1. This is the return */ b end is_nonzero: /* Prepare the call to factorial(n-1) */ sub r0, r0, #1 /* r0 ← r0 - 1 */ bl factorial /* After the call r0 contains factorial(n-1) */ /* Load initial value of r0 (that we kept in r4) into r1 */ mov r1, r4 /* r1 ← r4 */ mul r0, r0, r1 /* r0 ← r0 * r1 */   end: ldr r4, [sp], #+4 /* Pop the top of the stack and put it in r4 */ ldr lr, [sp], #+4 /* Pop the top of the stack and put it in lr */ bx lr /* Leave factorial */```

Note that the remainder of the program does not have to change. This is the cool thing of functions 🙂

Ok, now pay attention to these two sequences in our new factorial version above.

```11 12 str lr, [sp,#-4]! /* Push lr onto the top of the stack */ str r4, [sp,#-4]! /* Push r4 onto the top of the stack */```
```30 31 ldr r4, [sp], #+4 /* Pop the top of the stack and put it in r4 */ ldr lr, [sp], #+4 /* Pop the top of the stack and put it in lr */```

Now, let’s replace them with `stmdb` and `ldmia` as explained a few paragraphs ago.

```11 stmdb sp!, {r4, lr} /* Push r4 and lr onto the stack */```
```30 ldmia sp!, {r4, lr} /* Pop lr and r4 from the stack */```

Note that the order of the registers in the set of registers is not relevant, but the processor will handle them in ascending order, so we should write them in ascending order. GNU assembler will emit a warning otherwise. Since `lr` is actually `r14` it must go after `r4`. This means that our code is 100% equivalent to the previous one since `r4` will end in a lower address than `lr`: remember our stack grows toward lower addresses, thus `r4` which is in the top of the stack in `factorial` has the lowest address.

Remembering `stmdb sp!` and `ldmia sp!` may be a bit hard. Also, given that these two instructions will be relatively common when entering and leaving functions, GNU assembler provides two mnemonics `push` and `pop` for `stmdb sp!` and `ldmia sp!`, respectively. Note that these are not ARM instructions actually, just convenience names that are easier to remember.

```11 push {r4, lr}```
```30 pop {r4, lr}```

That’s all for today.

### 27 thoughts on “ARM assembler in Raspberry Pi – Chapter 10”

• very nice code … I follow it …

• Fernando says:

While disassembling C code, I found some “fp” register. Reading a bit, seems to be a pointer to the stack frame, and although it seems usefull, I can’t find any info on the AAPCS, so it looks like semi-standard?

Any insight on this subject will be appreciated.

• rferrer says:

The GCC compiler has used traditionally different names for registers r10, r11 and r12. They do not seem to be in any ARM document so I guess it is just some GCC own convention.

Register `r10` is also called `sl`. It is used for PIC code (Position Independent Code), a specific code shape required when generating shared libraries (usually libraries like libname.so). This is a rather advanced thing which would require several chapters to explain.

r11 is called `fp`. This register is used to keep the stack frame (also called the stack base or base pointer). What is the stack frame? The stack frame is just a register that keeps the value that `sp` got at the beginning. Since r11 must be restored, the previous value of r11 must be first kept in the stack, so r11 will not contain the address of the stack at the beginning of the function but actually contains `sp0 - 4` where `sp0` means the value of `sp` at the beginning of the function.

This register is only required to index local variables in the stack itself when the amount of stack is not a constant value (imagine the amount local data depends on a parameter of the function). Otherwise we should be able to index all the data through `sp` because we always should know how many bytes we have decreased sp.

This is an example where a stack-frame (like `fp`) is required.
``` int f(int n) {    /* The length of a[n] is only known at runtime */    int a[n];    int i, s;    ... } ```

If the array `a` was declared with a fixed amount of elements (like `int a[10]`) then from the top of the stack we could know where `a`, `i` and `s`. If a has a nonconstant size we could also compute the addresses of these variables but it would be much more complicated (in fact, it is complicated: imagine if there were more than one variable-length array). So we keep the address of sp (actually, sp – 4) and we use it to index these variable-length stuff (we would probably keep also the actual size of the arrays in another local variable).

Finally, r12 is also called `ip` and seems to be used by the linker when the called routine is so far (in bytes distance) that it requires an intermediate function.

• Fernando says:

First of all, thanks for your quick and detailed explanation. I’m really enjoying this post of yours, really really interesting and useful.

Next: I saw in this PPT from Intel (page 6) what it seems to be a “standard frame”, which contains all the registers. It does not fit with what you say (which, by the way, makes total sense for me), so:
– Shall I expect a frame to be like that, or it does not exist such concept as “standard frame”, which holds every register.

In the other hand, I was wondering: why GCC creates frames even if I do not have a variable-length stack? The following code generates a frame when compiling with GCC (both for ‘main’ and ‘f’):
void main(void){
int i, j;
i = 1;
j = f(i);
}

int f(int i){
return 2*i;
}

Also, that code generates some STR instructions, which I see unnecessary as there is plenty register-space for those simple operations (Also, GCC code is at least 2-3 times bigger than hand-crafted code ¿?).

• rferrer says:

I’ll first answer the GCC thing. GCC by default does not produce optimized code, so it emits very simple-minded code which is known to be correct but can be largely improved. This is why you are seing dumb code being emitted. You can request GCC to emit better code by enabling optimization. You do that by using the flag `-O`. GCC allows up to 3 optimization levels (higher levels are downgraded to `-O3` always) but uually `-O2` is a good default. You should see that the code is much better. GCC is known for not being great in the ARM architecture, but the code should definitely look much better to you.

Regarding to the slides you link, they cover a topic on exceptional situations. In these cases a recovery function is invoked by the processor and then it changes its mode to a special one. ARM processors have 7 modes for exceptions, depending on the nature of the exceptional situation being handled. In these modes some registers are banked, meaning that you have a different register than the one you usually have in normal mode. For instance, there is a mode called Fast Interrupt Request (FIQ) where registers `r8` to `r14` are actually `r8_fiq` to `r14_fiq`, modifying `r8_fiq` would not affect the `r8` of your program (modifying `r2` would, because it is not banked). In these modes, the processor sets the `lr` register to the instruction (or close to, depending on the exception) that was going to be run (or actually being run) when the exception happened. This way, once we have handled the exceptional situation, we can return back to the user code.

As far as I know, there is nothing in the architecture that forces you to use a frame pointer, even in these exceptional cases. So, maybe the slides just relate to a convention followed in that particular environment. That said, we know that in some situations the `fp` is really handy. Given its non-mandatory nature, the user (usually the compiler) may choose to use `fp` at its own discretion as long as its original value is preserved at the exit of the function.

• Fernando says:

Well, thanks again for your detailed explanation.

I knew that GCC usually generates non-optimized code, but well, now I know exactly how it looks like.

By the way, are you planning on continuing with this ARM RPI ASM series? I’d love to see something more about exception handling, MMU, timer, and in general, some more interaction with the existing hardware.

• rferrer says:

I’m committed to continue these series but not on the topics you mention since those topics mean entering the realm of operating systems.

Kind regards.

• Denz says:

Thanks for your the best articles!
I have question: how to return structure in assembler ( apcs-linux ) from routine? Please, show example.

Den

• rferrer says:

Hi,

there will be an example of returning a struct in Chapter 18 (still under preparation at the time of answering your comment).

Kind regards,

• Yeneros says:

Hi rferrer,

Excellent Tutorials! I am thoroughly enjoying them, Thank you so much for your great work

Just a note for anyone else doing this on the Raspberry Pi and encountering difficulty with the multiplication instruction as follows:

[email protected] ~/Assembly Tutorials/Lesson 10 \$ as -o factoral01.o factoral01.s
factoral01.s: Assembler messages:
factoral01.s:25: Rd and Rm should be different in mul

mul r0, r0, r1 /* r0 ← r0 * r1 */

Apparently since ARMv6 this now works but on earlier systems you get that compiler error which you can fix by simply reversing the order of the last two operands:

mul r0, r1, r0 /* r0 ← r1 * r0 */

Cheers.

• Roger Ferrer Ibáñez says:

Hi Yeneros,

yes, you are right that in earlier versions of the ARM architecture this was not allowed. You can also address this problem by passing the flag `-march=armv6` to the assembler invocation.

That said, your technique is easier and just requires swapping the operands. Thanks for the tip!

Kind regards,

• Linus says:

Hi Roger,

thank you for the easy to understand tutorials. I was wondering about one thing, is it required to move the value of r4 to r1 and then multiplying r0 and r1 in the factorial example program?
mov r1, r4 /* r1 ← r4 */
mul r0, r0, r1 /* r0 ← r0 * r1 */
Can’t you do the following or am I missing something?
mul r0, r0, r4 /* r0 ← r0 * r4 */

Thanks again,

Linus

• Roger Ferrer Ibáñez says:

Hi Linus,

yes, you can do this. Likely I left this unnecesary `mov` when I took the first version and modified to write the second one.

Thank you!

• tieugiaqb says:

thanks for your tutorials. it’s exellent series

• Roger Ferrer Ibáñez says:

Thanks!

• Andrew McConnell says:

Great course!
Just note:
.globl main should be .global main

• Roger Ferrer Ibáñez says:

Hi Andrew,

thanks for the heads up. While accepts both forms better I use `.global` as it is more readable. I have updated this post but I’m afraid I may have used `.globl` in other chapters.

Kind regards

• David says:

Hi Roger,

Awesome tutorial.

But I’m finding hard to understand the following concept:

factorial:
str lr, [sp,#-4]!
str r0, [sp,#-4]!
cmp r0, #0
bne is_nonzero
mov r0, #1
b end
is_nonzero:

sub r0, r0, #1
bl factorial

ldr r1, [sp]
mul r0, r0, r1
end:
ldr lr, [sp], #+4
bx lr

How is the program going to run the following lines:
ldr r1, [sp]
mul r0, r0, r1

Since after the instruction, bne is_nonzero, the instruction is branched into end function.

• Roger Ferrer Ibáñez says:

Hi David,

the first thing the function factorial does is comparing r0 with 0 (`cmp r0, #0`). This updates the `cpsr` register which is later used in the `bne` instruction.

`bne` means “branch if not equal”. In this case it means “branch only if r0 is different to 0”:

• When r0 is different to 0 the program branches to the destination in the branch, this is, it “jumps” over all the instructions that there are between the branch itself and the label. So when r0 is not zero the instructions `mov r0, #1` and `b end` are not executed at all.
• When `r0` is equals to 0 then the `bne` instruction is ignored (no branch happens), so it continues running the next instructiona as usual. So it executes `mov r0, #1` and `b end`.

Kind regards,
Roger

• David says:

Hi Roger,

But I was asking about the following instructions after the factorial function call :

ldr r1, [sp]
mul r0, r0, r1

When will these get executed? and also, is the multiplication in loop?

I was confused a little bit. Can you help me on this?

• Roger Ferrer Ibáñez says:

• Jason says:

Hi Roger,

I have the same question as David phrased.

Regarding these two lines of code in the “is_nonzero” function:

ldr r1, [sp]
mul r0, r0, r1

When will these two lines of code be executed? Should the multiplication be in the loop?

• Roger Ferrer Ibáñez says:

Hi Jason,

they are executed after the call to factorial above (the `bl` instruction) ends.

For instance, to compute `factorial(5)` we first call `factorial(4)` which will have left the correct result in `r0` and then we multiply it by 5.

Kind regards

• Jason says:

Hi Roger,

Thank you for replying to my message. From the code, I understand that the two instructions:

ldr r1, [sp]
mul r0, r0, r1

will only be executed after the ‘bl factorial’ instruction completes.

I’m still struggling to understand how the code works. Two major points:
1. After the “factorial:” label in line 10 of the code, there are two instructions to PUSH data onto the stack. If we are computing factorial (5), then we will PUSH data to the stack 10 times. However, there is only two POP instructions at the end of the code, before factorial completes on lines 59 and 60 of the code. Should there be another two POP instructions per recursion?

2. The MULT operation is only performed after factorial completes. Shouldn’t the MULT operation also be perform inside the factorial function as well?

Let’s take an example: Factorial (5). The code will call the factorial function five times and there will be ten PUSH instructions to be stack. However, before the factorial function completes, there are only two POP instructions at the end of the code. Something tells me that there are two POP instructions missing per recursion of the code?

In calculating factorial(5), if the MULT is only outside of the loop, then how is factorial(4) computed without using a MULT operation?

• Roger Ferrer Ibáñez says:

Hi Jason,

lines 59 and 60 belong to the `main` function. The `main` function is not recursive. `main` calls `factorial` and it is the latter the one calling itself, recursively. If a function pushes data onto the stack it has to pop that data before exiting. Because the function `factorial` spans from lines 10 to 30 all the stack handling will happen in these lines. Lines 59 and 60 belong to the `main` function and they concern to the stack handling that happened in lines 34 and 35 (again, of the main function). Unfortunately, labels do not help to demarcate functions in assembly so that might be confusing you.

As an example let’s trace `factorial(2)` (`factorial(5)` should be the same but with more recursive calls and a deeper stack, I recommend you to do the exercise). I’m showing each activation of factorial with an extra bullet nesting level.

• Our program starts in main (line 34), we read a number via `scanf` and we put the value in `r0` (line 47) and then we call `factorial` (line 48). This is calling `factorial(2)`. Assume here the stack is empty `[]`. Really it isn’t empty, but for what it matters to factorial it is. As I said above, each activation of a function (`main` vs `factorial`) handles its own stack every time they are called. Here activation meaning every time we do `bl` to the function.
• Now our program is in line 11. Recall `r0` is 2.
• We push `lr` and `r0` onto the stack. Our stack (from top to down) is now `[2, address of line 51]` these are `r0` and `lr` respectively (lines 11-12)
• Recall, `r0` is 2, we compare it with 0 (line 14).
• Then we branch if it is non-zero to `is_nonzero`, so we end at line 20. (line 15)
• Now we subtract 1 to `r0`, which leaves us `r0` with 1. Note: the top of the stack still has the “previous” `r0` that was 2. (line 20)
• Now we call `factorial` again but this time `r0` is 1, so this is `factorial(1)` (line 21)
• Now we go again to line 11. Note that now `r0` is 1.
• We push `lr` and `r0` onto the stack. Our stack (from top to down) is now `[1, address of line 24, 1, address of line 51]` (lines 11-12)
• Recall, `r0` is 1, we compare it with 0 (line 14).
• Then we branch if it is non-zero to `is_nonzero`, so we end at line 20. (line 15)
• Now we subtract 1 to `r0`, which leaves us `r0` with 0. Note: the top of the stack still has the “previous” `r0` that was 1.
• Now we call factorial again but this time `r0` is 0, so this is `factorial(0)` (line 21)
• Now we go again to line 11. Note that now `r0` is 0.
• We push `lr` and `r0` onto the stack. Our stack (from top to down) is now [0, address of line 24, 1, address of line 24, 2, address of line 51] (lines 11-12)
• Recall, `r0` is 0, we compare it with 0 (line 14).
• We do not branch because `r0` is zero, so we fall-through at line 16 (line 15)
• We set `r0` to be 1 (line 16)
• We (unconditionally) branch to end, to line 28 (line 17)
• We do `sp ← sp + 4`, this means that we have popped the top of our stack that now looks like `[address of line 24, 1, address of line 24, 2, address of line 51]` (line 28)
• We read the top of the stack (so lr is now address of line 24) and then we do sp ← sp + 4. Now our stack looks like `[1, address of line 24, 2, address of line 51]` (line 29)
• We execute `bx lr` which takes us to line 24. (line 30)
• Now we load the top of the stack into `r1`. So `r1` is now 1. Note that the stack is left alone, just read, not popped. So it is still `[1, address of line 24, 2, address of line 51]` (line 24)
• We do `r0 * r1`. Recall `r0` is 1 because we set it above in line 16. `r1` is 1 because we have just loaded the top of the stack which was 1. The result is 1 and it is stored in `r0`. (line 25)
• Note that now we fall-through to end, to line 28.
• We do `sp ← sp + 4`, this means that we have popped the top of our stack that now looks like `[address of line 24, 2, address of line 51]` (line 28)
• We read the top of the stack (so `lr` is now address of line 24) and then we do `sp ← sp + 4`. Now our stack looks like `[2, address of line 51]` (line 29)
• We execute `bx lr` which takes us to line 24. (line 30)
• Now we load the top of the stack into `r1`. So `r1` is now 2. Note that the stack is left alone, just read, not popped. So it is still `[2, address of line 51]` (line 24)
• We do `r0 * r1`. Recall `r0` is 1 because we set it above in line 25. `r1` is 2 because we have just loaded the top of the stack which was 2. The result is 2 and it is stored in `r0`. (line 25)
• Note that now we fall through to end, to line 28.
• We do `sp ← sp + 4`, this means that we have popped the top of our stack that now looks like `[address of line 51]` (line 28)
• We read the top of the stack (so `lr` is now address of line 24) and then we do `sp ← sp + 4`. Now our stack looks like `[]` (line 29)
• We execute `bx lr` which takes us to line 51. (line 30)
• And now we are back the main function in line 51.

Hope this time I’m clearer. Do not hesitate to ask.

Kind regards, Roger

• Jason says:

Thank you so much. This clarifies everything perfectly. Apologies for my confusion earlier.

• Roger Ferrer Ibáñez says:

Good to know, thanks for your patience!

This site uses Akismet to reduce spam. Learn how your comment data is processed.