ARM assembler in Raspberry Pi

Today we will continue with nested functions.

Sorting

We will first take a detour. The C function qsort can be used to sort any kind of array. Its C signature is the following.

void qsort(void *base,
     size_t nmemb,
     size_t size,
     int (*compar)(const void *, const void *));

qsort returns void, this is, it does not return anything because it performs the sort in place. This means that we will pass a (potentially unsorted) array called base of length nmemb to qsort. When qsort returns, the elements in this array will be sorted. If qsort were able to just sort a specific kind of arrays it would be rather limited. In order to be able to sort any array, qsort requires the size of each element in the array. Note that the array is passed by reference (otherwise the in place sorting would not be possible): void* is the C way to say «I accept an address to any kind of data».

We will come back later to the compar bit of qsort.

Print an array

Before we sort an array, we need a way to examine it. We will use for that a function print_array that prints an array of integers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* print-array.s */

.data

/* declare an array of 10 integers called my_array */
.align 4
my_array: .word 82, 70, 93, 77, 91, 30, 42, 6, 92, 64

/* format strings for printf */
/* format string that prints an integer plus a space */
.align 4
integer_printf: .asciz "%d "
/* format string that simply prints a newline */
.align 4
newline_printf: .asciz "\n"

.text

print_array:
    /* r0 will be the address of the integer array */
    /* r1 will be the number of items in the array */
    push {r4, r5, r6, lr}  /* keep r4, r5, r6 and lr in the stack */

    mov r4, r0             /* r4 ← r0. keep the address of the array */
    mov r5, r1             /* r5 ← r1. keep the number of items */
    mov r6, #0             /* r6 ← 0.  current item to print */

    b .Lprint_array_check_loop /* go to the condition check of the loop */

    .Lprint_array_loop:
      /* prepare the call to printf */
      ldr r0, addr_of_integer_printf  /* r0 ← &integer_printf */
      /* load the item r6 in the array in address r4.
         elements are of size 4 bytes so we need to multiply r6 by 4 */
      ldr r1, [r4, +r6, LSL #2]       /* r1 ← *(r4 + r6 << 2)
                                         this is the same as
                                         r1 ← *(r4 + r6 * 4) */
      bl printf                       /* call printf */

      add r6, r6, #1                  /* r6 ← r6 + 1 */
    .Lprint_array_check_loop: 
      cmp r6, r5               /* perform r6 - r5 and update cpsr */
      bne .Lprint_array_loop   /* if cpsr states that r6 is not equal to r5
                                  branch to the body of the loop */

    /* prepare call to printf */
    ldr r0, addr_of_newline_printf /* r0 ← &newline_printf */
    bl printf
    
    pop {r4, r5, r6, lr}   /* restore r4, r5, r6 and lr from the stack */
    bx lr                  /* return */

addr_of_integer_printf: .word integer_printf
addr_of_newline_printf: .word newline_printf

.globl main
main:
    push {r4, lr}             /* keep r4 and lr in the stack */

    /* prepare call to print_array */
    ldr r0, addr_of_my_array  /* r0 ← &my_array */
    mov r1, #10               /* r1 ← 10
                                 our array is of length 10 */
    bl print_array            /* call print_array */

    mov r0, #0                /* r0 ← 0 set errorcode to 0 prior returning from main */
    pop {r4, lr}              /* restore r4 and lr in the stack */
    bx lr                     /* return */

addr_of_my_array: .word my_array

The code above is pretty straightforward and it does not feature anything that has not been seen in previous installments. Running it simply prints the current contents of the array.

$ ./print-array 
82 70 93 77 91 30 42 6 92 64

Comparison

Above, when we talked about qsort we skipped the compar parameter. What is compar? It is a an address to a function. The funky syntax for C tells us that this function, if it is ever called, will be passed two addreses (again, it does not care what they are, so they are void*) and returns an integer. The manual of qsort explains that this function has to return lower than zero, zero or greater than zero. If the object in the address of the first parameter of compar is lower than the object in the address of the second parameter, then it has to return lower than zero. If they are equal, it should return zero. If the first object is greater than the second, then it should return greater than zero.

If you wonder why the parameters of compar are actually const void* rather than void*, it is the C way of telling us that the data of the referenced objects cannot change during the comparison. This may sound obvious given that changing things is not the job of a comparison function. Passing them by reference would let us to change them. So this is reminder that we should not.

Since our array is an array of integers we will have to compare integers: let's write a function that, given two pointers to integers (i.e. addresses) behaves as stated above.

19
20
21
22
23
24
25
26
27
28
29
30
31
integer_comparison:
    /* r0 will be the address to the first integer */
    /* r1 will be the address to the second integer */
    ldr r0, [r0]    /* r0 ← *r0
                       load the integer pointed by r0 in r0 */
    ldr r1, [r1]    /* r1 ← *r1
                       load the integer pointed by r1 in r1 */

    cmp r0, r1      /* compute r0 - r1 and update cpsr */
    moveq r0, #0    /* if cpsr means that r0 == r1 then r0 ←  0 */
    movlt r0, #-1   /* if cpsr means that r0 <  r1 then r0 ← -1 */
    movgt r0, #1    /* if cpsr means that r0 >  r1 then r0 ←  1 */
    bx lr           /* return */

Function integer_comparison does not feature anything new either: it simply avoids branches by using predication as we saw in chapter 11.

Now we have the last missing bit to be able to call qsort. Here is a program that prints (only main is shown) the array twice, before sorting it and after sorting it.

66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
.globl main
main:
    push {r4, lr}             /* keep r4 and lr in the stack */

    /* prepare call to print_array */
    ldr r0, addr_of_my_array  /* r0 ← &my_array */
    mov r1, #10               /* r1 ← 10
                                 our array is of length 10 */
    bl print_array            /* call print_array */

    /* prepare call to qsort */
    /*
    void qsort(void *base,
         size_t nmemb,
         size_t size,
         int (*compar)(const void *, const void *));
    */
    ldr r0, addr_of_my_array  /* r0 ← &my_array
                                 base */
    mov r1, #10               /* r1 ← 10
                                 nmemb = number of members
                                 our array is 10 elements long */
    mov r2, #4                /* r2 ← 4
                                 size of each member is 4 bytes */
    ldr r3, addr_of_integer_comparison
                              /* r3 ← &integer_comparison
                                 compar */
    bl qsort                  /* call qsort */

    /* now print again to see if elements were sorted */
    /* prepare call to print_array */
    ldr r0, addr_of_my_array  /* r0 ← &my_array */
    mov r1, #10               /* r1 ← 10
                                 our array is of length 10 */
    bl print_array            /* call print_array */

    mov r0, #0                /* r0 ← 0 set errorcode to 0 prior returning from main */
    pop {r4, lr}              /* restore r4 and lr in the stack */
    bx lr                     /* return */

addr_of_my_array: .word my_array
addr_of_integer_comparison : .word integer_comparison

If we put everything together, we can verify that our array is effectively sorted after the call to the qsort function.

$ ./sort-array 
82 70 93 77 91 30 42 6 92 64 
6 30 42 64 70 77 82 91 92 93

What is going on?

C function qsort implements a sorting algorithm (the C Standard does not specify which one must be but it is usually a fine-tuned version of quicksort) which at some point will require to compare two elements. To do this, qsort calls the function compar.

Count how many comparisons happen

Now, we want to count how many comparisons (i.e., how many times integer_comparison is called) when sorting the array. We could change integer_comparison so it increments a global counter.

.data
global_counter: .word 0

.text
integer_comparison_count_global:
    /* r0 will be the address to the first integer */
    /* r1 will be the address to the second integer */
    push {r4, r5}   /* keep callee-saved registers */
    ldr r0, [r0]    /* r0 ← *r0
                       load the integer pointed by r0 in r0 */
    ldr r1, [r1]    /* r1 ← *r1
                       load the integer pointed by r1 in r1 */

    cmp r0, r1      /* compute r0 - r1 and update cpsr */
    moveq r0, #0    /* if cpsr means that r0 == r1 then r0 ←  0 */
    movlt r0, #-1   /* if cpsr means that r0 <  r1 then r0 ← -1 */
    movgt r0, #1    /* if cpsr means that r0 >  r1 then r0 ←  1 */

    ldr r4, addr_of_global_counter /* r4 ← &global_counter */
    ldr r5, [r4]    /* r5 ← *r4 */ 
    add r5, r5, #1  /* r5 ← r5 + 1 */
    str r5, [r4]    /* *r4 ← r5 */

    pop {r4, r5}    /* restore callee-saved registers */
    bx lr           /* return */
addr_of_global_counter: .word global_counter

But this post is about nested functions so we will use nested functions. Recall that nested functions can access local variables of their enclosing functions. So we will use a local variable of main as the counter and a nested function (of main) that performs the comparison and updates the counter.

In the last chapter we ended with a short discussion about nested functions. A downside of nested functions it that a pointer to a nested function requires two things: the address of the function and the lexical scope. If you check again the previous example where we call qsort, you will see that we do not mention anywhere the lexical scope. And there is a reason for that, it is not possible to pass it to qsort. In C, functions cannot be nested so a pointer to a function can just be the address of the function.

Trampoline

We will continue using the convention of the last chapter: r10 will have the lexical scope upon the entry of the function. But qsort, when calls integer_compare_count will not set it for us: we cannout count on r10 having a meaningful value when called from qsort. This means that qsort should actually call something that first sets r10 with the right value and then jumps to integer_compare_count. We will call this ancillary code (or pseudofunction) a trampoline. The technique used here is similar to the one used by GCC described in Lexical Closures for C++ (Thomas M. Breuel, USENIX C++ Conference Proceedings, October 17-21, 1988).

The trampoline is a small, always the same, sequence of instructions that behaves like a function and its only purpose is to set r10 and then make an indirect call to the nested function. Since the sequence of instructions is always the same, the instructions themselves look like a template.

174
175
176
177
178
179
180
181
182
183
.Laddr_trampoline_template : .word .Ltrampoline_template /* we will use this below */
.Ltrampoline_template:
    .Lfunction_called: .word 0x0
    .Llexical_scope: .word 0x0
    push {r4, r5, r10, lr}           /* keep callee-saved registers */
    ldr r4, .Lfunction_called        /* r4 ← function called */
    ldr r10, .Llexical_scope         /* r10 ← lexical scope */
    blx r4                           /* indirect call to r4 */
    pop {r4, r5, r10, lr}            /* restore callee-saved registers */
    bx lr                            /* return */

I used the word template because while the instructions are not going to change, there are two items in the trampoline, labeled function_called and lexical_scope, that will have to be appropriately set before using the trampoline.

It may be easier to understand if you consider the code above as if it were data: see it as an array of integers. The first two integers, function_called and lexical_scope, are still zero but will be set at some point. The remaining elements in the array are other integers (we do not care which ones) that happen to encode ARM instructions. The cool thing is that these instructions refer to the two first integers, so by changing them we are indirectly changing what the trampoline does. This trampoline takes 8 words, so 32 bytes.

Let's start with this example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/* trampoline-sort-arrays.s */

.data

/* declare an array of 10 integers called my_array */
.align 4
my_array: .word 82, 70, 93, 77, 91, 30, 42, 6, 92, 64

/* format strings for printf */
/* format string that prints an integer plus a space */
.align 4
integer_printf: .asciz "%d "
/* format string that simply prints a newline */
.align 4
newline_printf: .asciz "\n"
.align 4 /* format string for number of comparisons */
comparison_message: .asciz "Num comparisons: %d\n"

.text

The function print_array will be the same as above. Next is main.

54
55
56
57
58
59
60
61
62
63
64
.globl main
main:
    push {r4, r5, r6, fp, lr} /* keep callee saved registers */
    mov fp, sp                /* setup dynamic link */

    sub sp, sp, #4            /* counter will be in fp - 4 */
    /* note that now the stack is 8-byte aligned */

    /* set counter to zero */
    mov r4, #0        /* r4 ← 0 */
    str r4, [fp, #-4] /* counter ← r4 */

Nothing fancy here, we set the dynamic link, allocate space in the stack for the counter and set it to zero.

Now we make room for the trampoline in the stack. Recall that our trampoline takes 32 bytes.

66
67
68
69
    /* Make room for the trampoline */
    sub sp, sp, #32 /* sp ← sp - 32 */
    /* note that 32 is a multiple of 8, so the stack
       is still 8-byte aligned */

Now we will copy the trampoline template into the stack storage we just allocated. We do this with a loop that copies a word (4 bytes) at a time.

71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
    /* copy the trampoline into the stack */
    mov r4, #32                        /* r4 ← 32 */
    ldr r5, .Laddr_trampoline_template /* r4 ← &trampoline_template */
    mov r6, sp                         /* r6 ← sp */
    b .Lcopy_trampoline_loop_check     /* branch to copy_trampoline_loop_check */

    .Lcopy_trampoline_loop:
        ldr r7, [r5]     /* r7 ← *r5 */
        str r7, [r6]     /* *r6 ← r7 */
        add r5, r5, #4   /* r5 ← r5 + 4 */
        add r6, r6, #4   /* r6 ← r6 + 4 */
        sub r4, r4, #4   /* r4 ← r4 - 4 */
    .Lcopy_trampoline_loop_check:
        cmp r4, #0                  /* compute r4 - 0 and update cpsr */
        bgt .Lcopy_trampoline_loop  /* if cpsr means that r4 > 0
                                       then branch to copy_trampoline_loop */

In the loop above, r4 counts how many bytes remain to copy. r5 and r6 are pointers inside the (source) trampoline and the (destination) stack, respectively. Since we copy 4 bytes at a time, all three registers are updated by 4.

Now we have the trampoline copied in the stack. Recall, it is just an array of words, the two first of which must be updated. The first 4 bytes must be the address of function to be called, i.e. integer_comparison_count and the second 4 bytes must be the static link, i.e. fp.

88
89
90
91
92
93
94
95
96
    /* setup the trampoline */
    ldr r4, addr_of_integer_comparison_count
                       /* r4 ← &integer_comparison_count */
    str r4, [fp, #-36] /* *(fp - 36) ← r4 */
                       /* set the function_called in the trampoline
                          to be &integer_comparison_count */
    str fp, [fp, #-32]  /* *(fp - 32) ← fp */
                        /* set the lexical_scope in the trampoline
                           to be fp */

Recall that our trampoline takes 32 bytes but in the stack we also have the counter. This is the reason why the trampoline starts in fp - 36 (this is also the address of the first word of the trampoline, of course). The second word is then at fp - 32.

Now we proceed like in the sort example above: we print the array before sorting it and after sorting it. Before printing the sorted array, we will also print the number of comparisons that were performed.

103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
   /* prepare call to print_array */
    ldr r0, addr_of_my_array /* r0 ← &my_array */
    mov r1, #10              /* r1 ← 10
                                our array is of length 10 */
    bl print_array           /* call print_array */

    /* prepare call to qsort */
    /*
    void qsort(void *base,
         size_t nmemb,
         size_t size,
         int (*compar)(const void *, const void *));
    */
    ldr r0, addr_of_my_array /* r0 ← &my_array
                                base */
    mov r1, #10              /* r1 ← 10
                                nmemb = number of members
                                our array is 10 elements long */
    mov r2, #4               /* r2 ← 4
                                size of each member is 4 bytes */
    sub r3, fp, #28          /* r3 ← fp - 28 */
    bl qsort                 /* call qsort */

    /* prepare call to printf */
    ldr r1, [fp, #-4]                    /* r1 ← counter */
    ldr r0, addr_of_comparison_message   /* r0 ← &comparison_message */
    bl printf                            /* call printf */

    /* now print again the array to see if elements were sorted */
    /* prepare call to print_array */
    ldr r0, addr_of_my_array  /* r0 ← &my_array */
    mov r1, #10               /* r1 ← 10
                                 our array is of length 10 */
    bl print_array            /* call print_array */

Note that the argument compar passed to qsort (line 123) is not the address of the nested function but the trampoline. In fact, it is not the trampoline but its third word since, as we know, the two first words of the trampoline are the address of the nested function to call and the lexical scope (that we set earlier, lines 91 and 94).

Finally we return from main as usual.

139
140
141
142
143
144
145
146
    mov r0, #0                /* r0 ← 0 set errorcode to 0 prior returning from main */

    mov sp, fp
    pop {r4, r5, r6, fp, lr}      /* restore callee-saved registers */
    bx lr                     /* return */

addr_of_my_array: .word my_array
addr_of_comparison_message : .word comparison_message

The nested comparison function is next.

148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
    /* nested function integer comparison */
    addr_of_integer_comparison_count : .word integer_comparison_count
    integer_comparison_count:
        /* r0 will be the address to the first integer */
        /* r1 will be the address to the second integer */
        push {r4, r5, r10, fp, lr} /* keep callee-saved registers */
        mov fp, sp                 /* setup dynamic link */

        ldr r0, [r0]    /* r0 ← *r0
                           load the integer pointed by r0 in r0 */
        ldr r1, [r1]    /* r1 ← *r1
                           load the integer pointed by r1 in r1 */
     
        cmp r0, r1      /* compute r0 - r1 and update cpsr */
        moveq r0, #0    /* if cpsr means that r0 == r1 then r0 ←  0 */
        movlt r0, #-1   /* if cpsr means that r0 <  r1 then r0 ← -1 */
        movgt r0, #1    /* if cpsr means that r0 >  r1 then r0 ←  1 */

        ldr r4, [fp, #8]  /* r4 ← *(fp + 8)
                             get static link in the stack */
        ldr r5, [r4, #-4] /* r5 ← counter
                             get value of counter */
        add r5, r5, #1    /* r5 ← r5 + 1 */
        str r5, [r4, #-4] /* counter ← r5
                             update counter */

        mov sp, fp        /* restore stack */
        pop {r4, r5, r10, fp, lr} /* restore callee-saved registers */
        bx lr           /* return */

As you can see, the nested function expects r10 to be correctly set. This is what the trampoline does.

Harvard architecture

If you try to run the program as shown, it will probably work. But it will do it by chance. The reason is that we are featuring some simple form of self modifying code.

The Raspberry Pi processor features a modified Harvard architecture. This means that at some point there exists a distinction between instructions memory (.text) and the data memory (.data). Nowadays there are not many processors that feature a strict distinction between instruction and data memory (so at some point the program and the data are both in main memory, commonly called the RAM) but such differentiation is kept for caches.

A cache is a smaller and faster memory, that sits between the processor and the main memory. It is used to speed up memory accesses since most of the time such accesses happen close to other memory accesses (i.e. accessing elements of an array, different local variables in the stack or one instruction after the other in implicit sequencing) or close in time (i.e. accessing several times the same local variable or executing the same instruction when the code is in a loop).

Most modern processors feature distinguished caches for data (called the data cache) and instructions (called the instruction cache). The reason for such differentiation is that accessing to memory to execute instruction has a different pattern than accessing to memory to load/store data. It is beneficial to make such distinction but it comes at some price: when a program manipulates data that later will be executed as instructions (like we did with the trampoline, but also when the operating system loads a program in memory) the view of the two caches respect to the program state becomes incoherent: changes that we did in the data will have effect in the data cache but not in the instruction cache. Conversely, since the instruction cache will only get data from the main memory (and not from the data cache), we need to write back all the changes we did in the data cache to the main memory (this is called flushing the cache). We also have to make sure the instruction cache effectively gets the instructions from the memory, rather than reusing previously loaded instructions (which would be stale now), so we have to invalidate (or clear) the instruction cache.

In ARM the instructions that flush and invalidate caches are privileged operations (done through coprocessor instructions on the coprocessor 15 which manages the memory system of the CPU). This means that only the operating system can execute such instructions. As you see, user code may have to request a cache clear. Linux provides a cacheflush system call for this purpose. Recall that in chapter 19 we saw ho to make system calls.

According to the Linux kernel, register r0 must contain the address of the beginning of the region to be flushed and invalidated. r1 must contain the address of the first byte that will not be invalidated. r2 must be zero. The cacheflush service number, that has to be set in r7 is 0xf0002.

    push {r7}          /* keep r7 because we are going to modify it */
    mov r7, #0xf0000   /* r7 ← 0xf0000 */
    add r7, r7, #2     /* r7 ← r7 + 2. So r7 ← 0xf0002
                          We do this in two steps because
                          we cannot encode 0xf0002 in
                          the instruction */
    mov r0, sp         /* r0 ← sp */
    add r1, sp, #32    /* r1 ← sp + 32 */
    mov r2, #0         /* r2 ← 0 */
    swi 0              /* system call */
    pop  {r7}          /* restore r7 */

As an alternative we can call an internal function implemented in libgcc (the GCC low-level runtime library) called __clear_cache. This function will internally call the Linux service.

98
99
100
101
    /* prepare call to __clear_cache */
    mov r0, sp       /* r0 ← sp */
    add r1, sp, #32  /* r1 ← sp + 32 */
    bl __clear_cache /* call __clear_cache */

We will invalidate and flush the caches right after setting up the trampoline (lines 89 to 94).

Now it only remains to run our program.

$ ./trampoline-sort-array 
82 70 93 77 91 30 42 6 92 64 
Num comparisons: 22
6 30 42 64 70 77 82 91 92 93

You can see the full listing here.

Discussion

Given that nested functions require a lexical scope, they cannot be trivially passed as plain addresses to other functions. Today we have seen that by using a trampoline it is possible to pass them to functions that do not allow passing a lexical scope. The price is having to copy a template, the trampoline, having to set it up with the proper values. We also have to flush caches in order to avoid executing wrong code. It is complicated but doable.

Having to flush the cache is undesirable (although not required in all architectures) and may cause a severe degradation of performance. Performance-critical pieces of code typically would not want to do this.

A serious concern, though, with the trampoline approach relates to the fact that we need an executable stack. A modern operating system, like Linux, can mark regions of memory to be readable, writable or executable. A region of memory that is not executable may contain instructions but if we branch to that region the processor will signal a fault, and the operating system will likely kill our process. Being able to disable execution of specific memory regions is done for security purposes. Most of the time one does not have to execute instructions that are found in stack or .data section. Only .text makes sense in these cases to be executable.

If you check what we did above, we actually copied some code (which was in .text) into the stack and then, qsort branched to the stack. This is because our programs allow an executable stack. Executable stacks are linked to common program vulnerability exploits like buffer overflows.

As we've seen in this chapter and in the previous one, nested functions come with several downsides, so it is not surprising that several programming languages do not provide support for them.

That's all for today.