Today we will continue with nested functions.

## Sorting

We will first take a detour. The C function `qsort` can be used to sort any kind of array. Its C signature is the following.

`qsort` returns `void`, this is, it does not return anything because it performs the sort in place. This means that we will pass a (potentially unsorted) array called `base` of length `nmemb` to `qsort`. When `qsort` returns, the elements in this array will be sorted. If `qsort` were able to just sort a specific kind of arrays it would be rather limited. In order to be able to sort any array, `qsort` requires the `size` of each element in the array. Note that the array is passed by reference (otherwise the in place sorting would not be possible): `void*` is the C way to say «I accept an address to any kind of data».

We will come back later to the `compar` bit of `qsort`.

### Print an array

Before we sort an array, we need a way to examine it. We will use for that a function `print_array` that prints an array of integers.

The code above is pretty straightforward and it does not feature anything that has not been seen in previous installments. Running it simply prints the current contents of the array.

### Comparison

Above, when we talked about `qsort` we skipped the `compar` parameter. What is `compar`? It is a an address to a function. The funky syntax for C tells us that this function, if it is ever called, will be passed two addreses (again, it does not care what they are, so they are `void*`) and returns an integer. The manual of qsort explains that this function has to return lower than zero, zero or greater than zero. If the object in the address of the first parameter of `compar` is lower than the object in the address of the second parameter, then it has to return lower than zero. If they are equal, it should return zero. If the first object is greater than the second, then it should return greater than zero.

If you wonder why the parameters of `compar` are actually `const void*` rather than `void*`, it is the C way of telling us that the data of the referenced objects cannot change during the comparison. This may sound obvious given that changing things is not the job of a comparison function. Passing them by reference would let us to change them. So this is reminder that we should not.

Since our array is an array of integers we will have to compare integers: let's write a function that, given two pointers to integers (i.e. addresses) behaves as stated above.

Function `integer_comparison` does not feature anything new either: it simply avoids branches by using predication as we saw in chapter 11.

Now we have the last missing bit to be able to call `qsort`. Here is a program that prints (only `main` is shown) the array twice, before sorting it and after sorting it.

If we put everything together, we can verify that our array is effectively sorted after the call to the `qsort` function.

### What is going on?

C function `qsort` implements a sorting algorithm (the C Standard does not specify which one must be but it is usually a fine-tuned version of quicksort) which at some point will require to compare two elements. To do this, `qsort` calls the function `compar`.

## Count how many comparisons happen

Now, we want to count how many comparisons (i.e., how many times `integer_comparison` is called) when sorting the array. We could change `integer_comparison` so it increments a global counter.

But this post is about nested functions so we will use nested functions. Recall that nested functions can access local variables of their enclosing functions. So we will use a local variable of `main` as the counter and a nested function (of `main`) that performs the comparison and updates the counter.

In the last chapter we ended with a short discussion about nested functions. A downside of nested functions it that a pointer to a nested function requires two things: the address of the function and the lexical scope. If you check again the previous example where we call `qsort`, you will see that we do not mention anywhere the lexical scope. And there is a reason for that, it is not possible to pass it to `qsort`. In C, functions cannot be nested so a pointer to a function can just be the address of the function.

## Trampoline

We will continue using the convention of the last chapter: `r10` will have the lexical scope upon the entry of the function. But `qsort`, when calls `integer_compare_count` will not set it for us: we cannout count on `r10` having a meaningful value when called from `qsort`. This means that `qsort` should actually call something that first sets `r10` with the right value and then jumps to `integer_compare_count`. We will call this ancillary code (or pseudofunction) a trampoline. The technique used here is similar to the one used by GCC described in Lexical Closures for C++ (Thomas M. Breuel, USENIX C++ Conference Proceedings, October 17-21, 1988).

The trampoline is a small, always the same, sequence of instructions that behaves like a function and its only purpose is to set `r10` and then make an indirect call to the nested function. Since the sequence of instructions is always the same, the instructions themselves look like a template.

I used the word template because while the instructions are not going to change, there are two items in the trampoline, labeled `function_called` and `lexical_scope`, that will have to be appropriately set before using the trampoline.

It may be easier to understand if you consider the code above as if it were data: see it as an array of integers. The first two integers, `function_called` and `lexical_scope`, are still zero but will be set at some point. The remaining elements in the array are other integers (we do not care which ones) that happen to encode ARM instructions. The cool thing is that these instructions refer to the two first integers, so by changing them we are indirectly changing what the trampoline does. This trampoline takes 8 words, so 32 bytes.

The function `print_array` will be the same as above. Next is `main`.

Nothing fancy here, we set the dynamic link, allocate space in the stack for the counter and set it to zero.

Now we make room for the trampoline in the stack. Recall that our trampoline takes 32 bytes.

Now we will copy the trampoline template into the stack storage we just allocated. We do this with a loop that copies a word (4 bytes) at a time.

In the loop above, `r4` counts how many bytes remain to copy. `r5` and `r6` are pointers inside the (source) trampoline and the (destination) stack, respectively. Since we copy 4 bytes at a time, all three registers are updated by 4.

Now we have the trampoline copied in the stack. Recall, it is just an array of words, the two first of which must be updated. The first 4 bytes must be the address of function to be called, i.e. `integer_comparison_count` and the second 4 bytes must be the static link, i.e. `fp`.

Recall that our trampoline takes 32 bytes but in the stack we also have the counter. This is the reason why the trampoline starts in `fp - 36` (this is also the address of the first word of the trampoline, of course). The second word is then at `fp - 32`.

Now we proceed like in the sort example above: we print the array before sorting it and after sorting it. Before printing the sorted array, we will also print the number of comparisons that were performed.

Note that the argument `compar` passed to qsort (line 123) is not the address of the nested function but the trampoline. In fact, it is not the trampoline but its third word since, as we know, the two first words of the trampoline are the address of the nested function to call and the lexical scope (that we set earlier, lines 91 and 94).

Finally we return from main as usual.

The nested comparison function is next.

As you can see, the nested function expects `r10` to be correctly set. This is what the trampoline does.

## Harvard architecture

If you try to run the program as shown, it will probably work. But it will do it by chance. The reason is that we are featuring some simple form of self modifying code.

The Raspberry Pi processor features a modified Harvard architecture. This means that at some point there exists a distinction between instructions memory (`.text`) and the data memory (`.data`). Nowadays there are not many processors that feature a strict distinction between instruction and data memory (so at some point the program and the data are both in main memory, commonly called the RAM) but such differentiation is kept for caches.

A cache is a smaller and faster memory, that sits between the processor and the main memory. It is used to speed up memory accesses since most of the time such accesses happen close to other memory accesses (i.e. accessing elements of an array, different local variables in the stack or one instruction after the other in implicit sequencing) or close in time (i.e. accessing several times the same local variable or executing the same instruction when the code is in a loop).

Most modern processors feature distinguished caches for data (called the data cache) and instructions (called the instruction cache). The reason for such differentiation is that accessing to memory to execute instruction has a different pattern than accessing to memory to load/store data. It is beneficial to make such distinction but it comes at some price: when a program manipulates data that later will be executed as instructions (like we did with the trampoline, but also when the operating system loads a program in memory) the view of the two caches respect to the program state becomes incoherent: changes that we did in the data will have effect in the data cache but not in the instruction cache. Conversely, since the instruction cache will only get data from the main memory (and not from the data cache), we need to write back all the changes we did in the data cache to the main memory (this is called flushing the cache). We also have to make sure the instruction cache effectively gets the instructions from the memory, rather than reusing previously loaded instructions (which would be stale now), so we have to invalidate (or clear) the instruction cache.

In ARM the instructions that flush and invalidate caches are privileged operations (done through coprocessor instructions on the coprocessor 15 which manages the memory system of the CPU). This means that only the operating system can execute such instructions. As you see, user code may have to request a cache clear. Linux provides a `cacheflush` system call for this purpose. Recall that in chapter 19 we saw ho to make system calls.

According to the Linux kernel, register r0 must contain the address of the beginning of the region to be flushed and invalidated. `r1` must contain the address of the first byte that will not be invalidated. `r2` must be zero. The cacheflush service number, that has to be set in `r7` is `0xf0002`.

As an alternative we can call an internal function implemented in `libgcc` (the GCC low-level runtime library) called __clear_cache. This function will internally call the Linux service.

We will invalidate and flush the caches right after setting up the trampoline (lines 89 to 94).

Now it only remains to run our program.

You can see the full listing here.

## Discussion

Given that nested functions require a lexical scope, they cannot be trivially passed as plain addresses to other functions. Today we have seen that by using a trampoline it is possible to pass them to functions that do not allow passing a lexical scope. The price is having to copy a template, the trampoline, having to set it up with the proper values. We also have to flush caches in order to avoid executing wrong code. It is complicated but doable.

Having to flush the cache is undesirable (although not required in all architectures) and may cause a severe degradation of performance. Performance-critical pieces of code typically would not want to do this.

A serious concern, though, with the trampoline approach relates to the fact that we need an executable stack. A modern operating system, like Linux, can mark regions of memory to be readable, writable or executable. A region of memory that is not executable may contain instructions but if we branch to that region the processor will signal a fault, and the operating system will likely kill our process. Being able to disable execution of specific memory regions is done for security purposes. Most of the time one does not have to execute instructions that are found in stack or `.data` section. Only `.text` makes sense in these cases to be executable.

If you check what we did above, we actually copied some code (which was in `.text`) into the stack and then, `qsort` branched to the stack. This is because our programs allow an executable stack. Executable stacks are linked to common program vulnerability exploits like buffer overflows.

As we've seen in this chapter and in the previous one, nested functions come with several downsides, so it is not surprising that several programming languages do not provide support for them.

That's all for today.