ARM assembler in Raspberry Pi – Chapter 16
We saw in chapters 6 and 12 several control structures but we left out a usual one: the switch also known as select/case. In this chapter we will see how we can implement it in ARM assembler.
Switch control structure
A switch in C has the following structure.
In the example above, expression E
is evaluated and its value is used to determine the next statement executed. So if E
evaluates to V2
, S2
will be the next statement executed. If no case
matches, the whole switch
construct is ignored unless there is a default
case the statement of which is executed instead.
Note that, once the flow jumps to a statement, the execution continues from that point unless a break
statement is found. The break
statement switch
construct. Most of the time the programmer adds a break
to end each case. Otherwise fall-through cases happens. In the example above, if E
evaluates to V1
and there is no break in S1
, the program would continue running S2
and Sdefault
unless the program encounters a break
statement inside S2
or Sdefault
. Fall-through may look a bit weird and confusing but there are some cases where is useful.
That said, C is a particularly bad example to showcase this structure. The reason is that the exact language definition of a switch
in C is as follows.
S
can be anything but the flow will always jump to a case
or a default
inside S
, so if S
does not contain any of these statements, nothing happens.
So for a switch
to be useful we will need at least one case
or default
statement. If more than one is needed, then we can use a compound statement (a list of statements enclosed in side {
and }
as shown in the first example above.
Note also, that case
and default
statements are only valid inside the S
of a switch
but this does not mean that they have to be immediately nested inside them.
As you can see, the switch
statement in C is pretty liberal. Other languages, like Pascal or Fortran, have stricter syntaxes that do not allow fall-through nor loose positioning of case/default.
In this post, we will not care about these strange cases of switch
although we will allow fall-through.
Implementing switch
Probably you already have figured that a switch not involving fall-through in any of its cases is equivalent to a sequence of if-else blocks. The following switch
,
can be implemented as
In contrast to the usual if-else statement, there need not be a branch that goes after the if-statement once the if branch has been executed. This is, in the example above, it is optional to have a branch after code_for_case5
that goes to code_after
. If such branch is omitted, then a fall-through to code_for_case10
happens naturally. So the break
statement inside a switch
is simply that unconditional branch.
We can put all the checks at the beginning, as long as we preserve the order of the cases (so fall-through works if break
is omitted).
This approach is sensible if the number of cases is low. Here "low" is not very well defined, let's say 10 or less. What if we have lots of cases? A sequence of if-else checks will make as many comparisons as cases. If the values of the N cases are uniformly spread during the execution of the program, this means that in average we will have to do N/2 checks. If the values are not uniformly spread, then it is obvious that we should check the common values first and the rare last (sadly, most of the times we have no idea of their frequency).
There are a number of ways of reducing the cost of checking the cases: tables and binary search.
Jump tables
Imagine we have a switch
like this one
If we implement it the way shown above we will make in average (for an uniformly distributed set of values of x
) 50 comparisons. We can improve this if we simply use the value c
to index a table of addresses to the instructions of the case instructions.
Consider the program below where we use the value of argc
of a C program. In C, the main
function receives two parameters, argc
and argv
: argc
is just an integer, in the register r0
as usual; argv
is an address, in the register r1
as usual, to an array of the arguments passed in the command-line. There are as many elements in argv
as the value of argc
, at least one. We will not use argv
today, only argc
.
We are using just 3 cases plus the default one, but it would not be complex (yet cumbersome) to extend it to 100 cases.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* jumptable.s */
.data
.text
.globl main
main:
cmp r0, #1 /* r0 - 1 and update cpsr */
blt case_default /* branch to case_default if r0 < 1 */
cmp r0, #3 /* r0 - 3 and update cpsr */
bgt case_default /* branch to case_default if r0 > 3 */
sub r0, r0, #1 /* r0 ← r0 - 1. Required to index the table */
ldr r1, addr_of_jump_table /* r1 ← &jump_table */
ldr r1, [r1, +r0, LSL #2] /* r1 ← *(r1 + r0*4).
This is r1 ← jump_table[r0] */
mov pc, r1 /* pc ← r1
This will cause a branch to the
computed address */
case_1:
mov r0, #1 /* r0 ← 1 */
b after_switch /* break */
case_2:
mov r0, #2 /* r0 ← 2 */
b after_switch /* break */
case_3:
mov r0, #3 /* r0 ← 3 */
b after_switch /* break */
case_default:
mov r0, #42 /* r0 ← 42 */
b after_switch /* break (unnecessary) */
after_switch:
bx lr /* Return from main */
.align 4
jump_table:
.word case_1
.word case_2
.word case_3
.align 4
addr_of_jump_table: .word jump_table
As you can see in line 43 we define a jump table, whose elements are the addresses of the labels of each case (in order). In lines 14 to 16 we load the appropiate value from that table after we are sure that the value of argc is between 1 and 3, checked in lines 9 to 12. Finally, we load the address to pc
. This will effectively do a branch to the proper case.
If you run the program you will see diferent exit codes returned (remember that they are retuned through r0
in main
). The program only counts the arguments, if instead of "a b" you use "one two" it will return 3 as well. More than two arguments and it will return 42.
To safely use the jump table we have to make sure the value of the case lies in the bounds of the table. If m is the minimum case value and M is the maximum case value, our table will have M - m + 1 entries. In the example above m = 1 and M = 3 so we have 3 entries in the table. We have to make sure that the value used to index is m ≤ x ≤ M, otherwise we would be accessing incorrect memory locations. Also remember that to properly index the jump table we will have to subtract m to the case value.
Jump tables are great, once we have checked that the case value is in the proper range (these are two comparisons) then we do not have to compare anything else. So basically the cost of comparisons in this approach is constant (i.e. it does not grow if the number of cases grow).
There are two big downsides to this approach which prevent us from always using it. The first one happens when the difference between M and m is large, our jump table will be large. This enlarges the code size. We have, basically, traded time by space. Now our code size will add 4 bytes per case handled in a jump table. A table of 256 entries will take up to 1 Kbyte (1024 bytes) of memory in our executable program. To be fair, this is the amount of space taken by 256 instructions. So if code size is a concern for you (and usually in the embedded world is), this approach may not be suitable. The second big downside happens when there are "holes" in the cases. Imagine our cases are just 1, 3 and 100. The table will be 100 items long but only 1, 3 and 100 will have useful entries: all the remaining entries will have the address of the default case (or the address after the switch if the default case is omitted). In this case we are not just taking 400 bytes, we are wasting 388 bytes (97% of entries would be useless!). So if the number of cases is low and the values are scattered in a large range, jump tables are not a good choice.
Compute the case address
This strategy is a bit complicated and has more constraints than a jump table, so it is less general. If all the cases are ordered and they take the same amount of instructions, we can compute the address of the case without using a jump table. This is risky because we have to be careful when computing the address of the branch using the current value (otherwise we will jump to a wrong address and bad things will happen for sure).
If not all the cases take the same amount of instructions, we can compensate them to take as many instructions as the case with the biggest number of instructions. We can do that using the nop
instruction that does nothing but occupy space. If the variance of the number of instructions among cases is small we will just end adding some nop
s to a few cases. If the variance is large, we may end with code bloat, something we wanted to avoid when using this technique.
If there are holes, we can just branch them to the default case and fill the remaining instructions with nop
s. Again if the number of holes is large this is prone to code bloat as well.
In our example of the jump table, each case takes just two instructions. So we can get the address of the first case and use it as a base address to compute the branch.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* calcjump.s */
.data
.text
.globl main
main:
cmp r0, #1 /* r0 - 1 and update cpsr */
blt case_default /* branch to case_default if r0 < 1 */
cmp r0, #3 /* r0 - 3 and update cpsr */
bgt case_default /* branch to case_default if r0 > 3 */
sub r0, r0, #1 /* r0 ← r0 - 1. Required to index the table */
ldr r1, addr_of_case_1 /* r1 ← &case_1 */
add r1, r1, r0, LSL #3 /* r1 ← r1 + r0 * 8
Each instruction is 4 bytes
Each case takes 2 instructions
Thus, each case is 8 bytes (4 * 2)
*/
mov pc, r1 /* pc ← r1
This will cause a branch to the
computed address */
case_1:
mov r0, #1 /* r0 ← 1 */
b after_switch /* break */
case_2:
mov r0, #2 /* r0 ← 2 */
b after_switch /* break */
case_3:
mov r0, #3 /* r0 ← 3 */
b after_switch /* break */
case_default:
mov r0, #42 /* r0 ← 42 */
b after_switch /* break (unnecessary) */
after_switch:
bx lr /* Return from main */
.align 4
addr_of_case_1: .word case_1
Binary search
Consider again our example with 100 cases. A string of if-else will require in average 50 comparisons. Can we reduce the number of comparisons? Well, the answer is yes. Perform a binary search of the case.
A binary search will discard half of the case set each time. This will allow us to dramatically reduce the amount of comparisons. The following example implements the same code in the jump table but with cases 1 to 10.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* binsearch.s */
.data
.text
.globl main
main:
cmp r0, #1 /* r0 - 1 and update cpsr */
blt case_default /* if r0 < 1 then branch to case_default */
cmp r0, #10 /* r0 - 10 and update cpsr */
bgt case_default /* if r0 > 10 then branch to case default */
case_1_to_10:
cmp r0, #5 /* r0 - 5 and update cpsr */
beq case_5 /* if r0 == 5 branch to case_5 */
blt case_1_to_4 /* if r0 < 5 branch to case_1_to_4 */
bgt case_6_to_10 /* if r0 > 5 branch to case_6_to_4 */
case_1_to_4:
cmp r0, #2 /* r0 - 2 and update cpsr */
beq case_2 /* if r0 == 2 branch to case_2 */
blt case_1 /* if r0 < 2 branch to case_1
(case_1_to_1 does not make sense) */
bgt case_3_to_4 /* if r0 > 2 branch to case_3_to_4 */
case_3_to_4:
cmp r0, #3 /* r0 - 3 and update cpsr */
beq case_3 /* if r0 == 3 branch to case_3 */
b case_4 /* otherwise it must be r0 == 4,
branch to case_4 */
case_6_to_10:
cmp r0, #8 /* r0 - 8 and update cpsr */
beq case_8 /* if r0 == 8 branch to case_8 */
blt case_6_to_7 /* if r0 < 8 then branch to case_6_to_7 */
bgt case_9_to_10 /* if r0 > 8 then branch to case_9_to_10 */
case_6_to_7:
cmp r0, #6 /* r0 - 6 and update cpsr */
beq case_6 /* if r0 == 6 branch to case_6 */
b case_7 /* otherwise it must be r0 == 7,
branch to case 7 */
case_9_to_10:
cmp r0, #9 /* r0 - 9 and update cpsr */
beq case_9 /* if r0 == 9 branch to case_9 */
b case_10 /* otherwise it must be r0 == 10,
branch to case 10 */
case_1:
mov r0, #1
b after_switch
case_2:
mov r0, #2
b after_switch
.
. /* Cases from 3 to 9 omitted */
.
case_10:
mov r0, #10
b after_switch
case_default:
mov r0, #42 /* r0 ← 42 */
b after_switch /* break (unnecessary) */
after_switch:
bx lr /* Return from main */
This strategy is able to determine the case value in just only 3 comparisons (if we ignore the mandated two comparisons for range checking). What we do is we check the compare the case value with the middle one in the current range. This way we can discard half of the sets at every comparison step.
This strategy works well also for scattered case sets like [1, 2, 3, 24, 25, 26, 97, 98, 99, 300]. In this case the comparisons would be
which is 3 comparisons at most also.
Using this strategy the number of comparisons is log(N), where N is the number of elements in the case set. So for 10 elements, in the worst of the cases, we will have to do 3 comparisons, for 20 at most 4, for 40 at most 5, etc.
Let's retake the code bloat issue that arised with jump tables. If you check, every comparison requires 3 or 4 instructions, this is about 12 to 16 bytes per comparison. If we have a case set of 256 elements, the generated code will require 128 comparisons blocks in total. While the number of comparisons performed at runtime, 8 in the worst of the cases, we still need 128 case_x_to_y
comparison blocks to perform the binary search. If we pessimistically assume that all comparison blocks take 4 instructions this will be 4*128*4 = 2048 bytes in instructions. Compare that to a jump table of 256 positions, each position takes 4 bytes: 256 * 4 = 1024 bytes. So, binary search is not so competitive in terms of code size.
Binary search, thus, is useful for large scattered sets. Recall that if-else strings are not efficient for large sets of cases and jump tables waste space if the case range lacks lots of cases.
Hybrid approach
Is it possible to combine the two strategies? The answer is yes. We will use two tables: a case values table (sorted, usually in ascending order) and addresses for each case in another table, in the same order as the case values table.
We will make a binary search inside the case value set. When the value is found we will use the index of the match to calculate a jump. For the example below we will use the case set [1, 2, 3, 24, 25, 26, 97, 98, 99, 300].
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/* hybrid.s */
.data
.text
.globl main
main:
push {r4, r5, r6, lr}
cmp r0, #1 /* r0 - 1 and update cpsr */
blt case_default /* if r0 < 1 then branch to case_default */
cmp r0, #300 /* r0 - 300 and update cpsr */
bgt case_default /* if r0 > 300 then branch to case default */
/* prepare the binary search.
r1 will hold the lower index
r2 will hold the upper index
r3 the base address of the case_value_table
*/
mov r1, #0
mov r2, #9
ldr r3, addr_case_value_table /* r3 ← &case_value_table */
b check_binary_search
binary_search:
add r4, r1, r2 /* r4 ← r1 + r2 */
mov r4, r4, ASR #1 /* r4 ← r4 / 2 */
ldr r5, [r3, +r4, LSL #2] /* r5 ← *(r3 + r4 * 4).
This is r5 ← case_value_table[r4] */
cmp r0, r5 /* r0 - r5 and update cpsr */
sublt r2, r4, #1 /* if r0 < r5 then r2 ← r4 - 1 */
addgt r1, r4, #1 /* if r0 > r5 then r1 ← r4 + 1 */
bne check_binary_search /* if r0 != r5 branch to binary_search */
/* if we reach here it means that r0 == r5 */
ldr r5, addr_case_addresses_table /* r5 ← &addr_case_value_table */
ldr r5, [r5, +r4, LSL #2] /* r5 ← *(r5 + r4*4)
This is r5 ← case_addresses_table[r4] */
mov pc, r5 /* branch to the proper case */
check_binary_search:
cmp r1, r2 /* r1 - r2 and update cpsr */
ble binary_search /* if r1 <= r2 branch to binary_search */
/* if we reach here it means the case value
was not found. branch to default case */
b case_default
case_1:
mov r0, #1
b after_switch
case_2:
mov r0, #2
b after_switch
case_3:
mov r0, #3
b after_switch
case_24:
mov r0, #24
b after_switch
case_25:
mov r0, #95
b after_switch
case_26:
mov r0, #96
b after_switch
case_97:
mov r0, #97
b after_switch
case_98:
mov r0, #98
b after_switch
case_99:
mov r0, #99
b after_switch
case_300:
mov r0, #300 /* The error code will be 44 */
b after_switch
case_default:
mov r0, #42 /* r0 ← 42 */
b after_switch /* break (unnecessary) */
after_switch:
pop {r4,r5,r6,lr}
bx lr /* Return from main */
case_value_table: .word 1, 2, 3, 24, 25, 26, 97, 98, 99, 300
addr_case_value_table: .word case_value_table
case_addresses_table:
.word case_1
.word case_2
.word case_3
.word case_24
.word case_25
.word case_26
.word case_97
.word case_98
.word case_99
.word case_300
addr_case_addresses_table: .word case_addresses_table
In lines 21 to 44 we implement the binary search. This implementation is an iterative binary search where r1
and r2
keep the lower and upper indexes of the table that is currently searched. We will leave the search if the lower index becomes larger than the upper, lines 42 to 44. When searching the range given by r1
and r2
, we will compute r4
as the middle index (r1+r2)/2
, lines 27 to 28. We will compare it to the current case value being searched, in r0
, line 31. If the value, r5
, in the case value table (which must be in ascending order) is lower than the current case value being searched, then we shrink the range from r1
to r4-1
, so we update r2
only if r0 < r5
, line 32. Conversely if r0 > r5
then we shrink the range from r4+1
to r2
, line 33. If the value of r5
matches, then we use the index r4
to load the case address and branch to it, lines 37 to 40. Note that if r0
is different to r5
, we have to omit this step so we branch to the check of the loop, line 34.
You can check that this works.
This approach has several interesting properties. It reduces the number of comparisons (we will make the same number of comparisons as in the binary search) and avoids the code bloat due to big jump tables (avoiding useless entries) and comparison blocks (by using a loop). As a drawback, this approach, requires two tables.
That's all for today.