ARM assembler in Raspberry Pi

We saw in chapters 6 and 12 several control structures but we left out a usual one: the switch also known as select/case. In this chapter we will see how we can implement it in ARM assembler.

Switch control structure

A switch in C has the following structure.

switch (E)
{
   case V1: S1;
   case V2: S2;
   default: Sdefault;
}

In the example above, expression E is evaluated and its value is used to determine the next statement executed. So if E evaluates to V2, S2 will be the next statement executed. If no case matches, the whole switch construct is ignored unless there is a default case the statement of which is executed instead.

Note that, once the flow jumps to a statement, the execution continues from that point unless a break statement is found. The break statement switch construct. Most of the time the programmer adds a break to end each case. Otherwise fall-through cases happens. In the example above, if E evaluates to V1 and there is no break in S1, the program would continue running S2 and Sdefault unless the program encounters a break statement inside S2 or Sdefault. Fall-through may look a bit weird and confusing but there are some cases where is useful.

That said, C is a particularly bad example to showcase this structure. The reason is that the exact language definition of a switch in C is as follows.

switch (E)
   S;

S can be anything but the flow will always jump to a case or a default inside S, so if S does not contain any of these statements, nothing happens.

switch (E)
  printf("This will never be printed\n");

So for a switch to be useful we will need at least one case or default statement. If more than one is needed, then we can use a compound statement (a list of statements enclosed in side { and } as shown in the first example above.

Note also, that case and default statements are only valid inside the S of a switch but this does not mean that they have to be immediately nested inside them.

switch (E)
{
  if (X) // Note that the check of the truth value of X will be never run!
  {
     default: printf ("Hi!\n");
  }
  else
  {
     case 10: printf ("Howdy stranger!\n");
  }
}

As you can see, the switch statement in C is pretty liberal. Other languages, like Pascal or Fortran, have stricter syntaxes that do not allow fall-through nor loose positioning of case/default.

{ Case statement in Pascal }
Case Number of  
 1 : WriteLn ('One');  
 2 : WriteLn ('Two');  
Else  
 WriteLn ('Other than one or two');  
End;

In this post, we will not care about these strange cases of switch although we will allow fall-through.

Implementing switch

Probably you already have figured that a switch not involving fall-through in any of its cases is equivalent to a sequence of if-else blocks. The following switch,

switch (x)
{
  case 5: code_for_case5; break;
  case 10: code_for_case10; break;
  default: code_for_default; break; 
  // break would not be required here as this is the last case
}

can be implemented as

if (x == 5)
  code_for_case5;
else if (x == 10)
  code_for_case10;
else /* default */
  code_for_default;

code_after;

In contrast to the usual if-else statement, there need not be a branch that goes after the if-statement once the if branch has been executed. This is, in the example above, it is optional to have a branch after code_for_case5 that goes to code_after. If such branch is omitted, then a fall-through to code_for_case10 happens naturally. So the break statement inside a switch is simply that unconditional branch.

  /* Here we evaluate x and keep it in r0 */
  case_5:             /* case 5 */
    cmp r0, #5        /* Compute r0 - 5 and update cpsr */
    bne case_10       /* if r0 != 5 branch to case_10 */
    code_for_case5
    b after_switch    /* break */

  case_10:            /* case 10 */
    cmp r0, #10       /* Compute r0 - 10 and update cpsr */
    bne case_default  /* If r0 != 10 branch to case_default */
    code_for_case10
    b after_switch    /* break */

  case_default:
    code_for_default 
    /* Note that if default is not the last case
       we need a branch to after_switch here */

  after_switch:

We can put all the checks at the beginning, as long as we preserve the order of the cases (so fall-through works if break is omitted).

  /* Here we evaluate x and keep it in r0 */
  cmp r0, #5        /* Compute r0 - 5 and update cpsr */
  beq case_5        /* if r0 == 5 branch to case_5 */
  cmp r0, #10       /* Compute r0 - 10 and update cpsr */
  beq case_10       /* if r0 == 10 branch to case_10 */
  b case_default    /* branch to default case
                       Note that there is no default case
                       we would branch to after_switch */

  case_5:             /* case 5 */
    code_for_case5
    b after_switch    /* break */

  case_10:            /* case 10 */
    code_for_case10
    b after_switch    /* break */

  case_default:
    code_for_default 
    /* Note that if default is not the last case
       we need a branch to after_switch here */

  after_switch:

This approach is sensible if the number of cases is low. Here "low" is not very well defined, let's say 10 or less. What if we have lots of cases? A sequence of if-else checks will make as many comparisons as cases. If the values of the N cases are uniformly spread during the execution of the program, this means that in average we will have to do N/2 checks. If the values are not uniformly spread, then it is obvious that we should check the common values first and the rare last (sadly, most of the times we have no idea of their frequency).

There are a number of ways of reducing the cost of checking the cases: tables and binary search.

Jump tables

Imagine we have a switch like this one

switch (x)
{
  case 1: do_something_1;
  case 2: do_something_2;
  ...
  case 100: do_something_100;
}

If we implement it the way shown above we will make in average (for an uniformly distributed set of values of x) 50 comparisons. We can improve this if we simply use the value c to index a table of addresses to the instructions of the case instructions.

Consider the program below where we use the value of argc of a C program. In C, the main function receives two parameters, argc and argv: argc is just an integer, in the register r0 as usual; argv is an address, in the register r1 as usual, to an array of the arguments passed in the command-line. There are as many elements in argv as the value of argc, at least one. We will not use argv today, only argc.

int main(int argc, char *argv[])
{
   int x;
   switch (argc)
   {
     case 1: x = 1; break;
     case 2: x = 2; break;
     case 3: x = 3; break;
     default: x = 42; break;
   }
   return x;
}

We are using just 3 cases plus the default one, but it would not be complex (yet cumbersome) to extend it to 100 cases.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* jumptable.s */
.data

.text

.globl main

main:
  cmp r0, #1                  /* r0 - 1 and update cpsr */
  blt case_default            /* branch to case_default if r0 < 1 */
  cmp r0, #3                  /* r0 - 3 and update cpsr */
  bgt case_default            /* branch to case_default if r0 > 3 */

  sub r0, r0, #1              /* r0 ← r0 - 1. Required to index the table */
  ldr r1, addr_of_jump_table  /* r1 ← &jump_table */
  ldr r1, [r1, +r0, LSL #2]   /* r1 ← *(r1 + r0*4).
                                 This is r1 ← jump_table[r0] */

  mov pc, r1                  /* pc ← r1
                                 This will cause a branch to the
                                 computed address */

  case_1:
   mov r0, #1                 /* r0 ← 1 */ 
   b after_switch             /* break */

  case_2:
   mov r0, #2                 /* r0 ← 2 */
   b after_switch             /* break */

  case_3:
   mov r0, #3                 /* r0 ← 3 */
   b after_switch             /* break */

  case_default:
   mov r0, #42                /* r0 ← 42 */
   b after_switch             /* break (unnecessary) */

  after_switch:

  bx lr                       /* Return from main */

.align 4
jump_table:
   .word case_1
   .word case_2
   .word case_3

.align 4
addr_of_jump_table: .word jump_table

As you can see in line 43 we define a jump table, whose elements are the addresses of the labels of each case (in order). In lines 14 to 16 we load the appropiate value from that table after we are sure that the value of argc is between 1 and 3, checked in lines 9 to 12. Finally, we load the address to pc. This will effectively do a branch to the proper case.

If you run the program you will see diferent exit codes returned (remember that they are retuned through r0 in main). The program only counts the arguments, if instead of "a b" you use "one two" it will return 3 as well. More than two arguments and it will return 42.

$ ./jumptable ; echo $?
1
$ ./jumptable a ; echo $?
2
$ ./jumptable a b  ; echo $?
3
$ ./jumptable a b c ; echo $?
42

To safely use the jump table we have to make sure the value of the case lies in the bounds of the table. If m is the minimum case value and M is the maximum case value, our table will have M - m + 1 entries. In the example above m = 1 and M = 3 so we have 3 entries in the table. We have to make sure that the value used to index is m ≤ x ≤ M, otherwise we would be accessing incorrect memory locations. Also remember that to properly index the jump table we will have to subtract m to the case value.

Jump tables are great, once we have checked that the case value is in the proper range (these are two comparisons) then we do not have to compare anything else. So basically the cost of comparisons in this approach is constant (i.e. it does not grow if the number of cases grow).

There are two big downsides to this approach which prevent us from always using it. The first one happens when the difference between M and m is large, our jump table will be large. This enlarges the code size. We have, basically, traded time by space. Now our code size will add 4 bytes per case handled in a jump table. A table of 256 entries will take up to 1 Kbyte (1024 bytes) of memory in our executable program. To be fair, this is the amount of space taken by 256 instructions. So if code size is a concern for you (and usually in the embedded world is), this approach may not be suitable. The second big downside happens when there are "holes" in the cases. Imagine our cases are just 1, 3 and 100. The table will be 100 items long but only 1, 3 and 100 will have useful entries: all the remaining entries will have the address of the default case (or the address after the switch if the default case is omitted). In this case we are not just taking 400 bytes, we are wasting 388 bytes (97% of entries would be useless!). So if the number of cases is low and the values are scattered in a large range, jump tables are not a good choice.

Compute the case address

This strategy is a bit complicated and has more constraints than a jump table, so it is less general. If all the cases are ordered and they take the same amount of instructions, we can compute the address of the case without using a jump table. This is risky because we have to be careful when computing the address of the branch using the current value (otherwise we will jump to a wrong address and bad things will happen for sure).

If not all the cases take the same amount of instructions, we can compensate them to take as many instructions as the case with the biggest number of instructions. We can do that using the nop instruction that does nothing but occupy space. If the variance of the number of instructions among cases is small we will just end adding some nops to a few cases. If the variance is large, we may end with code bloat, something we wanted to avoid when using this technique.

If there are holes, we can just branch them to the default case and fill the remaining instructions with nops. Again if the number of holes is large this is prone to code bloat as well.

In our example of the jump table, each case takes just two instructions. So we can get the address of the first case and use it as a base address to compute the branch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* calcjump.s */
.data

.text

.globl main

main:
  cmp r0, #1                  /* r0 - 1 and update cpsr */
  blt case_default            /* branch to case_default if r0 < 1 */
  cmp r0, #3                  /* r0 - 3 and update cpsr */
  bgt case_default            /* branch to case_default if r0 > 3 */

  sub r0, r0, #1              /* r0 ← r0 - 1. Required to index the table */
  ldr r1, addr_of_case_1      /* r1 ← &case_1 */
  add r1, r1, r0, LSL #3      /* r1 ← r1 + r0 * 8
                                 Each instruction is 4 bytes
                                 Each case takes 2 instructions
                                 Thus, each case is 8 bytes (4 * 2)
                                 */

  mov pc, r1                  /* pc ← r1
                                 This will cause a branch to the
                                 computed address */

  case_1:
   mov r0, #1                 /* r0 ← 1 */ 
   b after_switch             /* break */
 
  case_2:
   mov r0, #2                 /* r0 ← 2 */
   b after_switch             /* break */

  case_3:
   mov r0, #3                 /* r0 ← 3 */
   b after_switch             /* break */

  case_default:
   mov r0, #42                /* r0 ← 42 */
   b after_switch             /* break (unnecessary) */  

  after_switch:

  bx lr                       /* Return from main */

.align 4
addr_of_case_1: .word case_1

Binary search

Consider again our example with 100 cases. A string of if-else will require in average 50 comparisons. Can we reduce the number of comparisons? Well, the answer is yes. Perform a binary search of the case.

A binary search will discard half of the case set each time. This will allow us to dramatically reduce the amount of comparisons. The following example implements the same code in the jump table but with cases 1 to 10.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* binsearch.s */
.data

.text

.globl main

main:

  cmp r0, #1              /* r0 - 1 and update cpsr */
  blt case_default        /* if r0 < 1 then branch to case_default */
  cmp r0, #10             /* r0 - 10 and update cpsr */
  bgt case_default        /* if r0 > 10 then branch to case default */

  case_1_to_10:
    cmp r0, #5            /* r0 - 5 and update cpsr */
    beq case_5            /* if r0 == 5 branch to case_5 */
    blt case_1_to_4       /* if r0 < 5 branch to case_1_to_4 */
    bgt case_6_to_10      /* if r0 > 5 branch to case_6_to_4 */

  case_1_to_4:
    cmp r0, #2            /* r0 - 2 and update cpsr */
    beq case_2            /* if r0 == 2 branch to case_2 */
    blt case_1            /* if r0 < 2 branch to case_1 
                             (case_1_to_1 does not make sense) */
    bgt case_3_to_4       /* if r0 > 2 branch to case_3_to_4 */

  case_3_to_4:
    cmp r0, #3            /* r0 - 3 and update cpsr */
    beq case_3            /* if r0 == 3 branch to case_3 */
    b case_4              /* otherwise it must be r0 == 4,
                             branch to case_4 */

  case_6_to_10:
    cmp r0, #8            /* r0 - 8 and update cpsr */
    beq case_8            /* if r0 == 8 branch to case_8 */
    blt case_6_to_7       /* if r0 < 8 then branch to case_6_to_7 */
    bgt case_9_to_10      /* if r0 > 8 then branch to case_9_to_10 */

  case_6_to_7:
    cmp r0, #6            /* r0 - 6 and update cpsr */
    beq case_6            /* if r0 == 6 branch to case_6 */
    b case_7              /* otherwise it must be r0 == 7,
                             branch to case 7 */

  case_9_to_10:
    cmp r0, #9            /* r0 - 9 and update cpsr */
    beq case_9            /* if r0 == 9 branch to case_9 */
    b case_10             /* otherwise it must be r0 == 10,
                             branch to case 10 */

  case_1:
     mov r0, #1
     b after_switch
  case_2:
     mov r0, #2
     b after_switch
  .
  .  /* Cases from 3 to 9 omitted */
  .
  case_10:
     mov r0, #10
     b after_switch

  case_default:
   mov r0, #42                /* r0 ← 42 */
   b after_switch             /* break (unnecessary) */

  after_switch:

  bx lr                       /* Return from main */

This strategy is able to determine the case value in just only 3 comparisons (if we ignore the mandated two comparisons for range checking). What we do is we check the compare the case value with the middle one in the current range. This way we can discard half of the sets at every comparison step.

This strategy works well also for scattered case sets like [1, 2, 3, 24, 25, 26, 97, 98, 99, 300]. In this case the comparisons would be

case_1_to_300:
  cmp r0, #25
  beq case_25
  blt case_1_to_24
  bgt case_26_to_300
case_1_to_24:
  cmp r0, #2
  beq case_2
  blt case_1
  bgt case_3_to_24
case_3_to_24:
  cmp r0, #3
  beq case_3
  b case_24
case_26_to_300:
  cmp r0, #98
  beq case_98
  blt case_26_to_97
  bgt case_99_to_300
case_26_to_97:
  cmp r0, #26
  beq case_26
  b case_97
case_99_to_300:
  cmp r0, #99
  beq case_99
  b case_300

which is 3 comparisons at most also.

Using this strategy the number of comparisons is log(N), where N is the number of elements in the case set. So for 10 elements, in the worst of the cases, we will have to do 3 comparisons, for 20 at most 4, for 40 at most 5, etc.

Let's retake the code bloat issue that arised with jump tables. If you check, every comparison requires 3 or 4 instructions, this is about 12 to 16 bytes per comparison. If we have a case set of 256 elements, the generated code will require 128 comparisons blocks in total. While the number of comparisons performed at runtime, 8 in the worst of the cases, we still need 128 case_x_to_y comparison blocks to perform the binary search. If we pessimistically assume that all comparison blocks take 4 instructions this will be 4*128*4 = 2048 bytes in instructions. Compare that to a jump table of 256 positions, each position takes 4 bytes: 256 * 4 = 1024 bytes. So, binary search is not so competitive in terms of code size.

Binary search, thus, is useful for large scattered sets. Recall that if-else strings are not efficient for large sets of cases and jump tables waste space if the case range lacks lots of cases.

Hybrid approach

Is it possible to combine the two strategies? The answer is yes. We will use two tables: a case values table (sorted, usually in ascending order) and addresses for each case in another table, in the same order as the case values table.

We will make a binary search inside the case value set. When the value is found we will use the index of the match to calculate a jump. For the example below we will use the case set [1, 2, 3, 24, 25, 26, 97, 98, 99, 300].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/* hybrid.s */
.data

.text

.globl main

main:
  push {r4, r5, r6, lr}

  cmp r0, #1                /* r0 - 1 and update cpsr */
  blt case_default          /* if r0 < 1 then branch to case_default */
  cmp r0, #300              /* r0 - 300 and update cpsr */
  bgt case_default          /* if r0 > 300 then branch to case default */

  /* prepare the binary search. 
     r1 will hold the lower index
     r2 will hold the upper index
     r3 the base address of the case_value_table
  */
  mov r1, #0
  mov r2, #9
  ldr r3, addr_case_value_table /* r3 ← &case_value_table */

  b check_binary_search
  binary_search:
    add r4, r1, r2          /* r4 ← r1 + r2 */
    mov r4, r4, ASR #1      /* r4 ← r4 / 2 */
    ldr r5, [r3, +r4, LSL #2]   /* r5 ← *(r3 + r4 * 4). 
                               This is r5 ← case_value_table[r4] */
    cmp r0, r5              /* r0 - r5 and update cpsr */
    sublt r2, r4, #1        /* if r0 < r5 then r2 ← r4 - 1 */
    addgt r1, r4, #1        /* if r0 > r5 then r1 ← r4 + 1 */
    bne check_binary_search /* if r0 != r5 branch to binary_search */

    /* if we reach here it means that r0 == r5 */
    ldr r5, addr_case_addresses_table /* r5 ← &addr_case_value_table */
    ldr r5, [r5, +r4, LSL #2]   /* r5 ← *(r5 + r4*4) 
                               This is r5 ← case_addresses_table[r4] */
    mov pc, r5              /* branch to the proper case */

  check_binary_search:
    cmp r1, r2              /* r1 - r2 and update cpsr */
    ble binary_search       /* if r1 <= r2 branch to binary_search */

  /* if we reach here it means the case value
     was not found. branch to default case */
  b case_default

    case_1:
     mov r0, #1
     b after_switch
  case_2:
     mov r0, #2
     b after_switch
  case_3:
     mov r0, #3
     b after_switch
  case_24:
     mov r0, #24
     b after_switch
  case_25:
     mov r0, #95
     b after_switch
  case_26:
     mov r0, #96
     b after_switch
  case_97:
     mov r0, #97
     b after_switch
  case_98:
     mov r0, #98
     b after_switch
  case_99:
     mov r0, #99
     b after_switch
  case_300:
     mov r0, #300    /* The error code will be 44 */
     b after_switch

  case_default:
   mov r0, #42       /* r0 ← 42 */
   b after_switch    /* break (unnecessary) */

  after_switch:

  pop {r4,r5,r6,lr}
  bx lr              /* Return from main */

case_value_table: .word 1, 2, 3, 24, 25, 26, 97, 98, 99, 300
addr_case_value_table: .word case_value_table

case_addresses_table:
    .word case_1
    .word case_2
    .word case_3
    .word case_24
    .word case_25
    .word case_26
    .word case_97
    .word case_98
    .word case_99
    .word case_300
addr_case_addresses_table: .word case_addresses_table

In lines 21 to 44 we implement the binary search. This implementation is an iterative binary search where r1 and r2 keep the lower and upper indexes of the table that is currently searched. We will leave the search if the lower index becomes larger than the upper, lines 42 to 44. When searching the range given by r1 and r2, we will compute r4 as the middle index (r1+r2)/2, lines 27 to 28. We will compare it to the current case value being searched, in r0, line 31. If the value, r5, in the case value table (which must be in ascending order) is lower than the current case value being searched, then we shrink the range from r1 to r4-1, so we update r2 only if r0 < r5, line 32. Conversely if r0 > r5 then we shrink the range from r4+1 to r2, line 33. If the value of r5 matches, then we use the index r4 to load the case address and branch to it, lines 37 to 40. Note that if r0 is different to r5, we have to omit this step so we branch to the check of the loop, line 34.

You can check that this works.

$ ./hybrid ; echo $?
1
$ ./hybrid 2 ; echo $?
2
$ ./hybrid 2 3 ; echo $?
3
$ ./hybrid 2 3 4 ; echo $?
42
$ ./hybrid 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ; echo $?
24

This approach has several interesting properties. It reduces the number of comparisons (we will make the same number of comparisons as in the binary search) and avoids the code bloat due to big jump tables (avoiding useless entries) and comparison blocks (by using a loop). As a drawback, this approach, requires two tables.

That's all for today.