ARM assembler in Raspberry Pi

It may be suprising, but the ARMv6 architecture does not provide an integer division instruction while it does have a floating point instruction in VFPv2. In this chapter we will see usual ways to workaround this limitation with different techniques that can be used in specific scenarios involving divisions.

What does integer division mean?

First we should clearly define what we understand by integer division. Given two integer numbers N (for numerator) and D (for denominator, different than zero) we define the integer division of N and D as the pair of integer numbers Q (for quotient) and R (for remainder) satisfying the following equality.

N = D × Q + R where 0 ≤ |R| < |D|

The equality implies that there are two solutions 0 < R < |D| and 0 < |-R| < |D|. For example, N=7 and D=3 has two solutions (Q=2, R=1) and (Q=3, R=-2). While both solutions may be useful, the former one is the preferred as it is closer to our natural notion of the remainder. But what if D is negative? For example N=7 and D=-3 has two solutions as well (Q=-2, R=1) and (Q=-3, R=-2). When negative numbers are involved the choice of the remainder is not intuitive but conventional. Many conventions can be used to choose one solution. We can always pick the solution with the positive remainder (this is called euclidean division), or the negative, or the solution where the sign of the remainder matches the numerator (or the denominator).

Most computers perform an integer division where the remainder has the same sign as the numerator. So for N=7 and D=3 the computed solution is (Q=2, R=1) and for N=7 and D=-3 the computed solution is (Q=-2, R=1). We will assume such integer division convention in the remainder (no pun intended) of this post.

Unsigned division

An unsigned integer division is an integer division involving two unsigned integers N and D. This has the consequence that Q and R will always be positive. A very naive (and slow) approach for unsigned division is as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
unsigned_naive_div:
    /* r0 contains N */
    /* r1 contains D */
    mov r2, r1             /* r2 ← r0. We keep D in r2 */
    mov r1, r0             /* r1 ← r0. We keep N in r1 */

    mov r0, #0             /* r0 ← 0. Set Q = 0 initially */

    b .Lloop_check
    .Lloop:
       add r0, r0, #1      /* r0 ← r0 + 1. Q = Q + 1 */
       sub r1, r1, r2      /* r1 ← r1 - r2 */
    .Lloop_check:
       cmp r1, r2          /* compute r1 - r2 */
       bhs .Lloop            /* branch if r1 >= r2 (C=0 or Z=1) */

    /* r0 already contains Q */
    /* r1 already contains R */
    bx lr

This algorithm, while correct and easy to understand, is not very efficient (think on dividing a big N with a small D). Is there a way that we can compute the division in a fixed amount of time? The answer is yes, just adapt how you divide by hand but to binary numbers. We will compute a temporary remainder picking bits, from left to right, of the dividend. When the remainder is larger than the divisor, we will subtract the divisor from that remainder and set the appropiate bit in the quotient.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
unsigned_longdiv:
    /* r0 contains N */
    /* r1 contains D */
    /* r2 contains Q */
    /* r3 contains R */
    push {r4, lr}
    mov r2, #0                 /* r2 ← 0 */
    mov r3, #0                 /* r3 ← 0 */

    mov r4, #32                /* r4 ← 32 */
    b .Lloop_check1
    .Lloop1:
        movs r0, r0, LSL #1    /* r0 ← r0 << 1 updating cpsr (sets C if 31st bit of r0 was 1) */
        adc r3, r3, r3         /* r3 ← r3 + r3 + C. This is equivalent to r3 ← (r3 << 1) + C */

        cmp r3, r1             /* compute r3 - r1 and update cpsr */
        subhs r3, r3, r1       /* if r3 >= r1 (C=1) then r3 ← r3 - r1 */
        adc r2, r2, r2         /* r2 ← r2 + r2 + C. This is equivalent to r2 ← (r2 << 1) + C */
    .Lloop_check1:
        subs r4, r4, #1        /* r4 ← r4 - 1 */
        bpl .Lloop1            /* if r4 >= 0 (N=0) then branch to .Lloop1 */

    pop {r4, lr}
    bx lr

This approach is a bit more efficient since it repeats the loop a fixed number of times (always 32). For each bit of N starting from the most significant one (line 13), we push it to the right of the current value of R (line 14). We do this by adding R to itself, this is 2*R which is actually shifting to the right R 1 bit. Then we add the carry, that will be 1 if the most significant bit of N before the shift (line 13) was 1. Then we check if the current R is already bigger than D (line 16) If so we subtract N from R, R ← R - N (line 17) and then we push a 1 to the right of Q (line 18), again by adding Q to itself plus the carry set by the comparison (if R >= N then there is no borrow so C became 1 in cmp of line 16).

The shown code is fine but it can be improved in a number of ways. First, there is no need to check all the bits of a number (although this gives as an upper bound of the cost in the worst of the cases). Second, we should try hard to reduce the number of used registers. Here we are using 5 registers, is there a way we can use less registers? For this we will have to use a slightly different approach.

Given N and D, we will first shift D as many bits to the left as possible but always having N > D. So, for instance if we divide N=1010₍₂ by D=10₍₂ we would adjust D until it was D₀=1000₍₂ (this is shifting twice to the left). Now we start a similar process to the one above: if N_i ≥ D_i, we set 1 to the lowest bit of Q and then we compute a new N_i+1 ← N_i - D_i and a new D_i+1 ← D_i/2. If N_i < D_i then we simply compute a new D_i+1 ← D_i/2. We stop when the current D_i is smaller than the initial D (not D₀). Note that this condition is what makes dividing N=1010₍₂ by D=10₍₂ different that dividing N=1010₍₂ by D=1₍₂ although the D₀ of both cases is the same.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
better_unsigned_division :
    /* r0 contains N and Ni */
    /* r1 contains D */
    /* r2 contains Q */
    /* r3 will contain Di */

    mov r3, r1                   /* r3 ← r1 */
    cmp r3, r0, LSR #1           /* update cpsr with r3 - r0/2 */
    .Lloop2:
      movls r3, r3, LSL #1       /* if r3 <= 2*r0 (C=0 or Z=1) then r3 ← r3*2 */
      cmp r3, r0, LSR #1         /* update cpsr with r3 - (r0/2) */
      bls .Lloop2                /* branch to .Lloop2 if r3 <= 2*r0 (C=0 or Z=1) */

    mov r2, #0                   /* r2 ← 0 */

    .Lloop3:
      cmp r0, r3                 /* update cpsr with r0 - r3 */
      subhs r0, r0, r3           /* if r0 >= r3 (C=1) then r0 ← r0 - r3 */
      adc r2, r2, r2             /* r2 ← r2 + r2 + C.
                                    Note that if r0 >= r3 then C=1, C=0 otherwise */

      mov r3, r3, LSR #1         /* r3 ← r3/2 */
      cmp r3, r1                 /* update cpsr with r3 - r1 */
      bhs .Lloop3                /* if r3 >= r1 branch to .Lloop3 */
   
    bx lr

We can avoid the first loop where we shift until we exceed by counting the leading zeroes. By counting the leading zeroes of the dividend and the divisor we can straightforwardly compute how many bits we need to shift the divisor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
clz_unsigned_division:
    clz  r3, r0                /* r3 ← CLZ(r0) Count leading zeroes of N */
    clz  r2, r1                /* r2 ← CLZ(r1) Count leading zeroes of D */
    sub  r3, r2, r3            /* r3 ← r2 - r3. 
                                 This is the difference of zeroes
                                 between D and N. 
                                 Note that N >= D implies CLZ(N) <= CLZ(D)*/
    add r3, r3, #1             /* Loop below needs an extra iteration count */

    mov r2, #0                 /* r2 ← 0 */
    b .Lloop_check4
    .Lloop4:
      cmp r0, r1, lsl r3       /* Compute r0 - (r1 << r3) and update cpsr */
      adc r2, r2, r2           /* r2 ← r2 + r2 + C.
                                  Note that if r0 >= (r1 << r3) then C=1, C=0 otherwise */
      subcs r0, r0, r1, lsl r3 /* r0 ← r0 - (r1 << r3) if C = 1 (this is, only if r0 >= (r1 << r3) ) */
    .Lloop_check4:
      subs r3, r3, #1          /* r3 ← r3 - 1 */
      bpl .Lloop4              /* if r3 >= 0 (N=0) then branch to .Lloop1 */

    mov r0, r2
    bx lr

Signed division

Signed division can be computed with an unsigned division but taking care of the signs. We can first compute |N|/|D| (this is, ignoring the signs of N and D), this will yield a quotient Q₊ and remainder R₊. If signs of N and D are different then Q = -Q₊. If N < 0, then R = -R₊, as we said at the beginning of the post.

Powers of two

An unsigned division by a power of two 2^N is as simple as doing a logical shift right of N bits. Conversely, a signed division by a power of two 2^N is as simple as doing an arithmetic shift right of N bits. We can use mov and the addressing modes LSR and ASR for this. This case is ideal because it is extremely fast.

Division by a constant integer

When we divide a number by a constant, we can use a multiplication by a magic number to compute the division. All the details and the theory of this technique is too long to write it here but you can find it in chapter 10 of Hacker's Delight. We can summarize it, though, into three values: a magic constant M, a shift S and an additional flag. The author set up a magic number calculator that computes these numbers.

It is not obvious how to properly use these magic numbers, so I crafted a small Python script which emits code for the signed and the unsigned case. Imagine you want to divide an unsigned number by 14. So let's ask our script.

$ ./magic.py 14 code_for_unsigned
u_divide_by_14:
   /* r0 contains the argument to be divided by 14 */
   ldr r1, .Lu_magic_number_14 /* r1 ← magic_number */
   umull r1, r2, r1, r0   /* r1 ← Lower32Bits(r1*r0). r2 ← Upper32Bits(r1*r0) */
   adds r2, r2, r0        /* r2 ← r2 + r0 updating cpsr */
   mov r2, r2, ROR #0     /* r2 ← (carry_flag << 31) | (r2 >> 1) */
   mov r0, r2, LSR #4     /* r0 ← r2 >> 4 */
   bx lr                  /* leave function */
   .align 4
   .Lu_magic_number_14: .word 0x24924925

Similarly we can ask for the signed version:

$ ./magic.py 14 code_for_signed
s_divide_by_14:
   /* r0 contains the argument to be divided by 14 */
   ldr r1, .Ls_magic_number_14 /* r1 ← magic_number */
   smull r1, r2, r1, r0   /* r1 ← Lower32Bits(r1*r0). r2 ← Upper32Bits(r1*r0) */
   add r2, r2, r0         /* r2 ← r2 + r0 */
   mov r2, r2, ASR #3     /* r2 ← r2 >> 3 */
   mov r1, r0, LSR #31    /* r1 ← r0 >> 31 */
   add r0, r2, r1         /* r0 ← r2 + r1 */
   bx lr                  /* leave function */
   .align 4
   .Ls_magic_number_14: .word 0x92492493

As an example I have used it to implement a small program that just divides the user input by 14.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* -- divideby14.s */

.data

.align 4
read_number: .word 0

.align 4
message1 : .asciz "Enter an integer to divide it by 14: "

.align 4
message2 : .asciz "Number %d (signed-)divided by 14 is %d\n"

.align 4
scan_format : .asciz "%d"

.text

/* This function has been generated using "magic.py 14 code_for_signed" */
s_divide_by_14:
   /* r0 contains the argument to be divided by 14 */
   ldr r1, .Ls_magic_number_14 /* r1 ← magic_number */
   smull r1, r2, r1, r0   /* r1 ← Lower32Bits(r1*r0). r2 ← Upper32Bits(r1*r0) */
   add r2, r2, r0         /* r2 ← r2 + r0 */
   mov r2, r2, ASR #3     /* r2 ← r2 >> 3 */
   mov r1, r0, LSR #31    /* r1 ← r0 >> 31 */
   add r0, r2, r1         /* r0 ← r2 + r1 */
   bx lr                  /* leave function */
   .align 4
   .Ls_magic_number_14: .word 0x92492493

.globl main

main:
    /* Call printf */
    push {r4, lr}
    ldr r0, addr_of_message1       /* r0 ← &message */
    bl printf

    /* Call scanf */
    ldr r0, addr_of_scan_format   /* r0 ← &scan_format */
    ldr r1, addr_of_read_number   /* r1 ← &read_number */
    bl scanf

    ldr r0, addr_of_read_number   /* r1 ← &read_number */
    ldr r0, [r0]                  /* r1 ← *r1 */

    bl s_divide_by_14
    mov r2, r0

    ldr r1, addr_of_read_number   /* r1 ← &read_number */
    ldr r1, [r1]                  /* r1 ← *r1 */

    ldr r0, addr_of_message2      /* r0 ← &message2 */
    bl printf                     /* Call printf, r1 and r2 already
                                     contain the desired values */
    pop {r4, lr}
    mov r0, #0
    bx lr

addr_of_message1: .word message1
addr_of_scan_format: .word scan_format
addr_of_message2: .word message2
addr_of_read_number: .word read_number

Using VFPv2

I would not recommend using this technique. I present it here for the sake of completeness. We simply convert our integers to floating point numbers, divide them as floating point numbers and convert the result back to an integer.

1
2
3
4
5
6
7
8
9
10
11
vfpv2_division:
    /* r0 contains N */
    /* r1 contains D */
    vmov s0, r0             /* s0 ← r0 (bit copy) */
    vmov s1, r1             /* s1 ← r1 (bit copy) */
    vcvt.f32.s32 s0, s0     /* s0 ← (float)s0 */
    vcvt.f32.s32 s1, s1     /* s1 ← (float)s1 */
    vdiv.f32 s0, s0, s1     /* s0 ← s0 / s1 */
    vcvt.s32.f32 s0, s0     /* s0 ← (int)s0 */
    vmov r0, s0             /* r0 ← s0 (bit copy). Now r0 is Q */
    bx lr

Comparing versions

After a comment below, I thought it would be interesting to benchmark the general division algorithm. The benchmark I used is the following:

.set MAX, 16384
main:
    push {r4, r5, r6, lr}

    mov r4, #1                         /* r4 ← 1 */

    b .Lcheck_loop_i                   /* branch to .Lcheck_loop_i */
    .Lloop_i:
       mov r5, r4                      /* r5 ← r4 */
       b .Lcheck_loop_j                /* branch to .Lcheck_loop_j */
       .Lloop_j:

         mov r0, r5                    /* r0 ← r5. This is N */
         mov r1, r4                    /* r1 ← r4. This is D */

         bl  <your unsigned division routine here>

         add r5, r5, #1
       .Lcheck_loop_j:
         cmp r5, #MAX                   /* compare r5 and MAX */
         bne .Lloop_j                  /* if r5 != 10 branch to .Lloop_j */
       add r4, r4, #1
    .Lcheck_loop_i:
      cmp r4, #MAX                     /* compare r4 and MAX */
      bne .Lloop_i                     /* if r4 != 10 branch to .Lloop_i */
       
    mov r0, #0
      
    pop {r4, r5, r6, lr}               
    bx lr

Basically it does something like this

for (i = 1; i < MAX; i++)
 for (j = i; j < MAX; j++)
   division_function(j, i);

Timings have been obtained using perf_3.2 stat --repeat=5 -e cpu-clock. In the table below, __aeabi_uidiv is the function in libgcc that gcc uses to implement an unsigned integer division.

Version	Time (seconds)
unsigned_longdiv	45,43
vfpv2_division	9,70
clz_unsigned_longdiv	8,48
__aeabi_uidiv	7,37
better_unsigned_longdiv	6,67

As you can see the performance of our unsigned long division is dismal. The reason it is that it always checks all the bits. The libgcc version is like our clz version but the loop has been fully unrolled and there is a computed branch, similar to a Duff's device. Unfortunately, I do not have a convincing explanation why better_unsigned_longdiv runs faster than the other versions, because the code, a priori, looks worse to me.

That's all for today.