Think In Geek

In geek we trust

# ARM assembler in Raspberry Pi – Chapter 15

It may be suprising, but the ARMv6 architecture does not provide an integer division instruction while it does have a floating point instruction in VFPv2. In this chapter we will see usual ways to workaround this limitation with different techniques that can be used in specific scenarios involving divisions.

## What does integer division mean?

First we should clearly define what we understand by integer division. Given two integer numbers N (for numerator) and D (for denominator, different than zero) we define the integer division of N and D as the pair of integer numbers Q (for quotient) and R (for remainder) satisfying the following equality.

`N = D × Q + R` where `0 ≤ |R| < |D|`

The equality implies that there are two solutions `0 < R < |D|` and `0 < |-R| < |D|`. For example, `N=7` and `D=3` has two solutions `(Q=2, R=1)` and `(Q=3, R=-2)`. While both solutions may be useful, the former one is the preferred as it is closer to our natural notion of the remainder. But what if `D` is negative? For example `N=7` and `D=-3` has two solutions as well `(Q=-2, R=1)` and `(Q=-3, R=-2)`. When negative numbers are involved the choice of the remainder is not intuitive but conventional. Many conventions can be used to choose one solution. We can always pick the solution with the positive remainder (this is called euclidean division), or the negative, or the solution where the sign of the remainder matches the numerator (or the denominator).

Most computers perform an integer division where the remainder has the same sign as the numerator. So for `N=7` and `D=3` the computed solution is `(Q=2, R=1)` and for `N=7` and `D=-3` the computed solution is `(Q=-2, R=1)`. We will assume such integer division convention in the remainder (no pun intended) of this post.

## Unsigned division

An unsigned integer division is an integer division involving two unsigned integers N and D. This has the consequence that Q and R will always be positive. A very naive (and slow) approach for unsigned division is as follows.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 unsigned_naive_div: /* r0 contains N */ /* r1 contains D */ mov r2, r1 /* r2 ← r0. We keep D in r2 */ mov r1, r0 /* r1 ← r0. We keep N in r1 */   mov r0, #0 /* r0 ← 0. Set Q = 0 initially */   b .Lloop_check .Lloop: add r0, r0, #1 /* r0 ← r0 + 1. Q = Q + 1 */ sub r1, r1, r2 /* r1 ← r1 - r2 */ .Lloop_check: cmp r1, r2 /* compute r1 - r2 */ bhs .Lloop /* branch if r1 >= r2 (C=0 or Z=1) */   /* r0 already contains Q */ /* r1 already contains R */ bx lr```

This algorithm, while correct and easy to understand, is not very efficient (think on dividing a big N with a small D). Is there a way that we can compute the division in a fixed amount of time? The answer is yes, just adapt how you divide by hand but to binary numbers. We will compute a temporary remainder picking bits, from left to right, of the dividend. When the remainder is larger than the divisor, we will subtract the divisor from that remainder and set the appropiate bit in the quotient.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 unsigned_longdiv: /* r0 contains N */ /* r1 contains D */ /* r2 contains Q */ /* r3 contains R */ push {r4, lr} mov r2, #0 /* r2 ← 0 */ mov r3, #0 /* r3 ← 0 */   mov r4, #32 /* r4 ← 32 */ b .Lloop_check1 .Lloop1: movs r0, r0, LSL #1 /* r0 ← r0 << 1 updating cpsr (sets C if 31st bit of r0 was 1) */ adc r3, r3, r3 /* r3 ← r3 + r3 + C. This is equivalent to r3 ← (r3 << 1) + C */   cmp r3, r1 /* compute r3 - r1 and update cpsr */ subhs r3, r3, r1 /* if r3 >= r1 (C=1) then r3 ← r3 - r1 */ adc r2, r2, r2 /* r2 ← r2 + r2 + C. This is equivalent to r2 ← (r2 << 1) + C */ .Lloop_check1: subs r4, r4, #1 /* r4 ← r4 - 1 */ bpl .Lloop1 /* if r4 >= 0 (N=0) then branch to .Lloop1 */   pop {r4, lr} bx lr```

This approach is a bit more efficient since it repeats the loop a fixed number of times (always 32). For each bit of N starting from the most significant one (line 13), we push it to the right of the current value of R (line 14). We do this by adding R to itself, this is 2*R which is actually shifting to the right R 1 bit. Then we add the carry, that will be 1 if the most significant bit of N before the shift (line 13) was 1. Then we check if the current R is already bigger than D (line 16) If so we subtract N from R, R ← R - N (line 17) and then we push a 1 to the right of Q (line 18), again by adding Q to itself plus the carry set by the comparison (if R >= N then there is no borrow so C became 1 in `cmp` of line 16).

The shown code is fine but it can be improved in a number of ways. First, there is no need to check all the bits of a number (although this gives as an upper bound of the cost in the worst of the cases). Second, we should try hard to reduce the number of used registers. Here we are using 5 registers, is there a way we can use less registers? For this we will have to use a slightly different approach.

Given N and D, we will first shift D as many bits to the left as possible but always having N > D. So, for instance if we divide N=1010(2 by D=10(2 we would adjust D until it was D0=1000(2 (this is shifting twice to the left). Now we start a similar process to the one above: if Ni ≥ Di, we set 1 to the lowest bit of Q and then we compute a new Ni+1 ← Ni - Di and a new Di+1 ← Di/2. If Ni < Di then we simply compute a new Di+1 ← Di/2. We stop when the current Di is smaller than the initial D (not D0). Note that this condition is what makes dividing N=1010(2 by D=10(2 different that dividing N=1010(2 by D=1(2 although the D0 of both cases is the same.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 better_unsigned_division : /* r0 contains N and Ni */ /* r1 contains D */ /* r2 contains Q */ /* r3 will contain Di */   mov r3, r1 /* r3 ← r1 */ cmp r3, r0, LSR #1 /* update cpsr with r3 - r0/2 */ .Lloop2: movls r3, r3, LSL #1 /* if r3 <= 2*r0 (C=0 or Z=1) then r3 ← r3*2 */ cmp r3, r0, LSR #1 /* update cpsr with r3 - (r0/2) */ bls .Lloop2 /* branch to .Lloop2 if r3 <= 2*r0 (C=0 or Z=1) */   mov r2, #0 /* r2 ← 0 */   .Lloop3: cmp r0, r3 /* update cpsr with r0 - r3 */ subhs r0, r0, r3 /* if r0 >= r3 (C=1) then r0 ← r0 - r3 */ adc r2, r2, r2 /* r2 ← r2 + r2 + C. Note that if r0 >= r3 then C=1, C=0 otherwise */   mov r3, r3, LSR #1 /* r3 ← r3/2 */ cmp r3, r1 /* update cpsr with r3 - r1 */ bhs .Lloop3 /* if r3 >= r1 branch to .Lloop3 */   bx lr```

We can avoid the first loop where we shift until we exceed by counting the leading zeroes. By counting the leading zeroes of the dividend and the divisor we can straightforwardly compute how many bits we need to shift the divisor.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 clz_unsigned_division: clz r3, r0 /* r3 ← CLZ(r0) Count leading zeroes of N */ clz r2, r1 /* r2 ← CLZ(r1) Count leading zeroes of D */ sub r3, r2, r3 /* r3 ← r2 - r3. This is the difference of zeroes between D and N. Note that N >= D implies CLZ(N) <= CLZ(D)*/ add r3, r3, #1 /* Loop below needs an extra iteration count */   mov r2, #0 /* r2 ← 0 */ b .Lloop_check4 .Lloop4: cmp r0, r1, lsl r3 /* Compute r0 - (r1 << r3) and update cpsr */ adc r2, r2, r2 /* r2 ← r2 + r2 + C. Note that if r0 >= (r1 << r3) then C=1, C=0 otherwise */ subcs r0, r0, r1, lsl r3 /* r0 ← r0 - (r1 << r3) if C = 1 (this is, only if r0 >= (r1 << r3) ) */ .Lloop_check4: subs r3, r3, #1 /* r3 ← r3 - 1 */ bpl .Lloop4 /* if r3 >= 0 (N=0) then branch to .Lloop1 */   mov r0, r2 bx lr```

## Signed division

Signed division can be computed with an unsigned division but taking care of the signs. We can first compute |N|/|D| (this is, ignoring the signs of `N` and `D`), this will yield a quotient Q+ and remainder R+. If signs of N and D are different then Q = -Q+. If N < 0, then R = -R+, as we said at the beginning of the post.

## Powers of two

An unsigned division by a power of two 2N is as simple as doing a logical shift right of N bits. Conversely, a signed division by a power of two 2N is as simple as doing an arithmetic shift right of N bits. We can use `mov` and the addressing modes `LSR` and `ASR` for this. This case is ideal because it is extremely fast.

## Division by a constant integer

When we divide a number by a constant, we can use a multiplication by a magic number to compute the division. All the details and the theory of this technique is too long to write it here but you can find it in chapter 10 of Hacker's Delight. We can summarize it, though, into three values: a magic constant M, a shift S and an additional flag. The author set up a magic number calculator that computes these numbers.

It is not obvious how to properly use these magic numbers, so I crafted a small Python script which emits code for the signed and the unsigned case. Imagine you want to divide an unsigned number by 14. So let's ask our script.

```\$ ./magic.py 14 code_for_unsigned u_divide_by_14: /* r0 contains the argument to be divided by 14 */ ldr r1, .Lu_magic_number_14 /* r1 ← magic_number */ umull r1, r2, r1, r0 /* r1 ← Lower32Bits(r1*r0). r2 ← Upper32Bits(r1*r0) */ adds r2, r2, r0 /* r2 ← r2 + r0 updating cpsr */ mov r2, r2, ROR #0 /* r2 ← (carry_flag << 31) | (r2 >> 1) */ mov r0, r2, LSR #4 /* r0 ← r2 >> 4 */ bx lr /* leave function */ .align 4 .Lu_magic_number_14: .word 0x24924925```

Similarly we can ask for the signed version:

```\$ ./magic.py 14 code_for_signed s_divide_by_14: /* r0 contains the argument to be divided by 14 */ ldr r1, .Ls_magic_number_14 /* r1 ← magic_number */ smull r1, r2, r1, r0 /* r1 ← Lower32Bits(r1*r0). r2 ← Upper32Bits(r1*r0) */ add r2, r2, r0 /* r2 ← r2 + r0 */ mov r2, r2, ASR #3 /* r2 ← r2 >> 3 */ mov r1, r0, LSR #31 /* r1 ← r0 >> 31 */ add r0, r2, r1 /* r0 ← r2 + r1 */ bx lr /* leave function */ .align 4 .Ls_magic_number_14: .word 0x92492493```

As an example I have used it to implement a small program that just divides the user input by 14.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 /* -- divideby14.s */   .data   .align 4 read_number: .word 0   .align 4 message1 : .asciz "Enter an integer to divide it by 14: "   .align 4 message2 : .asciz "Number %d (signed-)divided by 14 is %d\n"   .align 4 scan_format : .asciz "%d"   .text   /* This function has been generated using "magic.py 14 code_for_signed" */ s_divide_by_14: /* r0 contains the argument to be divided by 14 */ ldr r1, .Ls_magic_number_14 /* r1 ← magic_number */ smull r1, r2, r1, r0 /* r1 ← Lower32Bits(r1*r0). r2 ← Upper32Bits(r1*r0) */ add r2, r2, r0 /* r2 ← r2 + r0 */ mov r2, r2, ASR #3 /* r2 ← r2 >> 3 */ mov r1, r0, LSR #31 /* r1 ← r0 >> 31 */ add r0, r2, r1 /* r0 ← r2 + r1 */ bx lr /* leave function */ .align 4 .Ls_magic_number_14: .word 0x92492493   .globl main   main: /* Call printf */ push {r4, lr} ldr r0, addr_of_message1 /* r0 ← &message */ bl printf   /* Call scanf */ ldr r0, addr_of_scan_format /* r0 ← &scan_format */ ldr r1, addr_of_read_number /* r1 ← &read_number */ bl scanf   ldr r0, addr_of_read_number /* r1 ← &read_number */ ldr r0, [r0] /* r1 ← *r1 */   bl s_divide_by_14 mov r2, r0   ldr r1, addr_of_read_number /* r1 ← &read_number */ ldr r1, [r1] /* r1 ← *r1 */   ldr r0, addr_of_message2 /* r0 ← &message2 */ bl printf /* Call printf, r1 and r2 already contain the desired values */ pop {r4, lr} mov r0, #0 bx lr   addr_of_message1: .word message1 addr_of_scan_format: .word scan_format addr_of_message2: .word message2 addr_of_read_number: .word read_number```

## Using VFPv2

I would not recommend using this technique. I present it here for the sake of completeness. We simply convert our integers to floating point numbers, divide them as floating point numbers and convert the result back to an integer.

```1 2 3 4 5 6 7 8 9 10 11 vfpv2_division: /* r0 contains N */ /* r1 contains D */ vmov s0, r0 /* s0 ← r0 (bit copy) */ vmov s1, r1 /* s1 ← r1 (bit copy) */ vcvt.f32.s32 s0, s0 /* s0 ← (float)s0 */ vcvt.f32.s32 s1, s1 /* s1 ← (float)s1 */ vdiv.f32 s0, s0, s1 /* s0 ← s0 / s1 */ vcvt.s32.f32 s0, s0 /* s0 ← (int)s0 */ vmov r0, s0 /* r0 ← s0 (bit copy). Now r0 is Q */ bx lr```

## Comparing versions

After a comment below, I thought it would be interesting to benchmark the general division algorithm. The benchmark I used is the following:

```.set MAX, 16384 main: push {r4, r5, r6, lr}   mov r4, #1 /* r4 ← 1 */   b .Lcheck_loop_i /* branch to .Lcheck_loop_i */ .Lloop_i: mov r5, r4 /* r5 ← r4 */ b .Lcheck_loop_j /* branch to .Lcheck_loop_j */ .Lloop_j:   mov r0, r5 /* r0 ← r5. This is N */ mov r1, r4 /* r1 ← r4. This is D */   bl <your unsigned division routine here>   add r5, r5, #1 .Lcheck_loop_j: cmp r5, #MAX /* compare r5 and MAX */ bne .Lloop_j /* if r5 != 10 branch to .Lloop_j */ add r4, r4, #1 .Lcheck_loop_i: cmp r4, #MAX /* compare r4 and MAX */ bne .Lloop_i /* if r4 != 10 branch to .Lloop_i */   mov r0, #0   pop {r4, r5, r6, lr} bx lr```

Basically it does something like this

```for (i = 1; i < MAX; i++) for (j = i; j < MAX; j++) division_function(j, i);```

Timings have been obtained using `perf_3.2 stat --repeat=5 -e cpu-clock`. In the table below, `__aeabi_uidiv` is the function in `libgcc` that `gcc` uses to implement an unsigned integer division.

Version Time (seconds)
unsigned_longdiv 45,43
vfpv2_division 9,70
clz_unsigned_longdiv 8,48
__aeabi_uidiv 7,37
better_unsigned_longdiv 6,67

As you can see the performance of our unsigned long division is dismal. The reason it is that it always checks all the bits. The libgcc version is like our clz version but the loop has been fully unrolled and there is a computed branch, similar to a Duff's device. Unfortunately, I do not have a convincing explanation why `better_unsigned_longdiv` runs faster than the other versions, because the code, a priori, looks worse to me.

That's all for today.

### 19 thoughts on “ARM assembler in Raspberry Pi – Chapter 15”

• Bob Wilkinson says:

as ever – thanks for this great article!

• Bob Wilkinson says:

in better_unsigned_division lines 8 and 11 viz.

cmp r3, r0, LSR #1 /* update cpsr with r3 – 2*r0 */

a logical shift right is division not multiplication? so the comment should be r3 – (r0/2)?

• rferrer says:

Hi Bob,

you are right.

I fixed it. Thank you very much.

Kind regards,

• Ian says:

Thanks for the incredible blog!

Just one question – you say you would not recommend using VFPv2. Although the code looks very straightforward, is this approach just too slow?

I am curious now to see what the C compiler emits when asked to do an integer divide.

• rferrer says:

Hi Ian,

my intuition tells me that it should be slower. That said I will add some benchmarks to verify this, because I could be wrong.

Respect to what the C compiler emits: if the divisor is constant it uses the magic number approach (or a shift right for the case of a power of two). If the divisor is non-constant then it simply makes a call to a runtime function. That function in gcc is implemented in libgcc and it is a fully unrolled version (written in assembler) of an algorithm similar to the ones shown above.

Kind regards,

• rferrer says:

Hi Ian,

I have updated the post with some benchmarks and another implementation.

The VFPv2 code is not faster than any integer counterpart. That said, it is not awfully slow, only 1.3X times compared to the libgcc version.

Kind regards,

• f says:

if i want to use __aeabi_uidiv function in asm, which library should i link ?

• rferrer says:

Just link with `libgcc`, but this should happen automatically if you link using the `gcc` driver.

Kind regards,

• Filippo says:

I think that in clz_unsigned_division you could change line 4 with subs, remove line 8 and swap line 17-18

• Roger Ferrer Ibáñez says:

Hi Filippo,

sounds about right I’ll try that and then update the benchmark results at the end.

Thanks a lot!

Regards

• theking says:

Where should i put the floating point code to make it work? Also instead of using 14 as the divisor what code should i do to ask the user to input it?

• Roger Ferrer Ibáñez says:

Hi,

not sure to understand your question. What do you mean by “where”? Adding it to the assembler file and calling the function should do.

If you are asking the divisor to the user you will have to use the general algorithm. The example with 14 presumes that the divisor is known to be 14. Asking it to the user would make it unknown.

Kind regards,
Roger

• vincent LeBoulou says:

Super blog with lots of interesting information but a python script to slice the magic numbers !! uh uh uh !!!!
here is an arm assembler program for rapsberry:
compil with as -o calculmagic.o calculmagic.s
link with ld -o calculmagic calculmagic.o -e main

Vincent

• Roger Ferrer Ibáñez says:

Thanks a lot for your kind words Vincent 🙂

• Muhammad says:

Hi i’m novice to assembly and i’m looking for a program to compute and display the GCD and LCM of two integers using r0 and r1 with call funtion bl and bx

here is a working GCD code that i got working but left with the LCM part:
.text
.global main

// Usage of the function: gcd(x,y), with the gcd returned in R0.
// Parameters: Register R0 must contain x, and register R1 must contain y.
gcd:
bal mod
cmp r0, #0
bne gcd
// end of gcd function

// Utility modulus function for you to use in your gcd function:
// Usage: mod(x,y), returns the modulus of x and y in register R0.
// Parameters: Register R0 must contain x, and register R1 must contain y.
mod: push {LR}
mov R3, R1
mov R1, #0
mov R2, #1
//Shift the denominator left until greater than numerator, then shift back
_shift_left:
//R3<<=1; //Denominator shift left
mov r3, r3, asl #1
//R2<R3)goto _shift_left;//Shift Left until Decrement/Division Greater than Numerator
cmp r0, r3 // compare r0 to r3
bgt _shift_left // branch if greater than to _shift_left label
//R3>>=1; //Shift Denominator right
mov r3, r3, asr #1
//R2>>=1; //Shift Division right
mov r2, r2, asr #1
//Loop and keep subtracting off the shifted Denominator
_subtract: //if(R0<R3)goto _done;
cmp r0, r3 // compare r0 to r3
blt _done // branch if less than to _done
//R1+=R2; //Increment division by the increment
//R0-=R3; //Subtract shifted Denominator with remainder of Numerator
subs r0, r3 //Shift right until denominator is less than numerator
_shift_right:
//if(R2==1) goto _subtract;
cmp r2, #1 // compare r2 to #1
beq _subtract // branch if equal to _subtract
//if(R3>=1; //Shift Increment left
mov r2, r2, asr #1
//R3>>=1; //Shift Denominator left
mov r3, r3, asr #1
//goto _shift_right; //Shift Denominator until less than Numerator
bal _shift_right
//goto _subtract; //Keep Looping until the division is complete
bal _subtract
_done: pop {PC}
// end of mod function

main:
push {LR}
ldr R0, =welcome
bl printf

ldr R0, =prompt_a
bl printf

ldr R0, =scan_pat
ldr R1, =a
bl scanf

ldr r0, =prompt_b
bl printf

ldr r0, =scan_pat
ldr r1, =b
bl scanf

ldr r0, =a
ldr r0, [r0]
ldr r1, =b
ldr r1, [r1]
bl gcd
ldr r1, =gcd_result
str r0, [r1]

ldr r0, =gcd_msg
ldr r1, =a
ldr r1, [r1]
ldr r2, =b
ldr r2, [r2]
ldr r3, =gcd_result
ldr r3, [r3]
bl printf

mov r0, #0
pop {PC}
// end of main function

.data
welcome: .asciz “Welcome to the GCD Program!\n”
prompt_a: .asciz “Enter a value for A: ”
prompt_b: .asciz “Enter a value for B: ”
gcd_msg: .asciz “gcd(%d,%d) = %d\n”
scan_pat: .asciz “%d”

a: .word 0
b: .word 0
gcd_result: .word 0

• Roger Ferrer Ibáñez says:

I’m not sure what is your question here.

Kind regards,

• G. Adam Stanislav says:

Perhaps this is new since you wrote this, but udiv and sdiv are the division opcodes for ARM unsigned and signed division.

• Roger Ferrer Ibáñez says:

AFAICT those were added in Armv7.

This tutorial was written for Raspberry Pi 1 which has an ARM1176JZF-S CPU. That CPU implements the Armv6 architecture which doesn’t have those instructions.

• G. Adam Stanislav says:

OK, thanks.

This site uses Akismet to reduce spam. Learn how your comment data is processed.