It may be suprising, but the ARMv6 architecture does not provide an integer division instruction while it does have a floating point instruction in VFPv2. In this chapter we will see usual ways to workaround this limitation with different techniques that can be used in specific scenarios involving divisions.

## What does integer division mean?

First we should clearly define what we understand by integer division. Given two integer numbers N (for numerator) and D (for denominator, different than zero) we define the integer division of N and D as the pair of integer numbers Q (for quotient) and R (for remainder) satisfying the following equality.

`N = D × Q + R` where `0 ≤ |R| < |D|`

The equality implies that there are two solutions `0 < R < |D|` and `0 < |-R| < |D|`. For example, `N=7` and `D=3` has two solutions `(Q=2, R=1)` and `(Q=3, R=-2)`. While both solutions may be useful, the former one is the preferred as it is closer to our natural notion of the remainder. But what if `D` is negative? For example `N=7` and `D=-3` has two solutions as well `(Q=-2, R=1)` and `(Q=-3, R=-2)`. When negative numbers are involved the choice of the remainder is not intuitive but conventional. Many conventions can be used to choose one solution. We can always pick the solution with the positive remainder (this is called euclidean division), or the negative, or the solution where the sign of the remainder matches the numerator (or the denominator).

Most computers perform an integer division where the remainder has the same sign as the numerator. So for `N=7` and `D=3` the computed solution is `(Q=2, R=1)` and for `N=7` and `D=-3` the computed solution is `(Q=-2, R=1)`. We will assume such integer division convention in the remainder (no pun intended) of this post.

## Unsigned division

An unsigned integer division is an integer division involving two unsigned integers N and D. This has the consequence that Q and R will always be positive. A very naive (and slow) approach for unsigned division is as follows.

This algorithm, while correct and easy to understand, is not very efficient (think on dividing a big N with a small D). Is there a way that we can compute the division in a fixed amount of time? The answer is yes, just adapt how you divide by hand but to binary numbers. We will compute a temporary remainder picking bits, from left to right, of the dividend. When the remainder is larger than the divisor, we will subtract the divisor from that remainder and set the appropiate bit in the quotient.

This approach is a bit more efficient since it repeats the loop a fixed number of times (always 32). For each bit of N starting from the most significant one (line 13), we push it to the right of the current value of R (line 14). We do this by adding R to itself, this is 2*R which is actually shifting to the right R 1 bit. Then we add the carry, that will be 1 if the most significant bit of N before the shift (line 13) was 1. Then we check if the current R is already bigger than D (line 16) If so we subtract N from R, R ← R - N (line 17) and then we push a 1 to the right of Q (line 18), again by adding Q to itself plus the carry set by the comparison (if R >= N then there is no borrow so C became 1 in `cmp` of line 16).

The shown code is fine but it can be improved in a number of ways. First, there is no need to check all the bits of a number (although this gives as an upper bound of the cost in the worst of the cases). Second, we should try hard to reduce the number of used registers. Here we are using 5 registers, is there a way we can use less registers? For this we will have to use a slightly different approach.

Given N and D, we will first shift D as many bits to the left as possible but always having N > D. So, for instance if we divide N=1010(2 by D=10(2 we would adjust D until it was D0=1000(2 (this is shifting twice to the left). Now we start a similar process to the one above: if Ni ≥ Di, we set 1 to the lowest bit of Q and then we compute a new Ni+1 ← Ni - Di and a new Di+1 ← Di/2. If Ni < Di then we simply compute a new Di+1 ← Di/2. We stop when the current Di is smaller than the initial D (not D0). Note that this condition is what makes dividing N=1010(2 by D=10(2 different that dividing N=1010(2 by D=1(2 although the D0 of both cases is the same.

We can avoid the first loop where we shift until we exceed by counting the leading zeroes. By counting the leading zeroes of the dividend and the divisor we can straightforwardly compute how many bits we need to shift the divisor.

## Signed division

Signed division can be computed with an unsigned division but taking care of the signs. We can first compute |N|/|D| (this is, ignoring the signs of `N` and `D`), this will yield a quotient Q+ and remainder R+. If signs of N and D are different then Q = -Q+. If N < 0, then R = -R+, as we said at the beginning of the post.

## Powers of two

An unsigned division by a power of two 2N is as simple as doing a logical shift right of N bits. Conversely, a signed division by a power of two 2N is as simple as doing an arithmetic shift right of N bits. We can use `mov` and the addressing modes `LSR` and `ASR` for this. This case is ideal because it is extremely fast.

## Division by a constant integer

When we divide a number by a constant, we can use a multiplication by a magic number to compute the division. All the details and the theory of this technique is too long to write it here but you can find it in chapter 10 of Hacker's Delight. We can summarize it, though, into three values: a magic constant M, a shift S and an additional flag. The author set up a magic number calculator that computes these numbers.

It is not obvious how to properly use these magic numbers, so I crafted a small Python script which emits code for the signed and the unsigned case. Imagine you want to divide an unsigned number by 14. So let's ask our script.

Similarly we can ask for the signed version:

As an example I have used it to implement a small program that just divides the user input by 14.

## Using VFPv2

I would not recommend using this technique. I present it here for the sake of completeness. We simply convert our integers to floating point numbers, divide them as floating point numbers and convert the result back to an integer.

## Comparing versions

After a comment below, I thought it would be interesting to benchmark the general division algorithm. The benchmark I used is the following:

Basically it does something like this

Timings have been obtained using `perf_3.2 stat --repeat=5 -e cpu-clock`. In the table below, `__aeabi_uidiv` is the function in `libgcc` that `gcc` uses to implement an unsigned integer division.

Version Time (seconds)
unsigned_longdiv 45,43
vfpv2_division 9,70
clz_unsigned_longdiv 8,48
__aeabi_uidiv 7,37
better_unsigned_longdiv 6,67

As you can see the performance of our unsigned long division is dismal. The reason it is that it always checks all the bits. The libgcc version is like our clz version but the loop has been fully unrolled and there is a computed branch, similar to a Duff's device. Unfortunately, I do not have a convincing explanation why `better_unsigned_longdiv` runs faster than the other versions, because the code, a priori, looks worse to me.

That's all for today.