Chapter 10. Integer Division By Constants

On many computers, division is very time consuming and is to be avoided when possible. A value of 20 or more elementary add times is not uncommon, and the execution time is usually the same large value even when the operands are small. This chapter gives some methods for avoiding the divide instruction when the divisor is a constant.

10–1 Signed Division by a Known Power of 2

Apparently, many people have made the mistake of assuming that a shift right signed of k positions divides a number by 2^k, using the usual truncating form of division [GLS2]. It’s a little more complicated than that. The code shown below computes q = n ÷ 2^k, for 1 ≤ k ≤ 31 [Hop].

Click here to view code image

shrsi t,n,k-1        Form the integer
shri  t,t,32-k       2**k – 1 if n < 0, else 0.
add   t,n,t          Add it to n,
shrsi q,t,k          and shift right (signed).

It is branch free. It simplifies to three instructions in the common case of division by 2 (k = 1). It does, however, rely on the machine’s being able to shift by a large amount in a short time. The case k = 31 does not make too much sense, because the number 2³¹ is not representable in the machine. Nevertheless, the code does produce the correct result in that case (which is q = –1 if n = –2³¹and q = 0 for all other n).

To divide by –2^k, the above code can be followed by a negate instruction. There does not seem to be any better way to do it.

The more straightforward code for dividing by 2^k is

Click here to view code image

        bge   n,label       Branch if n >= 0.
        addi  n,n,2**k-1    Add 2**k - 1 to n,
label  shrsi n,n,k         and shift right (signed).

This would be preferable on a machine with slow shifts and fast branches.

PowerPC has an unusual device for speeding up division by a power of 2 [GGS]. The shift right signed instructions set the machine’s carry bit if the number being shifted is negative and one or more 1-bits are shifted out. That machine also has an instruction for adding the carry bit to a register, denoted addze. This allows division by any (positive) power of 2 to be done in two instructions:

Click here to view code image

shrsi q,n,k
addze q,q

A single shrsi of k positions does a kind of signed division by 2^k that coincides with both modulus and floor division. This suggests that one of these might be preferable to truncating division for computers and HLL’s to use. That is, modulus and floor division mesh with shrsi better than does truncating division, permitting a compiler to translate the expression n / 2 to an shrsi. Furthermore, shrsi followed by neg (negate) does modulus division by –2^k, which is a hint that maybe modulus division is best. (This is mainly an aesthetic issue. It is of little practical significance, because division by a negative constant is no doubt extremely rare.)

10–2 Signed Remainder from Division by a Known Power of 2

If both the quotient and remainder of n ÷ 2^k are wanted, it is simplest to compute the remainder r from r = n – q * 2^k This requires only two instructions after computing the quotient q:

Click here to view code image

shli r,q,k
sub r,n,r

To compute only the remainder seems to require about four or five instructions. One way to compute it is to use the four-instruction sequence above for signed division by 2^k, followed by the two instructions shown immediately above to obtain the remainder. This results in two consecutive shift instructions that can be replaced by an and, giving a solution in five instructions (four if k = 1):

Click here to view code image

shrsi t,n,k-1      Form the integer
shri t,t,32-k      2**k - 1 if n < 0, else 0.
add t,n,t          Add it to n,
andi t,t,-2**k     clear rightmost k bits,
sub r,n,t          and subtract it from n.

Another method is based on

To use this, first compute , and then

r ← ((abs(n) & (2^k– 1)) ⊕ t) – t

(five instructions) or, for k = 1, since (– n) & 1 = n & 1,

r ← ((n & 1) ⊕ t) – t

(four instructions). This method is not very good for k > 1 if the machine does not have absolute value (computing the remainder would then require six instructions).

Still another method is based on

This leads to

(five instructions for k > 1, four for k = 1).

The above methods all work for 1 ≤ k ≤ 31.

Incidentally, if shift right signed is not available, the value that is 2^k – 1 for n < 0 and 0 for n ≥ 0 can be constructed from

which adds only one instruction.

10–3 Signed Division and Remainder by Non-Powers of 2

The basic trick is to multiply by a sort of reciprocal of the divisor d, approximately 2³²/d, and then to extract the leftmost 32 bits of the product. The details, however, are more complicated, particularly for certain divisors such as 7.

Let us first consider a few specific examples. These illustrate the code that will be generated by the general method. We denote registers as follows:

n – the input integer (numerator)
M – loaded with a “magic number”
t - a temporary register
q - will contain the quotient
r - will contain the remainder

Division by 3

Click here to view code image

li      M,0x55555556    Load magic number, (2**32+2)/3.
mulhs   q,M,n           q = floor(M*n/2**32).

shri    t,n,31          Add 1 to q if
add     q,q,t           n is negative.

muli    t,q,3           Compute remainder from
sub     r,n,t           r = n - q*3.

Proof. The multiply high signed operation (mulhs) cannot overflow, as the product of two 32-bit integers can always be represented in 64 bits and mulhs gives the high-order 32 bits of the 64-bit product. This is equivalent to dividing the 64-bit product by 2³² and taking the floor of the result, and this is true whether the product is positive or negative. Thus, for n ≥ 0 the above code computes

Now, n < 2³¹, because 2³¹ – 1 is the largest representable positive number. Hence, the “error” term 2n / (3 · 2³²) is less than 1/3 (and is nonnegative), so by Theorem D4 (page 183) we have q = ⌊n / 3⌋, which is the desired result (Equation (1) on page 182).

For n < 0, there is an addition of 1 to the quotient. Hence the code computes

where we have used Theorem D2. Hence

For –2³¹ ≤ n ≤ –1,

The error term is nonpositive and greater than –1 / 3, so by Theorem D4 q = ⌈n / 3⌉, which is the desired result (Equation (1) on page 182).

This establishes that the quotient is correct. That the remainder is correct follows easily from the fact that the remainder must satisfy

n = qd + r,

the multiplication by 3 cannot overflow (because –2³¹ / 3 ≤ q ≤ (2³¹ – 1) / 3), and the subtract cannot overflow because the result must be in the range –2 to +2.

The multiply immediate can be done with two add’s, or a shift and an add, if either gives an improvement in execution time.

On many present-day RISC computers, the quotient can be computed as shown above in nine or ten cycles, whereas the divide instruction might take 20 cycles or so.

Division by 5

For division by 5, we would like to use the same code as for division by 3, except with a multiplier of (2³² + 4) / 5. Unfortunately, the error term is then too large; the result is off by 1 for about 1/5 of the values of n ≥ 2³⁰ in magnitude. However, we can use a multiplier of (2³³ + 3) / 5 and add a shift right signed instruction. The code is

Click here to view code image

li    M,0x66666667  Load magic number, (2**33+3)/5.
mulhs q,M,n         q = floor(M*n/2**32).
shrsi q,q,1
shri  t,n,31        Add 1 to q if
add   q,q,t         n is negative.

muli  t,q,5         Compute remainder from
sub   r,n,t         r = n - q*5.

Proof. The mulhs produces the leftmost 32 bits of the 64-bit product, and then the code shifts this right by one position, signed (or “arithmetically”). This is equivalent to dividing the product by 2³³ and then taking the floor of the result. Thus, for n ≥ 0 the code computes

For 0 ≤ n < 2³¹, the error term 3n / 5 · 2³³ is nonnegative and less than 1/5, so by Theorem D4, q = ⌊ n / 5⌋.

For n < 0, the above code computes

The error term is nonpositive and greater than –1/5, so q = ⌈n / 5⌉.
That the remainder is correct follows as in the case of division by 3.
The multiply immediate can be done with a shift left of two and an add.

Division by 7

Dividing by 7 creates a new problem. Multipliers of (2³² + 3) / 7 and (2³³ + 6) / 7 give error terms that are too large. A multiplier of (2³⁴ + 5) / 7 would work, but it’s too large to represent in a 32-bit signed word. We can multiply by this large number by multiplying by (2³⁴ + 5) / 7 – 2³² (a negative number), and then correcting the product by inserting an add. The code is

Click here to view code image

li    M,0x92492493  Magic num, (2**34+5)/7 - 2**32.
mulhs q,M,n         q = floor(M*n/2**32).
add   q,q,n         q = floor(M*n/2**32) + n.
shrsi q,q,2         q = floor(q/4).
shri  t,n,31        Add 1 to q if
add   q,q,t         n is negative.

muli  t,q,7         Compute remainder from
sub   r,n,t         r = n - q*7.

Proof. It is important to note that the instruction “add q,q,n” above cannot overflow. This is because q and n have opposite signs, due to the multiplication by a negative number. Therefore, this “computer arithmetic” addition is the same as real number addition. Hence for n ≥ 0 the above code computes

where we have used the corollary of Theorem D3.

For 0 ≤ n ≤ 2³¹, the error term 5n/ 7 · 2³⁴ is nonnegative and less than 1/7, so q = ⌊ n / 7⌋.

For n < 0, the above code computes

The error term is nonpositive and greater than –1/7, so q = ⌈n / 7⌉

The multiply immediate can be done with a shift left of three and a subtract.

10–4 Signed Division by Divisors ≥ 2

At this point you may wonder if other divisors present other problems. We see in this section that they do not; the three examples given illustrate the only cases that arise (for d ≥ 2).

Some of the proofs are a bit complicated, so to be cautious, the work is done in terms of a general word size W.

Given a word size W ≥ 3 and a divisor d, 2 ≤ d ≤ 2^{W – 1} we wish to find the least integer m and integer p such that

with 0 ≤ m< 2^W and p ≥ W.

The reason we want the least integer m is that a smaller multiplier may give a smaller shift amount (possibly zero) or may yield code similar to the “divide by 5” example, rather than the “divide by 7” example. We must have m ≤ 2^W– 1 so the code has no more instructions than that of the “divide by 7” example (that is, we can handle a multiplier in the range 2^{W –1} to 2^W– 1 by means of the add that was inserted in the “divide by 7” example, but we would rather not deal with larger multipliers). We must have p ≥ W, because the generated code extracts the left half of the product mn, which is equivalent to shifting right W positions. Thus, the total right shift is W or more positions.

There is a distinction between the multiplier m and the “magic number,” denoted M. The magic number is the value used in the multiply instruction. It is given by

Because (1b) must hold for n = –d,⌊– md/ 2^p⌋ + 1 = –1, which implies

Let n_c be the largest (positive) value of n such that rem(n_c, d) = d – 1. n_c exists because one possibility is n_c = d – 1. It can be calculated from n_c = ⌊ 2^{W – 1} / d ⌋ d – 1 = 2^{W – 1} – rem(2^{W –1}, d) – 1. n_c is one of the highest d admissible values of n, so

and, clearly

Because (1a) must hold for n = n_c

Combining this with (2) gives

Because m is to be the least integer satisfying (4), it is the next integer greater than 2^p / d; that is,

Combining this with the right half of (4) and simplifying gives

The Algorithm

Thus, the algorithm to find the magic number M and the shift amount s from d is to first compute n_c, and then solve (6) for p by trying successively larger values. If p < W, set p = W (the theorem below shows that this value of p also satisfies (6)). When the smallest p ≥ W satisfying (6) is found, m is calculated from (5). This is the smallest possible value of m, because we found the smallest acceptable p, and from (4) clearly smaller values of p yield smaller values of m. Finally, s = p– W and M is simply a reinterpretation of m as a signed integer (which is how the mulhs instruction interprets it).

Forcing p to be at least W is justified by the following:

THEOREM DC1. If(6) is true for some value of p, then it is true for all larger values of p.

Proof. Suppose (6) is true for p = p₀. Multiplying (6) by 2 gives

2^{p₀ +1} > n_c(2 d – 2rem(2^p₀, d)).

From Theorem D5, rem(2^{p₀ +1}, d) ≥ 2rem(2^p₀, d) – d. Combining gives

2^{p₀0 +1} > n_c((2d – (rem(2^{p₀ + 1}, d) + d)), or
2^{p 0 +1} > n_c(d – rem(2^{p₀ +1}, d)).

Therefore, (6) is true for p = p₀ + 1, and hence for all larger values.

Thus, one could solve (6) by a binary search, although a simple linear search (starting with p = W) is probably preferable, because usually d is small, and small values of d give small values of p.

Proof That the Algorithm Is Feasible

We must show that (6) always has a solution and that 0 ≤ m < 2^W. (It is not necessary to show that p ≥ W, because that is forced.)

We show that (6) always has a solution by getting an upper bound on p. As a matter of general interest, we also derive a lower bound under the assumption that p is not forced to be at least W. To get these bounds on p, observe that for any positive integer x, there is a power of 2 greater than x and less than or equal to 2x. Hence, from (6),

n_c(d – rem(2^p, d)) < 2^p ≤ 2n_c((d – rem(2^p, d)).

Because 0 ≤ rem(2^p, d) ≤ d – 1,

From (3a) and (3b), n_c ≥ max(2^{W – 1} – d, d – 1). The lines f₁(d) = 2^{W –1} – d and f₂(d) = d – 1 cross at d = (2^{W –1} + 1) / 2. Hence n_c ≥ (2^{W –1} – 1) / 2. Because n_c is an integer, n_c ≥ 2^{W –2}. Because n_c, d ≤ 2^{W – 1} – 1, (7) becomes

2^{w –2} + 1 ≤ 2^p ≤ 2(2^{w – 1} – 1)²

The lower bound p = W – 1 can occur (e.g., for W = 32, d = 3), but in that case we set p = W.

If p is not forced to equal W, then from (4) and (7),

Using (3b) gives

Because n_c ≤ 2^{W –1} – 1 (3a),

2 ≤ m ≤ 2^w – 1.

If p is forced to equal W, then from (4),

Because 2 ≤ d ≤ 2^{W –1} – 1 and n_c ≥ 2^{W –2},

Hence in either case m is within limits for the code schema illustrated by the “divide by 7” example.

Proof That the Product Is Correct

We must show that if p and m are calculated from (6) and (5), then Equations (1a) and (1b) are satisfied.

Equation (5) and inequality (6) are easily seen to imply (4). (In the case that p is forced to be equal to W, (6) still holds, as shown by Theorem DC1.) In what follows, we consider separately the following five ranges of values of n:

From (4), because m is an integer,

Multiplying by n / 2^p, for n ≥ 0 this becomes

For 0 ≤ n ≤ n_c, 0 ≤ (2^p – 1) n / (2^pdn_c) < 1 / d, so by Theorem D4,

Hence (1a) is satisfied in this case (0 ≤ n ≤ n_c).

For n > n_c, n is limited to the range

because n ≥ n_c + d contradicts the choice of n_c as the largest value of n such that rem(n_c, d) = d – 1 (alternatively, from (3a), n ≥ n_c + d implies n ≥ 2^{W – 1}). From (4), for n ≥ 0,

By elementary algebra, this can be written

From (9), 1 ≤ n – n_c ≤ d – 1, so

Because n_c ≥ d – 1 (by (3b)) and (n_c + 1) / n_c has its maximum when n_c has its minimum,

In (10), the term (n_c + 1) / d is an integer. The term (n – n_c)(n_c + 1) / dn_c is less than or equal to 1. Therefore, (10) becomes

For all n in the range (9),⌊n/ d ⌋ = (n_c + 1) / d. Hence, (1a) is satisfied in this case (n_c + 1 ≤ n ≤ n_c + d – 1).

For n < 0, from (4) we have, because m is an integer,

Multiplying by n / 2^p, for n < 0 this becomes

Using Theorem D2 gives

Because n + 1 ≤ 0, the right inequality can be weakened, giving

For –n_c ≤ n ≤ –1,

Hence, by Theorem D4,

so that (1b) is satisfied in this case (–n_c ≤ n ≤ –1).

For n < –n_c, n is limited to the range

(From (3a), n < – n_c – d implies that n < –2^{W – 1}, which is impossible.) Performing elementary algebraic manipulation of the left comparand of (11) gives

For – n_c – d + 1 ≤ n ≤ – n_c – 1,

The ratio (n_c + 1) / n_c is a maximum when n_c is a minimum; that is, n_c = d – 1.

Therefore,

From (13), because (– n_c – 1) / d is an integer and the quantity added to it is between 0 and –1,

For n in the range – n_c – d + 1 ≤ n ≤ – n_c – 1,

Hence, ⌊mn/ 2^p⌋ + 1 = ⌈n / d⌉—that is, (1b) is satisfied.

The last case, n = – n_c – d, can occur only for certain values of d. From (3a), – n_c – d ≤ –2^{W – 1}, so if n takes on this value, we must have n = – n_c – d = –2^{W –1}, and hence n_c = 2^{W –1} – d. Therefore, rem(2^{W – 1}, d) = rem(n_c + d, d) = d – 1 (that is, d divides 2^{W –1} + 1).

For this case (n = – n_c – d), (6) has the solution p = W – 1 (the smallest possible value of p), because for p = W – 1,

Then from (5),

Therefore,

so that (1b) is satisfied.

This completes the proof that if m and p are calculated from (5) and (6), then Equations (1a) and (1b) hold for all admissible values of n.

10–5 Signed Division by Divisors ≤ –2

Because signed integer division satisfies n ÷ (–d) = –(n ÷ d), it is adequate to generate code for n ÷ |d| and follow it with an instruction to negate the quotient. (This does not give the correct result for d = –2^{W –1}, but for this and other negative powers of 2, you can use the code in Section 10–1, “Signed Division by a Known Power of 2,” on page 205, followed by a negating instruction.) It will not do to negate the dividend, because of the possibility that it is the maximum negative number.

It is possible to avoid the negating instruction. The scheme is to compute

Adding 1 if n > 0 is awkward (because one cannot simply use the sign bit of n), so the code will instead add 1 if q < 0. This is equivalent, because the multiplier m is negative (as will be seen).

The code to be generated is illustrated below for the case W = 32, d = –7.

Click here to view code image

li    M,0x6DB6DB6D  Magic num, -(2**34+5)/7 + 2**32.
mulhs q,M,n         q = floor(M*n/2**32).
sub   q,q,n         q = floor(M*n/2**32) - n.
shrsi q,q,2         q = floor(q/4).
shri  t,q,31        Add 1 to q if
add   q,q,t         q is negative (n is positive).

muli  t,q,-7        Compute remainder from
sub   r,n,t         r = n - q*(-7).

This code is the same as that for division by +7, except that it uses the negative of the multiplier for +7, and a sub rather than an add after the multiply, and the shri of 31 must use q rather than n, as discussed above. (The case of d = +7 could also use q here, but there would be less parallelism in the code.) The subtract will not overflow, because the operands have the same sign. This scheme, however, does not always work! Although the code above for W = 32, d = –7 is correct, the analogous alteration of the “divide by 3” code to produce code to divide by –3 does not give the correct result for W = 32, n = –2³¹.

Let us look at the situation more closely.

Given a word size W ≥ 3 and a divisor d, –2^{W –1} ≤ d ≤ –2, we wish to find the least (in absolute value) integer m and integer p such that

with –2^W ≤ m ≤ 0 and p ≥ W.

Proceeding similarly to the case of division by a positive divisor, let n_c be the most negative value of n such that n_c = kd + 1 for some integer k. n_c exists, because one possibility is n_c = d + 1. It can be calculated from n_c = ⌊ (– 2^{W –1} – 1) / d ⌋ d + 1 = – 2^{W – 1} + rem(2^{W –1} + 1, d). n_c is one of the least |d| admissible values of n, so

and, clearly

Because (14b) must hold for n = – d, and (14a) must hold for n = n_c, we obtain, analogous to (4),

Because m is to be the greatest integer satisfying (16), it is the next integer less than 2^p / d—that is,

Combining this with the left half of (16) and simplifying gives

The proof that the algorithm suggested by (17) and (18) is feasible, and that the product is correct, is similar to that for a positive divisor, and will not be repeated. A difficulty arises in trying to prove that – 2^W ≤ m ≤ 0. To prove this, consider separately the cases in which d is the negative of a power of 2, or some other number. For d = –2^k, it is easy to show that n_c = –2^{w – 1} + 1, p = W + k – 1, and m = – 2^{w –1} –1 (which is within range). For d not of the form –2^k, it is straightforward to alter the earlier proof.

**For Which Divisors Is m (–d) ≠ – m (d)?**

By m(d) we mean the multiplier corresponding to a divisor d. If m(–d) = –m(d), code for division by a negative divisor can be generated by calculating the multiplier for |d|, negating it, and then generating code similar to that of the “divide by –7” case illustrated above.

By comparing (18) with (6) and (17) with (5), it can be seen that if the value of n_c for –d is the negative of that for d, then m(–d) = –m(d). Hence, m(–d) ≠ m(d) can occur only when the value of n_c calculated for the negative divisor is the maximum negative number, –2^{W – 1}. Such divisors are the negatives of the factors of 2^{W –1} + 1. These numbers are fairly rare, as illustrated by the factorings below (obtained from Scratchpad).

2¹⁵ + 1 = 3 · 11 · 331

2³¹ + 1 = 3 · 715,827,883

2⁶³ + 1 = 3³ · 19 · 43 · 5419 · 77,158,673,929

For all these factors, m(–d) ≠ m(d). Proof sketch: For d > 0 we have n_c = 2^{w – 1} – d. Because rem(2^{w – 1}, d) = d – 1, (6) is satisfied by p = W – 1 and hence also by p = W. For d < 0, however, we have n_c = –2^{W – 1} and rem(2^{w – 1}, d) = |d| –1. Hence, (18) is not satisfied for p = W – 1 or for p = W, so p > W.

10–6 Incorporation into a Compiler

For a compiler to change division by a constant into a multiplication, it must compute the magic number M and the shift amount s, given a divisor d. The straightforward computation is to evaluate (6) or (18) for p = W, W +1, ... until it is satisfied. Then, m is calculated from (5) or (17). M is simply a reinterpretation of m as a signed integer, and s = p − W.

The scheme described below handles positive and negative d with only a little extra code, and it avoids doubleword arithmetic.

Recall that n_c is given by

Hence, |n_c| can be computed from

The remainder must be evaluated using unsigned division, because of the magnitude of the arguments. We have written rem(t, |d|) rather than the equivalent rem(t, d), to emphasize that the program must deal with two positive (and unsigned) arguments.

From (6) and (18), p can be calculated from

and then |m| can be calculated from (c.f. (5) and (17)):

Direct evaluation of rem(2^p, |d|) in (19) requires “long division” (dividing a 2W-bit dividend by a W-bit divisor, giving a W-bit quotient and remainder), and, in fact, it must be unsigned long division. There is a way to solve (19), and to do all the calculations, that avoids long division and can easily be implemented in a conventional HLL using only W-bit arithmetic. We do, however, need unsigned division and unsigned comparisons.

We can calculate rem(2^p, |d|) incrementally, by initializing two variables q and r to the quotient and remainder of 2^p divided by |d| with p = W – 1, and then updating q and r as p increases.

As the search progresses—that is, when p is incremented by 1—q and r are updated from (see Theorem D5(a))

Click here to view code image

q = 2*q;
r = 2*r;
if (r>= abs(d)) {
q = q + 1;
r = r - abs(d);}

The left half of inequality (4) and the right half of (16), together with the bounds proved for m, imply that q = ⌊ 2^p /|d|⌋ < 2^W, so q is representable as a W-bit unsigned integer. Also, 0 ≤ r < |d|, so r is representable as a W-bit signed or unsigned integer. (Caution: The intermediate result 2r can exceed 2^{W –1} – 1, so r should be unsigned and the comparison above should also be unsigned.)

Next, calculate δ = |d| – r. Both terms of the subtraction are representable as W-bit unsigned integers, and the result is also (1 ≤ δ ≤ |d|), so there is no difficulty here.

To avoid the long multiplication of (19), rewrite it as

The quantity 2^p / |n_c| is representable as a W-bit unsigned integer (similar to (7), from (19) it can be shown that 2^p ≤ 2|n_c| · |d| and, for d = –2^{W – 1}, n_c = –2^{w – 1} + 1 and p = 2W – 2, so that 2^p / |n_c| = 2^{2W – 2} / (2^{w – 1} − 1) < 2^W for W ≥ 3). Also, it is easily calculated incrementally (as p increases) in the same manner as for rem(2^p, |d|). The comparison should be unsigned, for the case 2^p / |n_c| ≥ 2^{W – 1} (which can occur, for large d).

To compute m, we need not evaluate (20) directly (which would require long division). Observe that

The loop closure test 2^p / |n_c| > δ is awkward to evaluate. The quantity 2^p/ |n_c| is available only in the form of a quotient q₁ and a remainder r₁. 2^p / |n_c| may or may not be an integer (it is an integer only for d = 2^{W – 2} + 1 and a few negative values of d). The test 2^p / |n_c| ≤ δ can be coded as

q₁ < δ | (q₁ = δ & r₁ = 0).

The complete procedure for computing M and s from d is shown in Figure 10–1, coded in C, for W = 32. There are a few places where overflow can occur, but the correct result is obtained if overflow is ignored.

To use the results of this program, the compiler should generate the li and mulhs instructions, generate the add if d > 0 and M < 0, or the sub if d < 0 and M > 0, and generate the shrsi if s > 0. Then, the shri and final add must be generated.

For W = 32, handling a negative divisor can be avoided by simply returning a precomputed result for d = 3 and d = 715,827,883, and using m(– d) = – m(d) for other negative divisors. However, that program would not be significantly shorter, if at all, than the one given in Figure 10–1.

Click here to view code image

struct ms {int M;          // Magic number
          int s;};          // and shift amount.

struct ms magic(int d) {    // Must have 2 <= d <= 2**31-1
                           // or   -2**31 <= d <= -2.
   int p;
   unsigned ad, anc, delta, q1, r1, q2, r2, t;
   const unsigned two31 = 0x80000000;      // 2**31.
   struct ms mag;

   ad = abs(d);
   t = two31 + ((unsigned)d >> 31);
   anc = t - 1 - t%ad;      // Absolute value of nc.
   p = 31;                 // Init. p.
   q1 = two31/anc;          // Init. q1 = 2**p/|nc|.
   r1 = two31 - q1*anc;     // Init. r1 = rem(2**p, |nc|).
   q2 = two31/ad;           // Init. q2 = 2**p/|d|.
   r2 = two31 - q2*ad;      // Init. r2 = rem(2**p, |d|).
   do {
      p = p + 1;
      q1 = 2*q1;            // Update q1 = 2**p/|nc|.
      r1 = 2*r1;           // Update r1 = rem(2**p, |nc|).
      if (r1 >= anc) {      // (Must be an unsigned
         q1 = q1 + 1;       // comparison here.)
         r1 = r1 - anc;}
      q2 = 2*q2;            // Update q2 = 2**p/|d|.
      r2 = 2*r2;            // Update r2 = rem(2**p, |d|).
      if (r2 >= ad) {       // (Must be an unsigned
         q2 = q2 + 1;       // comparison here.)
         r2 = r2 - ad;}
      delta = ad - r2;
   } while (q1 < delta || (q1 == delta && r1 == 0));

   mag.M = q2 + 1;
   if (d < 0) mag.M = -mag.M; // Magic number and
   mag.s = p - 32;             // shift amount to return.
   return mag;
}

FIGURE 10–1. Computing the magic number for signed division.

10–7 Miscellaneous Topics

THEOREM DC2. The least multiplier m is odd if p is not forced to equal W.

Proof. Assume that Equations (1a) and (1b) are satisfied with least (not forced) integer p, and m even. Then clearly m could be divided by 2 and p could be decreased by 1, and (1a) and (1b) would still be satisfied. This contradicts the assumption that p is minimal.

Uniqueness

The magic number for a given divisor is sometimes unique (e.g., for W = 32, d = 7), but often it is not. In fact, experimentation suggests that it is usually not unique. For example, for W = 32, d = 6, there are four magic numbers:

Nevertheless, there is the following uniqueness property:

THEOREM DC3. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not forced to equal W.

Proof. First consider the case d > 0. The difference between the upper and lower limits of inequality (4) is 2^p/ dn_c. We have already proved (7) that if p is minimal, then 2^p/dn_c ≤ 2. Therefore, there can be at most two values of m satisfying (4). Let m be the smaller of these values, given by (5); then m + 1 is the other.

Let p₀ be the least value of p for which m + 1 satisfies the right half of (4) (p₀ is not forced to equal W). Then

This simplifies to

2^p₀ > n_c (2 d – rem(2^p₀, d)).

Dividing by 2 gives

Because rem(2^p0, d) ≤ 2rem(2^{p0 – 1}, d) (by Theorem D5 on page 184),

2^{p₀ – 1} > n_c (d – rem(2^{p₀ –1}, d)),

contradicting the assumption that p₀ is minimal.

The proof for d < 0 is similar and will not be given.

The Divisors with the Best Programs

The program for d = 3, W = 32 is particularly short, because there is no add or shrsi after the mulhs. What other divisors have this short program?

We consider only positive divisors. We wish to find integers m and p that satisfy Equations (1a) and (1b), and for which p = W and 0 ≤ m < 2^{W –1}. Because any integers m and p that satisfy equations (1a) and (1b) must also satisfy (4), it suffices to find those divisors d for which (4) has a solution with p = W and 0 ≤ m < 2^{W –1}. All solutions of (4) with p = W are given by

Combining this with the right half of (4) and simplifying gives

The weakest restriction on rem(2^W, d) is with k = 1 and n_c at its minimal value of 2^{W –2}. Hence, we must have

rem(2^W, d) > d – 4;

that is, d divides 2^W+ 1, 2^W+ 2, or 2^W+ 3.

Now let us see which of these factors actually have optimal programs.

If d divides 2^W+ 1, then rem(2^W, d) = d – 1. Then a solution of (6) is p = W, because the inequality becomes

2^W > n_c (d – (d – 1)) = n_c,

which is obviously true, because n_c < 2^{W –1}. Then in the calculation of m we have

which is less than 2^{W –1} for d ≥ 3 (d ≠ 2 because d divides 2^W + 1). Hence, all the factors of 2^W+ 1 have optimal programs.

Similarly, if d divides 2^W + 2, then rem(2^W, d) = d – 2. Again, a solution of (6) is p = W, because the inequality becomes

2^W > n_c (d – (d – 2)) = 2n_c,

which is obviously true. Then in the calculation of m we have

which exceeds 2^{W –1} – 1 for d = 2, but which is less than or equal to 2^{W – 1} – 1 for W ≥ 3, d ≥ 3 (the case W = 3 and d = 3 does not occur, because 3 is not a factor of 2³ + 2 = 10). Hence all factors of 2^W + 2, except for 2 and the cofactor of 2, have optimal programs. (The cofactor of 2 is (2^W+ 2) / 2, which is not representable as a W-bit signed integer).

If d divides 2^W + 3, the following argument shows that d does not have an optimal program. Because rem(2^W, d) = d – 3, inequality (21) implies that we must have

for some k = 1, 2, 3, .... The weakest restriction is with k = 1, so we must have n_c < 2^W / 3.

From (3a), n_c ≥ 2^{W –1} – d, or d ≥ 2^{W – 1} – n_c. Hence, it is necessary that

Also, because 2, 3, and 4 do not divide 2^W + 3, the smallest possible factor of 2^W + 3 is 5. Therefore, the largest possible factor is (2^W + 3) / 5. Thus, if d divides 2^W + 3 and d has an optimal program, it is necessary that

Taking reciprocals of this with respect to 2^W + 3 shows that the cofactor of d, (2^W + 3) / d, has the limits

For W ≥ 5, this implies that the only possible cofactors are 5 and 6. For W < 5, it is easily verified that there are no factors of 2^W + 3. Because 6 cannot be a factor of 2^W + 3, the only possibility is 5. Therefore, the only possible factor of 2^W + 3 that might have an optimal program is (2^W + 3) / 5.

For d = (2^W + 3) / 5,

For W ≥ 4,

This exceeds (2^W / 3), so d = (2^W + 3) / 5 does not have an optimal program. Because for W < 4 there are no factors of 2^W + 3, we conclude that no factors of 2^W + 3 have optimal programs.

In summary, all the factors of 2^W + 1 and of 2^W + 2, except for 2 and (2^W + 2) / 2, have optimal programs, and no other numbers do. Furthermore, the above proof shows that algorithm magic (Figure 10–1 on page 223) always produces the optimal program when it exists.

Let us consider the specific cases W = 16, 32, and 64. The relevant factorizations are shown below.

The result for W = 16 is that there are 20 divisors that have optimal programs. The ones less than 100 are 3, 6, 9, 11, 18, 22, 33, 66, and 99.

For W = 32, there are six such divisors: 3, 6, 641, 6,700,417, 715,827,883, and 1,431,655,766.

For W = 64, there are 126 such divisors. The ones less than 100 are 3, 6, 9, 18, 19, 27, 38, 43, 54, 57, and 86.

10–8 Unsigned Division

Unsigned division by a power of 2 is, of course, implemented by a single shift right logical instruction, and remainder by and immediate.

It might seem that handling other divisors will be simple: Just use the results for signed division with d > 0, omitting the two instructions that add 1 if the quotient is negative. We will see, however, that some of the details are actually more complicated in the case of unsigned division.

Unsigned Division by 3

For a non-power of 2, let us first consider unsigned division by 3 on a 32-bit machine. Because the dividend n can now be as large as 2³² – 1, the multiplier (2³² + 2) / 3 is inadequate, because the error term 2 n / 3 · 2³² (see “divide by 3” example above) can exceed 1/3. However, the multiplier (2³³ + 1) / 3 is adequate. The code is

Click here to view code image

li    M,0xAAAAAAAB  Load magic number, (2**33+1)/3.
mulhu q,M,n         q = floor(M*n/2**32).
shri  q,q,1

muli  t,q,3         Compute remainder from
sub   r,n,t         r = n - q*3.

An instruction that gives the high-order 32 bits of a 64-bit unsigned product is required, which we show above as mulhu.

To see that the code is correct, observe that it computes

For 0 ≤ n < 2³², 0 ≤ n / (3 · 2³³) < 1 / 3, so by Theorem D4, q = ⌊ n/ 3⌋.

In computing the remainder, the multiply immediate can overflow if we regard the operands as signed integers, but it does not overflow if we regard them and the result as unsigned. Also, the subtract cannot overflow, because the result is in the range 0 to 2, so the remainder is correct.

Unsigned Division by 7

For unsigned division by 7 on a 32-bit machine, the multipliers (2³² + 3) / 7, (2³³ + 6) / 7, and (2³⁴ + 5) / 7 are all inadequate, because they give too large an error term. The multiplier (2³⁵ + 3) / 7 is acceptable, but it’s too large to represent in a 32-bit unsigned word. We can multiply by this large number by multiplying by (2³⁵ + 3) / 7 – 2³² and then correcting the product by inserting an add. The code is

Click here to view code image

li    M,0x24924925  Magic num, (2**35+3)/7 - 2**32.
mulhu q,M,n         q = floor(M*n/2**32).
add   q,q,n         Can overflow (sets carry).
shrxi q,q,3         Shift right with carry bit.

muli  t,q,7         Compute remainder from
sub   r,n,t         r = n - q*7.

Here we have a problem: The add can overflow. To allow for this, we have invented the new instruction shift right extended immediate (shrxi), which treats the carry from the add and the 32 bits of register q as a single 33-bit quantity, and shifts it right with 0-fill. On the Motorola 68000 family, this can be done with two instructions: rotate with extend right one position, followed by a logical right shift of three (roxr actually uses the X bit, but the add sets the X bit the same as the carry bit). On most machines, it will take more. For example, on PowerPC it takes three instructions: clear rightmost three bits of q, add carry to q, and rotate right three positions.

With shrxi implemented somehow, the code above computes

For 0 ≤ n < 2³², 0 ≤ 3 n /(7 · 2³⁵) < 1/7, so by Theorem D4, q = ⌊ n / 7 ⌋.

Granlund and Montgomery [GM] have a clever scheme for avoiding the shrxi instruction. It requires the same number of instructions as the above three-instruction sequence for shrxi, but it employs only elementary instructions that almost any machine would have, and it does not cause overflow at all. It uses the identity

Applying this to our problem, with q = ⌊ Mn / 2³² ⌋ where 0 ≤ M <2³², the subtraction will not overflow, because

so that, clearly, 0 ≤ n – q < 2³². Also, the addition will not overflow, because

and 0 ≤ n,q < 2³².

Using this idea gives the following code for unsigned division by 7:

Click here to view code image

li    M,0x24924925  Magic num, (2**35+3)/7 - 2**32.
mulhu q,M,n         q = floor(M*n/2**32).
sub   t,n,q         t = n - q.
shri  t,t,1         t = (n - q)/2.
add   t,t,q         t = (n - q)/2 + q = (n + q)/2.
shri  q,t,2         q = (n+Mn/2**32)/8 = floor(n/7).

muli  t,q,7         Compute remainder from
sub   r,n,t         r = n - q*7.

For this to work, the shift amount for the hypothetical shrxi instruction must be greater than 0. It can be shown that if d > 1 and the multiplier m ≥ 2³² (so that the shrxi instruction is needed), then the shift amount is greater than 0.

10–9 Unsigned Division by Divisors ≥ 1

Given a word size W ≥ 1 and a divisor d, 1 ≤ d < 2^W, we wish to find the least integer m and integer p such that

with 0 ≤ m < 2^W+1 and p ≥ W.

In the unsigned case, the magic number M is given by

Because (22) must hold for n = d, ⌊md/ 2^p⌋ = 1, or

As in the signed case, let n_c be the largest value of n such that rem(n_c, d) = d – 1. It can be calculated from n_c = ⌊ 2^W / d ⌋ d – 1 = 2^W – rem(2^W, d) – 1. Then

and

These imply that n_c ≥ 2^{W – 1}.

Because (22) must hold for n = n_c

Combining this with (23) gives

Because m is to be the least integer satisfying (25), it is the next integer greater than or equal to 2^p / d—that is,

Combining this with the right half of (25) and simplifying gives

The Algorithm (Unsigned)

Thus, the algorithm is to find by trial and error the least p ≥ W satisfying (27). Then, m is calculated from (26). This is the smallest possible value of m satisfying (22) with p ≥ W. As in the signed case, if (27) is true for some value of p, then it is true for all larger values of p. The proof is essentially the same as that of Theorem DC1, except Theorem D5(b) is used instead of Theorem D5(a).

Proof That the Algorithm Is Feasible (Unsigned)

We must show that (27) always has a solution and that 0 ≤ m < 2^{W +1}.

Because for any nonnegative integer x there is a power of 2 greater than x and less than or equal to 2 x + 1, from (27),

n_c(d – 1 – rem(2^p – 1, d)) < 2^p ≤ 2n_c(d – 1 – rem(2^p – 1, d)) + 1.

Because 0 ≤ rem(2^p– 1, d) ≤ d – 1,

Because n_c, d ≤ 2^W – 1, this becomes

1 ≤ 2^p ≤ 2(2^W – 1)(2^W – 2) + 1,

Thus, (27) always has a solution.

If p is not forced to equal W, then from (25) and (28),

If p is forced to equal W, then from (25),

Because 1 ≤ d ≤ 2^W– 1 and n_c ≥ 2^{W – 1},

In either case m is within limits for the code schema illustrated by the “unsigned divide by 7” example.

Proof That the Product Is Correct (Unsigned)

We must show that if p and m are calculated from (27) and (26), then (22) is satisfied.

Equation (26) and inequality (27) are easily seen to imply (25). Inequality (25) is nearly the same as (4), and the remainder of the proof is nearly identical to that for signed division with n ≥ 0.

10–10 Incorporation into a Compiler (Unsigned)

There is a difficulty in implementing an algorithm based on direct evaluation of the expressions used in this proof. Although p ≤ 2 W, which is proved above, the case p = 2 W can occur (e.g., for d = 2^W – 2 with W ≥ 4). When p = 2 W, it is difficult to calculate m, because the dividend in (26) does not fit in a 2W-bit word.

However, it can be implemented by the “incremental division and remainder” technique of algorithm magic. The algorithm is given in Figure 10–2 for W = 32. It passes back an indicator a, which tells whether or not to generate an add instruction. (In the case of signed division, the caller recognizes this by M and d having opposite signs.)

Some key points in understanding this algorithm are as follows:

• Unsigned overflow can occur at several places and should be ignored.

• n_c = 2^w – rem (2^w,d) – 1 = (2^W – 1) – rem(2^W – d, d).

• The quotient and remainder of dividing 2^p by n_c cannot be updated in the same way as is done in algorithm magic, because here the quantity 2*r1 can overflow. Hence, the algorithm has the test “if (r1 > = nc – r l),” whereas “if (2*rl >= nc)” would be more natural. A similar remark applies to computing the quotient and remainder of 2^P –1 divided by d.

• 0 ≤ δ ≤ d – 1, so δ is representable as a 32-bit unsigned integer.

Click here to view code image

struct mu {unsigned M;     // Magic number,
          int a;           // "add" indicator,
          int s;};         // and shift amount.

struct mu magicu(unsigned d) {
                            // Must have 1 <= d <= 2**32-1.
   int p;
   unsigned nc, delta, q1, r1, q2, r2;
   struct mu magu;

   magu.a = 0;             // Initialize "add" indicator.
   nc = -1 - (-d)%d;       // Unsigned arithmetic here.
   p = 31;                 // Init. p.
   q1 = 0x80000000/nc;     // Init. q1 = 2**p/nc.
   r1 = 0x80000000 - q1*nc;// Init. r1 = rem(2**p, nc).
   q2 = 0x7FFFFFFF/d;      // Init. q2 = (2**p - 1)/d.
   r2 = 0x7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, d).
   do {
      p = p + 1;
      if (r1 >= nc - r1) {
         q1 = 2*q1 + 1;             // Update q1.
         r1 = 2*r1 - nc;}           // Update r1.
      else {
         q1 = 2*q1;
         r1 = 2*r1;}
      if (r2 + 1 >= d - r2) {
         if (q2 >= 0x7FFFFFFF) magu.a = 1;
         q2 = 2*q2 + 1;             // Update q2.
         r2 = 2*r2 + 1 - d;}        // Update r2.
      else {
         if (q2 >= 0x80000000) magu.a = 1;
         q2 = 2*q2;
         r2 = 2*r2 + 1;}
      delta = d - 1 - r2;
   } while (p < 64 &&
           (q1 < delta || (q1 == delta && r1 == 0)));

   magu.M = q2 + 1;        // Magic number
   magu.s = p - 32;        // and shift amount to return
   return magu;            // (magu.a was set above).
}

FIGURE 10–2. Computing the magic number for unsigned division.

• m = (2^p +d – 1 – rem(2^p – 1, d)) / d = ⌊ (2^p – 1) / d ⌋ +1 = q₂ + 1.

• The subtraction of 2^W when the magic number M exceeds 2^W – 1 is not explicit in the program; it occurs if the computation of q2 overflows.

• The “add” indicator, magu.a, cannot be set by a straightforward comparison of M to 2³², or of q2 to 2³² – 1, because of overflow. Instead, the program tests q2 before overflow can occur. If q2 ever gets as large as 2³² – 1, so that M will be greater than or equal to 2³², then magu.a is set equal to 1. If q2 stays below 2³² – 1, then magu.a is left at its initial value of 0.

• Inequality (27) is equivalent to 2^p/ n_c > δ.

• The loop test needs the condition p < 64, because without it, overflow of q1 would cause the program to loop too many times, giving incorrect results.

To use the results of this program, the compiler should generate the li and mulhu instructions and, if the “add” indicator a = 0, generate the shri of s (if s > 0), as illustrated by the example of “Unsigned Division by 3,” on page 227. If a = 1 and the machine has the shrxi instruction, the compiler should generate the add and shrxi of s as illustrated by the example of “Unsigned Division by 7,” on page 228. If a = 1 and the machine does not have the shrxi instruction, use the example on page 229: generate the sub, the shri of 1, the add, and finally the shri of s – 1 (if s – 1 > 0; s will not be 0 at this point except in the trivial case of division by 1, which we assume the compiler deletes).

10–11 Miscellaneous Topics (Unsigned)

THEOREM DC2U. The least multiplier m is odd if p is not forced to equal W.

THEOREM DC3U. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not forced to equal W.

The proofs of these theorems follow very closely the corresponding proofs for signed division.

The Divisors with the Best Programs (Unsigned)

For unsigned division, to find the divisors (if any) with optimal programs of two instructions to obtain the quotient (li, mulhu), we can do an analysis similar to that of the signed case (see “The Divisors with the Best Programs” on page 225). The result is that such divisors are the factors of 2^W or 2^W + 1, except for d = 1. For the common word sizes, this leaves very few nontrivial divisors that have optimal programs for unsigned division. For W = 16, there are none. For W = 32, there are only two: 641 and 6,700,417. For W = 64, again there are only two: 274,177 and 67,280,421,310,721.

The case d = 2^k, k = 1, 2, ..., deserves special mention. In this case, algorithm magicu produces p = W (forced), m = 2^{32 – k}. This is the minimal value of m, but it is not the minimal value of M. Better code results if p = W + k is used, if sufficient simplifications are done. Then, m = 2^W, M = 0, a = 1, and s = k. The generated code involves a multiplication by 0 and can be simplified to a single shift right k instruction. As a practical matter, divisors that are a power of 2 would probably be special-cased without using magicu. (This phenomenon does not occur for signed division, because for signed division m cannot be a power of 2. Proof: For d > 0, inequality (4) combined with (3b) implies that d – 1 < 2^p / m < d. Therefore, 2^p / m cannot be an integer. For d < 0, the result follows similarly from (16) combined with (15b).)

For unsigned division, the code for the case m ≥ 2^W is considerably worse than the code for the case m < 2^W if the machine does not have shrxi. It is of interest to have some idea of how often the large multipliers arise. For W = 32, among the integers less than or equal to 100, there are 31 “bad” divisors: 1, 7, 14, 19, 21, 27, 28, 31, 35, 37, 38, 39, 42, 45, 53, 54, 55, 56, 57, 62, 63, 70, 73, 74, 76, 78, 84, 90, 91, 95, and 97.

Using Signed in Place of Unsigned Multiply, and the Reverse

If your machine does not have mulhu, but it does have mulhs (or signed long multiplication), the trick given in “High-Order Product Signed from/to Unsigned,” on page 174, might make our method of doing unsigned division by a constant still useful.

That section gives a seven-instruction sequence for getting mulhu from mulhs. However, for this application it simplifies, because the magic number M is known. Thus, the compiler can test the most significant bit of the magic number, and generate code such as the following for the operation “mulhu q,M,n.” Here t denotes a temporary register.

Click here to view code image

           M₃₁ = 0           M₃₁ = 1
         mulhs q,M,n       mulhs q,M,n
         shrsi t,n,31      shrsi t,n,31
         and   t,t,M       and   t,t,M
         add   q,q,t       add   t,t,n
                           add   q,q,t

Accounting for the other instructions used with mulhu, this uses a total of six to eight instructions to obtain the quotient of unsigned division by a constant on a machine that does not have unsigned multiply.

This trick can be inverted, to get mulhs in terms of mulhu. The code is the same as that above, except the mulhs is changed to mulhu and the final add in each column is changed to sub.

A Simpler Algorithm (Unsigned)

Dropping the requirement that the magic number be minimal yields a simpler algorithm. In place of (27) we can use

and then use (26) to compute m, as before.

It should be clear that this algorithm is formally correct (that is, that the value of m computed does satisfy Equation (22)), because its only difference from the previous algorithm is that it computes a value of p that, for some values of d, is unnecessarily large. It can be proved that the value of m computed from (30) and (26) is less than 2^{W +1}. We omit the proof and simply give the algorithm (Figure 10–3).

Click here to view code image

struct mu {unsigned M;      // Magic number,
          int a;            // "add" indicator,
          int s;};          // and shift amount.

struct mu magicu2(unsigned d) {
                            // Must have 1 <= d <= 2**32-1.
   int p;
   unsigned p32, q, r, delta;
   struct mu magu;
   magu.a = 0;              // Initialize "add" indicator.
   p = 31;                  // Initialize p.
   q = 0x7FFFFFFF/d;        // Initialize q = (2**p - 1)/d.
   r = 0x7FFFFFFF - q*d;    // Init. r = rem(2**p - 1, d).
   do {
      p = p + 1;
      if (p == 32) p32 = 1;     // Set p32 = 2**(p-32).
      else p32 = 2*p32;
      if (r + 1 >= d - r) {
         if (q >= 0x7FFFFFFF) magu.a = 1;
         q = 2*q + 1;           // Update q.
         r = 2*r + 1 - d;       // Update r.
      }
      else {
         if (q >= 0x80000000) magu.a = 1;
         q = 2*q;
         r = 2*r + 1;
      }
      delta = d - 1 - r;
   } while (p < 64 && p32 < delta);
   magu.M = q + 1;         // Magic number and
   magu.s = p - 32;        // shift amount to return
   return magu;            // (magu.a was set above).
}

FIGURE 10–3. Simplified algorithm for computing the magic number, unsigned division.

Alverson [Alv] gives a much simpler algorithm, discussed in the next section, but it gives somewhat large values for m. The point of algorithm magicu2 is that it nearly always gives the minimal value for m when d ≤ 2^{W –1}. For W = 32, the smallest divisor for which magicu2 does not give the minimal multiplier is d = 102,807, for which magicu calculates m = 2,737,896,999 and magicu2 calculates m = 5,475,793,997.

There is an analog of magicu2 for signed division by positive divisors, but it does not work out very well for signed division by arbitrary divisors.

10–12 Applicability to Modulus and Floor Division

It might seem that turning modulus or floor division by a constant into multiplication would be simpler, in that the “add 1 if the dividend is negative” step could be omitted. This is not the case. The methods given above do not apply in any obvious way to modulus and floor division. Perhaps something could be worked out; it might involve altering the multiplier m slightly, depending upon the sign of the dividend.

10–13 Similar Methods

Rather than coding algorithm magic, we can provide a table that gives the magic numbers and shift amounts for a few small divisors. Divisors equal to the tabulated ones multiplied by a power of 2 are easily handled as follows:

1. Count the number of trailing 0’s in d, and let this be denoted by k.

2. Use as the lookup argument d / 2^k (shift right k).

3. Use the magic number found in the table.

4. Use the shift amount found in the table, increased by k.

Thus, if the table contains the divisors 3, 5, 25, and so on, divisors of 6, 10, 100, and so forth can be handled.

This procedure usually gives the smallest magic number, but not always. The smallest positive divisor for which it fails in this respect for W = 32 is d = 334,972, for which it computes m = 3,361,176,179 and s = 18. However, the minimal magic number for d = 334,972 is m = 840,294,045, with s = 16. The procedure also fails to give the minimal magic number for d = –6. In both these cases, output code quality is affected.

Alverson [Alv] is the first known to the author to state that the method described here works with complete accuracy for all divisors. Using our notation, his method for unsigned integer division by d is to set the shift amount p = W + ⌈ log₂ d ⌉, and the multiplier m = ⌈ 2^p / d ⌉ and then do the division by ⌋ n ÷ d = ⌊ mn / 2^p ⌋ (that is, multiply and shift right). He proves that the multiplier m is less than, 2^W+1 and that the method gets the exact quotient for all n expressible in W bits.

Alverson’s method is a simpler variation of ours in that it doesn’t require trial and error to determine p, and is therefore more suitable for building in hardware, which is his primary interest. His multiplier m is always greater than or equal to 2^W, and hence for the software application always gives the code illustrated by the “unsigned divide by 7” example (that is, always has the add and shrxi, or the alternative four instructions). Because most small divisors can be handled with a multiplier less than 2^W, it seems worthwhile to look for these cases.

For signed division, Alverson suggests finding the multiplier for |d| and a word length of W – 1 (then 2^{W –1} ≤ m < 2^W), multiplying the dividend by it, and negating the result if the operands have opposite signs. (The multiplier must be such that it gives the correct result when the dividend is 2^{W –1}, the absolute value of the maximum negative number.) It seems possible that this suggestion might give better code than what has been given here in the case that the multiplier m ≥ 2^W. Applying it to signed division by 7 gives the following code, where we have used the relation –x = +1 to avoid a branch:

Click here to view code image

abs   an,n
li    M,0x92492493  Magic number, (2**34+5)/7.
mulhu q,M,an        q = floor(M*an/2**32).
shri  q,q,2
shrsi t,n,31        These three instructions
xor   q,q,t         negate q if n is
sub   q,q,t         negative.

This is not quite as good as the code we gave for signed division by 7 (six versus seven instructions), but it would be useful on a machine that has abs and mulhu, but not mulhs.

The next section gives some representative magic numbers.

10–14 Sample Magic Numbers

TABLE10–1. SOME MAGIC NUMBERS FOR W = 32

TABLE 10–2. SOME MAGIC NUMBERS FOR W = 64

10–15 Simple Code in Python

Computing a magic number is greatly simplified if one is not limited to doing the calculations in the same word size as that of the environment in which the magic number will be used. For the unsigned case, for example, in Python it is straightforward to compute n_c and then evaluate Equations (27) and (26), as described in Section 10–9. Figure 10–4 shows such a function.

Click here to view code image

def magicgu(nmax, d):
   nc = (nmax//d)*d - 1
   nbits = int(log(nmax, 2)) + 1
   for p in range(0, 2*nbits + 1):
      if 2**p > nc*(d - 1 - (2**p - 1)%d):
         m = (2**p + d - 1 - (2**p - 1)%d)//d
         return (m, p)
   print "Can't find p, something is wrong."
   sys.exit(1)

FIGURE 10–4. Python code for computing the magic number for unsigned division.

The function is given the maximum value of the dividend nmax and the divisor d. It returns a pair of integers: the magic number m and a shift amount p. To divide a dividend x by d, one multiplies x by m and then shifts the (full length) product right p bits.

This program is more general than the others in this chapter in two ways: (1) one specifies the maximum value of the dividend (nmax), rather than the number of bits required for the dividend, and (2) the program can be used for arbitrarily large dividends and divisors (“bignums”). The advantage of specifying the maximum value of the dividend is that one sometimes gets a smaller magic number than would be obtained if the next power of two less 1 were used for the maximum value. For example, suppose the maximum value of the dividend is 90, and the divisor is 7. Then function magicgu returns (37, 8), meaning that the magic number is 37 (a 6-bit number) and the shift amount is 8. But if we asked for a magic number that can handle divisors up to 127, then the result is (147, 10), and 147 is an 8-bit number.

10–16 Exact Division by Constants

By “exact division,” we mean division in which it is known beforehand, somehow, that the remainder is 0. Although this situation is not common, it does arise, for example, when subtracting two pointers in the C language. In C, the result of p – q, where p and q are pointers, is well defined and portable only if p and q point to objects in the same array [H&S, sec. 7.6.2]. If the array element size is s, the object code for the difference p – q computes (p – q) / s.

The material in this section was motivated by [GM, sec. 9].

The method to be given applies to both signed and unsigned exact division, and is based on the following theorem.

THEOREM MI. If a and m are relatively prime integers, then there exists an integer ā, 1 ≤ ā < m, such that

aā ≡ 1 (mod m).

That is, ā is a multiplicative inverse of a, modulo m. There are several ways to prove this theorem; three proofs are given in [NZM, p. 52]. The proof below requires only a very basic familiarity with congruences.

Proof. We will prove something a little more general than the theorem. If a and m are relatively prime (therefore nonzero), then as x ranges over all m distinct values modulo m, ax takes on all m distinct values modulo m. For example, if a = 3 and m = 8, then as x ranges from 0 to 7, ax = 0, 3, 6, 9, 12, 15, 18, 21 or, reduced modulo 8, ax = 0, 3, 6, 1, 4, 7, 2, 5. Observe that all values from 0 to 7 are present in the last sequence.

To see this in general, assume that it is not true. Then there exist distinct integers that map to the same value when multiplied by a; that is, there exist x and y, with x y(mod m), such that

ax ≡ ay (mod m).

Then there exists an integer k such that

ax – ay = km, or

a (x – y) = km.

Because a has no factor in common with m, it must be that x – y is a multiple of m; that is,

x ≡ y (mod m).

This contradicts the hypothesis.

Now, because ax takes on all m distinct values modulo m, as x ranges over the m values, it must take on the value 1 for some x.

The proof shows that there is only one value (modulo m) of x such that ax ≡ 1 (mod m)—that is, the multiplicative inverse is unique, apart from additive multiples of m. It also shows that there is a unique (modulo m) integer x such that ax ≡ b (mod m), where b is any integer.

As an example, consider the case m = 16. Then , because 3 · 11 = 33 ≡ 1 (mod 16). We could just as well take , because 3 · (–5) = –15 ≡ 1 (mod 16). Similarly , because (–3) · 5 = –15 ≡ 1 (mod 16).

These observations are important because they show that the concepts apply to both signed and unsigned numbers. If we are working in the domain of unsigned integers on a 4-bit machine, we take . In the domain of signed integers, we take . But 11 and –5 have the same representation in two’s-complement (because they differ by 16), so the same computer word contents can serve in both domains as the multiplicative inverse.

The theorem applies directly to the problem of division (signed and unsigned) by an odd integer d on a W-bit computer. Because any odd integer is relatively prime to 2^W, the theorem says that if d is odd, there exists an integer (unique in the range 0 to 2^W– 1 or in the range –2^{W – 1} to 2^{W –1} – 1) such that

Hence, for any integer n that is a multiple of d,

In other words, n/d can be calculated by multiplying n by and retaining only the rightmost W bits of the product.

If the divisor d is even, let d = d_o · 2^k, where d_o is odd and k ≥ 1. Then, simply shift n right k positions (shifting out 0’s), and then multiply by (the shift could be done after the multiplication as well).

Below is the code for division of n by 7, where n is a multiple of 7. This code gives the correct result whether it is considered to be signed or unsigned division.

Click here to view code image

li M,0xB6DB6DB7 Mult. inverse, (5*2**32 + 1)/7.
mul q,M,n q = n/7.

Computing the Multiplicative Inverse by the Euclidean Algorithm

How can we compute the multiplicative inverse? The standard method is by means of the “extended Euclidean algorithm.” This is briefly discussed below as it applies to our problem, and the interested reader is referred to [NZM, p. 13] and to [Knu2, 4.5.2] for a more complete discussion.

Given an odd divisor d, we wish to solve for x

dx ≡ 1(mod/m),

where, in our application, m = 2^W and W is the word size of the machine. This will be accomplished if we can solve for integers x and y (positive, negative, or 0) the equation

dx + my = 1.

Toward this end, first make d positive by adding a sufficient number of multiples of m to it. (d and d + km have the same multiplicative inverse.) Second, write the following equations (in which d, m > 0):

If d = 1, we are done, because (ii) shows that x = 1. Otherwise, compute

Third, multiply Equation (ii) by q and subtract it from (i). This gives

d (– 1 – q) + m(1) = m – d – qd = rem(m – d, d).

This equation holds because we have simply multiplied one equation by a constant and subtracted it from another. If rem(m – d, d) = 1, we are done; this last equation is the solution and x = – 1 – q.

Repeat this process on the last two equations, obtaining a fourth, and continue until the right-hand side of the equation is 1. The multiplier of d, reduced modulo m, is then the desired inverse of d.

Incidentally, if m – d < d, so that the first quotient is 0, then the third row will be a copy of the first, so that the second quotient will be nonzero. Furthermore, most texts start with the first row being

d(0) + m(1) = m,

but in our application m = 2^W is not representable in the machine.

The process is best illustrated by an example: Let m = 256 and d = 7. Then the calculation proceeds as follows. To get the third row, note that q = ⌊249 / 7 ⌋ = 35.

Click here to view code image

7(-1) + 256( 1) = 249
7( 1) + 256( 0) = 7
7(-36) + 256( 1) = 4
7( 37) + 256(-1) = 3
7(-73) + 256( 2) = 1

Thus, the multiplicative inverse of 7, modulo 256, is –73 or, expressed in the range 0 to 255, is 183. Check: 7 · 183 = 1281 ≡ 1 (mod 256).

From the third row on, the integers in the right-hand column are all remainders of dividing the number above it into the number two rows above it, so they form a sequence of strictly decreasing nonnegative integers. Therefore, the sequence must end in 0 (as the above would if carried one more step). Furthermore, the value just before the 0 must be 1, for the following reason. Suppose the sequence ends in b followed by 0, with b ≠ 1. Then, the integer preceding the b must be a multiple of b, let’s say k₁b, for the next remainder to be 0. The integer preceding k₁ b must be of the form k₁k₂ b+ b, for the next remainder to be b. Continuing up the sequence, every number must be a multiple of b, including the first two (in the positions of the 249 and the 7 in the above example). This is impossible, because the first two integers are m – d and d, which are relatively prime.

This constitutes an informal proof that the above process terminates, with a value of 1 in the right-hand column, and hence it finds the multiplicative inverse of d.

To carry this out on a computer, first note that if d < 0, we should add 2^W to it. With two’s-complement arithmetic it is not necessary to actually do anything here; simply interpret d as an unsigned number, regardless of how the application interprets it.

The computation of q must use unsigned division.

Observe that the calculations can be done modulo m, because this does not change the right-hand column (these values are in the range 0 to m – 1 anyway). This is important, because it enables the calculations to be done in “single precision,” using the computer’s modulo-2^W unsigned arithmetic.

Most of the quantities in the table need not be represented. The column of multiples of 256 need not be represented, because in solving dx + my = 1, we do not need the value of y. There is no need to represent d in the first column. Reduced to its bare essentials, then, the calculation of the above example is carried out as follows:

Click here to view code image

255  249
  1    7
220    4
37    3
183    1

A C program for performing this computation is shown in Figure 10–5.

Click here to view code image

unsigned mulinv(unsigned d) {           // d must be odd.
   unsigned x1, v1, x2, v2, x3, v3, q;

   x1 = 0xFFFFFFFF;      v1 = -d;
   x2 = 1;              v2 = d;
   while (v2 > 1) {
      q = v1/v2;
      x3 = x1 - q*x2;   v3 = v1 - q*v2;
      x1 = x2;           v1 = v2;
      x2 = x3;           v2 = v3;
   }
   return x2;
}

FIGURE 10–5. Multiplicative inverse modulo 2³² by the Euclidean algorithm.

The reason the loop continuation condition is (v2 > 1), rather than the more natural (v2 != 1), is that if the latter condition were used, the loop would never terminate if the program were invoked with an even argument. It is best that programs not loop forever even if misused. (If the argument d is even, v2 never takes on the value 1, but it does become 0.)

What does the program compute if given an even argument? As written, it computes a number x such that dx ≡ 0 (mod 2³²), which is probably not useful. However, with the minor modification of changing the loop continuation condition to (v2 != 0) and returning x1 rather than x2, it computes a number x such that dx ≡ g (mod 2³²), where g is the greatest common divisor of d and 2³²—that is, the greatest power of 2 that divides d. The modified program still computes the multiplicative inverse of d for d odd, but it requires one more iteration than the unmodified program.

As for the number of iterations (divisions) required by the above program, for d odd and less than 20, it requires a maximum of 3 and an average of 1.7. For d in the neighborhood of 1000, it requires a maximum of 11 and an average of about 6.

Computing the Multiplicative Inverse by Newton’s Method

It is well known that, over the real numbers, 1 / d, for d ≠ 0, can be calculated to ever-increasing accuracy by iteratively evaluating

provided the initial estimate x₀ is sufficiently close to 1/d. The number of digits of accuracy approximately doubles with each iteration.

It is not so well known that this same formula can be used to find the multiplicative inverse modulo any power of 2!. For example, to find the multiplicative inverse of 3, modulo 256, start with x₀ = 1 (any odd number will do). Then,

The iteration has reached a fixed point modulo 256, so –85, or 171, is the multiplicative inverse of 3 (modulo 256). All calculations can be done modulo 256.

Why does this work? Because if x_n satisfies

dx_n ≡ 1 (mod m)

and if x_{n + 1} is defined by (31), then

dx_{n +1} ≡ 1 (mod m²).

To see this, let dx_n = 1 + km. Then

In our application, m is a power of 2, say 2^N. In this case, if

In a sense, if x_n is regarded as a sort of approximation to , then each iteration of (31) doubles the number of bits of “accuracy” of the approximation.

It happens that modulo 8, the multiplicative inverse of any (odd) number d is d itself. Thus, taking x₀ = d is a reasonable and simple initial guess at . Then, (31) will give values of x₁, x₂, ..., such that

Thus, four iterations suffice to find the multiplicative inverse modulo 2³² (if x ≡ 1 (mod 2⁴⁸), then x ≡ 1 (mod 2ⁿ) for n ≤ 48). This leads to the C program in Figure 10–6, in which all computations are done modulo 2³².

For about half the values of d, this program takes 4.5 iterations, or nine multiplications. For the other half (those for which the initial value of xn is “correct to 4 bits”—that is, d² ≡ 1 (mod 16)), it takes seven or fewer, usually seven, multiplications. Thus, it takes about eight multiplications on average.

Click here to view code image

unsigned mulinv(unsigned d) {       // d must be odd.
   unsigned xn, t;

   xn = d;
loop: t = d*xn;
      if (t == 1) return xn;
      xn = xn*(2 - t);
      goto loop;
}

FIGURE 10–6. Multiplicative inverse modulo 2³² by Newton’s method.

A variation is to simply execute the loop four times, regardless of d, perhaps “strung out” to eliminate the loop control (eight multiplications). Another variation is to somehow make the initial estimate x₀ “correct to 4 bits” (that is, find x₀ that satisfies dx₀ ≡ 1 (mod 16)). Then, only three loop iterations are required. Some ways to set the initial estimate are

Here, the multiplication by 2 is a left shift, and the computations are done modulo 2³² (ignoring overflow). Because the second formula uses a multiplication, it saves only one.

This concern about execution time is, of course, totally unimportant for the compiler application. For that application, the routine would be so seldom used that it should be coded for minimum space. But there may be applications in which it is desirable to compute the multiplicative inverse quickly.

The “Newton method” described here applies only when (1) the modulus is an integral power of some number a, and (2) the multiplicative inverse of d modulo a is known. It works particularly well for a = 2, because then the multiplicative inverse of any (odd) number d modulo 2 is known immediately—it is 1.

Sample Multiplicative Inverses

We conclude this section with a listing of some multiplicative inverses in Table 10–3.

TABLE 10–3. SAMPLE MULTIPLICATIVE INVERSES

You may notice that in several cases (d = 3, 5, 9, 11), the multiplicative inverse of d is the same as the magic number for unsigned division by d (see Section 10–14, “Sample Magic Numbers,” on page 238). This is more or less a coincidence. It happens that for these numbers, the magic number M is equal to the multiplier m, and these are of the form (2^p + 1) / d, with p ≥ 32. In this case, notice that

so that M ≡ (mod 2³²).

10–17 Test for Zero Remainder after Division by a Constant

The multiplicative inverse of a divisor d can be used to test for a zero remainder after division by d[GM].

Unsigned

First, consider unsigned division with the divisor d odd. Denote by the multiplicative inverse of d. Then, because , where W is the machine’s word size in bits, is also odd. Thus, is relatively prime to 2^W, and as shown in the proof of theorem MI in the preceding section, as n ranges over all 2^W distinct values modulo 2^W, takes on all 2^W distinct values modulo 2^W.

It was shown in the preceding section that if n is a multiple of d,

That is, for n = 0, d, 2d, ..., ⌊(2^W – 1) / d ⌋ d, ≡ 0, 1, 2, ..., ⌊(2^W – 1) / d ⌋(mod 2^W). Therefore, for n not a multiple of d, the value of , reduced modulo 2^W to the range 0 to 2^W – 1, must exceed ⌊(2^W – 1) / d ⌋.

This can be used to test for a zero remainder. For example, to test if an integer n is a multiple of 25, multiply n by and compare the rightmost W bits to ⌊(2^W– 1) / 25 ⌋. On our basic RISC:

Click here to view code image

li      M,0xC28F5C29  Load mult. inverse of 25.
mul    q,M,n          q = right half of M*n.
li      c,0x0A3D70A3  c = floor((2**32-1)/25).
cmpleu t,q,c          Compare q and c, and branch
bt     t,is_mult      if n is a multiple of 25.

To extend this to even divisors, let d = d_o · 2^k, where d_o is odd and k ≥ 1. Then, because an integer is divisible by d if and only if it is divisible by d_o and by 2^k, and because n and have the same number of trailing zeros ( is odd), the test that n is a multiple of d is

where the mod function is understood to reduce to the interval [0, 2^W –1].

Direct implementation of this requires two tests and conditional branches, but it can be reduced to one compare-branch quite efficiently if the machine has the rotate-shift instruction. This follows from the following theorem, in which denotes the computer word a rotated right k positions (0 ≤ k ≤ 32).

THEOREM ZRU. and x ends in k 0-bits if and only if

Proof. (Assume a 32-bit machine.) Suppose and x ends in k 0-bits. Then, because , But Therefore, If x does not end in k 0-bits, then does not begin with k 0-bits, whereas ⌊a /2^k⌋ does, so Lastly, if and x ends in k 0-bits, then the integer formed from the first 32 – k bits of x must exceed that formed from the first 32 - k bits of a, so that

Using this theorem, the test that n is a multiple of d, where n and d > 1 are unsigned integers and d = d_o · 2^k with d_o odd, is

Here we used ⌊⌊(2^W– 1) / d_o ⌋ / 2^k ⌋ = ⌊(2^W– 1) / (d_o · 2^k)⌋ = ⌊ (2^W– 1) / d⌋.

As an example, the following code tests an unsigned integer n to see if it is a multiple of 100:

Click here to view code image

li     M,0xC28F5C29  Load mult. inverse of 25.
mul    q,M,n         q = right half of M*n.
shrri  q,q,2         Rotate right two positions.
li     c,0x028F5C28  c = floor((2**32-1)/100).
cmpleu t,q,c         Compare q and c, and branch
bt     t,is_mult     if n is a multiple of 100.

Signed, Divisor ≥ 2

For signed division, it was shown in the preceding section that if n is a multiple of d and d is odd, then

Thus, for n = ⌈ –2^{W – 1} /d ⌉ · d, ...,–d,0, d, ...,⌊(2^{W – 1} – 1) / d ⌋ · d, we have ≡ ⌈– 2^{W – 1} / d,...,–1,0, 1, ...,⌊ (2^{W –1} – 1) / d ⌋ (mod 2^W). Furthermore, because d is relatively prime to 2^W, as n ranges over all 2^W distinct values modulo 2^W, takes on all 2^W distinct values modulo 2^W. Therefore, n is a multiple of d if and only if

where the mod function is understood to reduce to the interval [–2^{W – 1} 2^{W – 1} –1]

This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1, it does not divide 2^W–1. Therefore,

⌈ – 2^{W – 1} / d ⌉ = ⌈ (–2^{W – 1} + 1) d ⌉ = - ⌊(2^W–1 –1)/d⌋.

Thus, for signed numbers, the test that n is a multiple of d, where d = d_o · 2^k and d_o is odd, is

Set q = mod(, 2^W);

-⌊ (2^{W –1} – 1) / d_o ⌋ ≤ q ≤ ⌊ (2^{W – 1} – 1)/d_o ⌋ and q ends in k or more 0-bits.

On the surface, this would seem to require three tests and branches. However, as in the unsigned case, it can be reduced to one compare-branch by use of the following theorem:

THEOREM ZRS. If a ≥ 0, the following assertions are equivalent:

where a′ is a with its rightmost k bits set to 0 (that is, a ′ = a & –2^k).

Proof. (Assume a 32-bit machine). To see that (1) is equivalent to (2), clearly the assertion – a ≤ x ≤ a is equivalent to abs(x) ≤ a. Then, Theorem ZRU applies, because both sides of this inequality are nonnegative.

To see that (1) is equivalent to (3), note that assertion (1) is equivalent to itself with a replaced with a′. Then, by the theorem on bounds checking on page 68, this in turn is equivalent to

Because x + a′ ends in k 0-bits if and only if x does, Theorem ZRU applies, giving the result.

Using part (3) of this theorem, the test that n is a multiple of d, where n and d ≥ 2 are signed integers and d = d_o · 2^k with d_o odd, is

(a′ can be computed at compile time, because d is a constant.)

As an example, the following code tests a signed integer n to see if it is a multiple of 100. Notice that the constant ⌊2a′ / 2^k⌋ can always be derived from the constant a′ by a shift of k – 1 bits, saving an instruction or a load from memory to develop the comparand.

Click here to view code image

li     M,0xC28F5C29  Load mult. inverse of 25.
mul    q,M,n         q = right half of M*n.
li     c,0x051EB850  c = floor((2**31 – 1)/25) & –4.
add    q,q,c         Add c.
shrri  q,q,2         Rotate right two positions.
shri   c,c,1         Compute const. for comparison.
cmpleu t,q,c         Compare q and c, and
bt     t,is_mult     branch if n is a mult. of 100.

10–18 Methods Not Using Multiply High

In this section we consider some methods for dividing by constants that do not use the multiply high instruction, or a multiplication instruction that gives a double-word result. We show how to change division by a constant into a sequence of shift and add instructions, or shift, add, and multiply for more compact code.

Unsigned Division

For these methods, unsigned division is simpler than signed division, so we deal with unsigned division first. One method is to use the techniques given that use the multiply high instruction, but use the code shown in Figure 8–2 on page 174 to do the multiply high operation. Figure 10–7 shows how this works out for the case of (unsigned) division by 3. This is a combination of the code on page 228 and Figure 8–2 with “int” changed to “unsigned.” The code is 15 instructions, including four multiplications. The multiplications are by large constants and would take quite a few instructions if converted to shift’s and add’s. Very similar code can be devised for the signed case. This method is not particularly good and won’t be discussed further.

Another method [GLS1] is to compute in advance the reciprocal of the divisor, and multiply the dividend by that with a series of shift right and add instructions. This gives an approximation to the quotient. It is merely an approximation, because the reciprocal of the divisor (which we assume is not an exact power of two) is not expressed exactly in 32 bits, and also because each shift right discards bits of the dividend. Next, the remainder with respect to the approximate quotient is computed, and that is divided by the divisor to form a correction, which is added to the approximate quotient, giving the exact quotient. The remainder is generally small compared to the divisor (a few multiples thereof), so there is often a simple way to compute the correction without using a divide instruction.

To illustrate this method, consider dividing by 3, that is, computing ⌊n / 3 ⌋ where 0 ≤ n < 2³². The reciprocal of 3, in binary, is approximately

0.0101 0101 0101 0101 0101 0101 0101 0101.

To compute the approximate product of that and n, we could use

Click here to view code image

unsigned divu3(unsigned n) {
   unsigned n0, n1, w0, w1, w2, t, q;

   n0 = n & 0xFFFF;
   n1 = n >> 16;
   w0 = n0*0xAAAB;
   t = n1*0xAAAB + (w0 >> 16);
   w1 = t & 0xFFFF;
   w2 = t >> 16;
   w1 = n0*0xAAAA + w1;
   q = n1*0xAAAA + w2 + (w1 >> 16);
   return q >> 1;
}

FIGURE 10–7. Unsigned divide by 3 using simulated multiply high unsigned.

(29 instructions; the last 1 in the reciprocal is ignored because it would add the term which is obviously 0). However, the simple repeating pattern of 1’s and 0’s in the reciprocal permits a method that is both faster (nine instructions) and more accurate:

To compare these methods for their accuracy, consider the bits that are shifted out by each term of (32), if n is all 1-bits. The first term shifts out two 1-bits, the next four 1-bits, and so on. Each of these contributes an error of almost 1 in the least significant bit. Since there are 16 terms (counting the term we ignored), the shifts contribute an error of almost 16. There is an additional error due to the fact that the reciprocal is truncated to 32 bits; it turns out that the maximum total error is 16.

For procedure (1), each right shift also contributes an error of almost 1 in the least significant bit. But there are only five shift operations. They contribute an error of almost 5, and there is a further error due to the fact that the reciprocal is truncated to 32 bits; it turns out that the maximum total error is 5.

After computing the estimated quotient q, the remainder r is computed from

r ← n – q * 3.

The remainder cannot be negative, because q is never larger than the exact quotient. We need to know how large r can be to devise the simplest possible method for computing In general, for a divisor d and an estimated quotient q too low by k, the remainder will range from k*d to k*d + d – 1. (The upper limit is conservative; it may not actually be attained.) Thus, using (1), for which q is too low by at most 5, we expect the remainder to be at most 5*3 + 2 = 17. Experimentation reveals that it is actually at most 15. Thus, for the correction we must compute (exactly)

Since r is small compared to the largest value that a register can hold, this can be approximated by multiplying r by some approximation to 1/3 of the form a/b where b is a power of 2. This is easy to compute, because the division is simply a shift. The value of a/ b must be slightly larger than 1/3, so that after shifting the result will agree with truncated division. A sequence of such approximations is:

1/2, 2/4, 3/8, 6/16, 11/32, 22/64, 43/128, 86/256, 171/512, 342/1024, ....

Usually, the smaller fractions in the sequence are easier to compute, so we choose the smallest one that works; in the case at hand this is 11/32. Therefore, the final, exact, quotient is given by

The solution involves two multiplications by small numbers (3 and 11); these can be changed to shift’s and add’s.

Figure 10–8 shows the entire solution in C. As shown, it consists of 14 instructions, including two multiplications. If the multiplications are changed to shift’s and add’s, it amounts to 18 elementary instructions. However, if it is desired to avoid the multiplications, then either alternative return statement shown gives a solution in 17 elementary instructions. Alternative 2 has just a little instruction-level parallelism, but in truth this method generally has very little of that.

A more accurate estimate of the quotient can be obtained by changing the first executable line to

Click here to view code image

q = (n >> 1) + (n >> 3);

(which makes q too large by a factor of 2, but it has one more bit of accuracy), and then inserting just before the assignment to r,

Click here to view code image

q = q >> 1;

With this variation, the remainder is at most 9. However, there does not seem to be any better code for calculating with r limited to 9 than there is for r limited to 15 (four elementary instructions in either case). Thus, using the idea would cost one instruction. This possibility is mentioned because it does give a code improvement for most divisors.

Click here to view code image

unsigned divu3(unsigned n) {
   unsigned q, r;

   q = (n >> 2) + (n >> 4);     // q = n*0.0101 (approx).
   q = q + (q >> 4);            // q = n*0.01010101.
   q = q + (q >> 8);
   q = q + (q >> 16);
   r = n - q*3;                 // 0 <= r <= 15.
   return q + (11*r >> 5);      // Returning q + r/3.
// return q + (5*(r + 1) >> 4);         // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}

FIGURE 10–8. Unsigned divide by 3.

Figure 10–9 shows two variations of this method for dividing by 5. The reciprocal of 5, in binary, is

0.0011 0011 0011 0011 0011 0011 0011 0011.

As in the case of division by 3, the simple repeating pattern of 1’s and 0’s allows a fairly efficient and accurate computing of the quotient estimate. The estimate of the quotient computed by the code on the left can be off by at most 5, and it turns out that the remainder is at most 25. The code on the right retains two additional bits of accuracy in computing the quotient estimate, which is off by at most 2. The remainder in this case is at most 10. The smaller maximum remainder permits approximating 1/5 by 7/32 rather than 13/64, which gives a slightly more efficient program if the multiplications are done by shift’s and add’s. The instruction counts are, for the code on the left: 14 instructions including two multiplications, or 18 elementary instructions; for the code on the right: 15 instructions including two multiplications, or 17 elementary instructions. The alternative code in the return statement is useful only if your machine has comparison predicate instructions. It doesn’t reduce the instruction count, but merely has a little instruction-level parallelism.

For division by 6, the divide-by-3 code can be used, followed by a shift right of 1. However, the extra instruction can be saved by doing the computation directly, using the binary approximation

4/6 ≈ 0.1010 1010 1010 1010 1010 1010 1010 1010.

FIGURE 10–9. Unsigned divide by 5.

The code is shown in Figure 10–10. The version on the left multiplies by an approximation to 1/6 and then corrects with a multiplication by 11/64. The version on the right takes advantage of the fact that by multiplying by an approximation to 4/6, the quotient estimate is off by only 1 at most. This permits simpler code for the correction; it simply adds 1 to q if r ≥ 6. The code in the second return statement is appropriate if the machine has the comparison predicate instructions. Function divu6b is 15 instructions, including one multiply, as shown, or 17 elementary instructions if the multiplication by 6 is changed to shift’s and add’s.

FIGURE 10–10. Unsigned divide by 6.

For larger divisors, usually it seems to be best to use an approximation to 1/ d that is shifted left so that its most significant bit is 1. It seems that the quotient is then off by at most 1 usually (possibly always, this writer does not know), which permits efficient code for the correction step. Figure 10–11 shows code for dividing by 7 and 9, using the binary approximations

If the multiplications by 7 and 9 are expanded into shift’s and add’s, these functions take 16 and 15 elementary instructions, respectively.

FIGURE 10–11. Unsigned divide by 7 and 9.

Figures 10–12 and 10–13 show code for dividing by 10, 11, 12, and 13. These are based on the binary approximations:

If the multiplications are expanded into shift’s and add’s, these functions take 17, 20, 17, and 20 elementary instructions, respectively.

FIGURE 10–12. Unsigned divide by 10 and 11.

FIGURE 10–13. Unsigned divide by 12 and 13.

The case of dividing by 13 is instructive because it shows how you must look for repeating strings in the binary expansion of the reciprocal of the divisor. The first assignment sets q equal to n*0.1001. The second assignment to q adds n*0.00001001 and n*0.000001001. At this point, q is (approximately) equal to n*0.100111011. The third assignment to q adds in repetitions of this pattern. It sometimes helps to use subtraction, as in the case of divu9 above. However, you must use care with subtraction, because it may cause the quotient estimate to be too large, in which case the remainder is negative and the method breaks down. It is quite complicated to get optimal code, and we don’t have a general cookbook method that you can put in a compiler to handle any divisor.

The examples above are able to economize on instructions, because the reciprocals have simple repeating patterns, and because the multiplication in the computation of the remainder r is by a small constant, which can be done with only a few shift’s and add’s. One might wonder how successful this method is for larger divisors. To roughly assess this, Figures 10–14 and 10–15 show code for dividing by 100 and by 1000 (decimal). The relevant reciprocals are

If the multiplications are expanded into shift’s and add’s, these functions take 25 and 23 elementary instructions, respectively.

Click here to view code image

unsigned divu100(unsigned n) {
   unsigned q, r;

   q = (n >> 1) + (n >> 3) + (n >> 6) - (n >> 10) +
       (n >> 12) + (n >> 13) - (n >> 16);
   q = q + (q >> 20);
   q = q >> 6;
   r = n - q*100;
   return q + ((r + 28) >> 7);
// return q + (r > 99);
}

FIGURE 10–14. Unsigned divide by 100.

Click here to view code image

unsigned divu1000(unsigned n) {
   unsigned q, r, t;

   t = (n >> 7) + (n >> 8) + (n >> 12);
   q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14);
   q = q >> 9;
   r = n - q*1000;
   return q + ((r + 24) >> 10);
// return q + (r > 999);
}

FIGURE 10–15. Unsigned divide by 1000.

In the case of dividing by 1000, the least significant eight bits of the reciprocal estimate are nearly ignored. The code of Figure 10–15 replaces the binary 1001 0111 with 0100 0000, and still the quotient estimate is within one of the true quotient. Thus, it appears that although large divisors might have very little repetition in the binary representation of the reciprocal estimate, at least some bits can be ignored, which helps hold down the number of shift’s and add’s required to compute the quotient estimate.

This section has shown, in a somewhat imprecise way, how unsigned division by a constant can be reduced to a sequence of, typically, about 20 elementary instructions. It is nontrivial to get an algorithm that generates these code sequences that is suitable for incorporation into a compiler, because of three difficulties in getting optimal code.

1. It is necessary to search the reciprocal estimate bit string for repeating patterns.

2. Negative terms (as in divu10 and divu100) can be used sometimes, but the error analysis required to determine just when they can be used is difficult.

3. Sometimes some of the least significant bits of the reciprocal estimate can be ignored (how many?).

Another difficulty for some target machines is that there are many variations on the code examples given that have more instructions, but that would execute faster on a machine with multiple shift and add units.

The code of Figures 10–7 through 10–15 has been tested for all 2³² values of the dividends.

Signed Division

The methods given above can be made to apply to signed division. The right shift instructions in computing the quotient estimate become signed right shift instructions, which compute floor division by powers of 2. Thus, the quotient estimate is too low (algebraically), so the remainder is nonnegative, as in the unsigned case.

The code most naturally computes the floor division result, so we need a correction to make it compute the conventional truncated-toward-0 result. This can be done with three computational instructions by adding d – 1 to the dividend if the dividend is negative. For example, if the divisor is 6, the code begins with (the shift here is a signed shift)

Click here to view code image

n = n + (n >> 31 & 5);

Other than this, the code is very similar to that of the unsigned case. The number of elementary operations required is usually three more than in the corresponding unsigned division function. Several examples are given in Figures 10–16 through 10–22. All have been exhaustively tested.

Click here to view code image

int divs3(int n) {
   int q, r;

   n = n + (n>>31 & 2);         // Add 2 if n < 0.
   q = (n >> 2) + (n >> 4);     // q = n*0.0101 (approx).
   q = q + (q >> 4);            // q = n*0.01010101.
   q = q + (q >> 8);
   q = q + (q >> 16);
   r = n - q*3;                 // 0 <= r <= 14.
   return q + (11*r >> 5);      // Returning q + r/3.
// return q + (5*(r + 1) >> 4);         // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}

FIGURE 10–16. Signed divide by 3.

FIGURE 10–17. Signed divide by 5 and 6.

FIGURE 10–18. Signed divide by 7 and 9.

FIGURE 10–19. Signed divide by 10 and 11.

FIGURE 10–20. Signed divide by 12 and 13.

FIGURE 10–21. Signed divide by 100.

Click here to view code image

int divs1000(int n) {
   int q, r, t;

   n = n + (n >> 31 & 999);
   t = (n >> 7) + (n >> 8) + (n >> 12);
   q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14) +
       (n >> 26) + (t >> 21);
   q = q >> 9;
   r = n - q*1000;
   return q + ((r + 24) >> 10);
// return q + (r > 999);
}

FIGURE 10-22. Signed divide by 1000.

10-19 Remainder by Summing Digits

This section addresses the problem of computing the remainder of division by a constant without computing the quotient. The methods of this section apply only to divisors of the form 2^k ± 1, for k an integer greater than or equal to 2, and in most cases the code resorts to a table lookup (an indexed load instruction) after a fairly short calculation.

We will make frequent use of the following elementary property of congruences:

THEOREM C. If a ≡ b (mod m) and c ≡ d (mod m), then

The unsigned case is simpler and is dealt with first.

Unsigned Remainder

For a divisor of 3, multiplying the trivial congruence 1 ≡ 1 (mod 3) repeatedly by the congruence 2 ≡ –1 (mod 3), we conclude by Theorem C that

Therefore, a number n written in binary as ...b₃ b₂ b₁ b₀ satisfies

n = ... + b₃ · 2³ + b₂ · 2² + b₁ · 2 + b₀ ≡ ...– b₃ + b₂ – b₁ + b₀ (mod 3),

which is derived by using Theorem C repeatedly. Thus, we can alternately add and subtract the bits in the binary representation of the number to obtain a smaller number that has the same remainder upon division by 3. If the sum is negative, you must add a multiple of 3 to make it nonnegative. The process can then be repeated until the result is in the range 0 to 2.

The same trick works for finding the remainder after dividing a decimal number by 11.

Thus, if the machine has the population count instruction, a function that computes the remainder modulo 3 of an unsigned number n might begin with

Click here to view code image

n = pop(n & 0x55555555) - pop(n & 0xAAAAAAAA);

This can be simplified by using the following surprising identity discovered by Paolo Bonzini [Bonz]:

Since the references to 32 (the word size) cancel out, the result holds for any word size. Another way to prove (2) is to observe that it holds for x = 0, and if a 0-bit in x is changed to a 1 where m is 1, then both sides of (2) decrease by 1, and if a 0-bit of x is changed to a 1 where m is 0, then both sides of (2) increase by 1.

Applying (2) to the line of C code above gives

Click here to view code image

n = pop(n ^ 0xAAAAAAAA) - 16;

We want to apply this transformation again, until n is in the range 0 to 2, if possible. It is best to avoid producing a negative value of n, because the sign bit would not be treated properly on the next round. A negative value can be avoided by adding a sufficiently large multiple of 3 to n. Bonzini’s code, shown in Figure 10–23, increases the constant by 39. This is larger than necessary to make n nonnegative, but it causes n to range from –3 to 2 (rather than –3 to 3) after the second round of reduction. This simplifies the code on the return statement, which is adding 3 if n is negative. The function executes in 11 instructions, counting two to load the large constant.

Figure 10–24 shows a variation that executes in four instructions, plus a simple table lookup operation (e.g., an indexed load byte instruction).

Click here to view code image

int remu3(unsigned n) {
   n = pop(n ^ 0xAAAAAAAA) + 23;       // Now 23 <= n <= 55.
   n = pop(n ^ 0x2A) - 3;              // Now -3 <= n <= 2.
   return n + (((int)n >> 31) & 3);
}

FIGURE 10–23. Unsigned remainder modulo 3, using population count.

Click here to view code image

int remu3(unsigned n) {

   static char table[33] = {2, 0,1,2, 0,1,2, 0,1,2,
          0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
          0,1,2, 0,1};

   n = pop(n ^ 0xAAAAAAAA);
   return table[n];
}

FIGURE 10–24. Unsigned remainder modulo 3, using population count and a table lookup.

To avoid the population count instruction, notice that because 4 ≡ 1 (mod 3), 4^k ≡ 1 (mod 3). A binary number can be viewed as a base 4 number by taking its bits in pairs and interpreting the bits 00 to 11 as a base 4 digit ranging from 0 to 3. The pairs of bits can be summed using the code of Figure 5–2 on page 82, omitting the first executable line (overflow does not occur in the additions). The final sum ranges from 0 to 48, and a table lookup can be used to reduce this to the range 0 to 2. The resulting function is 16 elementary instructions, plus an indexed load.

There is a similar, but slightly better, way. As a first step, n can be reduced to a smaller number that is in the same congruence class modulo 3 with

Click here to view code image

n = (n >> 16) + (n & 0xFFFF);

This splits the number into two 16-bit portions, which are added together. The contribution modulo 3 of the left 16 bits of n is not altered by shifting them right 16 positions, because the shifted number, multiplied by 2¹⁶, is the original number, and 2¹⁶ ≡ 1 (mod 3). More generally, 2^k ≡ 1 (mod 3) if k is even. This is used repeatedly (five times) in the code shown in Figure 10–25. This code is 19 instructions. The instruction count can be reduced by cutting off the digit summing earlier and using an in-memory table lookup, as illustrated in Figure 10–26 (nine instructions, plus an indexed load). The instruction count can be reduced to six (plus an indexed load) by using a table of size 0x2FE = 766 bytes.

To compute the unsigned remainder modulo 5, the code of Figure 10–27 uses the relations 16^k ≡ 1 (mod 5) and 4 ≡ –1 (mod 5). It is 21 elementary instructions, assuming the multiplication by 3 is expanded into a shift and add.

Click here to view code image

int remu3(unsigned n) {
   n = (n >> 16) + (n & 0xFFFF);       // Max 0x1FFFE.
   n = (n >>  8) + (n & 0x00FF);       // Max 0x2FD.
   n = (n >>  4) + (n & 0x000F);       // Max 0x3D.
   n = (n >>  2) + (n & 0x0003);       // Max 0x11.
   n = (n >>  2) + (n & 0x0003);       // Max 0x6.
   return (0x0924 >> (n << 1)) & 3;
}

FIGURE 10–25. Unsigned remainder modulo 3, digit summing and an in-register lookup.

Click here to view code image

int remu3(unsigned n) {
   static char table[62] = {0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2,  0,1};

   n = (n >> 16) + (n & 0xFFFF);    // Max 0x1FFFE.
   n = (n >>  8) + (n & 0x00FF);    // Max 0x2FD.
   n = (n >>  4) + (n & 0x000F);    // Max 0x3D.
   return table[n];
}

FIGURE 10–26. Unsigned remainder modulo 3, digit summing and an in-memory lookup.

Click here to view code image

int remu5(unsigned n) {
   n = (n >> 16) + (n & 0xFFFF);            // Max 0x1FFFE.
   n = (n >>  8) + (n & 0x00FF);            // Max 0x2FD.
   n = (n >>  4) + (n & 0x000F);            // Max 0x3D.
   n = (n>>4) - ((n>>2) & 3) + (n & 3);     // -3 to 6.
   return (01043210432 >> 3*(n + 3)) & 7;   // Octal const.
}

FIGURE 10–27. Unsigned remainder modulo 5, digit summing method.

The instruction count can be reduced by using a table, similar to what is done in Figure 10–26. In fact, the code is identical, except the table is:

Click here to view code image

static char table[62] = {0,1,2,3,4, 0,1,2,3,4,
   0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
   0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
   0,1,2,3,4, 0,1,2,3,4, 0,1};

For the unsigned remainder modulo 7, the code of Figure 10–28 uses the relation 8^k ≡ 1 (mod 7) (nine elementary instructions, plus an indexed load).

As a final example, the code of Figure 10–29 computes the remainder of unsigned division by 9. It is based on the relation 8 ≡ –1 (mod 9). As shown, it is nine elementary instructions, plus an indexed load. The elementary instruction count can be reduced to six by using a table of size 831 (decimal).

Click here to view code image

int remu7(unsigned n) {

   static char table[75] = {0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
             0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
             0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
             0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4};

   n = (n >> 15) + (n & 0x7FFF);        // Max 0x27FFE.
   n = (n >>  9) + (n & 0x001FF);       // Max 0x33D.
   n = (n >>  6) + (n & 0x0003F);       // Max 0x4A.
   return table[n];
}

FIGURE 10–28. Unsigned remainder modulo 7, digit summing method.

Click here to view code image

int remu9(unsigned n) {

   int r;
   static char table[75] = {0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
         0,1,2,3,4,5,6,7,8, 0,1,2};

   r = (n & 0x7FFF) - (n >> 15);    // FFFE0001 to 7FFF.
   r = (r & 0x01FF) - (r >>  9);    // FFFFFFC1 to 2FF.
   r = (r & 0x003F) + (r >>  6);    // 0 to 4A.
   return table[r];
}

FIGURE 10–29. Unsigned remainder modulo 9, digit summing method.

Signed Remainder

The digit summing method can be adapted to compute the remainder resulting from signed division. There seems to be no better way than to add a few steps to correct the result of the method as applied to unsigned division. Two corrections are necessary: (1) correct for a different interpretation of the sign bit, and (2) add or subtract a multiple of the divisor d to get the result in the range 0 to – (d – 1).

For division by 3, the unsigned remainder code interprets the sign bit of the dividend n as contributing 2 to the remainder (because 2³¹ mod 3 = 2). For the remainder of signed division, the sign bit contributes only 1 (because (–2³¹) mod 3 = 1). Therefore, we can use the code for an unsigned remainder and correct its result by subtracting 1. Then, the result must be put in the range 0 to –2. That is, the result of the unsigned remainder code must be mapped as follows:

(0, 1, 2) ⇒ (–1, 0, 1) ⇒ (–1, 0, –2).

This adjustment can be done fairly efficiently by subtracting 1 from the unsigned remainder if it is 0 or 1, and subtracting 4 if it is 2 (when the dividend is negative). The code must not alter the dividend n, because it is needed in this last step.

This procedure can easily be applied to any of the functions given for the unsigned remainder modulo 3. For example, applying it to Figure 10–26 on page 265 gives the function shown in Figure 10–30. It is 13 elementary instructions, plus an indexed load. The instruction count can be reduced by using a larger table.

Click here to view code image

int rems3(int n) {
   unsigned r;
   static char table[62] = {0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
       0,1,2, 0,1,2, 0,1};

   r = n;
   r = (r >> 16) + (r & 0xFFFF);    // Max 0x1FFFE.
   r = (r >>  8) + (r & 0x00FF);    // Max 0x2FD.
   r = (r >>  4) + (r & 0x000F);    // Max 0x3D.
   r = table[r];
   return r - (((unsigned)n >> 31) << (r & 2));
}

FIGURE 10–30. Signed remainder modulo 3, digit summing method.

Figures 10–31 to 10–33 show similar code for computing the signed remainder of division by 5, 7, and 9. All the functions consist of 15 elementary operations, plus an indexed load. They use signed right shifts, and the final adjustment consists of subtracting the modulus if the dividend is negative and the remainder is nonzero. The number of instructions can be reduced by using larger tables.

Click here to view code image

int rems5(int n) {
   int r;
   static char table[62] = {2,3,4, 0,1,2,3,4, 0,1,2,3,4,
             0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
             0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
             0,1,2,3,4, 0,1,2,3};

   r = (n >> 16) + (n & 0xFFFF);  // FFFF8000 to 17FFE.
   r = (r >>  8) + (r & 0x00FF);  // FFFFFF80 to 27D.
   r = (r >>  4) + (r & 0x000F);  // -8 to 53 (decimal).
   r = table[r + 8];
   return r - (((int)(n & -r) >> 31) & 5);
}

FIGURE 10–31. Signed remainder modulo 5, digit summing method.

Click here to view code image

int rems7(int n) {
   int r;
   static char table[75] =   {5,6, 0,1,2,3,4,5,6,
     0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
     0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
     0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2};

   r = (n >> 15) + (n & 0x7FFF);   // FFFF0000 to 17FFE.
   r = (r >>  9) + (r & 0x001FF); // FFFFFF80 to 2BD.
   r = (r >>  6) + (r & 0x0003F); // -2 to 72 (decimal).
   r = table[r + 2];
   return r - (((int)(n & -r) >> 31) & 7);
}

FIGURE 10–32. Signed remainder modulo 7, digit summing method.

Click here to view code image

int rems9(int n) {
   int r;
   static char table[75] = {7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
              0,1,2,3,4,5,6,7,8, 0};

   r = (n & 0x7FFF) - (n >> 15);   // FFFF7001 to 17FFF.
   r = (r & 0x01FF) - (r >>  9);   // FFFFFF41 to 0x27F.
   r = (r & 0x003F) + (r >>  6);   // -2 to 72 (decimal).
   r = table[r + 2];
   return r - (((int)(n & -r) >> 31) & 9);
}

FIGURE 10–33. Signed remainder modulo 9, digit summing method.

10–20 Remainder by Multiplication and Shifting Right

The method described in this section applies, in principle, to all integer divisors greater than 2, but as a practical matter only to fairly small divisors and to divisors of the form 2^k – 1. As in the preceding section, in most cases the code resorts to a table lookup after a fairly short calculation.

Unsigned Remainder

This section uses the mathematical (not computer algebra) notation a mod b, where a and b are integers and b > 0, to denote the integer x, 0 ≤ x < b, that satisfies x ≡ a (mod b).

To compute n mod 3, observe that

Proof: Let n = 3 k + δ, where δ and k are integers and 0 ≤ δ ≤ 2. Then

Clearly, the value of the last expression is 0, 1, or 2 for δ = 0, 1, or 2 respectively. This allows changing the problem of computing the remainder modulo 3 to one of computing the remainder modulo 4, which is of course much easier on a binary computer.

Relations like (3) do not hold for all moduli, but similar relations do hold if the modulus is of the form 2^k – 1, for k an integer greater than 1. For example, it is easy to show that

For numbers not of the form 2^k – 1, there is no such simple relation, but there is a certain uniqueness property that can be used to compute the remainder for other divisors. For example, if the divisor is 10 (decimal), consider the expression

Let n = 10 k + δ where 0 ≤ δ ≤ 9. Then

For δ = 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, the last expression takes on the values 0, 1, 3, 4, 6, 8, 9, 11, 12, and 14 respectively. The latter numbers are all distinct. Therefore, if we can find a reasonably easy way to compute (4), we can translate 0 to 0, 1 to 1, 3 to 2, 4 to 3, and so on, to obtain the remainder of division by 10. This will generally require a translation table of size equal to the next power of 2 greater than the divisor, so the method is practical only for fairly small divisors (and for divisors of the form 2^k – 1, for which table lookup is not required).

The code to be shown was derived by using a little of the above theory and a lot of trial and error.

Consider the remainder of unsigned division by 3. Following (3), we wish to compute the rightmost two bits of the integer part of 4n/ 3. This can be done approximately by multiplying by ⌊2³² / 3⌋ and then dividing by 2³⁰ using a shift right instruction. When the multiplication by ⌊2³² / 3⌋ is done (using the multiply instruction that gives the low-order 32 bits of the product), high-order bits will be lost. But that doesn’t matter, and, in fact, it’s helpful, because we want the result modulo 4. Therefore, because ⌊2³² / 3⌋ = 0x55555555, a possible plan is to compute

Experiment indicates that this works for n in the range 0 to 2³⁰ + 2. It almost works, I should say; if n is nonzero and a multiple of 3, it gives the result 3. Therefore, it must be followed by a translation step that translates (0, 1, 2, 3) to (0, 1, 2, 0) respectively.

To extend the range of applicability, the multiplication must be done more accurately. Two more bits of accuracy suffice (that is, multiplying by 0x55555555.4). The following calculation, followed by the translation step, works for all n representable as an unsigned 32-bit integer:

It is, of course, possible to give a formal proof of this, but the algebra is quite lengthy and error prone.

The translation step can be done in three or four instructions on most machines, but there is a way to avoid it at a cost of two instructions. The above expression for computing r estimates low. If you estimate slightly high, the result is always 0, 1, or 2. This gives the C function shown in Figure 10–34 (eight instructions, including a multiply).

Click here to view code image

int remu3(unsigned n) {
return (0x55555555*n + (n >> 1) - (n >> 3)) >> 30;
}

FIGURE 10–34. Unsigned remainder modulo 3, multiplication method.

The multiplication can be expanded, giving the 13-instruction function shown in Figure 10–35 that uses only shift’s and add’s.

Click here to view code image

int remu3(unsigned n) {
   unsigned r;

   r = n + (n << 2);
   r = r + (r << 4);
   r = r + (r << 8);
   r = r + (r << 16);
   r = r + (n >> 1);
   r = r - (n >> 3);
   return r >> 30;
}

FIGURE 10–35. Unsigned remainder modulo 3, multiplication (expanded) method.

The remainder of unsigned division by 5 can be computed very similarly to the remainder of division by 3. Let n = 5 k + r with 0 ≤ r ≤ 4. Then (8 / 5)n mod 8 = (8 / 5)(5 k+ r) mod 8 = (8 / 5)r mod 8. For r = 0, 1, 2, 3, and 4, this takes on the values 0, 1, 3, 4, and 6 respectively. Since ⌊2³² / 5⌋ = 0x33333333, this leads to the function shown in Figure 10–36 (11 instructions, including a multiply). The last step (code on the return statement) is mapping (0, 1, 3, 4, 6, 7) to (0, 1, 2, 3, 4, 0) respectively, using an in-register method rather than an indexed load from memory. By also mapping 2 to 2 and 5 to 4, the precision required in the multiplication by 2³² / 5 is reduced to using just the term n >> 3 to approximate the missing part of the multiplier (hexadecimal 0.333...). If the “accuracy” term n >> 3 is omitted, the code still works for n ranging from 0 to 0x60000004.

Click here to view code image

int remu5(unsigned n) {
n = (0x33333333*n + (n >> 3)) >> 29;
return (0x04432210 >> (n << 2)) & 7;
}

FIGURE 10–36. Unsigned remainder modulo 5, multiplication method.

The code for computing the unsigned remainder modulo 7 is similar, but the mapping step is simpler; it is necessary only to convert 7 to 0. One way to code it is shown in Figure 10–37 (11 instructions, including a multiply). If the accuracy term n >> 4 is omitted, the code still works for n up to 0x40000006. With both accuracy terms omitted, it works for n up to 0x08000006.

Click here to view code image

int remu7(unsigned n) {
n = (0x24924924*n + (n >> 1) + (n >> 4)) >> 29;
return n & ((int)(n - 7) >> 31);
}

FIGURE 10–37. Unsigned remainder modulo 7, multiplication method.

Code for computing the unsigned remainder modulo 9 is shown in Figure 10–38. It is six instructions, including a multiply, plus an indexed load. If the accuracy term n >> 1 is omitted and the multiplier is changed to 0x1C71C71D, the function works for n up to 0x1999999E.

Click here to view code image

int remu9(unsigned n) {
   static char table[16] = {0, 1, 1, 2, 2, 3, 3, 4,
                            5, 5, 6, 6, 7, 7, 8, 8};

   n = (0x1C71C71C*n + (n >> 1)) >> 28;
   return table[n];
}

FIGURE 10–38. Unsigned remainder modulo 9, multiplication method.

Figure 10–39 shows a way to compute the unsigned remainder modulo 10. It is eight instructions, including a multiply, plus an indexed load instruction. If the accuracy term n >> 3 is omitted, the code works for n up to 0x40000004. If both accuracy terms are omitted, it works for n up to 0x0AAAAAAD.

Click here to view code image

int remu10(unsigned n) {
   static char table[16] = {0, 1, 2, 2, 3, 3, 4, 5,
                            5, 6, 7, 7, 8, 8, 9, 0};

   n = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
   return table[n];
}

FIGURE 10–39. Unsigned remainder modulo 10, multiplication method.

As a final example, consider the computation of the remainder modulo 63. This function is used by the population count program at the top of page 84. Joe Keane [Keane] has come up with the rather mysterious code shown in Figure 10–40. It is 12 elementary instructions on the basic RISC.

Click here to view code image

int remu63(unsigned n) {
   unsigned t;

   t = (((n >> 12) + n) >> 10) + (n << 2);
   t = ((t >> 6) + t + 3) & 0xFF;
   return (t - (t >> 6)) >> 2;
}

FIGURE 10–40. Unsigned remainder modulo 63, Keane’s method.

The “multiply and shift right” method leads to the code shown in Figure 10–41. This is 11 instructions on the basic RISC, one being a multiply. This would not be as fast as Keane’s method, unless the machine has a very fast multiply and the load of the constant 0x04104104 can move out of a loop.

Click here to view code image

int remu63(unsigned n) {
n = (0x04104104*n + (n >> 4) + (n >> 10)) >> 26;
return n & ((n - 63) >> 6); // Change 63 to 0.
}

FIGURE 10–41. Unsigned remainder modulo 63, multiplication method.

On some machines, an improvement can result from expanding the multiplication into shifts and adds as follows (15 elementary instructions for the whole function):

Click here to view code image

r = (n << 2) + (n << 8);       // r = 0x104*n.
r = r + (r << 12);             // r = 0x104104*n.
r = r + (n << 26);             // r = 0x04104104*n.

Signed Remainder

As in the case of the digit summing method, the “multiply and shift right” method can be adapted to compute the remainder resulting from signed division. Again, there seems to be no better way than to add a few steps to correct the result of the method as applied to unsigned division. For example, the code shown in Figure 10–42 is derived from Figure 10–34 on page 270 (12 instructions, including a multiply).

Click here to view code image

int rems3(int n) {
   unsigned r;

   r = n;
   r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30;
   return r - (((unsigned)n >> 31) << (r & 2));
}

FIGURE 10–42. Signed remainder modulo 3, multiplication method.

Some plausible ways to compute the remainder of signed division by 5, 7, 9, and 10 are shown in Figures 10–43 to 10–46. The code for a divisor of 7 uses quite a few extra instructions (19 in all, including a multiply); it might be preferable to use a table similar to that shown for the cases in which the divisor is 5, 9, or 10. In the latter cases, the table used for unsigned division is doubled in size, with the sign bit of the divisor factored in to index the table. Entries shown as u are unused.

Click here to view code image

int rems5(int n) {
   unsigned r;
   static signed char table[16] = {0, 1, 2, 2, 3, u, 4, 0,
                                   u, 0,-4, u,-3,-2,-2,-1};
   r = n;
   r = ((0x33333333*r) + (r >> 3)) >> 29;
   return table[r + (((unsigned)n >> 31) << 3)];
}

FIGURE 10–43. Signed remainder modulo 5, multiplication method.

Click here to view code image

int rems7(int n) {
   unsigned r;

   r = n - (((unsigned)n >> 31) << 2);  // Fix for sign.
   r = ((0x24924924*r) + (r >> 1) + (r >> 4)) >> 29;
   r = r & ((int)(r - 7) >> 31);        // Change 7 to 0.
   return r - (((int)(n&-r) >> 31) & 7);// Fix n<0 case.
}

FIGURE 10–44. Signed remainder modulo 7, multiplication method.

Click here to view code image

int rems9(int n) {
   unsigned r;
   static signed char table[32] = {0, 1, 1, 2, u, 3, u, 4,
                                   5, 5, 6, 6, 7, u, 8, u,
                                  -4, u,-3, u,-2,-1,-1, 0,
                                   u,-8, u,-7,-6,-6,-5,-5};
   r = n;
   r = (0x1C71C71C*r + (r >> 1)) >> 28;
   return table[r + (((unsigned)n >> 31) << 4)];
}

FIGURE 10–45. Signed remainder modulo 9, multiplication method.

Click here to view code image

int rems10(int n) {
   unsigned r;
   static signed char table[32] = {0, 1, u, 2, 3, u, 4, 5,
                                   5, 6, u, 7, 8, u, 9, u,
                                  -6,-5, u,-4,-3,-3,-2, u,
                                  -1, 0, u,-9, u,-8,-7, u};
   r = n;
   r = (0x19999999*r + (r >> 1) + (r >> 3)) >> 28;
   return table[r + (((unsigned)n >> 31) << 4)];
}

FIGURE 10–46. Signed remainder modulo 10, multiplication method.

10–21 Converting to Exact Division

Since the remainder can be computed without computing the quotient, the possibility arises of computing the quotient q = ⌊n/d⌋ by first computing the remainder, subtracting this from the dividend n, and then dividing the difference by the divisor d. This last division is an exact division, and it can be done by multiplying by the multiplicative inverse of d (see Section 10–16, “Exact Division by Constants,” on page 240). This method would be particularly attractive if both the quotient and remainder are wanted.

Let us try this for the case of unsigned division by 3. Computing the remainder by the multiplication method (Figure 10–34 on page 270) leads to the function shown in Figure 10–47.

Click here to view code image

unsigned divu3(unsigned n) {
   unsigned r;

   r = (0x55555555*n + (n >> 1) - (n >> 3)) >> 30;
   return (n - r)*0xAAAAAAAB;
}

FIGURE 10–47. Unsigned remainder and quotient with divisor = 3, using exact division.

This is 11 instructions, including two multiplications by large numbers. (The constant 0x55555555 can be generated by shifting the constant 0xAAAAAAAB right one position.) In contrast, the more straightforward method of computing the quotient q using (for example) the code of Figure 10–8 on page 254, requires 14 instructions, including two multiplications by small numbers, or 17 elementary operations if the multiplications are expanded into shift’s and add’s. If the remainder is also wanted, and it is computed from r = n - q*3, the more straightforward method requires 16 instructions, including three multiplications by small numbers, or 20 elementary instructions if the multiplications are expanded into shift’s and add’s.

The code of Figure 10–47 is not attractive if the multiplications are expanded into shift’s and add’s; the result is 24 elementary instructions. Thus, the exact division method might be a good one on a machine that does not have multiply high but does have a fast modulo 2³² multiply and slow divide, particularly if it can easily deal with the large constants.

For signed division by 3, the exact division method might be coded as shown in Figure 10–48. It is 15 instructions, including two multiplications by large constants.

Click here to view code image

int divs3(int n) {
   unsigned r;

   r = n;
   r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30;
   r = r - (((unsigned)n >> 31) << (r & 2));
   return (n - r)*0xAAAAAAAB;
}

FIGURE 10–48. Signed remainder and quotient with divisor = 3, using exact division.

As a final example, Figure 10–49 shows code for computing the quotient and remainder for unsigned division by 10. It is 12 instructions, including two multiplications by large constants, plus an indexed load instruction.

Click here to view code image

unsigned divu10(unsigned n) {
   unsigned r;
   static char table[16] = {0, 1, 2, 2, 3, 3, 4, 5,
                            5, 6, 7, 7, 8, 8, 9, 0};

   r = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
   r = table[r];
   return ((n - r) >> 1)*0xCCCCCCCD;
}

FIGURE 10–49. Signed remainder and quotient with divisor = 10, using exact division.

10–22 A Timing Test

Many machines have a 32×32 ⇒ 64 multiply instruction, so one would expect that to divide by a constant such as 3, the code shown on page 228 would be fastest. If that multiply instruction is not present, but the machine has a fast 32×32 ⇒ 32 multiply instruction, then the exact division method might be a good one if the machine has a slow divide and a fast multiply. To test this conjecture, an assembly language program was constructed to compare four methods of dividing by 3. The results are shown in Table 10–4. The machine used was a 667 MHz Pentium III (ca. 2000), and one would expect similar results on many other machines.

TABLE 10–4. UNSIGNED DIVIDE BY 3 ON A PENTIUM III

The first row gives the time in cycles for just two instructions: an xorl to clear the left half of the 64-bit source register, and the divl instruction, which evidently takes 40 cycles. The second row also gives the time for just two instructions: multiply and shift right 1 (mull and shrl). The third row gives the time for a sequence of 21 elementary instructions. It is the code of Figure 10–8 on page 254 using alternative 2, and with the multiplication by 3 done with a single instruction (leal). Several move instructions are necessary because the machine is (basically) two-address. The last row gives the time for a sequence of 10 instructions: two multiplications (imull) and the rest elementary. The two imull instructions use 4-byte immediate fields for the large constants. (The signed multiply instruction imull is used rather than its unsigned counterpart mull, because they give the same result in the low-order 32 bits, and imull has more addressing modes available.)

The exact division method would be even more favorable compared to the second and third methods if both the quotient and remainder were wanted, because they would require additional code for the computation r ← n – q*3. (The divl instruction produces the remainder as well as the quotient.)

10–23 A Circuit for Dividing by 3

There is a simple circuit for dividing by 3 that is about as complex as an adder. It can be constructed very similarly to the elementary way one constructs an n-bit adder from n 1-bit “full adder” circuits. However, in the divider signals flow from most significant to least significant bit.

Consider dividing by 3 the way it is taught in grade school, but in binary. To produce each bit of the quotient, you divide 3 into the next bit, but the bit is preceded by a remainder of 0, 1, or 2 from the previous stage. The logic is shown in Table 10–5. Here the remainder is represented by two bits r_i and s_i, with r_i being the most significant bit. The remainder is never 3, so the last two rows of the table represent “don’t care” cases.

A circuit for 32-bit division by 3 is shown in Figure 10–50. The quotient is the word consisting of bits y₃₁ through y₀, and the remainder is 2r₀ + s₀.

Another way to implement the divide-by-3 operation in hardware is to use the multiplier to multiply the dividend by the reciprocal of 3 (binary 0.010101...), with appropriate rounding and scaling. This is the technique shown on pages 207 and 228.

TABLE 10–5. LOGIC FOR DIVIDING BY 3

FIGURE 10–50. Logic circuit for dividing by 3.

Exercises

1. Show that for unsigned division by an even number, the shrxi instruction (or equivalent code) can be avoided by first (a) turning off the low-order bit of the dividend (and operation) [CavWer] or (b) dividing the dividend by 2 (shift right 1 instruction) and then dividing by half the divisor.

2. Code a function in Python similar to that of Figure 10–4 on page 240, but for computing the magic number for signed division. Consider only positive divisors.

3. Show how you would use Newton’s method to calculate the multiplicative inverse of an integer d modulo 81. Show the calculations for d = 146.

I think that I shall never envision

An op unlovely as division.

An op whose answer must be guessed

And then, through multiply, assessed;

An op for which we dearly pay,

In cycles wasted every day.

Division code is often hairy;

Long division’s downright scary.

The proofs can overtax your brain,

The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,

But even God can’t divide by zero!