On many computers, division is very time consuming and is to be avoided when possible. A value of 20 or more elementary add times is not uncommon, and the execution time is usually the same large value even when the operands are small. This chapter gives some methods for avoiding the divide instruction when the divisor is a constant.
Apparently, many people have made the mistake of assuming that a shift right signed of k positions divides a number by 2k, using the usual truncating form of division [GLS2]. It’s a little more complicated than that. The code shown below computes q = n ÷ 2k, for 1 ≤ k ≤ 31 [Hop].
shrsi t,n,k-1 Form the integer
shri t,t,32-k 2**k – 1 if n < 0, else 0.
add t,n,t Add it to n,
shrsi q,t,k and shift right (signed).
It is branch free. It simplifies to three instructions in the common case of division by 2 (k = 1). It does, however, rely on the machine’s being able to shift by a large amount in a short time. The case k = 31 does not make too much sense, because the number 231 is not representable in the machine. Nevertheless, the code does produce the correct result in that case (which is q = –1 if n = –231and q = 0 for all other n).
To divide by –2k, the above code can be followed by a negate instruction. There does not seem to be any better way to do it.
The more straightforward code for dividing by 2k is
bge n,label Branch if n >= 0.
addi n,n,2**k-1 Add 2**k - 1 to n,
label shrsi n,n,k and shift right (signed).
This would be preferable on a machine with slow shifts and fast branches.
PowerPC has an unusual device for speeding up division by a power of 2 [GGS]. The
shift right signed instructions set the machine’s carry bit if the number being shifted is negative
and one or more 1-bits are shifted out. That machine also has an instruction for adding
the carry bit to a register, denoted addze
. This allows division by any (positive) power of 2 to be done in two instructions:
shrsi q,n,k
addze q,q
A single shrsi
of k positions does a kind of signed division by 2k that coincides with both modulus and floor division. This suggests that one of these
might be preferable to truncating division for computers and HLL’s to use. That is,
modulus and floor division mesh with shrsi
better than does truncating division, permitting a compiler to translate the expression
n / 2 to an shrsi
. Furthermore, shrsi
followed by neg
(negate) does modulus division by –2k, which is a hint that maybe modulus division is best. (This is mainly an aesthetic
issue. It is of little practical significance, because division by a negative constant
is no doubt extremely rare.)
If both the quotient and remainder of n ÷ 2k are wanted, it is simplest to compute the remainder r from r = n – q * 2k This requires only two instructions after computing the quotient q:
shli r,q,k
sub r,n,r
To compute only the remainder seems to require about four or five instructions. One way to compute it is to use the four-instruction sequence above for signed division by 2k, followed by the two instructions shown immediately above to obtain the remainder. This results in two consecutive shift instructions that can be replaced by an and, giving a solution in five instructions (four if k = 1):
shrsi t,n,k-1 Form the integer
shri t,t,32-k 2**k - 1 if n < 0, else 0.
add t,n,t Add it to n,
andi t,t,-2**k clear rightmost k bits,
sub r,n,t and subtract it from n.
Another method is based on
To use this, first compute , and then
r ← ((abs(n) & (2k– 1)) ⊕ t) – t
(five instructions) or, for k = 1, since (– n) & 1 = n & 1,
(four instructions). This method is not very good for k > 1 if the machine does not have absolute value (computing the remainder would then require six instructions).
Still another method is based on
This leads to
(five instructions for k > 1, four for k = 1).
The above methods all work for 1 ≤ k ≤ 31.
Incidentally, if shift right signed is not available, the value that is 2k – 1 for n < 0 and 0 for n ≥ 0 can be constructed from
which adds only one instruction.
The basic trick is to multiply by a sort of reciprocal of the divisor d, approximately 232/d, and then to extract the leftmost 32 bits of the product. The details, however, are more complicated, particularly for certain divisors such as 7.
Let us first consider a few specific examples. These illustrate the code that will be generated by the general method. We denote registers as follows:
n
– the input integer (numerator)M
– loaded with a “magic number”t
- a temporary registerq
- will contain the quotientr
- will contain the remainder
li M,0x55555556 Load magic number, (2**32+2)/3.
mulhs q,M,n q = floor(M*n/2**32).
shri t,n,31 Add 1 to q if
add q,q,t n is negative.
muli t,q,3 Compute remainder from
sub r,n,t r = n - q*3.
Proof. The multiply high signed operation (mulhs
) cannot overflow, as the product of two 32-bit integers can always be represented
in 64 bits and mulhs
gives the high-order 32 bits of the 64-bit product. This is equivalent to dividing
the 64-bit product by 232 and taking the floor of the result, and this is true whether the product is positive
or negative. Thus, for n ≥ 0 the above code computes
Now, n < 231, because 231 – 1 is the largest representable positive number. Hence, the “error” term 2n / (3 · 232) is less than 1/3 (and is nonnegative), so by Theorem D4 (page 183) we have q = ⌊n / 3⌋, which is the desired result (Equation (1) on page 182).
For n < 0, there is an addition of 1 to the quotient. Hence the code computes
where we have used Theorem D2. Hence
For –231 ≤ n ≤ –1,
The error term is nonpositive and greater than –1 / 3, so by Theorem D4 q = ⌈n / 3⌉, which is the desired result (Equation (1) on page 182).
This establishes that the quotient is correct. That the remainder is correct follows easily from the fact that the remainder must satisfy
n = qd + r,
the multiplication by 3 cannot overflow (because –231 / 3 ≤ q ≤ (231 – 1) / 3), and the subtract cannot overflow because the result must be in the range –2 to +2.
The multiply immediate can be done with two add’s, or a shift and an add, if either gives an improvement in execution time.
On many present-day RISC computers, the quotient can be computed as shown above in nine or ten cycles, whereas the divide instruction might take 20 cycles or so.
For division by 5, we would like to use the same code as for division by 3, except with a multiplier of (232 + 4) / 5. Unfortunately, the error term is then too large; the result is off by 1 for about 1/5 of the values of n ≥ 230 in magnitude. However, we can use a multiplier of (233 + 3) / 5 and add a shift right signed instruction. The code is
li M,0x66666667 Load magic number, (2**33+3)/5.
mulhs q,M,n q = floor(M*n/2**32).
shrsi q,q,1
shri t,n,31 Add 1 to q if
add q,q,t n is negative.
muli t,q,5 Compute remainder from
sub r,n,t r = n - q*5.
Proof. The mulhs
produces the leftmost 32 bits of the 64-bit product, and then the code shifts this
right by one position, signed (or “arithmetically”). This is equivalent to dividing
the product by 233 and then taking the floor of the result. Thus, for n ≥ 0 the code computes
For 0 ≤ n < 231, the error term 3n / 5 · 233 is nonnegative and less than 1/5, so by Theorem D4, q = ⌊ n / 5⌋.
For n < 0, the above code computes
The error term is nonpositive and greater than –1/5, so q = ⌈n / 5⌉.
That the remainder is correct follows as in the case of division by 3.
The multiply immediate can be done with a shift left of two and an add.
Dividing by 7 creates a new problem. Multipliers of (232 + 3) / 7 and (233 + 6) / 7 give error terms that are too large. A multiplier of (234 + 5) / 7 would work, but it’s too large to represent in a 32-bit signed word. We
can multiply by this large number by multiplying by (234 + 5) / 7 – 232 (a negative number), and then correcting the product by inserting an add
. The code is
li M,0x92492493 Magic num, (2**34+5)/7 - 2**32.
mulhs q,M,n q = floor(M*n/2**32).
add q,q,n q = floor(M*n/2**32) + n.
shrsi q,q,2 q = floor(q/4).
shri t,n,31 Add 1 to q if
add q,q,t n is negative.
muli t,q,7 Compute remainder from
sub r,n,t r = n - q*7.
Proof. It is important to note that the instruction “add q,q,n”
above cannot overflow. This is because q and n have opposite signs, due to the multiplication by a negative number. Therefore, this
“computer arithmetic” addition is the same as real number addition. Hence for n ≥ 0 the above code computes
where we have used the corollary of Theorem D3.
For 0 ≤ n ≤ 231, the error term 5n/ 7 · 234 is nonnegative and less than 1/7, so q = ⌊ n / 7⌋.
For n < 0, the above code computes
The error term is nonpositive and greater than –1/7, so q = ⌈n / 7⌉
The multiply immediate can be done with a shift left of three and a subtract.
At this point you may wonder if other divisors present other problems. We see in this section that they do not; the three examples given illustrate the only cases that arise (for d ≥ 2).
Some of the proofs are a bit complicated, so to be cautious, the work is done in terms of a general word size W.
Given a word size W ≥ 3 and a divisor d, 2 ≤ d ≤ 2W – 1 we wish to find the least integer m and integer p such that
with 0 ≤ m< 2W and p ≥ W.
The reason we want the least integer m is that a smaller multiplier may give a smaller shift amount (possibly zero) or may
yield code similar to the “divide by 5” example, rather than the “divide by 7” example.
We must have m ≤ 2W– 1 so the code has no more instructions than that of the “divide by 7” example (that
is, we can handle a multiplier in the range 2W –1 to 2W– 1 by means of the add
that was inserted in the “divide by 7” example, but we would rather not deal with
larger multipliers). We must have p ≥ W, because the generated code extracts the left half of the product mn, which is equivalent to shifting right W positions. Thus, the total right shift is W or more positions.
There is a distinction between the multiplier m and the “magic number,” denoted M. The magic number is the value used in the multiply instruction. It is given by
Because (1b) must hold for n = –d,⌊– md/ 2p⌋ + 1 = –1, which implies
Let nc be the largest (positive) value of n such that rem(nc, d) = d – 1. nc exists because one possibility is nc = d – 1. It can be calculated from nc = ⌊ 2W – 1 / d ⌋ d – 1 = 2W – 1 – rem(2W –1, d) – 1. nc is one of the highest d admissible values of n, so
and, clearly
Because (1a) must hold for n = nc
or
Because m is to be the least integer satisfying (4), it is the next integer greater than 2p / d; that is,
Combining this with the right half of (4) and simplifying gives
Thus, the algorithm to find the magic number M and the shift amount s from d is to first compute nc, and then solve (6) for p by trying successively larger values. If p < W, set p = W (the theorem below shows that this value of p also satisfies (6)). When the smallest p ≥ W satisfying (6) is found, m is calculated from (5). This is the smallest possible value of m, because we found the smallest acceptable p, and from (4) clearly smaller values of p yield smaller values of m. Finally, s = p– W and M is simply a reinterpretation of m as a signed integer (which is how the mulhs
instruction interprets it).
Forcing p to be at least W is justified by the following:
THEOREM DC1. If(6) is true for some value of p, then it is true for all larger values of p.
Proof. Suppose (6) is true for p = p0. Multiplying (6) by 2 gives
2p0 +1 > nc(2 d – 2rem(2p0, d)).
From Theorem D5, rem(2p0 +1, d) ≥ 2rem(2p0, d) – d. Combining gives
2p00 +1 > nc((2d – (rem(2p0 + 1, d) + d)), or
2p 0 +1 > nc(d – rem(2p0 +1, d)).
Therefore, (6) is true for p = p0 + 1, and hence for all larger values.
Thus, one could solve (6) by a binary search, although a simple linear search (starting with p = W) is probably preferable, because usually d is small, and small values of d give small values of p.
We must show that (6) always has a solution and that 0 ≤ m < 2W. (It is not necessary to show that p ≥ W, because that is forced.)
We show that (6) always has a solution by getting an upper bound on p. As a matter of general interest, we also derive a lower bound under the assumption that p is not forced to be at least W. To get these bounds on p, observe that for any positive integer x, there is a power of 2 greater than x and less than or equal to 2x. Hence, from (6),
nc(d – rem(2p, d)) < 2p ≤ 2nc((d – rem(2p, d)).
Because 0 ≤ rem(2p, d) ≤ d – 1,
From (3a) and (3b), nc ≥ max(2W – 1 – d, d – 1). The lines f1(d) = 2W –1 – d and f2(d) = d – 1 cross at d = (2W –1 + 1) / 2. Hence nc ≥ (2W –1 – 1) / 2. Because nc is an integer, nc ≥ 2W –2. Because nc, d ≤ 2W – 1 – 1, (7) becomes
2w –2 + 1 ≤ 2p ≤ 2(2w – 1 – 1)2
or
The lower bound p = W – 1 can occur (e.g., for W = 32, d = 3), but in that case we set p = W.
If p is not forced to equal W, then from (4) and (7),
Using (3b) gives
Because nc ≤ 2W –1 – 1 (3a),
2 ≤ m ≤ 2w – 1.
If p is forced to equal W, then from (4),
Because 2 ≤ d ≤ 2W –1 – 1 and nc ≥ 2W –2,
Hence in either case m is within limits for the code schema illustrated by the “divide by 7” example.
We must show that if p and m are calculated from (6) and (5), then Equations (1a) and (1b) are satisfied.
Equation (5) and inequality (6) are easily seen to imply (4). (In the case that p is forced to be equal to W, (6) still holds, as shown by Theorem DC1.) In what follows, we consider separately the following five ranges of values of n:
From (4), because m is an integer,
Multiplying by n / 2p, for n ≥ 0 this becomes
For 0 ≤ n ≤ nc, 0 ≤ (2p – 1) n / (2pdnc) < 1 / d, so by Theorem D4,
Hence (1a) is satisfied in this case (0 ≤ n ≤ nc).
For n > nc, n is limited to the range
because n ≥ nc + d contradicts the choice of nc as the largest value of n such that rem(nc, d) = d – 1 (alternatively, from (3a), n ≥ nc + d implies n ≥ 2W – 1). From (4), for n ≥ 0,
By elementary algebra, this can be written
From (9), 1 ≤ n – nc ≤ d – 1, so
Because nc ≥ d – 1 (by (3b)) and (nc + 1) / nc has its maximum when nc has its minimum,
In (10), the term (nc + 1) / d is an integer. The term (n – nc)(nc + 1) / dnc is less than or equal to 1. Therefore, (10) becomes
For all n in the range (9),⌊n/ d ⌋ = (nc + 1) / d. Hence, (1a) is satisfied in this case (nc + 1 ≤ n ≤ nc + d – 1).
For n < 0, from (4) we have, because m is an integer,
Multiplying by n / 2p, for n < 0 this becomes
or
Using Theorem D2 gives
Because n + 1 ≤ 0, the right inequality can be weakened, giving
For –nc ≤ n ≤ –1,
Hence, by Theorem D4,
so that (1b) is satisfied in this case (–nc ≤ n ≤ –1).
For n < –nc, n is limited to the range
(From (3a), n < – nc – d implies that n < –2W – 1, which is impossible.) Performing elementary algebraic manipulation of the left comparand of (11) gives
For – nc – d + 1 ≤ n ≤ – nc – 1,
The ratio (nc + 1) / nc is a maximum when nc is a minimum; that is, nc = d – 1.
Therefore,
From (13), because (– nc – 1) / d is an integer and the quantity added to it is between 0 and –1,
For n in the range – nc – d + 1 ≤ n ≤ – nc – 1,
Hence, ⌊mn/ 2p⌋ + 1 = ⌈n / d⌉—that is, (1b) is satisfied.
The last case, n = – nc – d, can occur only for certain values of d. From (3a), – nc – d ≤ –2W – 1, so if n takes on this value, we must have n = – nc – d = –2W –1, and hence nc = 2W –1 – d. Therefore, rem(2W – 1, d) = rem(nc + d, d) = d – 1 (that is, d divides 2W –1 + 1).
For this case (n = – nc – d), (6) has the solution p = W – 1 (the smallest possible value of p), because for p = W – 1,
Then from (5),
so that (1b) is satisfied.
This completes the proof that if m and p are calculated from (5) and (6), then Equations (1a) and (1b) hold for all admissible values of n.
Because signed integer division satisfies n ÷ (–d) = –(n ÷ d), it is adequate to generate code for n ÷ |d| and follow it with an instruction to negate the quotient. (This does not give the correct result for d = –2W –1, but for this and other negative powers of 2, you can use the code in Section 10–1, “Signed Division by a Known Power of 2,” on page 205, followed by a negating instruction.) It will not do to negate the dividend, because of the possibility that it is the maximum negative number.
It is possible to avoid the negating instruction. The scheme is to compute
Adding 1 if n > 0 is awkward (because one cannot simply use the sign bit of n), so the code will instead add 1 if q < 0. This is equivalent, because the multiplier m is negative (as will be seen).
The code to be generated is illustrated below for the case W = 32, d = –7.
li M,0x6DB6DB6D Magic num, -(2**34+5)/7 + 2**32.
mulhs q,M,n q = floor(M*n/2**32).
sub q,q,n q = floor(M*n/2**32) - n.
shrsi q,q,2 q = floor(q/4).
shri t,q,31 Add 1 to q if
add q,q,t q is negative (n is positive).
muli t,q,-7 Compute remainder from
sub r,n,t r = n - q*(-7).
This code is the same as that for division by +7, except that it uses the negative
of the multiplier for +7, and a sub
rather than an add
after the multiply, and the shri
of 31 must use q rather than n, as discussed above. (The case of d = +7 could also use q here, but there would be less parallelism in the code.) The subtract will not overflow, because the operands have the same sign. This scheme, however,
does not always work! Although the code above for W = 32, d = –7 is correct, the analogous alteration of the “divide by 3” code to produce code
to divide by –3 does not give the correct result for W = 32, n = –231.
Let us look at the situation more closely.
Given a word size W ≥ 3 and a divisor d, –2W –1 ≤ d ≤ –2, we wish to find the least (in absolute value) integer m and integer p such that
with –2W ≤ m ≤ 0 and p ≥ W.
Proceeding similarly to the case of division by a positive divisor, let nc be the most negative value of n such that nc = kd + 1 for some integer k. nc exists, because one possibility is nc = d + 1. It can be calculated from nc = ⌊ (– 2W –1 – 1) / d ⌋ d + 1 = – 2W – 1 + rem(2W –1 + 1, d). nc is one of the least |d| admissible values of n, so
and, clearly
Because (14b) must hold for n = – d, and (14a) must hold for n = nc, we obtain, analogous to (4),
Because m is to be the greatest integer satisfying (16), it is the next integer less than 2p / d—that is,
Combining this with the left half of (16) and simplifying gives
The proof that the algorithm suggested by (17) and (18) is feasible, and that the product is correct, is similar to that for a positive divisor, and will not be repeated. A difficulty arises in trying to prove that – 2W ≤ m ≤ 0. To prove this, consider separately the cases in which d is the negative of a power of 2, or some other number. For d = –2k, it is easy to show that nc = –2w – 1 + 1, p = W + k – 1, and m = – 2w –1 –1 (which is within range). For d not of the form –2k, it is straightforward to alter the earlier proof.
By m(d) we mean the multiplier corresponding to a divisor d. If m(–d) = –m(d), code for division by a negative divisor can be generated by calculating the multiplier for |d|, negating it, and then generating code similar to that of the “divide by –7” case illustrated above.
By comparing (18) with (6) and (17) with (5), it can be seen that if the value of nc for –d is the negative of that for d, then m(–d) = –m(d). Hence, m(–d) ≠ m(d) can occur only when the value of nc calculated for the negative divisor is the maximum negative number, –2W – 1. Such divisors are the negatives of the factors of 2W –1 + 1. These numbers are fairly rare, as illustrated by the factorings below (obtained from Scratchpad).
215 + 1 = 3 · 11 · 331
231 + 1 = 3 · 715,827,883
263 + 1 = 33 · 19 · 43 · 5419 · 77,158,673,929
For all these factors, m(–d) ≠ m(d). Proof sketch: For d > 0 we have nc = 2w – 1 – d. Because rem(2w – 1, d) = d – 1, (6) is satisfied by p = W – 1 and hence also by p = W. For d < 0, however, we have nc = –2W – 1 and rem(2w – 1, d) = |d| –1. Hence, (18) is not satisfied for p = W – 1 or for p = W, so p > W.
For a compiler to change division by a constant into a multiplication, it must compute the magic number M and the shift amount s, given a divisor d. The straightforward computation is to evaluate (6) or (18) for p = W, W +1, ... until it is satisfied. Then, m is calculated from (5) or (17). M is simply a reinterpretation of m as a signed integer, and s = p − W.
The scheme described below handles positive and negative d with only a little extra code, and it avoids doubleword arithmetic.
Hence, |nc| can be computed from
The remainder must be evaluated using unsigned division, because of the magnitude of the arguments. We have written rem(t, |d|) rather than the equivalent rem(t, d), to emphasize that the program must deal with two positive (and unsigned) arguments.
From (6) and (18), p can be calculated from
and then |m| can be calculated from (c.f. (5) and (17)):
Direct evaluation of rem(2p, |d|) in (19) requires “long division” (dividing a 2W-bit dividend by a W-bit divisor, giving a W-bit quotient and remainder), and, in fact, it must be unsigned long division. There is a way to solve (19), and to do all the calculations, that avoids long division and can easily be implemented in a conventional HLL using only W-bit arithmetic. We do, however, need unsigned division and unsigned comparisons.
We can calculate rem(2p, |d|) incrementally, by initializing two variables q and r to the quotient and remainder of 2p divided by |d| with p = W – 1, and then updating q and r as p increases.
As the search progresses—that is, when p is incremented by 1—q and r are updated from (see Theorem D5(a))
q = 2*q;
r = 2*r;
if (r>= abs(d)) {
q = q + 1;
r = r - abs(d);}
The left half of inequality (4) and the right half of (16), together with the bounds proved for m, imply that q = ⌊ 2p /|d|⌋ < 2W, so q is representable as a W-bit unsigned integer. Also, 0 ≤ r < |d|, so r is representable as a W-bit signed or unsigned integer. (Caution: The intermediate result 2r can exceed 2W –1 – 1, so r should be unsigned and the comparison above should also be unsigned.)
Next, calculate δ = |d| – r. Both terms of the subtraction are representable as W-bit unsigned integers, and the result is also (1 ≤ δ ≤ |d|), so there is no difficulty here.
To avoid the long multiplication of (19), rewrite it as
The quantity 2p / |nc| is representable as a W-bit unsigned integer (similar to (7), from (19) it can be shown that 2p ≤ 2|nc| · |d| and, for d = –2W – 1, nc = –2w – 1 + 1 and p = 2W – 2, so that 2p / |nc| = 22W – 2 / (2w – 1 − 1) < 2W for W ≥ 3). Also, it is easily calculated incrementally (as p increases) in the same manner as for rem(2p, |d|). The comparison should be unsigned, for the case 2p / |nc| ≥ 2W – 1 (which can occur, for large d).
To compute m, we need not evaluate (20) directly (which would require long division). Observe that
The loop closure test 2p / |nc| > δ is awkward to evaluate. The quantity 2p/ |nc| is available only in the form of a quotient q1 and a remainder r1. 2p / |nc| may or may not be an integer (it is an integer only for d = 2W – 2 + 1 and a few negative values of d). The test 2p / |nc| ≤ δ can be coded as
q1 < δ | (q1 = δ & r1 = 0).
The complete procedure for computing M
and s
from d
is shown in Figure 10–1, coded in C, for W = 32. There are a few places where overflow can occur, but the correct result is
obtained if overflow is ignored.
To use the results of this program, the compiler should generate the li
and mulhs
instructions, generate the add
if d
> 0 and M
< 0, or the sub
if d
< 0 and M
> 0, and generate the shrsi
if s
> 0. Then, the shri
and final add
must be generated.
For W = 32, handling a negative divisor can be avoided by simply returning a precomputed result for d = 3 and d = 715,827,883, and using m(– d) = – m(d) for other negative divisors. However, that program would not be significantly shorter, if at all, than the one given in Figure 10–1.
struct ms {int M; // Magic number
int s;}; // and shift amount.
struct ms magic(int d) { // Must have 2 <= d <= 2**31-1
// or -2**31 <= d <= -2.
int p;
unsigned ad, anc, delta, q1, r1, q2, r2, t;
const unsigned two31 = 0x80000000; // 2**31.
struct ms mag;
ad = abs(d);
t = two31 + ((unsigned)d >> 31);
anc = t - 1 - t%ad; // Absolute value of nc.
p = 31; // Init. p.
q1 = two31/anc; // Init. q1 = 2**p/|nc|.
r1 = two31 - q1*anc; // Init. r1 = rem(2**p, |nc|).
q2 = two31/ad; // Init. q2 = 2**p/|d|.
r2 = two31 - q2*ad; // Init. r2 = rem(2**p, |d|).
do {
p = p + 1;
q1 = 2*q1; // Update q1 = 2**p/|nc|.
r1 = 2*r1; // Update r1 = rem(2**p, |nc|).
if (r1 >= anc) { // (Must be an unsigned
q1 = q1 + 1; // comparison here.)
r1 = r1 - anc;}
q2 = 2*q2; // Update q2 = 2**p/|d|.
r2 = 2*r2; // Update r2 = rem(2**p, |d|).
if (r2 >= ad) { // (Must be an unsigned
q2 = q2 + 1; // comparison here.)
r2 = r2 - ad;}
delta = ad - r2;
} while (q1 < delta || (q1 == delta && r1 == 0));
mag.M = q2 + 1;
if (d < 0) mag.M = -mag.M; // Magic number and
mag.s = p - 32; // shift amount to return.
return mag;
}
FIGURE 10–1. Computing the magic number for signed division.
THEOREM DC2. The least multiplier m is odd if p is not forced to equal W.
Proof. Assume that Equations (1a) and (1b) are satisfied with least (not forced) integer p, and m even. Then clearly m could be divided by 2 and p could be decreased by 1, and (1a) and (1b) would still be satisfied. This contradicts the assumption that p is minimal.
The magic number for a given divisor is sometimes unique (e.g., for W = 32, d = 7), but often it is not. In fact, experimentation suggests that it is usually not unique. For example, for W = 32, d = 6, there are four magic numbers:
Nevertheless, there is the following uniqueness property:
THEOREM DC3. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not forced to equal W.
Proof. First consider the case d > 0. The difference between the upper and lower limits of inequality (4) is 2p/ dnc. We have already proved (7) that if p is minimal, then 2p/dnc ≤ 2. Therefore, there can be at most two values of m satisfying (4). Let m be the smaller of these values, given by (5); then m + 1 is the other.
Let p0 be the least value of p for which m + 1 satisfies the right half of (4) (p0 is not forced to equal W). Then
This simplifies to
2p0 > nc (2 d – rem(2p0, d)).
Dividing by 2 gives
Because rem(2p0, d) ≤ 2rem(2p0 – 1, d) (by Theorem D5 on page 184),
2p0 – 1 > nc (d – rem(2p0 –1, d)),
contradicting the assumption that p0 is minimal.
The proof for d < 0 is similar and will not be given.
The program for d = 3, W = 32 is particularly short, because there is no add
or shrsi
after the mulhs
. What other divisors have this short program?
We consider only positive divisors. We wish to find integers m and p that satisfy Equations (1a) and (1b), and for which p = W and 0 ≤ m < 2W –1. Because any integers m and p that satisfy equations (1a) and (1b) must also satisfy (4), it suffices to find those divisors d for which (4) has a solution with p = W and 0 ≤ m < 2W –1. All solutions of (4) with p = W are given by
Combining this with the right half of (4) and simplifying gives
The weakest restriction on rem(2W, d) is with k = 1 and nc at its minimal value of 2W –2. Hence, we must have
rem(2W, d) > d – 4;
that is, d divides 2W+ 1, 2W+ 2, or 2W+ 3.
Now let us see which of these factors actually have optimal programs.
If d divides 2W+ 1, then rem(2W, d) = d – 1. Then a solution of (6) is p = W, because the inequality becomes
2W > nc (d – (d – 1)) = nc,
which is obviously true, because nc < 2W –1. Then in the calculation of m we have
which is less than 2W –1 for d ≥ 3 (d ≠ 2 because d divides 2W + 1). Hence, all the factors of 2W+ 1 have optimal programs.
Similarly, if d divides 2W + 2, then rem(2W, d) = d – 2. Again, a solution of (6) is p = W, because the inequality becomes
2W > nc (d – (d – 2)) = 2nc,
which is obviously true. Then in the calculation of m we have
which exceeds 2W –1 – 1 for d = 2, but which is less than or equal to 2W – 1 – 1 for W ≥ 3, d ≥ 3 (the case W = 3 and d = 3 does not occur, because 3 is not a factor of 23 + 2 = 10). Hence all factors of 2W + 2, except for 2 and the cofactor of 2, have optimal programs. (The cofactor of 2 is (2W+ 2) / 2, which is not representable as a W-bit signed integer).
If d divides 2W + 3, the following argument shows that d does not have an optimal program. Because rem(2W, d) = d – 3, inequality (21) implies that we must have
for some k = 1, 2, 3, .... The weakest restriction is with k = 1, so we must have nc < 2W / 3.
From (3a), nc ≥ 2W –1 – d, or d ≥ 2W – 1 – nc. Hence, it is necessary that
Also, because 2, 3, and 4 do not divide 2W + 3, the smallest possible factor of 2W + 3 is 5. Therefore, the largest possible factor is (2W + 3) / 5. Thus, if d divides 2W + 3 and d has an optimal program, it is necessary that
Taking reciprocals of this with respect to 2W + 3 shows that the cofactor of d, (2W + 3) / d, has the limits
For W ≥ 5, this implies that the only possible cofactors are 5 and 6. For W < 5, it is easily verified that there are no factors of 2W + 3. Because 6 cannot be a factor of 2W + 3, the only possibility is 5. Therefore, the only possible factor of 2W + 3 that might have an optimal program is (2W + 3) / 5.
For d = (2W + 3) / 5,
For W ≥ 4,
This exceeds (2W / 3), so d = (2W + 3) / 5 does not have an optimal program. Because for W < 4 there are no factors of 2W + 3, we conclude that no factors of 2W + 3 have optimal programs.
In summary, all the factors of 2W + 1 and of 2W + 2, except for 2 and (2W + 2) / 2, have optimal programs, and no other numbers do. Furthermore, the above proof shows that algorithm magic (Figure 10–1 on page 223) always produces the optimal program when it exists.
Let us consider the specific cases W = 16, 32, and 64. The relevant factorizations are shown below.
The result for W = 16 is that there are 20 divisors that have optimal programs. The ones less than 100 are 3, 6, 9, 11, 18, 22, 33, 66, and 99.
For W = 32, there are six such divisors: 3, 6, 641, 6,700,417, 715,827,883, and 1,431,655,766.
For W = 64, there are 126 such divisors. The ones less than 100 are 3, 6, 9, 18, 19, 27, 38, 43, 54, 57, and 86.
Unsigned division by a power of 2 is, of course, implemented by a single shift right logical instruction, and remainder by and immediate.
It might seem that handling other divisors will be simple: Just use the results for signed division with d > 0, omitting the two instructions that add 1 if the quotient is negative. We will see, however, that some of the details are actually more complicated in the case of unsigned division.
For a non-power of 2, let us first consider unsigned division by 3 on a 32-bit machine. Because the dividend n can now be as large as 232 – 1, the multiplier (232 + 2) / 3 is inadequate, because the error term 2 n / 3 · 232 (see “divide by 3” example above) can exceed 1/3. However, the multiplier (233 + 1) / 3 is adequate. The code is
li M,0xAAAAAAAB Load magic number, (2**33+1)/3.
mulhu q,M,n q = floor(M*n/2**32).
shri q,q,1
muli t,q,3 Compute remainder from
sub r,n,t r = n - q*3.
An instruction that gives the high-order 32 bits of a 64-bit unsigned product is required,
which we show above as mulhu
.
To see that the code is correct, observe that it computes
For 0 ≤ n < 232, 0 ≤ n / (3 · 233) < 1 / 3, so by Theorem D4, q = ⌊ n/ 3⌋.
In computing the remainder, the multiply immediate can overflow if we regard the operands as signed integers, but it does not overflow if we regard them and the result as unsigned. Also, the subtract cannot overflow, because the result is in the range 0 to 2, so the remainder is correct.
For unsigned division by 7 on a 32-bit machine, the multipliers (232 + 3) / 7, (233 + 6) / 7, and (234 + 5) / 7 are all inadequate, because they give too large an error term. The multiplier
(235 + 3) / 7 is acceptable, but it’s too large to represent in a 32-bit unsigned word.
We can multiply by this large number by multiplying by (235 + 3) / 7 – 232 and then correcting the product by inserting an add
. The code is
li M,0x24924925 Magic num, (2**35+3)/7 - 2**32.
mulhu q,M,n q = floor(M*n/2**32).
add q,q,n Can overflow (sets carry).
shrxi q,q,3 Shift right with carry bit.
muli t,q,7 Compute remainder from
sub r,n,t r = n - q*7.
Here we have a problem: The add
can overflow. To allow for this, we have invented the new instruction shift right extended immediate (shrxi
), which treats the carry from the add
and the 32 bits of register q as a single 33-bit quantity, and shifts it right with 0-fill. On the Motorola 68000
family, this can be done with two instructions: rotate with extend right one position, followed by a logical right shift of three (roxr
actually uses the X bit, but the add
sets the X bit the same as the carry bit). On most machines, it will take more. For
example, on PowerPC it takes three instructions: clear rightmost three bits of q, add carry to q, and rotate right three positions.
With shrxi
implemented somehow, the code above computes
For 0 ≤ n < 232, 0 ≤ 3 n /(7 · 235) < 1/7, so by Theorem D4, q = ⌊ n / 7 ⌋.
Granlund and Montgomery [GM] have a clever scheme for avoiding the shrxi
instruction. It requires the same number of instructions as the above three-instruction
sequence for shrxi,
but it employs only elementary instructions that almost any machine would have, and
it does not cause overflow at all. It uses the identity
Applying this to our problem, with q = ⌊ Mn / 232 ⌋ where 0 ≤ M <232, the subtraction will not overflow, because
so that, clearly, 0 ≤ n – q < 232. Also, the addition will not overflow, because
and 0 ≤ n,q < 232.
Using this idea gives the following code for unsigned division by 7:
li M,0x24924925 Magic num, (2**35+3)/7 - 2**32.
mulhu q,M,n q = floor(M*n/2**32).
sub t,n,q t = n - q.
shri t,t,1 t = (n - q)/2.
add t,t,q t = (n - q)/2 + q = (n + q)/2.
shri q,t,2 q = (n+Mn/2**32)/8 = floor(n/7).
muli t,q,7 Compute remainder from
sub r,n,t r = n - q*7.
For this to work, the shift amount for the hypothetical shrxi
instruction must be greater than 0. It can be shown that if d > 1 and the multiplier m ≥ 232 (so that the shrxi
instruction is needed), then the shift amount is greater than 0.
Given a word size W ≥ 1 and a divisor d, 1 ≤ d < 2W, we wish to find the least integer m and integer p such that
with 0 ≤ m < 2W+1 and p ≥ W.
In the unsigned case, the magic number M is given by
Because (22) must hold for n = d, ⌊md/ 2p⌋ = 1, or
As in the signed case, let nc be the largest value of n such that rem(nc, d) = d – 1. It can be calculated from nc = ⌊ 2W / d ⌋ d – 1 = 2W – rem(2W, d) – 1. Then
and
These imply that nc ≥ 2W – 1.
Because (22) must hold for n = nc
or
Combining this with (23) gives
Because m is to be the least integer satisfying (25), it is the next integer greater than or equal to 2p / d—that is,
Combining this with the right half of (25) and simplifying gives
Thus, the algorithm is to find by trial and error the least p ≥ W satisfying (27). Then, m is calculated from (26). This is the smallest possible value of m satisfying (22) with p ≥ W. As in the signed case, if (27) is true for some value of p, then it is true for all larger values of p. The proof is essentially the same as that of Theorem DC1, except Theorem D5(b) is used instead of Theorem D5(a).
We must show that (27) always has a solution and that 0 ≤ m < 2W +1.
Because for any nonnegative integer x there is a power of 2 greater than x and less than or equal to 2 x + 1, from (27),
nc(d – 1 – rem(2p – 1, d)) < 2p ≤ 2nc(d – 1 – rem(2p – 1, d)) + 1.
Because 0 ≤ rem(2p– 1, d) ≤ d – 1,
Because nc, d ≤ 2W – 1, this becomes
1 ≤ 2p ≤ 2(2W – 1)(2W – 2) + 1,
or
Thus, (27) always has a solution.
If p is not forced to equal W, then from (25) and (28),
If p is forced to equal W, then from (25),
Because 1 ≤ d ≤ 2W– 1 and nc ≥ 2W – 1,
In either case m is within limits for the code schema illustrated by the “unsigned divide by 7” example.
We must show that if p and m are calculated from (27) and (26), then (22) is satisfied.
Equation (26) and inequality (27) are easily seen to imply (25). Inequality (25) is nearly the same as (4), and the remainder of the proof is nearly identical to that for signed division with n ≥ 0.
There is a difficulty in implementing an algorithm based on direct evaluation of the expressions used in this proof. Although p ≤ 2 W, which is proved above, the case p = 2 W can occur (e.g., for d = 2W – 2 with W ≥ 4). When p = 2 W, it is difficult to calculate m, because the dividend in (26) does not fit in a 2W-bit word.
However, it can be implemented by the “incremental division and remainder” technique
of algorithm magic. The algorithm is given in Figure 10–2 for W = 32. It passes back an indicator a
, which tells whether or not to generate an add
instruction. (In the case of signed division, the caller recognizes this by M
and d
having opposite signs.)
Some key points in understanding this algorithm are as follows:
• Unsigned overflow can occur at several places and should be ignored.
• nc = 2w – rem (2w,d) – 1 = (2W – 1) – rem(2W – d, d).
• The quotient and remainder of dividing 2p by nc cannot be updated in the same way as is done in algorithm magic, because here the quantity 2*r1
can overflow. Hence, the algorithm has the test “if (r
1 > = nc – r l)
,” whereas “if (2*rl >= nc
)” would be more natural. A similar remark applies to computing the quotient and remainder
of 2P –1 divided by d.
• 0 ≤ δ ≤ d – 1, so δ is representable as a 32-bit unsigned integer.
struct mu {unsigned M; // Magic number,
int a; // "add" indicator,
int s;}; // and shift amount.
struct mu magicu(unsigned d) {
// Must have 1 <= d <= 2**32-1.
int p;
unsigned nc, delta, q1, r1, q2, r2;
struct mu magu;
magu.a = 0; // Initialize "add" indicator.
nc = -1 - (-d)%d; // Unsigned arithmetic here.
p = 31; // Init. p.
q1 = 0x80000000/nc; // Init. q1 = 2**p/nc.
r1 = 0x80000000 - q1*nc;// Init. r1 = rem(2**p, nc).
q2 = 0x7FFFFFFF/d; // Init. q2 = (2**p - 1)/d.
r2 = 0x7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, d).
do {
p = p + 1;
if (r1 >= nc - r1) {
q1 = 2*q1 + 1; // Update q1.
r1 = 2*r1 - nc;} // Update r1.
else {
q1 = 2*q1;
r1 = 2*r1;}
if (r2 + 1 >= d - r2) {
if (q2 >= 0x7FFFFFFF) magu.a = 1;
q2 = 2*q2 + 1; // Update q2.
r2 = 2*r2 + 1 - d;} // Update r2.
else {
if (q2 >= 0x80000000) magu.a = 1;
q2 = 2*q2;
r2 = 2*r2 + 1;}
delta = d - 1 - r2;
} while (p < 64 &&
(q1 < delta || (q1 == delta && r1 == 0)));
magu.M = q2 + 1; // Magic number
magu.s = p - 32; // and shift amount to return
return magu; // (magu.a was set above).
}
FIGURE 10–2. Computing the magic number for unsigned division.
• m = (2p +d – 1 – rem(2p – 1, d)) / d = ⌊ (2p – 1) / d ⌋ +1 = q2 + 1.
• The subtraction of 2W when the magic number M
exceeds 2W – 1 is not explicit in the program; it occurs if the computation of q2
overflows.
• The “add” indicator, magu.a
, cannot be set by a straightforward comparison of M
to 232, or of q2
to 232 – 1, because of overflow. Instead, the program tests q2
before overflow can occur. If q2
ever gets as large as 232 – 1, so that M
will be greater than or equal to 232, then magu.a
is set equal to 1. If q2
stays below 232 – 1, then magu.a
is left at its initial value of 0.
• Inequality (27) is equivalent to 2p/ nc > δ.
• The loop test needs the condition p < 64,
because without it, overflow of q1
would cause the program to loop too many times, giving incorrect results.
To use the results of this program, the compiler should generate the li
and mulhu
instructions and, if the “add” indicator a
= 0, generate the shri
of s
(if s
> 0), as illustrated by the example of “Unsigned Division by 3,” on page 227. If a
= 1 and the machine has the shrxi
instruction, the compiler should generate the add
and shrxi
of s
as illustrated by the example of “Unsigned Division by 7,” on page 228. If a
= 1 and the machine does not have the shrxi
instruction, use the example on page 229: generate the sub
, the shri
of 1, the add
, and finally the shri
of s
– 1 (if s
– 1 > 0; s
will not be 0 at this point except in the trivial case of division by 1, which we
assume the compiler deletes).
THEOREM DC2U. The least multiplier m is odd if p is not forced to equal W.
THEOREM DC3U. For a given divisor d, there is only one multiplier m having the minimal value of p, if p is not forced to equal W.
The proofs of these theorems follow very closely the corresponding proofs for signed division.
For unsigned division, to find the divisors (if any) with optimal programs of two
instructions to obtain the quotient (li
, mulhu
), we can do an analysis similar to that of the signed case (see “The Divisors with the Best Programs” on page 225). The result is that such divisors are the factors of 2W or 2W + 1, except for d = 1. For the common word sizes, this leaves very few nontrivial divisors that have
optimal programs for unsigned division. For W = 16, there are none. For W = 32, there are only two: 641 and 6,700,417. For W = 64, again there are only two: 274,177 and 67,280,421,310,721.
The case d = 2k, k = 1, 2, ..., deserves special mention. In this case, algorithm magicu produces p = W (forced), m = 232 – k. This is the minimal value of m, but it is not the minimal value of M. Better code results if p = W + k is used, if sufficient simplifications are done. Then, m = 2W, M = 0, a = 1, and s = k. The generated code involves a multiplication by 0 and can be simplified to a single shift right k instruction. As a practical matter, divisors that are a power of 2 would probably be special-cased without using magicu. (This phenomenon does not occur for signed division, because for signed division m cannot be a power of 2. Proof: For d > 0, inequality (4) combined with (3b) implies that d – 1 < 2p / m < d. Therefore, 2p / m cannot be an integer. For d < 0, the result follows similarly from (16) combined with (15b).)
For unsigned division, the code for the case m ≥ 2W is considerably worse than the code for the case m < 2W if the machine does not have shrxi
. It is of interest to have some idea of how often the large multipliers arise. For
W = 32, among the integers less than or equal to 100, there are 31 “bad” divisors:
1, 7, 14, 19, 21, 27, 28, 31, 35, 37, 38, 39, 42, 45, 53, 54, 55, 56, 57, 62, 63,
70, 73, 74, 76, 78, 84, 90, 91, 95, and 97.
If your machine does not have mulhu
, but it does have mulhs
(or signed long multiplication), the trick given in “High-Order Product Signed from/to
Unsigned,” on page 174, might make our method of doing unsigned division by a constant still useful.
That section gives a seven-instruction sequence for getting mulhu
from mulhs
. However, for this application it simplifies, because the magic number M is known. Thus, the compiler can test the most significant bit of the magic number,
and generate code such as the following for the operation “mulhu q,M,n
.” Here t
denotes a temporary register.
M31 = 0 M31 = 1
mulhs q,M,n mulhs q,M,n
shrsi t,n,31 shrsi t,n,31
and t,t,M and t,t,M
add q,q,t add t,t,n
add q,q,t
Accounting for the other instructions used with mulhu
, this uses a total of six to eight instructions to obtain the quotient of unsigned
division by a constant on a machine that does not have unsigned multiply.
This trick can be inverted, to get mulhs
in terms of mulhu
. The code is the same as that above, except the mulhs
is changed to mulhu
and the final add
in each column is changed to sub
.
Dropping the requirement that the magic number be minimal yields a simpler algorithm. In place of (27) we can use
and then use (26) to compute m, as before.
It should be clear that this algorithm is formally correct (that is, that the value of m computed does satisfy Equation (22)), because its only difference from the previous algorithm is that it computes a value of p that, for some values of d, is unnecessarily large. It can be proved that the value of m computed from (30) and (26) is less than 2W +1. We omit the proof and simply give the algorithm (Figure 10–3).
struct mu {unsigned M; // Magic number,
int a; // "add" indicator,
int s;}; // and shift amount.
struct mu magicu2(unsigned d) {
// Must have 1 <= d <= 2**32-1.
int p;
unsigned p32, q, r, delta;
struct mu magu;
magu.a = 0; // Initialize "add" indicator.
p = 31; // Initialize p.
q = 0x7FFFFFFF/d; // Initialize q = (2**p - 1)/d.
r = 0x7FFFFFFF - q*d; // Init. r = rem(2**p - 1, d).
do {
p = p + 1;
if (p == 32) p32 = 1; // Set p32 = 2**(p-32).
else p32 = 2*p32;
if (r + 1 >= d - r) {
if (q >= 0x7FFFFFFF) magu.a = 1;
q = 2*q + 1; // Update q.
r = 2*r + 1 - d; // Update r.
}
else {
if (q >= 0x80000000) magu.a = 1;
q = 2*q;
r = 2*r + 1;
}
delta = d - 1 - r;
} while (p < 64 && p32 < delta);
magu.M = q + 1; // Magic number and
magu.s = p - 32; // shift amount to return
return magu; // (magu.a was set above).
}
FIGURE 10–3. Simplified algorithm for computing the magic number, unsigned division.
Alverson [Alv] gives a much simpler algorithm, discussed in the next section, but it gives somewhat large values for m. The point of algorithm magicu2 is that it nearly always gives the minimal value for m when d ≤ 2W –1. For W = 32, the smallest divisor for which magicu2 does not give the minimal multiplier is d = 102,807, for which magicu calculates m = 2,737,896,999 and magicu2 calculates m = 5,475,793,997.
There is an analog of magicu2 for signed division by positive divisors, but it does not work out very well for signed division by arbitrary divisors.
It might seem that turning modulus or floor division by a constant into multiplication would be simpler, in that the “add 1 if the dividend is negative” step could be omitted. This is not the case. The methods given above do not apply in any obvious way to modulus and floor division. Perhaps something could be worked out; it might involve altering the multiplier m slightly, depending upon the sign of the dividend.
Rather than coding algorithm magic, we can provide a table that gives the magic numbers and shift amounts for a few small divisors. Divisors equal to the tabulated ones multiplied by a power of 2 are easily handled as follows:
1. Count the number of trailing 0’s in d, and let this be denoted by k.
2. Use as the lookup argument d / 2k (shift right k).
3. Use the magic number found in the table.
4. Use the shift amount found in the table, increased by k.
Thus, if the table contains the divisors 3, 5, 25, and so on, divisors of 6, 10, 100, and so forth can be handled.
This procedure usually gives the smallest magic number, but not always. The smallest positive divisor for which it fails in this respect for W = 32 is d = 334,972, for which it computes m = 3,361,176,179 and s = 18. However, the minimal magic number for d = 334,972 is m = 840,294,045, with s = 16. The procedure also fails to give the minimal magic number for d = –6. In both these cases, output code quality is affected.
Alverson [Alv] is the first known to the author to state that the method described here works with complete accuracy for all divisors. Using our notation, his method for unsigned integer division by d is to set the shift amount p = W + ⌈ log2 d ⌉, and the multiplier m = ⌈ 2p / d ⌉ and then do the division by ⌋ n ÷ d = ⌊ mn / 2p ⌋ (that is, multiply and shift right). He proves that the multiplier m is less than, 2W+1 and that the method gets the exact quotient for all n expressible in W bits.
Alverson’s method is a simpler variation of ours in that it doesn’t require trial
and error to determine p, and is therefore more suitable for building in hardware, which is his primary interest.
His multiplier m is always greater than or equal to 2W, and hence for the software application always gives the code illustrated by the
“unsigned divide by 7” example (that is, always has the add
and shrxi
, or the alternative four instructions). Because most small divisors can be handled
with a multiplier less than 2W, it seems worthwhile to look for these cases.
For signed division, Alverson suggests finding the multiplier for |d| and a word length of W – 1 (then 2W –1 ≤ m < 2W), multiplying the dividend by it, and negating the result if the operands have opposite
signs. (The multiplier must be such that it gives the correct result when the dividend
is 2W –1, the absolute value of the maximum negative number.) It seems possible that this
suggestion might give better code than what has been given here in the case that the
multiplier m ≥ 2W. Applying it to signed division by 7 gives the following code, where we have used
the relation –x = +1 to avoid a branch:
abs an,n
li M,0x92492493 Magic number, (2**34+5)/7.
mulhu q,M,an q = floor(M*an/2**32).
shri q,q,2
shrsi t,n,31 These three instructions
xor q,q,t negate q if n is
sub q,q,t negative.
This is not quite as good as the code we gave for signed division by 7 (six versus
seven instructions), but it would be useful on a machine that has abs
and mulhu
, but not mulhs
.
The next section gives some representative magic numbers.
TABLE10–1. SOME MAGIC NUMBERS FOR W = 32
TABLE 10–2. SOME MAGIC NUMBERS FOR W = 64
Computing a magic number is greatly simplified if one is not limited to doing the calculations in the same word size as that of the environment in which the magic number will be used. For the unsigned case, for example, in Python it is straightforward to compute nc and then evaluate Equations (27) and (26), as described in Section 10–9. Figure 10–4 shows such a function.
def magicgu(nmax, d):
nc = (nmax//d)*d - 1
nbits = int(log(nmax, 2)) + 1
for p in range(0, 2*nbits + 1):
if 2**p > nc*(d - 1 - (2**p - 1)%d):
m = (2**p + d - 1 - (2**p - 1)%d)//d
return (m, p)
print "Can't find p, something is wrong."
sys.exit(1)
FIGURE 10–4. Python code for computing the magic number for unsigned division.
The function is given the maximum value of the dividend nmax
and the divisor d
. It returns a pair of integers: the magic number m
and a shift amount p
. To divide a dividend x
by d
, one multiplies x
by m
and then shifts the (full length) product right p
bits.
This program is more general than the others in this chapter in two ways: (1) one
specifies the maximum value of the dividend (nmax
), rather than the number of bits required for the dividend, and (2) the program can
be used for arbitrarily large dividends and divisors (“bignums”). The advantage of
specifying the maximum value of the dividend is that one sometimes gets a smaller
magic number than would be obtained if the next power of two less 1 were used for
the maximum value. For example, suppose the maximum value of the dividend is 90, and
the divisor is 7. Then function magicgu
returns (37, 8), meaning that the magic number is 37 (a 6-bit number) and the shift
amount is 8. But if we asked for a magic number that can handle divisors up to 127,
then the result is (147, 10), and 147 is an 8-bit number.
By “exact division,” we mean division in which it is known beforehand, somehow, that the remainder is 0. Although this situation is not common, it does arise, for example, when subtracting two pointers in the C language. In C, the result of p – q, where p and q are pointers, is well defined and portable only if p and q point to objects in the same array [H&S, sec. 7.6.2]. If the array element size is s, the object code for the difference p – q computes (p – q) / s.
The material in this section was motivated by [GM, sec. 9].
The method to be given applies to both signed and unsigned exact division, and is based on the following theorem.
THEOREM MI. If a and m are relatively prime integers, then there exists an integer ā, 1 ≤ ā < m, such that
aā ≡ 1 (mod m).
That is, ā is a multiplicative inverse of a, modulo m. There are several ways to prove this theorem; three proofs are given in [NZM, p. 52]. The proof below requires only a very basic familiarity with congruences.
Proof. We will prove something a little more general than the theorem. If a and m are relatively prime (therefore nonzero), then as x ranges over all m distinct values modulo m, ax takes on all m distinct values modulo m. For example, if a = 3 and m = 8, then as x ranges from 0 to 7, ax = 0, 3, 6, 9, 12, 15, 18, 21 or, reduced modulo 8, ax = 0, 3, 6, 1, 4, 7, 2, 5. Observe that all values from 0 to 7 are present in the last sequence.
To see this in general, assume that it is not true. Then there exist distinct integers
that map to the same value when multiplied by a; that is, there exist x and y, with x y(mod m), such that
ax ≡ ay (mod m).
Then there exists an integer k such that
ax – ay = km, or
a (x – y) = km.
Because a has no factor in common with m, it must be that x – y is a multiple of m; that is,
x ≡ y (mod m).
This contradicts the hypothesis.
Now, because ax takes on all m distinct values modulo m, as x ranges over the m values, it must take on the value 1 for some x.
The proof shows that there is only one value (modulo m) of x such that ax ≡ 1 (mod m)—that is, the multiplicative inverse is unique, apart from additive multiples of m. It also shows that there is a unique (modulo m) integer x such that ax ≡ b (mod m), where b is any integer.
As an example, consider the case m = 16. Then , because 3 · 11 = 33 ≡ 1 (mod 16). We could just as well take
, because 3 · (–5) = –15 ≡ 1 (mod 16). Similarly
, because (–3) · 5 = –15 ≡ 1 (mod 16).
These observations are important because they show that the concepts apply to both
signed and unsigned numbers. If we are working in the domain of unsigned integers
on a 4-bit machine, we take . In the domain of signed integers, we take
. But 11 and –5 have the same representation in two’s-complement (because they differ
by 16), so the same computer word contents can serve in both domains as the multiplicative
inverse.
The theorem applies directly to the problem of division (signed and unsigned) by an
odd integer d on a W-bit computer. Because any odd integer is relatively prime to 2W, the theorem says that if d is odd, there exists an integer (unique in the range 0 to 2W– 1 or in the range –2W – 1 to 2W –1 – 1) such that
Hence, for any integer n that is a multiple of d,
In other words, n/d can be calculated by multiplying n by and retaining only the rightmost W bits of the product.
If the divisor d is even, let d = do · 2k, where do is odd and k ≥ 1. Then, simply shift n right k positions (shifting out 0’s), and then multiply by (the shift could be done after the multiplication as well).
Below is the code for division of n by 7, where n is a multiple of 7. This code gives the correct result whether it is considered to be signed or unsigned division.
li M,0xB6DB6DB7 Mult. inverse, (5*2**32 + 1)/7.
mul q,M,n q = n/7.
How can we compute the multiplicative inverse? The standard method is by means of the “extended Euclidean algorithm.” This is briefly discussed below as it applies to our problem, and the interested reader is referred to [NZM, p. 13] and to [Knu2, 4.5.2] for a more complete discussion.
Given an odd divisor d, we wish to solve for x
dx ≡ 1(mod/m),
where, in our application, m = 2W and W is the word size of the machine. This will be accomplished if we can solve for integers x and y (positive, negative, or 0) the equation
dx + my = 1.
Toward this end, first make d positive by adding a sufficient number of multiples of m to it. (d and d + km have the same multiplicative inverse.) Second, write the following equations (in which d, m > 0):
If d = 1, we are done, because (ii) shows that x = 1. Otherwise, compute
Third, multiply Equation (ii) by q and subtract it from (i). This gives
d (– 1 – q) + m(1) = m – d – qd = rem(m – d, d).
This equation holds because we have simply multiplied one equation by a constant and subtracted it from another. If rem(m – d, d) = 1, we are done; this last equation is the solution and x = – 1 – q.
Repeat this process on the last two equations, obtaining a fourth, and continue until the right-hand side of the equation is 1. The multiplier of d, reduced modulo m, is then the desired inverse of d.
Incidentally, if m – d < d, so that the first quotient is 0, then the third row will be a copy of the first, so that the second quotient will be nonzero. Furthermore, most texts start with the first row being
d(0) + m(1) = m,
but in our application m = 2W is not representable in the machine.
The process is best illustrated by an example: Let m = 256 and d = 7. Then the calculation proceeds as follows. To get the third row, note that q = ⌊249 / 7 ⌋ = 35.
7(-1) + 256( 1) = 249
7( 1) + 256( 0) = 7
7(-36) + 256( 1) = 4
7( 37) + 256(-1) = 3
7(-73) + 256( 2) = 1
Thus, the multiplicative inverse of 7, modulo 256, is –73 or, expressed in the range 0 to 255, is 183. Check: 7 · 183 = 1281 ≡ 1 (mod 256).
From the third row on, the integers in the right-hand column are all remainders of dividing the number above it into the number two rows above it, so they form a sequence of strictly decreasing nonnegative integers. Therefore, the sequence must end in 0 (as the above would if carried one more step). Furthermore, the value just before the 0 must be 1, for the following reason. Suppose the sequence ends in b followed by 0, with b ≠ 1. Then, the integer preceding the b must be a multiple of b, let’s say k1b, for the next remainder to be 0. The integer preceding k1 b must be of the form k1k2 b+ b, for the next remainder to be b. Continuing up the sequence, every number must be a multiple of b, including the first two (in the positions of the 249 and the 7 in the above example). This is impossible, because the first two integers are m – d and d, which are relatively prime.
This constitutes an informal proof that the above process terminates, with a value of 1 in the right-hand column, and hence it finds the multiplicative inverse of d.
To carry this out on a computer, first note that if d < 0, we should add 2W to it. With two’s-complement arithmetic it is not necessary to actually do anything here; simply interpret d as an unsigned number, regardless of how the application interprets it.
The computation of q must use unsigned division.
Observe that the calculations can be done modulo m, because this does not change the right-hand column (these values are in the range 0 to m – 1 anyway). This is important, because it enables the calculations to be done in “single precision,” using the computer’s modulo-2W unsigned arithmetic.
Most of the quantities in the table need not be represented. The column of multiples of 256 need not be represented, because in solving dx + my = 1, we do not need the value of y. There is no need to represent d in the first column. Reduced to its bare essentials, then, the calculation of the above example is carried out as follows:
255 249
1 7
220 4
37 3
183 1
A C program for performing this computation is shown in Figure 10–5.
unsigned mulinv(unsigned d) { // d must be odd.
unsigned x1, v1, x2, v2, x3, v3, q;
x1 = 0xFFFFFFFF; v1 = -d;
x2 = 1; v2 = d;
while (v2 > 1) {
q = v1/v2;
x3 = x1 - q*x2; v3 = v1 - q*v2;
x1 = x2; v1 = v2;
x2 = x3; v2 = v3;
}
return x2;
}
FIGURE 10–5. Multiplicative inverse modulo 232 by the Euclidean algorithm.
The reason the loop continuation condition is (v2 > 1),
rather than the more natural (v2 != 1),
is that if the latter condition were used, the loop would never terminate if the
program were invoked with an even argument. It is best that programs not loop forever
even if misused. (If the argument d
is even, v2
never takes on the value 1, but it does become 0.)
What does the program compute if given an even argument? As written, it computes a
number x such that dx ≡ 0 (mod 232), which is probably not useful. However, with the minor modification of changing
the loop continuation condition to (v2 != 0)
and returning x1
rather than x2,
it computes a number x such that dx ≡ g (mod 232), where g is the greatest common divisor of d and 232—that is, the greatest power of 2 that divides d. The modified program still computes the multiplicative inverse of d for d odd, but it requires one more iteration than the unmodified program.
As for the number of iterations (divisions) required by the above program, for d odd and less than 20, it requires a maximum of 3 and an average of 1.7. For d in the neighborhood of 1000, it requires a maximum of 11 and an average of about 6.
It is well known that, over the real numbers, 1 / d, for d ≠ 0, can be calculated to ever-increasing accuracy by iteratively evaluating
provided the initial estimate x0 is sufficiently close to 1/d. The number of digits of accuracy approximately doubles with each iteration.
It is not so well known that this same formula can be used to find the multiplicative inverse modulo any power of 2!. For example, to find the multiplicative inverse of 3, modulo 256, start with x0 = 1 (any odd number will do). Then,
The iteration has reached a fixed point modulo 256, so –85, or 171, is the multiplicative inverse of 3 (modulo 256). All calculations can be done modulo 256.
Why does this work? Because if xn satisfies
dxn ≡ 1 (mod m)
and if xn + 1 is defined by (31), then
dxn +1 ≡ 1 (mod m2).
To see this, let dxn = 1 + km. Then
In our application, m is a power of 2, say 2N. In this case, if
In a sense, if xn is regarded as a sort of approximation to , then each iteration of (31) doubles the number of bits of “accuracy” of the approximation.
It happens that modulo 8, the multiplicative inverse of any (odd) number d is d itself. Thus, taking x0 = d is a reasonable and simple initial guess at . Then, (31) will give values of x1, x2, ..., such that
Thus, four iterations suffice to find the multiplicative inverse modulo 232 (if x ≡ 1 (mod 248), then x ≡ 1 (mod 2n) for n ≤ 48). This leads to the C program in Figure 10–6, in which all computations are done modulo 232.
For about half the values of d, this program takes 4.5 iterations, or nine multiplications. For the other half (those
for which the initial value of xn
is “correct to 4 bits”—that is, d2 ≡ 1 (mod 16)), it takes seven or fewer, usually seven, multiplications. Thus, it
takes about eight multiplications on average.
unsigned mulinv(unsigned d) { // d must be odd.
unsigned xn, t;
xn = d;
loop: t = d*xn;
if (t == 1) return xn;
xn = xn*(2 - t);
goto loop;
}
FIGURE 10–6. Multiplicative inverse modulo 232 by Newton’s method.
A variation is to simply execute the loop four times, regardless of d, perhaps “strung out” to eliminate the loop control (eight multiplications). Another variation is to somehow make the initial estimate x0 “correct to 4 bits” (that is, find x0 that satisfies dx0 ≡ 1 (mod 16)). Then, only three loop iterations are required. Some ways to set the initial estimate are
Here, the multiplication by 2 is a left shift, and the computations are done modulo 232 (ignoring overflow). Because the second formula uses a multiplication, it saves only one.
This concern about execution time is, of course, totally unimportant for the compiler application. For that application, the routine would be so seldom used that it should be coded for minimum space. But there may be applications in which it is desirable to compute the multiplicative inverse quickly.
The “Newton method” described here applies only when (1) the modulus is an integral power of some number a, and (2) the multiplicative inverse of d modulo a is known. It works particularly well for a = 2, because then the multiplicative inverse of any (odd) number d modulo 2 is known immediately—it is 1.
We conclude this section with a listing of some multiplicative inverses in Table 10–3.
TABLE 10–3. SAMPLE MULTIPLICATIVE INVERSES
You may notice that in several cases (d = 3, 5, 9, 11), the multiplicative inverse of d is the same as the magic number for unsigned division by d (see Section 10–14, “Sample Magic Numbers,” on page 238). This is more or less a coincidence. It happens that for these numbers, the magic number M is equal to the multiplier m, and these are of the form (2p + 1) / d, with p ≥ 32. In this case, notice that
so that M ≡ (mod 232).
The multiplicative inverse of a divisor d can be used to test for a zero remainder after division by d[GM].
First, consider unsigned division with the divisor d odd. Denote by the multiplicative inverse of d. Then, because
, where W is the machine’s word size in bits,
is also odd. Thus,
is relatively prime to 2W, and as shown in the proof of theorem MI in the preceding section, as n ranges over all 2W distinct values modulo 2W,
takes on all 2W distinct values modulo 2W.
It was shown in the preceding section that if n is a multiple of d,
That is, for n = 0, d, 2d, ..., ⌊(2W – 1) / d ⌋ d, ≡ 0, 1, 2, ..., ⌊(2W – 1) / d ⌋(mod 2W). Therefore, for n not a multiple of d, the value of
, reduced modulo 2W to the range 0 to 2W – 1, must exceed ⌊(2W – 1) / d ⌋.
This can be used to test for a zero remainder. For example, to test if an integer
n is a multiple of 25, multiply n by and compare the rightmost W bits to ⌊(2W– 1) / 25 ⌋. On our basic RISC:
li M,0xC28F5C29 Load mult. inverse of 25.
mul q,M,n q = right half of M*n.
li c,0x0A3D70A3 c = floor((2**32-1)/25).
cmpleu t,q,c Compare q and c, and branch
bt t,is_mult if n is a multiple of 25.
To extend this to even divisors, let d = do · 2k, where do is odd and k ≥ 1. Then, because an integer is divisible by d if and only if it is divisible by do and by 2k, and because n and have the same number of trailing zeros (
is odd), the test that n is a multiple of d is
where the mod function is understood to reduce to the interval [0, 2W –1].
Direct implementation of this requires two tests and conditional branches, but it
can be reduced to one compare-branch quite efficiently if the machine has the rotate-shift instruction. This follows from the following theorem, in which denotes the computer word a rotated right k positions (0 ≤ k ≤ 32).
THEOREM ZRU. and x ends in k 0-bits if and only if
Proof. (Assume a 32-bit machine.) Suppose and x ends in k 0-bits. Then, because
,
But
Therefore,
If x does not end in k 0-bits, then
does not begin with k 0-bits, whereas ⌊a /2k⌋ does, so
Lastly, if
and x ends in k 0-bits, then the integer formed from the first 32 – k bits of x must exceed that formed from the first 32 - k bits of a, so that
Using this theorem, the test that n is a multiple of d, where n and d > 1 are unsigned integers and d = do · 2k with do odd, is
Here we used ⌊⌊(2W– 1) / do ⌋ / 2k ⌋ = ⌊(2W– 1) / (do · 2k)⌋ = ⌊ (2W– 1) / d⌋.
As an example, the following code tests an unsigned integer n to see if it is a multiple of 100:
li M,0xC28F5C29 Load mult. inverse of 25.
mul q,M,n q = right half of M*n.
shrri q,q,2 Rotate right two positions.
li c,0x028F5C28 c = floor((2**32-1)/100).
cmpleu t,q,c Compare q and c, and branch
bt t,is_mult if n is a multiple of 100.
For signed division, it was shown in the preceding section that if n is a multiple of d and d is odd, then
Thus, for n = ⌈ –2W – 1 /d ⌉ · d, ...,–d,0, d, ...,⌊(2W – 1 – 1) / d ⌋ · d, we have ≡ ⌈– 2W – 1 / d,...,–1,0, 1, ...,⌊ (2W –1 – 1) / d ⌋ (mod 2W). Furthermore, because d is relatively prime to 2W, as n ranges over all 2W distinct values modulo 2W,
takes on all 2W distinct values modulo 2W. Therefore, n is a multiple of d if and only if
where the mod function is understood to reduce to the interval [–2W – 1 2W – 1 –1]
This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1, it does not divide 2W–1. Therefore,
⌈ – 2W – 1 / d ⌉ = ⌈ (–2W – 1 + 1) d ⌉ = - ⌊(2W–1 –1)/d⌋.
Thus, for signed numbers, the test that n is a multiple of d, where d = do · 2k and do is odd, is
Set q = mod(, 2W);
-⌊ (2W –1 – 1) / do ⌋ ≤ q ≤ ⌊ (2W – 1 – 1)/do ⌋ and q ends in k or more 0-bits.
On the surface, this would seem to require three tests and branches. However, as in the unsigned case, it can be reduced to one compare-branch by use of the following theorem:
THEOREM ZRS. If a ≥ 0, the following assertions are equivalent:
where a′ is a with its rightmost k bits set to 0 (that is, a ′ = a & –2k).
Proof. (Assume a 32-bit machine). To see that (1) is equivalent to (2), clearly the assertion – a ≤ x ≤ a is equivalent to abs(x) ≤ a. Then, Theorem ZRU applies, because both sides of this inequality are nonnegative.
To see that (1) is equivalent to (3), note that assertion (1) is equivalent to itself with a replaced with a′. Then, by the theorem on bounds checking on page 68, this in turn is equivalent to
Because x + a′ ends in k 0-bits if and only if x does, Theorem ZRU applies, giving the result.
Using part (3) of this theorem, the test that n is a multiple of d, where n and d ≥ 2 are signed integers and d = do · 2k with do odd, is
(a′ can be computed at compile time, because d is a constant.)
As an example, the following code tests a signed integer n to see if it is a multiple of 100. Notice that the constant ⌊2a′ / 2k⌋ can always be derived from the constant a′ by a shift of k – 1 bits, saving an instruction or a load from memory to develop the comparand.
li M,0xC28F5C29 Load mult. inverse of 25.
mul q,M,n q = right half of M*n.
li c,0x051EB850 c = floor((2**31 – 1)/25) & –4.
add q,q,c Add c.
shrri q,q,2 Rotate right two positions.
shri c,c,1 Compute const. for comparison.
cmpleu t,q,c Compare q and c, and
bt t,is_mult branch if n is a mult. of 100.
In this section we consider some methods for dividing by constants that do not use the multiply high instruction, or a multiplication instruction that gives a double-word result. We show how to change division by a constant into a sequence of shift and add instructions, or shift, add, and multiply for more compact code.
For these methods, unsigned division is simpler than signed division, so we deal with
unsigned division first. One method is to use the techniques given that use the multiply high instruction, but use the code shown in Figure 8–2 on page 174 to do the multiply high operation. Figure 10–7 shows how this works out for the case of (unsigned) division by 3. This is a combination
of the code on page 228 and Figure 8–2 with “int
” changed to “unsigned
.” The code is 15 instructions, including four multiplications. The multiplications
are by large constants and would take quite a few instructions if converted to shift’s and add’s. Very similar code can be devised for the signed case. This method is not particularly
good and won’t be discussed further.
Another method [GLS1] is to compute in advance the reciprocal of the divisor, and multiply the dividend by that with a series of shift right and add instructions. This gives an approximation to the quotient. It is merely an approximation, because the reciprocal of the divisor (which we assume is not an exact power of two) is not expressed exactly in 32 bits, and also because each shift right discards bits of the dividend. Next, the remainder with respect to the approximate quotient is computed, and that is divided by the divisor to form a correction, which is added to the approximate quotient, giving the exact quotient. The remainder is generally small compared to the divisor (a few multiples thereof), so there is often a simple way to compute the correction without using a divide instruction.
To illustrate this method, consider dividing by 3, that is, computing ⌊n / 3 ⌋ where 0 ≤ n < 232. The reciprocal of 3, in binary, is approximately
0.0101 0101 0101 0101 0101 0101 0101 0101.
To compute the approximate product of that and n, we could use
unsigned divu3(unsigned n) {
unsigned n0, n1, w0, w1, w2, t, q;
n0 = n & 0xFFFF;
n1 = n >> 16;
w0 = n0*0xAAAB;
t = n1*0xAAAB + (w0 >> 16);
w1 = t & 0xFFFF;
w2 = t >> 16;
w1 = n0*0xAAAA + w1;
q = n1*0xAAAA + w2 + (w1 >> 16);
return q >> 1;
}
FIGURE 10–7. Unsigned divide by 3 using simulated multiply high unsigned.
(29 instructions; the last 1 in the reciprocal is ignored because it would add the
term which is obviously 0). However, the simple repeating pattern of 1’s and 0’s in the reciprocal permits a method that is both faster (nine instructions)
and more accurate:
To compare these methods for their accuracy, consider the bits that are shifted out by each term of (32), if n is all 1-bits. The first term shifts out two 1-bits, the next four 1-bits, and so on. Each of these contributes an error of almost 1 in the least significant bit. Since there are 16 terms (counting the term we ignored), the shifts contribute an error of almost 16. There is an additional error due to the fact that the reciprocal is truncated to 32 bits; it turns out that the maximum total error is 16.
For procedure (1), each right shift also contributes an error of almost 1 in the least significant bit. But there are only five shift operations. They contribute an error of almost 5, and there is a further error due to the fact that the reciprocal is truncated to 32 bits; it turns out that the maximum total error is 5.
After computing the estimated quotient q, the remainder r is computed from
r ← n – q * 3.
The remainder cannot be negative, because q is never larger than the exact quotient. We need to know how large r can be to devise the simplest possible method for computing In general, for a divisor d and an estimated quotient q too low by k, the remainder will range from k*d to k*d + d – 1. (The upper limit is conservative; it may not actually be attained.) Thus, using
(1), for which q is too low by at most 5, we expect the remainder to be at most 5*3 + 2 = 17. Experimentation
reveals that it is actually at most 15. Thus, for the correction we must compute (exactly)
Since r is small compared to the largest value that a register can hold, this can be approximated by multiplying r by some approximation to 1/3 of the form a/b where b is a power of 2. This is easy to compute, because the division is simply a shift. The value of a/ b must be slightly larger than 1/3, so that after shifting the result will agree with truncated division. A sequence of such approximations is:
1/2, 2/4, 3/8, 6/16, 11/32, 22/64, 43/128, 86/256, 171/512, 342/1024, ....
Usually, the smaller fractions in the sequence are easier to compute, so we choose the smallest one that works; in the case at hand this is 11/32. Therefore, the final, exact, quotient is given by
The solution involves two multiplications by small numbers (3 and 11); these can be changed to shift’s and add’s.
Figure 10–8 shows the entire solution in C. As shown, it consists of 14 instructions, including
two multiplications. If the multiplications are changed to shift’s and add’s, it amounts to 18 elementary instructions. However, if it is desired to avoid the
multiplications, then either alternative return
statement shown gives a solution in 17 elementary instructions. Alternative 2 has
just a little instruction-level parallelism, but in truth this method generally has
very little of that.
A more accurate estimate of the quotient can be obtained by changing the first executable line to
q = (n >> 1) + (n >> 3);
(which makes q
too large by a factor of 2, but it has one more bit of accuracy), and then inserting
just before the assignment to r
,
q = q >> 1;
With this variation, the remainder is at most 9. However, there does not seem to be
any better code for calculating with r limited to 9 than there is for r limited to 15 (four elementary instructions in either case). Thus, using the idea
would cost one instruction. This possibility is mentioned because it does give a code improvement for most divisors.
unsigned divu3(unsigned n) {
unsigned q, r;
q = (n >> 2) + (n >> 4); // q = n*0.0101 (approx).
q = q + (q >> 4); // q = n*0.01010101.
q = q + (q >> 8);
q = q + (q >> 16);
r = n - q*3; // 0 <= r <= 15.
return q + (11*r >> 5); // Returning q + r/3.
// return q + (5*(r + 1) >> 4); // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}
FIGURE 10–8. Unsigned divide by 3.
Figure 10–9 shows two variations of this method for dividing by 5. The reciprocal of 5, in binary, is
0.0011 0011 0011 0011 0011 0011 0011 0011.
As in the case of division by 3, the simple repeating pattern of 1’s and 0’s allows
a fairly efficient and accurate computing of the quotient estimate. The estimate of
the quotient computed by the code on the left can be off by at most 5, and it turns
out that the remainder is at most 25. The code on the right retains two additional
bits of accuracy in computing the quotient estimate, which is off by at most 2. The
remainder in this case is at most 10. The smaller maximum remainder permits approximating
1/5 by 7/32 rather than 13/64, which gives a slightly more efficient program if the
multiplications are done by shift’s and add’s. The instruction counts are, for the code on the left: 14 instructions including
two multiplications, or 18 elementary instructions; for the code on the right: 15
instructions including two multiplications, or 17 elementary instructions. The alternative
code in the return
statement is useful only if your machine has comparison predicate instructions. It
doesn’t reduce the instruction count, but merely has a little instruction-level parallelism.
For division by 6, the divide-by-3 code can be used, followed by a shift right of 1. However, the extra instruction can be saved by doing the computation directly, using the binary approximation
4/6 ≈ 0.1010 1010 1010 1010 1010 1010 1010 1010.
FIGURE 10–9. Unsigned divide by 5.
The code is shown in Figure 10–10. The version on the left multiplies by an approximation to 1/6 and then corrects
with a multiplication by 11/64. The version on the right takes advantage of the fact
that by multiplying by an approximation to 4/6, the quotient estimate is off by only
1 at most. This permits simpler code for the correction; it simply adds 1 to q
if r
≥ 6. The code in the second return
statement is appropriate if the machine has the comparison predicate instructions.
Function divu6b
is 15 instructions, including one multiply, as shown, or 17 elementary instructions if the multiplication by 6 is changed to
shift’s and add’s.
FIGURE 10–10. Unsigned divide by 6.
For larger divisors, usually it seems to be best to use an approximation to 1/ d that is shifted left so that its most significant bit is 1. It seems that the quotient is then off by at most 1 usually (possibly always, this writer does not know), which permits efficient code for the correction step. Figure 10–11 shows code for dividing by 7 and 9, using the binary approximations
If the multiplications by 7 and 9 are expanded into shift’s and add’s, these functions take 16 and 15 elementary instructions, respectively.
FIGURE 10–11. Unsigned divide by 7 and 9.
Figures 10–12 and 10–13 show code for dividing by 10, 11, 12, and 13. These are based on the binary approximations:
If the multiplications are expanded into shift’s and add’s, these functions take 17, 20, 17, and 20 elementary instructions, respectively.
FIGURE 10–12. Unsigned divide by 10 and 11.
FIGURE 10–13. Unsigned divide by 12 and 13.
The case of dividing by 13 is instructive because it shows how you must look for repeating
strings in the binary expansion of the reciprocal of the divisor. The first assignment
sets q
equal to n
*0.1001. The second assignment to q
adds n
*0.00001001 and n
*0.000001001. At this point, q
is (approximately) equal to n
*0.100111011. The third assignment to q
adds in repetitions of this pattern. It sometimes helps to use subtraction, as in the case of divu9
above. However, you must use care with subtraction, because it may cause the quotient
estimate to be too large, in which case the remainder is negative and the method breaks
down. It is quite complicated to get optimal code, and we don’t have a general cookbook
method that you can put in a compiler to handle any divisor.
The examples above are able to economize on instructions, because the reciprocals
have simple repeating patterns, and because the multiplication in the computation
of the remainder r
is by a small constant, which can be done with only a few shift’s and add’s. One might wonder how successful this method is for larger divisors. To roughly
assess this, Figures 10–14 and 10–15 show code for dividing by 100 and by 1000 (decimal). The relevant reciprocals are
If the multiplications are expanded into shift’s and add’s, these functions take 25 and 23 elementary instructions, respectively.
unsigned divu100(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 3) + (n >> 6) - (n >> 10) +
(n >> 12) + (n >> 13) - (n >> 16);
q = q + (q >> 20);
q = q >> 6;
r = n - q*100;
return q + ((r + 28) >> 7);
// return q + (r > 99);
}
FIGURE 10–14. Unsigned divide by 100.
unsigned divu1000(unsigned n) {
unsigned q, r, t;
t = (n >> 7) + (n >> 8) + (n >> 12);
q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14);
q = q >> 9;
r = n - q*1000;
return q + ((r + 24) >> 10);
// return q + (r > 999);
}
FIGURE 10–15. Unsigned divide by 1000.
In the case of dividing by 1000, the least significant eight bits of the reciprocal estimate are nearly ignored. The code of Figure 10–15 replaces the binary 1001 0111 with 0100 0000, and still the quotient estimate is within one of the true quotient. Thus, it appears that although large divisors might have very little repetition in the binary representation of the reciprocal estimate, at least some bits can be ignored, which helps hold down the number of shift’s and add’s required to compute the quotient estimate.
This section has shown, in a somewhat imprecise way, how unsigned division by a constant can be reduced to a sequence of, typically, about 20 elementary instructions. It is nontrivial to get an algorithm that generates these code sequences that is suitable for incorporation into a compiler, because of three difficulties in getting optimal code.
1. It is necessary to search the reciprocal estimate bit string for repeating patterns.
2. Negative terms (as in divu10
and divu100)
can be used sometimes, but the error analysis required to determine just when they
can be used is difficult.
3. Sometimes some of the least significant bits of the reciprocal estimate can be ignored (how many?).
Another difficulty for some target machines is that there are many variations on the code examples given that have more instructions, but that would execute faster on a machine with multiple shift and add units.
The code of Figures 10–7 through 10–15 has been tested for all 232 values of the dividends.
The methods given above can be made to apply to signed division. The right shift instructions in computing the quotient estimate become signed right shift instructions, which compute floor division by powers of 2. Thus, the quotient estimate is too low (algebraically), so the remainder is nonnegative, as in the unsigned case.
The code most naturally computes the floor division result, so we need a correction to make it compute the conventional truncated-toward-0 result. This can be done with three computational instructions by adding d – 1 to the dividend if the dividend is negative. For example, if the divisor is 6, the code begins with (the shift here is a signed shift)
n = n + (n >> 31 & 5);
Other than this, the code is very similar to that of the unsigned case. The number of elementary operations required is usually three more than in the corresponding unsigned division function. Several examples are given in Figures 10–16 through 10–22. All have been exhaustively tested.
int divs3(int n) {
int q, r;
n = n + (n>>31 & 2); // Add 2 if n < 0.
q = (n >> 2) + (n >> 4); // q = n*0.0101 (approx).
q = q + (q >> 4); // q = n*0.01010101.
q = q + (q >> 8);
q = q + (q >> 16);
r = n - q*3; // 0 <= r <= 14.
return q + (11*r >> 5); // Returning q + r/3.
// return q + (5*(r + 1) >> 4); // Alternative 1.
// return q + ((r + 5 + (r << 2)) >> 4);// Alternative 2.
}
FIGURE 10–16. Signed divide by 3.
FIGURE 10–17. Signed divide by 5 and 6.
FIGURE 10–18. Signed divide by 7 and 9.
FIGURE 10–19. Signed divide by 10 and 11.
FIGURE 10–20. Signed divide by 12 and 13.
FIGURE 10–21. Signed divide by 100.
int divs1000(int n) {
int q, r, t;
n = n + (n >> 31 & 999);
t = (n >> 7) + (n >> 8) + (n >> 12);
q = (n >> 1) + t + (n >> 15) + (t >> 11) + (t >> 14) +
(n >> 26) + (t >> 21);
q = q >> 9;
r = n - q*1000;
return q + ((r + 24) >> 10);
// return q + (r > 999);
}
FIGURE 10-22. Signed divide by 1000.
This section addresses the problem of computing the remainder of division by a constant without computing the quotient. The methods of this section apply only to divisors of the form 2k ± 1, for k an integer greater than or equal to 2, and in most cases the code resorts to a table lookup (an indexed load instruction) after a fairly short calculation.
We will make frequent use of the following elementary property of congruences:
THEOREM C. If a ≡ b (mod m) and c ≡ d (mod m), then
The unsigned case is simpler and is dealt with first.
For a divisor of 3, multiplying the trivial congruence 1 ≡ 1 (mod 3) repeatedly by the congruence 2 ≡ –1 (mod 3), we conclude by Theorem C that
Therefore, a number n written in binary as ...b3 b2 b1 b0 satisfies
n = ... + b3 · 23 + b2 · 22 + b1 · 2 + b0 ≡ ...– b3 + b2 – b1 + b0 (mod 3),
which is derived by using Theorem C repeatedly. Thus, we can alternately add and subtract the bits in the binary representation of the number to obtain a smaller number that has the same remainder upon division by 3. If the sum is negative, you must add a multiple of 3 to make it nonnegative. The process can then be repeated until the result is in the range 0 to 2.
The same trick works for finding the remainder after dividing a decimal number by 11.
Thus, if the machine has the population count instruction, a function that computes the remainder modulo 3 of an unsigned number
n
might begin with
n = pop(n & 0x55555555) - pop(n & 0xAAAAAAAA);
This can be simplified by using the following surprising identity discovered by Paolo Bonzini [Bonz]:
Since the references to 32 (the word size) cancel out, the result holds for any word size. Another way to prove (2) is to observe that it holds for x = 0, and if a 0-bit in x is changed to a 1 where m is 1, then both sides of (2) decrease by 1, and if a 0-bit of x is changed to a 1 where m is 0, then both sides of (2) increase by 1.
Applying (2) to the line of C code above gives
n = pop(n ^ 0xAAAAAAAA) - 16;
We want to apply this transformation again, until n
is in the range 0 to 2, if possible. It is best to avoid producing a negative value
of n
, because the sign bit would not be treated properly on the next round. A negative
value can be avoided by adding a sufficiently large multiple of 3 to n
. Bonzini’s code, shown in Figure 10–23, increases the constant by 39. This is larger than necessary to make n
nonnegative, but it causes n
to range from –3 to 2 (rather than –3 to 3) after the second round of reduction.
This simplifies the code on the return
statement, which is adding 3 if n
is negative. The function executes in 11 instructions, counting two to load the large
constant.
Figure 10–24 shows a variation that executes in four instructions, plus a simple table lookup operation (e.g., an indexed load byte instruction).
int remu3(unsigned n) {
n = pop(n ^ 0xAAAAAAAA) + 23; // Now 23 <= n <= 55.
n = pop(n ^ 0x2A) - 3; // Now -3 <= n <= 2.
return n + (((int)n >> 31) & 3);
}
FIGURE 10–23. Unsigned remainder modulo 3, using population count.
int remu3(unsigned n) {
static char table[33] = {2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1};
n = pop(n ^ 0xAAAAAAAA);
return table[n];
}
FIGURE 10–24. Unsigned remainder modulo 3, using population count and a table lookup.
To avoid the population count instruction, notice that because 4 ≡ 1 (mod 3), 4k ≡ 1 (mod 3). A binary number can be viewed as a base 4 number by taking its bits in pairs and interpreting the bits 00 to 11 as a base 4 digit ranging from 0 to 3. The pairs of bits can be summed using the code of Figure 5–2 on page 82, omitting the first executable line (overflow does not occur in the additions). The final sum ranges from 0 to 48, and a table lookup can be used to reduce this to the range 0 to 2. The resulting function is 16 elementary instructions, plus an indexed load.
There is a similar, but slightly better, way. As a first step, n
can be reduced to a smaller number that is in the same congruence class modulo 3
with
n = (n >> 16) + (n & 0xFFFF);
This splits the number into two 16-bit portions, which are added together. The contribution
modulo 3 of the left 16 bits of n
is not altered by shifting them right 16 positions, because the shifted number, multiplied
by 216, is the original number, and 216 ≡ 1 (mod 3). More generally, 2k ≡ 1 (mod 3) if k is even. This is used repeatedly (five times) in the code shown in Figure 10–25. This code is 19 instructions. The instruction count can be reduced by cutting off
the digit summing earlier and using an in-memory table lookup, as illustrated in Figure 10–26 (nine instructions, plus an indexed load). The instruction count can be reduced to six (plus an indexed load) by using a table of size 0x2FE = 766 bytes.
To compute the unsigned remainder modulo 5, the code of Figure 10–27 uses the relations 16k ≡ 1 (mod 5) and 4 ≡ –1 (mod 5). It is 21 elementary instructions, assuming the multiplication by 3 is expanded into a shift and add.
int remu3(unsigned n) {
n = (n >> 16) + (n & 0xFFFF); // Max 0x1FFFE.
n = (n >> 8) + (n & 0x00FF); // Max 0x2FD.
n = (n >> 4) + (n & 0x000F); // Max 0x3D.
n = (n >> 2) + (n & 0x0003); // Max 0x11.
n = (n >> 2) + (n & 0x0003); // Max 0x6.
return (0x0924 >> (n << 1)) & 3;
}
FIGURE 10–25. Unsigned remainder modulo 3, digit summing and an in-register lookup.
int remu3(unsigned n) {
static char table[62] = {0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1};
n = (n >> 16) + (n & 0xFFFF); // Max 0x1FFFE.
n = (n >> 8) + (n & 0x00FF); // Max 0x2FD.
n = (n >> 4) + (n & 0x000F); // Max 0x3D.
return table[n];
}
FIGURE 10–26. Unsigned remainder modulo 3, digit summing and an in-memory lookup.
int remu5(unsigned n) {
n = (n >> 16) + (n & 0xFFFF); // Max 0x1FFFE.
n = (n >> 8) + (n & 0x00FF); // Max 0x2FD.
n = (n >> 4) + (n & 0x000F); // Max 0x3D.
n = (n>>4) - ((n>>2) & 3) + (n & 3); // -3 to 6.
return (01043210432 >> 3*(n + 3)) & 7; // Octal const.
}
FIGURE 10–27. Unsigned remainder modulo 5, digit summing method.
The instruction count can be reduced by using a table, similar to what is done in Figure 10–26. In fact, the code is identical, except the table is:
static char table[62] = {0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1};
For the unsigned remainder modulo 7, the code of Figure 10–28 uses the relation 8k ≡ 1 (mod 7) (nine elementary instructions, plus an indexed load).
As a final example, the code of Figure 10–29 computes the remainder of unsigned division by 9. It is based on the relation 8 ≡ –1 (mod 9). As shown, it is nine elementary instructions, plus an indexed load. The elementary instruction count can be reduced to six by using a table of size 831 (decimal).
int remu7(unsigned n) {
static char table[75] = {0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4};
n = (n >> 15) + (n & 0x7FFF); // Max 0x27FFE.
n = (n >> 9) + (n & 0x001FF); // Max 0x33D.
n = (n >> 6) + (n & 0x0003F); // Max 0x4A.
return table[n];
}
FIGURE 10–28. Unsigned remainder modulo 7, digit summing method.
int remu9(unsigned n) {
int r;
static char table[75] = {0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2};
r = (n & 0x7FFF) - (n >> 15); // FFFE0001 to 7FFF.
r = (r & 0x01FF) - (r >> 9); // FFFFFFC1 to 2FF.
r = (r & 0x003F) + (r >> 6); // 0 to 4A.
return table[r];
}
FIGURE 10–29. Unsigned remainder modulo 9, digit summing method.
The digit summing method can be adapted to compute the remainder resulting from signed division. There seems to be no better way than to add a few steps to correct the result of the method as applied to unsigned division. Two corrections are necessary: (1) correct for a different interpretation of the sign bit, and (2) add or subtract a multiple of the divisor d to get the result in the range 0 to – (d – 1).
For division by 3, the unsigned remainder code interprets the sign bit of the dividend n as contributing 2 to the remainder (because 231 mod 3 = 2). For the remainder of signed division, the sign bit contributes only 1 (because (–231) mod 3 = 1). Therefore, we can use the code for an unsigned remainder and correct its result by subtracting 1. Then, the result must be put in the range 0 to –2. That is, the result of the unsigned remainder code must be mapped as follows:
(0, 1, 2) ⇒ (–1, 0, 1) ⇒ (–1, 0, –2).
This adjustment can be done fairly efficiently by subtracting 1 from the unsigned
remainder if it is 0 or 1, and subtracting 4 if it is 2 (when the dividend is negative).
The code must not alter the dividend n
, because it is needed in this last step.
This procedure can easily be applied to any of the functions given for the unsigned remainder modulo 3. For example, applying it to Figure 10–26 on page 265 gives the function shown in Figure 10–30. It is 13 elementary instructions, plus an indexed load. The instruction count can be reduced by using a larger table.
int rems3(int n) {
unsigned r;
static char table[62] = {0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2, 0,1,2,
0,1,2, 0,1,2, 0,1};
r = n;
r = (r >> 16) + (r & 0xFFFF); // Max 0x1FFFE.
r = (r >> 8) + (r & 0x00FF); // Max 0x2FD.
r = (r >> 4) + (r & 0x000F); // Max 0x3D.
r = table[r];
return r - (((unsigned)n >> 31) << (r & 2));
}
FIGURE 10–30. Signed remainder modulo 3, digit summing method.
Figures 10–31 to 10–33 show similar code for computing the signed remainder of division by 5, 7, and 9. All the functions consist of 15 elementary operations, plus an indexed load. They use signed right shifts, and the final adjustment consists of subtracting the modulus if the dividend is negative and the remainder is nonzero. The number of instructions can be reduced by using larger tables.
int rems5(int n) {
int r;
static char table[62] = {2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4, 0,1,2,3,4,
0,1,2,3,4, 0,1,2,3};
r = (n >> 16) + (n & 0xFFFF); // FFFF8000 to 17FFE.
r = (r >> 8) + (r & 0x00FF); // FFFFFF80 to 27D.
r = (r >> 4) + (r & 0x000F); // -8 to 53 (decimal).
r = table[r + 8];
return r - (((int)(n & -r) >> 31) & 5);
}
FIGURE 10–31. Signed remainder modulo 5, digit summing method.
int rems7(int n) {
int r;
static char table[75] = {5,6, 0,1,2,3,4,5,6,
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6,
0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2,3,4,5,6, 0,1,2};
r = (n >> 15) + (n & 0x7FFF); // FFFF0000 to 17FFE.
r = (r >> 9) + (r & 0x001FF); // FFFFFF80 to 2BD.
r = (r >> 6) + (r & 0x0003F); // -2 to 72 (decimal).
r = table[r + 2];
return r - (((int)(n & -r) >> 31) & 7);
}
FIGURE 10–32. Signed remainder modulo 7, digit summing method.
int rems9(int n) {
int r;
static char table[75] = {7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0,1,2,3,4,5,6,7,8,
0,1,2,3,4,5,6,7,8, 0};
r = (n & 0x7FFF) - (n >> 15); // FFFF7001 to 17FFF.
r = (r & 0x01FF) - (r >> 9); // FFFFFF41 to 0x27F.
r = (r & 0x003F) + (r >> 6); // -2 to 72 (decimal).
r = table[r + 2];
return r - (((int)(n & -r) >> 31) & 9);
}
FIGURE 10–33. Signed remainder modulo 9, digit summing method.
The method described in this section applies, in principle, to all integer divisors greater than 2, but as a practical matter only to fairly small divisors and to divisors of the form 2k – 1. As in the preceding section, in most cases the code resorts to a table lookup after a fairly short calculation.
This section uses the mathematical (not computer algebra) notation a mod b, where a and b are integers and b > 0, to denote the integer x, 0 ≤ x < b, that satisfies x ≡ a (mod b).
To compute n mod 3, observe that
Proof: Let n = 3 k + δ, where δ and k are integers and 0 ≤ δ ≤ 2. Then
Clearly, the value of the last expression is 0, 1, or 2 for δ = 0, 1, or 2 respectively. This allows changing the problem of computing the remainder modulo 3 to one of computing the remainder modulo 4, which is of course much easier on a binary computer.
Relations like (3) do not hold for all moduli, but similar relations do hold if the modulus is of the form 2k – 1, for k an integer greater than 1. For example, it is easy to show that
For numbers not of the form 2k – 1, there is no such simple relation, but there is a certain uniqueness property that can be used to compute the remainder for other divisors. For example, if the divisor is 10 (decimal), consider the expression
Let n = 10 k + δ where 0 ≤ δ ≤ 9. Then
For δ = 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, the last expression takes on the values 0, 1, 3, 4, 6, 8, 9, 11, 12, and 14 respectively. The latter numbers are all distinct. Therefore, if we can find a reasonably easy way to compute (4), we can translate 0 to 0, 1 to 1, 3 to 2, 4 to 3, and so on, to obtain the remainder of division by 10. This will generally require a translation table of size equal to the next power of 2 greater than the divisor, so the method is practical only for fairly small divisors (and for divisors of the form 2k – 1, for which table lookup is not required).
The code to be shown was derived by using a little of the above theory and a lot of trial and error.
Consider the remainder of unsigned division by 3. Following (3), we wish to compute the rightmost two bits of the integer part of 4n/ 3. This can be done approximately by multiplying by ⌊232 / 3⌋ and then dividing by 230 using a shift right instruction. When the multiplication by ⌊232 / 3⌋ is done (using the multiply instruction that gives the low-order 32 bits of the product), high-order bits will be lost. But that doesn’t matter, and, in fact, it’s helpful, because we want the result modulo 4. Therefore, because ⌊232 / 3⌋ = 0x55555555, a possible plan is to compute
Experiment indicates that this works for n in the range 0 to 230 + 2. It almost works, I should say; if n is nonzero and a multiple of 3, it gives the result 3. Therefore, it must be followed by a translation step that translates (0, 1, 2, 3) to (0, 1, 2, 0) respectively.
To extend the range of applicability, the multiplication must be done more accurately. Two more bits of accuracy suffice (that is, multiplying by 0x55555555.4). The following calculation, followed by the translation step, works for all n representable as an unsigned 32-bit integer:
It is, of course, possible to give a formal proof of this, but the algebra is quite lengthy and error prone.
The translation step can be done in three or four instructions on most machines, but there is a way to avoid it at a cost of two instructions. The above expression for computing r estimates low. If you estimate slightly high, the result is always 0, 1, or 2. This gives the C function shown in Figure 10–34 (eight instructions, including a multiply).
int remu3(unsigned n) {
return (0x55555555*n + (n >> 1) - (n >> 3)) >> 30;
}
FIGURE 10–34. Unsigned remainder modulo 3, multiplication method.
The multiplication can be expanded, giving the 13-instruction function shown in Figure 10–35 that uses only shift’s and add’s.
int remu3(unsigned n) {
unsigned r;
r = n + (n << 2);
r = r + (r << 4);
r = r + (r << 8);
r = r + (r << 16);
r = r + (n >> 1);
r = r - (n >> 3);
return r >> 30;
}
FIGURE 10–35. Unsigned remainder modulo 3, multiplication (expanded) method.
The remainder of unsigned division by 5 can be computed very similarly to the remainder
of division by 3. Let n = 5 k + r with 0 ≤ r ≤ 4. Then (8 / 5)n mod 8 = (8 / 5)(5 k+ r) mod 8 = (8 / 5)r mod 8. For r = 0, 1, 2, 3, and 4, this takes on the values 0, 1, 3, 4, and 6 respectively. Since
⌊232 / 5⌋ = 0x33333333, this leads to the function shown in Figure 10–36 (11 instructions, including a multiply). The last step (code on the return
statement) is mapping (0, 1, 3, 4, 6, 7) to (0, 1, 2, 3, 4, 0) respectively, using
an in-register method rather than an indexed load from memory. By also mapping 2 to 2 and 5 to 4, the precision required in the multiplication
by 232 / 5 is reduced to using just the term n >> 3
to approximate the missing part of the multiplier (hexadecimal 0.333...). If the
“accuracy” term n >> 3
is omitted, the code still works for n
ranging from 0 to 0x60000004.
int remu5(unsigned n) {
n = (0x33333333*n + (n >> 3)) >> 29;
return (0x04432210 >> (n << 2)) & 7;
}
FIGURE 10–36. Unsigned remainder modulo 5, multiplication method.
The code for computing the unsigned remainder modulo 7 is similar, but the mapping
step is simpler; it is necessary only to convert 7 to 0. One way to code it is shown
in Figure 10–37 (11 instructions, including a multiply). If the accuracy term n >> 4
is omitted, the code still works for n
up to 0x40000006. With both accuracy terms omitted, it works for n
up to 0x08000006.
int remu7(unsigned n) {
n = (0x24924924*n + (n >> 1) + (n >> 4)) >> 29;
return n & ((int)(n - 7) >> 31);
}
FIGURE 10–37. Unsigned remainder modulo 7, multiplication method.
Code for computing the unsigned remainder modulo 9 is shown in Figure 10–38. It is six instructions, including a multiply, plus an indexed load. If the accuracy term n >> 1
is omitted and the multiplier is changed to 0x1C71C71D, the function works for n
up to 0x1999999E.
int remu9(unsigned n) {
static char table[16] = {0, 1, 1, 2, 2, 3, 3, 4,
5, 5, 6, 6, 7, 7, 8, 8};
n = (0x1C71C71C*n + (n >> 1)) >> 28;
return table[n];
}
FIGURE 10–38. Unsigned remainder modulo 9, multiplication method.
Figure 10–39 shows a way to compute the unsigned remainder modulo 10. It is eight instructions,
including a multiply, plus an indexed load instruction. If the accuracy term n >> 3
is omitted, the code works for n
up to 0x40000004. If both accuracy terms are omitted, it works for n
up to 0x0AAAAAAD.
int remu10(unsigned n) {
static char table[16] = {0, 1, 2, 2, 3, 3, 4, 5,
5, 6, 7, 7, 8, 8, 9, 0};
n = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
return table[n];
}
FIGURE 10–39. Unsigned remainder modulo 10, multiplication method.
As a final example, consider the computation of the remainder modulo 63. This function is used by the population count program at the top of page 84. Joe Keane [Keane] has come up with the rather mysterious code shown in Figure 10–40. It is 12 elementary instructions on the basic RISC.
int remu63(unsigned n) {
unsigned t;
t = (((n >> 12) + n) >> 10) + (n << 2);
t = ((t >> 6) + t + 3) & 0xFF;
return (t - (t >> 6)) >> 2;
}
FIGURE 10–40. Unsigned remainder modulo 63, Keane’s method.
The “multiply and shift right” method leads to the code shown in Figure 10–41. This is 11 instructions on the basic RISC, one being a multiply. This would not be as fast as Keane’s method, unless the machine has a very fast multiply and the load of the constant 0x04104104 can move out of a loop.
int remu63(unsigned n) {
n = (0x04104104*n + (n >> 4) + (n >> 10)) >> 26;
return n & ((n - 63) >> 6); // Change 63 to 0.
}
FIGURE 10–41. Unsigned remainder modulo 63, multiplication method.
On some machines, an improvement can result from expanding the multiplication into shifts and adds as follows (15 elementary instructions for the whole function):
r = (n << 2) + (n << 8); // r = 0x104*n.
r = r + (r << 12); // r = 0x104104*n.
r = r + (n << 26); // r = 0x04104104*n.
As in the case of the digit summing method, the “multiply and shift right” method can be adapted to compute the remainder resulting from signed division. Again, there seems to be no better way than to add a few steps to correct the result of the method as applied to unsigned division. For example, the code shown in Figure 10–42 is derived from Figure 10–34 on page 270 (12 instructions, including a multiply).
int rems3(int n) {
unsigned r;
r = n;
r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30;
return r - (((unsigned)n >> 31) << (r & 2));
}
FIGURE 10–42. Signed remainder modulo 3, multiplication method.
Some plausible ways to compute the remainder of signed division by 5, 7, 9, and 10
are shown in Figures 10–43 to 10–46. The code for a divisor of 7 uses quite a few extra instructions (19 in all, including
a multiply); it might be preferable to use a table similar to that shown for the cases in which
the divisor is 5, 9, or 10. In the latter cases, the table used for unsigned division
is doubled in size, with the sign bit of the divisor factored in to index the table.
Entries shown as u
are unused.
int rems5(int n) {
unsigned r;
static signed char table[16] = {0, 1, 2, 2, 3, u, 4, 0,
u, 0,-4, u,-3,-2,-2,-1};
r = n;
r = ((0x33333333*r) + (r >> 3)) >> 29;
return table[r + (((unsigned)n >> 31) << 3)];
}
FIGURE 10–43. Signed remainder modulo 5, multiplication method.
int rems7(int n) {
unsigned r;
r = n - (((unsigned)n >> 31) << 2); // Fix for sign.
r = ((0x24924924*r) + (r >> 1) + (r >> 4)) >> 29;
r = r & ((int)(r - 7) >> 31); // Change 7 to 0.
return r - (((int)(n&-r) >> 31) & 7);// Fix n<0 case.
}
FIGURE 10–44. Signed remainder modulo 7, multiplication method.
int rems9(int n) {
unsigned r;
static signed char table[32] = {0, 1, 1, 2, u, 3, u, 4,
5, 5, 6, 6, 7, u, 8, u,
-4, u,-3, u,-2,-1,-1, 0,
u,-8, u,-7,-6,-6,-5,-5};
r = n;
r = (0x1C71C71C*r + (r >> 1)) >> 28;
return table[r + (((unsigned)n >> 31) << 4)];
}
FIGURE 10–45. Signed remainder modulo 9, multiplication method.
int rems10(int n) {
unsigned r;
static signed char table[32] = {0, 1, u, 2, 3, u, 4, 5,
5, 6, u, 7, 8, u, 9, u,
-6,-5, u,-4,-3,-3,-2, u,
-1, 0, u,-9, u,-8,-7, u};
r = n;
r = (0x19999999*r + (r >> 1) + (r >> 3)) >> 28;
return table[r + (((unsigned)n >> 31) << 4)];
}
FIGURE 10–46. Signed remainder modulo 10, multiplication method.
Since the remainder can be computed without computing the quotient, the possibility arises of computing the quotient q = ⌊n/d⌋ by first computing the remainder, subtracting this from the dividend n, and then dividing the difference by the divisor d. This last division is an exact division, and it can be done by multiplying by the multiplicative inverse of d (see Section 10–16, “Exact Division by Constants,” on page 240). This method would be particularly attractive if both the quotient and remainder are wanted.
Let us try this for the case of unsigned division by 3. Computing the remainder by the multiplication method (Figure 10–34 on page 270) leads to the function shown in Figure 10–47.
unsigned divu3(unsigned n) {
unsigned r;
r = (0x55555555*n + (n >> 1) - (n >> 3)) >> 30;
return (n - r)*0xAAAAAAAB;
}
FIGURE 10–47. Unsigned remainder and quotient with divisor = 3, using exact division.
This is 11 instructions, including two multiplications by large numbers. (The constant
0x55555555
can be generated by shifting the constant 0xAAAAAAAB
right one position.) In contrast, the more straightforward method of computing the
quotient q
using (for example) the code of Figure 10–8 on page 254, requires 14 instructions, including two multiplications by small numbers, or 17
elementary operations if the multiplications are expanded into shift’s and add’s. If the remainder is also wanted, and it is computed from r = n - q*3,
the more straightforward method requires 16 instructions, including three multiplications
by small numbers, or 20 elementary instructions if the multiplications are expanded
into shift’s and add’s.
The code of Figure 10–47 is not attractive if the multiplications are expanded into shift’s and add’s; the result is 24 elementary instructions. Thus, the exact division method might be a good one on a machine that does not have multiply high but does have a fast modulo 232 multiply and slow divide, particularly if it can easily deal with the large constants.
For signed division by 3, the exact division method might be coded as shown in Figure 10–48. It is 15 instructions, including two multiplications by large constants.
int divs3(int n) {
unsigned r;
r = n;
r = (0x55555555*r + (r >> 1) - (r >> 3)) >> 30;
r = r - (((unsigned)n >> 31) << (r & 2));
return (n - r)*0xAAAAAAAB;
}
FIGURE 10–48. Signed remainder and quotient with divisor = 3, using exact division.
As a final example, Figure 10–49 shows code for computing the quotient and remainder for unsigned division by 10. It is 12 instructions, including two multiplications by large constants, plus an indexed load instruction.
unsigned divu10(unsigned n) {
unsigned r;
static char table[16] = {0, 1, 2, 2, 3, 3, 4, 5,
5, 6, 7, 7, 8, 8, 9, 0};
r = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
r = table[r];
return ((n - r) >> 1)*0xCCCCCCCD;
}
FIGURE 10–49. Signed remainder and quotient with divisor = 10, using exact division.
Many machines have a 32×32 ⇒ 64 multiply instruction, so one would expect that to divide by a constant such as 3, the code shown on page 228 would be fastest. If that multiply instruction is not present, but the machine has a fast 32×32 ⇒ 32 multiply instruction, then the exact division method might be a good one if the machine has a slow divide and a fast multiply. To test this conjecture, an assembly language program was constructed to compare four methods of dividing by 3. The results are shown in Table 10–4. The machine used was a 667 MHz Pentium III (ca. 2000), and one would expect similar results on many other machines.
TABLE 10–4. UNSIGNED DIVIDE BY 3 ON A PENTIUM III
The first row gives the time in cycles for just two instructions: an xorl
to clear the left half of the 64-bit source register, and the divl
instruction, which evidently takes 40 cycles. The second row also gives the time
for just two instructions: multiply and shift right 1 (mull
and shrl
). The third row gives the time for a sequence of 21 elementary instructions. It is
the code of Figure 10–8 on page 254 using alternative 2, and with the multiplication by 3 done with a single instruction
(leal
). Several move instructions are necessary because the machine is (basically) two-address. The last
row gives the time for a sequence of 10 instructions: two multiplications (imull
) and the rest elementary. The two imull
instructions use 4-byte immediate fields for the large constants. (The signed multiply instruction imull
is used rather than its unsigned counterpart mull
, because they give the same result in the low-order 32 bits, and imull
has more addressing modes available.)
The exact division method would be even more favorable compared to the second and
third methods if both the quotient and remainder were wanted, because they would require
additional code for the computation r ← n – q*3. (The divl
instruction produces the remainder as well as the quotient.)
There is a simple circuit for dividing by 3 that is about as complex as an adder. It can be constructed very similarly to the elementary way one constructs an n-bit adder from n 1-bit “full adder” circuits. However, in the divider signals flow from most significant to least significant bit.
Consider dividing by 3 the way it is taught in grade school, but in binary. To produce each bit of the quotient, you divide 3 into the next bit, but the bit is preceded by a remainder of 0, 1, or 2 from the previous stage. The logic is shown in Table 10–5. Here the remainder is represented by two bits ri and si, with ri being the most significant bit. The remainder is never 3, so the last two rows of the table represent “don’t care” cases.
A circuit for 32-bit division by 3 is shown in Figure 10–50. The quotient is the word consisting of bits y31 through y0, and the remainder is 2r0 + s0.
Another way to implement the divide-by-3 operation in hardware is to use the multiplier to multiply the dividend by the reciprocal of 3 (binary 0.010101...), with appropriate rounding and scaling. This is the technique shown on pages 207 and 228.
TABLE 10–5. LOGIC FOR DIVIDING BY 3
FIGURE 10–50. Logic circuit for dividing by 3.
1. Show that for unsigned division by an even number, the shrxi
instruction (or equivalent code) can be avoided by first (a) turning off the low-order
bit of the dividend (and operation) [CavWer] or (b) dividing the dividend by 2 (shift right 1 instruction) and then dividing by half the divisor.
2. Code a function in Python similar to that of Figure 10–4 on page 240, but for computing the magic number for signed division. Consider only positive divisors.
3. Show how you would use Newton’s method to calculate the multiplicative inverse of an integer d modulo 81. Show the calculations for d = 146.
I think that I shall never envision
An op unlovely as division.
An op whose answer must be guessed
And then, through multiply, assessed;
An op for which we dearly pay,
In cycles wasted every day.
Division code is often hairy;
Long division’s downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
But even God can’t divide by zero!