The 80x86 microprocessors execute sequences of instructions at blinding speed. Therefore, you'll rarely encounter a slow program that doesn't contain any loops. Because loops are the primary source of performance problems within a program, they are the place to look when attempting to speed up your software. While a treatise on how to write efficient programs is beyond the scope of this chapter, there are some things you should be aware of when designing loops in your programs. They're all aimed at removing unnecessary instructions from your loops in order to reduce the time it takes to execute a single iteration of the loop.
Consider the following flow graphs for the three types of loops presented earlier:
repeat..until loop: Initialization code Loop body Test for termination Code following the loop while loop: Initialization code Loop termination test Loop body Jump back to test Code following the loop forever..endfor loop: Initialization code Loop body part one Loop termination test Loop body part two Jump back to Loop body part one Code following the loop
As you can see, the repeat..until
loop is the simplest of the bunch. This is reflected in the assembly language implementation of these loops. Consider the following repeat..until
and while
loops that are semantically identical:
// Example involving a WHILE loop: mov( edi, esi ); sub( 20, esi ); while( esi <= edi ) do << stmts >> inc( esi ); endwhile; // Conversion of the code above into pure assembly language: mov( edi, esi ); sub( 20, esi ); whlLbl: cmp( esi, edi ); jnle EndOfWhile; << stmts >> inc( esi ); << stmts >> jmp whlLbl; EndOfWhile: // Example involving a REPEAT..UNTIL loop: mov( edi, esi ); sub( 20, esi ); repeat << stmts >> inc( esi ); until( esi > edi ); // Conversion of the REPEAT..UNTIL loop into pure assembly: rptLabel: << stmts >> inc( esi ); cmp( esi, edi ); jng rptLabel;
As you can see by carefully studying the conversion to pure assembly language, testing for the termination condition at the end of the loop allowed us to remove a jmp
instruction from the loop. This can be significant if this loop is nested inside other loops. In the preceding example there wasn't a problem with executing the body at least once. Given the definition of the loop, you can easily see that the loop will be executed exactly 20 times. This suggests that the conversion to a repeat..until
loop is trivial and always possible. Unfortunately, it's not always quite this easy. Consider the following HLA code:
while( esi <= edi ) do << stmts >> inc( esi ); endwhile;
In this particular example, we haven't the slightest idea what ESI contains upon entry into the loop. Therefore, we cannot assume that the loop body will execute at least once. So we must test for loop termination before executing the body of the loop. The test can be placed at the end of the loop with the inclusion of a single jmp
instruction:
jmp WhlTest; TopOfLoop: << stmts >> inc( esi ); WhlTest: cmp( esi, edi ); jle TopOfLoop;
Although the code is as long as the original while
loop, the jmp
instruction executes only once rather than on each repetition of the loop. Note that this slight gain in efficiency is obtained via a slight loss in readability. The second code sequence above is closer to spaghetti code than the original implementation. Such is often the price of a small performance gain. Therefore, you should carefully analyze your code to ensure that the performance boost is worth the loss of clarity. More often than not, assembly language programmers sacrifice clarity for dubious gains in performance, producing impossible-to-understand programs.
Note, by the way, that HLA translates its high-level while
statement into a sequence of instructions that test the loop termination condition at the bottom of the loop using exactly the technique this section describes.
Because of the nature of the flags on the 80x86, loops that repeat from some number down to (or up to) 0 are more efficient than loops that execute from 0 to some other value. Compare the following HLA for
loop and the code it generates:
for( mov( 1, j ); j <= 8; inc( j ) ) do << stmts >> endfor; // Conversion to pure assembly (as well as using a REPEAT..UNTIL form): mov( 1, j ); ForLp: << stmts >> inc( j ); cmp( j, 8 ); jnge ForLp;
Now consider another loop that also has eight iterations but runs its loop-control variable from 8 down to 1 rather than 1 up to 8:
mov( 8, j ); LoopLbl: << stmts >> dec( j ); jnz LoopLbl;
Note that by running the loop from 8 down to 1 we saved a comparison on each repetition of the loop.
Unfortunately, you cannot force all loops to run backward. However, with a little effort and some coercion you should be able to write many for
loops so that they operate backward. Saving the execution time of the cmp
instruction on each iteration of the loop may result in faster code.
The example above worked out well because the loop ran from 8 down to 1. The loop terminated when the loop-control variable became 0. What happens if you need to execute the loop when the loop-control variable goes to 0? For example, suppose that the loop above needed to range from 7 down to 0. As long as the upper bound is positive, you can substitute the jns
instruction in place of the jnz
instruction in the earlier code:
mov( 7, j ); LoopLbl: << stmts >> dec( j ); jns LoopLbl;
This loop will repeat eight times, with j
taking on the values 7..0. When it decrements 0 to −1, it sets the sign flag and the loop terminates.
Keep in mind that some values may look positive but are actually negative. If the loop-control variable is a byte, then values in the range 128..255 are negative in the two's complement system. Therefore, initializing the loop-control variable with any 8-bit value in the range 129..255 (or, of course, 0) terminates the loop after a single execution. This can get you into trouble if you're not careful.
A loop-invariant computation is some calculation that appears within a loop that always yields the same result. You needn't do such computations inside the loop. You can compute them outside the loop and reference the value of the computations inside the loop. The following HLA code demonstrates an invariant computation:
for( mov( 0, eax ); eax < n; inc( eax )) do mov( eax, edx ); add( j, edx ); sub( 2, edx ); add( edx, k ); endfor;
Because j
never changes throughout the execution of this loop, the subexpression j-2
can be computed outside the loop:
mov( j, ecx ); sub( 2, ecx ); for( mov( 0, eax ); eax < n; inc( eax )) do mov( eax, edx ); add( ecx, edx ); add( edx, k ); endfor;
Although we've eliminated a single instruction by computing the subexpression j-2
outside the loop, there is still an invariant component to this calculation. Note that this invariant component executes n times in the loop; this means that we can translate the previous code to the following:
mov( j, ecx ); sub( 2, ecx ); intmul( n, ecx ); // Compute n*(j-2) and add this into k outside add( ecx, k ); // the loop. for( mov( 0, eax ); eax < n; inc( eax )) do add( eax, k ); endfor;
As you can see, we've shrunk the loop body from four instructions down to one. Of course, if you're really interested in improving the efficiency of this particular loop, you can compute the result without using a loop at all (there is a formula that corresponds to the iterative calculation above). Still, this simple example demonstrates elimination of loop-invariant calculations from a loop.
For small loops, that is, those whose body is only a few statements, the overhead required to process a loop may constitute a significant percentage of the total processing time. For example, look at the following Pascal code and its associated 80x86 assembly language code:
for i := 3 downto 0 do A[i] := 0; mov( 3, i ); LoopLbl: mov( i, ebx ); mov( 0, A[ ebx*4 ] ); dec( i ); jns LoopLbl;
Four instructions execute on each repetition of the loop. Only one instruction is doing the desired operation (moving a 0 into an element of A
). The remaining three instructions control the loop. Therefore, it takes 16 instructions to do the operation logically required by 4.
While there are many improvements we could make to this loop based on the information presented thus far, consider carefully exactly what it is that this loop is doing—it's storing four 0s into A[0]
through A[3]
. A more efficient approach is to use four mov
instructions to accomplish the same task. For example, if A
is an array of double words, then the following code initializes A
much faster than the code above:
mov( 0, A[0] ); mov( 0, A[4] ); mov( 0, A[8] ); mov( 0, A[12] );
Although this is a simple example, it shows the benefit of loop unraveling (also known as loop unrolling). If this simple loop appeared buried inside a set of nested loops, the 4:1 instruction reduction could possibly double the performance of that section of your program.
Of course, you cannot unravel all loops. Loops that execute a variable number of times are difficult to unravel because there is rarely a way to determine (at assembly time) the number of loop iterations. Therefore, unraveling a loop is a process best applied to loops that execute a known number of times (and the number of times is known at assembly time).
Even if you repeat a loop some fixed number of iterations, it may not be a good candidate for loop unraveling. Loop unraveling produces impressive performance improvements when the number of instructions controlling the loop (and handling other overhead operations) represents a significant percentage of the total number of instructions in the loop. Had the previous loop contained 36 instructions in the body (exclusive of the 4 overhead instructions), then the performance improvement would be, at best, only 10 percent (compared with the 300–400 percent it now enjoys). Therefore, the costs of unraveling a loop, that is, all the extra code that must be inserted into your program, quickly reach a point of diminishing returns as the body of the loop grows larger or as the number of iterations increases. Furthermore, entering that code into your program can become quite a chore. Therefore, loop unraveling is a technique best applied to small loops.
Note that the superscalar 80x86 chips (Pentium and later) have branch-prediction hardware and use other techniques to improve performance. Loop unrolling on such systems may actually slow down the code because these processors are optimized to execute short loops.
Consider the following loop:
for i := 0 to 255 do csetVar[i] := {};
Here the program is initializing each element of an array of character sets to the empty set. The straightforward code to achieve this is the following:
mov( 0, i ); FLp: // Compute the index into the array (note that each element // of a CSET array contains 16 bytes). mov( i, ebx ); shl( 4, ebx ); // Set this element to the empty set (all 0 bits). mov( 0, csetVar[ ebx ] ); mov( 0, csetVar[ ebx+4 ] ); mov( 0, csetVar[ ebx+8 ] ); mov( 0, csetVar[ ebx+12 ] ); inc( i ); cmp( i, 256 ); jb FLp;
Although unraveling this code will still produce a performance improvement, it will take 1,024 instructions to accomplish this task, too many for all but the most time-critical applications. However, you can reduce the execution time of the body of the loop using induction variables. An induction variable is one whose value depends entirely on the value of some other variable. In the example above, the index into the array csetVar
tracks the loop-control variable (it's always equal to the value of the loop-control variable times 16). Because i
doesn't appear anywhere else in the loop, there is no sense in performing the computations on i
. Why not operate directly on the array index value? The following code demonstrates this technique:
mov( 0, ebx ); FLp: mov( 0, csetVar[ ebx ]); mov( 0, csetVar[ ebx+4 ] ); mov( 0, csetVar[ ebx+8 ] ); mov( 0, csetVar[ ebx+12 ] ); add( 16, ebx ); cmp( ebx, 256*16 ); jb FLp;
The induction that takes place in this example occurs when the code increments the loop-control variable (moved into EBX for efficiency reasons) by 16 on each iteration of the loop rather than by 1. Multiplying the loop-control variable by 16 (and also the final loop-termination constant value) allows the code to eliminate multiplying the loop-control variable by 16 on each iteration of the loop (that is, this allows us to remove the shl
instruction from the previous code). Further, because this code no longer refers to the original loop-control variable (i
), the code can maintain the loop-control variable strictly in the EBX register.