Of course, we can easily accomplish the converse of merging two bit streams; that is, we can extract and distribute bits in a bit string among multiple destinations. The following code takes the 32-bit value in EAX and distributes alternate bits among the BX and DX registers:
mov( 16, cl ); // Count the loop iterations. ExtractLp: shr( 1, eax ); // Extract even bits to (e)bx. rcr( 1, ebx ); shr( 1, eax ); // Extract odd bits to (e)dx. rcr( 1, edx ); dec( cl ); // Repeat 16 times. jnz ExtractLp; shr( 16, ebx ); // Need to move the results from the H.O. shr( 16, edx ); // bytes of ebx/edx to the L.O. bytes.
This sequence executes 99 instructions. This isn't terrible, but we can probably do a little better by using an algorithm that extracts bits in parallel. Employing the technique we used to reverse bits in a register, we can come up with the following algorithm that relocates all the even bits to the L.O. word of EAX and all the odd bits to the H.O. word of EAX.
// Swap bits at positions (1,2), (5,6), (9,10), (13,14), (17,18), // (21,22), (25,26), and (29, 30). mov( eax, edx ); and( $9999_9999, eax ); // Mask out the bits we'll keep for now. mov( edx, ecx ); shr( 1, edx ); // Move 1st bits in tuple above to the and( $2222_2222, ecx ); // correct position and mask out the and( $2222_2222, edx ); // unneeded bits. shl( 1, ecx ); // Move 2nd bits in tuples above. or( edx, ecx ); // Merge all the bits back together. or( ecx, eax ); // Swap bit pairs at positions ((2,3), (4,5)), // ((10,11), (12,13)), etc. mov( eax, edx ); and( $c3c3_c3c3, eax ); // The bits we'll leave alone. mov( edx, ecx ); shr( 2, edx ); and( $0c0c_0c0c, ecx ); and( $0c0c_0c0c, edx ); shl( 2, ecx ); or( edx, ecx ); or( ecx, eax ); // Swap nibbles at nibble positions (1,2), (5,6), (9,10), etc. mov( eax, edx ); and( $f00f_f00f, eax ); mov( edx, ecx ); shr(4, edx ); and( $0f0f_0f0f, ecx ); and( $0f0f_0f0f, ecx ); shl( 4, ecx ); or( edx, ecx ); or( ecx, eax ); // Swap bits at positions 1 and 2. ror( 8, eax ); xchg( al, ah ); rol( 8, eax );
This sequence requires 30 instructions. At first blush it looks like a winner because the original loop executes 64 instructions. However, this code isn't quite as good as it looks. After all, if we're willing to write this much code, why not unroll the loop above 16 times? That sequence requires only 64 instructions. So the complexity of the previous algorithm may not gain much on instruction count. As to which sequence is faster, well, you'll have to time them to figure this out. However, the shrd
instructions are not particularly fast on all processors and neither are the instructions in the other sequence. This example appears here not to show you a better algorithm but rather to demonstrate that writing really tricky code doesn't always provide a big performance boost.
Extracting other bit combinations is left as an exercise for the reader.