because the division operation won't yield the first x groups of MSB zeroes, it operates from right to left
it could be the msb or the 11th msb, but with all zeroes in the first 1 or 2 11 bit parts of the number the division doesn't see them, you go to zero and then you have 23 or 22 or 21 11 bit values
just try modifying the implementation to skip the check and assume 33rd byte is all zero, and see what happens when first X of the 11 bit fields are also zero
the way you are proposing forces you to copy 32 byte segments on each operation to another position whereas the division lets you do it as 5 64 bit right shifts copying the overflow into the 11 bit field array
24 right shifts on 5 64 bit words versus 24 copy-and-shift operations, plus you have to have a lookup table to decide the shift at each point because they only coincide at 88 bit boundaries
11 and 8 is nasty, and 64 bit processor will optimize if you just use division and multiplication with powers of 2
i used to think it would be like you but go look it up, power of two in integer division and multiplication is optimized into bitshifts and bitshifts are funneled down in a second decision step to use the multiply/divide circuit
oh yeah, this is probably a big part of the reason why btcec is so much slower than libsecp256k1 too, because the secp256k1 library does it via 64 bit word multiplication and division operations, catching the result in the overflow register