CRC works on polynomial division. In this case, with coefficients in "GF2". GF2 is a "field" with 2 elements. 0,1. It also has two operators, "+" is xor, and "*" is and. These operators fill the normal field requirements (see wikipedia). As a result, it is common to describe them by writing "+" instead of "xor", and "*" instead of "and".
The CRC polynomial defines an expression, eg: x**8 = x**4 + x**3 + x**2 + 1. However, x**9 = x*x**8 = x**5 + x**4 + x**3 + x. Indeed, with this rule you can reduce x**N as a polynomial of at most degree 7. The HW implementation will store all 8 1b coefficients from x**7 downto x**0. (ignore the value for x. it isn't important. The expression will never be evaluated -- the coefficients are all that matter)
The CRC circuit works by shifting data in serially (for the first concept). a 1 can be shifted into the x**0 position. on the next input, it will be shifted to the x**1 position and so on. Once it gets to the x**8 position, the rule for x**8 = x**4 + x**3 + x**2 + 1. Thus a 1 gets added (xor'd) into the x**4 place, (and the x**3, x**2, and x**0 place).
Shifting by 2 bits per cycle is easy as well -- one value will end up in the x**8 place and will be reduced with the above rule. There will also be a value in the x**9 place, which also will be reduced back down. the output is thus the shifted previous state plus the sum of the reduced x**8 term and the reduced x**9 term. (again, for GF2 "sum" means "repeated xor")
And because the effect of x**N is known for all N, a CRC can be generated of any size.
In this case, the value calculated included the additional N shifts as normal in CRC calculations. eg, an input of 0x0001 will result in a CRC of 0x1D, not 0x01.