For example, if I work on generating the first 32 pixels of the first 32 lines instead of the first 1024 pixels of the first line. There is now an improved chance that groups of pixels read from the input will actually be used in the output.
For the 90deg rotation, generating 32 output lines means reading 1024*32 times, and possibly incurring an equal number of row operations for the basic approach. For the block approach, it means reading 1024*4 times, and possibly incurring 1024 row operations. That is the biggest difference between the two.
The size of the block can be chosen based on how efficient the 45 deg rotation should be. The average number of used pixels vs read pixels increases as the block size increases.
That said, the rate is fairly low compared to DDR3. The row behavior might still be significant though, although interleaving banks could help out in that case.