A priority encoder is an applicable circuit that can work for your application. You can scan blocks of 32 pixels at a time to get the location of the next 1. You can use the bit trick "x <= x & ~(-x & x)" to unset the lsb.
There are different ways to generate a priority encoder. One version sets the encoded lsb to |((-x & x) & 0xAAAAAAAA), the next msb to |((-x & x) & 0xCCCCCCCC), then next to |((-x & x) & 0xF0F0F0F0), next |((-x & x) & 0xFF00FF00), finally |((-x & x) & 0xFFFF0000). This should run at decently high rates on modern fpgas, and could be pipelined if needed. The final circuit used is just a comparison to 0 to determine if the next 32b should be processed.
Using this logic, it is possible to skip 32 (or more) 0's per cycle and it takes at least one cycle per 1 encountered.
You can also have similar logic to skip blocks of 32 even faster. eg, check 8 32b blocks to see if any are non-zero. the result goes to the same encoder structure as above. This can also be pipelined if needed. This allows skipping of multiple 32b blocks per cycle. However, this optimization might not be justified.
( (-x & x) generates a vector where only the lsb is set. eg x = 0x12345678. ~x = 0xEDCBA987. -x = 0xEDCBA988. (-x & x) = 0x00000008. only the lsb remains.)
As mentioned, if you can process pixels as they come in, that would be easier.
(edit, I also shouldn't have to mention that the 256*256 bit image should be in a BRAM and not a large data bus. This works well as you can have 32b outputs from a BRAM. This size requires ~8 BRAM on modern devices. At the same time, you did try to synthesize the clearly troll design so I'll mention this just in case.)