It looks like a fairly simple algorithm though. It isn't that good for high performance though.
You would have the state ram, a key shift register, a few adders, a state machine, an output fifo, and optionally an input data fifo. 1 byte could be produced around once per 6 cycles for a simple design. 1 per 3 cycles is probably possible without too much extra effort.
The state machine would have a state that initialized the state ram to the 0-255 counting pattern. There may need to be a "get key" state done next. The next states would do the key scheduling algorithm, which probably will reasonable complete an iteration a bit slower. The loop would probably be read_state, triple_add, write_state_i_read_state_j, write_state_j. additional states can be added for the init and loop termination if desired for easier debugging. Finally, the next states would be the PRNG loop. The logic is similar to the above: read_state_i, accumulate_j, read_j_and_write_i, add_i_j_write_i, read_k. There might be an idle state that checks to see if the output fifo is full.
The input fifo can be used to store data while waiting for the 1/Nth rate of the PRNG.
There is also a version
https://opencores.org/websvn,listing?repname=rc4-prbs&path=/rc4-prbs/trunk/#path_rc4-prbs_trunk_ that uses a large register based RAM and a lot of big muxes. It uses around 2k slices according to the author. It might not be synthesizable on all tools due the the mod operator (not needed as the count wraps predictably).