I suspect some tool misunderstandings. Morell's changes didn't exactly follow all of my advice.
This is an expanding design -- the number of valid output cycles is more than the number of valid input cycles. As a result, cycle vs sample delay is important. Too much logic is based only on (valid_in = '1').
If this works, it is possible the simulator is using a previously compiled version of the RTL. It is also possible the valid_in/valid_out concepts were not understood. In all cases, you should specify how you think the interface in/out of the modules will work.
(With all of the info given, the 4,7 block codes would be lookup tables in modern FPGAs. the fancy encoders would go away as FPGA primitives allow the basic approach to work.)