Thanks guys, rca provided a pretty good high level view, ads-ee gave a good alternative solution, and yanshangzhao added the remaining pieces...
As to round-up in general, I think:
- Shift register pros: less area, better timing
- FIFO pros: less power (esp when register-depth gets huge) since not all registers are updated every clock cycle
Also, FIFO higher area cost can be reduced if we code it the way that ignore unnecessary function (for example, separated read & write pointer, full & empty flag)