The answer is beyond the discussed data structure.
Either if the data packets are aligned with 64 bit boundary (by using fill bytes) or can start somewhere in the middle of it, there must be some kind of framing in the data stream. Robustness requirements demand that the receiver should be able to resync to the packet start even in case of data loss or corruption, thus just counting data length isn't an option.
Framing can be either made by using unique patterns in the data stream, involving escape sequences or encoding with additional bits in case of binary data. Or by signaling start of frame in an additional channel, e.g. extra bits in the FIFO output.
- - - Updated - - -
As said, a framing signal e.g. SOP (start of packet) is required to decode the stream. If you review popular streaming interface specifications, you'll see that they always provide some information of this kind.
Means SOP must be a multi-bit signal, able to mark any byte in 64 bit word (or none), at least 4 bit required.
Doesn't seem related to packet framing, unless you are misunderstanding the point.