Normally, the input would be provided over multiple cycles if the input is large. Likewise, the output would be provided over several cycles. However, for reasonable widths, you could have the output be a tuple: Nmax-bits for data, log2(Nmax)-bits for "number of bits valid", and possibly a 1b "this data is valid" indicator. Reasonable is based on device size and design requirements. You can also adopt this scheme for multi-cycle data, possibly adding either a "start of data" and/or "end of data" indicator. In that case, a 32b bus could transfer 100 bits as:
{!valid, !start, !end, X bits valid, X data}
{valid, start, !end, X bits valid, data[31:0]}
{valid, !start, !end, X bits valid, data[63:32]}
{valid, !start, !end, X bits valid, data[95:64]}
{valid, !start, end, 4 bits valid, data[99:96] with 28 leading/trailing bits}
{!valid, !start, !end, X bits valid, X data}
There are several variations on this generic protocol. You should document exactly what assumptions can be made. eg, if start/last can be set when valid isn't, if the number of valid bits is only used when "end" is set (as in my example). If start+end can occur at the same time, if the final output is padded with trailing random bits (or 0's) or with leading random bits (or 0's). You can then use the same style of interface throughout your design.
In general, you should try to look at having your input be provided over N-cycles, with part or all of your algorithm completing within N-cycles, and with the output being transferred within N-cycles. When an algorithm is more complex, you may need to break it into a series of steps that each complete within N-cycles. Again, this is for full bandwidth. If you don't require full bandwidth, you can buffer data and perform processing at a slower rate.