pipelining generally works well for systems with limited to no feedback, or when the feedback has certain qualities.
addition and multiplication are common operations that are somewhat complex. Multiplexing also can be a difficult operation due to routing issues. If we assume that multiplications and additions take 2ns each, then the operation:
y = a*c + b*d would take 4ns -- a*c and b*d are computed, then the sum is computed.
However, if 500MHz is required, then pipelining can be done. in that case, registers are placed after a*c and b*d. this gives:
yac <= a*c
ybd <= b*d
y <= yac + ybd
if a,b,c,d = 1,1,1,1 on cycle 1, 1,2,3,4 on cycle 3, and 1,3,5,7 on cycle 3:
1,1,1,1 -- yac <= 1, ybd <= 1, y <= 0 (yac and ybd are assumed to be equal to zero at the start of the cycle)
1,2,3,4 -- yac <= 2, ybd <= 12, y <= 1 (using the yac = ybd = 1 from the start of the cycle)
1,3,5,7 -- yac <= 3, ybd = 15, y <= 14 (using yac =2, ybd = 12)
0,0,0,0 -- yac <= 0, ybd = 0, y <= 18 (using yac = 3, ybd = 15)
Thus the design computes 1 new output per cycle, but has a latency of 1 additional cycle before the output is computed. this is similar to an assembly line that can produce 1000 items per hour, but requires an hour before any items have completed all stages of production.
The problem: y[n] = a*y[n-1] + b*x[n] is much more difficult. for example, the response for x[n] = 1, 2, 3 is b, a*b+2b, a^2*b+2a*b+3b
if ya = a*y, yb = b*x[n], y = ya+yb:
cycle 1, ya = 0, yb = b. y = 0 (waiting for ya, yb)
cycle 2, ya = 0, yb = 2b. y = b
cycle 3, ya = a*b, yb = 3b, y = 2b
cycle 4, ya = 2*a*b, yb = 0, y = a*b + 3b
notice that the values after pipelining no longer line up! this is because the value "y" is delayed by an additional cycle, but is needed on the next cycle. This type of problem can sometimes be solved, for example if a = 1 the problem can be solved (think about the logic for each bit of the addition).
Likewise, individual operations can be pipelined. A 32b addition could be broken into two 16b additions with carry in.