In your case, i think you've just broken it.
the benfit from breaking the multiply operation over multiple clock cycles is the logic is split across multiple clock cycles. eg, if 32 levels of logic were required for the multiplier before, now only 16 are needed per cycle. sometimes, pipelining is needed even with 1 level of logic if routing delays are large. the clock rate can be increased. as long as there isn't tight feedback, everything is fine.
eg, an FIR filter is just y = a*x1 + b*x2 + c*x3 + d*x4. there is no feedback in the structure at all. this can be pipelined to achieve higher clock rates.
On the other hand, an IIR filter such as y1 = a*y0 + b*x1 immediately runs into problems. if pipeline stages are added, they change the system -- this system needs its output on the next cycle. In this case, c-slow can sometimes be used. this is where the multiply is pipelined, and the input is commutated between different independent sources.