Hi,
I think there is a big difference whether you do 8x8 bit integer multiplication or 32bit x 32bit floating point multiplication.
One is the lower end, the other is the higher end ( or at least good middle range)
the one has a dynamic of 1:256, the other about 1:144700000000000000000000000000000000000000000000000000000000000000000000000
(dont know whether there is a zero too much or too less)
So it´s something totally different.
*********************************
If I understand you right, then you don´t need "fast" code, but code with high "throughput".
(pipelined code is not fast (input to output) but it has high throughput (maybe one calculation per clock cycle))
Klaus