Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

When and why should we use pipelining in FPGA?

Status
Not open for further replies.

shaiko

Advanced Member level 5
Joined
Aug 20, 2011
Messages
2,644
Helped
303
Reputation
608
Reaction score
297
Trophy points
1,363
Activity points
18,302
Hello people,
I've read an interesting post a few days ago...

Basically, it suggests that when doing logic operation that are time consuming - we should use pipelining.
For example, if we want to multyply four 32 bit wide vectors (A,B,C,D) and move the result to vector Z.

We shouldn't do Z <= A*B*C*D;

but rather:

PROCESS(Clk)
BEGIN
if(rising_edge(Clk)) then
--Implement the pipeline stages using a for loop and case statement.
--'i' is the stage number here.
--The multiplication is done in 3 stages here.
for i in 0 to 2 loop
case i is
when 0 => temp1 <= A*B;
when 1 => temp2 <= B*C;
when 2 => Z <= C*D;
when others => null;
end case;
end loop;
end if;
END PROCESS;



What I don't understand:
If we consider the first example (Z<=A*B*C*D) after some delay and "settling time" - the correct result of Z will arrive at the end. It will be "glitchy" - but only for a very short time!
After the "settling time" the product Z will be stable!

So, why should we pipeline?
 

Pipelining is used to boost speed. Think about the physical realization of your original equation: You have multiplier "W" which gives the product A*B followed by multiplier "X" which gives the product of W*C followed by multiplier "Y" which gives W*D. You've got the delay of W+X+Y to wait until your output is valid. This is your limiting parameter. (You could also multiply A*B and C*D in parallel and then multiply those two products)

Now suppose you multiply A*B=W and latch that (1 pipeline stage), and at the same time multiply C*D=X and latch that. On the next clock cycle you multiply W*X. You've delayed your output by one clock, but now your limiting factor is only a single multiplier delay instead of 3. So you sacrifice latency for speed.

Does this make sense?
 
You don't need to use pipeline if you can hold the same value in source regs, and if you can tell the subsequent logic when the valid data is ready. You just need to set multicycle path for timing analysis.

But if the source flops get updated every cycle, you cannot hold the same value there and you need to pipeline.
 
Last edited:
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Hello Barry,

This explains the concept of parallization not pipelining.

You say:
"suppose you multiply A*B=W and latch that (1 pipeline stage), and at the same time multiply C*D=X and latch that. On the next clock cycle you multiply W*X"

parallization isn't the issue - what I don't understand is why to register the temporary results (your W and X)?
Why can't we use a pure combinatorial circuit of:

result <= (A*B)*(C*D);

without registering the products of (A*B) and (C*D)...

---------- Post added at 20:41 ---------- Previous post was at 20:41 ----------

Hello Barry,

This explains the concept of parallization not pipelining.

You say:
"suppose you multiply A*B=W and latch that (1 pipeline stage), and at the same time multiply C*D=X and latch that. On the next clock cycle you multiply W*X"

parallization isn't the issue - what I don't understand is why to register the temporary results (your W and X)?
Why can't we use a pure combinatorial circuit of:

result <= (A*B)*(C*D);

without registering the products of (A*B) and (C*D)...

---------- Post added at 20:48 ---------- Previous post was at 20:41 ----------

Thanks lostinxlation!
This explains everything.

So, can we say that piplining makes sence only when the operands "die" too quickly?
 

So, can we say that piplining makes sence only when the operands "die" too quickly?
The benefit of pipelining is that you can pump in the data every cycle so that you can get a throughput of 1 in ideal condition. If the new data arrives every N cycle or more(N being cycle time to execute the multiplication), there is not much benefit to use pineline from performance point of view..
However, practically, I'd prefer using pipeline to make timing analysis simpler.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
what I don't understand is why to register the temporary results (your W and X)?
Why can't we use a pure combinatorial circuit of:

result <= (A*B)*(C*D);

without registering the products of (A*B) and (C*D)...



Let's say your multiplier has a propagation delay of 100 ns. Using your combinatorial circuit you can only present new data every 200 nS. (Assuming you're showing a parallel implementation). If you pipeline it you can present new data every 100 ns.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
The code in the original post hasn't to do with pipelining. (And it's not having the intended result Z <= A*B*C*D).

A simple pipelining example (without thinking about correct result length) can be found below. It's giving the result after two clock cycles.
Code:
if rising_edge(Clk) then
  temp1 <= A*B;
  temp2 <= C*D;
  Z <= temp1*temp2;
end if;
 

FvM,

I know how pipeling is done...
What didn't make sence to me was the actual need to use registers if the operands (A,B,C,D) are driven for a long enough time.
Sure enough, lostinxlation explained that you don't need to use piplining in such an event.

From all of the replies I conclude that:
Piplining is essential only if the operand update rate is faster than the operation speed.
It won't give functional benefit if A,B,C,D are driven for a longer time then it takes the combinatorial operation to complete.
 

I know how pipeling is done...
O.K. At first sight, the original post didn't give the impression that you are aware of synchronous design basics.

What didn't make sence to me was the actual need to use registers if the operands (A,B,C,D) are driven for a long enough time.
Synchronous FPGA design means registers clocked by a system clock and some logic processing in the middle. Except for special "multi cycle" cases, it's expected that the logic transit time is less than one clock cycle. This gives the option to check the design timing easily and to guarantee for correct operation without relying on assumptions about the data flow.

"driven for a long enough time" implies that you need to exactly keep track of the data flow and to know how many clock cycles are "long enough" for specific operations. A logic path, that needs two clock cycles for completion can be only updated every second cycle. Pipelining cuts the path and inserts a register level. Reducing the clock rate is of course another option.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Thanks FvM,

You explained it very clearly.

BTW:
Suppose,
I want to write a synthesizable VHDL function for a 4 number multiply (A*B*C*D)
But what if I want to do it pipelined?
I know that functions in VHDL can describe only combinatorical logic - so...is there a way around it?

Is there a reusable mechanism in VHDL (like functions) that allows to use mixed logic (synchronous & combinatorial) under one "roof"?
 

"retiming" is the lazy man's pipelining. It is sometimes called "register balancing" as well. Retiming is a synthesis or placement optimization that can be done to try to automatically pipeline a design. eg, you might write: "Z_pre <= A*B*C*D; Z <= Z_pre". The retiming system will hopefully determine that the extra register stage should be moved backwards somewhat. It ideally takes inputs as "here's an operation, and a latency, now build my system"

Retiming has some advantages and disadvantages. It can do a better job taking routing delays into consideration. It can also do a few tricks that would be confusing coding if done in HDL. But the algorithm is superlinear in run time, so large FPGA designs can end up taking several additional hours to route. In general, retiming doesn't work better than manually pipelining a design. Overall, I've had mostly bad luck with retiming though the algorithms may get better over time. retiming also tends to rename nets in strange ways, which can be important with Xilinx designs.
 

The clear VHDL method to implement the pipelined multiplication is to write a component. Unfortunately you can't use it in a process. So you're required to place some component instantiations into the concurrent code, that are interacting with the sequential (behavioral) code. There's no better way to do these things in present HD languages, as far as I'm aware of.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Here is a simple way of implementing the pipeline. If your FPGA has dedicated multipliers, you might want to use those.

process(clk)
begin
AtimesB<=a*b;
CtimesD<=c*d;
product<=AtimesB*CtimesD;
end process;
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
process(clk)
begin
AtimesB<=a*b;
CtimesD<=c*d;
product<=AtimesB*CtimesD;
end process;

how about this synthesisable code instead:

Code:
process(clk)
begin
  if rising_edge(clk) then
    AtimesB<=a*b;
    CtimesD<=c*d;
    product<=AtimesB*CtimesD;
  end if;
end process;
 

Right, I left out the rising_edge statement (or "if clk='1' and clk'event")
 

i want to use pipeling concept in my project(floating point multiplier)

i have completed all units like exponent calculation, mantissa multiplier,normlizer without clock using concurrent statements and structural modelling(component instantiation)
i want to use pipeline between exponent calc & mantissa multiplier with normalizer and normalizer with final output.
the fig is


dotted line shows pipeling stage.

please give me some idea/ example of pipelining.

can i use parallel in parallel out shift register (d flip flop) between stages?
 

Probably.
But you usually build pipelining into your modules. And floating point units will need a LOT of pipelining inside the modules themselves, otherwise they will run very very slowly.
 

can pipeling be done by using only clk with if statement inside process statement in the different components i.e. without using d flip flop??
 
Last edited:

you can create them how you want. Aslong as you follow the correct syncrhonous template:

Code:
d_ff : process(clk, reset)
begin
  if reset = '1' then
    --async reset
  elsif rising_edge(clk) then
    --sync stuff
  end if;
end process;

Or if you're really crazy you can instantiate the primitives (but why would you want to?)
 

i have completed all modules using structral modelling now i want to pipeline them please tell me how to pipelining the components.
please give me example.
 

Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top