What are best practices for optimizing pipeline throughput for fpga implementations?

mrflibble · Apr 28, 2011

How does one for example make the best use of retiming and/or c-slow to make the most of a given pipeline.

With retiming, some modules get better results by putting the shift registers on the inputs (forward register balancing), while other modules do better with shift registers on the output (backward register balancing).

For now I use the following method:

code hdl (in verilog)
create timing constraints for the specific module
synthesize, map, place & route (using ISE 13.1)
look at post place & route timings for the module-to-be-improved, and at the maximum number of logic levels.
take this number of logic levels, and make an educated guess for the number of flip-flops to insert.
insert flip-flops, enable register balancing, hope for the best

As it stands, this method is hit & miss. Sometimes it gets pretty good results, sometimes it's crap. So, what is a good way to improve the success ratio of such retiming?

Are there any tools that can aid in this? Also, links, papers and book recommendations would be much appreciated.

---------- Post added at 06:01 ---------- Previous post was at 04:45 ----------

Something related to this...

Retiming can take a long time, so is there some way to store and reuse intermediate results?

For example, the following module responds fairly well to insertion of a 1-deep shift register on the inputs. With register balancing enable it can then move these registers forward to improve timings.

Code:

`default_nettype none
`timescale 1ns / 1ps

(* REGISTER_BALANCING = "YES" *)
module edge_detector_16bit  #(
    parameter          RETIME = 0     // 0 = no retiming , 1 = 1-bit shift register on inputs for retiming
    ) (
    input  wire        clk,
    input  wire        prev,
    input  wire [15:0] data,
    output reg   [3:0] index    = 0,
    output reg         detect   = 0,  // 0 = no edge detected           ; 1 = edge detected
    output reg         polarity = 0   // 0 = 1-to-0 transition detected ; 1 = 0-to-1 transition detected
    );

//
// Check for illegal parameter values.
//
initial begin
    if ((RETIME < 0) || (RETIME > 1)) begin
        $display("DRC ERROR: Illegal value %d for RETIME parameter. Should be 0 or 1.", RETIME);
        $finish;
    end
end


//
// Retiming
//
wire [16:0] sel_data;

generate
  if (RETIME==1) begin
    // Retiming
    reg  [16:0] retimed_data; // shift register, no initialization value
    assign      sel_data     = retimed_data;

    always @(posedge clk) begin
        retimed_data <= {data[15:0],prev};
    end
  end else begin
    // No retiming
    assign sel_data = ({data[15:0],prev});
  end
endgenerate


//
// Edge detect
//
always @(posedge clk) begin
    casez (sel_data)
      // decode with priority from right to left (low to high bit)
      17'b???????????????10: begin index <=  0; detect <= 1'b1; polarity <= 1'b1; end
      17'b??????????????100: begin index <=  1; detect <= 1'b1; polarity <= 1'b1; end
      17'b?????????????1000: begin index <=  2; detect <= 1'b1; polarity <= 1'b1; end
      17'b????????????10000: begin index <=  3; detect <= 1'b1; polarity <= 1'b1; end
      17'b???????????100000: begin index <=  4; detect <= 1'b1; polarity <= 1'b1; end
      17'b??????????1000000: begin index <=  5; detect <= 1'b1; polarity <= 1'b1; end
      17'b?????????10000000: begin index <=  6; detect <= 1'b1; polarity <= 1'b1; end
      17'b????????100000000: begin index <=  7; detect <= 1'b1; polarity <= 1'b1; end
      17'b???????1000000000: begin index <=  8; detect <= 1'b1; polarity <= 1'b1; end
      17'b??????10000000000: begin index <=  9; detect <= 1'b1; polarity <= 1'b1; end
      17'b?????100000000000: begin index <= 10; detect <= 1'b1; polarity <= 1'b1; end
      17'b????1000000000000: begin index <= 11; detect <= 1'b1; polarity <= 1'b1; end
      17'b???10000000000000: begin index <= 12; detect <= 1'b1; polarity <= 1'b1; end
      17'b??100000000000000: begin index <= 13; detect <= 1'b1; polarity <= 1'b1; end
      17'b?1000000000000000: begin index <= 14; detect <= 1'b1; polarity <= 1'b1; end
      17'b10000000000000000: begin index <= 15; detect <= 1'b1; polarity <= 1'b1; end
      17'b00000000000000000: begin index <= 0 ; detect <= 1'b0; polarity <= 1'b1; end

      17'b???????????????01: begin index <=  0; detect <= 1'b1; polarity <= 1'b0; end
      17'b??????????????011: begin index <=  1; detect <= 1'b1; polarity <= 1'b0; end
      17'b?????????????0111: begin index <=  2; detect <= 1'b1; polarity <= 1'b0; end
      17'b????????????01111: begin index <=  3; detect <= 1'b1; polarity <= 1'b0; end
      17'b???????????011111: begin index <=  4; detect <= 1'b1; polarity <= 1'b0; end
      17'b??????????0111111: begin index <=  5; detect <= 1'b1; polarity <= 1'b0; end
      17'b?????????01111111: begin index <=  6; detect <= 1'b1; polarity <= 1'b0; end
      17'b????????011111111: begin index <=  7; detect <= 1'b1; polarity <= 1'b0; end
      17'b???????0111111111: begin index <=  8; detect <= 1'b1; polarity <= 1'b0; end
      17'b??????01111111111: begin index <=  9; detect <= 1'b1; polarity <= 1'b0; end
      17'b?????011111111111: begin index <= 10; detect <= 1'b1; polarity <= 1'b0; end
      17'b????0111111111111: begin index <= 11; detect <= 1'b1; polarity <= 1'b0; end
      17'b???01111111111111: begin index <= 12; detect <= 1'b1; polarity <= 1'b0; end
      17'b??011111111111111: begin index <= 13; detect <= 1'b1; polarity <= 1'b0; end
      17'b?0111111111111111: begin index <= 14; detect <= 1'b1; polarity <= 1'b0; end
      17'b01111111111111111: begin index <= 15; detect <= 1'b1; polarity <= 1'b0; end
      17'b11111111111111111: begin index <= 0 ; detect <= 1'b0; polarity <= 1'b0; end

      default              : begin index <= 0 ; detect <= 1'b0; polarity <= 1'b0; end
    endcase
end

endmodule // edge_detector_16bit
`default_nettype wire

I tried something similar, but then with a wider input. This meant a lot more case statements, and more logic levels. So I inserted a deeper shift-register on the inputs. But no matter what depth (from 1 to 8 ) I choose for the shift register, the results were always crap. So I guess there is a limit to how many logic levels XST is able to optimize...

Other than hand optimizing or partitioning (which is what I did now), are there tools that helps analyze this sort of thing?

permute · Apr 28, 2011

this is a somewhat good method. The big thing to remember is to keep feedback paths as long (in registers) as possible. Especially for control. This is because as latency requirements decrease, it becomes much more difficult to pipeline a system.

I think a lot of people will manually pipeline some of the logic. eg, in your case above you could break that into multiple stages by hand.

Adding registers often causes synthesis to infer SRL's. These make it harder to meet timing, so you might set the synthesis attribute to turn SRLs off for the pipeline registers. Likewise, XST can make mistakes. once it thinks it can meet its goals, it will stop trying to improve speed and focus on area. This is because area optimizations usually use less routing resources, freeing them up for other logic.

ISE has a smartguide and has had support in the past for partitions. Though the former wasn't stable in the past, and I think the latter has been moved to planahead.

Another method is to look for failing nets, and then add constraints to the synthesizer to hint that they should be improved. SmartXplorer is another good tool because you can see if a design fails due to logic design, or due to placement. bad logic will fail in most cases, while good logic will sometimes fail if it becomes route-starved. The result is that some nets fail on most builds, while others fail occasionally.

mrflibble · Apr 29, 2011

Thanks for your reply!

This is a somewhat good method. The big thing to remember is to keep feedback paths as long (in registers) as possible. Especially for control. This is because as latency requirements decrease, it becomes much more difficult to pipeline a system.

Understood. Luckily in this particular case I can tolerate a fair amount of latency, as it is mostly open loop.

I think a lot of people will manually pipeline some of the logic. eg, in your case above you could break that into multiple stages by hand.

Yeah, the above can be broken into parts. But the thing is ... functionally it will (hopefully) be the same as the one-part solution. It would be nice if I could write the high-level HDL for it, and have the tool assist in breaking it down so that it fits nicely into LUTs.

Right now I am better of breaking it up myself to make sure it fits nicely into LUT6's... Which is all well and good, but 1) this takes time and 2) in a few years when people decide to move to LUT8 architecture or whatever it will suddenly be a whole lot less optimal.

Adding registers often causes synthesis to infer SRL's. These make it harder to meet timing, so you might set the synthesis attribute to turn SRLs off for the pipeline registers. Likewise, XST can make mistakes. once it thinks it can meet its goals, it will stop trying to improve speed and focus on area. This is because area optimizations usually use less routing resources, freeing them up for other logic.

With regard to that I notice I made a mistake in the above module. I intended to prevent SLRs, but that's not what I did.

I've already run into this issue once before. Xilinx advice is to add a (* KEEP = "TRUE" *) attribute to the registers to prevent SLR's. However, I am uncertain what KEEP will do in combination with REGISTER_BALANCING enabled. Any experience with that?

ISE has a smartguide and has had support in the past for partitions. Though the former wasn't stable in the past, and I think the latter has been moved to planahead.

Is the smartguide the thing where you can use the result from previous place & routes? I did notice that before but never used it. I get the impression that Xilinx is moving more and more stuff towards planahead. As far as I am concerned a good thing, if only they hurried up.

Quite a lot of things that I use fpga editor for, I still cannot do with planahead.

Another method is to look for failing nets, and then add constraints to the synthesizer to hint that they should be improved.

How do I do that? I can find the failing nets just fine, but what synthesizer hints should I give XST?

SmartXplorer is another good tool because you can see if a design fails due to logic design, or due to placement. bad logic will fail in most cases, while good logic will sometimes fail if it becomes route-starved. The result is that some nets fail on most builds, while others fail occasionally.

Not 100% I follow what you mean here. Do you mean... Take the design, let smartexplorer try a whole bunch of design strategies. If they all fail ==> bad logic. If some strategies fail and some succeed ==> route starvation?

On the subject of route starvation, a few weeks ago I ran into something that had me puzzled for a while ... but while typing it I realized it was a bit too long and off topic. So if you're interested in FUN routing problems, see this other thread.

(FUN after application of tragedy + time equals humor)

permute · Apr 29, 2011

what exactly are the timing constraints? You generally don't need to get logic down to just 1 level until you get pretty aggressive with the clock rate. It is possible you have a high amount of fanout on the inputs or outputs. There is an old reccomendation to "register the inputs and outputs of each module". This came mainly for other reasons, but it generally gives nets a full cycle for routing from one area to another, then allows a module to be fairly self contained.

Another good method is to run everything as slow as is allowed. eg, make sure the non-critical logic is placed in a slow clock domain. For a high speed system, you might do some high speed logic in one clock domain, then transfer it into a faster clock domain for higher clock rate but simpler processing (eg, packetization).

SmartXplorer can be set up to iterate over cost tables. The biggest thing the cost tables influence is general placement. eg, that packet buffer could be placed anywhere on the IC. it will probably be placed near the GTP tiles that eventually get its data, but it might make more sense to place it closer to the DSP logic that generates that data for the packets. The cost tables will affect the placement of these things. Some builds are heavily influenced by the location of the more spares resources -- the DSPs and RAMs. Logic that just can't work will always fail. But sometimes you'll get some otherwise good nets fail just because the placement was poor (usually resulting in congestion).

XST at a time had an XCF file. IIRC there were options to tell the synthesis tool that some nets needed to run faster than the clock period implied. It was a way to get the synthesizer to change the output of things like register duplication and retiming.

I find planahead funny. Mainly because they had something called ISE or "integrated design environment". Eventually I could see ISE and planahead merging, probably before they add systemverilog or vhdl2008 support...

IIRC, Xilinx also allows you to specify SHREG_EXTRACT or such. KEEP is useful if you also need to have a KEEP in the UCF, as it allows you to find the net easier. It is annoying because MAP will do some logic optimizations and doesn't have access to the synthesis attributes (from what I can tell).

mrflibble · Apr 29, 2011

The design has a 400 MHz clock for the fastest parts, and 200 MHz. This is on a speedgrade 2 spartan-6. And the clock is a bit negotiable. If it really cannot be done, then a design compromise would be 350 MHz for the fast clock.

I get 192 bits result on the 400 MHz clock, and have to encode this to an 8-bit value. The 192 bits are spread out along the length of a carry chain, so it has to be done in several stages. Not in the least due to the distance between the LSB and MSB on the carry chain.

I try to run the 1st and 2nd pipeline stage at the full 400 MHz clock because otherwise it uses way too much area. Area being somewhat important since I need to have multiple instances of this carry chain + encoder. After that it is doable to split it, and do 2 datapaths running at 200 MHz.

As for the fanout, I do try to keep an eye on that. For the 400 MHz part the fanout for the driving flip-flops is usually on the order of 3, and the amount of delay planahead timing analyzer displays for that particular amount of fanout looks reasonable. (Well, apart for 2 paths that get routed in a suboptimal fashion.)

With regards to registering both the inputs and outputs, I've read the recommendations. Registering all the outputs IMO is a must, and registering all the inputs is a matter of "it depends".

If a big issue is reusability then by all means encapsulate it. Make sure the boundaries of the module-to-be-reduced are well defined. So not only register the outputs but also the inputs.

However for sub-modules I only register the outputs. If for no other reason than that if I were to register the inputs as well, this will add 1 extra cycle of latency to every single sub-module. Even where it is not really needed. Which would make things a bit slow I would think. Or maybe I am wrong in that assumption... What would be helpful however is some general way to take any module, and put a wrapper around it with configurable depth shift registers for both the input and output registers, to make retiming easier.

Currently the only way I can think of to automate this is use an external program to parse the module declaration, find the inputs + outputs, and generate the wrapper module. That is easy enough to do with a quick perl script, but hopefully there are faster/cleaner ways to do that...

SmartXplorer can be set up to iterate over cost tables. The biggest thing the cost tables influence is general placement. eg, that packet buffer could be placed anywhere on the IC. it will probably be placed near the GTP tiles that eventually get its data, but it might make more sense to place it closer to the DSP logic that generates that data for the packets. The cost tables will affect the placement of these things. Some builds are heavily influenced by the location of the more spares resources -- the DSPs and RAMs. Logic that just can't work will always fail. But sometimes you'll get some otherwise good nets fail just because the placement was poor (usually resulting in congestion).

Thanks! Good to know that. I have used smartexplorer to both try the different strategies, and to iterate over multiple cost table values, but I didn't really know what these cost tables were doing... But now that you mention this, I can see that some part of the design would be better of with different values for cost. Is it possible to break up a design into parts, and use different mapping methods / cost values for that? Or is that part of the "design partitioning" in planahead that I really should be trying?

XST at a time had an XCF file. IIRC there were options to tell the synthesis tool that some nets needed to run faster than the clock period implied. It was a way to get the synthesizer to change the output of things like register duplication and retiming.

I understand what you mean, but I wouldn't know what options would be needed for that...

With regard to names and logic optimization, yeah that can get pretty annoying. I have Keep design hierarchy ON, if for no other reason than to be able to find nets in a post place & route simulation. If I don't use keep hierarchy it becomes a mess real fast. The drawback of course being that it cannot optimize across module boundaries...

Welcome to EDAboard.com

What are best practices for optimizing pipeline throughput for fpga implementations?

mrflibble

Advanced Member level 5

permute

Advanced Member level 3

mrflibble

mrflibble

Advanced Member level 5

permute

Advanced Member level 3

mrflibble

mrflibble

Advanced Member level 5

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics