Clock enable routing delay in spartan-6

mrflibble · Mar 9, 2011

Short version: How do you handle a clock enable where the clock enable has to travel to the CE pins of flip-flops that are all over the place on the fpga die?

I have a design that uses clock enables, and for the life of me I cannot get it to work. That is to say, I can get it to work, but not without ugly workarounds.

The situation is pretty simple. I have several modules where the flip-flops only have to clock in the data at every other clock cycle. That is, I use clock enable to relax the timing constraints on these modules. So I have lets say 20 modules spread over a large area of the die, and they use:
- the same clock of 375 MHz.
- the same clock enable signal.

For the clock signal we have low skew global clock nets to distribute the clock. Now as I understand it, in a spartan-6 you don't have dedicated nets to distribute clock enables. So how do you distribute the clock enable to all these flip-flops that are spread over a large area? Since the clock frequency is 375 MHz, that only allows for so much routing delay...

I have read about BUFGCE's, but as far as I can tell these would use up an extra global clock net for every "gated" clock. Is this correct? If so, then using BUFGCE's becomes prohibitive real fast because I need more than 1 different clock enable signal.

Ideally the clock enables get generated (by the tool) locally, since there are plenty of spare flip-flops in the various slices. Problem is, how do I specify this in a clean and maintainable way?

I can for example add an extra register to every module. Sortof a "bring your own clock enable signal" thing. But that to me would seem a bit of a kludge, and is not guaranteed to work all the time either...

Soooo, how do you handle a clock enable where the clock enable has to travel to the CE pins of flip-flops that are all over the place on the fpga die?

permute · Mar 9, 2011

These are some of the difficulties with these useful methods. Some of the choices might be influenced by the people you work with. Both single input clock enables, and passing valids are common interfaces.

It is often difficult to trick the tools into doing what you want. Ideally, you want to have some pipelined clock enable for your scheme. The issue is that the tools can often see that certain clock enables are the same, and apply logic reductions that end up adding to the problem. you might find that you need to specify register duplication, max fanout, and possibly keep's for certain parts of the design.

if you have a lot of nets that use the same clock enable, you can use the bufgce's or possibly use the global net for routing. This makes sense because it allows fast routing resources to be given to other nets. Just remember that a gated clock and a clock enable are different -- the FF's prioritize reset over CE, so if rst and !ce are asserted at a FF, the FF will reset. with a gated clock, if rst is asserted but no clock edge occurs, the FF will not be reset. (assuming synchronous resets like those in the DSP slices, async will still work.)

mrflibble · Mar 9, 2011

hi permute, thanks for your reply!

These are some of the difficulties with these useful methods. Some of the choices might be influenced by the people you work with. Both single input clock enables, and passing valids are common interfaces.

Well, since verilog + fpga tinkering for me is a hobby thing the choices are mainly influenced by whatever I can find on the internet & books. When you say that passing valids is a common interfaces, what do you mean by that?

It is often difficult to trick the tools into doing what you want. Ideally, you want to have some pipelined clock enable for your scheme. The issue is that the tools can often see that certain clock enables are the same, and apply logic reductions that end up adding to the problem. you might find that you need to specify register duplication, max fanout, and possibly keep's for certain parts of the design.

No kidding! Sometimes getting the tools to Do-What-I-Want takes up most of my time. No doubt that is due to me still being a relative newbie, and not sitting next to someone more experienced that you ask. But still...

As for register duplication and max fanout .. I did try precisely that, and I got results that were a solid 8.6 score on the WTF scale. It did duplicate registers. And it put them at very interesting locations, which did not really help in meeting timing constraints. A question related to that ... I have "Keep Hierarchy" enabled. Could it be that the freedom the tool has to do register duplication is being negatively impacted by the Keep Hierarchy?

if you have a lot of nets that use the same clock enable, you can use the bufgce's or possibly use the global net for routing. This makes sense because it allows fast routing resources to be given to other nets. Just remember that a gated clock and a clock enable are different -- the FF's prioritize reset over CE, so if rst and !ce are asserted at a FF, the FF will reset. with a gated clock, if rst is asserted but no clock edge occurs, the FF will not be reset. (assuming synchronous resets like those in the DSP slices, async will still work.)

Yes, I did consider the BUFGCE's. But each output of a BUFGCE uses an entire global clock net to do it's thing, right? Some modules run at full speed (375 MHz), some slower part A runs at the even clock cycles and some slower part B runs at the odd clock cycles. If I understand it right, then this would use 3 global clock nets when using the BUFGCE approach. As opposed to using only 1 global clock net when I use some other way to distribute the clock enables. Correct?

In this design the global clock nets are at a premium, because I really really need them for other functionality.

anyways, thank you for your advice! It's really helpful.

permute · Mar 10, 2011

eg, each module accepts din, din_valid and provides dout, dout_valid. The exact names may vary based on who writes the module. eg, Xilinx likes using "ready" and "new_data" in a lot of their cores. there are several interface strategies, and really, defining and understanding the interfaces in a design can go a long way to solving some issues. for example, in the above example a valid input might propogate through a pipeline, or the design might use "valid" as a clock enable. In the first case, the input data can stop, in the second case it cannot (any data in the pipeline will not be flushed). Sometimes, moving to interfaces that allow processing of blocks of data at a time can be very advantageous, as the control logic can run very slow as it only needs to make decisions every (eg) 1024 cycles.

Can you give some examples of nets that are failing?

for the bufg's, you can also use 1x clock, and both edges of a divided clock, as long as you have logic to start processing on the correct edge. (eg, the first divided clock edge after reset will be either rising or falling, and the logic should be set up to handle either case). This might have other issues in your design.

mrflibble · Mar 14, 2011

eg, each module accepts din, din_valid and provides dout, dout_valid. The exact names may vary based on who writes the module. eg, Xilinx likes using "ready" and "new_data" in a lot of their cores.

Ah okay, I understand what you mean now.

There are several interface strategies, and really, defining and understanding the interfaces in a design can go a long way to solving some issues. for example, in the above example a valid input might propogate through a pipeline, or the design might use "valid" as a clock enable. In the first case, the input data can stop, in the second case it cannot (any data in the pipeline will not be flushed). Sometimes, moving to interfaces that allow processing of blocks of data at a time can be very advantageous, as the control logic can run very slow as it only needs to make decisions every (eg) 1024 cycles.

In this particular pipeline I have a continuous datastream, where the pipeline output sometimes is used or is discarded, depending on some trigger conditions. The discard rate (~ 10%) is low enough and the pipeline depth high enough that this is the best I can think of.

Can you give some examples of nets that are failing?

See below for what I am using now. This very same code would fail if I use only the top level clock enable. Think "USE_LOCAL_CE=0" on all levels in the code below.

For the bufg's, you can also use 1x clock, and both edges of a divided clock, as long as you have logic to start processing on the correct edge. (eg, the first divided clock edge after reset will be either rising or falling, and the logic should be set up to handle either case). This might have other issues in your design.

This would still need an extra clock, right? Which admittedly is better than the previous 2 extra...

I still need the full 375 MHz clock for the fast part, and then I would need a slow clock at 187.5 MHz. And then use the slow clock posedge for the even clock cycles and the slow clock negedge for the odd clock cycles. But it is an idea worth considering for this particular design. Two extra global clocks was a nogo. One extra might just be doable.

Of course this does not solve the general problem of how the hell to properly use clock enables.

Because suppose that some part of the design is a perfect match for multi-cycle operation on three cycles? And we want full throughput so that means three copies of the circuit running in parallel. With 3 there is no such luck of using boith edges, so it would mean 1 full speed clock, and then 3 slow clocks each with a 120 degree phase shift.

On the other hand with clock enables I could do a pre-loaded SRL that is preloaded with 100, 010 and 001 for the respective phases. These SRL's could then be done in several local copies to keep the routing delay to the CE pins low. Again, conceptually simple. Doing this by hand would be a pain to maintain, doing this with the tools is ARRRRGH with my current understanding of the tools.

Currently I am just providing the affected modules with the ability to generate their own local clock enables. So like this, using the USE_LOCAL_CE parameter:

Code:

module count_192_ones_ce  #(
    parameter          USE_LOCAL_CE  = 0,  // Generate a local flip-flop for clock enable?   0=NO,   1=YES
                       LOCAL_CE_INIT = 0   // Start local clock enable at EVEN or ODD cycle? 0=EVEN, 1=ODD
    ) (
    input              clk,                // IN:  system clock
    input              ce,                 // IN:  clock enable for multi-cycle path operation
    input      [191:0] ones_in,            // IN:  taps after synchronization
    output reg   [7:0] count               // OUT: number of ones in the "taps" input. Latency 10 (5 deep, 2 cycles each)
    );

// Check parameter values + initialize output registers
initial begin
    if ((USE_LOCAL_CE < 0) || (USE_LOCAL_CE > 1)) begin
        $display("DRC ERROR: Illegal value %d for USE_LOCAL_CE parameter. Should be 0 or 1.", USE_LOCAL_CE);
        $finish;
    end
    if ((LOCAL_CE_INIT < 0) || (LOCAL_CE_INIT > 1)) begin
        $display("DRC ERROR: Illegal value %d for LOCAL_CE_INIT parameter. Should be 0 or 1.", LOCAL_CE_INIT);
        $finish;
    end else begin
        count = 0;
    end
end


reg local_ce = (LOCAL_CE_INIT);

always @(posedge clk) begin
    local_ce <= (~local_ce);
end


wire active_ce = (USE_LOCAL_CE == 1) ? (local_ce) : (ce); // unused one (either "ce" or "local_ce") will be optimized away

wire [95:0] part1 = ones_in[95:0];
wire [95:0] part2 = ones_in[191:96];

wire [6:0] count_part1;
wire [6:0] count_part2;



count_96_ones_ce  #(
    .USE_LOCAL_CE  (1),
    .LOCAL_CE_INIT (LOCAL_CE_INIT)
    ) count_96_ones_part1 (
    .clk           (clk), 
    .ce            (active_ce), 
    .ones_in       (part1), 
    .count         (count_part1)
    );

count_96_ones_ce  #(
    .USE_LOCAL_CE  (1),
    .LOCAL_CE_INIT (LOCAL_CE_INIT)
    ) count_96_ones_part2 (
    .clk           (clk), 
    .ce            (active_ce), 
    .ones_in       (part2),
    .count         (count_part2)
    );


always @(posedge clk) begin
    if (active_ce) begin
        count <= (count_part1) + (count_part2);
    end
end

endmodule // count_192_ones_ce

And then further down the tree the same structure.

Code:

module count_96_ones_ce  #(
    parameter USE_LOCAL_CE  = 0,  // Generate a local flip-flop for clock enable?   0=NO,   1=YES
              LOCAL_CE_INIT = 0   // Start local clock enable at EVEN or ODD cycle? 0=EVEN, 1=ODD
    ) (
    input             clk,        // IN:  system clock
    input             ce,         // IN:  clock enable for multi-cycle path operation
    input      [95:0] ones_in,    // IN:  taps after synchronization
    output reg  [6:0] count       // OUT: number of ones in the "taps" input. Latency 10 (5 deep, 2 cycles each)
    );

// Check parameter values.
initial begin
    if ((USE_LOCAL_CE < 0) || (USE_LOCAL_CE > 1)) begin
        $display("DRC ERROR: Illegal value %d for USE_LOCAL_CE parameter. Should be 0 or 1.", USE_LOCAL_CE);
        $finish;
    end
    if ((LOCAL_CE_INIT < 0) || (LOCAL_CE_INIT > 1)) begin
        $display("DRC ERROR: Illegal value %d for LOCAL_CE_INIT parameter. Should be 0 or 1.", LOCAL_CE_INIT);
        $finish;
    end else begin
        count = 0;
    end
end


reg local_ce = (LOCAL_CE_INIT);

always @(posedge clk) begin
    local_ce <= (~local_ce);
end


wire active_ce = (USE_LOCAL_CE == 1) ? (local_ce) : (ce); // unused one (either "ce" or "local_ce") will be optimized away

wire [47:0] part1 = ones_in[47:0];
wire [47:0] part2 = ones_in[95:48];

wire   [5:0] count_part1;
wire   [5:0] count_part2;



count_48_ones_ce  #(
    .USE_LOCAL_CE  (0),
    .LOCAL_CE_INIT (LOCAL_CE_INIT)
    ) count_48_ones_part1 (
    .clk     (clk), 
    .ce      (active_ce), 
    .ones_in (part1), 
    .count   (count_part1)
    );

count_48_ones_ce  #(
    .USE_LOCAL_CE  (0),
    .LOCAL_CE_INIT (LOCAL_CE_INIT)
    ) count_48_ones_part2 (
    .clk     (clk), 
    .ce      (active_ce), 
    .ones_in (part2), 
    .count   (count_part2)
    );


always @(posedge clk) begin
    if (active_ce) begin
        count <= (count_part1) + (count_part2);
    end
end

endmodule // count_96_ones_ce

So at every level I can decide to keep the clock enable from the upper level module, OR to create a local clock enable and propagate that. This does work, and preferably I would like something a bit neater than this. But since I'm pretty new to this, this is the best I could come up with that actually works. So any better approaches are welcome.

Welcome to EDAboard.com

Clock enable routing delay in spartan-6

mrflibble

Advanced Member level 5

permute

Advanced Member level 3

mrflibble

mrflibble

Advanced Member level 5

permute

Advanced Member level 3

mrflibble

mrflibble

Advanced Member level 5

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics