Negative Slack / Report Analysis

player80 · Mar 9, 2019

------------------------------------------------
Path Begin : dcfx/fifo/d2_gray_enc_rd_addr_i_i11/Q
Path End : dcfx/fifo/w_ptr_diff_i__i13/D
Source Clock : clk_in
Destination Clock : clk_in
Logic Level : 27
Delay Ratio : 68.6% (route), 31.4% (logic)
Setup Constraint : 16666 ps
Path Slack : -17103 ps (Failed)

-------------------------------------------------

Can someone help me to understand this report?

The Path Slack is -17.103ns, the Setup Constraint 16.6ns, does that mean if the slack would be -15 it would be okay?

The entire design works, I have tested it over days. Initially the slack was 64ns (that seriously failed), after rewriting the fifo with a different strategy it went down to that value.

vGoodtimes · Mar 9, 2019

No, you don't want negative slack at all. Although just having negative slack in static analysis doesn't mean the design will always fail. It means that the design could fail to match simulation and could fail in unexpected ways.

std_match · Mar 9, 2019

This is a very large error. You must fix it with a design change. The fact that the design works in the lab means nothing. It can fail randomly at any time.

From the error message and the number of logic levels, we can draw the conclusion that this the gray encoding of a counter with many bits. probably for a clock domain crossing.
You can fix it with pipelining or simply by sampling the counter every n (2 or greater) clock cycles, and updating the register with the gray value at the same time.
There is no need to transfer FIFO read/write pointers to another clock domain every clock cycle.

player80 · Mar 9, 2019

std_match said:
This is a very large error. You must fix it with a design change. The fact that the design works in the lab means nothing. It can fail randomly at any time.

From the error message and the number of logic levels, we can draw the conclusion that this the gray encoding of a counter with many bits. probably for a clock domain crossing.
You can fix it with pipelining or simply by sampling the counter every n (2 or greater) clock cycles, and updating the register with the gray value at the same time.
There is no need to transfer FIFO read/write pointers to another clock domain every clock cycle.

thanks.

How is this done in the real world out there?

I have updated the design a bit but I cannot get the slack better than -12.
Are such things just ignored in FPGA designs if they are tested fine in the lab for some time?

For example the tool complains about every asynchronous reset, I don't even have a permanent clock on all the input modules so I have to use an asynchronous reset approach no?

player80 · Mar 9, 2019

player80 said:
thanks.

How is this done in the real world out there?

I have updated the design a bit but I cannot get the slack better than -12.
Are such things just ignored in FPGA designs if they are tested fine in the lab for some time?

For example the tool complains about every asynchronous reset, I don't even have a permanent clock on all the input modules so I have to use an asynchronous reset approach no?

I have added a false_path to the reset network.

std_match · Mar 9, 2019

player80 said:
How is this done in the real world out there?

I have updated the design a bit but I cannot get the slack better than -12.
Are such things just ignored in FPGA designs if they are tested fine in the lab for some time?

For example the tool complains about every asynchronous reset, I don't even have a permanent clock on all the input modules so I have to use an asynchronous reset approach no?

Some timing errors can be accepted, but they must be analyzed to be safe.
Your problem with the gray-coded counter is not such an error. It can never be accepted by lab testing.
It must be corrected by a design change.
The whole point with the gray-coding is to transfer the counter value to another clock domain so that only one bit at a time can change. This is violated with your timing error.
It should be easy to correct. Please show the source code.

vGoodtimes · Mar 9, 2019

Also, if this is a test "sandbox" design, you need to disable io-buffer insertion. The timing constraint is low and the logic doesn't sound like something that would be an issue. With io-buffer insertion, the synthesis tools might be placing some inputs on actual physical io pins that are not located near each other. That might explain the 60% route delays.

player80 · Mar 10, 2019

Thanks for all the valuable feedback!

the problem seems to be the fifo size calculation, the fifo is not a power of 2 it's 61440 bit that's why I'm doing the weird calculation (as I would do it in software)
I have changed it to a nonblocking assignment and the problem is still there. So I'm analysing the output in the netlist now...

Code:

conv_bin_rd_addr_i_1 <= conv_bin_rd_addr_i;
bin_enc_wr_addr_i_1 <= bin_enc_wr_addr_i;
					 
if (bin_enc_wr_addr_i_1 > conv_bin_rd_addr_i_1)
    w_ptr_diff_i <= bin_enc_wr_addr_i_1 - conv_bin_rd_addr_i_1;
else if (conv_bin_rd_addr_i_1 > bin_enc_wr_addr_i_1)
    w_ptr_diff_i <= ((FIFO_DEPTH - conv_bin_rd_addr_i_1) + bin_enc_wr_addr_i_1);
else
    w_ptr_diff_i <= 14'b0;

Code:

Path Begin                                   : dcfx/fifo/conv_bin_rd_addr_i_1_i0/Q
Path End                                     : dcfx/fifo/w_ptr_diff_i__i13/D
Source Clock                                 : clk_in
Destination Clock                            : clk_in
Logic Level                                  : 33
Delay Ratio                                  : 62.3% (route), 37.7% (logic)
Setup Constraint                             : 16666 ps 
Path Slack                                   : -14969 ps  (Failed)


  Destination Clock Arrival Time (clk_in:R#2): 16666
+ Destination Clock Source Latency           : 0
- Destination Clock Uncertainty              : 0
+ Destination Clock Path Delay               : 6020
- Setup Time                                 : 199
---------------------------------------------  --------
End-of-path required time( ps )              : 22487


  Source Clock Arrival Time (clk_in:R#1)     : 0
+ Source Clock Source Latency                : 0
+ Source Clock Path Delay                    : 6020
+ Data Path Delay                            : 31437
---------------------------------------------  --------
End-of-path arrival time( ps )               : 37457

I also kind of understand why it is working for me because the input runs at 60mhz but input is only coming in every 16th cycle of it. Due to graycode the problem is not visible in practice since it has more than enough time to complete the whole operation.

player80 · Mar 10, 2019

I have so much time in between every fifo commit so even the gray code is not really used.

Since the buffer is not a power of 2 as mentioned regular graycode doesn't work.

when having 7680 elements the gray code counter should range between 256-7935 in order to not violate the idea of the graycode.

std_match · Mar 10, 2019

The timing problem in post #8 seems to have nothing to do with the gray coding.
It seems that one compare (similar to subtraction), one subtraction and one addition is done in one clock cycle.

I am not a verilog person, but try this simple change, which will pipeline the subtraction:

Code:

conv_bin_rd_addr_i_1 <= conv_bin_rd_addr_i;
bin_enc_wr_addr_i_1 <= bin_enc_wr_addr_i;
fifo_depth_minus_rd_addr_i_1 <= FIFO_DEPTH - conv_bin_rd_addr_i;
					 
if (bin_enc_wr_addr_i_1 > conv_bin_rd_addr_i_1)
    w_ptr_diff_i <= bin_enc_wr_addr_i_1 - conv_bin_rd_addr_i_1;
else if (conv_bin_rd_addr_i_1 > bin_enc_wr_addr_i_1)
    w_ptr_diff_i <= fifo_depth_minus_rd_addr_i_1 + bin_enc_wr_addr_i_1;
else
    w_ptr_diff_i <= 14'b0;

I think you should show the complete source code for the FIFO, since I see some risks here. We don't know if the code you showed executes in the write clock domain or the read clock domain.
The code above can fail if conv_bin_rd_addr_i isn't syncronized properly from another clock domain.

vGoodtimes · Mar 10, 2019

Are the tools inferring block ram for this? 61440 bit is basically 2 BRAM -- there doesn't seem to be any savings by having a custom size. The custom size seems only to complicate things while possibly forcing the tools into a suboptimal non-bram implementation. (once I had something like a 4097 element ram which the tools forced into a register implementation...)

It also appears to have the same read/write clocks, so I'm not sure why this is done with gray code other than the file being somewhat generic. In that case, you could just keep track of the fifo-size without gray code, or external to the fifo if you have some need to get a specific number of elements before processing.

Also, looking at the logic, it shouldn't have been that bad from just what is shown. every term is basically "bin_enc_wr_addr_i_1 - conv_bin_rd_addr_i_1"

Code:

conv_bin_rd_addr_i_1 <= conv_bin_rd_addr_i;
bin_enc_wr_addr_i_1 <= bin_enc_wr_addr_i;

common_subtraction <= bin_enc_wr_addr_i_1 - conv_bin_rd_addr_i_1;
					 
if (!common_subtraction[msb]) // this does >= here as there is no need to have an extra else at the end
    w_ptr_diff_i <= common_subtraction;
else // this also basically used common_subtraction in the original version.
    w_ptr_diff_i <= FIFO_DEPTH + common_subtraction;

And I have a hard time believing a 3 input 16 bit adder/subtractor being so close to failing 60MHz by itself.

player80 · Mar 10, 2019

vGoodtimes said:
Are the tools inferring block ram for this? 61440 bit is basically 2 BRAM -- there doesn't seem to be any savings by having a custom size. The custom size seems only to complicate things while possibly forcing the tools into a suboptimal non-bram implementation. (once I had something like a 4097 element ram which the tools forced into a register implementation...)

It also appears to have the same read/write clocks, so I'm not sure why this is done with gray code other than the file being somewhat generic. In that case, you could just keep track of the fifo-size without gray code, or external to the fifo if you have some need to get a specific number of elements before processing.

Also, looking at the logic, it shouldn't have been that bad from just what is shown. every term is basically "bin_enc_wr_addr_i_1 - conv_bin_rd_addr_i_1"

Code:

conv_bin_rd_addr_i_1 <= conv_bin_rd_addr_i; bin_enc_wr_addr_i_1 <= bin_enc_wr_addr_i; common_subtraction <= bin_enc_wr_addr_i_1 - conv_bin_rd_addr_i_1; if (!common_subtraction[msb]) // this does >= here as there is no need to have an extra else at the end w_ptr_diff_i <= common_subtraction; else // this also basically used common_subtraction in the original version. w_ptr_diff_i <= FIFO_DEPTH + common_subtraction;

And I have a hard time believing a 3 input 16 bit adder/subtractor being so close to failing 60MHz by itself.

I'm talented in getting things wrong... ok I have implemented this one (without pipelining it - the fifo is at 5.172ns slack now, comparing with initially 66ns that's a hugh improvement).
The previous approach with pipelining end up at 8ns slack (but at the same place, since the problem is obviously the greater/smallerthan comparator ... I believed in some magic which makes the comparison fast enough ..which is not available in that FPGA unfortunately).

The good news for now: the slack happens elsewhere now.
First I'm going to remove all greaterthan/smallerthan comparators now..

thanks again for the great hint to subtract the value and check for the negative MSB.

vGoodtimes · Mar 10, 2019

player80 said:
thanks again for the great hint to subtract the value and check for the negative MSB.

That's not exactly what I intended. I would expect my coding choice to be marginally better at best. I think you are trying to solve a symptom of a different problem. Possibly a problem related to preemptive optimizations. Assuming the FPGA has BRAM, that is more important than math tricks. The odd size fifo really sounds like an optimization that isn't needed and probably makes things worse.

player80 · Mar 11, 2019

I'm just doing some generic cleanup first, replacing all the variables with signals which at the moment is 90% done.

I think I got something wrong with the fifo before, it's 30x4k. the total of it is a non-power of 2, I'm putting all of them together.
I tried to optimise it during Sunday but I'm still having some issues it seems.

The slack depends on the level of optimisation now and ranges between 3-8ns.

Is it really that difficult to make an FPGA design work properly?

ads-ee · Mar 11, 2019

player80 said:
Is it really that difficult to make an FPGA design work properly?

No it's not difficult. The problem is you keep trying to OPTIMIZE stuff that shouldn't be optimized.
making custom counters with weird roll overs for a FIFO to optimize the size is not going to optimize anything, instead it will make the design larger and slower.

The address (depth) of a FIFO should be a power of 2. The width is whatever it is and you just use enough block RAMs to get that width without having to multiplex RAMs due to having RAMs that don't match the address depth.

e.g. a 4kx30 FIFO in Xilinx would be implemented as 4096x36 or four 4096x9 BRAMs, as opposed to a multiplexed implementation of four 1024x36 BRAMs.

If that wasn't a typo and you really have a 30 deep FIFO with 4096 bits then you are probably having problems because you don't have a RAM and it is being implemented in FFs.

player80 · Mar 11, 2019

ads-ee said:
No it's not difficult.

... for me it is unfortunately.

e.g. a 4kx30 FIFO in Xilinx would be implemented as 4096x36 or four 4096x9 BRAMs, as opposed to a multiplexed implementation of four 1024x36 BRAMs.

If that wasn't a typo and you really have a 30 deep FIFO with 4096 bits then you are probably having problems because you don't have a RAM and it is being implemented in FFs.

the fifo is 16 bit wide, ok I have limited it to 65536bit (done slack went down to 4ns)

After removing some single-port-ram logic it went down to -1.745ns (best result so far)....
However still no idea how to remove that one now...?

Code:

Path Begin                                   : dcfx/fifo/gray_bin_conv_wr_addr_u/o_gray_cnt_out_i0/Q
Path End                                     : dcfx/fifo/d1_gray_enc_wr_addr_i_i0/D
Source Clock                                 : clk_in
Destination Clock                            : spi_clk
Logic Level                                  : 2
Delay Ratio                                  : 48.5% (route), 51.5% (logic)
Setup Constraint                             : 2083 ps 
Path Slack                                   : -1745 ps  (Failed)


  Destination Clock Arrival Time (spi_clk:R#8): 218750
+ Destination Clock Source Latency           : 0
- Destination Clock Uncertainty              : 0
+ Destination Clock Path Delay               : 6020
- Setup Time                                 : 199
---------------------------------------------  --------
End-of-path required time( ps )              : 224571


  Source Clock Arrival Time (clk_in:R#14)    : 216666
+ Source Clock Source Latency                : 0
+ Source Clock Path Delay                    : 6020
+ Data Path Delay                            : 3630
---------------------------------------------  --------
End-of-path arrival time( ps )               : 226316

ads-ee · Mar 11, 2019

This is a clock domain crossing. If this is correctly designed as a gray transfer of the address, the cross clock domain transfer should use only use a max delay between source and destination registers. Your multiple clock domains should have an asynchronous clock groupings or have false paths set between them to remove them from producing impossible to meet timing paths.

Asynchronous clocks will be looked at over many cycles to find the worst case transfer between them and that will inevitably ALWAYS fail timing on both the setup and hold. Hence why the arrival times of the clocks are 218.750 ns for the spi_clk and 216.666 for the clk_in for the source and destination registers.

This issue is caused by having an improperly constrained design.

player80 · Mar 12, 2019

ads-ee said:
This is a clock domain crossing. If this is correctly designed as a gray transfer of the address, the cross clock domain transfer should use only use a max delay between source and destination registers. Your multiple clock domains should have an asynchronous clock groupings or have false paths set between them to remove them from producing impossible to meet timing paths.

Asynchronous clocks will be looked at over many cycles to find the worst case transfer between them and that will inevitably ALWAYS fail timing on both the setup and hold. Hence why the arrival times of the clocks are 218.750 ns for the spi_clk and 216.666 for the clk_in for the source and destination registers.

This issue is caused by having an improperly constrained design.

Thanks for all the hints.
Finally it seems like the fifo is okay

NBR Summary
-----------
Number of unrouted connections : 0 (0.00%)
Number of connections with timing violations : 0 (0.00%)
Estimated worst slack<setup> : 0.454ns
Timing score<setup> : 0

player80 · Mar 15, 2019

I'm doing some more experiments with floor planning at the moment but the results are pretty bad.

Is there any common way which is supposed to work to put an entity/architecture instance into a defined area of an FPGA?

I noticed the badness of the synthesis tools, I'm playing around with grouping multiple instances now and the result is extremely mixed sometimes there's no slack sometimes there is slack, depending on how the synthesis got seeded I guess. The slack varies between +0.2ns and -1.5ns (depending on the seeding I guess).

If I define 2 instances of the fifo, sometimes one SLICE from the other side of the FPGA will be used for calculating the empty/full buffer, the travel time is 80% and the logic level 2 or 3 usually.

std_match · Mar 15, 2019

I think you should show us the complete FIFO source code. We can only guess why you have these problems. We don't know how many clock domains you have, or which clock domain executes the small code snippet you have shown.
It is not clear if you really need the "diff" value. You don't need it to create empty/full flags.
If you are using Xilinx, you will get the fastest FIFO by using the block RAM primitive and tell it to be a FIFO. The block RAMs have integrated FIFO logic.

Negative Slack / Report Analysis

Full Member level 2

Advanced Member level 4

Advanced Member level 4

Full Member level 2

Full Member level 2

Advanced Member level 4

Advanced Member level 4

Full Member level 2

Full Member level 2

Advanced Member level 4

Advanced Member level 4

Full Member level 2

Advanced Member level 4

Full Member level 2

Super Moderator

Full Member level 2

Super Moderator

Full Member level 2

Full Member level 2

Advanced Member level 4

Similar threads

Privacy & Transparency

Privacy & Transparency