High utilization and timing vioalations

UltraGreen · Sep 20, 2016

Hello All,

I am working on a prototyping a design with Ultrascale device.
The LUT utilization is around 98% with timing optimized Synthesis and 90% with area optimized synthesis.
It also takes a very long time to implement the design.

There are timing violations as well. ( hold -1.8ns as well as setup -3.4 ns with 20 Mhz clk) I tried partitioning few modules , but with this high utilization the partitioning is failing and the fitter unable to fit the rest of the design.
Also the critical path is in the IP. how to deal with this ?
Any suggestion is highly appreciated. ( My apologies for not providing enough numbers and results. Please help theoretically )

Thanks.

P.S. Block ram utilization is very low.

TrickyDicky · Sep 20, 2016

Do you have any rams that being inferred as logic? They take a long time to synthesise and user a lot of logic. Is there anything else that could be put in ram?

If the answer is no, then the only answer would be a bigger device if your can't reduce logic.

FvM · Sep 20, 2016

A reference to the previous thread should be given as background information https://www.edaboard.com/threads/359199/

hold -1.8ns as well as setup -3.4 ns with 20 Mhz clk

Actually a single clock domain design?

Path delays of about 50 ns suggests your design has nothing in common with state-of-the-art synchronous logic, even if there's a certain amount of routing delay. Looks like combinational paths of hundred or more LUTs.

Also the critical path is in the IP.

Hard to believe. Wrong parameters, wrong clock connections?

UltraGreen · Sep 20, 2016

The IP runs at 100 MHz clock. Its the axi bridge PCIE3 IP
Yes unfortunately huge combination delay and too many logic levels. and few modules are completely combinatorial.

I believe that clock and reset routing might have gone wrong, can you please suggest a way to check the routing in vivado and also the ways to optimize it. Although I routed the global clock and reset through BufGs.

@Tricky, Yes there are room for inferring more rams itno the design as many LUTs are implemented as ram. One of the reason behind that is that the default threshold for implementing block ram is 5 and many rams are below this threshold.

FvM · Sep 20, 2016

The IP runs at 100 MHz clock. Its the axi bridge PCIE3 IP

This statement and the info in post #1 don't match. You now reveal that it's apparently a multi-domain design (or the "20 MHz" stated in post #1 is just erroneous). Saying the critical path is "in the IP" may be a misinterpretation. I would primarily look at the signals going in and out of the AXI bridge, do they use appropriate pipelining to relax instead of tighten the timing inside the IP block.

At worst case it may turn out that too high resource utilization blocks timing closure of the IP, without particular design faults.

UltraGreen · Sep 20, 2016

Thanks FvM,

I am not sure weather it is an actual CDC because :
1. The signals from design unit (@20MHz) goes to axi clock converter IP which converts the signals at 20 Mhz to 100 Mhz which eventually goes to Axi pci bridge IP.
2. Same happens with signal comming from Axi Pci3 bridge Ip ( @100 MHz ) are converted to 20 MHz using Axi clock converter

its like Design unit @20 MHz ---- > Axi_clk_converter IP ----> AXI_Pci_bridge (@100 MHz)
Axi_pci_bridge (@100 MHz) -----> Axi_clk_converter IP ----> Design unit @20 MHz

ads-ee · Sep 20, 2016

An ultrascale part with AXI4 and a PCIE should have no problem placing and routing in the slowest part in the family. If you are having timing problems in that IP it's not because of the IP it's because of the poor design (huge combinational circuits) implementation elsewhere.

What is the synthesis results showing for utilization of primitives, can you post the final synthesis primitive results (assuming you can't divulge any information about the design). If you have a large number of SRL, DistRAM, small ratio of FF/LUTs that could easily result in problems with high utilization. Do you know how many control sets are in the design? That would affect the packing during placement and can adversely affect timing closure.

- - - Updated - - -

To give effective help we would need significantly more information such as the structure and architecture of the design. The resource utilization report per hierarchical level along with a block diagram of the design showing major blocks and memory usage of each block. There's more information that would be useful, but those would be the minimum to start.

ads-ee · Sep 20, 2016

UltraGreen said:
P.S. Block ram utilization is very low.

This suddenly popped out for me...

Use the block rams as huge LUTs to replace large chunks of your combinational logic. It might be a pain to convert the combinational logic but it will reduce the size of your design significantly.

UltraGreen · Sep 21, 2016

Thanks ads_ee,

attaching the utilization summary

TrickyDicky · Sep 21, 2016

what is in qt_tc_inst? and specifically what is g1_SH_iter?

UltraGreen · Sep 21, 2016

TrickyDicky said:
what is in qt_tc_inst? and specifically what is g1_SH_iter?

That's a memory processing unit and it is instantiated 4 times. It is so mostly combinatorial designed for ASIC. Unfortunately I cannot modify the core design.

ads-ee · Sep 21, 2016

UltraGreen said:
That's a memory processing unit and it is instantiated 4 times. It is so mostly combinatorial designed for ASIC. Unfortunately I cannot modify the core design.

I see. This is an ASICS emulation design. Usually you would use one of those specialty boards and the tools that split the design over multiple parts so they will fit the much smaller FPGA devices.

ASICS code many times will not fit well in an FPGA due to large combinational logic cones and the requirement that source code can't be changed.

pbernardi · Sep 21, 2016

As you cannot change the ASIC code, you cannot change what is probably the root of the behavior you complain.

But you can try to act not on cause, but on symptoms. One idea: you can try to find not-used paths and relax or even completely ignore the timing requirement.

For example, external reset timings usually can be completely ignored. As you seems to use a huge combinational logic, maybe there are some critical paths that do not need a timing constraint.

Other try would be to manually fix some main blocks inside the FPGA (DSP, block rams) in a position that have good routing results. Unfortunately, this is quite empirical and may take a lot of time as well.

ads-ee · Sep 21, 2016

There is also the option of just reducing the clock frequency you are targeting (being it's an emulation). I still think it's more a problem with not partitioning the design into multiple FPGAs, which is what most ASIC emulation tools would push you to do from the start.

Welcome to EDAboard.com

High utilization and timing vioalations

Junior Member level 3

Advanced Member level 7

Super Moderator

Junior Member level 3

Super Moderator

Junior Member level 3

Super Moderator

Super Moderator

Junior Member level 3

Advanced Member level 7

Junior Member level 3

Super Moderator

Full Member level 3

Super Moderator

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor