Real life throughput of FPGA DSP blocks

shaiko · Jun 24, 2017

Hello,

Both Xilinx and Altera market their FPGA DSP blocks at exceptionally high MHz values (>500MHz).

Even if the DSP's block silicon can be clocked that high - how is it possible to feed them data at these speeds ? What's the highest value you've been able to achieve ?

The architecture that gave me the best result is the following:

Use 2 clock domains -
1. Clock domain X - or the input clock domain.
2. Clock domain Y - the DSP clock domain
Clock domain Y faster then clock domain X.

Drive data from clock domain X into the wide side of an asynchronous width conversion FIFO.
Read narrow data from clock domain Y and feed it to the DSP block.
Process the data and write it back to a similar FIFO that does the opposite (narrow to wide) conversation...

But even with architecture, the clock domain Y of the FIFOs logic - fails at frequencies much much lower than the marketed frequency of the DSP block.

What do you think?
Would you do things differently?
What numbers where you able to achieve?

TrickyDicky · Jun 24, 2017

Not exactly an easy question to answer, as every design will have different goals, burst lengths, etc.
Ive seen DSPs running at 368Mhz in a stratix 4. But most of the design was using a 1 in 3 clock enable (so lots of multicycle paths)

Its not usually the clock speed thats important, but system bandwidth. If you're running 10GB ethernet then you can do that with a 200Mhz clock with a 50+bit bus.

shaiko · Jun 24, 2017

Its not usually the clock speed thats important, but system bandwidth. If you're running 10GB ethernet then you can do that with a 200Mhz clock with a 50+bit bus.

This is true.
But my goal isn't just a system that works...it should also be as cost effective as possible (a smaller cheaper device).
This means taking the DSP "overclocking" capability to the limit with the goal of using less DSP units.

And From my experience when you do so - the point of failure will always be the circuits that take care of clock domain crossing and width conversion (to feed the DSPs with narrow high speed data) .

For example:
Xilinx claims that their DSP48E2 is capable of 600MHz.
I say, that in order to be able to do so - the general purpose FPGA fabric that's used to implement fast side clock domain crossing and width conversion will also have to be clocked at 600MHz. And this is impossible!

vGoodtimes · Jun 24, 2017

What are you trying to do more specifically?

What are the widths, how many DSP slices, how are arguments routed? You need to have a deeper pipeline at the higher rates.

Make use of the dedicated routing and aligned-pitch as much as possible.

shaiko · Jun 24, 2017

I'm designing a filter that does convolution of a 7x7x12b pixel image patch, with different coefficients (each coefficients is 16 bits wide). The image format is 1024x1024.
There're 64 coefficients sets that have to work in parallel - I.E:

Convolution of the 7x7x12b pixel input with 7x7 coefficient set #0
Convolution of the 7x7x12b pixel input with 7x7 coefficient set #1
.
.
.
Convolution of the 7x7 pixel input with 7x7 coefficient set #127

This means 7*7*64*(1024^2) = 3,288,334,336 MACs of per frame.
The input data rate is 40MHz (12 bits wide pixel clocked at 40MHz
I want to run the MACs at the highest possible speed to make do with less DSP blocks.

vGoodtimes · Jun 25, 2017

How are you organizing this? My thought is using groups of 7 DSP barrel processors. That removes the input packing needs.

In any case, the goal is to keep the logic simple and direct for the highest rate stuff.

more specifically, all inputs come from the same row. The only vertical routing is dedicated routing. The input to the DSP slice is held for 64 cycles. The coefficients use SRLs and a register stage near the DSP48. The controls do the same. output address generation is also simple. There might be some vertical routing to the rams. The input fifos are read at a slow rate with SRLs and registers providing the correct delays. The final accumulator can be pipelined in fabric. The BRAM to DSP ratio is low, so there should be resources for double buffering. Ideally, the rest of the design will not attempt to route unrelated logic in these areas. You may need to LOC/RLOC some of the BRAM and DSP resources.

shaiko · Jun 25, 2017

And with all this effort - do you see the general purpose FPGA fabric clocking at frequencies anywhere near the marketed frequency of the DSP silicon ?

vGoodtimes · Jun 25, 2017

My main concern is the lack of logic after you do this operation. My proposed design uses over 90% of DSP and BRAM of a Kintex7 70T @ 560MHz. It also requires a lot of attention to routing and placement details. I don't know the input source, but I'm assuming it isn't app-friendly. Specifically, it is line at a time, which requires some buffer space. I think the hardest part will be the routing to/from this. My proposed design also truncates some intermediate terms to 18b. The clock domain crossing input fifos need to be fabric based due to the amount of buffer needed. You might also need to use gated clocks to avoid control set restrictions and control set routing.

To see if this is viable, you would need to finish the entire design before the prototype is available. And any last minute changes mean going to a device with significantly more resources. This is best done if you can write the barrel7 processor, LOC the BRAMs and DSPs, and then try to get the most regular layout for the logic. You are almost dropping down to schematic layout -- every HDL decision is based on knowing you have 1 level of logic with fast routing. Hopefully you can get something that you can replicate or RLOC/DIRT. You might end up dumping a working HF LOC/DIRT in the end so other parts never break the core timing. You'll be using several features and methods that are only used when nothing else works.

the proposed design has:
1.) fabric based clock-domain-crossing to 560/8 to an second buffer for the 560 clock domain
2.) SRL based splay so DSPs get inputs in the correct pipeline delay.
3.) DSP chain taking inputs from registered LUTs, SRLs, or DMEM.
4.) buffer with routing logic.
5.) pipelined accumulate (add in this case).
6.) registered control mux with count to seven logic and six detect to predict seven.
7.) various counters for the buffer read/write addresses.

Also, you might be able to get the compare-tree integrated into the engine, maybe. I also assume a gated clock and some manual duplication of logic. I think you can get something, but you'll be close to 100% of device utilization.

If you can go bigger than K7 70T, you should be fine. (A7 100T -3 technically could work too as it has similar specs.)

pbernardi · Jun 26, 2017

The datasheet could help you to pursuit the theoretical maximum frequency in your case.

Look at the switching characteristics for the datasheet. You will find not only the maximum DSP frequency, but also others blocks switching characteristics.

For example, for Artix 7, sped grade -2, I can see the maximum frequency is 550.66 MHz. BRAM, has a maximum frequency of 460.83 MHz, so if you need to use BRAM, this is your limit. But you may try to use SerDes, for example, to try to overcome this limit without the use of BRAM. Normal logic seems to support this frequency if optimized until on state-of-the-art (Max. CLB frequency is 1286 MHz)

I would suggest to break the big problem into smaller problems. You can make several small blocks (DSP itself, logic, RAM) and try to reach the maximum datasheet frequency for each block. Next step is connecting the blocks, checking the resulting frequency. If you reach a good result for 1 DSP, then try to go to n DSPs.

TrickyDicky · Jun 26, 2017

shaiko said:
And with all this effort - do you see the general purpose FPGA fabric clocking at frequencies anywhere near the marketed frequency of the DSP silicon ?

Even if you contact Xilinx/Altera, they are likely to tell you this is all marketing, and unrealistic in any real design. Even if you could clock the DSPs at this speed, getting data to it at these speeds is going to be very hard or impossible.
Personally, if you get to 75% of marketing frequency, you're going to start hitting timing issues somewhere in the design.

I say, do you want to be clever, or make money for the company? Usually, you'll do better at work if you go for the money and just ship something.

vGoodtimes · Jun 27, 2017

Actually, this might really be viable. I completely overlooked the serdes concept. The 7-DSP chain barrel processor takes inputs from SRLs with a fabric register. The input to the system is clocked at 80MHz. The output can go to parallel paths running at 280MHz or maybe even lower. The BRAM would run at 280MHz either way. This means the DSP48 and the CLBs next to the DSP48 are clocked at 560MHz and everything else is slow.

The barrel processor itself is moderately complex, but not really that bad.

- - - Updated - - -

edit: somehow double posted. removed second post.

Welcome to EDAboard.com

Real life throughput of FPGA DSP blocks

shaiko

Advanced Member level 5

TrickyDicky

Advanced Member level 7

shaiko

Advanced Member level 5

vGoodtimes

Advanced Member level 4

shaiko

shaiko

Advanced Member level 5

vGoodtimes

Advanced Member level 4

shaiko

shaiko

Advanced Member level 5

vGoodtimes

Advanced Member level 4

shaiko

pbernardi

Full Member level 3

shaiko

TrickyDicky

Advanced Member level 7

shaiko

vGoodtimes

Advanced Member level 4

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics