Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

[SOLVED] Generation of desired size ANN for FPGA

Status
Not open for further replies.
There is some tricks in PlanAhead or I should manually edit design in FPGA editor after PAR? I use ISE only on this stage. And graph is drawn form synthesis report.
 

The design speed increase will come from the coding, rather than planahead, and if you have to poke around after the PAR there is something very wrong.
It sounds like the design isnt really coming from usual VHDL design rather directly porting the matlab code. This can usually lead to poorer FPGA design.
 

Thanks a lot for criticism in my address :)
388MHz is achieved after PAR in ISE14.7. Basically i connect few BRAMs, DRAMs, SRLs with shared DSP.
Xilinx claims that Fmax in range of 450-500MHz is possible for various filters. I have reached it only for single DSP macro. If i connect some other registered staff to this, it drop the Fmax below 400MHz. Have anybody a success to reach 450MHz with 5-10 DSPs joined in a chain or through RAM? or I want too much.
 

Have you looked at the schematic view of the synthesized design? Are you sure you've enabled all the pipeline registers? Have you looked at the critical paths to see what logic is involved (probably something outside the DSP48 blocks)? Without seeing the code and the implementation results and correlating the two can anyone here help further with this.
 

400+MHz is going to be hard to acheive. Usually the bottleneck is routing into or out of the RAM or DSP - check where the logic is placed. Is there a single register between two RAM/DSP? increase this to 2 so that it can easily route to a register, another register, then into DSP - rather than making the PAR have the two DSPs fight over a single external pipeline register.

How spaced out are the DSPs? could you bunch them together with some placement regions? Have you tried a seed sweep?
Why do you need the algorithm so fast? could you not just instantiate the core multiple times for parrallel processing/
 

Thanks. I want to achieve max performance for each instantiated DSP. Now, for example, the 5th order lattice-ladder structure uses one DSP with 100% performance (320MHz after PAR) and latency of 20 clk for 16 mul, 5 sub and 10 sum operations (first C-AB, four times P-AB, five times C+AB, one AB and last five P+AB). When the order of the lattice-ladder is reduced to 2, then the performance of single DSP is decreasing, because there are no ready data to by processed. And instantiation of second DSP will not accelerate processing, because the awaiting data are somewhere in first DSP pipeline registers. But this is not an issue, the interleaving of data from second channel can be done when 2x 2nd order structures are used.
I have compared the hand written vhdl with the vivado hls generated. After PAR of the vivado hls model the Fmax was ~200MHz. With the hls I newer reach higher than ~250MHz after par while testing this tool on various filters.

The problem begins while implementing the adaptation circuits. There are 14 different structures (possibilities) to implement the same learning circuits. And their are at least 3 times more arithmetic resources hungry than lattice-ladder. I thing, I go crazy to check all possible implementations and figure out the optimal one. To automate this process I begin to write a program, that read simple arithmetic equations (which really describes the data flow graph) from text file. Then create an adjacency list of the graph. After that segment whole graph in possible subgraphs, which describes DSP patterns (ex. P-(A+D)B, C+AB, ...). On this moment I have a list of patterns, and know who with who can be connected. Now I need to write a pattern scheduling feeding the data from one pattern to second at the right moment. As a result the scheduling gives the latency, performance and # of DSP for each structure.

Maybe I'll reinvent a bike? I suspect that system generator do same things as I've tried to describe above. Maybe you are more faced with XSG, how efficient is the generated vhdl, Fmax, is it similar to vivado hls generated or better? Because the tried HDL coder is wholly crap comparing to performance of hand coded hdl.
I can't check XSG now. The XSG in ISE14.7 is not compatible with my Matlab14a, so I will download another (13) version.
 
Last edited:

System generator and HDL codes are meant to be there for inexperienced designers and for people who want to put a system together quickly - so they need to cover a lot of cases.. For greater performance, you have to hand code, and it will be the case for a long while yet.
 
  • Like
Reactions: tomsld

    tomsld

    Points: 2
    Helpful Answer Positive Rating
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top