Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

Approaches for calculating histograms on FPGAs

Status
Not open for further replies.

shaiko

Advanced Member level 5
Joined
Aug 20, 2011
Messages
2,644
Helped
303
Reputation
608
Reaction score
297
Trophy points
1,363
Activity points
18,302
Hello,

I was looking for information regarding histogram calculation using FPGAs and came across this interesting article:

**broken link removed**
(look at the timing diagram on page 2)
This seems like a robust and easy approach.

However, after doing some more reading on the subject - I found this document:
https://www.xilinx.com/support/documentation/white_papers/wp335.pdf
(page 2 - Read-Modify-Write, One Operation Per Clock)

The second approach looks much more simple - and I think it can be used to achieve the same thing the first one does.

So why bother with multiple clock domains and phase shifting??
Am I missing something here?
 

hi shaiko

you can't avoid not crossing clock domains in video applications
video usually is 27mhz synchronous clock interface,
and your DSP is few x100mhz clock rate.
so in most cases you will need to have synchronizers, fifos, and all the rest...
(only if plain application like a video i/f board when only clock is 27mhz)

basically what they show is that you can use the dual port bram
to generate normal histogram in video clock domain.
you can later on read this data, but still the dsp will need to know
when to start to read the histogram. sometimes the dsp will wish to read
histogram that is not available yet, so you will need synchronisation and flags.

you will probably need additional registration to identify that specic histogram
is belong to specific video frame, so you will need also to generate this metadata.
so when you propogate your histogram to the dsp, it will know that it belong to specific video fram that is probebly stored in external DDR memory.

arui
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
you can't avoid not crossing clock domains in video applications
I didn't imply that you can or should.

I'm trying to compare the approach in the first article to the pipelined single cycle read modify write approach suggested by Xilinx:
A synchronous RAM cannot perform read-modify-write operations in a single clock
cycle, but the dual-port, synchronous block RAM in all Xilinx® FPGAs can pipeline
the write operation and achieve a throughput of one read-modify-write operation per
clock cycle. To do so, the designer uses Port A as the read port, uses Port B as the write
port, and uses one common clock for both ports. The read address is routed to Port A.
A copy of the read address is delayed by one clock and routed to Port B. The data from
Port A is modified and used as the data input to Port B.

Both solutions work - yet the one proposed by Xilinx looks simpler to me. It doesn't require a PLL, probably can be implemented with less logic and requires less routing effort.

So, what are the benefits of the first one?
 

The second approach looks much more simple - and I think it can be used to achieve the same thing the first one does.
The simple difference is that the first design uses one RAM port for the DSP access, so it isn't available for pipelining the read-modify-write operation. With only one free RAM port, you'll need to double the memory clock.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
The first article is talking about virtex and Spartan 2 devices, which are 10 yrs or older now. Resources and clock speeds were limited, so clever solutions were needed to save brams and logic. Now there isn't such a problem.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
The first article is talking about virtex and Spartan 2 devices, which are 10 yrs or older now. Resources and clock speeds were limited, so clever solutions were needed to save brams and logic. Now there isn't such a problem.

This was my first thought - and perhaps it was indeed the motivation of the designer.
But as FvM noted, this approach utilizes only a single port. If we use a DPR, the second port will be free to read the histogram data.

However, with Xilinx's read-modify-write we occupy both ports for the job.




The simple difference is that the first design uses one RAM port for the DSP access, so it isn't available for pipelining the read-modify-write operation. With only one free RAM port, you'll need to double the memory clock.

So with the first approach
 

You can use the BRAM as mentioned. The difficult part is incrementing the same index on two consecutive cycles. There isn't a single-cycle read-modify-write, so the updated count will not be read on the second cycle. Note that a single port RAM does have a single cycle write-first mode. In the dual-port case it becomes a read-write collision.

There are at least three solutions to this problem. The first is overclocking the BRAM to get more operations per slow-clock cycle. The second is caching the last output and not re-reading when the next input is for the bram entry that hasn't been written yet. The third is detecting the problem upstream and modifying the logic to have an increment by two function.

The first solution requires more clocking resources, but keeps the logic simple. The second solution is a more general solution to this type of problem, and can reduce power consumption. The third method is specialized for this case, and should be able to allow highest performance.

For the overclocking, the shifted clock is only to determine read/write cycle on the 2x clock. This can be determined using other methods if there are timing issues with this method.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
The difficult part is incrementing the same index on two consecutive cycles.
So,
You're saying that without modifications Xilinx's approach won't work?
 

That is not what I am saying. I am explaining why there is a problem in the first place. Then I explain three possible solutions to the problem, the first is the 2x clock method that Xilinx used in the second paper.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top