Efficiently comparing 64bit values

Status
Not open for further replies.

dpaul

Advanced Member level 5
Joined
Jan 16, 2008
Messages
1,799
Helped
317
Reputation
635
Reaction score
342
Trophy points
1,373
Location
Germany
Activity points
13,072
Need ideas...

I have multi-input channels (lets assume 8 channels for now) each carrying 256 bits of AXI-Stream data. Now each input (the tdata bus) has a 64 bit unique value given by the position, tdata_i(95 downto 32).
My requirement is to compare each of the 64 bits values from all the 8 input data streams, determine which input data has the least 64 bit value and then save that data input into a FIFO first. Then the tdata with the next bigger 64 bit value will go into the second location of the FIFO and so on.
I will have corresponding tvalid_i and tlast_i signals for all the 8 channels.
I have to keep in mind that data might be continuously coming in from one or all of the channels or it might be completely random.

Target device is xc7a200t and clock is 125MHz.

How to do the comparison in the most efficient way?

Regards.
 

To compare numbers you simply use the less than operator (i.e. if a < b then...). If that approach meets your timing requirements, then you're done. Don't make it harder than it needs to be. If performance is not met then break the comparison up into a comparison of two 32 bit numbers where the result of the 32 bit compares get stored in a flip flop and on the next clock cycle you perform the rest of the comparison. If that doesn't work, break it into four 8 bit compares that get stored in the flip flop.

Kevin Jennings
 
Reactions: dpaul

    dpaul

    Points: 2
    Helpful Answer Positive Rating
Don't make it harder than it needs to be.
I appreciate your 'KISS' approach. Will try it as a first go, still not into the implementation stage.
 

Basically what you want is a sort algorithm. From the research I did many years ago this isn't something that is quick and efficient in hardware nor is it quick and efficient in software. Given you want to do this in real time with incoming data likely means a large pipelined tree structure with compares between each stage to "bubble sort" between each stage. First guess is probably 8 pipeline stages with compares between each adjacent pair of values with each stage alternating the adjacent pairs, i.e. abcdefgh: 1st ab cd ef gh, 2nd bc de fg, 3rd ab cd ef gh, ...
 


This is a "merge sort", not a bubble sort. Merge sorts are very fast on FPGAs, but take quite a few resources, but allow for a nice pipelined approach.
 
Reactions: dpaul

    dpaul

    Points: 2
    Helpful Answer Positive Rating
Yes it takes 8 stages to do this...


- - - Updated - - -

This is a "merge sort", not a bubble sort. Merge sorts are very fast on FPGAs, but take quite a few resources, but allow for a nice pipelined approach.
Okay thanks for pointing that out, as I said it's been a long long time since I last looked at this stuff. You are right it is a merge sort not a bubble sort and as I show it only takes 8 stages (very expensive but doable).

I would be particularly careful with how you deal with streams that are inactive or run out of data. That is something that needs clarification. Perhaps an extra bit MSB that is enabled when there is valid data so it will always sort to one end or the other and you only write the valid words into your sorted word FIFO.
 
Reactions: dpaul

    dpaul

    Points: 2
    Helpful Answer Positive Rating
First, the problem can reduce to a sort. However the problem here is find_min. There is no reason to find the 2nd lowest, 3rd lowest, etc...

The entire circuit has the comparators and the selection. 125MHz is probably a bit high. There are a few options:
1.) the 7 compare + 7 2:1 muxes. tvalid can be used either as part of the 2:1 muxes or as a 65th bit.
2.) 7 compares and an 8:1 mux.
3.) 4 + 6 compares and 4 2:1 muxes and a 4:1 mux.
4.) a few others.

The long path in the design is that the results of a compare are used to select the arguments for another compare. However, 6 compares can be done on 4 inputs and the 6b result used to generate the select bits.


The compares for the first methods are:
(a,b), (c,d), (e,f), (g,h),
(min(a,b),min(c,d)), (min(e,f), min(g,h)),
(min(a,b,e,f), min(c,d,g,h)).

for the 6 compare method, they are (a,b), (a,c), (a,d), (b,c), (b,d), (c,d).
if A is lowest, then 000???
if B is lowest, then 1??00?
if C is lowest, then ?1?1?0
if D is lowest, then ??1?11
where 0 means the left side is lower and 1 means the left side is greater or equal. In anycase, it comes down to 2 LUT6's.
 
Reactions: dpaul

    dpaul

    Points: 2
    Helpful Answer Positive Rating
Thank you for all the responses.
If I implement a pipeline approach for comparison, then I must allow for adequate data buffering in each channel and this would be somewhat costly....to be kept in mind!
The resources for just data buffering could be - 256bits x no. of pipeline-stages x no. of channels
 
Last edited:


Based on the below quote from your original set of requirements...
vGoodtimes neglected to account for the stuff in red (i.e. they want the sorted values not the minimum value)

If the sort is done using a pipeline then the input buffers don't have to be larger unless there is a feedback mechanism required that uses the sorted field to do something with the incoming channel data, which wasn't specified as part of the requirements.

Also if you are stuffing all these values into a FIFO then you will be restricted by the time sliced time of writing each of the sorted values into the FIFO one at a time. Is that what you are referring to as the reason for data buffering? If you need to do this FIFO write operation without the 8 TDMA operation then you'll need an 8-port RAM, which you could implement with distributed RAM to save block RAMs for your input buffers.
 

ah, ok. i did misunderstand the problem. I'd go with something like the bitonic sorting network or even-odd merge sort network.

I would do this only for the index order+valids and do the data muxing at the end. I'm guessing you want the data on the next cycle to be written to the next addresses of the fifo with no padding. Either way, doing the muxing at the end makes it easy to do either.

this could be 8 cycles of latency, that sounds like a reasonable estimate.


if the data actually is 256b, then BRAM get expensive. 64b writes mean 4 BRAM per input or 32 BRAM.

If the data is written over 8 cycles, then the find_min becomes a viable circuit or it makes sense to have a non-pipelined sort that also takes 8 cycles.
 

Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…