Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

VHDL Comparison Tree

Status
Not open for further replies.

shaiko

Advanced Member level 5
Joined
Aug 20, 2011
Messages
2,644
Helped
303
Reputation
608
Reaction score
297
Trophy points
1,363
Activity points
18,302
Hello,

I have an input array of 128 unsigned vectors - 0 to 127.
I need to compare all cells and return the POSITION of the cell with the highest result.
To achieve higher throughput - I used a pipelined tree topology as follows:

Code:
    comparison_tree : process ( IN_CLOCK , IN_RESET_ASYNCHRONOUS ) is
    begin
        if IN_RESET_ASYNCHRONOUS = '1' then
            stage_0_vote <= ( others => 0 ) ;    
            stage_1_vote <= ( others => 0 ) ;    
            stage_2_vote <= ( others => 0 ) ;    
            stage_3_vote <= ( others => 0 ) ;    
            stage_4_vote <= ( others => 0 ) ;    
            stage_5_vote <= ( others => 0 ) ;    
            stage_6_vote <= ( others => 0 ) ;    
            stage_7_vote <= 0 ;           
        elsif rising_edge ( IN_CLOCK ) then
            if IN_RESET_SYNCHRONOUS = '1' then
                stage_0_vote <= ( others => 0 ) ;    
                stage_1_vote <= ( others => 0 ) ;    
                stage_2_vote <= ( others => 0 ) ;    
                stage_3_vote <= ( others => 0 ) ;    
                stage_4_vote <= ( others => 0 ) ;    
                stage_5_vote <= ( others => 0 ) ;    
                stage_6_vote <= ( others => 0 ) ;    
                stage_7_vote <= 0 ;        
            else    
            
                for index in 0 to 63
                loop
                    if input ( 2 * index ) > input ( 2 * index + 1 ) then
                        stage_0_vote ( index ) <= 2 * index ;
                    else
                        stage_0_vote ( index ) <= ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;
                
                for index in 0 to 31
                loop
                    if stage_0_vote ( 2 * index ) > stage_0_vote ( 2 * index + 1 ) then
                        stage_1_vote ( index ) <= stage_0_vote ( 2 * index ) ;
                    else
                        stage_1_vote ( index ) <= stage_0_vote ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;                

                for index in 0 to 15
                loop
                    if stage_1_vote ( 2 * index ) > stage_1_vote ( 2 * index + 1 ) then
                        stage_2_vote ( index ) <= stage_1_vote ( 2 * index ) ;
                    else
                        stage_2_vote ( index ) <= stage_1_vote ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;        

                for index in 0 to 7
                loop
                    if stage_2_vote ( 2 * index ) > stage_2_vote ( 2 * index + 1 ) then
                        stage_3_vote ( index ) <= stage_2_vote ( 2 * index ) ;
                    else
                        stage_3_vote ( index ) <= stage_2_vote ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;   
                
                
                for index in 0 to 3
                loop
                    if stage_3_vote ( 2 * index ) > stage_3_vote ( 2 * index + 1 ) then
                        stage_4_vote ( index ) <= stage_3_vote ( 2 * index ) ;
                    else
                        stage_4_vote ( index ) <= stage_3_vote ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;        

                for index in 0 to 1
                loop
                    if stage_4_vote ( 2 * index ) > stage_4_vote ( 2 * index + 1 ) then
                        stage_5_vote ( index ) <= stage_4_vote ( 2 * index ) ;
                    else
                        stage_5_vote ( index ) <= stage_4_vote ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;  

                for index in 0 to 1
                loop
                    if stage_5_vote ( 2 * index ) > stage_5_vote ( 2 * index + 1 ) then
                        stage_6_vote ( index ) <= stage_5_vote ( 2 * index ) ;
                    else
                        stage_6_vote ( index ) <= stage_5_vote ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;  
                
                if stage_6_vote ( 0 ) > stage_6_vote ( 1 ) then
                    stage_7_vote <= stage_6_vote ( 0 ) ;
                else
                    stage_7_vote <= stage_6_vote ( 1 ) ;
                end if ;                   
     
            end if ;
        end if ;
    end process comparison_tree ;

Do you see a way how this can be written more generically for any even input size ?
 

with a tree structure it's complicated, because the number of stages depends on the size.

why don't you perform the comparison in time instead of in space? write a simple fsm that loops through the entire array, compares, stores the current highest value, advances, rinse repeat. it is much more flexible but at the cost of latency.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Interesting workaround - but it will also consume much more registers without (my guess) a significant performance gain.
 

What's the dataflow? A full array loaded every clock cycle (512 bit parallel interface), one array element updated every clock cycle, anything in between?
 

What's the dataflow?
128 words - every clock of which I have to find the location of the highest.

- - - Updated - - -

The "steps" topology ThisIsNotSam suggested would require a number of registers that's equivalent to the sum of an Arithmetic Sequence from 1 to 128 - I.E: 1+2+...+128 registers = 8256
 

Code:
-- This is more generic.  It relies on the synthesis tools doing the right thing.
-- I haven't synthesized this yet.  You should compare it to other styles.
-- Historically, tools have treated sequentially numbered names as a bus.  This
-- is not the case for "stages".  Likewise using variables in the loop for indexing
-- probably synthesizes fine, but maybe not.  You can always ensure it works by
-- creating constant integer_arrays for them.  

-- I will assume a power of two input.  normalize to this if needed.
signal stages : element_array(2*N-1-1 downto 0);
constant kStages : integer := ceil_log2(N); -- find some version of this common function

function stage(x : element_array) return element_array is
  variable x_nml : element_array(x'length-1 downto 0) := x;
  variable ret   : element_array(x'length/2 -1 downto 0);
begin
  for i in ret'range loop
    if x_nml(2*i+1) > x_nml(2*i) then
      ret(i) := x_nml(2*i+1);
    else
      ret(i) := x_nml(2*i);
    end if;
  end loop;
  return ret;
end function;


process (clk, rst) is
  variable idx_low : integer := 0;
  variable idx_size : integer := N;
begin
  if rst = '1' then
    stages := (others => kElementZero);
  else if rising_edge(clk) then
    idx_low := 0;
    idx_size := N;
    for i in 1 to kStages loop
      stages(idx_low + idx_size/2 -1 downto idx_low) <= stage(stages(idx_low+idx_size-1 downto idx_low));
      idx_low := idx_low + idx_size/2;
      idx_size := idx_size/2;
    end loop;
  end if;
end process;

This is one way to do this. It is similar to a software approach, although it is logically equivalent to the other style. It might not synthesize optimally, or at all. You might need to precompute more constants to ensure the tools know they are constant.

Another generic method would use if-generate/for-generate to recursively generate components. But that generates a lot of components and probably isn't as sim-friendly.

In terms of registers, you should compare the number of registers to the amount of logic. Registers are often treated as "free" in FPGAs. With the exception being large register-based rams.

- - - Updated - - -

I will also mention that neither the OP nor my code solve the problem in the first post. They both return the value of the highest element, not the position in the array.

- - - Updated - - -

oh, my code has bugs as well. In any case, this shows one way the code could be more generic. I might update it later as I'm interested to see what the results are.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Another generic method would use if-generate/for-generate to recursively generate components. But that generates a lot of components and probably isn't as sim-friendly.
Can you show an example of what you mean ?
 

128 words - every clock of which I have to find the location of the highest.
What is the clock frequency?
Is it possible to use a higher frequency clock to drive a state machine?
 

No.
The data input clock is already high.
But what do you suggest ?
 

This one works well:
Code:
entity comparison_tree is

generic
(
    NUMBER_OF_WORDS	: positive  := 128 ; 
    WIDTH_WORD     	: positive  := 12 
) ;

port 
(
	IN_CLOCK			  	: in	std_logic 																				; -- Global clock.
	IN_DATA 				: in	generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS - 1 ) ( WIDTH_WORD - 1 downto 0 )   ; 
    IN_RESET_ASYNCHRONOUS	: in	std_logic 																				; -- Asynchronous Reset.
	IN_RESET_SYNCHRONOUS	: in	std_logic 																				; -- Synchronous Reset.	
                    
	OUT_SOURCE			  	: out 	unsigned ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 )   
) ;

end entity comparison_tree ;





architecture rtl_comparison_tree of comparison_tree is 

signal stage_0_result : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 2 - 1 ) ( WIDTH_WORD - 1 downto 0 ) ;
signal stage_1_result : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 4 - 1 ) ( WIDTH_WORD - 1 downto 0 ) ;  
signal stage_2_result : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 8 - 1 ) ( WIDTH_WORD - 1 downto 0 ) ;
signal stage_3_result : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 16 - 1 ) ( WIDTH_WORD - 1 downto 0 ) ;    
signal stage_4_result : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 32 - 1 ) ( WIDTH_WORD - 1 downto 0 ) ;
signal stage_5_result : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 64 - 1 ) ( WIDTH_WORD - 1 downto 0 ) ;
signal stage_6_result : unsigned ( WIDTH_WORD - 1 downto 0 ) ;  
    
signal stage_0_source : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 2 - 1 ) ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;
signal stage_1_source : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 4 - 1 ) ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;  
signal stage_2_source : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 8 - 1 ) ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;
signal stage_3_source : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 16 - 1 ) ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;    
signal stage_4_source : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 32 - 1 ) ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;
signal stage_5_source : generic_1d_unsigned_array ( 0 to NUMBER_OF_WORDS / 64 - 1 ) ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;
signal stage_6_source : unsigned ( positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) - 1 downto 0 ) ;      
    
begin


    OUT_SOURCE <= stage_6_source ;


    comparison_tree : process ( IN_CLOCK , IN_RESET_ASYNCHRONOUS ) is
    begin
        if IN_RESET_ASYNCHRONOUS = '1' then
            stage_0_result <= ( others => ( others => '0' ) ) ;    
            stage_1_result <= ( others => ( others => '0' ) ) ;    
            stage_2_result <= ( others => ( others => '0' ) ) ;    
            stage_3_result <= ( others => ( others => '0' ) ) ;    
            stage_4_result <= ( others => ( others => '0' ) ) ;    
            stage_5_result <= ( others => ( others => '0' ) ) ;    
            stage_6_result <= ( others => '0' ) ;    
        elsif rising_edge ( IN_CLOCK ) then
            if IN_RESET_SYNCHRONOUS = '1' then
                stage_0_result <= ( others => ( others => '0' ) ) ;    
                stage_1_result <= ( others => ( others => '0' ) ) ;    
                stage_2_result <= ( others => ( others => '0' ) ) ;    
                stage_3_result <= ( others => ( others => '0' ) ) ;    
                stage_4_result <= ( others => ( others => '0' ) ) ;    
                stage_5_result <= ( others => ( others => '0' ) ) ;    
                stage_6_result <= ( others => '0' ) ;    
            else    
            
                for index in 0 to NUMBER_OF_WORDS / 2 - 1
                loop
                    if IN_DATA ( 2 * index ) > IN_DATA ( 2 * index + 1 ) then
                        stage_0_result ( index ) <= IN_DATA ( 2 * index ) ;
                        stage_0_source ( index ) <= to_unsigned ( 2 * index , positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) ) ;  
                    else
                        stage_0_result ( index ) <= IN_DATA ( 2 * index + 1 ) ;
                        stage_0_source ( index ) <= to_unsigned ( 2 * index + 1 , positive ( ceil ( log2 ( real ( NUMBER_OF_WORDS ) ) ) ) ) ;
                    end if ;    
                end loop ;
                
                for index in 0 to NUMBER_OF_WORDS / 4 - 1
                loop
                    if stage_0_result ( 2 * index ) > stage_0_result ( 2 * index + 1 ) then
                        stage_1_result ( index ) <= stage_0_result ( 2 * index ) ;
                        stage_1_source ( index ) <= stage_0_source ( 2 * index ) ;  
                    else
                        stage_1_result ( index ) <= stage_0_result ( 2 * index + 1 ) ;
                        stage_1_source ( index ) <= stage_0_source ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;                

                for index in 0 to NUMBER_OF_WORDS / 8 - 1
                loop
                    if stage_1_result ( 2 * index ) > stage_1_result ( 2 * index + 1 ) then
                        stage_2_result ( index ) <= stage_1_result ( 2 * index ) ;
                        stage_2_source ( index ) <= stage_1_source ( 2 * index ) ;  
                    else
                        stage_2_result ( index ) <= stage_1_result ( 2 * index + 1 ) ;
                        stage_2_source ( index ) <= stage_1_source ( 2 * index + 1 ) ;
                    end if ;   
                end loop ;        

                for index in 0 to NUMBER_OF_WORDS / 16 - 1
                loop
                    if stage_2_result ( 2 * index ) > stage_2_result ( 2 * index + 1 ) then
                        stage_3_result ( index ) <= stage_2_result ( 2 * index ) ;
                        stage_3_source ( index ) <= stage_2_source ( 2 * index ) ;  
                    else
                        stage_3_result ( index ) <= stage_2_result ( 2 * index + 1 ) ;
                        stage_3_source ( index ) <= stage_2_source ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;   
                
                
                for index in 0 to NUMBER_OF_WORDS / 32 - 1
                loop
                    if stage_3_result ( 2 * index ) > stage_3_result ( 2 * index + 1 ) then
                        stage_4_result ( index ) <= stage_3_result ( 2 * index ) ;
                        stage_4_source ( index ) <= stage_3_source ( 2 * index ) ;  
                    else
                        stage_4_result ( index ) <= stage_3_result ( 2 * index + 1 ) ;
                        stage_4_source ( index ) <= stage_3_source ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;        

                for index in 0 to NUMBER_OF_WORDS / 64 - 1
                loop
                    if stage_4_result ( 2 * index ) > stage_4_result ( 2 * index + 1 ) then
                        stage_5_result ( index ) <= stage_4_result ( 2 * index ) ;
                        stage_5_source ( index ) <= stage_4_source ( 2 * index ) ;  
                    else
                        stage_5_result ( index ) <= stage_4_result ( 2 * index + 1 ) ;
                        stage_5_source ( index ) <= stage_4_source ( 2 * index + 1 ) ;
                    end if ;    
                end loop ;  

                if stage_5_result ( 0 ) > stage_5_result ( 1 ) then
                    stage_6_result <= stage_5_result ( 0 ) ;
                    stage_6_source <= stage_5_source ( 0 ) ;  
                else
                    stage_6_result <= stage_5_result ( 1 ) ;
                    stage_6_source <= stage_5_source ( 1 ) ;
                end if ;                  
     
            end if ;
        end if ;
    end process comparison_tree ;  





end architecture rtl_comparison_tree ;
 

128 words - every clock of which I have to find the location of the highest.

- - - Updated - - -

The "steps" topology ThisIsNotSam suggested would require a number of registers that's equivalent to the sum of an Arithmetic Sequence from 1 to 128 - I.E: 1+2+...+128 registers = 8256

Yes it would, and that is pretty acceptable in most cases. 8k registers is chump change in the world of multi-million gate designs.
 

You can reduce the number of registers and the associated routing by building the address in reverse.

The first stage needs only 1 bit for location -- even or odd. The second stage passes this as well as 1 bit for upper two or lower two. The third stage adds the next msb. And this continues.

the log2 calculation should be a function as it is used several times in the file.
 
Yes it would, and that is pretty acceptable in most cases. 8k registers is chump change in the world of multi-million gate designs.

FYI this is an FPGA design not a multi-million gate ASIC design.

Even though 8k registers isn't a terribly large number of FFs to use for one module, it is a significant chunk of resources when you only have 400k in a medium sized FPGA (that's 2% of the total in one module).

Of course if this is done in a large Virtex/Stratix part or an Ultrascale part then yes, I would agree it is chump change.
 

FYI this is an FPGA design not a multi-million gate ASIC design.

Even though 8k registers isn't a terribly large number of FFs to use for one module, it is a significant chunk of resources when you only have 400k in a medium sized FPGA (that's 2% of the total in one module).

Of course if this is done in a large Virtex/Stratix part or an Ultrascale part then yes, I would agree it is chump change.

Sure, that is understandable. FFs aside, the solution I have proposed would generate only one comparator, whereas the tree structure would create many. Depending on resources, size of the problem, etc., it might pay off.
 
  • Like
Reactions: shaiko

    shaiko

    Points: 2
    Helpful Answer Positive Rating
Yes it would, and that is pretty acceptable in most cases. 8k registers is chump change in the world of multi-million gate designs.
When I said "register" I meant each register is actually a vector of DFFs.
So it's actually 8256 * 16 (result width) = 132096 DFFs just for the "result steps".
This will have to be coupled with a similar "steps" structure for the result source (similar in spirit to what I wrote in #10).
8256 * log2(128) = 57792

132096 + 57792 = 190K DFFs.
 

It is important to keep in mind FPGA issues. For example, if you have simple logic followed by registers then the registers are free. The registers are in the same CLBs. If there is no logic then registers are not free -- unused CLBs are used for just registers. If the registers are not needed, they lead to routing congestion.

As a result, 132096 registers is not meaningful if they are in the "basically free" category.
 

Yes...I know that. But even if we do as we should and look at the total number of LEs (instead of looking at registers and LUTs separately) the "steps" topology is orders of magnitude more resource hungry than a the "tree"
 

The results should describe the same logic and should infer the same logic. Unless you do it wrong and the tools don't do any optimization.

You should have:
16*64 + 1*64 registers for stage 1.
16*32 + 2*32 registers for stage 2.
16*16 + 3*16 for stage 3
16*8 + 4*8 for stage 4
16*4 + 5*4 for stage 5
16*2 + 6*2 for stage 6
7 for stage 7, assuming you only need the location and not the value.

This is a lot less than the 8k that you mention. If you need the 128 values to propagate, it probably makes sense to have them in a SRL. That would give 2k SRLs and a little over 2k registers.
 
vGoodtimes,

What you wrote is correct for a tree topology. The numbers I mentioned are for steps topology.
 

Status
Not open for further replies.

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top