Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

advice needed on how to speed up vhdl code

Status
Not open for further replies.

qwerty_asdf

Member level 4
Joined
Mar 26, 2012
Messages
73
Helped
0
Reputation
0
Reaction score
0
Trophy points
1,286
Activity points
1,781
In my project I want to do this: read 2 arrays (lets say 32 elements each of 64 bits,actually it's gonna be thousand of elements) and compute the minimum square difference between the first element of the first array and the whole array B, so on for the second etc. Let me explain my self. I want for the array A and for each of its elements, compute the minimum square difference between the elements of the other array B. My implementation so far is rather foollish but it works. i want to speed up and use less resources.

So far, i read element A1 (64 bits at once) and then read all the others B1...B32 and compute the differences. Then for the A2 and so on. something like a double for loop in C.

Here is my ideas(and other peoples ideas but I do not know exactly how to start coding them).
- Not read the whole element in one cycle, but in pieces of lets say 8 bits. My question is how this will help me? Will it reduce the memory i am using? If yes, how?
- A main problem I see is that I do not take advantage of concurrency. I mean in every cycle I do only one computation. (fetch element A1 and lets say B4 compute the difference). How to do something better? My code seems to me too much sequential. (like a softarea programmer). Will it be better for example do more than one computation on the same cycle? If yes, which of them?

Please let me know your ideas, comments and useful links. I can post my code so far if you wish.
 

post your code so far - but from the sounds of it you are trying to write it as if it were software without understanding the underlying hardware. Have you compiled this and got it working on an FPGA? or just in simulation?
 

post your code so far - but from the sounds of it you are trying to write it as if it were software without understanding the underlying hardware. Have you compiled this and got it working on an FPGA? or just in simulation?

just in simulation. Ok my code so far is this.
Code:
library IEEE;
use IEEE.std_logic_1164.all;
use STD.TEXTIO.all;
use IEEE.std_logic_unsigned.all;
use IEEE.std_logic_arith.all;
use work.my_package.all;

entity landmark is
  generic
		(N :integer := 8;
		NA:integer:=3 );
		port ( clk:in std_logic;
		new_set: in std_logic;
		vin:in std_logic;
		rst:in std_logic;
		flag: in std_logic;
		din: in signed(N-1 downto 0);
		dout: out big_matrix(0 to N); 
		done: out std_logic
		);
end landmark;

architecture behavioral of landmark is

signal inp1,inp2: matrix1_t(0 to NA);
signal k:integer range 0 to 100:= 0;
signal l:integer range 0 to 100:= 0;
signal i:integer range 0 to 100:= 0;
signal j:integer range 0 to 100:= 0;
signal min: signed (3*N-1 downto 0);

function sum_of_square_dif( a1,b1: in signed(N-1 downto 0); previous_sum:in std_logic_vector(3*N-1 downto 0))return std_logic_vector is     
     variable temp_sum:std_logic_vector(3*N-1 downto 0):=(others=>'0');
     variable diff: signed(N-1 downto 0):=(others=>'0');
	 variable square_diff: std_logic_vector(2*N-1 downto 0):=(others=>'0');
begin
  temp_sum:=previous_sum;
	diff:=a1-b1;
	square_diff:=ext(diff*diff,2*N);
	temp_sum:=ext(temp_sum+square_diff,3*N);
  return temp_sum;
end sum_of_square_dif;

begin
  
in_read:  process (clk,rst)
	begin
	if (rst='1') then
		min<=signed(ext("0",3*N));
	elsif (clk'event and clk='0') then  --reading at negative edges.
	if (vin='1') then --vin enable signal
		if (new_set='0') then
			if (i<=NA) then
			  inp1(i)<=din;
		    i<=i+1;
		  end if;
		else 
			if (j<=NA) then
			   inp2(j)<=din;
			   j<=j+1;
			end if;
		end if;
		--if ((j>NA)and(i>NA)) then
			--flag<='1';
		--end if;
	end if;
end if;
end process in_read;
   
    
f_min:    process(clk)
		   variable temp_num: signed (3*N-1 downto 0);
    begin
      if (clk'event and clk='1') then
			 if (flag='1') then --finished reading
          if (k<=NA) then 
              temp_num:=signed(ext(sum_of_square_dif(inp1(k),inp2(l),"000000000000000000000000"),3*N));
              done<='0';
              if(l<NA) then
                if (l=0) then
                  min<=temp_num;
                elsif (temp_num< min) then 
				          min<= temp_num;
			          else
				          min<=min;
				        end if;
                l<=l+1;
              else --last element of each row
                if (temp_num< min) then
				          dout(k)<= temp_num;
			          else
				          dout(k)<=min;
				        end if;
                l<=0;
                k<=k+1;
                min<=signed(ext("0",3*N));
              end if;
			     else
			        done<='1';
        end if;
	   end if; --end of if flag='1'
   end if;--end of clk'event
end process f_min;
    
end behavioral;
 

Ok
So there isnt anything really wrong with the code. It should synthesize and work ok on an FPGA, but I can see potential bugs and design flaws.

1. with i,j,k and l. First of all you have decalred them 0 to 100, and you never reset them. In simulation, when they get to 100, and you try and add one, you will get an integer out of range error (integers do not roll over). But on real hardware, they will be synthesised into 7 bit numbers, and will actually run 0 to 127, and they will roll over back to 0.

2. How big could dout be? you'll probably get away with it being 8 (like the default) but its going to get bigger and bigger. Instead of havving a big parrallel output, why cant you put the results in a RAM? yes you will have to access them serially, but you will use far less resources. Looking at the way you have it, it wouldnt be too big a deal to have this as a ram anyway, as only 1 output is written in any 1 clock cycle.

3. The function sum of square does too many actions in a single clock cycle. I would recommend breaking this up and pipelining it. All you would really need to do is put almost identical code inside a clocked process, and just make sure you delay any addressing by the same number of clocks. This would allow you to increase the FMax of the final circuit (assuming thats what you meant by "faster")

Its not VHDL code you speed up, its the underlying desing.
I would have a look at the RTL Viewer when you compile it for an FPGA. It will show you the circuit it created - see if it matches up with what you drew before you wrote the code (you did draw the circuit before you wrote the code right? - HDL is hardware description language - so if you dont know what the circuit is - how do you expect to describe it?)

And finally - stop using std_logic_unsigned and std_logic_arith. They are non-standard VHDL. I suggest switching to numeric_std (which is part of the VHDL standard). With your current code I sdont think it will break anything.
 
Ok
So there isnt anything really wrong with the code. It should synthesize and work ok on an FPGA, but I can see potential bugs and design flaws.

1. with i,j,k and l. First of all you have decalred them 0 to 100, and you never reset them. In simulation, when they get to 100, and you try and add one, you will get an integer out of range error (integers do not roll over). But on real hardware, they will be synthesised into 7 bit numbers, and will actually run 0 to 127, and they will roll over back to 0.

2. How big could dout be? you'll probably get away with it being 8 (like the default) but its going to get bigger and bigger. Instead of havving a big parrallel output, why cant you put the results in a RAM? yes you will have to access them serially, but you will use far less resources. Looking at the way you have it, it wouldnt be too big a deal to have this as a ram anyway, as only 1 output is written in any 1 clock cycle.

3. The function sum of square does too many actions in a single clock cycle. I would recommend breaking this up and pipelining it. All you would really need to do is put almost identical code inside a clocked process, and just make sure you delay any addressing by the same number of clocks. This would allow you to increase the FMax of the final circuit (assuming thats what you meant by "faster")

Its not VHDL code you speed up, its the underlying desing.
I would have a look at the RTL Viewer when you compile it for an FPGA. It will show you the circuit it created - see if it matches up with what you drew before you wrote the code (you did draw the circuit before you wrote the code right? - HDL is hardware description language - so if you dont know what the circuit is - how do you expect to describe it?)

And finally - stop using std_logic_unsigned and std_logic_arith. They are non-standard VHDL. I suggest switching to numeric_std (which is part of the VHDL standard). With your current code I sdont think it will break anything.
Thank you very much. i will work on them and i will come back.
 

3. The function sum of square does too many actions in a single clock cycle. I would recommend breaking this up and pipelining it. All you would really need to do is put almost identical code inside a clocked process, and just make sure you delay any addressing by the same number of clocks. This would allow you to increase the FMax of the final circuit (assuming thats what you meant by "faster")
Ok i am coming back on it. I solved issues 1,2 and now i need your help in this. Would it be appropriate if I do something like this:
Code:
process(clk)
  if clk'event and clk='1'
          compute_the_diff;
          square;
          sum;

Or put them in different processes (so that square function is independent from compute_diff and so on) ? And where to place the delay you mentioned? Between the end of each phase and the start of new one?
 

they'll be fine in the same process, as long as you assign the results to signals. This will give you a pipeline length of 3 clocks. no delays needed (because they are inherant in the code).
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top