Help me with a code which adds 16 sad

kakarala · Sep 2, 2010

hello

i am trying to write the synthesisable code to compute the sad for two blocks. I wrote the following code and i am able to compute the absolute difference of each pixel in two blocks. But when i want to add all the 16 differences , by using the following statement
sad <= tmp(0,0)+tmp(0,1)+tmp(0,2)+tmp(0,3)+tmp(1,0)+tmp(1,1)+tmp(1,2)+tmp(1,3)+tmp(2,0)+tmp(2,1)+tmp(2,2)+tmp(2,3)+tmp(3,0)+tmp(3,1)+tmp(3,2)+tmp(3,3);

It is taking hours to synthesise the code. so is there any simple way to add all the differences?

I am using package "images" that contain images i have attached the codes

Nixphe · Sep 3, 2010

Re: Help about the code

Hello,

if you want synthesisable code, IMO it's better to use std_logic_vectors than integers. If you don't, you should specify a range for your integers. You can hope the tools do this for you, an expensive toolset might, a cheaper one might not do it and just assume a default range.

If you have a look at what the tools synthesized, does it seem reasonable in size? Otherwise the integers are my first suspect.

Are you sure you are talking about synthesis and not implementation/place&route/...? If the latter, maybe the architecture you are targetting is not suitable.

Adding 16 numbers together in one cycle is no problem in se.

Kind regards

vipinlal · Sep 3, 2010

Re: Help about the code

you have been asking similar kind of questions for a long time and we told you not to use for loops without care.

Code:

 for i in 0 to 3 loop		  
		 for j in 0 to 3 loop
		    tmp(i,j) <=  abs(curr_image(((currblk_row*4+i)*64)+(currblk_column*4+j))- ref_image(((refblk_row*4+i)*64)+(refblk_column*4+j)));
		 end loop;
       end loop;

I got the these lines from sad.txt and I think you should go for a simple design. Doing so much over a clock edge is difficult and will take hours to synthesis.

TrickyDicky · Sep 3, 2010

Re: Help about the code

vipinlal said:
you have been asking similar kind of questions for a long time and we told you not to use for loops without care.

Code:

for i in 0 to 3 loop for j in 0 to 3 loop tmp(i,j) <= abs(curr_image(((currblk_row*4+i)*64)+(currblk_column*4+j))- ref_image(((refblk_row*4+i)*64)+(refblk_column*4+j))); end loop; end loop;

I got the these lines from sad.txt and I think you should go for a simple design. Doing so much over a clock edge is difficult and will take hours to synthesis.

This line isnt the problem. This is 16 parrallell adders.

The problem is the large adder underneath.

---------- Post added at 01:29 PM ---------- Previous post was at 01:26 PM ----------

Ive just realised why its taking so long.

Because you're doing so many lookups into your image in parrallel, it cannot create a single memory for it. It has to create loads and loads of memories.

Please pipeline your design and ROM lookups properly.

Nixphe · Sep 4, 2010

Re: Help about the code

It seems to me doing lots of operations between 2 cycles is not really a problem. As long as your device has sufficient resources and you ask for a reasonable period. Depending on which toolset you use, as far as i can remember as it has been a while, synthesis can or can't take imposed clock period and chosen device into account.

Depending on the tools, it might be more interesting to use tools to instantiate a memory with preloaded values, preferably in hard RAM blocks in your FPGA. As far as i know there might me following problems
- (again) device is on the small side. Choose another device and check if thing go faster
- tools don't recognise your image as being a real memory, but implement it in another complicated way. You can check the manuals to rewrite your vhdl description of that to mach the recommendation, so it will be recognized as such. You could try making the memory much smaller as a test

permute · Sep 4, 2010

Re: Help about the code

The OP can also make use of the wide IO on FPGA BRAMs. eg, reading two 32b words per BRAM. I suppose the synthesizer isn't picking this up.

(also the OP's design is probably flawed. the addition does not allow bit growth, and the inputs are all >= 0)

the tools should be able to pick up the image as a ROM. XST will annoyingly pick up a 256b ROM as a 18kb BRAM instead of using the f7/f8 muxes in any slice to make a 256b ROM.

kakarala · Sep 7, 2010

Re: Help about the code

can anyone tell me how to imlement the same logic without for loops?
I am thinking of computing each absolute difference and store each in a new variable and add all the variables and store it in another variable.

kakarala · Sep 8, 2010

Re: Help about the code

actually i need help to add all the 16 sums , if i remove that statement its working fine. I need to add all the 16 sums at a time

permute · Sep 8, 2010

Re: Help about the code

how many cycles can you allow per SAD? you currently have a 2 cycle pipeline, allowing a new output every cycle. you could move to a design that loads a 72b segment of 8 pixels from each row. this would allow for a new output every 4 cycles. other optimizations can be made to reduce the area at this point. For example, the 16 input adder can become a 4-input accumulator (5-input addition).

kakarala · Sep 8, 2010

Re: Help about the code

what you want me to do is decrease the number of differences between each pixel in block for each clock?

permute · Sep 8, 2010

Re: Help about the code

I want you to reduce the number of pixels accessed per clock. Right now, the large number of RAM lookups is causing the design size to grow to unacceptable levels. in the worst case, 1 pixel would be loaded per image, per cycle. This gives a small area. If this is an FPGA, there are probably resources to allow 2 pixels per image per cycle to be loaded. With a little work, this can be improved to 4 per image per cycle. After this point, additional BRAMs would need to be used.

As the number of pixels read per image per clock cycle increases, the performance can increase. But doing so also requires more hardware. Thus, if the performance is not needed, reading 1-4 pixels per image per cycle might provide a low area design with suitable performance.

kakarala · Sep 8, 2010

Re: Help about the code

I changed the code as follows so that it computes each difference for each clock and adds it for each clock. But it gives me following error
Advanced HDL Synthesis *
=========================================================================

INTERNAL_ERROR:Xst:cmain.c:3464:1.56 - Process will terminate. For technical support on this issue, please open a WebCase with this project attached at Xilinx: Support.

Process "Synthesize - XST" failed

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use work.images.all;
entity SAD1 is
port ( clk : in std_logic;
rst : in std_logic;
currblk_row : in integer;
currblk_column : integer;
refblk_row : in integer;
refblk_column : in integer;
sad : out integer range 0 to 4080);
end SAD1;
architecture Behavioral of SAD1 is
signal tmp : one_array(0 to 15);
signal x,count_i,count_j ,i,j: integer :=0;
signal tsum : integer;

begin
sum_ad : process(clk,rst)

begin
if count_i = 4 then
count_i <= 0;
i<= i+1;
end if;
if count_j = 4 then
j <= 0;
end if;
if i<= 3 and j<= 3 then

tmp(4*i + j) <= abs(curr_image(((currblk_row*4+i)*64)+(currblk_column*4+j))- ref_image(((refblk_row*4+i)*64)+(refblk_column*4+j)));
j <= j+1;
count_i <= count_i+1;
count_j <= count_j+1;
end if;
if x /= 16 then

tsum <= tsum + tmp(x);
x <= x + 1;
end if;
end if;
sad <= tsum;

end process;
end Behavioral;

TrickyDicky · Sep 8, 2010

Re: Help about the code

you need to make the process synchronous. The problem you are getting is because you forgot to add the following code around yours and it's trying to make some hideous asynchronous circuit. You need to add:

Code:

if reset = '1' then
  --do reset
elsif rising_edge(clk) then
  --do all the other stuff you wrote
end if;

around your code.

kakarala · Sep 8, 2010

Re: Help about the code

i do have that part of code in xilinx and i forgot to mention in the above code, srry about that , it gives the same error

TrickyDicky · Sep 9, 2010

Re: Help about the code

I think the problem is still with the assignment of tmp;

You are trying to do many things all in one clock cylce:
generate 4x read address through a multiplier and adder
read the same rom 4 times.

These should be pipelined appropriatly.

permute · Sep 9, 2010

Re: Help about the code

I don't think the use of integer, without a range, helps either. But the above does 1 ROM access per image. The address generation is just a few shift/add operations.

The code should probably be written differently -- the non-blocking assigns won't update the signals instantly (i don't understand why there is a count_i, i, count_j, and j). tsum should also get assigned tmp in some cases, as opposed to always being tmp+tsum.

Welcome to EDAboard.com

Help me with a code which adds 16 sad

Member level 1

Attachments

Junior Member level 1

Full Member level 6

Advanced Member level 7

Junior Member level 1

Advanced Member level 3

Member level 1

Member level 1

Advanced Member level 3

Member level 1

Advanced Member level 3

Member level 1

Advanced Member level 7

Member level 1

Advanced Member level 7

Advanced Member level 3

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor