Using read before write RAM for Histogram calculation

shaiko · Apr 13, 2016

Hello,

A common approach for FPGA Histogram calculations is to use a Dual port RAM in the following fashion:

However, it's always mentioned that the RAM must be configured as "read before write".
My question: is the "read before write" function necessary to prevent data corruption when the same pixel value arrives twice in consecutive cycles?

K-J · Apr 14, 2016

shaiko said:
My question: is the "read before write" function necessary to prevent data corruption when the same pixel value arrives twice in consecutive cycles?

No, 'read before write' is necessary because you're going to overwrite the address with new data so in order to do any calculations on the old data, you need to read it out before the address is overwritten.

One would only store data in memory in the first place if one is going to read it and do something with it. That 'something' can either be functional (i.e. compute something based on the data) or timing related (i.e. to delay one scan line so that one can use pixels from the current scan line and the previous scan line in some computation, such as averaging).

Kevin Jennings

shaiko · Apr 14, 2016

Kevin,
Is my drawing correct?

Notes:
1. Both ports use the same clock.
2. The address to the write port is the address of the read port delayed by one cycle.

TrickyDicky · Apr 14, 2016

The approach in the OP will work - but what are you going to do with these histograms? your idea gives no ability to read the results.

shaiko · Apr 14, 2016

The approach in the OP will work

But in order for it to work the memory must be configured as "read before write" - correct?

your idea gives no ability to read the results.

I wrote "always_read" and "always_write" just to make the illustration simpler. The real application will of course have more complicated logic controlling these ports that will allow reading out the results when the computation is complete.

K-J · Apr 14, 2016

shaiko said:
Is my drawing correct?

Sort of, but not quite. The +1 in the data path does not imply that you're delaying the data by a clock cycle, it just says you're adding one. This implies that the address and data will be one clock cycle out of sync. However, assuming this is an implementation in an FPGA which would have synchronous memory, the data delay would get implemented inside your memory block in which case the drawing is correct, but should note that there is a one clock cycle latency in that memory.

Kevin Jennings

shaiko · Apr 14, 2016

Sort of, but not quite. The +1 in the data path does not imply that you're delaying the data by a clock cycle, it just says you're adding one.

Well yes - that's what I meant:
din_port_write = dout_port_read + 1

This is how I see it step by step:
1. On clock edge #1 A value is driven into our system. This value is fed (without a delay register) to "Address A" as well as "Address B" (delayed by one register).
3. Because of the intrinsic register in the RAM's output - Dout A (of the address driven in clock edge #1) is available only on edge #2 . This data is incremented +1 after a minor combinatorical delay and fed to Din B into the address marked in red.
4. The internal content of the memory cell is changed only on edge #3.

Correct ?

K-J · Apr 14, 2016

shaiko said:
3. Because of the intrinsic register in the RAM's output - Dout A (of the address driven in clock edge #1) is available only on edge #2.

RAM does not have an intrinsic register that delays the output to the next clock edge. That's why I suggest the note stating that the figure is for memory that does have an output register.

Kevin Jennings

shaiko · Apr 14, 2016

I think that my system can work ONLY if the output of "Dout a" is unregistered.
What do you think?

K-J · Apr 14, 2016

shaiko said:
I think that my system can work ONLY if the output of "Dout a" is unregistered.
What do you think?

No it won't work, unless you also get rid of the register leading to AddressB OR the memory is synchronous memory that has a one clock cycle latency.

If DoutA is unregistered but AddressB is registered (as shown in your diagram), here is what you will get at time t...
AddressA = Value(t)
DoutA = Mem(Value(t))
AddressB = Value(t-1)
DinB = Mem(Value(t))+1

Note that the value being written to the AddressB port is not what you intended. What should be written back is Mem(Value(t-1))+1

The clock cycle latency in the AddressB path must match the latency in the DinB path in order to get the functionality that you mentioned. The latency can be whatever you want it to be, but the address and data paths must have the same latency. Some solutions:

- Zero latency: Remove the address delay register and use unclocked memory (i.e. Read_Data <= Mem(Address)); Using current FPGAs, this would likely not be a preferable solution since it would mean the fitter would have to use the LUT memory to implement rather than block RAM so the memory size would be limited. Also, since this is implemented in zero clock cycles, the clock cycle performance of the overall design might be impacted since the path from address, through the memory, through the adder and back to the memory might be a critical path.
- One clock latency: As you've drawn your figure implemented but using synchronous memory (i.e. Read_Data <= Mem(Address) after rising_edge(Clock)). Using current FPGAs, this would likely be the preferred solution since the FPGA could use either block RAM or LUTs and registers as the fitter so chooses and the clock cycle performance of this subsystem would be higher than the zero latency approach. The drawback is the additional registers to implement the address delay.
- N clock latency: Have N registers in the address path, N-1 registers in the data path and use synchronous memory.

Kevin Jennings

ads-ee · Apr 14, 2016

You know drawing a timing diagram would have made this very easy to see (though I'm feeling too lazy to do it today). The drawing in post #1 only has a read before write requirement if the address persists for more than one clock cycle. As the write address is delayed it doesn't overlap the read address unless the read address persists for two or more clock cycles. As your figure suggests always read and always write, you could still end up with a read before write requirement as two consecutive inputs could be binned in the same location (resulting in two updates to the same address).

shaiko · Apr 14, 2016

Thanks for the elaborate explanation.
I followed:

- One clock latency: As you've drawn your figure implemented but using synchronous memory (i.e. Read_Data <= Mem(Address) after rising_edge(Clock)). Using current FPGAs, this would likely be the preferred solution since the FPGA could use either block RAM or LUTs and registers as the fitter so chooses and the clock cycle performance of this subsystem would be higher than the zero latency approach. The drawback is the additional registers to implement the address delay.

And came up with the following code (please note that in this test I'm using the same incoming value addra <= "0000" )

Code:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity tb_ram is
end entity tb_ram;

architecture simulation_tb_ram of tb_ram is

component blk_mem_gen_0 is

port 
(
	clka 	: in std_logic ;
	wea 	: in std_logic_vector ( 0 downto 0 ) ;
	addra 	: in std_logic_vector ( 3 downto 0 ) ;
	dina 	: in std_logic_vector ( 3 downto 0 ) ;
	douta 	: out std_logic_vector ( 3 downto 0 ) ;
	clkb 	: in std_logic ;
	web 	: in std_logic_vector ( 0 downto 0 ) ;
	addrb 	: in std_logic_vector ( 3 downto 0 ) ;
	dinb 	: in std_logic_vector ( 3 downto 0 ) ;
	doutb 	: out std_logic_vector( 3 downto 0 )
) ;
end component blk_mem_gen_0 ;

signal clk 		: std_logic := '0' ;
signal wea 		: std_logic_vector ( 0 downto 0 ) ;
signal addra 	: std_logic_vector ( 3 downto 0 ) ;
signal dina 	: std_logic_vector ( 3 downto 0 ) ;
signal douta 	: std_logic_vector ( 3 downto 0 ) ;
signal web 		: std_logic_vector ( 0 downto 0 ) ;
signal addrb 	: std_logic_vector ( 3 downto 0 ) ;
signal dinb 	: std_logic_vector ( 3 downto 0 ) ;
signal doutb 	: std_logic_vector( 3 downto 0 ) ;

begin

	clk 	<= not clk after 5 ns ;
	wea 	<= "0" ;	
	addra   <= "0000" ;
	dina    <= ( others => '0' ) ;
	web 	<= "1" ;
	
	process ( clk ) is
	begin 
		if rising_edge ( clk ) then
			addrb <= addra ;
		end if ;
	end process ;
		
	dinb <= std_logic_vector ( unsigned ( douta ) + 1 ) ;

	waveform : blk_mem_gen_0 
	port map
	(
		clka	=>	clk ,	
		wea 	=> 	wea , 	
		addra 	=> 	addra ,
		dina 	=> 	dina , 
		douta 	=> 	douta , 
		clkb 	=> 	clk ,
		web 	=> 	web ,
		addrb 	=> 	addrb ,
		dinb 	=> 	dinb , 
		doutb 	=> 	doutb
	) ;

end architecture simulation_tb_ram ;

I also configured the memory as "read first".
And ran the simulation.
The histogram calculation is incorrect.

ads-ee · Apr 14, 2016

Not sure what you think is wrong...

The data being output on douta (the read side) is being incremented each double clock cycle as you've indicated. It would be a better test to alternate between at least two addresses to see if there are any effects from latency that aren't properly accounted for.

shaiko · Apr 14, 2016

The data being output on douta (the read side) is being incremented each double clock cycle.

But this means that we miss half of the histogram values...
If during N clock ticks - value X is driven into our system all the time, I want the corresponding memory address of X to be equal to N - not N/2.

ads-ee · Apr 14, 2016

shaiko said:
But this means that we miss half of the histogram values...

So it's supposed to be counting twice each two clock cycle?

shaiko · Apr 14, 2016

So it's supposed to be counting twice each two clock cycle?

Yes. It's a histogram - it's supposed to count the number of times a certain value was driven into the system...
If the number of clock ticks equals N - and value X is the only value that we had at the gates during these N clock ticks, then I want the cell of address X to equal N.
And it doesn't...
What am I doing wrong?

TrickyDicky · Apr 14, 2016

I have done histogramming before, but because the 2nd port was connected to a CPU, your method was not an option. The only option for the full pel rate histograms was to use 2x pixel clock on port A and do a read-modify-write operation. But this meant the CPU had easy access to the histograms.
For anything at 1/2 rate and below, you can just use the normal clock using the same operation.

What your method doesnt catch is if you have 2 consecutive values at the same level. The two reads occur before the write occurs. In this case you need to have a mux input to your adder:
if current addr = prev adder (first instance of) then +2, else +1.

shaiko · Apr 14, 2016

I have done histogramming before, but because the 2nd port was connected to a CPU, your method was not an option. The only option for the full pel rate histograms was to use 2x pixel clock on port A and do a read-modify-write operation.

I implemented the above successfully when I needed the read port decoupled from the write - but as you mentioned the downside is higher clock rates and a PLL.

For anything at 1/2 rate and below, you can just use the normal clock using the same operation.

Not my case - my incoming data rate = FPGA frequency.

In this case you need to have a mux input to your adder

You mean - like this?

Code:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity tb_ram is
end entity tb_ram;

architecture simulation_tb_ram of tb_ram is

component blk_mem_gen_0 is

port 
(
	clka 	: in std_logic ;
	wea 	: in std_logic_vector ( 0 downto 0 ) ;
	addra 	: in std_logic_vector ( 3 downto 0 ) ;
	dina 	: in std_logic_vector ( 3 downto 0 ) ;
	douta 	: out std_logic_vector ( 3 downto 0 ) ;
	clkb 	: in std_logic ;
	web 	: in std_logic_vector ( 0 downto 0 ) ;
	addrb 	: in std_logic_vector ( 3 downto 0 ) ;
	dinb 	: in std_logic_vector ( 3 downto 0 ) ;
	doutb 	: out std_logic_vector( 3 downto 0 )
) ;
end component blk_mem_gen_0 ;

signal clk 		: std_logic := '0' ;
signal wea 		: std_logic_vector ( 0 downto 0 ) ;
signal addra 	: std_logic_vector ( 3 downto 0 ) ;
signal dina 	: std_logic_vector ( 3 downto 0 ) ;
signal douta 	: std_logic_vector ( 3 downto 0 ) ;
signal web 		: std_logic_vector ( 0 downto 0 ) ;
signal addrb 	: std_logic_vector ( 3 downto 0 ) ;
signal dinb 	: std_logic_vector ( 3 downto 0 ) ;
signal doutb 	: std_logic_vector( 3 downto 0 ) ;

begin

	clk 	<= not clk after 5 ns ;
	wea 	<= "0" ;	
	addra   <= "0000" ;
	dina    <= ( others => '0' ) ;
	web 	<= "1" ;
	
	process ( clk ) is
	begin 
		if rising_edge ( clk ) then
			addrb <= addra ;
		end if ;
	end process ;
		
	dinb <= std_logic_vector ( unsigned ( douta ) + 2 ) when  addrb = addra else std_logic_vector ( unsigned ( douta ) + 1 ) ;

	waveform : blk_mem_gen_0 
	port map
	(
		clka	=>	clk ,	
		wea 	=> 	wea , 	
		addra 	=> 	addra ,
		dina 	=> 	dina , 
		douta 	=> 	douta , 
		clkb 	=> 	clk ,
		web 	=> 	web ,
		addrb 	=> 	addrb ,
		dinb 	=> 	dinb , 
		doutb 	=> 	doutb
	) ;

end architecture simulation_tb_ram ;

K-J · Apr 14, 2016

shaiko said:
The histogram calculation is incorrect.

The problem is because the compensation required to accommodate the memory latency has not been applied. When there is memory latency, the +1 operation is not correct for the case where Value(t) = Value(t+1) (i.e. the input is constant) since it takes a clock cycle to compute the updated value to write back into the memory. So you'll need to add logic to handle that particular case. Something of the form:

dinb <= std_logic_vector ( unsigned ( douta ) + 2 ) when (addra = addrb) else std_logic_vector ( unsigned ( douta ) + 1 ) ;

I don't know if the above is exactly correct, but it gets across the point that because of the memory latency, you need to adjust the computation to compensate for operating on stale data. The statement that you have for computing dinb is only correct if there is zero clock cycle memory latency.

Kevin Jennings

std_match · Apr 15, 2016

I don't see a big problem with reading the data from a CPU or similar. Overclocking the RAM has already been mentioned. An other option is to connect the write port of a second RAM in parallel with the existing RAM. Both RAMs will have the same content and the read port of the second RAM is free to use for reading out the histogram data.

Welcome to EDAboard.com

Using read before write RAM for Histogram calculation

Advanced Member level 5

Advanced Member level 2

Advanced Member level 5

Advanced Member level 7

Advanced Member level 5

Advanced Member level 2

Advanced Member level 5

Advanced Member level 2

Advanced Member level 5

Advanced Member level 2

Super Moderator

Advanced Member level 5

Super Moderator

Advanced Member level 5

Super Moderator

Advanced Member level 5

Advanced Member level 7

Advanced Member level 5

Advanced Member level 2

Advanced Member level 4

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor