Accessing records in VHDL

shaiko · Jul 31, 2015

128 (KJ: Did you mean 0 here?)

Yes, my mistake.

Code:

Next_Symbol <= Next_Symbol + MAX_SYMBOLS[COLOR="#FF0000"]-1[/COLOR]-empty;

I think that "-1" is incorrect.
For example:
At system start, Next_Symbol = 0.
Suppose we get a message with only 1 valid symbol ( I.E, empty = 15 ).
With the "-1" in place, Next_Symbol would remain 0 ( when it should increment to 1 ).
I think it should be simply:

Code:

Next_Symbol <= Next_Symbol + MAX_SYMBOLS - empty;

Am I right?

Next, I suppose you agree that what I wrote in #18 would also do the work.
If so, how would you compare it to what you wrote at #20?
What are the pros and cons of each approach ?

vGoodtimes · Jul 31, 2015

You should look into creating a smaller packer with 256 (ish) registers, where a variable amount of data is shifted in per cycle and then a 128b word is selected from it. That 128b can feed a 128b wide BRAM. (probably 2-4 primitive BRAM elements). This uses far fewer resources, and should use all of them efficiently. 12,800 registers + muxes is significant on many FPGAs. 256 registers + muxes + 4 BRAM is only significant on the smallest FPGAs.

This also solves the extra data problem in the streaming case where more data will eventually be available.

shaiko · Jul 31, 2015

You should look into creating a smaller packer with 256 (ish) registers, where a variable amount of data is shifted in per cycle and then a 128b word is selected from it.

Post an example please.

vGoodtimes · Aug 1, 2015

There are a few variations based on latency and timing. Here is one method that is easy to understand. Pipelining methods can be done if needed. I'll give the 2 bit version. There is a 2 bit input with 0 to 2 bits valid, a 2 bit output where either 0 or 2 bits will be valid, and a (2-1 = 1) bit register. There is also a log(2) = 1 bit counter for state. This gives two states -- 0 bit saved, or 1 bit saved.

with 0 bits saved, 1 input bit will be placed in the saved register. with 2 bit input, the output will be the input directly.

with 1 bits saved, 1 input bit will cause the output to be the new input bit and no bits will be in the saved register. 2 input bits will cause the same output, but will place the extra bit into the saved register.

The general case is:
with k bits saved and m input bits (with an input/output width of n):
if (k+m >= n), the k saved bits become the msbs of the output. The the lower k+m-n bits are moved into the msbs of the saved register, and k is set to k+m-n.
else, the m input bits are saved to k:k+m-1 in the saved register (assuming 0 to n-1 ordering), and k is set to k+m.

K-J · Aug 2, 2015

shaiko said:
Yes, my mistake.

Code:

Next_Symbol <= Next_Symbol + MAX_SYMBOLS[COLOR=#ff0000]-1[/COLOR]-empty;

I think that "-1" is incorrect.
For example:
At system start, Next_Symbol = 0.
Suppose we get a message with only 1 valid symbol ( I.E, empty = 15 ).
With the "-1" in place, Next_Symbol would remain 0 ( when it should increment to 1 ).
I think it should be simply:

Code:

Next_Symbol <= Next_Symbol + MAX_SYMBOLS - empty;

Am I right?

Actually, it should be

Code:

Next_Symbol <= (Next_Symbol - empty) mod MAX_SYMBOLS;

If MAX_SYMBOLS is not a power of 2, then you might want to implement it as you noted.

Next, I suppose you agree that what I wrote in #18 would also do the work.

I didn't simulate yours or mine, so I don't know.

Code:

If so, how would you compare it to what you wrote at #20? 
What are the pros and cons of each approach ?

I don't see any advantage to preventing the updating of symbols that have not already been received. Your code only updates the number of bytes based on 'current_valid_symbols', but I'm guessing that it is really a don't care situation. The downside with your approach then would be the data path timing should be longer since it will depend on the calculation of 'current_valid_symbols'.

As an example, the key question then is that does it really matter if bytes 2 and 3 in the bigger message get updated along with bytes 0 and 1 which are the only ones that are 'valid'. Presumably a partial message comes later that does update bytes 2 and 3 with valid data. Until the full 'bigger message' is fully assembled, it shouldn't matter what all of the bytes are. If it does matter, then you have a higher level issue that should be addressed.

Kevin Jennings

K-J · Aug 3, 2015

K-J said:
Actually, it should be

Code:

Next_Symbol <= (Next_Symbol - empty) mod MAX_SYMBOLS;

If MAX_SYMBOLS is not a power of 2, then you might want to implement it as you noted.

This part is not correct.

Kevin

shaiko · Aug 3, 2015

Bad news,
Compilation fails - with the following error:
https://www.altera.com/support/support-resources/knowledge-base/solutions/rd12122013_218.html

Code:

library ieee ;
use ieee.std_logic_1164.all ;
use ieee.numeric_std.all ;
use ieee.math_real.all ;
	
entity message_assembler is 
port						
( 		
IN_CLOCK : in std_logic ;	
IN_RESET : in std_logic ;	
IN_VALID : in std_logic ;
IN_EMPTY : in unsigned ( 3 downto 0 ) ;
IN_DATA : in std_logic_vector ( 127 downto 0 ) ; -- Global clock.

OUT_DATA : out std_logic_vector ( 12799 downto 0 ) 
) ;	 	 			           
end entity message_assembler ;

architecture synthesizable_message_assembler of message_assembler is

signal whole_message : std_logic_vector ( 12799 downto 0 ) ;
signal number_of_valid_symbols : unsigned ( positive ( ceil ( log2 ( real ( 16 + 1 ) ) ) ) - 1 downto 0 ) ; 
signal pointer : unsigned ( positive ( ceil ( log2 ( real ( 12799 + 1 ) ) ) ) - 1 downto 0 ) ; 

begin
	
OUT_DATA <= whole_message ;

number_of_valid_symbols <= 16 - resize ( IN_EMPTY , 5 ) ;
	
process ( IN_CLOCK , IN_RESET ) is
begin 
  if IN_RESET = '1' then
    whole_message <= ( others => '0' ) ;
  elsif rising_edge ( IN_CLOCK ) then
    if IN_VALID = '1' then
      whole_message ( ( to_integer ( pointer ) + to_integer ( number_of_valid_symbols ) * 8 - 1 ) downto to_integer ( number_of_valid_symbols ) ) <= IN_DATA ( to_integer ( number_of_valid_symbols ) * 8 - 1 downto 0 ) ;
      pointer <= pointer + to_unsigned ( to_integer ( number_of_valid_symbols ) * 8 , pointer ' length ) ;
    end if ;
  end if ;	
end process ;
	
end architecture synthesizable_message_assembler ;

TrickyDicky · Aug 3, 2015

As it explains in the message you posted - its a quartus bug. Not a VHDL problem.

shaiko · Aug 3, 2015

As it explains in the message you posted - its a quartus bug. Not a VHDL problem.

Yeah, and it persists in v14 as well.
The error message is shown pretty quickly, yet compilation is stuck at 10% forever.
Windows 7 "Task Manager" reveals that when compiling this code - the Quartus process uses ~7.5GB (!) of RAM and the PCs fan is working like a hairdryer - killing it took around 10 minutes...

Time to get my fire extinguisher - Kevin's code is next.

std_match · Aug 3, 2015

What is the speed requirement? It seems to be a suitable task for a state machine. I guess that a state machine can't handle the Avalon transfers directly, so put the 128-bit data and the empty value into a FIFO and let the state machine process it from there. This assumes that data comes in bursts, and that the average rate is not too high.

The state machine can implement an 8-bit-wide shift register with parallell load from the 128-bit FIFO. No muxes are needed.
The shift register can be 8*16 circular or 8*32 linear depending on if one Avalon cycle can generate data that should go into two different output words (wrapping) or not.
A 8*16 circular shift register will need byte-wise strobe signals for the parallel load.

With extra logic complexity it can be possible to use an 8*16 circular shift register even if wrapping is possible.
The 8*32 linear shift register should have the simplest logic regardless of the wrapping issue.

If you really want a single-cycle solution, check if single-bit assignments (without "to" or "downto") are also affected by the Quartus bug.
If not, you can try to slice the assignments as I mentioned here:
https://www.edaboard.com/threads/341735/#post1459151

shaiko · Aug 3, 2015

Thanks.

The shift register can be 8*16 circular or 8*32 linear

What do you mean by serial and linear shift registers in this case?
Please explain in code.

std_match · Aug 3, 2015

shaiko said:
What do you mean by serial and linear shift registers in this case?
Please explain in code.

Circular pseudo-code.
Parallel loading in one clock cycle:

Code:

shift_register_16_bytes(0 to 15); -- downto will also work
in_data_bits(127 downto 0);

-- The parallel load, only load the valid bytes. Always loading to the same bits means no muxex.
  shift_register_16_bytes(0) <= in_data_bits(7 downto 0); -- Always load the lowest byte, empty can not be 16 ?

  -- The following can be a loop, unrolled for clarity

  if in_empty < 15 then
    shift_register_16_bytes(1) <= in_data_bits(15 downto 8);
  end if;

  if in_empty < 14 then
    shift_register_16_bytes(2) <= in_data_bits(23 downto 16);
  end if;

  if in_empty < 13 then
    shift_register_16_bytes(3) <= in_data_bits(31 downto 24);
  end if;

  .....

  if in_empty < 1 then
    shift_register_16_bytes(15) <= in_data_bits(127 downto 120);
  end if;

Circular shift (16-empty) times, multi-cycle by a state machine. Count the number of bytes. One shift:

Code:

    for i in 0 to 14 loop
      shift_register_16_bytes(i) <= shift_register_16_bytes(i + 1); -- No muxes needed to shift one step
    end hdl loop
    shift_register_16_bytes(15) <= shift_register_16_bytes(0); -- This is the "circular" shift

-- When 16 bytes has been loaded and shifted, the shift_register_16_bytes array contains one complete output word (128 bits = 16 bytes).
-- No shifting is needed if all 16 bytes are received at once, that special case can be optimized for speed

Linear pseudo-code. Can handle wrapping input data.
Parallel loading in one clock cycle:

Code:

shift_register_32_bytes(0 to 31)  -- downto will also work. We load to (16 to 31) and get the output word in (0 to 15),
-- It will work even if one Avalon transfer wraps over two output words.
in_data_bits(127 downto 0);

-- Always load 128 bits. Always loading to the same bits means no muxex.
  for i in 0 to 15 loop
    shift_register_32_bytes(i+16) <= in_data_bits(i*8+7 downto i*8);
  end loop

Linear shift (16-empty) times, multi-cycle by a state machine. Count the number of bytes. One shift:

Code:

    for i in 0 to 30 loop
      shift_register_32_bytes(i) <= shift_register_32_bytes(i + 1); -- No muxes needed to shift one step
    end hdl loop
    -- The data in shift_register_32_bytes(0) is thrown away, but doesn't contain anything

-- When 16 bytes has been loaded and shifted, shift_register_32_bytes(0 to 15) contains one complete output word (128 bits = 16 bytes).

shaiko · Aug 5, 2015

This is how I designed it - looks pretty simple in the RTL viewer and synthesizes very quickly.

Code:

library ieee ;
use ieee.std_logic_1164.all ;
use ieee.numeric_std.all ;
use ieee.math_real.all ;
	
entity frame_assembler is 

generic 
(
   messages_per_frame : positive := 3 ;	
   bits_per_symbol : positive := 2 ;
   symbols_per_message : positive := 2 
) ;

port						
( 		
   IN_CLOCK : in std_logic ;	
   IN_RESET : in std_logic ;	
   IN_VALID : in std_logic ;
   IN_EMPTY : in unsigned ( positive ( ceil ( log2 ( real ( symbols_per_message + 1 ) ) ) ) - 1 downto 0 ) ; 
   IN_DATA : in std_logic_vector ( bits_per_symbol * symbols_per_message - 1 downto 0 ) ;
   OUT_DATA : out	std_logic_vector ( bits_per_symbol * symbols_per_message * messages_per_frame - 1 downto 0 ) 
) ;	 	 			           
end entity frame_assembler ;

architecture synthesizable_frame_assembler of frame_assembler is


constant bits_per_message : positive := bits_per_symbol * symbols_per_message ;
constant bits_per_frame : positive := bits_per_message * messages_per_frame ;

signal number_of_valid_symbols : unsigned ( positive ( ceil ( log2 ( real ( symbols_per_message + 1 + 1 ) ) ) ) - 1 downto 0 ) ; 
signal number_of_valid_bits : unsigned ( positive ( ceil ( log2 ( real ( IN_DATA ' length + 1 ) ) ) ) - 1 downto 0 ) ; 
signal current_bit : unsigned ( positive ( ceil ( log2 ( real ( OUT_DATA ' length + 1 ) ) ) ) - 1 downto 0 ) ; 
signal next_bit : unsigned ( positive ( ceil ( log2 ( real ( OUT_DATA ' length + 1 ) ) ) ) - 1 downto 0 ) ; 

signal long_data : std_logic_vector ( OUT_DATA ' range ) ; 

begin

  -- long_data is simply "IN_DATA" copied n times ( n = messages_per_frame	 ).
  generating_long_data : for index in 0 to messages_per_frame - 1
  generate 
  long_data ( ( bits_per_message * index + bits_per_message ) - 1 downto index * bits_per_message ) <= IN_DATA ;
  end generate generating_long_data ;
	
  number_of_valid_symbols <= symbols_per_message - resize ( IN_EMPTY , number_of_valid_symbols ' length ) ;
  number_of_valid_bits <= resize ( bits_per_symbol * number_of_valid_symbols , number_of_valid_bits ' length) ;
  next_bit <= current_bit + resize ( number_of_valid_bits , current_bit ' length ) ;
	
  process ( IN_CLOCK , IN_RESET ) is
  begin
    if IN_RESET = '1' then 
      current_bit <= ( others => '0' ) ;
      OUT_DATA <= ( others => '0' ) ;
    elsif rising_edge ( IN_CLOCK ) then
      if IN_VALID = '1' then
        current_bit <= next_bit ; 
        for index in 0 to ( bits_per_frame - 1 ) loop
          if index >= current_bit then -- Update only those bits that are left of the currenct pointer position ( don't touch the already updated bits ). 
            OUT_DATA ( index ) <= long_data ( index ) ;
         end if ;
       end loop ;
     end if ;	
    end if ;	
  end process ;
	
end architecture synthesizable_frame_assembler

;

std_match · Aug 5, 2015

I don't think your code is consistent with the requirements you have posted earlier.

Each symbol in OUT_DATA can only come from a certain symbol position in "IN_DATA". As "IN_EMPTY" is described earlier, each symbol in OUT_DATA could come from any symbol position in "IN_DATA".

The default values for the generics are much lower than the numbers you have mentioned earlier. I think this design will explode (= not "pretty simple") if you correct the bugs and use the earlier numbers (IN_DATA = 16 symbols * 8 bits).

What is "messages_per_frame"?

shaiko · Aug 5, 2015

The default values for the generics are much lower than the numbers you have mentioned earlier. I think this design will explode (= not "pretty simple") if you correct the bugs and use the earlier numbers (IN_DATA = 16 symbols * 8 bits).

I scaled it down on purpose to make the RTL view more readable.
Even with the nominal values in place the design still synthesizes successfully - around x5 slower though.
I should note however that the target device is by all means HUGE.

What is "messages_per_frame"?

That's the number of full IN_DATA messages in one frame.
For example: if a full "IN_DATA" message is 128 bits and the total frame size is 12800 bits then messages_per_frame = 12800 / 128 = 100.

Each symbol in OUT_DATA can only come from a certain symbol position in "IN_DATA". As "IN_EMPTY" is described earlier, each symbol in OUT_DATA could come from any symbol position in "IN_DATA".

You're correct - indeed a bug...thanks for pointing it out!
Can you suggest a code that fixes this issue while maintaining the loop approach ( no muxes ) ?

std_match · Aug 6, 2015

shaiko said:
std_match said:

What is "messages_per_frame"?

Click to expand...

That's the number of full IN_DATA messages in one frame.
For example: if a full "IN_DATA" message is 128 bits and the total frame size is 12800 bits then messages_per_frame = 12800 / 128 = 100.

std_match said:

Each symbol in OUT_DATA can only come from a certain symbol position in "IN_DATA". As "IN_EMPTY" is described earlier, each symbol in OUT_DATA could come from any symbol position in "IN_DATA".

Click to expand...

You're correct - indeed a bug...thanks for pointing it out!
Can you suggest a code that fixes this issue while maintaining the loop approach ( no muxes ) ?

If you want to do it in one clock cycle you must use muxes and that will get you into trouble if OUT_DATA is 1600 symbols = 12800 bits. Do you really want to deliver so many bits in parallel? What happens to the data in the next step?

If it is OK to do it in multiple clock cycles I recommend the "linear shift register" I proposed earlier. You will then get rid of all the muxes and all the strobe comparators. The shift register should have a length so that one IN_DATA + one OUT_DATA fits. You load all bits into the IN_DATA part and shift into the OUT_DATA part. If you can introduce wait states to the previous logic you may not need a FIFO. The shift register state machine should be possible to clock at a very high rate, several hundred MHz, depending on the FPGA family. It will take up to 16 such clock cycles to process one incoming 128-bit word.

K-J · Aug 6, 2015

shaiko said:
Can you suggest a code that fixes this issue while maintaining the loop approach ( no muxes ) ?

Yeah...the seven lines of code that I posted in #20 which are two nested for loops that computes 'My_Big_Message'

Kevin

vGoodtimes · Aug 6, 2015

K-J said:
Yeah...the seven lines of code that I posted in #20 which are two nested for loops that computes 'My_Big_Message'

Kevin

You have an if statement within the for loop that will synthesize into a mux.

shaiko · Aug 6, 2015

Yeah...the seven lines of code that I posted in #20 which are two nested for loops that computes 'My_Big_Message'

I thought about it - and it didn't make sense...

Code:

for [COLOR="#FF0000"]i[/COLOR] in 0 to (MSGS_PER_BIG_MSG-1) loop
   if (Next_Symbol >= [COLOR="#FF0000"]i[/COLOR]) then -- KJ:  Essentially creates the enable signal here
      for [COLOR="#FF0000"]i[/COLOR] in Data_In'range loop
         My_Big_Message(BITS_PER_SYMBOL * [COLOR="#FF0000"]i[/COLOR]) <= Data_In([COLOR="#FF0000"]i[/COLOR]);
      end loop;
   end if;
end loop;

Why do you use the same iteration index ( i ) in both loops ?
Is this a typo?

K-J · Aug 6, 2015

vGoodtimes said:
You have an if statement within the for loop that will synthesize into a mux.

No, the 'if' statement will synthesize into a clock enable, it's not muxing together any of the input bits, which is exactly what I said in the comment line with the posted code. The exact same thing will happen with the shift register approach.

Kevin Jennings

- - - Updated - - -

shaiko said:
Why do you use the same iteration index ( i ) in both loops ?
Is this a typo?

As I said in the post, the code has errors, but is to demonstrate the approach which is that you loop through all of the possible places where the input can end up (i.e. the 'messages per big message') and as long as the starting place for the next symbol is above that loop index, then you store the data. By not relying on the 'empty' signal to compute the data, it will result in simpler and potentially faster logic. The 'empty' signal is only used to update where the next symbol will be stored, which is a different logic path that terminates in the flip flops that compute the next symbol.

Code:

for j in Data_In'range loop
    My_Big_Message(BITS_PER_SYMBOL * i + j) <= Data_In(j);

Kevin Jennings

Accessing records in VHDL

Advanced Member level 5

Advanced Member level 4

Advanced Member level 5

Advanced Member level 4

Advanced Member level 2

Advanced Member level 2

Advanced Member level 5

Advanced Member level 7

Advanced Member level 5

Advanced Member level 4

Advanced Member level 5

Advanced Member level 4

Advanced Member level 5

Advanced Member level 4

Advanced Member level 5

Advanced Member level 4

Advanced Member level 2

Advanced Member level 4

Advanced Member level 5

Advanced Member level 2

Similar threads

Privacy & Transparency

Privacy & Transparency