Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

[SOLVED] how to locate the addresses of 1's in std_logic_vector

Status
Not open for further replies.

rafimiet

Member level 5
Joined
May 21, 2015
Messages
94
Helped
1
Reputation
2
Reaction score
1
Trophy points
1,288
Activity points
2,364
I have a std_logic_vector signal, namely A of 2048 bits, 2047 downto 0, I want to locate the positions (addresses) of '1' bits in the vector. These addresses need to be stored into an array B in ascending order i.e. address 0 should contain the address of first 1 encountered as we go 0 to 2047 of signal A. But this whole process should take me as minimum clock cycles as possible.
 

One solution is to have a shift register that allows 4 (or more) bits to be examined per cycle. this allows 4 compares/writes per cycle, finishing the input in 512 cycles in all cases. 4 is chosen as each BRAM in modern FPGAs has 2 ports each and this probably uses 2 BRAM, so 4 writes per cycle.

If average case performance is needed, and the input is sparse, other methods that are more aggressive can be used.
 
One solution is to have a shift register that allows 4 (or more) bits to be examined per cycle. this allows 4 compares/writes per cycle, finishing the input in 512 cycles in all cases. 4 is chosen as each BRAM in modern FPGAs has 2 ports each and this probably uses 2 BRAM, so 4 writes per cycle.

If average case performance is needed, and the input is sparse, other methods that are more aggressive can be used.

While your solution probably works nicely at some very high clock rate, The OP originally asked for as few clocks as possible.
It would be quite easy just to do it in a single clock cycle. But with a very very slow clock (maybe 2MHz if you're lucky), and probably slower with much higher resources than your suggestion, but it didnt meet the OPs requirements.

Code:
process(clk)
  variable b_addr : natural;
begin
  if rising_edge(clk) then
    b_addr := 0;
    for i in 0 to 2047 loop
      if ip(i) = '1' then
        addresses(b_addr) <= i;
        b_addr := b_addr + 1;
      end if;
    end loop;
  end if;
end process;

Coded in a single clock, as the OP asked.
 
@TrickyDicky:
Every time you make a post about HW design not being the same as SW design, I will post a link to this thread.
 

TrickyDicky... It would be quite easy just to do it in a single clock cycle. But with a very very slow clock (maybe 2MHz if you're lucky), and probably slower with much higher resources than your suggestion, but it didnt meet the OPs requirements.
I tried to implement a small code as per your recommendations
Code:
----------------------------------------------------------------------------------
-- Company: 
-- Engineer: 
-- 
-- Create Date:    23:11:28 09/09/2017 
-- Design Name: 
-- Module Name:    clk_check - Behavioral 
-- Project Name: 
-- Target Devices: 
-- Tool versions: 
-- Description: 
--
-- Dependencies: 
--
-- Revision: 
-- Revision 0.01 - File Created
-- Additional Comments: 
--
----------------------------------------------------------------------------------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
use IEEE.NUMERIC_STD.ALL;

-- Uncomment the following library declaration if instantiating
-- any Xilinx primitives in this code.
--library UNISIM;
--use UNISIM.VComponents.all;

entity clk_check is
    Port ( a : in  STD_LOGIC_VECTOR (3 downto 0);
           clk : in  STD_LOGIC;
           y : out  STD_LOGIC_VECTOR (3 downto 0));
end clk_check;

architecture Behavioral of clk_check is
signal ip : STD_LOGIC_VECTOR (15 downto 0):= (OTHERS => '0');
TYPE ram_type IS ARRAY (15 DOWNTO 0) OF STD_LOGIC_VECTOR (3 downto 0);
SIGNAL addresses: ram_type;
begin
process(clk)
  variable b_addr : INTEGER RANGE 0 TO 15;
  variable ip_temp : STD_LOGIC_VECTOR(15 downto 0):= (OTHERS => '0');
begin
  if rising_edge(clk) then
	ip_temp := ip_temp(11 downto 0) & a;
	ip <= ip_temp;
    b_addr := 0;
    for i in 0 to 15 loop
      if ip(i) = '1' then
        addresses(b_addr) <= std_logic_vector(to_unsigned(i,4));
        b_addr := b_addr + 1;
      end if;
		y <= addresses(b_addr);
    end loop;
	 ip_temp := ip;
  end if;
end process;

end Behavioral;
When I try to synthesis, it slows down my system so much that I can not do anything. This has never been a case for last two years of my VHDL coding. Also the synthesis is never completed. Please let me know, where am I making the mistake?
 

TD's post was (hopefully) sarcasm. A single cycle design would be very large. Larger than any existing FPGA. Possibly your 16 element version is possible.

Do you have any assumptions or requirements or targets for resources, clock rates, latency, and area?

A sane design will take inputs as slow as required and generate outputs as slow as required and store data as little as possible. It might be possible to reduce the size of the input bus -- process fewer elements over more cycles. You might not need an actual array for the output if the outputs can be processed as they are generated. A complex algorithm might not matter if the outputs can't be consumed fast enough.
 
When I try to synthesis, it slows down my system so much that I can not do anything. This has never been a case for last two years of my VHDL coding. Also the synthesis is never completed. Please let me know, where am I making the mistake?

Yes. Its not a surprise - I suggested it would be very slow, use a lot of resources and be practically useless. I was pointing out the weakness in your original question - my answer met your design spec.
What are your constraints? what device are you using? what clocks are available? what have you attempted so far (other than this terrible design) and what issues need fixing?
Without proper specifications, it is all to easy to miss the real problem.
 

@TrickyDicky... What are your constraints? what device are you using? what clocks are available? what have you attempted so far (other than this terrible design) and what issues need fixing?
I have a vector A(256*256-1 downto 0) that gives me the address to a block of pixels in an image. Starting from LSB to the MSB, when I encounter a '1' bit, I have to move to the next state in the Finite State Machine to process that block of pixels. If I take one bit at a time, it means I have to use 256*256 clock cycles. Which can be even more clock cycles for bigger images (eg. 512*512). Also I know that initially, we will be having only a small number of bits = '1'. With more iterations, the number of '1' bits will increase. All I have to do is to locate the next '1' bit and skip '0' bits in between. Can we do that? Can we locate next '1' bits starting from LSB bit? This can save me a lot of clock cycles.
 

Why do you have to do it in so few clock cycles? where is the image coming from? is it a live video?
I dont really understand the need for such low latency. Usually, if it is a video, a 1 frame latency is usually acceptable. And given the pixel rate even at HD is 75Mhz, you can easilly have an on board clock of 3x the pixel rate to given you plety of cycles for image processing.

So, whats the source? whats the goal? why the need for such a rediculously small latency on such a large image? Why cant it be pipelined?
 
where is the image coming from? is it a live video?
The source is a still/video camera. It will add up to my work if we can process more than 30 frames a second.
whats the goal? why the need for such a rediculously small latency on such a large image?
After acquiring the image, I have to apply 5 levels of Discrete wavelet transform, then around levels for bit-plane coding and then as many levels of decoding and inverse transformation. So it is a full-fledged algorithm. Actually I myself was becoming a bit more cautious for reducing the latency. Because, as I have experienced, it is much easier to code something than debugging the code or trying to improve it once it is completed. I have got all the answers and options I have.
 

In that case, why are you trying to process the entire image in one go? Why not just process it at the pixel level as each pixel arrives? You should easily be able to do this.

Don't think about processing the whole image like a software program. Think of it in terms of pixel processing. Processing video in real time is possible all the way up to 4k images.
 

A priority encoder is an applicable circuit that can work for your application. You can scan blocks of 32 pixels at a time to get the location of the next 1. You can use the bit trick "x <= x & ~(-x & x)" to unset the lsb.

There are different ways to generate a priority encoder. One version sets the encoded lsb to |((-x & x) & 0xAAAAAAAA), the next msb to |((-x & x) & 0xCCCCCCCC), then next to |((-x & x) & 0xF0F0F0F0), next |((-x & x) & 0xFF00FF00), finally |((-x & x) & 0xFFFF0000). This should run at decently high rates on modern fpgas, and could be pipelined if needed. The final circuit used is just a comparison to 0 to determine if the next 32b should be processed.

Using this logic, it is possible to skip 32 (or more) 0's per cycle and it takes at least one cycle per 1 encountered.

You can also have similar logic to skip blocks of 32 even faster. eg, check 8 32b blocks to see if any are non-zero. the result goes to the same encoder structure as above. This can also be pipelined if needed. This allows skipping of multiple 32b blocks per cycle. However, this optimization might not be justified.

( (-x & x) generates a vector where only the lsb is set. eg x = 0x12345678. ~x = 0xEDCBA987. -x = 0xEDCBA988. (-x & x) = 0x00000008. only the lsb remains.)

As mentioned, if you can process pixels as they come in, that would be easier.


(edit, I also shouldn't have to mention that the 256*256 bit image should be in a BRAM and not a large data bus. This works well as you can have 32b outputs from a BRAM. This size requires ~8 BRAM on modern devices. At the same time, you did try to synthesize the clearly troll design so I'll mention this just in case.)
 

Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top