Counting the position of ones

ganeshmirajkar · May 4, 2011

I want to design a combinatorial circuit for an fpga which will count the position of ones in a 16 bit vector.
eg1 if vector is 16'b0000_0000_0011_0101 then output should be
64'h0000_0000_0000_5420

eg2 if vector is 16'b0000_1010_0000_1010 then output should be
64'h0000_0000_0000_b931.

Please help.
Thanks in advance....

lucbra · May 4, 2011

just browse in this forum, you will find your answer

ganeshmirajkar · May 4, 2011

Hello lucbra.
Thanks for your reply..
I tried searching a lot and i did not found any such solution. Can you please direct me to any such thread.
Thanks

mrflibble · May 4, 2011

Just out of idle curiosity, given your somewhat incomplete description of the specs ... You have a 16 bit input, and I see a 64 bit output. Suppose you used the following input values:

input_a: 16'b1111_1111_1111_1111

input_b: 16'b0000_0000_0000_0000

input_c: 16'b0000_0000_0000_0001

What are the outputs of your circuit for these example inputs?

Also, what is the use case of this circuit? Because all it seems to be doing is for a given input bit, output either 0 or the output index...

ganeshmirajkar · May 4, 2011

for input_a output will be 64'hfedc_ba98_7654_3210

for input_b output will be 64'h0000_0000_0000_0000

For input_c output will be 64'h0000_0000_0000_0000 .. i will qualify the lsb nibble 0 by looking at the input
The whole idea is to get the indexes of ones. Now since the no of ones can max be 16 my output is of 64 bits. I have encoded the index in four bits since the index will go from 0 to 15.
Now in your case the output of input_b and input_c will be same.

mrflibble · May 4, 2011

I understand that the whole idea of getting the indexes of ones is that you then have the indexes of ones. But that still gives me no clue as to what the use of that would be. Getting the index of the first 1, I understand how that would be useful.

Oh wait, fatal assumption on my side... I was assuming until now that this is going to be synthesized.

Is this going to be synthesizable? Or is it just for simulation?

Also, let me rephrase the question... Is this circuit part of a bigger circuit? If yes, could you give a rough description of the bigger circuit.

And finally, if this is just for simulation you could use a function that iterates over all the input bits...

FvM · May 4, 2011

Now in your case the output of input_b and input_c will be same.

Sounds like the specification should be reconsidered?
As far as I remember, there have been examples of a leading one detector HDL design. The present problem is somewaht more complex, I'm not aware of a similar problem from a previous thread.

You're asking for a combinational circuit, so everything has to be calculated in parallel. The obvious way would be writing a behavioral description, using iteration loops to describe the parallelism. Then let your design compiler synthesize the circuit and wait for the resource usage.

And finally, if this is just for simulation you could use a function that iterates over all the input bits...

Why do you think that I won't work for synthesis? The resource usage will be possibly high, but you should get away with may be some 1000 logic cells. The method isn't different from e.g. designing a large parallel divider.

mrflibble · May 4, 2011

FvM said:
Why do you think that I won't work for synthesis? The resource usage will be possibly high, but you should get away with may be some 1000 logic cells. The method isn't different from e.g. designing a large parallel divider.

I don't think any such thing. But I was going over the "what use would such a circuit be?" question, and then realized that as a convenience function during simulation it could make sense. And because as you point out, if he wants to detect all the ones it would result in fairly high resource usage in actual hardware.

So before going any further on that, I thought it would be nice if there were less assumptions and more specifics.

Aside from that, it did make me thing of the "leading ones detector" thread from a while back. But that was for just detecting the first 1.

FvM · May 4, 2011

I tried a behavioral VHDL implementation, it takes 955 LEs of an Altera Cyclone III. The results are according to the example cases, including the ambiguity of input_b and input_c. tpd is about 19 ns.

Code:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use ieee.numeric_std.all;

entity ones is
port ( inp : in std_logic_vector(15 downto 0);
       outp : out std_logic_vector(15 downto 0)
         );
end ones;

architecture Behavioral of ones is

begin
process(inp)
  variable first: integer range 0 to 15;
  variable act: integer range 0 to 63;
  variable digit: unsigned (3 downto 0);
begin
   outp <= (others => '0');
   first:=0;
   act:=0;
   for i in 15 downto 0 loop
      first := i;
      exit when inp(i) = '1';
   end loop;  
   for i in 0 to 15 loop
      exit when i > first;
      if inp(i)='1' then
         digit := to_unsigned(i,4);
         for j in 0 to 3 loop
            outp(act*4+j) <= digit(j);
         end loop;   
         act := act + 1;
      end if;
   end loop;  
end process;
end Behavioral;

dcreddy1980 · May 4, 2011

The below code will give a part of the logic required which requires much less logic when comapared to the original representation of the o/p :

module index (a,b);
input [15:0] a;
output [63:0] b;
wire [63:0] b_tmp;
assign b_tmp[3:0] = {4{a[0]}} & 4'b0000;
assign b_tmp[7:4] = {4{a[1]}} & 4'b0001;
assign b_tmp[11:8] = {4{a[2]}} & 4'b0010;
assign b_tmp[15:12] = {4{a[3]}} & 4'b0011;
assign b_tmp[19:16] = {4{a[4]}} & 4'b0100;
assign b_tmp[23:20] = {4{a[5]}} & 4'b0101;
assign b_tmp[27:24] = {4{a[6]}} & 4'b0110;
assign b_tmp[31:28] = {4{a[7]}} & 4'b0111;
assign b_tmp[35:32] = {4{a[8]}} & 4'b1000;
assign b_tmp[39:36] = {4{a[9]}} & 4'b1001;
assign b_tmp[43:40] = {4{a[10]}} & 4'b1010;
assign b_tmp[47:44] = {4{a[11]}} & 4'b1011;
assign b_tmp[51:48] = {4{a[12]}} & 4'b1100;
assign b_tmp[55:52] = {4{a[13]}} & 4'b1101;
assign b_tmp[59:56] = {4{a[14]}} & 4'b1110;
assign b_tmp[63:60] = {4{a[15]}} & 4'b1111;
assign b = b_tmp;
endmodule

16'b0000_0000_0011_0101(input) and the result will be : 64'h0000_0000_0054_0200

We can use priority encoder which can be used in parallel for each 4-bits to truncate the hex representation of zeros so that final o/p will be : 64'h0000_0000_0000_5420

For example :

b[3:0] = 0;

b[7:4] = a[1] ? 'd1 : (a[2] ? 'd2 : (a[3] ? 'd3 : (a[4] ? 'd4 : ....and so on till a[15])));// dont check for a[0]

b[11:8] = a[2] ? 'd2 : (a[3] ? 'd3 : (a[4] ? 'd4 : (a[5] ? 'd5 : ....and so on till a[15])));// dont check for a[1]

BradtheRad · May 4, 2011

If you're asking how to do this with IC's...

Your maximum input is 1111-1111-1111-1111

Your maximum output is fedc-ba98-7654-3210

You'll start with a 16 bit shift register. Or two 8 bit.

Your output will require 16 counters, each containing 4 bit. (zero to F)

Load the shift register with your number consisting of ones and zeroes...

Say, this sounds like an assignment a professor would give his students...

mrflibble · May 4, 2011

BradtheRad said:
Say, this sounds like an assignment a professor would give his students...

Surely you don't think lazy people would come on this board to have other people do their homework assignments for them? *gasp*

---------- Post added at 20:39 ---------- Previous post was at 20:37 ----------

Besides, the usual "I need this ASAP! kthxbye" was missing this time around.

BradtheRad · May 4, 2011

mrflibble said:
Surely you don't think lazy people would come on this board to have other people do their homework assignments for them? *gasp*

---------- Post added at 20:39 ---------- Previous post was at 20:37 ----------

Besides, the usual "I need this ASAP! kthxbye" was missing this time around.

Even if it does happen at this board I'm sure we all want to help them rather than discourage them from seeking our immense wisdom. :smiley face:

ganeshmirajkar · May 5, 2011

mrflibble said:
I understand that the whole idea of getting the indexes of ones is that you then have the indexes of ones. But that still gives me no clue as to what the use of that would be. Getting the index of the first 1, I understand how that would be useful.

Oh wait, fatal assumption on my side... I was assuming until now that this is going to be synthesized.

Is this going to be synthesizable? Or is it just for simulation?

Also, let me rephrase the question... Is this circuit part of a bigger circuit? If yes, could you give a rough description of the bigger circuit.

And finally, if this is just for simulation you could use a function that iterates over all the input bits...

"I understand that the whole idea of getting the indexes of ones is that you then have the indexes of ones " -- I dont know what you mean by this.
"Is this going to be synthesized? Or is it just for simulation?" -- I said i want to design a combinatorial circuit. Why would i write a combinatorial circuit if i just want to simulate it. Of course it will be synthesized.!! I i just wanted to simulate it i would rather write a behavioral code.

Also to solve this problem you really don't need to know the out of box specifics but still to feed your curiosity...i would like to tell you that yes this is a part of a bigger circuits where there are 16 buffers at one end and 16 fifos at other end.There is a switcher logic which sits between these buffer and fifo's. Now the output which is 64 bits is actually a select line for each of the muxes that sit at the write port of fifos..... I hope you understand the bigger picture.....

---------- Post added at 06:34 ---------- Previous post was at 06:30 ----------

FvM said:
Sounds like the specification should be reconsidered?
As far as I remember, there have been examples of a leading one detector HDL design. The present problem is somewaht more complex, I'm not aware of a similar problem from a previous thread.

You're asking for a combinational circuit, so everything has to be calculated in parallel. The obvious way would be writing a behavioral description, using iteration loops to describe the parallelism. Then let your design compiler synthesize the circuit and wait for the resource usage.

Well that is a very straight solution but i dont like to depend on the synthesis tool to make my circuit. I would like to tell it that this is my circuit and you implement it. For loops will give you the answer but then as a fpga/asic engineer you would never know what the actual circuit is and why it is taking so many resources...

---------- Post added at 06:36 ---------- Previous post was at 06:34 ----------

FvM said:
Sounds like the specification should be reconsidered?

Please read by comment on the mrflibble post and if you understand the bigger picture then you will also realise that the specifications are correct.....

---------- Post added at 06:38 ---------- Previous post was at 06:36 ----------

Fvm i really appreciate your efforts. Thanks. But i dont want to use a for loop because then i will never know what circuit it will actually make..Also this circuit is expected to be runnig at 266 MHz!!!! so i dont want to take a chance....

---------- Post added at 06:50 ---------- Previous post was at 06:38 ----------

dcreddy1980 said:
The below code will give a part of the logic required which requires much less logic when comapared to the original representation of the o/p :

16'b0000_0000_0011_0101(input) and the result will be : 64'h0000_0000_0054_0200

We can use priority encoder which can be used in parallel for each 4-bits to truncate the hex representation of zeros so that final o/p will be : 64'h0000_0000_0000_5420

For example :

b[3:0] = 0;

b[7:4] = a[1] ? 'd1 : (a[2] ? 'd2 : (a[3] ? 'd3 : (a[4] ? 'd4 : ....and so on till a[15])));// dont check for a[0]

b[11:8] = a[2] ? 'd2 : (a[3] ? 'd3 : (a[4] ? 'd4 : (a[5] ? 'd5 : ....and so on till a[15])));// dont check for a[1]

I think you and Fvm are the only one who have managed to understand the specifications....I really appreciate your efforts .
Well the code you have written is really impressive. Well the module index you have written is very nice. Actually i had the same approach. I managed to get the same "b" output but with a little lengthy process. You masked it with binary 4 bit number. What i did was i masked it with one hot masks 0001,0010,0100,1000 and then these four outputs went to a one hot to binary encoder. So i actually took two stages but you managed to do it in a single stage..Thats why i liked it..
Now the problem in front of me is how to remove those extra invalid zeros. The assign statements you have written will do the work but then those are like 16 muxes placed one after another and that is going to hurt my timing which is 266 MHz.
Also to just let you know that b[3:0] will but always be 0. for eg. if input is 16'b0000_0000_0000_1110 then output will be 64'h0000_0000_0000_0321.
eg2 input is 16'b0000_0000_0000_1010. output is 64'h 0000_0000_0000_0031. for eg 2 The module which you have written will give output 64'h0000_0000_0000_3020. We now only need to remove those zeros and i am badly stuck at it....
Thanks for your help....

---------- Post added at 06:52 ---------- Previous post was at 06:50 ----------

BradtheRad said:
If you're asking how to do this with IC's...

Your maximum input is 1111-1111-1111-1111

Your maximum output is fedc-ba98-7654-3210

You'll start with a 16 bit shift register. Or two 8 bit.

Your output will require 16 counters, each containing 4 bit. (zero to F)

Load the shift register with your number consisting of ones and zeroes...

Say, this sounds like an assignment a professor would give his students...

Dear i need a combinational circuit. The whole idea is to calculate the output in a single clock of 266 MHz. You solution will take atleat 16 clock cycles...but unfortunately i dont have that much time. Thanks for you efforts.

---------- Post added at 07:02 ---------- Previous post was at 06:52 ----------

mrflibble said:
Surely you don't think lazy people would come on this board to have other people do their homework assignments for them? *gasp*

---------- Post added at 20:39 ---------- Previous post was at 20:37 ----------

Besides, the usual "I need this ASAP! kthxbye" was missing this time around.

I understand the frustration you and may be BradtheRad are facing because u did not understand the question/specification itself.... but i think you both have a habit of underestimating other people.....Just to let you know that this is not a homework it is a real life design problem which i am facing and so i came to this board because i thought people are helpful here; and yes some really are....but there are also people like you too who make a hurry in commenting without understanding the specs..... and trying to act as if you are the most intelligent people on the earth. Also i think you have a habit of approaching every problem as if you were going to do a new invention...But mind well this is not a invention i am trying to do here......
I just need some help in solving that crazy problem.... But i still appreciate your time and effort in understanding the specs and also for writing these useless comments....

---------- Post added at 07:03 ---------- Previous post was at 07:02 ----------

BradtheRad said:
Even if it does happen at this board I'm sure we all want to help them rather than discourage them from seeking our immense wisdom. :smiley face:

If you really had that approach of helping people you would never post this comment....But thanks for you sequential solution to a combinatorial problem....

TrickyDicky · May 5, 2011

Fvm i really appreciate your efforts. Thanks. But i dont want to use a for loop because then i will never know what circuit it will actually make..Also this circuit is expected to be runnig at 266 MHz!!!! so i dont want to take a chance....

Good luck with that.
1. Without a for loop, the code is going to look like shit.
2. With such huge amounts of serial logic, unless you pipeline it, you have 0 chance of making it run at 266MHz, even in the most expensive FPGAs.

FvM · May 5, 2011

But i dont want to use a for loop because then i will never know what circuit it will actually make..Also this circuit is expected to be runnig at 266 MHz!!!! so i dont want to take a chance....

You have originally required a combinational circuit. Either if you take advantage of HDL iteration constructs or design the logic manually, a large number of logic elements will be involved. And I complete agree with TrickyDicky, that the flattened design won't look straightforward. You can also take my reported tpd number as a hint, that it won't work at 266 MHz.

I realized in the meantime, that the initial calculation of the first one position doesn't serve a purpose and can be omitted, also the related exit condition in the second iteration. This way, the logic cell amount will be below 500, but tpd is still above 17 ns. A similar solution has been posted at Altera Forum, as you know.

As TrickyDicky explained, 266 MHz can only work in a pipelined design. This also implies, that the simple behavioral coding approach won't work, so you get a good chance to do what you want: design the logic manually.

lostinxlation · May 5, 2011

There is a switcher logic which sits between these buffer and fifo's. Now the output which is 64 bits is actually a select line for each of the muxes that sit at the write port of fifos..... I hope you understand the bigger picture.....

well, by seeing the bigger picture, the latency anyway gets screwed by FIFO and I don't see the reason why you don't wanna pipeline it. Or is there further bigger picture ?

Prioritizing a 16 bit signal, you cannot avoid a deep logic to deal with the priority no matter what kind of tricky logic you make. Either data or control get deep.

ganeshmirajkar · May 5, 2011

lostinxlation said:
well, by seeing the bigger picture, the latency anyway gets screwed by FIFO and I don't see the reason why you don't wanna pipeline it. Or is there further bigger picture ?

Prioritizing a 16 bit signal, you cannot avoid a deep logic to deal with the priority no matter what kind of tricky logic you make. Either data or control get deep.

Ok well i am sorry for the wrong info. Actually there are 16 data ports on left hand side not buffers. So now on each clock cycle there will be data (valid or invalid denoted by msb bit of data) on 16 ports and i need to put this data into the fifos on right hand side. The writing in fifo is done in round robin fashion. So if at 1 st clock i have valid data on suppose 2 4 and 5th ports then i need to take the 2nd,4th and 5th data and put it in fifos 0,1,2. now in next clock if the valid data is on suppose 3,5,8 ports then that data will go into 3,4,5th fifos.??? can you suggest how do i pipeline here......

lostinxlation · May 5, 2011

ganeshmirajkar said:
The writing in fifo is done in round robin fashion. So if at 1 st clock i have valid data on suppose 2 4 and 5th ports then i need to take the 2nd,4th and 5th data and put it in fifos 0,1,2. now in next clock if the valid data is on suppose 3,5,8 ports then that data will go into 3,4,5th fifos.??? can you suggest how do i pipeline here......

Buffers or ports... the same difference. By using FIFO, in the biggest picture, your application most probably doesn't care about latency.
Seems to me just adding a few staging cycles works. what's the difficulty to do pipelining ?

ganeshmirajkar · May 5, 2011

lostinxlation said:
Buffers or ports... the same difference. By using FIFO, in a big picture, your application most probably doesn't care about latency.
Seems to me just adding a few staging cycles works. what's the difficulty to do pipelining ?

By ports i mean i cannot stop data from coming. I have to process it.There is no buffer. So no imagine that i have 16 input ports. From all the ports i am getting data on each clock. On each clock i must see what all ports are carrying valid data and write it them to fifo's. Now to implement pipeline what i will do is i will write a module which will take a snap of all 16 ports and register all 16 ports. Now i will take at the max 16 cycles to generate these selects. and in 17th cycle i will write the fifos. but in these 17 cycles i will have 17 different datas comming on those ports. So i will need 17 such modules. So 17 such modules will work in parallel. 1st module will work on 1st snap of data. 2nd module will work on second clock snap of data and so on. But then i think this is a huge resource consuming idea. rather suppose if i manage to calculate these selects in one clock then i will need onle one clock to process the input and on the next clock i will be ready for the next snap shot of ports.....

---------- Post added at 10:22 ---------- Previous post was at 10:21 ----------

@lostinxlation can you suggest some idea on how to pipeline in this case....

Welcome to EDAboard.com

Counting the position of ones

Newbie level 6

Advanced Member level 2

Newbie level 6

Advanced Member level 5

Newbie level 6

Advanced Member level 5

Super Moderator

Advanced Member level 5

Super Moderator

Full Member level 5

Super Moderator

Advanced Member level 5

Super Moderator

Newbie level 6

Advanced Member level 7

Super Moderator

Advanced Member level 2

Newbie level 6

Advanced Member level 2

Newbie level 6

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor