Counting the position of ones

TrickyDicky · May 5, 2011

I think you missunderstand the idea of a pipeline. If you have a 17 stage pipeline, you can enter a new input on every clock cycle, so the input rate would be the same as the data rate, you just have to wait 17 clock cycles for the result, but you get one result every clock cycle.

lostinxlation · May 5, 2011

Now i will take at the max 16 cycles to generate these selects. and in 17th cycle i will write the fifos. but in these 17 cycles i will have 17 different datas comming on those ports.

You misunderstand what pipelining is.

A supermarket has 2 cashers, and the customers are waiting in a line for either of the cashers. Each casher takes care of 1 customer at a time, and in the meantime, the new customers join the line. Do you need more than 2 cashers to make it work in order ?

ganeshmirajkar · May 5, 2011

lostinxlation said:
You misunderstand what pipelining is.

A supermarket has 2 cashers, and the customers are waiting in a line for either of the cashers. Each casher takes care of 1 customer at a time, and in the meantime, the new customers join the line. Do you need more than 2 cashers to make it work in order ?

No. But in my case the customers are not ready to wait.....

lostinxlation · May 5, 2011

ganeshmirajkar said:
No. But in my case the customers are not ready to wait.....

Well. from what I read in your description, they can make a line and wait for their turn.

ganeshmirajkar · May 5, 2011

TrickyDicky said:
I think you missunderstand the idea of a pipeline. If you have a 17 stage pipeline, you can enter a new input on every clock cycle, so the input rate would be the same as the data rate, you just have to wait 17 clock cycles for the result, but you get one result every clock cycle.

You are right but then i cannot pipeline the data here because i need the data to be constant for next 16 cycles in order to parse it bit by bit. If i take the next data into same register then that will break my operation. In pipe lining you assume that there are different operations to be done on data and each stage changes data into another form and feeds it to next stage. So the input data is only needed by first stage because the next successive stages are not going to use input data rather they will be using data given by their previous stage. That's why pipe lining works there.
I my case the first operation of calculating the selects itself is taking 16 cycles. and that is why i cannot pipeline.

---------- Post added at 10:56 ---------- Previous post was at 10:54 ----------

lostinxlation said:
Well. from what I read in your description, they can make a line and wait for their turn.

Actually that is what i clarified in post no 20

TrickyDicky · May 5, 2011

Whats to stop you passing the origional data along a parrallel pipeline?

lostinxlation · May 5, 2011

ganeshmirajkar said:
because i need the data to be constant for next 16 cycles in order to parse it bit by bit. .

here is your misunderstanding of pipelining.

ganeshmirajkar · May 5, 2011

TrickyDicky said:
Whats to stop you passing the origional data along a parrallel pipeline?

The fact that i need data to be stable for 16 clocks.

TrickyDicky · May 5, 2011

It will be, along with the other 16 values in the pipeline.

Ever heard of a shift register?

ganeshmirajkar · May 5, 2011

lostinxlation said:
here is your misunderstanding of pipelining.

Actually that is not my misunderstanding... I am trying to say that thats why i cannot pipe line....

---------- Post added at 11:03 ---------- Previous post was at 10:58 ----------

TrickyDicky said:
It will be, along with the other 16 values in the pipeline.

Ever heard of a shift register?

ok... i think i will ask you a question and tell me how will you pipeline in that case.
Suppose i have a 4 bit input. I want to count the no of ones in this input. Now the input is changing on every clock cycle. and my output should give me the no of ones in input. Tell me how will you pipe line if u need 4 cycles to calculate the no of ones in input.

lostinxlation · May 5, 2011

Instead, how about you thinking by yourself ?
you have a 6 bit priority decoder and the timing is too tight and you need to make it with 2 cycle pipe stages. Make a logic with gates.

ganeshmirajkar · May 5, 2011

I think i am confised now....

mrflibble · May 5, 2011

"Is this going to be synthesized? Or is it just for simulation?" -- I said i want to design a combinatorial circuit. Why would i write a combinatorial circuit if i just want to simulate it.

And? People don't simulate combinatorial circuits where you live? The reason I asked for clarification is to not to have to make an assumption. Now you have clarified.

Also to solve this problem you really don't need to know the out of box specifics but still to feed your curiosity...i would like to tell you that yes this is a part of a bigger circuits where there are 16 buffers at one end and 16 fifos at other end.There is a switcher logic which sits between these buffer and fifo's. Now the output which is 64 bits is actually a select line for each of the muxes that sit at the write port of fifos..... I hope you understand the bigger picture.....

Well, how things with specifications usually work ... Either you give a 100% bulletproof spec (which is pretty difficult), or failing that you give a 80% effort + some context to help other people understand the immediate surroundings of the circuit. This will help these other people come up with better solutions to your problem.

I leave you to judge your initial specs yourself by the follow up posts by the other people in this thread...

Fvm i really appreciate your efforts. Thanks. But i dont want to use a for loop because then i will never know what circuit it will actually make..Also this circuit is expected to be runnig at 266 MHz!!!! so i dont want to take a chance....

Cool, 266 MHz. In your opinion, is the fact that the circuit will have to produce a valid output in 3.76 ns an unimportant detail?

Now the problem in front of me is how to remove those extra invalid zeros. The assign statements you have written will do the work but then those are like 16 muxes placed one after another and that is going to hurt my timing which is 266 MHz.

As far as I can tell the removal of these invalid zeros is the tricky part. Personally I would do 2 pipeline stages, and then reduce it in a tree like fashion. On a LUT6 architecture this would be a nice fit. I am sorry, I must have missed the architecture in the specs. What device are you targeting? If it has a LUT6 architecture, then you could do this reduction in 2 stages. Each stage being a maximum of 2 logic levels deep, which should not be a problem for a 266 MHz clock.

I understand the frustration you and may be BradtheRad are facing because u did not understand the question/specification itself.... but i think you both have a habit of underestimating other people.....

ROFL!!! You assume too much, good sir. Not meant in a "you are not important" way (because we are all treasured indivuals, each unique in their own special way), but really this is not important enough to me personally to get frustrated over.

That interaction between me and BradtheRad was related to the amount of people that come on this board, and their first posts being "do my homework for me plz plz ASAP!". Just browse around a little on the board, and I am sure you will see those kind of posts.

Just to let you know that this is not a homework it is a real life design problem which i am facing and so i came to this board because i thought people are helpful here; and yes some really are....but there are also people like you too who make a hurry in commenting without understanding the specs..... and trying to act as if you are the most intelligent people on the earth. Also i think you have a habit of approaching every problem as if you were going to do a new invention...But mind well this is not a invention i am trying to do here......
I just need some help in solving that crazy problem.... But i still appreciate your time and effort in understanding the specs and also for writing these useless comments....

Do you have any other amusing assumptions you would like to share? If nothing else, they manage to put a smile on my face.

You understand my personality so well. I feel understood, finally. *sniff*

That aside, this reads a bit like "you did not give me the answer I wanted, so therefore you must have not understood my awesomely complete specifications".

Well, with regards to your specifications ... my inferior intellect and subpar skill was just barely enough to discern that your specifications just might be considered incomplete.

But don't take my word for it, because potentially we have different communication styles. Read the posts by all the other people, including the ones that did expend the time and effort to make a good guess as to what your intended circuit was.

Because on the subject of specifications ... your initial specs are 100% fulfilled by FvM's code in post #9.

But as it turned out this was not good enough because it would not churn out the data within 1 clock cycle @ 266 MHz. Did you make any mention of a 266 MHz clock in your initial specifications? Can you now imagine that there are people that have seen incomplete specs before, that first ask for some clarifications before guessing at solutions that may turn out to be a no-go due to incomplete specifications?

Case in point, FvM wrote a nicely coded and readable solution to the initial spec. But it essentially gets discarded because it does not fulfill an as of then undisclosed constraint...

But i still appreciate your time and effort in understanding the specs and also for writing these useless comments....

I appreciate the time and effort you put into making me smile.

ganeshmirajkar · May 5, 2011

mrflibble said:
And? People don't simulate combinatorial circuits where you live? The reason I asked for clarification is to not to have to make an assumption. Now you have clarified.

Well, how things with specifications usually work ... Either you give a 100% bulletproof spec (which is pretty difficult), or failing that you give a 80% effort + some context to help other people understand the immediate surroundings of the circuit. This will help these other people come up with better solutions to your problem.

I leave you to judge your initial specs yourself by the follow up posts by the other people in this thread...

Cool, 266 MHz. In your opinion, is the fact that the circuit will have to produce a valid output in 3.76 ns an unimportant detail?

As far as I can tell the removal of these invalid zeros is the tricky part. Personally I would do 2 pipeline stages, and then reduce it in a tree like fashion. On a LUT6 architecture this would be a nice fit. I am sorry, I must have missed the architecture in the specs. What device are you targeting? If it has a LUT6 architecture, then you could do this reduction in 2 stages. Each stage being a maximum of 2 logic levels deep, which should not be a problem for a 266 MHz clock.

ROFL!!! You assume too much, good sir. Not meant in a "you are not important" way (because we are all treasured indivuals, each unique in their own special way), but really this is not important enough to me personally to get frustrated over. That interaction between me and BradtheRad was related to the amount of people that come on this board, and their first posts being "do my homework for me plz plz ASAP!". Just browse around a little on the board, and I am sure you will see those kind of posts.

Do you have any other amusing assumptions you would like to share? If nothing else, they manage to put a smile on my face. You understand my personality so well. I feel understood, finally. *sniff*

That aside, this reads a bit like "you did not give me the answer I wanted, so therefore you must have not understood my awesomely complete specifications".

Well, with regards to your specifications ... my inferior intellect and subpar skill was just barely enough to discern that your specifications just might be considered incomplete.

But don't take my word for it, because potentially we have different communication styles. Read the posts by all the other people, including the ones that did expend the time and effort to make a good guess as to what your intended circuit was.

Because on the subject of specifications ... your initial specs are 100% fulfilled by FvM's code in post #9.

But as it turned out this was not good enough because it would not churn out the data within 1 clock cycle @ 266 MHz. Did you make any mention of a 266 MHz clock in your initial specifications? Can you now imagine that there are people that have seen incomplete specs before, that first ask for some clarifications before guessing at solutions that may turn out to be a no-go due to incomplete specifications?

Case in point, FvM wrote a nicely coded and readable solution to the initial spec. But it essentially gets discarded because it does not fulfill an as of then undisclosed constraint...

I appreciate the time and effort you put into making me smile.

I think i am in the wrong forum full of egoistic people.... You what what ever you assume but frankly i dont have time to deal with your ego here....u r free to write what ever you want but that is just useless....

mrflibble · May 5, 2011

And now to be a little less sarcastic and a bit more working towards a solution .... The reason I found the "counting the position of ones" interesting is because I had to solve a similar problem recently. To give a bit of context ...

The input of the circuit was 192 bits, at a clock rate of 400 MHz in a spartan-6. The objective was to find the index of the first 1-to-0 transistion. Also, it had to keep track of the number of transitions under the assumption that only 0, 1, 2 or 3 transitions could occur due to the way the data was generated. A new batch of 192-bit data had to be handled every 2.5 ns, but the total latency requirement was negotiable. So what I finally ended up with was a 5 stage pipeline. Well, configurable between 4 and 6, but turns out 5 was just the right spot for the 2.5 ns timing constraint.

My point is, I started out with a beautiful, readable, easily understood expression pretty much like FwM's code (but less awesome of course). And if I put this nice code through the tools, no matter how many retiming registers I gave the tools to play with, the end result was always pretty bad. Nowhere near 400 MHz, and certainly not in an acceptable amount of pipeline stages.

So the main challenge was to look for ways to take the whole module and then to cut it into smaller modules, all with registered outputs. If I understand your circuit correctly and you still want to maintain that 266 MHz clock then you IMO have 2 options. Option 1: Reduce the clock (probably not an option). Option 2: introduce a couple of pipeline stages. Looks like 2 or 3 should do it. Going with a 3 stage pipeline would mean you could still churn out these 16 bit inputs at the intended 266 MHz. The pipelined result would then arrive at the fifo inputs with a delay of 3 of these 266 MHz clock cycles. If your design can tolerate this kind of latency then this seems to be the way to go.

---------- Post added at 13:06 ---------- Previous post was at 13:05 ----------

ganeshmirajkar said:
I think i am in the wrong forum full of egoistic people.... You what what ever you assume but frankly i dont have time to deal with your ego here....u r free to write what ever you want but that is just useless....

Pfffrt! At least reread your own replies after a couple of days and see if all the miscommunication was 100% other people's fault.

Honestly, you assume far too much about other people and then post about that. And it is strange that people respond to this? How quant.

Generally I try to help people. But to help people with a circuit you need to know some specifics. What you don't need to hear is "ooooh, you did not give me the answer I want, so therefore you must have all sorts of character flaws".

Should you decide to leave the board because it is full of egotistical people that just don't understand you ... take a look at the "Posts" + "Helped" count of all the people in this thread, including your own. Exactly.

dcreddy1980 · May 5, 2011

Here is the code which will work as per specification :

module index (a,b);

input [7:0] a;
output [31:0] b;

wire [3:0] sum;
wire [3:0] diff;
wire [31:0] b_tmp;
wire [63:0] b_tmp1;
wire [31:0] c;

assign c[3:0] = {4{a[0]}} & 4'b0000;
assign c[7:4] = {4{a[1]}} & 4'b0001;
assign c[11:8] = {4{a[2]}} & 4'b0010;
assign c[15:12] = {4{a[3]}} & 4'b0011;
assign c[19:16] = {4{a[4]}} & 4'b0100;
assign c[23:20] = {4{a[5]}} & 4'b0101;
assign c[27:24] = {4{a[6]}} & 4'b0110;
assign c[31:28] = {4{a[7]}} & 4'b0111;

max2min_8 max2min_8 (.a0_i(c[3:0]), .a1_i(c[7:4]), .a2_i(c[11:8]), .a3_i(c[15:12]), .a4_i(c[19:16]), .a5_i(c[23:20]), .a6_i(c[27:24]), .a7_i(c[31:28]), .max2min_8_o(b_tmp));

assign sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + a[6] + a[7];

assign diff = 4'b1000 - sum;

assign b_tmp1 = b_tmp << (diff*4);

assign b = diff > 'd0 ? b_tmp1[63:32] : b_tmp1[31:0];

endmodule

module max2min_8 (a0_i, a1_i, a2_i, a3_i, a4_i, a5_i, a6_i, a7_i, max2min_8_o);

input [3:0] a0_i,a1_i,a2_i,a3_i,a4_i,a5_i,a6_i,a7_i;
output [31:0] max2min_8_o;

wire [15:0] tmp_1;
wire [15:0] tmp_2;
wire [7:0] tmp_3;
wire [7:0] tmp_4;
wire [15:0] tmp_5;
wire [7:0] tmp_6;
wire [7:0] tmp_7;
wire [15:0] tmp_8;

max2min_4 max2min_4_1 (.a0_i(a0_i), .a1_i(a1_i), .a2_i(a2_i), .a3_i(a3_i), .max2min_4_o(tmp_1));
max2min_4 max2min_4_2 (.a0_i(a4_i), .a1_i(a5_i), .a2_i(a6_i), .a3_i(a7_i), .max2min_4_o(tmp_2));

max2min max2min_1 (.a0_i(tmp_1[15:12]), .a1_i(tmp_2[15:12]), .max2min_2_o(tmp_3));
max2min max2min_2 (.a0_i(tmp_1[3:0]), .a1_i(tmp_2[3:0]), .max2min_2_o(tmp_4));

max2min_4 max2min_4_3 (.a0_i(tmp_1[11:8]), .a1_i(tmp_1[7:4]), .a2_i(tmp_2[11:8]), .a3_i(tmp_2[7:4]), .max2min_4_o(tmp_5));

max2min max2min_3 (.a0_i(tmp_5[15:12]), .a1_i(tmp_3[3:0]), .max2min_2_o(tmp_6));
max2min max2min_4 (.a0_i(tmp_4[7:4]), .a1_i(tmp_5[3:0]), .max2min_2_o(tmp_7));

max2min_4 max2min_4_4 (.a0_i(tmp_6[3:0]), .a1_i(tmp_5[11:8]), .a2_i(tmp_5[7:4]), .a3_i(tmp_7[7:4]), .max2min_4_o(tmp_8));

assign max2min_8_o[31:28] = tmp_3[7:4];
assign max2min_8_o[27:24] = tmp_6[7:4];
assign max2min_8_o[23:20] = tmp_8[15:12];
assign max2min_8_o[19:16] = tmp_8[11:8];
assign max2min_8_o[15:12] = tmp_8[7:4];
assign max2min_8_o[11:8] = tmp_8[3:0];
assign max2min_8_o[7:4] = tmp_7[3:0];
assign max2min_8_o[3:0] = tmp_4[3:0];
endmodule

module max2min_4 (a0_i, a1_i, a2_i, a3_i, max2min_4_o);

input [3:0] a0_i,a1_i,a2_i,a3_i;
output [15:0] max2min_4_o;

wire [7:0] tmp_1;
wire [7:0] tmp_2;
wire [7:0] tmp_3;
wire [7:0] tmp_4;
wire [7:0] tmp_5;

max2min max2min_1 (.a0_i(a0_i), .a1_i(a1_i), .max2min_2_o(tmp_1));
max2min max2min_2 (.a0_i(a2_i), .a1_i(a3_i), .max2min_2_o(tmp_2));

max2min max2min_3 (.a0_i(tmp_1[7:4]), .a1_i(tmp_2[7:4]), .max2min_2_o(tmp_3));
max2min max2min_4 (.a0_i(tmp_1[3:0]), .a1_i(tmp_2[3:0]), .max2min_2_o(tmp_4));

max2min max2min_5 (.a0_i(tmp_3[3:0]), .a1_i(tmp_4[7:4]), .max2min_2_o(tmp_5));

assign max2min_4_o[15:12] = tmp_3[7:4];
assign max2min_4_o[11:8] = tmp_5[7:4];
assign max2min_4_o[7:4] = tmp_5[3:0];
assign max2min_4_o[3:0] = tmp_4[3:0];

endmodule

module max2min (a0_i, a1_i, max2min_2_o);

input [3:0] a0_i,a1_i;
output [7:0] max2min_2_o;

assign max2min_2_o[7:4] = a0_i > a1_i ? a0_i : a1_i;
assign max2min_2_o[3:0] = a0_i < a1_i ? a0_i : a1_i;

endmodule

The above code has been simulated and works fine, is synthesizable will consume only 94 LUTS when targeted to Virtex-6 device.

The only point is its done for a 8-bit input vector.

You can always make it for any bit width from the above method.

permute · May 5, 2011

something wrong with using "generate"? it would remove a lot of the repeated lines of code.

@ganeshmirajkar:
the pipelined version is essentially the same as the original version of a circuit, but with registers in between well-defined "stages". this is done mainly to improve clock rate, or when it simplifies the logic in some manner. in the above, each "stage" would take a delayed (registered) version of the input, and a delayed (registered) version of the previous stage's output, and generate a new partial output. after the 16 stages, the result would be computed. each stages can work on one input per cycle, and the overall circuit can accept one new input per cycle. The first output will be available after 16 cycles. It is similar to an assembly line -- it can produce a thousand things per hour, but it takes an hour before anything has completed every stage of production. Pipelining is used in many cases in systems with little to no recursion.

FvM · May 5, 2011

I think i am in the wrong forum full of egoistic people.... You what what ever you assume but frankly i dont have time to deal with your ego here....u r free to write what ever you want but that is just useless....

Sounds like a pretty strong hint to further stay away from your post!?
In my view, mrflibble simply mentioned some well-founded criticism about your way to present the problem. You should stand it.

Regarding pipelining. You can either go for a pipelined design or give it up. I see, that are no familiar to the concept and apparently still misunderstand it in part. Please consider, that other contributors, who are working with this stuff since many years also had some start-up difficulties. They most likely learned from text books and throrough exercises.

BradtheRad · May 5, 2011

Sorry to misunderstand your situation earlier.

All I know to recommend is a lookup table at this point.

You have 65536 possible inputs. A unique output for any given input. Prepare every answer ahead of time. No need for calculations. Can you retrieve in one clock cycle? Not sure. But it's the fastest way.

====

My previous suggestion would have been to count zeroes coming off the right end of a shift register. You referred to it taking 16 clock cycles so you probably know what I'm suggesting.

Enable the least significant counter until a one appears. Count the one as well.

Disable that counter. Load its contents into the next least significant counter.

Enable the new counter. Count zeroes, etc.

This suits your specification to use exactly as many counters as there are ones in your original input. Each counter will contain the position of the column where there was a one.

FvM · May 5, 2011

All I know to recommend is a lookup table at this point.

Basically a good suggestion. But would need 64kx64 respectively 512 kByte of ROM. You'll have serious difficulties to meet the 266 MHz requirement.

The other suggestion is one of many possible sequential solutions. It's promising low resource usage, but only moderate speed. Unlike a fully parallel combinational design, it can't be pipelined without multiplying the resource count. It's unsuitable for high speed.

Welcome to EDAboard.com

Counting the position of ones

Advanced Member level 7

Advanced Member level 2

Newbie level 6

Advanced Member level 2

Newbie level 6

Advanced Member level 7

Advanced Member level 2

Newbie level 6

Advanced Member level 7

Newbie level 6

Advanced Member level 2

Newbie level 6

Advanced Member level 5

Newbie level 6

Advanced Member level 5

Full Member level 5

Advanced Member level 3

Super Moderator

Super Moderator

Super Moderator

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor