Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

[SOLVED] Need help on pipelining in square root using verilog

Status
Not open for further replies.

dipin

Full Member level 4
Joined
Jul 16, 2014
Messages
223
Helped
14
Reputation
28
Reaction score
14
Trophy points
18
Activity points
1,731
Hi,

i have designed a circuit in verilog to find the square root. for N bit width input it will give output in N/2+1 clock cycle.

i have checked xilinx ipcore for this. in xilinx ip core , for the first output it takes 5 clockcycle and after

that every clockcycle it gives output.(in mine after one input ,next one again take 5 clock cycle).

i think the solution for this is to pipeline my design . but my design is like shifting two position left followed by

substraction. so is there any way to do pipelining .did anybody have any documents or sample code please

share it with me .

thanks & regards
 

Clock1 you do the shifting. Clock2 for subtraction. Clock3 you will get the output.
Let all these 3 subunits work every clock. Feed an input every clock..
Then you will notice the outputs coming out every clock...
 
Last edited by a moderator:

hi
my design is like this ,for a 8 bit input 01100100 .For each clock cycle i will shift two position to left and substract from fixed number . so for one operand total it will take 5 clock cycle to complete. if i give input in next clock cycle.it will overwrite the operand and results in error
regards
 

What is the algorithm that you are using, any link ?
 

You unravel the algorithm, which means you have each stage of shift and subtracts to complete the calculation (i.e. you end up with pipelined parallel hardware).
 

Hi Dipin. For 2 shifting and 1 subtraction, I don't know why does it take 5 clock cycles. And anyways, u said the xilinx core takes 5, not this..
 

hi,
Hi Dipin. For 2 shifting and 1 subtraction, I don't know why does it take 5 clock cycles. And anyways, u said the xilinx core takes 5, not this..
**broken link removed**

please check figure2 and figure3 the link. i have done the design using verilog. but for me after one output again it take 5 clock to generate the next output. but in xilinx only had initial latency 5 and then every clock cycle it gives output .so i need to use pipelining in my design.

regards
 

Hi,
Code:
end else if(count>0) begin
            
                 temp_in_data   <= temp_in_data << 2;
             
                   
              
                    if(temp_in_data[2*N-1:2*Q]>= temp_sub_result[N-1:0]) begin
                      
                      
                       temp_in_data [2*N-1:2*Q+2]   <= temp_in_data[2*N-1:2*Q] - temp_sub_result[N-1:0];
                      
                       a_out_data     <=  a_out_data << 1;                    
                 
                       a_out_data[0]  <=  1'b1;
                      
                       count          <= count-1;
                     
                       temp_sub_result[2]      <=  1'b1;
                       
                       temp_sub_result[Q+2:3]  <=  a_out_data[Q-1:0];
                  
                 
                    end else begin
                 
                     
                      a_out_data  <=  a_out_data << 1;                    
                 
                      a_out_data[0]  <=  1'b0;
                      
                      count <= count-1;
                      
                      temp_sub_result[Q+2:3]  <= a_out_data[Q-1:0];
                      
                      temp_sub_result[2]    <=  1'b0;
                                       
                     end
                 
            end

this is the code i have written.
do i need to change the algoritham to make it pipeline . the input width are parametrized
so please give your suggetion.

thanks&regards
 

If your question is on how generally to go about pipelining, then ... good question. I would be interesting in reading material for that as well if anyone has any good suggestions. One thing that you can try is C-slowing and retiming. That basically means that you take your existing module, add a bunch (say 3) of flip-flops as on every input, and another bunch (3 more) of flip-flops on every output. Then you place timing constraints on the max allowed delay between two flip-flops. And then you tell your synthesizer it has your permission to fsck things up and retime that module. I've used that in the past, and the results are rather variable. If you want to know more about the technique, a search on "c-slow retiming" shows up pretty relevant hits. But to be honest, I have found that a proper piece of paper with a properly sharpened pencil combined with a good cup of coffee results in better pipelined designs than c-slow + retiming. One other things about retiming, it gobbles up cpu cycles while the tools figure out how best to shuffle those registers around.
 
  • Like
Reactions: dipin

    dipin

    Points: 2
    Helpful Answer Positive Rating
Hi Dipin. Your same code should work. Just make every element registered. Do synthesis for this design and check. You will find that the design consists of 5 flops with some combo logic between some flops. Then draw a timing diagram to explain yourself the working of this circuit. As I have said earlier feed an input every clock. After the 5th clock, you will notice an output every clock.
 
  • Like
Reactions: dipin

    dipin

    Points: 2
    Helpful Answer Positive Rating
Code:
end else if(count>0) begin
  temp_in_data   <= temp_in_data << 2;

Code like this means that temp_in_data is feed back to the input of the flip-flop and therefore can't be pipelined.

Pipelining isn't about just adding registers to a design it is about performing the necessary processing steps in a chain of registers. This improves performance but increases latency as the pipeline needs to be filled before you see the first valid data. Feedback loops used to share a resource (i.e. the shift above and the subtraction) have to be unraveled.

Instead a pipelined version would have each shift operation done separately, so new data can be continuously clocked into the first temp_in_data.
Code:
temp_in_data_1 <= temp_in_data << 2;
temp_in_data_2 << temp_in_data_1 << 2;
//...etc

Code:
temp_in_data [2*N-1:2*Q+2] <= temp_in_data[2*N-1:2*Q] - temp_sub_result[N-1:0];
is Q a constant? if it isn't you won't be able to synthesize this code. What is the relationship between N and Q? It seems to me you should be slicing the bus with +: instead of using this messy error prone method. (e.g. [2 +: 30] would be a slice of [31:2], 30-bits starting at index 2)
Different pipeline stages will use different temp_in_data_# pipelined values.

You should also combine the parts of the assignment instead of using two statements:
Code:
// instead of this:
temp_sub_result[2]         <= 1'b1;
temp_sub_result[Q+2:3]     <= a_out_data[Q-1:0];
// you can concatenate the RHS like this:
temp_sub_result[Q+2:2] < {a_out_data[Q-1:0, 1'b1};

Regards

- - - Updated - - -

After the 5th clock, you will notice an output every clock.
No it won't, the design is not pipelined, it's got feedback elements that need to have every iteration completed before they can get the first output. Hence their problem stated in post #3
if i give input in next clock cycle.it will overwrite the operand and results in error
 
  • Like
Reactions: dipin

    dipin

    Points: 2
    Helpful Answer Positive Rating
Hi ,
i have solved this one and thanks for your valuable help.
Code:
                    temp_in_data_1   <= temp_in_data << 2;
    temp_in_data_2   <= temp_in_data_1 << 2;
    temp_in_data_3   <= temp_in_data_2 << 2;
    temp_in_data_4   <= temp_in_data_3 << 2;
               
                   
              
  if(temp_in_data[2*N-1:2*Q]>= temp_sub_result[N-1:0]) begin
                      
                                                                                                 temp_in_data_1 [2*N-1:2*Q+2]   <= temp_in_data[2*N-1:2*Q] - temp_sub_result[N-1:0] ;
  
       a_out_data_1  <=  {a_out_data  ,1'b1 };                    
 
      count          <= count-1;
                        
      temp_sub_result_1[Q+2:2] <= {a_out_data[Q-1:0], 1'b1};
                       
                 
 end else begin
                 
                     
          temp_in_data_1 <= temp_in_data << 2;
                      
         a_out_data_1  <=  {a_out_data  ,1'b0 };                   
                      
           count <= count-1;
        
          temp_sub_result_1[Q+2:2] <= {a_out_data[Q-1:0],1'b0};
                         
                                       
   end
                     
                     
                     
                     
   if(temp_in_data_1[2*N-1:2*Q]>= temp_sub_result_1[N-1:0]) begin
                      
                      
        temp_in_data_2 [2*N-1:2*Q+2]   <= temp_in_data_1[2*N-1:2*Q] - temp_sub_result_1[N-1:0];
  
         a_out_data_2  <=  {a_out_data_1  ,1'b1 };                    
 
           count          <= count-1;
                        
        temp_sub_result_2[Q+2:2] <= {a_out_data_1[Q-1:0], 1'b1};
                       
                 
   end else begin
                      
          temp_in_data_2 <= temp_in_data_1 << 2;
                 
                     
          a_out_data_2  <=  {a_out_data_1  ,1'b0 };                   
                      
            count <= count-1;
        
           temp_sub_result_2[Q+2:2] <= {a_out_data_1[Q-1:0],1'b0};
                         
                                       
  end


here for the 8 bit input 4 if block are there and latency is 4 clockcycle.

but my problem is still it takes N/2 cycle (N input data width).
for 32 bit data it takes 16 clockcycle which is not good . so is there any way to decrese the latency.
in xilinx cordic ip core with no pipelining it is giving output in 2 clockcycle .
with optimal pipelining for 32 bit takes 10 clockcycle
so is there any way to decrease the latency???
please give your suggestion..

do i need to start a new thread??

thanks & regards
 

Well, the thing is that in each step, you need the value that was previous calculated, so 16 steps for 32 bits seems ok cause you take 2 bits in ech step. If you want to speed up more the process, i guess that you woul need to use a lot more resources. The first thing that i can imagine is doing some steps in advanced. So taking the Figure 2 in the paper, in the first step you should substract -1 and also doing the next step supposing that the result is positivite and also supposing that the result is negative. (i haven't done this for this exact problem right now, but i'll guess that's possible)


Pipiline has some limits and you reach it, so you basically need to put there more resources to speed up the process, or increase the clk frequency of this specific module.
 
  • Like
Reactions: dipin

    dipin

    Points: 2
    Helpful Answer Positive Rating
Try this code, it does two operations in one clock cycle and reduce in half the number of clocks but it uses more resources. The result should be when counter equals 8

Each if case tries two cases, the actual and the next one, and that's why it's also necessary to shift 4 bits instead of 2.

Code:
module main(
    input en,
    input clk
    );


localparam N = 32;

reg [N-1:0] input_data = 1927;
reg [N-1:0] _input_data = 0;
reg [N-1:0] rest = 0;
reg [N-1:0] result = 0;
reg  [N-1:0] counter = 0;

wire[N-1:0] rest_1;
wire[N-1:0] rest_2;
wire[N-1:0] result_1;
wire[N-1:0] result_2;

assign  rest_1 = ({rest,_input_data[N-1:N-2]} - {result,2'b01}),
        rest_2 = {rest,_input_data[N-1:N-2],_input_data[N-3:N-4]},
        result_1 = {result,1'b1,2'b01},
        result_2 = {result,1'b0,2'b01};

always @(posedge clk)
begin
    if (en)    
    begin
        _input_data <= input_data;    
        rest        <= 0;
        result      <= 0;
        counter     <= 0;
    end
    else
    begin
        counter = counter + 1;
        _input_data <= _input_data<<4;
        

        if (~rest_1[N-1] && {rest_1,_input_data[N-3:N-4]}  >= result_1)    // first positive, second positive
        begin
            rest    <= {rest_1,_input_data[N-3:N-4]}  - result_1;
            result  <= {result,1'b1,1'b1};
        end
        else if (~rest_1[N-1] && {rest_1,_input_data[N-3:N-4]}  < result_1)    // first positive, second negative
        begin
            rest    <= {rest_1,_input_data[N-3:N-4]};
            result  <= {result,1'b1,1'b0};
        end
        else if (rest_1[N-1] && rest_2  >= result_2)    // first negative, second positive
        begin
            rest    <= rest_2  - result_2;
            result  <= {result,1'b0,1'b1};
        end
        else if (rest_1[N-1] && rest_2  < result_2)    // First negative, second negative
        begin
            rest    <= rest_2;
            result  <= {result,1'b0,1'b0};
        end        
    end
end

endmodule
 
Last edited:
  • Like
Reactions: dipin

    dipin

    Points: 2
    Helpful Answer Positive Rating
Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top