Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

[SOLVED] Verilog Serial Filter Pipelining

Status
Not open for further replies.

keyboardcowboy

Member level 3
Member level 3
Joined
Mar 4, 2010
Messages
55
Helped
0
Reputation
0
Reaction score
0
Trophy points
1,286
Visit site
Activity points
1,715
I have this filter code which is not meeting the timing constraints, in order to fix it I have to pipeline it.

Code:
    case(cmd0)
      0: begin
           acc_d = q0*h0;
         end
      1: begin
           acc_d = q0*h0+acc;
         end
      2: begin
           {acc_d,roundit} = {acc,1'b0} >>> h0[6:0];
           acc_d = acc_d + ((roundit)?64'b1:64'b0);
         end
      3: begin
           dout_d = acc[31:0];
           acc_d = 0;
           _pushout_d=1;
         end
    endcase

The delay is being caused by the multiplication and addition operation in case 1. How can i pipeline this part??

Code:
    1: begin
       acc_d = q0*h0+acc;
 

I would appriciate to see the full picture:
- how is acc derived from acc_d
- net declarations
- how is the design clocked

Posting scarce snippets gives too much room for misunderstandings.
 

Code:
`timescale 1ns/10ps
// cmd codes
// 0 = first mult
// 1 = mult-accumulate
// 2 = shift right by h[6:0] and round
// 3 = send output and clear
module sfilt(input clk, input rst, input pushin, input [1:0] cmd,  
	input [31:0] q, input [31:0] h,
	output pushout, output [31:0] z);
reg signed [63:0] acc,acc_d;
integer q0,q0_d,h0,h0_d,dout,dout_d;
integer temp1,temp2,temp3,temp4;
reg push0;
reg _pushout,_pushout_d;
reg [1:0] cmd0;
reg roundit;

assign pushout = _pushout;
assign z = dout;

always @(*) begin
  q0_d = q0;
  h0_d = h0;
  acc_d = acc;
  dout_d = dout;
  _pushout_d=0;
  if(pushin) begin
    q0_d = q;
    h0_d = h;
  end
  if(push0) begin
    case(cmd0)
      0: begin
           acc_d = q0*h0;
         end
      1: begin
           acc_d = q0*h0+acc;
         end
      2: begin
           {acc_d,roundit} = {acc,1'b0} >>> h0[6:0];
           acc_d = acc_d + ((roundit)?64'b1:64'b0);
         end
      3: begin
           dout_d = acc[31:0];
           acc_d = 0;
           _pushout_d=1;
         end
    endcase
  end
end

always @(posedge(rst))
begin
    push0 <= #1 0;
    acc <= #1 0;
    q0 <= #1 0;
    h0 <= #1 0;
    dout <= #1 0;
    cmd0 <= #1 0;
    _pushout <= #1 0;
  end

always @(posedge(clk))
begin
    push0 <= #1 pushin;
    cmd0 <= #1 cmd;
    acc <= #1 acc_d;
  end

always @(posedge(clk))
begin
    q0 <= #1 q0_d;
    h0 <= #1 h0_d;
  end

always @(posedge(clk))
begin
	dout <= #1 dout_d;
    _pushout <= #1 _pushout_d;
  end  
  
endmodule

Clock is 250MHZ
 
Last edited:

All your multiplication and addition is pure combinational logic. Why don't you clock those always blocks? That should give you a better timing performance.
 

I would suggest in a simple way the pipeline could be done by taking each of the cmd as a stage of pipeline.

Stage 1Stage 2Stage 3Stage 4
multmult-accshiftoutput

Maybe 3rd and 4th stages can be combined. Once you start designing it'll be easier to optimize.

Another thing which I noticed in the code is you are driving signals in two different always blocks (one for rst and anothe for clk). I would suggest you drive the signals only in one always block. Otherwise these signals can be seen as having multiple drivers.

Code:
always@(posedge clk or posedge rst) begin
  if(rst) begin  
  [I]<drive reset values>[/I]
  end
  else begin
  [I]<drive non-reset values>[/I]
  end
end
 

All the additions and multiplications are already in always block
sharath666 meaned they should be in a clocked always block.

As a first point, the code must rewritten to be synthesizable, removing the various multiple driver errors by placing rst and clock actions in a single always block. You should also remove timing statements which are useless in Verilog for hardware synthesis and shouldn't be needed for simulation of synthesizable code (otherwise something is wrong with it).


Code Verilog - [expand]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
always @(posedge rst or posedge clk)
if (rst)
  begin
    push0 <= 0;
    acc <= 0;
    q0 <= 0;
    h0 <= 0;
    dout <= 0;
    cmd0 <= 0;
    pushout <= 0;
  end
else
  begin
    push0 <= pushin;
    cmd0 <= cmd;
    acc <= acc_d;
    q0 <= q0_d;
    h0 <= h0_d;
    dout <= dout_d;
    pushout <= _pushout_d;
  end



There's also unintentional latch inference for _pushout_d which should be cancelled. After these changes the code is synthesizable. Achieved speed with recent FPGA would be rather 100 to 150 than 250 MHz.

Separating the design in a combinational and registered always block makes the code hard to read. It's not obvious at first sight if it serves a purpose. To achieve 250 MHz, more pipelining, e.g. of the muxes is apparently required.

I believe the design bottlenecks can be better identified by sketching a circuit schematic.

- - - Updated - - -

harpv is showing a good way to make a fast pipelined design.
 

Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top