[SOLVED] Verilog Serial Filter Pipelining

keyboardcowboy · Feb 14, 2015

I have this filter code which is not meeting the timing constraints, in order to fix it I have to pipeline it.

Code:

    case(cmd0)
      0: begin
           acc_d = q0*h0;
         end
      1: begin
           acc_d = q0*h0+acc;
         end
      2: begin
           {acc_d,roundit} = {acc,1'b0} >>> h0[6:0];
           acc_d = acc_d + ((roundit)?64'b1:64'b0);
         end
      3: begin
           dout_d = acc[31:0];
           acc_d = 0;
           _pushout_d=1;
         end
    endcase

The delay is being caused by the multiplication and addition operation in case 1. How can i pipeline this part??

Code:

    1: begin
       acc_d = q0*h0+acc;

FvM · Feb 14, 2015

I would appriciate to see the full picture:
- how is acc derived from acc_d
- net declarations
- how is the design clocked

Posting scarce snippets gives too much room for misunderstandings.

keyboardcowboy · Feb 14, 2015

Code:

`timescale 1ns/10ps
// cmd codes
// 0 = first mult
// 1 = mult-accumulate
// 2 = shift right by h[6:0] and round
// 3 = send output and clear
module sfilt(input clk, input rst, input pushin, input [1:0] cmd,  
	input [31:0] q, input [31:0] h,
	output pushout, output [31:0] z);
reg signed [63:0] acc,acc_d;
integer q0,q0_d,h0,h0_d,dout,dout_d;
integer temp1,temp2,temp3,temp4;
reg push0;
reg _pushout,_pushout_d;
reg [1:0] cmd0;
reg roundit;

assign pushout = _pushout;
assign z = dout;

always @(*) begin
  q0_d = q0;
  h0_d = h0;
  acc_d = acc;
  dout_d = dout;
  _pushout_d=0;
  if(pushin) begin
    q0_d = q;
    h0_d = h;
  end
  if(push0) begin
    case(cmd0)
      0: begin
           acc_d = q0*h0;
         end
      1: begin
           acc_d = q0*h0+acc;
         end
      2: begin
           {acc_d,roundit} = {acc,1'b0} >>> h0[6:0];
           acc_d = acc_d + ((roundit)?64'b1:64'b0);
         end
      3: begin
           dout_d = acc[31:0];
           acc_d = 0;
           _pushout_d=1;
         end
    endcase
  end
end

always @(posedge(rst))
begin
    push0 <= #1 0;
    acc <= #1 0;
    q0 <= #1 0;
    h0 <= #1 0;
    dout <= #1 0;
    cmd0 <= #1 0;
    _pushout <= #1 0;
  end

always @(posedge(clk))
begin
    push0 <= #1 pushin;
    cmd0 <= #1 cmd;
    acc <= #1 acc_d;
  end

always @(posedge(clk))
begin
    q0 <= #1 q0_d;
    h0 <= #1 h0_d;
  end

always @(posedge(clk))
begin
	dout <= #1 dout_d;
    _pushout <= #1 _pushout_d;
  end  
  
endmodule

Clock is 250MHZ

sharath666 · Feb 15, 2015

All your multiplication and addition is pure combinational logic. Why don't you clock those always blocks? That should give you a better timing performance.

keyboardcowboy · Feb 15, 2015

All the additions and multiplications are already in always block

harpv · Feb 15, 2015

I would suggest in a simple way the pipeline could be done by taking each of the cmd as a stage of pipeline.

Stage 1	Stage 2	Stage 3	Stage 4
mult	mult-acc	shift	output

Maybe 3rd and 4th stages can be combined. Once you start designing it'll be easier to optimize.

Another thing which I noticed in the code is you are driving signals in two different always blocks (one for rst and anothe for clk). I would suggest you drive the signals only in one always block. Otherwise these signals can be seen as having multiple drivers.

Code:

always@(posedge clk or posedge rst) begin
  if(rst) begin  
  [I]<drive reset values>[/I]
  end
  else begin
  [I]<drive non-reset values>[/I]
  end
end

FvM · Feb 15, 2015

All the additions and multiplications are already in always block

sharath666 meaned they should be in a clocked always block.

As a first point, the code must rewritten to be synthesizable, removing the various multiple driver errors by placing rst and clock actions in a single always block. You should also remove timing statements which are useless in Verilog for hardware synthesis and shouldn't be needed for simulation of synthesizable code (otherwise something is wrong with it).

Code Verilog - [expand]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
always @(posedge rst or posedge clk)
if (rst)
  begin
    push0 <= 0;
    acc <= 0;
    q0 <= 0;
    h0 <= 0;
    dout <= 0;
    cmd0 <= 0;
    pushout <= 0;
  end
else
  begin
    push0 <= pushin;
    cmd0 <= cmd;
    acc <= acc_d;
    q0 <= q0_d;
    h0 <= h0_d;
    dout <= dout_d;
    pushout <= _pushout_d;
  end

There's also unintentional latch inference for _pushout_d which should be cancelled. After these changes the code is synthesizable. Achieved speed with recent FPGA would be rather 100 to 150 than 250 MHz.

Separating the design in a combinational and registered always block makes the code hard to read. It's not obvious at first sight if it serves a purpose. To achieve 250 MHz, more pipelining, e.g. of the muxes is apparently required.

I believe the design bottlenecks can be better identified by sketching a circuit schematic.

- - - Updated - - -

harpv is showing a good way to make a fast pipelined design.

Welcome to EDAboard.com

[SOLVED] Verilog Serial Filter Pipelining

keyboardcowboy

Member level 3

FvM

Super Moderator

keyboardcowboy

Member level 3

sharath666

Advanced Member level 2

keyboardcowboy

Member level 3

harpv

Member level 4

FvM

Super Moderator

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics