mrflibble
Advanced Member level 5
How does one for example make the best use of retiming and/or c-slow to make the most of a given pipeline.
With retiming, some modules get better results by putting the shift registers on the inputs (forward register balancing), while other modules do better with shift registers on the output (backward register balancing).
For now I use the following method:
As it stands, this method is hit & miss. Sometimes it gets pretty good results, sometimes it's crap. So, what is a good way to improve the success ratio of such retiming?
Are there any tools that can aid in this? Also, links, papers and book recommendations would be much appreciated.
---------- Post added at 06:01 ---------- Previous post was at 04:45 ----------
Something related to this...
Retiming can take a long time, so is there some way to store and reuse intermediate results?
For example, the following module responds fairly well to insertion of a 1-deep shift register on the inputs. With register balancing enable it can then move these registers forward to improve timings.
I tried something similar, but then with a wider input. This meant a lot more case statements, and more logic levels. So I inserted a deeper shift-register on the inputs. But no matter what depth (from 1 to 8 ) I choose for the shift register, the results were always crap. So I guess there is a limit to how many logic levels XST is able to optimize...
Other than hand optimizing or partitioning (which is what I did now), are there tools that helps analyze this sort of thing?
With retiming, some modules get better results by putting the shift registers on the inputs (forward register balancing), while other modules do better with shift registers on the output (backward register balancing).
For now I use the following method:
- code hdl (in verilog)
- create timing constraints for the specific module
- synthesize, map, place & route (using ISE 13.1)
- look at post place & route timings for the module-to-be-improved, and at the maximum number of logic levels.
- take this number of logic levels, and make an educated guess for the number of flip-flops to insert.
- insert flip-flops, enable register balancing, hope for the best
As it stands, this method is hit & miss. Sometimes it gets pretty good results, sometimes it's crap. So, what is a good way to improve the success ratio of such retiming?
Are there any tools that can aid in this? Also, links, papers and book recommendations would be much appreciated.
---------- Post added at 06:01 ---------- Previous post was at 04:45 ----------
Something related to this...
Retiming can take a long time, so is there some way to store and reuse intermediate results?
For example, the following module responds fairly well to insertion of a 1-deep shift register on the inputs. With register balancing enable it can then move these registers forward to improve timings.
Code:
`default_nettype none
`timescale 1ns / 1ps
(* REGISTER_BALANCING = "YES" *)
module edge_detector_16bit #(
parameter RETIME = 0 // 0 = no retiming , 1 = 1-bit shift register on inputs for retiming
) (
input wire clk,
input wire prev,
input wire [15:0] data,
output reg [3:0] index = 0,
output reg detect = 0, // 0 = no edge detected ; 1 = edge detected
output reg polarity = 0 // 0 = 1-to-0 transition detected ; 1 = 0-to-1 transition detected
);
//
// Check for illegal parameter values.
//
initial begin
if ((RETIME < 0) || (RETIME > 1)) begin
$display("DRC ERROR: Illegal value %d for RETIME parameter. Should be 0 or 1.", RETIME);
$finish;
end
end
//
// Retiming
//
wire [16:0] sel_data;
generate
if (RETIME==1) begin
// Retiming
reg [16:0] retimed_data; // shift register, no initialization value
assign sel_data = retimed_data;
always @(posedge clk) begin
retimed_data <= {data[15:0],prev};
end
end else begin
// No retiming
assign sel_data = ({data[15:0],prev});
end
endgenerate
//
// Edge detect
//
always @(posedge clk) begin
casez (sel_data)
// decode with priority from right to left (low to high bit)
17'b???????????????10: begin index <= 0; detect <= 1'b1; polarity <= 1'b1; end
17'b??????????????100: begin index <= 1; detect <= 1'b1; polarity <= 1'b1; end
17'b?????????????1000: begin index <= 2; detect <= 1'b1; polarity <= 1'b1; end
17'b????????????10000: begin index <= 3; detect <= 1'b1; polarity <= 1'b1; end
17'b???????????100000: begin index <= 4; detect <= 1'b1; polarity <= 1'b1; end
17'b??????????1000000: begin index <= 5; detect <= 1'b1; polarity <= 1'b1; end
17'b?????????10000000: begin index <= 6; detect <= 1'b1; polarity <= 1'b1; end
17'b????????100000000: begin index <= 7; detect <= 1'b1; polarity <= 1'b1; end
17'b???????1000000000: begin index <= 8; detect <= 1'b1; polarity <= 1'b1; end
17'b??????10000000000: begin index <= 9; detect <= 1'b1; polarity <= 1'b1; end
17'b?????100000000000: begin index <= 10; detect <= 1'b1; polarity <= 1'b1; end
17'b????1000000000000: begin index <= 11; detect <= 1'b1; polarity <= 1'b1; end
17'b???10000000000000: begin index <= 12; detect <= 1'b1; polarity <= 1'b1; end
17'b??100000000000000: begin index <= 13; detect <= 1'b1; polarity <= 1'b1; end
17'b?1000000000000000: begin index <= 14; detect <= 1'b1; polarity <= 1'b1; end
17'b10000000000000000: begin index <= 15; detect <= 1'b1; polarity <= 1'b1; end
17'b00000000000000000: begin index <= 0 ; detect <= 1'b0; polarity <= 1'b1; end
17'b???????????????01: begin index <= 0; detect <= 1'b1; polarity <= 1'b0; end
17'b??????????????011: begin index <= 1; detect <= 1'b1; polarity <= 1'b0; end
17'b?????????????0111: begin index <= 2; detect <= 1'b1; polarity <= 1'b0; end
17'b????????????01111: begin index <= 3; detect <= 1'b1; polarity <= 1'b0; end
17'b???????????011111: begin index <= 4; detect <= 1'b1; polarity <= 1'b0; end
17'b??????????0111111: begin index <= 5; detect <= 1'b1; polarity <= 1'b0; end
17'b?????????01111111: begin index <= 6; detect <= 1'b1; polarity <= 1'b0; end
17'b????????011111111: begin index <= 7; detect <= 1'b1; polarity <= 1'b0; end
17'b???????0111111111: begin index <= 8; detect <= 1'b1; polarity <= 1'b0; end
17'b??????01111111111: begin index <= 9; detect <= 1'b1; polarity <= 1'b0; end
17'b?????011111111111: begin index <= 10; detect <= 1'b1; polarity <= 1'b0; end
17'b????0111111111111: begin index <= 11; detect <= 1'b1; polarity <= 1'b0; end
17'b???01111111111111: begin index <= 12; detect <= 1'b1; polarity <= 1'b0; end
17'b??011111111111111: begin index <= 13; detect <= 1'b1; polarity <= 1'b0; end
17'b?0111111111111111: begin index <= 14; detect <= 1'b1; polarity <= 1'b0; end
17'b01111111111111111: begin index <= 15; detect <= 1'b1; polarity <= 1'b0; end
17'b11111111111111111: begin index <= 0 ; detect <= 1'b0; polarity <= 1'b0; end
default : begin index <= 0 ; detect <= 1'b0; polarity <= 1'b0; end
endcase
end
endmodule // edge_detector_16bit
`default_nettype wire
I tried something similar, but then with a wider input. This meant a lot more case statements, and more logic levels. So I inserted a deeper shift-register on the inputs. But no matter what depth (from 1 to 8 ) I choose for the shift register, the results were always crap. So I guess there is a limit to how many logic levels XST is able to optimize...
Other than hand optimizing or partitioning (which is what I did now), are there tools that helps analyze this sort of thing?
Last edited: