# what highest clock freq. support by block ram in spartan3

Status
Not open for further replies.

#### k2w2yut

##### Newbie level 4
Hello,I am a newbie of FPGA and Verilog

My project is build simple 16-bit Pipeline CPU base on MIPS Architecture
I use separated Instruction Memory and Data Memory each have a 8-bit address and 16-bit data width

So because of 2 big memory array I use the block ram by make my module to read and write in synchronous mode driven by clock.

After I implement some stuff like data forwarding,branch predictor and test in simulation and use the clock that generate from built-in clock generator in spartan3 board I got

50 MHz for memory(Instruction and Data)
25 MHz for CPU(by using clock counter)

after that I want to test my design to get the highest clock that it can work on and use DCM

Code:
  wire                                  clkFXa, clka, locked1;
DCM dcm1 (.CLKIN(clock), .RST(1'b0), .CLKFB(), .CLK0(), .CLKDV(), .CLKFX(clkFXa), .LOCKED(locked1));
defparam dcm1.CLK_FEEDBACK       = "NONE";
defparam dcm1.CLKFX_MULTIPLY     = 4;
defparam dcm1.CLKFX_DIVIDE       = 5;
defparam dcm1.CLKIN_PERIOD       = 20;
BUFG buf1 (.I(clkFXa), .O(clka));
so I got the 40 MHz, then I improve my design and by the synthesize report I already acquire less timing constraint than before, But I cannot add more frequency to both CPU and Memory

and It stuck at 40MHz for CPU and 50MHz for Memory

k2w2yut

### k2w2yut

points: 2

#### Alexium

##### Full Member level 2
Spartan 3 BRAM works at 100 MHz with quite a margin (according to post-PAR report).

k2w2yut

### k2w2yut

points: 2

#### treqer

##### Full Member level 3
Clock Frequency spartan 3a
FBRAM Block RAM clock frequency 0 320 0 280 MHz

write full fpga name!

#### k2w2yut

##### Newbie level 4
Thank you everyone for the answers,Sorry for mis write

I use
Family : Spartan3
Device : XCS3S200
Package :FT256
Speed : -4

So It's surely that bram support more frequency than I have used (50MHz) but why I cannot add more frequency to memory clock (just like 40MHz to CPU and 60 MHz to Memory)

In memory module I only have always block that run for write and read

Thank you, k2w2

#### mrflibble

http://www.xilinx.com/support/documentation/data_sheets/ds529.pdf

speedgrade -4: 280 MHz max for the bram.

So the blockram is not the limiting factor. It sound like the rest of the design is the limiting factor.

As to why can you only go to 50 MHz ... look in the timing report and check what the slowest paths are.

k2w2yut

### k2w2yut

points: 2

#### k2w2yut

##### Newbie level 4

http://www.xilinx.com/support/documentation/data_sheets/ds529.pdf

speedgrade -4: 280 MHz max for the bram.

So the blockram is not the limiting factor. It sound like the rest of the design is the limiting factor.

As to why can you only go to 50 MHz ... look in the timing report and check what the slowest paths are.
Thank you for your suggestion :grin:

I really try to think about this too and I try to optimize my design for 3-4 days and expect that it should be better with following result.

This is from Synthesis Report >> Timing Summary

Code:
Timing Summary:
---------------

Minimum period: 22.679ns (Maximum Frequency: 44.094MHz)
Minimum input arrival time before clock: 12.525ns
Maximum output required time after clock: 7.709ns
Maximum combinational path delay: 11.560ns

//This is from the detail
Delay:               22.679ns (Levels of Logic = 21)
Source:            pcpuwm1/pcpu/regWriteDst_MEMWB_1 (FF)
Destination:       pcpuwm1/pcpu/ZF (FF)
Source Clock:      clock rising
Destination Clock: clock rising
the source and destination of slowest path is in the pcpu module but my implementation is

Code:
//clock is 50MHz

wire                                  clkFXa, clka, locked1;
DCM dcm1 (.CLKIN(clock), .RST(1'b0), .CLKFB(), .CLK0(), .CLKDV(), .CLKFX(clkFXa), .LOCKED(locked1));
defparam dcm1.CLK_FEEDBACK       = "NONE";
defparam dcm1.CLKFX_MULTIPLY     = 4;
defparam dcm1.CLKFX_DIVIDE       = 5;
defparam dcm1.CLKIN_PERIOD       = 20;
BUFG buf1 (.I(clkFXa), .O([U]clka[/U]));

pcpuwm pcpuwm1 (.[B]clock[/B]([U]clka[/U]), .[B]clock_mem[/B]([U]clock[/U]), .reset(NBTN[0]), .start(NBTN[1]), .stall(NBTN[2]),
.sel(SW[4:0]), .y(outgr));
"clka" sent to drive pcpu
"clock" sent to drive memory

So the clock freq. have separated in each module. As my understanding I can add more freq. to the clock_mem because It's doesn't had effect from the clock or slowest path in the pcpu module.

Code:
// this is my memory module code
always @(posedge clk) begin
if (en) begin
if (we)                        //Write Enable
ram[addr]<=di;           //Update ram by di(Data Input)
do <= ram[addr];          //Send data out via do(Data Out)
end
end
So sorry for my noob question again :-(
Thank you,k2w2

#### k2w2yut

##### Newbie level 4
I already look in to it, the document doesn't write actual clock frequency of BRAM but in the Block RAM Timing (table55 page85) its delay look nearly with spartan3A's so it should be working in the nearly frequency , Am I right??

#### treqer

##### Full Member level 3
spartan 3 is designed to build systems with 200 MHz. The calculation of the maximum frequency BRAM on the basis of 1 / (1.37 +1.37) gives too large a figure ))))))) But at 200 the memory should work

Another thing is that the memory layout when multiple modules is one big

#### permute

try setting up a UCF file, then getting the post-PAR timing. it might show more details where things are failing. It also gives a more realistic measure of the design. The synthesis report makes assumptions about routing that might not be true. the results after PAR will generally be a bit lower because of routing issues.

#### k2w2yut

##### Newbie level 4
try setting up a UCF file, then getting the post-PAR timing. it might show more details where things are failing. It also gives a more realistic measure of the design. The synthesis report makes assumptions about routing that might not be true. the results after PAR will generally be a bit lower because of routing issues.
This is my last result
I use 40.3 MHz to PCPU and 50MHz to Memory
Code:
  wire                                  clkFXa, clka, locked1;
DCM dcm1 (.CLKIN(clock), .RST(0), .CLKFB(), .CLK0(), .CLKDV(), .CLKFX(clkFXa), .LOCKED(locked1));
defparam dcm1.CLK_FEEDBACK       = "NONE";
defparam dcm1.CLKFX_MULTIPLY     = 25;
defparam dcm1.CLKFX_DIVIDE       = 31;
BUFG buf1 (.I(clkFXa), .O(clka));

wire                                  clkFXb, clkb, locked2;
DCM dcm2 (.CLKIN(clock), .RST(1'b0), .CLKFB(), .CLK0(), .CLKDV(), .CLKFX(clkFXb), .LOCKED(locked2));
defparam dcm2.CLK_FEEDBACK       = "NONE";
defparam dcm2.CLKFX_MULTIPLY     = 25;
defparam dcm2.CLKFX_DIVIDE       = 25;
BUFG buf2 (.I(clkFXb), .O(clkb));

pcpuwm pcpuwm1 (.clock(clka), .clock_mem(clkb), .reset(NBTN[0]), .start(NBTN[1]), .stall(NBTN[2]),
.sel(SW[4:0]), .y(outgr));
and I got synthesis report
Code:
Timing Summary:
---------------

Minimum period: 20.064ns (Maximum Frequency: 49.842MHz)
Minimum input arrival time before clock: 14.360ns
Maximum output required time after clock: 7.709ns
Maximum combinational path delay: 11.492ns
and this is from PAR report
Code:
Release 9.2.04i par J.40

CADPC03::  Wed May 11 14:40:42 2011

par -w -intstyle ise -ol std -t 1 board_map.ncd board.ncd board.pcf

Constraints file: board.pcf.
"board" is an NCD, version 3.1, device xc3s200, package ft256, speed -4

Initializing temperature to 85.000 Celsius. (default - Range: 0.000 to 85.000 Celsius)
Initializing voltage to 1.140 Volts. (default - Range: 1.140 to 1.260 Volts)

INFO:Par:282 - No user timing constraints were detected or you have set the option to ignore timing constraints ("par
-x"). Place and Route will run in "Performance Evaluation Mode" to automatically improve the performance of all
internal clocks in this design. The PAR timing summary will list the performance achieved for each clock. Note: For
the fastest runtime, set the effort level to "std".  For best performance, set the effort level to "high". For a
balance between the fastest runtime and best performance, set the effort level to "med".

Device speed data version:  "PRODUCTION 1.39 2007-10-19".

Device Utilization Summary:

Number of BUFGMUXs                        2 out of 8      25%
Number of DCMs                            2 out of 4      50%
Number of External IOBs                  33 out of 173    19%
Number of LOCed IOBs                  33 out of 33    100%

Number of RAMB16s                         2 out of 12     16%
Number of Slices                        665 out of 1920   34%
Number of SLICEMs                      0 out of 960     0%

Overall effort level (-ol):   Standard
Placer effort level (-pl):    High
Placer cost table entry (-t): 1
Router effort level (-rl):    Standard

WARNING:Par:288 - The signal BTN<1>_IBUF has no load.  PAR will not attempt to route this signal.
WARNING:Par:288 - The signal BTN<2>_IBUF has no load.  PAR will not attempt to route this signal.
WARNING:Par:288 - The signal BTN<3>_IBUF has no load.  PAR will not attempt to route this signal.

Starting Placer

Phase 1.1
Phase 1.1 (Checksum:98ac4b) REAL time: 2 secs

Phase 2.7
Phase 2.7 (Checksum:1312cfe) REAL time: 2 secs

Phase 3.31
Phase 3.31 (Checksum:1c9c37d) REAL time: 2 secs

Phase 4.2
.....
..
Phase 4.2 (Checksum:26259fc) REAL time: 3 secs

Phase 5.8
..................................................
........
..................................................
.............
..........
.....
Phase 5.8 (Checksum:aa7b01) REAL time: 9 secs

Phase 6.5
Phase 6.5 (Checksum:39386fa) REAL time: 9 secs

Phase 7.18
Phase 7.18 (Checksum:42c1d79) REAL time: 16 secs

Phase 8.5
Phase 8.5 (Checksum:4c4b3f8) REAL time: 16 secs

REAL time consumed by placer: 16 secs
CPU  time consumed by placer: 16 secs
Writing design to file board.ncd

Total REAL time to Placer completion: 17 secs
Total CPU time to Placer completion: 17 secs

Starting Router

Phase 1: 4967 unrouted;       REAL time: 17 secs

Phase 2: 4706 unrouted;       REAL time: 17 secs

Phase 3: 2287 unrouted;       REAL time: 18 secs

Phase 4: 2287 unrouted; (1334)      REAL time: 18 secs

Phase 5: 2323 unrouted; (0)      REAL time: 19 secs

Phase 6: 0 unrouted; (5676)      REAL time: 28 secs

Phase 7: 0 unrouted; (5676)      REAL time: 29 secs

Updating file: board.ncd with current fully routed design.

Phase 8: 0 unrouted; (3437)      REAL time: 32 secs

Phase 9: 0 unrouted; (2872)      REAL time: 49 secs

Phase 10: 0 unrouted; (2872)      REAL time: 49 secs

Phase 11: 0 unrouted; (0)      REAL time: 50 secs

WARNING:Route:455 - CLK Net:clock_IBUFG may have excessive skew because
6 CLK pins and 0 NON_CLK pins failed to route using a CLK template.
WARNING:Route:455 - CLK Net:clock_counter<10> may have excessive skew because
0 CLK pins and 1 NON_CLK pins failed to route using a CLK template.

Total REAL time to Router completion: 50 secs
Total CPU time to Router completion: 50 secs

Partition Implementation Status
-------------------------------

No Partitions were found in this design.

-------------------------------

Generating "PAR" statistics.

**************************
Generating Clock Report
**************************

+---------------------+--------------+------+------+------------+-------------+
|        Clock Net    |   Resource   |Locked|Fanout|Net Skew(ns)|Max Delay(ns)|
+---------------------+--------------+------+------+------------+-------------+
|                clka |      BUFGMUX0| No   |  226 |  0.004     |  1.014      |
+---------------------+--------------+------+------+------------+-------------+
|                clkb |      BUFGMUX3| No   |    2 |  0.000     |  1.011      |
+---------------------+--------------+------+------+------------+-------------+
|         clock_IBUFG |         Local|      |    8 |  0.697     |  1.854      |
+---------------------+--------------+------+------+------------+-------------+
|   clock_counter<10> |         Local|      |   10 |  0.646     |  3.132      |
+---------------------+--------------+------+------+------------+-------------+

* Net Skew is the difference between the minimum and maximum routing
only delays for the net. Note this is different from Clock Skew which
is reported in TRCE timing report. Clock Skew is the difference between
the minimum and maximum path delays which includes logic delays.

The Delay Summary Report

The NUMBER OF SIGNALS NOT COMPLETELY ROUTED for this design is: 0

The AVERAGE CONNECTION DELAY for this design is:        1.487
The MAXIMUM PIN DELAY IS:                               4.911
The AVERAGE CONNECTION DELAY on the 10 WORST NETS is:   4.476

Listing Pin Delays by value: (nsec)

d < 1.00   < d < 2.00  < d < 3.00  < d < 4.00  < d < 5.00  d >= 5.00
---------   ---------   ---------   ---------   ---------   ---------
1596        2034        1124         245          37           0

Timing Score: 0

Asterisk (*) preceding a constraint indicates it was not met.
This may be due to a setup or hold violation.

------------------------------------------------------------------------------------------------------
Constraint                                |  Check  | Worst Case |  Best Case | Timing |   Timing
|         |    Slack   | Achievable | Errors |    Score
------------------------------------------------------------------------------------------------------
Autotimespec constraint for clock net clo | SETUP   |         N/A|     4.215ns|     N/A|           0
ck_IBUFG                                  | HOLD    |     1.124ns|            |       0|           0
------------------------------------------------------------------------------------------------------
Autotimespec constraint for clock net clo | SETUP   |         N/A|    11.934ns|     N/A|           0
ck_counter<10>                            | HOLD    |     1.030ns|            |       0|           0
------------------------------------------------------------------------------------------------------
Autotimespec constraint for clock net clk | SETUP   |         N/A|    21.399ns|     N/A|           0
a                                         | HOLD    |     0.800ns|            |       0|           0
------------------------------------------------------------------------------------------------------

All constraints were met.
INFO:Timing:2761 - N/A entries in the Constraints list may indicate that the
constraint does not cover any paths or that it has no requested value.

All signals are completely routed.

WARNING:Par:283 - There are 3 loadless signals in this design. This design will cause Bitgen to issue DRC warnings.

Total REAL time to PAR completion: 52 secs
Total CPU time to PAR completion: 52 secs

Peak Memory Usage:  141 MB

Placement: Completed - No errors found.
Routing: Completed - No errors found.

Number of error messages: 0
Number of warning messages: 7
Number of info messages: 1

Writing design to file board.ncd

PAR done!
but this try did not work :-(

thank you,k2w2

#### mrflibble

Delay: 22.679ns (Levels of Logic = 21)
Source: pcpuwm1/pcpu/regWriteDst_MEMWB_1 (FF)
Destination: pcpuwm1/pcpu/ZF (FF)
Source Clock: clock rising
Destination Clock: clock rising
Well, 21 logic levels is a bit much. That is definitely going to put a limit on your speed.

If this happens to be a counter then 21 logic levels isn't as bad as it sounds. But should this be all combinatorial without a CARRY4 in there, then 21 logic levels is going to be slooooow.

Also, in the timing report right after the bit I just quotes, there is also information about how the path delay is built up. This is also useful information. Could you include that next time around, that helps in us understanding what is roughly going on...

And as permute suggested, set up some basic sensible timing constraints for your design. Without it ISE might give you numbers that are easily wrongly interpreted.

Now admitted, with 21 logic levels your design probably is going to be slow so you need to take a look at that as well...

---------- Post added at 11:34 ---------- Previous post was at 11:27 ----------

So the clock freq. have separated in each module. As my understanding I can add more freq. to the clock_mem because It's doesn't had effect from the clock or slowest path in the pcpu module.
That sounds like a wrong assumption right there. That signal regWriteDst_MEMWB is related to the memory interface right?

Delay: 22.679ns (Levels of Logic = 21)
Source: pcpuwm1/pcpu/regWriteDst_MEMWB_1 (FF)
Destination: pcpuwm1/pcpu/ZF (FF)

k2w2yut

### k2w2yut

points: 2

#### k2w2yut

##### Newbie level 4
Code:
Timing constraint: Default period analysis for Clock 'clock'
Clock period: 22.679ns (frequency: 44.094MHz)
Total number of paths / destination ports: 3022404 / 1196
-------------------------------------------------------------------------
Delay:               22.679ns (Levels of Logic = 21)
Source:            pcpuwm1/pcpu/regWriteDst_MEMWB_1 (FF)
Destination:       pcpuwm1/pcpu/ZF (FF)
Source Clock:      clock rising
Destination Clock: clock rising

Data Path: pcpuwm1/pcpu/regWriteDst_MEMWB_1 to pcpuwm1/pcpu/ZF
Gate     Net
Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
----------------------------------------  ------------
[COLOR="red"]FDC:C->Q             14   0.720   1.255  pcpuwm1/pcpu/regWriteDst_MEMWB_1 (pcpuwm1/pcpu/regWriteDst_MEMWB_1)
LUT4_D:I2->O         17   0.551   1.684  pcpuwm1/pcpu/fwdWB_Reg_Con<1>26 (pcpuwm1/pcpu/fwdWB_Reg_Con<1>26)
LUT2:I0->O           18   0.551   1.443  pcpuwm1/pcpu/fwdWB_Reg_Con<1>43 (pcpuwm1/pcpu/fwdWB_Reg_Con<1>)
LUT4_D:I3->O         11   0.551   1.170  pcpuwm1/pcpu/ALUIn1_or0001161_SW0 (N269)
LUT4:I3->O            1   0.551   0.869  pcpuwm1/pcpu/ALUIn1<2>11 (pcpuwm1/pcpu/ALUIn1<2>11)
LUT4:I2->O           12   0.551   1.313  pcpuwm1/pcpu/ALUIn1<2>39 (pcpuwm1/pcpu/ALUIn1<2>)[/COLOR]
[COLOR="lime"]    LUT4:I3->O            1   0.551   0.827  pcpuwm1/pcpu/result<10>151_SW0_SW0 (N447)
LUT4:I3->O            2   0.551   0.903  pcpuwm1/pcpu/result<10>151 (pcpuwm1/pcpu/result<10>)
LUT4:I3->O            1   0.551   0.996  pcpuwm1/pcpu/wZF17 (pcpuwm1/pcpu/wZF17)
LUT4:I1->O            1   0.551   0.000  pcpuwm1/pcpu/wZF99 (pcpuwm1/pcpu/wZF)
FDCE:D                    0.203          pcpuwm1/pcpu/ZF[/COLOR]
----------------------------------------
Total                     22.679ns (10.176ns logic, 12.503ns route)
(44.9% logic, 55.1% route)
I will try to explain from my data path
RED section is from my data forwarding unit that detect to send forward data from W/B stage to EX stage
Orange section is from my ALU Arithmetic operation
Green section is use to check and update Zero Flag register

regWriteDst_MEMWB << is the Flip-Flop contain target register(3-bit address) for W/B stage.It use to compare and detect that EX stage should use data from ID stage or forwarding data from W/B stage

I agree that It's not completely don't relate to memory but It's generated at the start of the pcpu clock cycle and unchange until next clock cycle that's enough for posedge of memory clock will "catch" data and operate it.

From 40/50 MHz I improved it to reduce delay, and add more frequency like 40.3/50 or 40/50.3 but It's both doesn't work

k2w2

#### mrflibble

Thank you, that was indeed the type of information I meant.

Well, I can tell you that without some changes you are not going to reach significantly higher speeds than what you get now. This is just one path (the worst one), so no doubt there are more like it.

I don't know your design, but usually with a cpu + memory interface if you already have it decoupled (since you have two different clocks for it) ... then you can pipeline parts of it.

For the particular path you posted, you have already identified 3 major parts that it can be divided into. So lets taking that as an example:

RED section is from my data forwarding unit that detect to send forward data from W/B stage to EX stage
Orange section is from my ALU Arithmetic operation
Green section is use to check and update Zero Flag register
So for each of these, register the output. right after your data forwarding unit sends the data forward to the EX ... clock this data into flip-flops.

Same for the output of your ALU operation. register the ALU output.

Ditto for your zero flag update.

Now instead of trying to do a lot of work within 1 cycle, you smear out the action over 3 clock cycles. So these 21 logic levels will get divided over these 3 stages. As a small bonus you will get some "free" routing as well, because judging by the percentage and amounts of routing delay things are pretty spread out. By adding these extra flip-flop stages you add some "halfway stations" as it were.

You will have to adjust the surrounding stuff accordingly of course to take this extra latency into account. But I am afraid there are no easy free solutions in this case. Either just accept the 21 logic levels (and the delays that go with that), or find some ways to break this up into stages and change the rest of the design to accommodate these extra stages.

What you can do as a quick test (sort of a feasibility study) is this:

For all the inputs of this module, add shift registers to the inputs (say 4 deep). Then enable register balancing for this module. This will not cost you personally all that much effort (does take some extra time for ISE, but hey, go make some coffee). After it is done with the place & route, check the post PAR timings for this path and see if it is any better.

The design will NOT be functional at that moment but we don't care. That is not the intention of this action. The intention of this action is to find out what kind of timing improvements are doable with some cheap pipelining.

Personally I have had mixed results with retiming (register balancing). The absolute best results I get with actual thinking about the design. The tool based improvements with register balancing are ranging from good to total crap. 100% depending on the quality of input code I suppose.

You may want to google "xilinx REGISTER_BALANCING" for some info if you are not clear on this...

Last edited:
k2w2yut

### k2w2yut

points: 2

#### k2w2yut

##### Newbie level 4
Thank you,mrflibble

I should take sometime to try this and will post the result ASAP ^^''

k2w2

Status
Not open for further replies.