Cache Memory speed up the system?

TuAtAu · Feb 14, 2012

Hi all, I am designing a processor core in for FPGA by using VHDL.

currently, I m doing pipelining and all instruction complete in 1 clock cycle (CC).

The core can support up to 100MHz CC speed.

Unfortunately, the flash that I planning to use to store Instructions/code can read fastest in 33ns, which is approximate 30MHz CC.
Since my core(100MHz) want to read instruction for each CC from the Flash(30MHz), the core have to keep waiting for the Flash.

So I decided to implement cache into my core, BUT the PROBLEM is:
cache read the memory also need 33ns, after go to cache then CORE read from cache again... it is seems like meaningless...
Or I dont know how to implement the cache to increase the overall system speed.

Anyone can give me any idea about to implement the cache in this situation?
Does cache able to help in this situation? :sad:

permute · Feb 14, 2012

cache is used when you have re-used data sections.

eg, a for loop. there might be 1000 instructions inside the loop, and the loop might run 10,000 times. with no cache, the instructions are loaded 10,000 times each. with caching, the instructions are read one time and (ideally) remain in cache memory.

However, if the cache size is less than 1000 instructions long, you will have cache misses and potentially lose all advantages of the cache.

the same goes for data -- if there are commonly used variables, but too many to fit into registers, a cache can be used. Ideally, things that make it into cache will have a high probability of being accessed multiple times before being ejected from cache.

joelby · Feb 14, 2012

If you're storing your program data in slow flash, then caching is a very good idea.

I can't understand the following sentence at all: "cache read the memory also need 33ns, after go to cache then CORE read from cache again... it is seems like meaningless". Can you rephrase that?

Ideally, a cache should work transparently. The first time you access memory that isn't cached (a miss), a whole page (of some arbitrary length) will be retrieved and copied somewhere to the BRAM cache. If your requirements are modest, you could just cache a single page at a time. If you want to be fancy, you could cache a number of different non-contiguous pages.

For a working implementation, you could have a look at the aeMB2 soft processor, though the code is not at all well documented. There should be plenty of other references and books that deal with the topic, though. This page was one of the first Google hits I found.

TuAtAu · Feb 15, 2012

Thanks permute, u give me a big picture rather than theorically "cache speed up system".

joelby said:
If you're storing your program data in slow flash, then caching is a very good idea.

I can't understand the following sentence at all: "cache read the memory also need 33ns, after go to cache then CORE read from cache again... it is seems like meaningless". Can you rephrase that?

Ideally, a cache should work transparently. The first time you access memory that isn't cached (a miss), a whole page (of some arbitrary length) will be retrieved and copied somewhere to the BRAM cache. If your requirements are modest, you could just cache a single page at a time. If you want to be fancy, you could cache a number of different non-contiguous pages.

For a working implementation, you could have a look at the aeMB2 soft processor, though the code is not at all well documented. There should be plenty of other references and books that deal with the topic, though. This page was one of the first Google hits I found.

Hi joelby
further details of "cache read the memory also need 33ns, after go to cache then CORE read from cache again... it is seems like meaningless"

Cache need to read from Slow FlashMemory which is 30MHz(33ns) read each data. So my processor core run in 100MHz means that need to fetch an instructions each 10ns.

Let say FlashMEM -- 33ns ---> cache --0.xns --> Processor , if we din't implement cache, FlashMEM -- 33ns --> processor , is faster right? Since cache also need to read from FlashMEM before to processor, why not just ignore cache and direct read from FlashMEM to processor? (this is my meaningless's means)

Because, If i implement cache, also about 64~256Byte only. Due to the limitation on my FPGA... So looping as permute said, it seems like not enough cache memory for loop even 100 instructions. (4Bytes/instruction) Right?

So do cache is necessary here?
Besides using faster FlashMEM, any other method to make the system running in 100MHz? (30MHz is too slow ><)

permute · Feb 15, 2012

FPGA's do have block rams. you should have 32kbit rams, which is 1024 values. even if you need to use it as a manual cache -- where the code copies itself into a memory space of the block ram -- you should be somewhat ok. Otherwise, you might need another blockram to store the upper bits of the address of the accessed element, as well as if it is valid. as mentioned, you can also cache large "pages" at a time.

also, keep in mind that the faster flash modes are burst-oriented. this means that a read request will be forced to return several values over multiple cycles. caching will obviously help here are the values will not need to be re-read.

another way to improve performance is to use variable length instructions. This will allow a higher average number of instructions per read operation.

joelby · Feb 15, 2012

TuAtAu said:
Let say FlashMEM -- 33ns ---> cache --0.xns --> Processor , if we din't implement cache, FlashMEM -- 33ns --> processor , is faster right? Since cache also need to read from FlashMEM before to processor, why not just ignore cache and direct read from FlashMEM to processor? (this is my meaningless's means)

The idea of a cache is that it will still be slow if you have a cache 'miss' and need to fetch data from Flash, but if you get mainly cache 'hits', the average instruction execution time will be very much faster.

Tuning your code so that critical sections fits into your cache is an art in itself. Cache design is also a bit of an art and can be done in many different ways depending on your needs and workload.

So do cache is necessary here?

To meet your 100 MIPS requirement, with the information you've given us: absolutely, yes.

Besides using faster FlashMEM, any other method to make the system running in 100MHz? (30MHz is too slow ><)

Use a larger FPGA that can fit your entire program into BRAM.

At initialisation time, copy the program into a fast, external static RAM.

TuAtAu · Feb 15, 2012

RE: permute

as mentioned, you can also cache large "pages" at a time.
Do you mean, we need to stop the processor and let all the data fetch from the FLASH to the CACHE 1st?

faster flash modes are burst-oriented
i plan to use this FLASH **broken link removed**
no burst-oriented mode found in datasheet...

variable length instructions
I am designing ARM based, RISC, and limited 16bit thumb and 32bit arm instructions.

RE:joelby
At initialisation time, copy the program into a fast, external static RAM.
Mean i need another RAM module/chips to store the sets of instructions/data from the FLASH to the RAM 1st when I pressed RESET button?
Means something like software program always LOADING. . . , before initiate?

joelby · Feb 15, 2012

Your process will effectively stop what it's doing while it's waiting for Flash as it is. A smart cache might even pre-load pages of memory that it thinks you'll need soon in the background, so that stalls are less common. Otherwise, it'll take 0ns while it is running from the cache, and then pause for 33ns+ whenever there's a cache miss.

Yes, if you copy your program into faster external memory there will be some initial loading time, though it should be pretty quick unless your flash is very slow and your program is very big.

TuAtAu · Feb 15, 2012

Unfortunately, I plan to run a Linux OS in my core, most probably a Ubuntu? or microblaze? etc its a big program..

joelby · Feb 15, 2012

I don't think you would ever be able to run Linux on an FPGA that only had 256 bytes of BRAM. Which device are you planning to target?

You should have a good look at the **broken link removed** soft processor. It runs Linux at a reasonable speed and implements on-FPGA instruction and data caching and much more.

TuAtAu · Feb 16, 2012

RE: Joelby

I m using XA SPARTAN 3E Spartan XC3S1200E

----------

for the cache part about HIT and MISS.. I still dont understand how it will fastern the system speed...

As in my mind:

Initially(with cache)
1st CC, cache take P.Counter 0000H instruction 33ns, HIT = core take from cache 0ns (core waiting 33ns)
nothing more in cache, Core have to wait cache load another instructions. (waiting 33ns)
2nd CC. cache take P.Counter 0001H instruction, cache -> core.
...
..
..
.
.
Even it is hit, but still need to wait cache take from FLASH..
If load by pages to cache, it still need 1CC (33ns) to load from flash to cache each CC.

So it is same with direct FLASH to Core?
doesn't it?
--------
ignore that, I realize that would faster when it has a loop back to the previous address...
Xp

joelby · Feb 16, 2012

Not quite. You can your processor at a different speed to the program memory - if the processor is faster than the memory, you will need to introduce a wait state.

So let's say your processor is running with a 10ns clock period and that fetching a page of, say, 128 instructions from flash and storing it in a cache takes 50ns.

Clock, Time Elapsed, State
0 0 ns Fetch instruction 0 - detect cache miss - fetch from flash
1 10 ns Wait state
2 20 ns Wait state
3 30 ns Wait state
4 40 ns Wait state
5 50 ns Execute instruction 0. Fetch instruction 1 - hit from cache
6 60 ns Execute instruction 1. Fetch instruction 2 - hit from cache
7 70 ns Execute instruction 2. Fetch instruction 3 - hit from cache
8 80 ns Execute instruction 3, a branch to instruction 500. Fetch instruction 500 - detect cache miss - fetch from flash
9 90 ns Wait state
10 100 ns Wait state
11 110 ns Wait state
12 120 ns Wait state
13 130 ns Execute instruction 500, etc.

So the processor always runs at a high clock speed, but stalls for a number of cycles while waiting for the cache to be filled. As the Wikipedia article states, you can improve performance by trying to predict when you'll need to load data into the cache, and do it in advance.

A simple example - if you're executing a long sequence of statements, from 0 to 127, the processor might anticipate that you'll fetch 128 and 129 after that, so it might start loading the next page in advance. This is possible if your cache is comprised of multiple independent sections.

---------- Post added at 13:37 ---------- Previous post was at 13:30 ----------

By the way, the XC3S1200E has 504 Kbits of BRAM, which is enough for a modest cache (even the 80486 only had as much as 16 KByte (128 Kbit) of on-chip cache.

To run Linux, you will need to use some external memory. I don't think it'd be possible to get away with less than 8 MB of RAM unless perhaps you used ancient versions of the kernel and userland.

TuAtAu · Feb 16, 2012

joelby said:
Not quite. You can your processor at a different speed to the program memory - if the processor is faster than the memory, you will need to introduce a wait state.

So let's say your processor is running with a 10ns clock period and that fetching a page of, say, 128 instructions from flash and storing it in a cache takes 50ns.

Clock, Time Elapsed, State
0 0 ns Fetch instruction 0 - detect cache miss - fetch from flash
1 10 ns Wait state
2 20 ns Wait state
3 30 ns Wait state
4 40 ns Wait state
5 50 ns Execute instruction 0. Fetch instruction 1 - hit from cache
6 60 ns Execute instruction 1. Fetch instruction 2 - hit from cache
7 70 ns Execute instruction 2. Fetch instruction 3 - hit from cache
8 80 ns Execute instruction 3, a branch to instruction 500. Fetch instruction 500 - detect cache miss - fetch from flash
9 90 ns Wait state
10 100 ns Wait state
11 110 ns Wait state
12 120 ns Wait state
13 130 ns Execute instruction 500, etc.

So the processor always runs at a high clock speed, but stalls for a number of cycles while waiting for the cache to be filled. As the Wikipedia article states, you can improve performance by trying to predict when you'll need to load data into the cache, and do it in advance.

A simple example - if you're executing a long sequence of statements, from 0 to 127, the processor might anticipate that you'll fetch 128 and 129 after that, so it might start loading the next page in advance. This is possible if your cache is comprised of multiple independent sections.

---------- Post added at 13:37 ---------- Previous post was at 13:30 ----------

By the way, the XC3S1200E has 504 Kbits of BRAM, which is enough for a modest cache (even the 80486 only had as much as 16 KByte (128 Kbit) of on-chip cache.

To run Linux, you will need to use some external memory. I don't think it'd be possible to get away with less than 8 MB of RAM unless perhaps you used ancient versions of the kernel and userland.

Thank you very much joelby, you explain it very clear! I understood.
Really appreciate. I will try to implement it.
BTW those example you gave me all in Verilog, I having difficulty in Verilog as I only use VHDL..
I will try to learn basic Verilog and study their implementation. Thanks again!

Welcome to EDAboard.com

Cache Memory speed up the system?

Advanced Member level 4

Advanced Member level 3

Full Member level 4

Advanced Member level 4

Advanced Member level 3

Full Member level 4

Advanced Member level 4

Full Member level 4

Advanced Member level 4

Full Member level 4

Advanced Member level 4

Full Member level 4

Advanced Member level 4

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor