Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

How to implement a fast pipeline (problem with sync ram)

Status
Not open for further replies.

hdhzero

Newbie level 5
Newbie level 5
Joined
Mar 17, 2013
Messages
9
Helped
0
Reputation
0
Reaction score
0
Trophy points
1,281
Visit site
Activity points
1,382
Although there are some threads about how to implement a pipeline, I believe this
won't be a duplicated thread. Also, english is not my native language, so if something
is not clear, please ask me.

I am coding a VLIW processor (harvard arch) with a pipeline of seven stages and variable
instruction size. Theorically I would implement a branch prediction, but I
am almost giving up...

Here is the pipeline
IF -> EXP -> ID -> SH -> EX -> MEM -> WB

EXP: this processor uses a variable instruction size like mips16 and thumb. In
this stage the instructions are expanded and next pc is calculated

SH: just a stage to perform shift and rotation operations

The other stages are similar to the ones used in MIPS etc.

I am having two major problems now: update the program counter and branch prediction. Next pc
is calculated in the EXP stage. I wonder if there's a good performance solution that would allow me
to calculate next pc in IF stage. Basically:

if (inst_size == 16 bits) {
pc = pc + 2;
} else if (inst_size == 32 bits) {
pc = pc + 4;
} ...

The problem is because I need the instruction to calculate the next pc:
pc -> addr cycle -> read cycle -> next_pc -> update pc

Cleary this is not possible to be done in one cycle and get a throughput of one instruction per cycle. My problems
would be solved if there was a fast ram with async read. But apparently Xilinx only has block ram with sync read =/

The second problem is branch prediction. I wonder if is still worth to implement such a thing on a hardware with so great
latency. Does it worth to put hardware that won't help that much with the latency problem?

The processor uses the ZNCV flags and cmp + beq, bne etc instructions. Prediction would be performed in the ID
stage and actual verification in the EX stage. But because of sync read from ram, this is what actually happening:

1 - The branch is taken
2 - The pc is updated with the jump address
3 - Needs to wait one cycle to read data be available
4 - Now update the pipeline register. Do not update pc yet, we still need to know the instruction size
5 - Calculate next pc
6 - update pc
7 - one more cycle waiting for read data to be ok

As you can see, seven cycles.. Does it still worth using branch prediction? We already have around 7 cycles of penalty...Maybe
it would be much better to not put more hardware and just use nops.

Again, such complication would not exist if there was a fast ram with async read:

1 - In the same cycle, the instruction is fetch and next pc is calculated.

I wonder how soft processors like NIOS II, Microblaze, LEON and other deal with the instruction memory.
I would be happy if I could code a processor to compete with these processors, but the only solution I can see to my problem
is to use two cycles per stage and this is a huge decrease of performance =/

There's a VLIW processor named r-vex that I read it's i_mem file. The only problem it is that is a ROM, not a RAM =/

Anyone could help me, please? Insights, coding sugestions etc
 

Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top