Help with 32bit CPU optimization

mtaciano · Jan 1, 2023

Hello,

I'm designing a 32bit CPU as a recurring assignment for my university classes. Currently it can run a single program, the next step is to make it run multiple programs preemptively, with the help of a "operating system" that would overseer the other running programs.
The problem is that it seems that my current design is very volatile, since if I make any changes it stops working. I don't know what's wrong since I'm still new to Verilog and FPGA in general, but if I could get some help making my current design better I would appreciate.

I'm using Quartus Prime 21.1.1, my target board is the Altera DE2-115.

The main things that I could pinpoint with the help of Quartus is the fact that the biggest components that where slowing down compilation where the 'Registradores.v', 'ULA.v' files and 'Mux2 mux2_mod2' component in 'CPU.v'.

The other problem is not related to compilation time, but to CPU cycles. the method I'm using for halting the program does not work, using a 'halt' input in the 'PC.v', my guess is that it's related to the CPU cycles and how an always blocks work, so I'm checking for halt not on the same clock it happens, and then the PC increments like nothing happened.

here's my CPU

the 'main' branch is the one that works, the 'dev' is the one with the changes I'm trying to make.

Here are some images i think may be important:

The size of ULA.v:

This one is understandable since it implements multiplication and division.

The size of Registradores.v:

This one is absolutely huge, my guess is because of the assigns but I'm clueless.

The size of Mux2 mux2_mod2:

My guess about this one is that it's because it's receiving an input from the 'Registradores.v', and that's why it's so big.

BradtheRad · Jan 2, 2023

I guess you're finding that to devise your own OS is a colossal undertaking, especially to make it switch between several sub-programs, in addition to ordinary housekeeping routines.

* Display verbose messages onscreen: telling what your OS is doing, what program it's executing, whether it detects you pressing a key or clicking the mouse, etc.

* Compose each sub-program as numerous short functions. Create flag variables whose values can be altered by the OS or the sub-program. Examine these status flags when exiting a function. These tell the sub-program whether to attend to an interrupt, or to turn over control to the OS, etc.

* Just before execution jumps out of a sub-program in order to do something else, you need to store the point where it should jump back in later. Maybe you can store it as a label to a routine. Or if not a label then store a status flag in memory. Or store the current memory location (or else a pointer to a memory location).

FvM · Jan 2, 2023

I don't understand the complaints about compilation time. Most recent dev branch (the only function in the GIT repository) is finishing compilation in about 100 s.

I notice that the design doesn't even rudimentarily meets timing requirements. One problem is the doubledabble hex_to_dec convertsion in out.v, but there may be more.

The gate level netlist pictures in post #1 are more or less useless. You should rather review RTL netlists

In any case, a block diagram of the CPU that clearly indicates clocked and combinational modules should be your guide when checking any design details. Without it, we can hardly decide if there are basic design flaws.

I suggest to check the design in simulator first. It's the best way to find out where it's possibly stuck and if the intended relation of address and data is achieved.

Compiling the design in Quartus can nevertheless help to detect non-synthesizable design parts. You need to observe the compiler messages like

Info (276007): RAM logic "MemInst:mem_inst_mod|memoriaI" is uninferred due to asynchronous read logic

.
In this case, the ROM memory is ignored because $readmemb doesn't work for memory implemented in register cells. I don't understand at first sight why ROM inference doesn't work here, the ROM template seems to be met, but the problem may be caused by not correctly driven input signals or unused output signals. Simulation would show.

--- Updated Jan 2, 2023 ---

The software question about "OS" isn't directly related to CPU design, unless you are asking about specific CPU features supporting multi tasking. I doubt however that preemptive multitasking features are a realistic task for a simple HDL CPU design exercise.

mtaciano · Jan 2, 2023

FvM said:
I notice that the design doesn't even rudimentarily meets timing requirements. One problem is the doubledabble hex_to_dec convertsion in out.v, but there may be more.

Yes, if I don't use a clock divider the timing requirements are not met, and the setup slack becomes really low (somewhere around -120). Currently my patch solution is to use a clock divider, but my understanding is that it's not a good idea, how can I find places that most affect the timing?
As for the Out.v, i didn't know that this way to convert Binary to BCD could create such latency, but i can't find any other better method online. Wikipedia suggests a Lookup table, how would that look? a big case with every possibility?

FvM said:
I suggest to check the design in simulator first. It's the best way to find out where it's possibly stuck and if the intended relation of address and data is achieved.

By "simulator" what do you mean? Until recently I used waveform files, but they proved to be hard to use since they only work with really small time spans, so while I needed to see ~70 instructions the graph could show only ~5. Now the method I'm using is just running it in the board, but then I lose all context of what's happening where.

FvM said:
In this case, the ROM memory is ignored because $readmemb doesn't work for memory implemented in register cells. I don't understand at first sight why ROM inference doesn't work here, the ROM template seems to be met, but the problem may be caused by not correctly driven input signals or unused output signals. Simulation would show.

This is a problem I had since the beginning, if it can be caused by incorrect input signals my guess would be it's because the input address is 32 bits while the memory size is 2**11? That means the range of the input address is bigger than the actual address and then Quartus would complain maybe?

FvM said:
The software question about "OS" isn't directly related to CPU design, unless you are asking about specific CPU features supporting multi tasking. I doubt however that preemptive multitasking features are a realistic task for a simple HDL CPU design exercise.

The course that I'm taking requires that I use this CPU to run programs preemptively. My questions where more about how to fix my CPU, but I do have some questions about multitasking, like how to save the PC when you need to go back to the OS, and how does the OS tells if the program finished or not, but that's more broad questions and not specific to my CPU.

FvM · Jan 3, 2023

mtaciano said:
By "simulator" what do you mean?

ModelSim, e.g. free Altera Starter Edition.

pbernardi · Jan 4, 2023

Hello mtaciano ,
Some comments:

The main things that I could pinpoint with the help of Quartus is the fact that the biggest components that where slowing down compilation where the 'Registradores.v', 'ULA.v' files and 'Mux2 mux2_mod2' component in 'CPU.v'.

I suggest to remove the division in the earlier implementations (just comment the division operation in ULA.v). This should make your life easier and faster for now. Division is a very costly operation and demands a lot of logic. Doing this in a one-cycle operation also slow down you processor *a lot*. I do not know the requirements from your university class, but if division is not required, you should consider to completely avoid it. A division can de done by software/macros using other operations (multiply, subtraction, rotate, etc.). If division is really required, you can test everything else and let the division as a last part of the design to be implemented/tested.

Surely the division is one of the parts that heavily affects the timing.

The other problem is not related to compilation time, but to CPU cycles. the method I'm using for halting the program does not work, using a 'halt' input in the 'PC.v', my guess is that it's related to the CPU cycles and how an always blocks work, so I'm checking for halt not on the same clock it happens, and then the PC increments like nothing happened.

You should use blocking assignments (=) inside always @(*) - altough I do not know if this is the root of the problem you are facing, but it is a good practice.

As FvM cited, simulation is a must have.

Welcome to EDAboard.com

Help with 32bit CPU optimization

mtaciano

Newbie level 4

BradtheRad

Super Moderator

FvM

Super Moderator

mtaciano

Newbie level 4

FvM

Super Moderator

pbernardi

Full Member level 3

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics