Measuring the execution time on FPGA

doost4 · Aug 26, 2017

Hello everybody
I'm trying to measure the exact execution time for my VHDL design. The problem is, I don't have a digital oscilloscope to see how many clocks spent for generating the final output. I added a counter to my design for counting the clock cycles, so that by multiplying the clock cycles and FPGA period (20 ns in my case) I could measure the execution time.
The counter code is like below:

Code:

process (clk)
begin
    if(rising_edge(clk)) then
	 counter_b <= counter_b + 1 ;
end if;
end process;
--

process(clk)
begin
   if(rising_edge(clk)) then
	count_out <= counter_b ; 
	end if ;
	end process ;

But the problem is, when I want to read my counter's value in Chipscope, it shows a irrelevant value. I think there is a problem related to internal clock signals, because the other output registers of design show the correct values in Chipscope.

Do you have any ideas about my problem? Or another way to measure the execution time on FPGA?

My device is Spartan6 lx9 series and I use Xilinx ISE DS 14.7.

Thanks

TrickyDicky · Aug 26, 2017

Is there anything in the design that could cause a change in execution time? Does it have done complex decision logic? If not then why do you even need to measure it? You should be able to work it out directly from the code, with a simulation or why even bother? What makes it important that you know the execution time? If it's just a pipeline, latency is pretty irrelevant, it's throughput that's important.

shaiko · Aug 27, 2017

Make sure you're using "clk" as the sampling clock of the ILA. Otherwise, you'll have clock domain crossing problems.

doost4 · Aug 27, 2017

TrickyDicky said:
Is there anything in the design that could cause a change in execution time? Does it have done complex decision logic? If not then why do you even need to measure it? You should be able to work it out directly from the code, with a simulation or why even bother? What makes it important that you know the execution time? If it's just a pipeline, latency is pretty irrelevant, it's throughput that's important.

Dear Tricky,

No. I don't have anything that changes my execution time, but the main purpose of my work is to measure the speed-up gained from running on FPGA because it consumes a lot of time running on CPU.

By simulation, I think the clock cycles may differ from execution on FPGA (or maybe I'm wrong with this). So I have to count the clock cycles during the execution on FPGA.

shaiko said:
Make sure you're using "clk" as the sampling clock of the ILA. Otherwise, you'll have clock domain crossing problems.

I have just one clock in my design so I have no other choice for ILA clock connection, but I realized a problem. If you take a look at the image below, you will see that in frame 0, all of the outputs are ready. My outputs are the first 56 bits and the other bits are related to my clock counter register. The outputs are correct and they don't change if I run the trigger, but every time that I arm the trigger, the clock counter register values change. I don't know why it's depend on the trigger factors. Even when I change the data depth the register values change but the outputs don't.

Another question is if I want to see the process of generating the outputs (before frame 0), what should I do ?

TrickyDicky · Aug 27, 2017

doost4 said:
By simulation, I think the clock cycles may differ from execution on FPGA (or maybe I'm wrong with this). So I have to count the clock cycles during the execution on FPGA.

If it's a fully synchronous system, then you can determine the number of clock cycles for execution from the code. Simulation will confirm this and unless there's a basic flaw in your design, it will always match hardware.

But determining the latency for a single data completion probably won't tell you how much faster it is than a CPU version of the core. Because I assume there will be some way to get data into the FPGA, and i assume there is a CPU involved in data transmission and receive, then there will be some randomness in the turnaround of data. Here, it would be better to run some large data set through both the FPGA and CPU and either measure the time for the large data set through both or even better, the bandwidth from each when running flat out. Doing this may help you identify bottlenecks in either and help you improve them.

doost4 · Aug 27, 2017

TrickyDicky said:
Here, it would be better to run some large data set through both the FPGA and CPU and either measure the time for the large data set through both or even better, the bandwidth from each when running flat out. Doing this may help you identify bottlenecks in either and help you improve them.

As I figured, by large data you mean that it takes few seconds to complete the operation, so that I could measure the execution time manually with an external timer, right?

filip.amator · Aug 27, 2017

Doing a simulation of your design code and static timing analysis (taking into account your fpga chip, its temperature, etc.) will give you the correct answer in 99%.

shaiko · Aug 27, 2017

One of the most important advantages in having the design implemented in hardware (FPGA) is inherent determinism.
The cycle to cycle functionality is visible from the code and can't be verified with simulation tools.

But the problem is, when I want to read my counter's value in Chipscope, it shows a irrelevant value.

How did you get to this conclusion?
What is your ILA trigger?

doost4 · Aug 28, 2017

shaiko said:
How did you get to this conclusion?

Because the value of my counter register shows that the execution time was about 8 seconds! and I'm sure that it's impossible.

What is your ILA trigger?

Do you mean trigger ports? If so, trigger ports are my main outputs and the counter register that counts clock cycles.

- - - Updated - - -

filip.amator said:
Doing a simulation of your design code and static timing analysis (taking into account your fpga chip, its temperature, etc.) will give you the correct answer in 99%.

You mean that these reports are available in synthesis report? I can just see the maximum path delay and the clock period, but I couldn't find the number of cycles.

TrickyDicky · Aug 28, 2017

It won't show you the number of clock cycles, as it doesn't know where the start and end points of any given algorithm are.

You can work out the latency from the code, it's not that hard. Judy count the number of register stages in your code.

Also, while the report will show you fmax, you again should know what clock speed you will actually will run the system at based on your design architecture, as the board will only have specific clocks available and you should already know what your target bandwidth should be, do you should have already worked out what clock you need.

KlausST · Aug 28, 2017

Hi,

did you implement a controller core on the FPGA? And run software on this core?

Klaus

doost4 · Aug 28, 2017

TrickyDicky said:
It won't show you the number of clock cycles, as it doesn't know where the start and end points of any given algorithm are.

Yeah, that's my main problem.

You can work out the latency from the code, it's not that hard. Judy count the number of register stages in your code.

I'll work on it. Is there any tutorial or something that explains how to do that? I've never done such before.

KlausST said:
Hi,

did you implement a controller core on the FPGA? And run software on this core?

Klaus

Actually, it's a computational core that some simulation algorithms run on this core and for now, it's combinantial.

TrickyDicky · Aug 28, 2017

Making it purely combinatorial will make your life difficult. Latency will vary with several factors (one being temperature) and it will be very slow. You should make it fully synchronous as then you can use the simulation to measure the latency (or count the register stages in the code)

doost4 · Aug 28, 2017

TrickyDicky said:
Making it purely combinatorial will make your life difficult. Latency will vary with several factors (one being temperature) and it will be very slow. You should make it fully synchronous as then you can use the simulation to measure the latency (or count the register stages in the code)

My design is not fully combinantial. As I replied to Kaus, it is a computational core, but the software that run on this core is a combinantial circuit.

ThisIsNotSam · Aug 28, 2017

doost4 said:
My design is not fully combinantial. As I replied to Kaus, it is a computational core, but the software that run on this core is a combinantial circuit.

my head is hurting. this makes no sense.

ads-ee · Aug 28, 2017

doost4 said:
My design is not fully combinantial. As I replied to Kaus, it is a computational core, but the software that run on this core is a combinantial circuit.

ThisIsNotSam said:
my head is hurting. this makes no sense.

It also makes no sense to me.

doost4, do you really understand the difference between VHDL and software? VHDL doesn't execute, it is synthesized, then cells are placed, and finally the design is routed. This is nothing like software where a compiler builds some byte code of the program and then links it to libraries creating the executable software image.

So unless your combinational circuit is some sort of processor it's not going to do any "software that run on this core" type of operation. Also as a processor requires some sort of memory elements a combinational circuit won't function well as a processor.

If you need to measure time in an FPGA you are going about this all wrong. The simplest way to obtain empirical data on the latency of a design is to create an integrated ILA design and add logic in your design that generates a pulse on starting and completion of the algorithm. You use that to capture your free running counter using the capture data based on a compare value. Basically you make the ILA only capture data when the start or complete strobes are active. This will give you delta times between start-complete-start events.

doost4 · Aug 29, 2017

ads-ee said:
It also makes no sense to me.

doost4, do you really understand the difference between VHDL and software? VHDL doesn't execute, it is synthesized, then cells are placed, and finally the design is routed. This is nothing like software where a compiler builds some byte code of the program and then links it to libraries creating the executable software image.

So unless your combinational circuit is some sort of processor it's not going to do any "software that run on this core" type of operation. Also as a processor requires some sort of memory elements a combinational circuit won't function well as a processor.

That's what I was looking for it!!

Yes, I was wrong about the concepts of software and VHDL design, but I just wanted to explain what am I trying to do. Even with all of this, your explanation was complete and helpful. Finally someone exactly pointed at my problem.

If you need to measure time in an FPGA you are going about this all wrong. The simplest way to obtain empirical data on the latency of a design is to create an integrated ILA design and add logic in your design that generates a pulse on starting and completion of the algorithm. You use that to capture your free running counter using the capture data based on a compare value. Basically you make the ILA only capture data when the start or complete strobes are active. This will give you delta times between start-complete-start events.

As I figured, I should read the counter register two times, one at the start pulse generation moment and once again at the complete pulse generation moment. Then I calculate the difference between the two values of my register and this difference shows the clock cycles, right? If so, the value of the register at the start pulse generation shouldn't be zero?

Thanks

ThisIsNotSam · Aug 29, 2017

you would use a reset signal to make sure the counter goes to zero at the start, then you release the reset and let the counter count. once the operation is done, you stop the counter and check the value it has stored. this can be done in a simulation model or directly on the FPGA.

ads-ee · Aug 29, 2017

You don't need to reset, stop, or do anything to the counter except capture the counter value at both the start and stop times. The counter is left as free running.

Only if the start time is larger than the stop time is there any difference to the calculation, which requires that you deal with the rollover of the free running counter. e.g.
start = 7 and stop = 1, counting from 7, 0, 1 two away or to calculate stop+8-start = 1+9-7 = 2.

I've used this in the past for monitoring a system that was having issues with dropped packets and discovered a problem with the aggregate rate into our board caused by an upstream board that was violating the maximum egress rate for that board. Capturing 100,000 samples and post processing using Perl was the way I found the problem.

TrickyDicky · Aug 29, 2017

Stop - start will always give the correct answer, regardless of rollover, no need to add an extra base value, assuming you keep the result to the same number of bits as the two operands.

1(001) - 7(111) = 2 (010)

Measuring the execution time on FPGA

Junior Member level 3

Advanced Member level 7

Advanced Member level 5

Junior Member level 3

Advanced Member level 7

Junior Member level 3

Full Member level 3

Advanced Member level 5

Junior Member level 3

Advanced Member level 7

Advanced Member level 7

Junior Member level 3

Advanced Member level 7

Junior Member level 3

Advanced Member level 5

Super Moderator

Junior Member level 3

Advanced Member level 5

Super Moderator

Advanced Member level 7

Similar threads

Privacy & Transparency

Privacy & Transparency