Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

STM32F769 speed problem

Status
Not open for further replies.

doraemon

Super Moderator
Staff member
Advanced Member level 4
Joined
Jun 21, 2009
Messages
1,257
Helped
292
Reputation
592
Reaction score
305
Trophy points
1,363
Location
Japan
Activity points
12,987
Hi guys!

I'm rather new to ST microcontrollers. I have built a program using cube MX, and the generated code works.
Basically it's a program that takes the input from 8, 16 or 32 MEMS microphones as explained in ST's document
"Audio Processing With An Open Development System". The CPU is a F769, supposed to run at 216 MHz.
I set it to 196.608 which is a multiple of 48000 (audio).
All the peripherals (DMA, Timer, etc) work as expected. Now I have a problem with execution time.
Although the clock is pretty high (around 196 MHZ), copying a buffer with memcopy is incredibly slow.
I copy 3032 (or 6144) bytes, and it takes 830 µs, but with a clock at 196 MHz, I think I could expect it to happen faster.
Same problem if I copy the bytes in a loop.
So I thought that there might be something I'm missing.
Some background info:
I'm sure the clock works as expected. When dividing it by 64, I get a microphone clock of 3.072 MHz, confirmed
by the scope. So the core seems to run as expected.

Does anybody have some experience with STM 32? (Easyrider83 ??).

I also tried to run the program in run mode (i.e. not debug), but it doesn't change anything.

Thanks for any hint.

Dora.
 

I don't know your processor, but you should be able to look at your instruction table and calculate exactly how long your process should take by looking at your assembly-level code. This is perfectly deterministic. You don't say what language you are using. Perhaps there is some optimization that needs to be enabled.
 

Hello!

I don't know your processor

My STM32F769 (see title) is a standard STM32F769, I didn't make it myself.
I should have added that I'm using SW4STM32 (the free IDE, see openstm32).
For example, if I move 3072 bytes using memcpy, is it normal that it takes
830 µs? The processor clock is 196.608 MHz, which means that it takes 53 clocks
clocks to move 1 byte.
I can't beleive that the libraries delivered with GCC are so slow, so there might be
something I'm missing.
I'm using C language, in a GCC environment.

Dora.
 

Yes, 53 cycles per byte sounds like a lot. Again, maybe there are some speed vs. size optimizations you can enable? Maybe the compiler is just not too bright, and you'll have to hand code the data move routine.
 

Sounds like you are copying from and to internal SRAM? If so, it should be much faster. One possible reason is that the CPU is almost blocked with interrupt activity, or that you didn't manage to set the CPU clock generation as intended.
 

Hello!

Sounds like you are copying from and to internal SRAM? If so, it should be much faster. One possible reason is that the CPU is almost blocked with interrupt activity, or that you didn't manage to set the CPU clock generation as intended.

Yes, it's entirely in internal sram.
The CPU gets interrupts every millisecond because I want to use ST's PDM to PCM library, and this
latter is designed to process trunks of 1ms.
As I want 48 kHz output, I will have 48 samples at the end, and since the oversampling factor is
64, I have a microphone clock of 3.072 MHz (configured by CubeMX). I have setup a 24.576 MHz
crystal, and I generate the largest multiple, which 196.608 MHz. Once divided by 64 by a timer,
I get 3.072 MHz microphone clock that I can verify with the scope.

As I get the microphones in parallel on a GPIO, I get trunks of 3072 samples. And therefore I declare
a 6144 buffer, and I get an interrupt at half buffer and full buffer.

So most of the time, the CPU does nothing. And if I light a LED before the memcpy call and set it
off after, I can verify on the scope how long it takes. And it's very long. A possible approach will be
to use DMA to move bytes, but the problem will remain when rearranging (de-interleaving) the
pdm buffers.

Any hint to speed up things would be helpful.

Thanks,

Dora.
 

Can you run a debugger to actually see what's happening when memcopy is invoked? It's going to have to create some kind of looping procedure. Is that procedure really running at 196MHZ?
 

Hello!

Thanks for your reply.
I have tried this just after clock init:

Code C - [expand]
1
2
3
while (1) {
    HAL_GPIO_TogglePin(GPIOD, GPIO_PIN_12);
}



The period is 550ns. Therefore the loop is 275 ns, which is 3.6 MHz.
As the clock frequency is 196 MHz, this brings us to 54 clocks again. Now I don't know
how efficient the HAL_GPIO_Toggle is.

Is that procedure really running at 196MHZ?

Again, it's difficult to say, but the frequency I have on the microphones corresponds
to my setup: htim1.Init.Period = 63;
Dividing 196.608 MHz by 64 yields 3.072 MHz, exactly what I observe, and therefore
I have this good reason to believe that the processor works at 196 MHz.
Now as there are many different clocks out there, is there a different clock for the core
(I mean different to the tim1 clock)?
Looking at CubeMX clock configuration, I can see that APB2 timer clock (on which tim1
depends) is derived from HCLK which in my case is equal to SYSCLK (196.608 MHz).
I would guess that SYSCLK, HCLK are used for the general processing.

Thanks,

Dora.
 

Look, there are only three possible reasons for your problem:
1) the memcopy process i running at a slower clock speed than you think.
2) the memcopy process is executing more instructions than you think
3) the memcopy process is being interrupted so that it doesn't complete in time.

You need to dig down deeper into the actual code. Again, a debugger would help.
 

Hello Barry!

Finally I set up a DMA and move by full words. It's a lot faster.
And beside this, I discoverd that in CubeMX, there is a setup for the processor itself.
By enabling cache, prefetch, etc... the speed has been multiplied by a factor of 3.3.
And on top of that the use of DMA is transparent and improves things quite a lot.
I think I will do a few tutorials on the whole chain (CubeMX + OpenSTM32 IDE).

Thanks,

Dora.
 

Hi,
Unfortunately you have chosen the worst method of moving memory. Why?
1. DMA is much slower than the core memory copy. There are several reasons - limited speed of the DMA accessible busses, potential conflicts with the core ( core has the priority over DMA), conflicts with another DMA transfers etc etc.
2. If you need to know if your transfer has finished - additional code (polling, interrupts) is required making the logic of the application more complicated and can be eventually the source of very difficult to diagnose errors. This one of the reasons why such a code would not meet many industrial standard requirements. Additional code is required as well to prepare the transfer. It is not maybe a big issue if you use the bare registers - but still it is completely unnecessary code.
3. Not all SRAM is accessible via DMA. In STM32F7 you have two areas of the core coupled memories (for the code 16kB and for data 128kB) and if you try to copy from or to those areas - you will get the hardware exception (bus error).

To copy fast you need to:
1. Care about the alignment
2. Use 32bit words
3. For the fastest copy the copy routine can be placed in the core coupled code memory, and your uC will work in the most optimal Harvard configuration (code & data will be accessed via the separate busses)
 

Hello!

Thanks for your reply.
I don't think you can, without knowing the rest of the program, conclude that DMA is the worst ever method.
I'm working without OS, so I schedule the tasks myself. That's a very predictable sound processing software,
so basically you get half and full buffers every millisecond, etc. Therefore, I can start my transfer, do something
while it's spinning, and get the data after this intermediate task. I'm aware (more or less) that DMA occupies
the bus at least partially and that it may slow other operation working with the same bus, but in the current
case, it worked pretty well.
And in the latest version, I found an even faster method: don't copy data. I used data copy in a first version because
it's easier to program, but the ultimate optimization consists in avoiding unnecessary copies.

Dora.
 

Hi Dora,

The thread is about copying (not mentioning "ïn the background" ) :). Of course in this case as you have probably more than one buffer it is better just to change the pointers. My post was about the general copy operation.
 

Hello!

Anyway thanks, I will also check the "core coupled code memory" to try to understand what I can do with it.
The ST processor gives a de-facto 2-buffer configuration for free (at least generated by CubeMX) , and as long as
you have enough time to process data of one of the buffers while the DMA fills the other from the microphones, then you're safe.

Dora.
 

Status
Not open for further replies.

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top