Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

[SOLVED] What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

Status
Not open for further replies.

FlyingDutch

Advanced Member level 1
Joined
Dec 16, 2017
Messages
457
Helped
45
Reputation
92
Reaction score
55
Trophy points
28
Location
Bydgoszcz - Poland
Activity points
4,942
Hello,

I am looking for best way (in sense of speed and simplicity) how to connect STM32 MCU to FPGA board (Artix-7). The comunication must be two-directional. I am experimenting with some kind of coprocessor for ARM CortexM4 core. The MCU is sending data to process for FPGA and after FPGA ended processing is receiving processed data. I try to find similiar projects in internet and I find these:

https://www.eetimes.com/document.asp?doc_id=1274649

https://community.st.com/s/question/0D50X00009XkfC2/stm32f4-to-fpga-via-fsmc

The way from first link seems for me to be best, but I am awre that there is possible many others way of such communication" liked mentioned in second link SPI, with DMA, or maybe some fast serial transmission (different tha SPI). The issue is that either in MCU board or FPGA board I have limited number o9f I/O pins.

Maybe some of more expierienced colleagues could to suggest me something.

Regards
 

Parallel interfaces are often simpler to implement, enable random access but at higher clock speeds (say >50Mhz) skew can become a problem.
Serial interfaces do not suffer skew problems to such a degree (only one data lane) BUT are slower (bytes/sec vs clock Mhz) and dramatically slower if random access is required.
You don't say what interfaces you have available to use on your STM32 nor the nature of your data (sequential vs random access) ?
Bear in mind an FPGA can be designed to interface to practically anything so your MCU is the limiting device.
 
Hi,

I also recommend parallel interface for simple hardware and high speed.

I don't have much experience, but I'd try FSMC.

Klaus
 
You don't say what interfaces you have available to use on your STM32 nor the nature of your data (sequential vs random access) ?
Bear in mind an FPGA can be designed to interface to practically anything so your MCU is the limiting device.

I am working on some kind of vector floating-point coprocessor. I would like to use it for multiplying and adding matrixes with floating-point numbers. Yes I am aware that FPU for floating-point number is occupying many resources and is reducing speed of processing. I would like to use 16 bit floating-point numbers to reduce the number of resources occupying. See the link to "Half-precision floating-point format":

https://en.wikipedia.org/wiki/Half-precision_floating-point_format

So data sended to FPGA will be batches with data (matrixes contents). It's nature would be rather sequential. The same with data send back from FPGA to MCU.

Regards

BTW: I am also aware that there are SIMD coprocessor for ARM Cortex MCUs, but I am doing it for myself education :-?
 
Last edited:

Hi,

... you missed to answer about the available interfaces.

You didn't clearly say ... but I guess the floating point unit should be programmed into the FPGA. Is this correct.

In either case: you should give at least a clue about the data, block (or frame) sizes and timing in both directions.

Klaus
 

Hi,

... you missed to answer about the available interfaces.

1) You didn't clearly say ... but I guess the floating point unit should be programmed into the FPGA. Is this correct.

2) In either case: you should give at least a clue about the data, block (or frame) sizes and timing in both directions.

Klaus
Hello Klaus,

related to 1) Yes you are right. There be many instances of 16 bit floating-point units (only two operations: multiplying and adding). Them will be used in module for making arithmetic operations on matrixes (properly for operations of Tensors - up to five dimensions), these tensor could be realy huge, so I will be sending data in portions adjusted to "tensor engine" (for processing).

2) about timing I would be able to answer after my design will be ready.

Regards
 

Hi,

I assume you already know that your post again gives almost only vague informations.
The only usuable values are
* "16 bit floating-point units"
* "five dimensions"

"many", "really huge", "portions" ... are useless informations.

*****

about timing I would be able to answer after my design will be ready.
Usually a development project starts with specifications. Like timing ... and the ammount of data...

From my experience it is very useful to do this in first place.

Klaus
 

Hi,

I assume you already know that your post again gives almost only vague informations.
The only usuable values are
* "16 bit floating-point units"
* "five dimensions"

"many", "really huge", "portions" ... are useless informations.

*****


Usually a development project starts with specifications. Like timing ... and the ammount of data...

From my experience it is very useful to do this in first place.

Klaus

"Really huge" means in this case up to few millions. The first dimension of tensor is usually the number of samples used to training "artificial neural network". of course vector coprocessor will be processing data in small chunks. I cannot give exactly data portion's size because I didn't finished implementation of 16-bit floating-point unit (and I don't know it's final size). I will be glad if I would be able to multiply 64 half-precision floating-point numberts in one course of "vector unit". If I will manage do more will be very glad. I am traying build something similiar to "Tensor-flow" GPU accelerator, but smaller and simpler

Regards
 

Hello @KlausST,

now I can say you more about assumptions to my design. I find in Xilinx Vivado free IPCore with FPU module implementation. This implementation is using DSP slices and I suspect it is hevily optimized. What is very helpfuly for me it is uses AXI-4 bus and can be parametrized to use "Half-precision floating-point numbers:

Se attached screenshots from Vivado:

FPU01.png
FPU02_.png
FPU03_.png

I made simple project in Vivado, where I placed one instance of "Half-precision floating-point" FPU (operation: multiply) and then impemented project on Artix-7 FPGA.

After implementation such FPU occupy:
  • 82 LUTs
  • 1 DSP block
  • 161 FF

I wolud like to implement this project on Artix-7 FPGA model XC7A100T-2FGG676I which have 101400 LEs and 240 DSP blocks. So I should manage to place max 240 Half-precision FPUs working in parallel (I will be mainly limited by the number of available DSP blocks). I will probably also implement a soft-processor (most likely Microblaze) to easy handle these FPU modules by AXI bus. As a ports for top modules (interface to STM32 MCU) I am going to implement two 16-bit wide paralle buses working width pipelining (for two directional comunication with MCU).

As a proff of concept I would like to make smaller version of matrixes multiplier (4 rows and 4 colums) on smaller model of Artix-7 FPGA.. Multiplication of matrixes is very similiar to tensor multiplication. The task of division big tensors for smaller data chunks for multiplication and adding on FPGA will be made by program in C on STM32 microcontroller. I haven't yet design data formats for comunication between MCU and FPGA, but I will do it quickly.

If you have more question about this project please just ask ;)

Regards

BTW: for comparison: "SIMD Neon" extension in ARM architecture - see link:

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon

allows (only in ARM-8 architecture) 8 parallel operations on 16-bit floating-point numbers - see citation:

8x16-bit*, 4x32-bit, 2x64-bit** floating-point operations

- - - Updated - - -

Hello,

a small example: Lets assume that we have two matrixes with 2 rows and two columns each. In order to multiply these matrixes we have to execute following operations - see image:

Multiply01.JPG

As we can see we have to execute 8 multiplication operations first (in parallel) and then 4 addition operations (in parallel). So summing up we need to make 12 Half-precision floating-point operations. if we are going to multiply two matrixes (each 4 rows and 4 columns) - we have to execute 64 multiply operations and 48 addition operations ( if I didn't mistake).
There is many small problems to solve in this design. There will be need of latch registers on each FPU module and handling data workflow. I am going to solve these problems either by using soft-processor or FSM implemented on FPGA (I not decided yet)

Regards
 
Last edited:

Status
Not open for further replies.

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top