+ Post New Thread
Results 1 to 9 of 9
  1. #1
    Full Member level 3
    Points: 1,189, Level: 7
    FlyingDutch's Avatar
    Join Date
    Dec 2017
    Location
    Bydgoszcz - Poland
    Posts
    154
    Helped
    21 / 21
    Points
    1,189
    Level
    7

    What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Hello,

    I am looking for best way (in sense of speed and simplicity) how to connect STM32 MCU to FPGA board (Artix-7). The comunication must be two-directional. I am experimenting with some kind of coprocessor for ARM CortexM4 core. The MCU is sending data to process for FPGA and after FPGA ended processing is receiving processed data. I try to find similiar projects in internet and I find these:

    https://www.eetimes.com/document.asp?doc_id=1274649

    https://community.st.com/s/question/...-fpga-via-fsmc

    The way from first link seems for me to be best, but I am awre that there is possible many others way of such communication" liked mentioned in second link SPI, with DMA, or maybe some fast serial transmission (different tha SPI). The issue is that either in MCU board or FPGA board I have limited number o9f I/O pins.

    Maybe some of more expierienced colleagues could to suggest me something.

    Regards

    •   AltAdvertisement

        
       

  2. #2
    Member level 3
    Points: 804, Level: 6

    Join Date
    Jun 2017
    Location
    near the sea
    Posts
    60
    Helped
    13 / 13
    Points
    804
    Level
    6

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Parallel interfaces are often simpler to implement, enable random access but at higher clock speeds (say >50Mhz) skew can become a problem.
    Serial interfaces do not suffer skew problems to such a degree (only one data lane) BUT are slower (bytes/sec vs clock Mhz) and dramatically slower if random access is required.
    You don't say what interfaces you have available to use on your STM32 nor the nature of your data (sequential vs random access) ?
    Bear in mind an FPGA can be designed to interface to practically anything so your MCU is the limiting device.


    1 members found this post helpful.

  3. #3
    Super Moderator
    Points: 79,217, Level: 68
    Achievements:
    7 years registered
    Awards:
    Most Frequent Poster 3rd Helpful Member

    Join Date
    Apr 2014
    Posts
    16,080
    Helped
    3642 / 3642
    Points
    79,217
    Level
    68

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Hi,

    I also recommend parallel interface for simple hardware and high speed.

    I don't have much experience, but I'd try FSMC.

    Klaus
    Please don´t contact me via PM, because there is no time to respond to them. No friend requests. Thank you.


    1 members found this post helpful.

    •   AltAdvertisement

        
       

  4. #4
    Full Member level 3
    Points: 1,189, Level: 7
    FlyingDutch's Avatar
    Join Date
    Dec 2017
    Location
    Bydgoszcz - Poland
    Posts
    154
    Helped
    21 / 21
    Points
    1,189
    Level
    7

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Quote Originally Posted by fourtytwo View Post
    You don't say what interfaces you have available to use on your STM32 nor the nature of your data (sequential vs random access) ?
    Bear in mind an FPGA can be designed to interface to practically anything so your MCU is the limiting device.
    I am working on some kind of vector floating-point coprocessor. I would like to use it for multiplying and adding matrixes with floating-point numbers. Yes I am aware that FPU for floating-point number is occupying many resources and is reducing speed of processing. I would like to use 16 bit floating-point numbers to reduce the number of resources occupying. See the link to "Half-precision floating-point format":

    https://en.wikipedia.org/wiki/Half-p...g-point_format

    So data sended to FPGA will be batches with data (matrixes contents). It's nature would be rather sequential. The same with data send back from FPGA to MCU.

    Regards

    BTW: I am also aware that there are SIMD coprocessor for ARM Cortex MCUs, but I am doing it for myself education
    Last edited by FlyingDutch; 1st December 2019 at 13:07.



    •   AltAdvertisement

        
       

  5. #5
    Super Moderator
    Points: 79,217, Level: 68
    Achievements:
    7 years registered
    Awards:
    Most Frequent Poster 3rd Helpful Member

    Join Date
    Apr 2014
    Posts
    16,080
    Helped
    3642 / 3642
    Points
    79,217
    Level
    68

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Hi,

    ... you missed to answer about the available interfaces.

    You didn't clearly say ... but I guess the floating point unit should be programmed into the FPGA. Is this correct.

    In either case: you should give at least a clue about the data, block (or frame) sizes and timing in both directions.

    Klaus
    Please don´t contact me via PM, because there is no time to respond to them. No friend requests. Thank you.



  6. #6
    Full Member level 3
    Points: 1,189, Level: 7
    FlyingDutch's Avatar
    Join Date
    Dec 2017
    Location
    Bydgoszcz - Poland
    Posts
    154
    Helped
    21 / 21
    Points
    1,189
    Level
    7

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Quote Originally Posted by KlausST View Post
    Hi,

    ... you missed to answer about the available interfaces.

    1) You didn't clearly say ... but I guess the floating point unit should be programmed into the FPGA. Is this correct.

    2) In either case: you should give at least a clue about the data, block (or frame) sizes and timing in both directions.

    Klaus
    Hello Klaus,

    related to 1) Yes you are right. There be many instances of 16 bit floating-point units (only two operations: multiplying and adding). Them will be used in module for making arithmetic operations on matrixes (properly for operations of Tensors - up to five dimensions), these tensor could be realy huge, so I will be sending data in portions adjusted to "tensor engine" (for processing).

    2) about timing I would be able to answer after my design will be ready.

    Regards



  7. #7
    Super Moderator
    Points: 79,217, Level: 68
    Achievements:
    7 years registered
    Awards:
    Most Frequent Poster 3rd Helpful Member

    Join Date
    Apr 2014
    Posts
    16,080
    Helped
    3642 / 3642
    Points
    79,217
    Level
    68

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Hi,

    I assume you already know that your post again gives almost only vague informations.
    The only usuable values are
    * "16 bit floating-point units"
    * "five dimensions"

    "many", "really huge", "portions" ... are useless informations.

    *****

    about timing I would be able to answer after my design will be ready.
    Usually a development project starts with specifications. Like timing ... and the ammount of data...

    From my experience it is very useful to do this in first place.

    Klaus
    Please don´t contact me via PM, because there is no time to respond to them. No friend requests. Thank you.



  8. #8
    Full Member level 3
    Points: 1,189, Level: 7
    FlyingDutch's Avatar
    Join Date
    Dec 2017
    Location
    Bydgoszcz - Poland
    Posts
    154
    Helped
    21 / 21
    Points
    1,189
    Level
    7

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Quote Originally Posted by KlausST View Post
    Hi,

    I assume you already know that your post again gives almost only vague informations.
    The only usuable values are
    * "16 bit floating-point units"
    * "five dimensions"

    "many", "really huge", "portions" ... are useless informations.

    *****


    Usually a development project starts with specifications. Like timing ... and the ammount of data...

    From my experience it is very useful to do this in first place.

    Klaus
    "Really huge" means in this case up to few millions. The first dimension of tensor is usually the number of samples used to training "artificial neural network". of course vector coprocessor will be processing data in small chunks. I cannot give exactly data portion's size because I didn't finished implementation of 16-bit floating-point unit (and I don't know it's final size). I will be glad if I would be able to multiply 64 half-precision floating-point numberts in one course of "vector unit". If I will manage do more will be very glad. I am traying build something similiar to "Tensor-flow" GPU accelerator, but smaller and simpler

    Regards



    •   AltAdvertisement

        
       

  9. #9
    Full Member level 3
    Points: 1,189, Level: 7
    FlyingDutch's Avatar
    Join Date
    Dec 2017
    Location
    Bydgoszcz - Poland
    Posts
    154
    Helped
    21 / 21
    Points
    1,189
    Level
    7

    Re: What is the best way to interface MCU (STM32F4 series) with FPGA board (Artix-7)

    Hello @KlausST,

    now I can say you more about assumptions to my design. I find in Xilinx Vivado free IPCore with FPU module implementation. This implementation is using DSP slices and I suspect it is hevily optimized. What is very helpfuly for me it is uses AXI-4 bus and can be parametrized to use "Half-precision floating-point numbers:

    Se attached screenshots from Vivado:

    Click image for larger version. 

Name:	FPU01.png 
Views:	2 
Size:	168.7 KB 
ID:	156738
    Click image for larger version. 

Name:	FPU02_.png 
Views:	1 
Size:	71.0 KB 
ID:	156739
    Click image for larger version. 

Name:	FPU03_.png 
Views:	1 
Size:	70.2 KB 
ID:	156740

    I made simple project in Vivado, where I placed one instance of "Half-precision floating-point" FPU (operation: multiply) and then impemented project on Artix-7 FPGA.

    After implementation such FPU occupy:
    • 82 LUTs
    • 1 DSP block
    • 161 FF


    I wolud like to implement this project on Artix-7 FPGA model XC7A100T-2FGG676I which have 101400 LEs and 240 DSP blocks. So I should manage to place max 240 Half-precision FPUs working in parallel (I will be mainly limited by the number of available DSP blocks). I will probably also implement a soft-processor (most likely Microblaze) to easy handle these FPU modules by AXI bus. As a ports for top modules (interface to STM32 MCU) I am going to implement two 16-bit wide paralle buses working width pipelining (for two directional comunication with MCU).

    As a proff of concept I would like to make smaller version of matrixes multiplier (4 rows and 4 colums) on smaller model of Artix-7 FPGA.. Multiplication of matrixes is very similiar to tensor multiplication. The task of division big tensors for smaller data chunks for multiplication and adding on FPGA will be made by program in C on STM32 microcontroller. I haven't yet design data formats for comunication between MCU and FPGA, but I will do it quickly.

    If you have more question about this project please just ask ;)

    Regards

    BTW: for comparison: "SIMD Neon" extension in ARM architecture - see link:

    https://developer.arm.com/architectu...simd-isas/neon

    allows (only in ARM-8 architecture) 8 parallel operations on 16-bit floating-point numbers - see citation:

    8x16-bit*, 4x32-bit, 2x64-bit** floating-point operations
    - - - Updated - - -

    Hello,

    a small example: Lets assume that we have two matrixes with 2 rows and two columns each. In order to multiply these matrixes we have to execute following operations - see image:

    Click image for larger version. 

Name:	Multiply01.JPG 
Views:	3 
Size:	361.0 KB 
ID:	156742

    As we can see we have to execute 8 multiplication operations first (in parallel) and then 4 addition operations (in parallel). So summing up we need to make 12 Half-precision floating-point operations. if we are going to multiply two matrixes (each 4 rows and 4 columns) - we have to execute 64 multiply operations and 48 addition operations ( if I didn't mistake).
    There is many small problems to solve in this design. There will be need of latch registers on each FPU module and handling data workflow. I am going to solve these problems either by using soft-processor or FSM implemented on FPGA (I not decided yet)

    Regards
    Last edited by FlyingDutch; 2nd December 2019 at 11:18.



--[[ ]]--