Hello @KlausST,
now I can say you more about assumptions to my design. I find in Xilinx Vivado free IPCore with FPU module implementation. This implementation is using DSP slices and I suspect it is hevily optimized. What is very helpfuly for me it is uses AXI-4 bus and can be parametrized to use "Half-precision floating-point numbers:
Se attached screenshots from Vivado:
I made simple project in Vivado, where I placed one instance of "Half-precision floating-point" FPU (operation: multiply) and then impemented project on Artix-7 FPGA.
After implementation such FPU occupy:
- 82 LUTs
- 1 DSP block
- 161 FF
I wolud like to implement this project on Artix-7 FPGA model XC7A100T-2FGG676I which have 101400 LEs and 240 DSP blocks. So I should manage to place max 240 Half-precision FPUs working in parallel (I will be mainly limited by the number of available DSP blocks). I will probably also implement a soft-processor (most likely Microblaze) to easy handle these FPU modules by AXI bus. As a ports for top modules (interface to STM32 MCU) I am going to implement two 16-bit wide paralle buses working width pipelining (for two directional comunication with MCU).
As a proff of concept I would like to make smaller version of matrixes multiplier (4 rows and 4 colums) on smaller model of Artix-7 FPGA.. Multiplication of matrixes is very similiar to tensor multiplication. The task of division big tensors for smaller data chunks for multiplication and adding on FPGA will be made by program in C on STM32 microcontroller. I haven't yet design data formats for comunication between MCU and FPGA, but I will do it quickly.
If you have more question about this project please just ask
Regards
BTW: for comparison: "SIMD Neon" extension in ARM architecture - see link:
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
allows (only in ARM-8 architecture) 8 parallel operations on 16-bit floating-point numbers - see citation:
8x16-bit*, 4x32-bit, 2x64-bit** floating-point operations
- - - Updated - - -
Hello,
a small example: Lets assume that we have two matrixes with 2 rows and two columns each. In order to multiply these matrixes we have to execute following operations - see image:
As we can see we have to execute 8 multiplication operations first (in parallel) and then 4 addition operations (in parallel). So summing up we need to make 12 Half-precision floating-point operations. if we are going to multiply two matrixes (each 4 rows and 4 columns) - we have to execute 64 multiply operations and 48 addition operations ( if I didn't mistake).
There is many small problems to solve in this design. There will be need of latch registers on each FPU module and handling data workflow. I am going to solve these problems either by using soft-processor or FSM implemented on FPGA (I not decided yet)
Regards