# Look up table for a given function

Status
Not open for further replies.

#### ctzof

##### Full Member level 3
Hello,

I am rather new to FPGA design and I am having some difficulties with a design. Here is the problem:

I want to build a look up table to generate the outputs of a specific function. So I have a temperature sensor and depending on the measured temperature I want to produce an offset which is going to be add to the output signal. The offset is straightforward and follows a specific line. The temperature value is a 10-bit unsigned integer which means that is going to produce 1024 different values-offsets. Its value in the look up table is 10-bit which means that I need a space of 1024x10 =10 kbit. The thing is that I want to use the same look up table many times in my project and I am rather limited in terms of resources. What I was thinking is to use less points than 1024 and some how compute the output value when the input value is somewhere in between (I am not really sure but I think this term called linear interpolation). Is there a possible way to do that in an FPGA any ideas?
My computations are made with unsigned numbers.

If the offset is linear (which I'm assuming at this point) then just calculate all of them not just some of them. If the offset isn't a linear function then yes you use interpolation and that means you use math.

If the offset is linear (which I'm assuming at this point) then just calculate all of them not just some of them. If the offset isn't a linear function then yes you use interpolation and that means you use math.

Thanks for the answer. Unfortunately is not linear. It similar to y=e^-x function. Can you give me a guidance or a rough idea of how to do this?

Do what is described here. Just remember to scale after you've finished everything, you don't want to lose any precision until you've finished calculating everything. (i.e. don't throw any bits away until the end result)

For LUTs in resource constrained designs, you might want to either move away from the LUT, run the LUT at a higher clock rate, or time share the LUT.

For a Xilinx BRAM, you get 2 reads per cycle per BRAM. Thus a 10kBit LUT (BRAM18) with 10 reads will use 5 BRAM, and be duplicated 5 times. The BRAMs _can_ run up near 500-600MHz. If your normal design runs at 100MHz, this means you could provide 10 input addresses per 100MHz cycle and get 10 output values per 100MHz cycle. You would need a small amount of logic running at 500MHz, and would need to have appropriate pipelining considerations to ensure the high-speed logic works out.

If many things use the LUT, but only infrequently, you might look into some form of arbitration to provide access to a reasonable number of LUTs, but with variable latency. This is a similar idea -- serialize access to the LUT -- but it doesn't use a high speed clock. Routing and arbitration logic might become an issue.

(if you can have 512-1024 cycles of latency, you can also cycle through the entire LUT and broadcast the result.)

--edit: linear interpolation might help, but it is hard to say. you should try to shrink the LUT by a factor of two in order to make up for double reads.

For LUTs in resource constrained designs, you might want to either move away from the LUT, run the LUT at a higher clock rate, or time share the LUT.

For a Xilinx BRAM, you get 2 reads per cycle per BRAM. Thus a 10kBit LUT (BRAM18) with 10 reads will use 5 BRAM, and be duplicated 5 times. The BRAMs _can_ run up near 500-600MHz. If your normal design runs at 100MHz, this means you could provide 10 input addresses per 100MHz cycle and get 10 output values per 100MHz cycle. You would need a small amount of logic running at 500MHz, and would need to have appropriate pipelining considerations to ensure the high-speed logic works out.

If many things use the LUT, but only infrequently, you might look into some form of arbitration to provide access to a reasonable number of LUTs, but with variable latency. This is a similar idea -- serialize access to the LUT -- but it doesn't use a high speed clock. Routing and arbitration logic might become an issue.

(if you can have 512-1024 cycles of latency, you can also cycle through the entire LUT and broadcast the result.)

--edit: linear interpolation might help, but it is hard to say. you should try to shrink the LUT by a factor of two in order to make up for double reads.

Hi, Thanks for the answer. Is there a different approach to my problem in your opinion rather than LUT?

Hi, Thanks for the answer. Is there a different approach to my problem in your opinion rather than LUT?
'Yes' is the short answer. But if you want more detailed answers, you're going to have to define what you're doing. So far, you haven't adequately defined function, performance or constraints. Without that info, you're only going to get speculative responses.
- Function: Exactly what function are you trying to implement and over what what input domain?
- Performance: How quickly do you need things? One per clock? Multiple clock cycles? Etc.
- Constraints: Is the FPGA or the FPGA family or maybe even just the supplier chosen? If so, which one? Are there resources that are likely limited because of other stuff that you have going on in your design? For example, maybe the rest of your design is pretty much locked in and you have only one spare LUT.

Kevin Jennings

'Yes' is the short answer. But if you want more detailed answers, you're going to have to define what you're doing. So far, you haven't adequately defined function, performance or constraints. Without that info, you're only going to get speculative responses.
- Function: Exactly what function are you trying to implement and over what what input domain?
- Performance: How quickly do you need things? One per clock? Multiple clock cycles? Etc.
- Constraints: Is the FPGA or the FPGA family or maybe even just the supplier chosen? If so, which one? Are there resources that are likely limited because of other stuff that you have going on in your design? For example, maybe the rest of your design is pretty much locked in and you have only one spare LUT.

Kevin Jennings

So the function look like the above image. For larger values of temperature I want less offset. I have chossen my FPGA it is a Mpicrosemi Proasic3/E A3P250. As for the parameters of frequency I am not pretty sure at the moment. The main clock is going to be 10 Mhz so the frequency of operation is not so high. The thing is that I need 5 of these LUT and some of them are 14x14 bits which is translated to 229kb which is quite a lot of space I think. How quickly do you need things? One per clock? Multiple clock cycles?

For the most varied applications, temperature change occurs at a quite small rate, so that I guess speed should not be a problem.

You've given a graph, but is this graph based on a relationship (e.g. a mathematical equation) or is it created based off of empirical data?

If it's measured data you may have to use some piece wise linear or curve fit algorithm to reduce the required table size and calculate the intermediate values between table points. If it is derived from an equation, well then just compute the offsets. Either way it doesn't seem like you need extremely high speed results so time sharing the resource is probably feasible.

And if you need performance you could always pipeline the algorithm(s) and stuff in 5 inputs (1/clock) and get 5 outputs after some amount of latency.

Designing an approximation function, e.g. table interpolation, starts with a specification of the ideal function and acceptable error amount. Having this, you can figure out how many linear segments are necessary.

You mentioned that the data is obtained from a temperature measurement, so we would expect a rather low data rate (e.g. < 1 kS/s). "use the same look up table many times" should be possible by sharing the function block in a sequential multiplex scheme.

If you're only receiving 10 bits from the ADC it's probably best to use every value so steer away from interpolation method. As FVM mentioned a temperature data-rate doesn't have to be fast so any FPGA should cater for the problem.

The best solution would be to implement an exponential function based on input value, which I don't know how to do but I'd like to know now.

Thanks for all the answers. As I said I am rather new to Verilog and I don't have so much expirirnce with coding. Is there a reference on how I can share the LUT block many time on the design? Also some LUT in my design are 14x14 bit=229kb and I have only 36kb of RAM in my FPGA so probably I have to stick with interpolation. :bang:

The data in the line are actually measured data or to say it more accurately precomputed offset data points so the line doesn't follow any specific equation.

what do you mean exactly by share??. Do you want to reuse the remaining address lines in the LUT block or reuse the LUT block itself??
What is the length of your LUT and also the data size written into it??

Thanks for all the answers. As I said I am rather new to Verilog and I don't have so much expirirnce with coding. Is there a reference on how I can share the LUT block many time on the design? Also some LUT in my design are 14x14 bit=229kb and I have only 36kb of RAM in my FPGA so probably I have to stick with interpolation. :bang:

The data in the line are actually measured data or to say it more accurately precomputed offset data points so the line doesn't follow any specific equation.

Let me repeat what has been stated before...

To share a memory you either have to have a multi-port memory (FPGA support dual-port memories) or share it virtually by using time division multiplexing of the resource to share the bandwidth into the memory.

Which way you go depends on how often you have to access the memory in a given amount of time.

FYI, your real question isn't about not knowing how to code this in Verilog, it's not understanding how to architect a design to do what you want within the context of the resources available in an FPGA. To help you with that will require a detailed specification on the data rates and clock frequencies of the design along with quantity of LUTs required.

Let me repeat what has been stated before...

To share a memory you either have to have a multi-port memory (FPGA support dual-port memories) or share it virtually by using time division multiplexing of the resource to share the bandwidth into the memory.

Which way you go depends on how often you have to access the memory in a given amount of time.

FYI, your real question isn't about not knowing how to code this in Verilog, it's not understanding how to architect a design to do what you want within the context of the resources available in an FPGA. To help you with that will require a detailed specification on the data rates and clock frequencies of the design along with quantity of LUTs required.

I understand what you are saying. The architecture and data rates is not yet specified but the clock frequency is going to be low (6-10 Mhz), so I maybe return later with exact specifications. The choice of the final FPGA has been made is a Microsemi Proasic3 A3P250
https://www.microsemi.com/products/fpga-soc/fpga/proasic3-e#product-tables

As I said in a previous post what I am really concern with is the fact that some of the tables are 14x14bit (229kb) and the available memory of this FPGA is 36Kb which means that it doesn't fit a single table thats why I want to use the interpolation approach.

As I said in a previous post what I am really concern with is the fact that some of the tables are 14x14bit (229kb) and the available memory of this FPGA is 36Kb which means that it doesn't fit a single table thats why I want to use the interpolation approach.
The problem of sharing the non-linear function block between multiple channels is independent of using a direct look-up table or linear interpolation. There are of course several relations:

- using interpolation can reduce the table size by a large factor and allows separate instances for each data channel.
- the table interpolation may need to access two succeeding entries to calculate the segment slope, can be either done sequentially in two clock cycles or using both ports of a dual-port ROM. Or by making a separate slope table.

A variable with 14bits of magnitude to store the temperature value means that you are working with a maximum resolution of 1/16,384 ( 0,006% ) which surely is unreachable for practical meters, therefore should have an optimization of the available resources of the core by proper scaling. Another point is that using 14k words to store the entire table, due to the nonlinear shape, would be expected a lot of addresses with almost the same value. You should consider to perform this task by a algebraic expression, instead of LUT.

You should consider to perform this task by a algebraic expression, instead of LUT.
I tend to contradict. Polynomial interpolation is an option if the function is explicitely defined this way, e.g. Pt100 or thermocouple linearisation. But the calculation is rather inconvenient with integer or fixed point arithmetic. Piecewise linear interpolation is in contrast simple and straightforward. And it can be much easier fitted to arbitrary calibration functions.

One of these tables (the 10x10 bit) produces the temperature offset. The other table (14x14 bit) are for different purposes.

Status
Not open for further replies.