permute
Advanced Member level 3
roub: clocked processes each describe hardware. Functions can be used, but complex functions rarely infer an area/performance efficient design. You seem to be intent on an example of x % y, where both x and y are user inputs. if this is written as a function, it might infer (for the 2b case):
output = x when ( y > x) else x - y when ( 2y > x) else x - 2y when (3y > x) else x - 3y.
(which grows exponentially with n). for larger x,y intermediate terms would be used:
z1 = x when ( 4y > x) else x-4y;
z2 = z1 when (2y > z1) else z1-2y;
z3 = z2 when (y > z2) else z2-y;
output = z3;
The above has no memory elements -- thus it will be flattened to:
output = (( x when ( 4y > x) else x-4y) when (2y > ( x when ( 4y > x) else x-4y)) else (( x when ( 4y > x) else x-4y)-2y)) when (y > ((x when ( 4y > x) else x-4y) when (2y > (x when ( 4y > x) else x-4y)) else ((x when ( 4y > x) else x-4y)-2y))) else ((( x when ( 4y > x) else x-4y) when (2y > ( x when ( 4y > x) else x-4y)) else (( x when ( 4y > x) else x-4y)-2y))-y);
(i'm pretty sure the above isn't exactly valid VHDL.) The point is that, when a function is used in a process, code with loops can result in the synthesis tool trying to optimize a complex expression.
The synthesis tool would likely choose to infer 3 subtractors (as shown by the first representation), and use intermediate results.
for the 8b case, the synthesizer would likely infer 8 subtractors, with a longest path that must traverse though all 8. for 32b, it would infer 32 subtractors, and require the longest path traverse all 32. This is a very long path, and limits the clock rate to fairly low values. It also uses a lot of area. The latency (in clock cycles) is minimized for this method, but it uses a lot of area, and only allows a slow clock rate. This implementation is chosen because the code is written to allow only 1 clock cycle for the computation of x % y, and x % y requires a complex circuit to complete in 1 cycle.
If area is a concern, the above could be broken apart, possible performing 1 subtraction stages per clock cycle. The throughput is then limited to a valid output every 32 cycles. The longest path uses only 1 subtract, so clock rate can be high. This method uses little area, but the throughput is limited. In this method, a state machine is used to determine when the output is valid, as well as select the input to the subtractor.
If both clock rate and throughput are concerns, then the design can be pipelined. The area-efficient state-machine approach requires only 1 subtraction circuit. The fully pipelined version would use use 32 subtraction circuits, but place a register on the output of every subtraction. It now takes a latency of 32 cycles before the first output is valid, after which each cycle could be valid. Again, the longest path is only 1 subtract, so clock rate can be high. This method uses a lot of area, but provides for a high throughput.
output = x when ( y > x) else x - y when ( 2y > x) else x - 2y when (3y > x) else x - 3y.
(which grows exponentially with n). for larger x,y intermediate terms would be used:
z1 = x when ( 4y > x) else x-4y;
z2 = z1 when (2y > z1) else z1-2y;
z3 = z2 when (y > z2) else z2-y;
output = z3;
The above has no memory elements -- thus it will be flattened to:
output = (( x when ( 4y > x) else x-4y) when (2y > ( x when ( 4y > x) else x-4y)) else (( x when ( 4y > x) else x-4y)-2y)) when (y > ((x when ( 4y > x) else x-4y) when (2y > (x when ( 4y > x) else x-4y)) else ((x when ( 4y > x) else x-4y)-2y))) else ((( x when ( 4y > x) else x-4y) when (2y > ( x when ( 4y > x) else x-4y)) else (( x when ( 4y > x) else x-4y)-2y))-y);
(i'm pretty sure the above isn't exactly valid VHDL.) The point is that, when a function is used in a process, code with loops can result in the synthesis tool trying to optimize a complex expression.
The synthesis tool would likely choose to infer 3 subtractors (as shown by the first representation), and use intermediate results.
for the 8b case, the synthesizer would likely infer 8 subtractors, with a longest path that must traverse though all 8. for 32b, it would infer 32 subtractors, and require the longest path traverse all 32. This is a very long path, and limits the clock rate to fairly low values. It also uses a lot of area. The latency (in clock cycles) is minimized for this method, but it uses a lot of area, and only allows a slow clock rate. This implementation is chosen because the code is written to allow only 1 clock cycle for the computation of x % y, and x % y requires a complex circuit to complete in 1 cycle.
If area is a concern, the above could be broken apart, possible performing 1 subtraction stages per clock cycle. The throughput is then limited to a valid output every 32 cycles. The longest path uses only 1 subtract, so clock rate can be high. This method uses little area, but the throughput is limited. In this method, a state machine is used to determine when the output is valid, as well as select the input to the subtractor.
If both clock rate and throughput are concerns, then the design can be pipelined. The area-efficient state-machine approach requires only 1 subtraction circuit. The fully pipelined version would use use 32 subtraction circuits, but place a register on the output of every subtraction. It now takes a latency of 32 cycles before the first output is valid, after which each cycle could be valid. Again, the longest path is only 1 subtract, so clock rate can be high. This method uses a lot of area, but provides for a high throughput.