void main(void)
{
int array_1[] = {1, 2, 3, 4};
int array_2[] = {5, 5, 5, 5};
int i;
int c_sum = 0;
for(i=0; i<4; i++)
{
c_sum = c_sum + (array_1[i]*array_2[i]);
}
}
`timescale 1ns / 1ps
//////////////////////////////////////////////////////////////////////////////////
// Company:
// Engineer:
//
// Create Date: 10:31:23 04/20/2011
// Design Name:
// Module Name: arrayz
// Project Name:
// Target Devices:
// Tool versions:
// Description:
//
// Dependencies:
//
// Revision:
// Revision 0.01 - File Created
// Additional Comments:
//
//////////////////////////////////////////////////////////////////////////////////
module array(clk,sum,reset);
input clk,reset;
input [7:0] din;
output reg [7:0] sum;
reg [7:0] memu1[3:0];
reg [7:0] memu2[3:0];
integer i;
always@(posedge clk)
begin
if(reset==1'b1)
begin
memu1[0]<=1;
memu1[1]<=2;
memu1[2]<=3;
memu1[3]<=4;
memu2[0]<=5;
memu2[1]<=5;
memu2[2]<=5;
memu2[3]<=5;
sum=0;
end
else
begin
sum=0;
for(i=0;i<4;i=i+1)
if(i<4)
begin
sum=sum+(memu1[i]*memu2[i]);
end
end
end
endmodule
Although it's no problem to use it for simulation, I won't suggest an iteration loop (similar to the C code) for the design, because it would prevent synthesis of reasonable hardware in most cases. The most difficult thing when learning HDL with a software programmer background is to understand that iteration loops don't describe sequential actions in time and must be avoided in many situations.
Thanks for the sample code blooz. I have just got hold of a copy of Palnitkar and I am starting my learning process now. Two more quick queries for you:
- What verilog compiler/simulator do you suggest to use? The Palnitkar book comes with one on the accompanying CD. Is that good?
- In that code sample you gave, the 'if(i<4)' is not required right?
FvM - Yeah, I read similar comments about loops in hardware in other posts. So what is the way to achieve the same process? Do I have to declare the array defined outside the module, then send two elements at a time to the module which does multiplication only, then send result back? I don't know how to do that. Again any advice or sample code will be really appreciated.
As FvM pointed out better hardware would be Synthesized if code is without iteration ..
The Above code in C style Could be modified to make it suitable for synthesis in a hardware style .
in single step you can write .
sum=(((memu1[0]*memu2[0])+ (memu1[1]*memu2[1]))+((memu1[2]*memu2[2])+(memu1[3]*memu2[3])));
because the parallel nature of the hardware must be taken into account .
sum=(((memu1[0]*memu2[0])+ (memu1[1]*memu2[1]))+((memu1[2]*memu2[2])+(memu1[3]*memu2[3])));
another_val = sum*50;
OK, I understand the parallel nature now. But is there a limit to that 'parallelism'?
- Suppose I had two arrays of 2000 values each. Will that calculation depend on the size of the FPGA elements?
- Suppose I want to do another calculation on the sum reg after the 'sum of products' operation has finished. If calculations happen in parallel, how do I know the correct value of sum goes to the next line to be multiplied by 50 in the following code? Do I have to start another begin-end in between those two lines?
Code:sum=(((memu1[0]*memu2[0])+ (memu1[1]*memu2[1]))+((memu1[2]*memu2[2])+(memu1[3]*memu2[3]))); another_val = sum*50;
The majority of verilog is describing synchronous logic. the 2000 element inner product is supposedly used to solve some problem. If you assume that design must process data as fast as possible (in a low latency sense), then you would use 2000 multiplications and a very large adder tree. But that's a lot of resources. You might also decide that the hardware can use a longer period of time to do the calculation, doing 100 multiplications per cycle and completing in 20 cycles. Or perhaps the calculation can take a very long time to complete. In this case you might design a small circuit that performs 1/4th of a multiplication per cycle and takes 8000 cycles to process the data. To do the latter cases, you would also have to describe some logic that buffered up some of the 2000 data inputs, so that they could be presented on later cycles.OK, I understand the parallel nature now. But is there a limit to that 'parallelism'?
- Suppose I had two arrays of 2000 values each. Will that calculation depend on the size of the FPGA elements?
- Suppose I want to do another calculation on the sum reg after the 'sum of products' operation has finished. If calculations happen in parallel, how do I know the correct value of sum goes to the next line to be multiplied by 50 in the following code? Do I have to start another begin-end in between those two lines?
Code:sum=(((memu1[0]*memu2[0])+ (memu1[1]*memu2[1]))+((memu1[2]*memu2[2])+(memu1[3]*memu2[3]))); another_val = sum*50;
Thanks for the verilog simulator links. You are a star!
About Simulator
I Would Chose Aldec Active HDl Student version
because of it's simple GUI ...suited for a beginner
module array(clk,sum,reset);
input clk,reset;
output reg [7:0] sum;
reg [7:0] memu1[3:0];
reg [7:0] memu2[3:0];
always@(clk,reset) //providing an //Asynchronous reset
begin
if(reset==1'b1)
begin
memu1[0]=1;
memu1[1]=2;
memu1[2]=3;
memu1[3]=4;
memu2[0]=5;
memu2[1]=5;
memu2[2]=5;
memu2[3]=5;
sum=0;
end
else if (reset==1'b0&&)
begin
sum=(((memu1[0]*memu2[0])+
(memu1[1]*memu2[1]))+
((memu1[2]*memu2[2])+
(memu1[3]*memu2[3])));
end
end
endmodule
module array_new(clk,sum,reset,calcen);
input clk,reset,calcen;
output reg [7:0] sum=7'b0;
reg [7:0] memu1[3:0];
reg [7:0] memu2[3:0];
always@(clk,reset) //providing an Asynchronous reset
begin
if(reset==1'b1)
begin
memu1[0]=1;
memu1[1]=2;
memu1[2]=3;
memu1[3]=4;
memu2[0]=5;
memu2[1]=5;
memu2[2]=5;
memu2[3]=5;
sum=0;
end
else if (calcen==1'b0)
sum=7'b0;
else if (reset==1'b0&calcen==1'b1)
begin
sum=(((memu1[0]*memu2[0])+
(memu1[1]*memu2[1]))+
((memu1[2]*memu2[2])+
(memu1[3]*memu2[3])));
end
end
endmodule
module sobel_mine( p0, p1, p2, p3, p5, p6, p7, p8, out);
input [7:0] p0,p1,p2,p3,p5,p6,p7,p8; // 8 input pixels of 8-bits
output [7:0] out; // 1 ouput pixel of 8-bits
// Internal wires
//11 bits because max value of gx and gy is 255*4 and last bit for sign
wire signed [10:0] gx,gy;
//Find the absolute value of gx and gy
wire signed [10:0] abs_gx,abs_gy;
//Max value is 255*8. here no sign bit needed.
wire [10:0] sum;
//------------------------//
//sobel mask for gradient in horizontal direction
assign gx=((p2-p0)+((p5-p3)<<1)+(p8-p6));
//sobel mask for gradient in vertical direction
assign gy=((p0-p6)+((p1-p7)<<1)+(p2-p8));
// Absolute value of gx
assign abs_gx = (gx[10]? ~gx+1 : gx);
// Absolute value of gy
assign abs_gy = (gy[10]? ~gy+1 : gy);
// Sum
assign sum = (abs_gx+abs_gy);
// Max value 255
assign out = (|sum[10:8])?8'hff : sum[7:0];
endmodule
Yes definitely ,You can exploit the parallel nature of the hardware. suppose your array is 256 by 256 and the processing is applied to and 8 by 8 subset ...so there are 1024 {S0,S1,....S1023} ....independent subsets.that could be processed separately ...suppose you have 2 processing elements ..P0 and P1 ..and they access the array in parallel based on a simple rule P0 access {s0,s2,...} even subsets and P1 access {S1,S3} ..odd subset ..
If there are enough Logic elements to do the trick ...then instantiating one more copy is a good idea ..
We use cookies and similar technologies for the following purposes:
Do you accept cookies and these technologies?
We use cookies and similar technologies for the following purposes:
Do you accept cookies and these technologies?