[SOLVED] Mean subtraction for Image processing in matlab

Jaffry · Aug 24, 2012

Dear all,

I am implementing an algorithm of image processing over FPGA,
actually I have done a good part of it but later I realize I should once confirm it with experienced guys.

What I am doing is I am getting 14-bit signal from ADC and I have to do Mean subtraction i.e.
Taking 500 samples of 14 bit signal and taking mean (scanning for X*Y area plate)

Now what I did in Mean subtraction that I added the 14 bits of values on each clock cycle as I received it and after 500 samples,
I divide the sum with 500 to get the mean of those samples. Since it is really first time I have any expereince of
image processing algo implementation over FPGA, hence I ask if this is right.
I am using FIFO to store sample from ADC then I sum the values.

Since I am scanning a plate of X*Y area hence I have total (500*X*Y) samples and (X*Y) mean values.
What I am concerned about if I have done thing right or not.

Any new idea is also welcome.
Eager for response and Thanks in advance

mrflibble · Aug 24, 2012

Well, if you can change that number of 500 to 512 life becomes a lot easier. Take 512 of those 14 bit values, add them all up and divide by 512. The reason to take 512 is that divide by 512 is much easier. All you have to do is right shift 9 bits.

Translation: make a 23 bit accumulator with a 14-bit input.

1 - clear accumulator
2 - clock in those 512 values
3 - your average value is the 14 MSB (so accu_out[22:9])
4 - problem solved

Jaffry · Aug 25, 2012

Hello mrflibble,

Thank you for your reply. That is indeed a great help.

I have to check and see if I can change the algorithm for 512 samples, well if I can it will be great.
Thank you again very much since earlier I had used the divider using core gen.

Well I also tried your suggestion at signed number as well and I took 2,4,8 samples and take their average.
So I had decimal numbers as well. What do you think of your suggestion for signed numbers?

- - - Updated - - -

Well one more thing to discuss. If you can suggest, since I am getting data from ADC it is in 2's compliment forms or simply put signed number so I think this technique should work well in tha as well.

mrflibble · Aug 25, 2012

Signed numbers will works just as well. Just make sure both the 14-bit accu input and the 23-bit accu itself are both signed.

Jaffry · Aug 26, 2012

Thank you mrflibble. yes I kept that in mind and simulate as well. It works good.

Thank you again

mrflibble · Aug 26, 2012

Nice to hear you got it working.

Jaffry · Sep 4, 2012

Hey by the way Mr flibble. Is there any better or similar algorithm for other numbers, for example. As the second part of the image processing, I have to do implement Spatial averaging for 3x3 window size. For that I will be requiring 3 window sizes

1) Size 4: Hence I need to divide by 4, this I can do by shifting 2 bits position.
2) Size 6: similarly I will sum up 6 elements and divide by 6, taking Spatial average
3) Size 9: similar for 9 elements.

Do you have any idea for division by 6 or 9. Since I am using the divider core now and I am not very happy with the result. Since I am skeptical if it will generate exact result in real hardware.

Waiting for your reply

Jaffry

mrflibble · Sep 4, 2012

Binary division by primes is going to be annoying. ;-)

Divide by 6 has the same issue as divide by 9. And that issue being the division by 3.

Divide by 6 is multiplying by 1/6. Lets readily forget about the factor of 2 because that's totally easy. So the problem du jour is dividing by 3. Which is binary multiplication by 0.0101010101010101010101010101010....etc

Essentially that's a pipeline of "right shift 2, then add". And the length of the pipeline depends on precision. Basically 2 approaches. Either do this with cheapo LUT + FF resources. Then you can get it running pretty fast but you do get a deeeeep pipeline just to divide by 3. If you can handle the deep pipeline I'd do that.

If you hate deep pipelines and have DSP slices available on your fpga you can use those to multiply by that 0.010101010101010101. Or to be more practical to divide by six you multiply by 0.0010101010101010101010101. You'll notice that's just the 1/3 (0.010101010101) shifted to the right by 1 bit for the extra divide by 2. And a similar approach to the divide by 9.

If you are doing image processing I guess you'll be wanting to do the DSP slice approach.

Oh yeah, one other thing ... depending on the weights for your individual pixels you can do divide and conquer approach. Say you have your 3x3 block like so:

Code:

123
456
789

Then you cleverly butcher that into 4 blocks:

Code:

I'm sure you get the idea. Like I said, that approach is only useful if you can spread the weights for the average around. Generally this means the weight for the center pixel 5 will have to be higher than the corners. So if you just take the boring case with all pixels the same weights this kind of thing is a no-go. But just something to keep in mind. If you can shuffle things around so you get chunks of power-of-two then things suddenly become easy again.

Oh yeah, and another approach is simply ooopsie forget to divide. Just do your normal math further down the pipeline. Then waaaaay at the end remember to divide by 6 or 9 or whatever the case is. Sometimes it's not necessary to do the annoying divide-by-prime right away. All you do is keep track of it and do it at a convenient point.

Why you ask? Well, if you have multiple stage ... Stage A does a divide by 6. Then some clever stuff, then more clever stuff, and then stage D does a divide by 7, then more and blah blah. If you do the divides right away you get expensive logic twice. If however you do the ooopsie-I-forgot method, then you only have to divide by 42 somewhere near the end. As in multiply by the reciprocal (1/42).

Hope that helps somewhat.

Oh yeah I forgot to add: for a divider core I would expect it generate something similar in the case of a constant division factor. As in, if you generate a core that should divide by a constant of 6, I'd expect it to do precisely this: generate DSP core that multiplies by 0.001010101010101...

But assumption is the maternal parent of all fsckups, so best check the synthesized results.

Jaffry · Sep 7, 2012

Thank you mrflibble for your all responses.

It is really learning experience with this forum, but since I am relateively new to the image processing or fpga's use for image processing,
if you can slightly explain me what do you mean by the following few things you explain (of course on your convinience)
But before that, yes I already did division by 6 and 9 by 1/6 or 1/9 method as I found this thing somewhere in this same forum.

Any ways,

I cannot understand by 'deep pipeline' what you mean in following, also

Essentially that's a pipeline of "right shift 2, then add". And the length of the pipeline depends on precision. Basically 2 approaches. Either do this with cheapo LUT + FF resources. Then you can get it running pretty fast but you do get a deeeeep pipeline just to divide by 3. If you can handle the deep pipeline I'd do that.

If you are doing image processing I guess you'll be wanting to do the DSP slice approach.

Could not understand a word

I get your idea regarding the Divide and Conquer approach, but what I dont understand is what do you mean by 'weight' of the pixels here...?

Well pardon me for so many questions in one subject: But how do I get to know the frequency or time delay for one multiplier. What I think is that it depends on number of bits
for example if I multiply by '111011', I have smaller delay than when I multiply with '1110111110101'....

I want to ask this because increasing the number of bits for multiplication would increase the number of precision, but what about timing requiremnt... or is the multiplier that I implement is not at all afftected by the number of bits (means clk-to-output delay of multiplier)...

I am using Xilinx FPGA, any good resource for reading you can share.

Bests and Thanks,
Jaffry

TrickyDicky · Sep 7, 2012

The max clock speed depends on routing rather than data width (but larger data width will modify the routing). The key is delay time between registers. As for the multipliers, the onboard DSP slices will be 18 bit multipliers (2x 18 bit input, 36 bit output) so unless you go over these limits, the speed of the multiplier wont change.

Basically, if you follow good design practice (keep everything synchronous, dont have lots of combinatorial logic between registers) then you should easily be able to have a clock at between 100 and 200Mhz on most newer devices.

Jaffry · Sep 7, 2012

Thank you for your replies.

That was indeed a great help. I simply implemented a simplest cirtuit

Code:

module mult(
    input clk,
    input rst,
    input [3:0] in1,
    input [3:0] in2,
    output reg  [7:0] out1
    );

always@(posedge clk)
if(rst) out1 <= 1;
else out1 <= in1 * in2;
endmodule

and first I simply run the process till PAR and see the report.
It has following details : Min. time 2.40 ns while number of slice 16 while no DSP48E slice.

But when I changed in the synthesis options for use DSP slice to 'YES' then it utilized zero (0) slices while only 1 DSP48E slice.
Hence conclusion 16 slices were saved while the minimum clock frequency was same in both case.

Good learning
Thank you.

Welcome to EDAboard.com

[SOLVED] Mean subtraction for Image processing in matlab

Jaffry

Member level 1

mrflibble

Advanced Member level 5

syedshan

Jaffry

Jaffry

Member level 1

mrflibble

Advanced Member level 5

Jaffry

Jaffry

Member level 1

mrflibble

Advanced Member level 5

Jaffry

Member level 1

mrflibble

Advanced Member level 5

Jaffry

Jaffry

Member level 1

TrickyDicky

Advanced Member level 7

Jaffry

Jaffry

Member level 1

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics