Binary division by primes is going to be annoying. ;-)
Divide by 6 has the same issue as divide by 9. And that issue being the division by 3.
Divide by 6 is multiplying by 1/6. Lets readily forget about the factor of 2 because that's totally easy. So the problem du jour is dividing by 3. Which is binary multiplication by 0.0101010101010101010101010101010....etc
Essentially that's a pipeline of "right shift 2, then add". And the length of the pipeline depends on precision. Basically 2 approaches. Either do this with cheapo LUT + FF resources. Then you can get it running pretty fast but you do get a deeeeep pipeline just to divide by 3. If you can handle the deep pipeline I'd do that.
If you hate deep pipelines and have DSP slices available on your fpga you can use those to multiply by that 0.010101010101010101. Or to be more practical to divide by six you multiply by 0.0010101010101010101010101. You'll notice that's just the 1/3 (0.010101010101) shifted to the right by 1 bit for the extra divide by 2. And a similar approach to the divide by 9.
If you are doing image processing I guess you'll be wanting to do the DSP slice approach.
Oh yeah, one other thing ... depending on the weights for your individual pixels you can do divide and conquer approach. Say you have your 3x3 block like so:
Then you cleverly butcher that into 4 blocks:
I'm sure you get the idea. Like I said, that approach is only useful if you can spread the weights for the average around. Generally this means the weight for the center pixel 5 will have to be higher than the corners. So if you just take the boring case with all pixels the same weights this kind of thing is a no-go. But just something to keep in mind. If you can shuffle things around so you get chunks of power-of-two then things suddenly become easy again.
Oh yeah, and another approach is simply ooopsie forget to divide. Just do your normal math further down the pipeline. Then waaaaay at the end remember to divide by 6 or 9 or whatever the case is. Sometimes it's not necessary to do the annoying divide-by-prime right away. All you do is keep track of it and do it at a convenient point.
Why you ask? Well, if you have multiple stage ... Stage A does a divide by 6. Then some clever stuff, then more clever stuff, and then stage D does a divide by 7, then more and blah blah. If you do the divides right away you get expensive logic twice. If however you do the ooopsie-I-forgot method, then you only have to divide by 42 somewhere near the end. As in multiply by the reciprocal (1/42).
Hope that helps somewhat.
Oh yeah I forgot to add: for a divider core I would
expect it generate something similar in the case of a constant division factor. As in, if you generate a core that should divide by a constant of 6, I'd expect it to do precisely this: generate DSP core that multiplies by 0.001010101010101...
But assumption is the maternal parent of all fsckups, so best check the synthesized results.