I guess it depends on how you implement your filter. The answer is different, for example, for a software implementation than for a hardware implementation.
For a software implementation, it will depend partly on the architecture of the CPU. For example, some CPUs may have dedicated multiply-accumulators and you will sometimes see the number of multiply-accumulate operations used as a rough measure of implementation cost. Delays are very cheap in many software scenarios because the CPU usually has access to huge banks of DRAM. If you can tolerate the latency involved in DRAM accesses, then you can just use that.
For a hardware implementation (or FPGA implementation), it is more likely that you will have to dedicate specific hardware for doing each multiply, add and delay. If you require very high throughput (i.e. sample rate higher than the clock rate), then you would even have to dedicate more than one block per operation (e.g. if the sample rate is quadruple the clock rate, then you will need 4 hardware blocks per logical operation). Conversely, if your throughput requirements are low (i.e. sample rate significantly lower than the clock rate), then you may be able to reuse the same hardware block for several logical operations (e.g. if the sample rate is half the clock rate, then it should be possible to reuse each physical block twice, thus halving the resource requirement).