I've never used Altera FPGAs, but I think that device has 54 18-bit MACs that can go 278 MHz. I don't know how easy/difficult it is to achieve maximum speed in Altera parts, but I would try using 6 MACs running at 220 MHz, with each one processing 11 filter taps.
In general, for good FPGA resource utilization, you should run the clock fast, use lots of pipelining, and do things sequentially to keep most of your logic busy on every clock cycle.
I use MATLAB to create digital filter coefficients. It makes the job pretty easy. MATLAB can also spit out VHDL or Verilog code, but I prefer writing my own HDL.
I don't know details of Stratix power management. In Xilinx FPGAs, static power is usually small compared to dynamic power.