I'm not sure how many lower-cost FPGAs will store 1MB. I suggest using a commodity DDR chip for this type of application. Most vendors have free IP or even hard-IP for this. With modern ICs, you can get well over 1MB in a single DDR2/3 IC. Of course there are more costly FPGAs that will have enough SRAM for 1MSamples.
I suggest DDR over QDR or on-chip SRAM, as DDR2/3 both have on-die termination, making signal integrity easy. Further, if you use 16b IO, you can run the interface at 166-250MHZ (or whatever the minimum DLL lock rate is), and even 2T timing, making interfacing simple. (if you do this for Xilinx, look into using SSTL class 1. SSTL class 2's implementation uses more power...)
In the end, the DDR route might be smaller, as some FPGAs that have the required SRAM also need bigger packages.
this should allow you to use lower cost FPGAs, commodity SDRAM, and whatever USB Phy you want. All while wildly exceeding the 1MB spec.\
for programming, you may end up using a low cost micro that can program the FPGA over USB. or you might use the built in program from commodity EEPROM feature of modern FPGAs. Its your choice.