I feed my input signal through a 42-bit carry chain. In a Spartan-3, it takes about 120ps for the input signal to propagate through each CLB (two MUXCY_L primitives), so this structure is basically a 5ns delay line with 42 taps. Next, I apply a 200 MHz clock to the FPGA. Inside each CLB, I connect the carry signal to a D-flop clocked at 200 MHz. The result - every 5ns the 42 D-flops take a snapshot of the input signal propagating along the delay line. The 42-bit output represents the input signal sampled at 120ps intervals, or 8.4 gigasamples per second.
If your FPGA's carry chain is faster or slower than 120ps, or if your clock is different from 200 MHz, then you need to adjust the number of taps accordingly. You will probably want to construct a mechanism that performs this adjustment automatically.
CLB placement is critical - the carry chain must fit into one column to achieve reasonably uniform 120ps delays.
Of course, you also need logic that analyzes the 42-bit output to find whatever you are looking for in the input signal.
This is stretching the limits of what can be done with an inexpensive FPGA. It's not a beginner FPGA project!