You really need to be mindful of how many levels of logic your HDL is going to produce. For slow clocks, you can get away with several layers, but faster clocks will really need to be limited to 2 or even 1 layer per clock to meet timing. Obviously, slow/medium/fast is relative to the technology you are using. Be wary of how you are accessing RAM/DSP as these are fixed in the FPGA so you will always incur a routing penalty, so minimise logic layers for RAM/DSP inputs and outputs.
My general rules for fixing timing in FPGA (and, afaik, these rules also apply reasonably well to ASIC)
1. Ensure all appropriate constraints are applied to appropriate registers (multicycle, false paths, max delays etc).
2. If you have a path that commonly fails, you probably need to change the HDL. Changing the HDL is almost always the easiest option. This may involve adding extra pipeline stages to give the fitter more options in placement of registers, likely nearer to rams/DSPs
3. Set area contrains to try and force related logic close together.
4. For the last few (and you really dont want too many) paths that you cannot fix with 1,2 or 3, overconstrain the path during the fitter, so that during STA, you have a better chance of meeting timing.
3 and 4 really should be when life is getting hard, as they take a lot of time to run the seed sweeps and rebuilds necessary to ensure your fix works. 1 and 2 are the quickest options.