For starters, it's almost never a hold problem because
1) standard cell library designers go for zero hold as a
general thing
2) Clock-Q delay means data will change after clock by
a FF delay (or more if there is any useful interstage
logic, anything but a straight shift register)
Setup however can easily be eaten up by deep or poorly
chosen logic between FFs. I've had to fight this to the
point of hand-designing special gates, bare-clocked FFs
and using Spectre on extracted layouts for timing closure
and that was "only" 400MHz chip-wide clock (albeit in
0.5um, where nobody including me thought it could be
done). Even had to invent "clock pullback" circuits to
resynchronize some stuff that was just too slow and
needed its own little "clock ghetto" to catch the late
data, but then this had to come back to main domain.
Temperature is another tool to try, setup issues that
come from logic stage delays between flops will show
up at high temp first. And if the problem goes away at
high temp you might then suspect a hold problem after
all.
You might return to simulations and apply higher net
loadings, skewed corners, etc. to see if the behavior
can be evoked.