Since there are certainly things you cannot model, you should
use worst-as-sensible models for timing until you can.
Are you at liberty to redesign the DFFs for your interests? In
my experience standard cell libraries are not made to push
performance; DFFs have internal clock stages that add to
CLK-Q delay (this is probably your main problem), may have
made choices you don't like regarding the balance between
setup and hold interval requirements (you would prefer zero
setup, and eat some hold time because you appear to have
plenty). Making a "bare clocked" DFF that receives aligned,
complementary CK,CKb will shave two inverter delays from
the DFF (perhaps more, if your clock-pair has robust drive).
Downside is, you get to build a complementary clock tree.
TINV-based DFFs (which are the norm) have built in nodes
of contention, in-the-moment. The forward and feedback
chains oppose each other until the logic state wraps back
around. You see this dealt with by making the feedback
scrawny and the forward, stout. But this could benefit from
either rebalancing, or going even further imbalanced,
depending on which transition binds you up.
You can play with "lag clocks" and "clock pullback" in trying
to design pipelined logic. For example, rather than figure a
complex carry term in one "bite" if it's wide, figure two
"precursor" terms that are right one cycle earlier, and do
the final combine ahead of the final clock so as to flatten
and use faster gates, meeting the setup in the end. This
is good for predictable things like counters; random-walk
state machines, not so much.
You can have staggered, lagged / advanced, clock branches
if you want to do the work and the work of getting the STA
to comprehend it all.