Please read UG626 Synthesis and Simulation Design Guide and Chapter 5: Coding for FPGA Device Flow.
There is Pipelining section that describes why additional regs improve the performance.
The summary
Before pipelining:
The clock speed is limited by:
• Clock-to out-time of the source flip-flop
• Logic delay through four levels of logic
• Routing associated with the four function generators
• Setup time of the destination register
After pipelining:
The clock speed is limited by:
• The clock-to-out time of the source flip-flop
• The logic delay through one level of logic: one routing delay
• The setup time of the destination register
In this example, the system clock runs much faster after pipelining than before pipelining.