If you have very long timing path then you can split it into several pipeline stages by adding registers. But if you do that you will lose clock cycles and may be will not have any increase in performance.
Now let consider 2 examples:
1.Your inputs depend on outputs. It means that you should have valid output results before next inputs process. In this case one additional pipe stage will decrease your critical timing path but as you can not process input before having valid output you will spend one additional clock cycle. For example you have 10 bytes data. Your module reads one byte in one clock cycle modifies them and puts to output. Without pipeline register you will finish processing after 10 clock cycles but now you need 20 clock cycle to process all data.
2.Your inputs don't depend on outputs. It means that you can process next input without waiting valid output results. The first byte will go to output after 2 clock cycle due to one pipe stage. But for processing next input your module dose not wait for output and can get data with every clock cycle. And now to process all data you need only 11 clock cycle. This additional clock cycle is called latency.