* Is there any specific reason to have such a huge hold margin of around 400ps?
if you reduce this then your problem would be simpler to solve or less buffers.
* Do you checked whether clock tree information is there for the input port or modelled, usually the capturing register will have the clock tree values but the port would not have so.
* Check your capture register clock tree , how big it is , is it reasonable number or huge .
* Either you can add data delay, in the data path, as you say that there is hardly one cell , then adding a delay should not be an issue , or you need to reduce the register clock tree, this would have impact in other places and this be touched or done at the lost resort when your hands are totally tied.
* Check your input delay numbers (as mentioned) whether it is reasonable.