Quick update on this.
I ran CTS by specifying only clock buffers and no clock inverters in set_clock_tree_references, since during the earlier run, only the inverters were pruned, and not the buffers. While I expected a significant increase in the area of cts-introdduced cells, it wasn't the case. Also, surprisingly, both the worst-case global and local skew are smaller in the tree with only buffers.
More importantly, the maxcap violations are now gone, since the tool uses X16 clk buffers to drive the clk pin of the memory macros.
I have one question regarding this - Earlier, all the clk inverters driving the macro clk pins did not drive any other cells. Now, there are other flip flops in the same sink group as that of the memory macro. I understand that the tool considers phase delay of the macro and balances skew with other flops on the sink group, but I just want to know if it is a good design practice to have other flops being driven by the same clk buffer that drives a memory macro clk pin.
Thanks,