I'd try a folded-cascode topology with gain boosting or, if the output swing is not enough, remove the gain-boosting and add one common-source output driver.
Thanks for the advice. I am afraid that a folded cascode topology would not satisfy either the upper or the lower ICMR requirement (Vt,n=0.7V, Vt,p=-0.9V).