You need to take advantage of external resources and a activity, first knowing what these are, offer and preclude.
Clocked comparators can embed auto-zero in the latched phase and get to <100uV easily. Continuous-time comparators make it tougher to hide and fold in any offset cancellation although you could (say) ping-pong two comparators, nulling one while the other is active and flipping on output transitions.
Expecting testability when embedded in a SoC at a 1mV limit spec, is probably misguided.
Point is, basic CMOS analog design is only the smallest part of the picture. You have to fit and make best use if your application environment.