The phase margin problems probably come from the fact that you are using a miller opamp for the error amp.
When you increase the PMOS, you increase the capacitive load seen by the opamp. Driving a bigger cap needs more miller-compensation. All you need to do to drive a bigger cap (pass device) is to increase the compensation.
In order to do this efficiently, you should use a resistor in series with your compensation cap. This adds a zero that cancels out the zero of the drive transistor. The resistor value should be about 3x the gm of your drive transistor (output fet of opamp, not pass device) Usally 2k-10k is where I end up putting them, in series with the comp cap. Now your comp cap is much smaller for the same phase margin.
Another idea would be to use a folded cascode error amp instead of miller amp. Now your main pole is at C*Rout of the cascode amp. Driving a bigger cap makes you MORE stable. That would be my choice for your project.
DMOS is very powerful for it's size, so therefore very efficient. Better in my opinion for switching applications than for linear apps. Maybe a lateral DMOS would be good because you don't need to use the substrate, but most vertical DMOS I've seen use the substrate, which you don't want connected to the outside world, except to GND..
I think PMOS is a good choice for you. Gabriel Alfonso Rincon Mora wrote his thesis on LDO's, I think you will get a lot of ideas from it. He also talks about body boosting, which for a PMOS means drawing a tiny amount of current out of the body. Vt is lowered and the output fet seems more powerful. I have only seen one successful implemenation of this so I urge you to stay away until you get the standard implemenation working well. Drawing ANY current out of the body puts you at high risk for latchup, which is really bad when a 1A mosfet latches up.
My suggestion is to increase the PMOS to 1.2mOhm at your lowest dropout spec. Now increase the compensation on the miller amp until it is stable at cold temp, fast models. Or, switch your error amp to a folded cascode, and the big cap (PMOS gate) should give you a pretty stable system.