A couple of other reasons:
PN forward voltage (once you're up out of the leakage /
recombination floor) is very well controlled and varies not
much. PN forward voltage is log-linear across a wide
range of current densities - key, because the whole idea
is to match devices operated at better than 10:1 different
current density and get a delta-V fairly independent of
the particular current.
MOSFETs are tougher to get good matching at different
current densities. You don't use width ratio because of
edge effects, and subthreshold slope will change a lot
if you have ionic contamination, radiation or just a bit
different gate oxidation cycle / cleans. Surface state
density is tough to measure so there's little process
control for it; "digital" technologies tend to not even
care about it, except as it impacts leakage.
Now delta-VT MOS references do have an interesting
play in the sub-bandgap reference niche (hard to get
a bandgap to work at all when Vdd <= 1.5V or so).
You could probably do some folding and get a delta-VT
reference, using low-threshold devices, to work below
a volt.
But if what you want is to run volume production at
minimum test-cost input and die size, that means a
design that runs tight to center with not much trim
range required (trim networks dominate old-timey
references', like LT1009, die size). And that in turn
wants the most stable, buried junction reference
element you can get (one un-influenced by oxide
qualities and their post-fab drift, e.g. hot carrier).