The two clocks from the PLL are orthogonal, i.e. one of them is 90 degree away from another. So XOR-ing them will get a 2x clock w/ good (not perfect) duty cycle. If we don't do it this way, we would need a PLL w/ 4x VCO and devide it down by 2 to get a 50-50 clock. Then the PLL divide ratio is too high as the external reference is too low.
So we decided to use this XOR-ing approach, although it's challenging as a symmetrical gate is required and we have this STA issue. We care the latency between PLL and XOR because one of the PLL output (1x) is also used and the 2x and 1x need to be synchronous.
We found a way which is to define a 2x clock at the PLL output, and propogate it through the XOR. Just as you suggested. Thanks for the valuable input.