Another option is to insert lookup latches for delay insertion between different clock domain flops, when you put them in one scan chain. If latency need to be balanced for some functional reason, we again need delay insertion.
The first option (to force balance clock latency in Clock Tree Synthesis tool) is better, of course.