If you try and imagine the design required to do any other way, I think you will realize it is the way it is because it makes sense (i.e. requires the least amount of total silicon area memory+controller). Here is what I concluded.
For the writes the device would need to delay the DQS on each lane to center it in the DQ eye. But since the DQS is an intermittent clock it's not possible to use a PLL for that delay, you would have to use DLL type logic which is very area intensive. By pushing it to the controller you only need one PLL/DLL delay block which keeps the total system silicon area lowest.
I think for the reads the device could have created a delayed DQS without too much trouble. In that case the controller timing could be fixed. But I think fixing the timing would be too difficult at higher speeds so it needs to be adjustable. Since that adjustability is required in the controller there is no point in adding the delay to the device. Not having that delay in the device keeps the total system silicon area lower.
Since all my designs have been dedicated discrete SDRAM designs there are also probably more issues I haven't thought about once you get into designs using multiple DIMM modules with different length lanes, busing, and termination.
Hope that answers your question.
Ray