I was referring to the case where 4 words are read simultaneously even though they are located in the different cache lines. Frankly, i think it's a bit lame it allows multiple cycles to obtain the data from cache. That's not much different from 4 load instructions getting a words from cache in pipeline, and stores the data into the 128 bit register, especially when nowadays superscalar and out-of-order exec is common. It seems simpler to me just to introduce a load instruction that comes with a write mask to 128 bit register.
I'm afraid it's very different from sequential load/store.
The next major Intel architecture will add Fused Multiply-Add (FMA) instruction support. It will have two 256-bit FMA units. That's a total of 32 single-precision floating-point operations per clock cycle. In contrast, indirectly addressing eight 32-bit elements takes 24 micro-operations. The effective throughput of sequential load/store is 48 times less! Of course not every parallel load/store requires arbitrary offsets, but even a few of these emulated gather/scatter instructions can clearly bring down the performance really fast.
It takes 24 micro-operations per gather or scatter because you also need to individually extract each offset. Being able to send the entire vector of offsets to the load/store unit would bring it down to essentially 8 cycles.
But it can be a lot better still by taking advantage of data coherence, which can be very high for parallel workloads. For instance when you have lookup tables to implement transcendental functions, they can fit into one or two cache lines. Also when you have say a large 3D matrix for finite-element simulations, it can be stored in blocks to improve the chances that neighboring elements are stored within the same cache line.
So by collecting elements from two cache lines per cycle, the throughput could be as high as 1 gather instruction every cycle. Of course the average throughput will be lower, but still much better than with sequential operations.
Note also that the list of applications that can benefit from this is endless. Every code loop which has independent iterations, can now be (automatically) vectorized. Gather/scatter is the parallel version of load/store. Every other instruction already has a parallel version, so gather/scatter is an essential missing piece to enable a significant speedup for a wide range of applications.
And I believe CISC processors like x86 works like RISC internally. All the complicated instructions are broken into multiple simple instructions and I don't see much benefit on the instruction atomicity anymore unless the said instruction is very unique and not easily replacable by a set of other instructions.
The majority of the x86 instructions are translated into just one RISC micro-instruction really. Especially the vector ISA extensions are relatively straightforward in comparison to the legacy x86 instructions. In particular something like an FMA vector instruction will be executed on a 256-bit wide execution units. So sequential load/store of individual 32-bit elements is not really an option.
There might be other solutions but a gather/scatter implementation at the load/store unit as I described would, as far as I'm aware, offer the best speedup/area compromise. I'm just hoping there's no design limitation which would not make it feasible. Your insight into this matter is much appreciated.