c0d1f1ed
Newbie level 4

Hi all,
I'm a computer engineer with a theoretical understanding of digital design, but most of the time I'm developing software. I believe gather/scatter instructions would be an incredibly valuable addition to an SIMD instruction set, but I wonder what's really feasible in hardware...
The architecture I'm analyzing already has a 128-bit vector load/store unit, so I'm curious whether this can realistically be turned into a 128-bit gather/scatter unit. The instruction would take a 64-bit base address, and a vector of four 32-bit offsets from this base address, to load/store four 32-bit element. If all the elements are stored within one cache line, I expect this instruction to take one cycle (throughput). If more cache lines are needed, they can be loaded in subsequent cycles, addressing elements till they're all loaded/stored (so worst case is 4 cycles). Note that this is not DMA gather/scatter, but parallel load/store between cache memory and registers.
To my knowledge, this mainly requires the ability to check which elements are stored in the same cache line, computing four offsets into a cache line, and extracing/inserting four elements between a cache line and a register. The rest of the logic should already be largely in place, since the architecture allows unaligned loads/stores (i.e. they can straddle cache lines), and can keep multiple misses in flight.
Is this requirements analysis correct? If so, would it have any significant impact on area or timings?
Thanks for your time!
Nicolas
P.S.: Sorry if this is the wrong forum or even the wrong site to ask. Other suggestions to get my question answered are highly appreciated!
I'm a computer engineer with a theoretical understanding of digital design, but most of the time I'm developing software. I believe gather/scatter instructions would be an incredibly valuable addition to an SIMD instruction set, but I wonder what's really feasible in hardware...
The architecture I'm analyzing already has a 128-bit vector load/store unit, so I'm curious whether this can realistically be turned into a 128-bit gather/scatter unit. The instruction would take a 64-bit base address, and a vector of four 32-bit offsets from this base address, to load/store four 32-bit element. If all the elements are stored within one cache line, I expect this instruction to take one cycle (throughput). If more cache lines are needed, they can be loaded in subsequent cycles, addressing elements till they're all loaded/stored (so worst case is 4 cycles). Note that this is not DMA gather/scatter, but parallel load/store between cache memory and registers.
To my knowledge, this mainly requires the ability to check which elements are stored in the same cache line, computing four offsets into a cache line, and extracing/inserting four elements between a cache line and a register. The rest of the logic should already be largely in place, since the architecture allows unaligned loads/stores (i.e. they can straddle cache lines), and can keep multiple misses in flight.
Is this requirements analysis correct? If so, would it have any significant impact on area or timings?
Thanks for your time!
Nicolas
P.S.: Sorry if this is the wrong forum or even the wrong site to ask. Other suggestions to get my question answered are highly appreciated!