[SOLVED] Gather/scatter feasibility

c0d1f1ed · Feb 14, 2011

Hi all,

I'm a computer engineer with a theoretical understanding of digital design, but most of the time I'm developing software. I believe gather/scatter instructions would be an incredibly valuable addition to an SIMD instruction set, but I wonder what's really feasible in hardware...

The architecture I'm analyzing already has a 128-bit vector load/store unit, so I'm curious whether this can realistically be turned into a 128-bit gather/scatter unit. The instruction would take a 64-bit base address, and a vector of four 32-bit offsets from this base address, to load/store four 32-bit element. If all the elements are stored within one cache line, I expect this instruction to take one cycle (throughput). If more cache lines are needed, they can be loaded in subsequent cycles, addressing elements till they're all loaded/stored (so worst case is 4 cycles). Note that this is not DMA gather/scatter, but parallel load/store between cache memory and registers.

To my knowledge, this mainly requires the ability to check which elements are stored in the same cache line, computing four offsets into a cache line, and extracing/inserting four elements between a cache line and a register. The rest of the logic should already be largely in place, since the architecture allows unaligned loads/stores (i.e. they can straddle cache lines), and can keep multiple misses in flight.

Is this requirements analysis correct? If so, would it have any significant impact on area or timings?

Thanks for your time!

Nicolas

P.S.: Sorry if this is the wrong forum or even the wrong site to ask. Other suggestions to get my question answered are highly appreciated!

lostinxlation · Feb 14, 2011

Not sure if I understood the instruction correctly.
Does that load the data from BASE+OFFSET0, BASE+OFFSET1, BASE+OFFSET2 and BASE+OFFSET3 and packed into a single 128bit data loaded to a single 128 bit register ?

If so, the requirement for the cache is what you described.
You need the ability to handle up to 4 address translations back to back with a single instruction.This requires more ports on TLB.
You need a address calculation block to generate up to 4 addresses.

if it supports out of order execution, the address calculations and cache accesses may not happen at the same time for 4 data. You need to schedule them properly if you want to get a whole set of data at one shot.

c0d1f1ed · Feb 15, 2011

lostinxlation said:
Does that load the data from BASE+OFFSET0, BASE+OFFSET1, BASE+OFFSET2 and BASE+OFFSET3 and packed into a single 128bit data loaded to a single 128 bit register ?

That's correct.

If so, the requirement for the cache is what you described.
You need the ability to handle up to 4 address translations back to back with a single instruction.This requires more ports on TLB.
You need a address calculation block to generate up to 4 addresses.

Yes, I believe the address calculation block is fairly simple though. If the cache line width is 64 bytes, then comparing the upper 25 bits of the offsets can be used to quickly determine whether the addresses are sure to be more than a cache line apart. The lower 7 bits, added to the base address, can be used to compute the cache line offsets and check which elements still cross a cache line boundary (since the base address can be anywhere from the first to the last byte of the cache line). So I don't think four full-fledged 64-bit + 32-bit adders are needed.

So do you think this is a fairly simple addition to a CPU which already has an SIMD instruction set? Or would it have a significant impact on area and/or timings? This is for a multi-core desktop CPU.

lostinxlation · Feb 15, 2011

I can't say if it's an easy change without knowing the detail on the current architecture.

You still need four 64bit addresses(58 bits more precisely) to determine whether 32 bit words are on the cache, and you also need 4 more bits to determine the locations of the word within the cache line.
Wether it is an easy change or not depends on the current architecture. If it is on super scalar, you have to schedule 4 address calculation properly or you might need extra reservation stations. do you have it already or do you have to implement it ?
Your cache needs at least 4 read ports. do you have it or do you need more circuit design to put more ports on RAM ?
How you'd handle the situation where 4 words are aliased on the cache ?

There are many things to consider.

c0d1f1ed · Feb 15, 2011

lostinxlation said:
I can't say if it's an easy change without knowing the detail on the current architecture.

x86 and ARM desktop/laptop CPUs. I'm trying to assess whether it's worth starting to invest in a software architecture which will take advantage of hardware gather/scatter support. Intel recently added 256-bit AVX vector instructions, with dual 128-bit load units, but no support for gather/scatter yet. This is a significant bottleneck for indirect load/store operations in throughput-oriented workloads. So some predict it's only a matter of time till gather/scatter logic gets added, while others claim it's too expensive for a CPU...

You still need four 64bit addresses(58 bits more precisely) to determine whether 32 bit words are on the cache...

Would you be so kind to elaborate?

As far as I can tell, once it is computed that the element at BASE+OFFSET# is located at a certain cache line, then checking which other elements are located on the same cache line only requires comparing the upper 25 bits of each of the offsets to trivially reject them, and computing base+offset sums for the lower 7 bits to compute the offset within the cache line and whether or not there's an overflow to the next line.

For instance if two of the offsets are 00000000h and 00300000h then there's no point in computing the entire 64-bit address since we know right away that they won't be stored on the same cache line.

For offsets 00000000h and 0000003Fh it depends on the lower 7 bits of BASE whether or not they're located on the same cache line.

Or am I missing something?

Your cache needs at least 4 read ports. do you have it or do you need more circuit design to put more ports on RAM ?

I don't think 4 read ports are needed. For the case of the Intel Sandy Bridge architecture, there are two 128-bit load units (so there are two L1 cache read ports). I'd like to assess whether it's realistic these can become two 128-bit gather units in the near future. This doesn't require increasing the number of cache read ports. When multiple cache lines are needed, they can be accessed in subsequent cycles. Only when all of the elements are located within the same cache line, the gather instruction is supposed to have a maximum throughput of 1 every cycle.

How you'd handle the situation where 4 words are aliased on the cache ?

I don't see how that's an issue in the first place.

There are many things to consider.

Yes, I apologise for the complexity of my question. And while I'm experienced at low-level software development it's not easy for me to translate things into hardware terms. Please bear with me, and thanks for the responses so far!

lostinxlation · Feb 15, 2011

c0d1f1ed said:
Would you be so kind to elaborate?

I was referring to the case where 4 words are read simultaneously even though they are located in the different cache lines. Frankly, i think it's a bit lame it allows multiple cycles to obtain the data from cache. That's not much different from 4 load instructions getting a words from cache in pipeline, and stores the data into the 128 bit register, especially when nowadays superscalar and out-of-order exec is common. It seems simpler to me just to introduce a load instruction that comes with a write mask to 128 bit register.

And I believe CISC processors like x86 works like RISC internally. All the complicated instructions are broken into multiple simple instructions and I don't see much benefit on the instruction atomicity anymore unless the said instruction is very unique and not easily replacable by a set of other instructions.

c0d1f1ed · Feb 16, 2011

lostinxlation said:
I was referring to the case where 4 words are read simultaneously even though they are located in the different cache lines. Frankly, i think it's a bit lame it allows multiple cycles to obtain the data from cache. That's not much different from 4 load instructions getting a words from cache in pipeline, and stores the data into the 128 bit register, especially when nowadays superscalar and out-of-order exec is common. It seems simpler to me just to introduce a load instruction that comes with a write mask to 128 bit register.

I'm afraid it's very different from sequential load/store.

The next major Intel architecture will add Fused Multiply-Add (FMA) instruction support. It will have two 256-bit FMA units. That's a total of 32 single-precision floating-point operations per clock cycle. In contrast, indirectly addressing eight 32-bit elements takes 24 micro-operations. The effective throughput of sequential load/store is 48 times less! Of course not every parallel load/store requires arbitrary offsets, but even a few of these emulated gather/scatter instructions can clearly bring down the performance really fast.

It takes 24 micro-operations per gather or scatter because you also need to individually extract each offset. Being able to send the entire vector of offsets to the load/store unit would bring it down to essentially 8 cycles.

But it can be a lot better still by taking advantage of data coherence, which can be very high for parallel workloads. For instance when you have lookup tables to implement transcendental functions, they can fit into one or two cache lines. Also when you have say a large 3D matrix for finite-element simulations, it can be stored in blocks to improve the chances that neighboring elements are stored within the same cache line.

So by collecting elements from two cache lines per cycle, the throughput could be as high as 1 gather instruction every cycle. Of course the average throughput will be lower, but still much better than with sequential operations.

Note also that the list of applications that can benefit from this is endless. Every code loop which has independent iterations, can now be (automatically) vectorized. Gather/scatter is the parallel version of load/store. Every other instruction already has a parallel version, so gather/scatter is an essential missing piece to enable a significant speedup for a wide range of applications.

And I believe CISC processors like x86 works like RISC internally. All the complicated instructions are broken into multiple simple instructions and I don't see much benefit on the instruction atomicity anymore unless the said instruction is very unique and not easily replacable by a set of other instructions.

The majority of the x86 instructions are translated into just one RISC micro-instruction really. Especially the vector ISA extensions are relatively straightforward in comparison to the legacy x86 instructions. In particular something like an FMA vector instruction will be executed on a 256-bit wide execution units. So sequential load/store of individual 32-bit elements is not really an option.

There might be other solutions but a gather/scatter implementation at the load/store unit as I described would, as far as I'm aware, offer the best speedup/area compromise. I'm just hoping there's no design limitation which would not make it feasible. Your insight into this matter is much appreciated.

c0d1f1ed · Jun 13, 2011

Intel has just revealed the AVX2 instruction set extension, which adds support for gather operations (Forums - Intel® Software Network - Intel® Software Network)! It will be supported by the 2013 "Haswell" architecture.

I'm curious what the performance will be like, as they haven't announced yet how it's implemented... Thoughts?

[SOLVED] Gather/scatter feasibility

c0d1f1ed

Newbie level 4

lostinxlation

Advanced Member level 2

c0d1f1ed

Newbie level 4

lostinxlation

Advanced Member level 2

c0d1f1ed

Newbie level 4

lostinxlation

Advanced Member level 2

c0d1f1ed

Newbie level 4

c0d1f1ed

Newbie level 4

Similar threads

[SOLVED] Gather/scatter feasibility

Newbie level 4

Advanced Member level 2

Newbie level 4

Advanced Member level 2

Newbie level 4

Advanced Member level 2

Newbie level 4

Newbie level 4

Similar threads

Privacy & Transparency

Privacy & Transparency