How much slower is random access, really?

63 points

1/21/1970

4 days ago

by sestep

Comments

andersa

Note this is not true random access in the manner it occurs in most programs. By having a contiguous array of indices to look at, that array can be prefetched as it goes, and speculative execution will take care of loading many upcoming indices of the target array in parallel.

A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.

7 hours ago

wtallis

> A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.

However, on some processors there's a data-dependent prefetcher that will notice the pointer-like value and start prefetching that address before the CPU requests it.

4 hours ago

hansvm

Fun fact, that's part of why parsing protobuf is so slow.

3 hours ago

delusional

> By having a contiguous array of indices to look at, that array can be prefetched as it goes

Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.

44 minutes ago

jiggawatts

This is why array random access and linked-list random access have wildly different performance characteristics.

Another thing I noticed is that the spike on the left hand side of his graphs is the overhead of file access.

Without this overhead, small array random access should have a lot better per-element cost.

6 hours ago

JonChesterfield

Random access is catastrophically slower because of the successive cache misses when the prefetcher fails to guess what you're doing.

One hint in the same article that random access is not cheap, in contrast with the conclusion, was noticing that the shuffle was unacceptably slow on large data sets.

Still, good to see peformance measurements, especially where the curves look roughly like you'd hope them to.

31 minutes ago

Animats

If, of course, you have the CPU and its caches all to yourself.

37 minutes ago

porcoda

The RandomAccess (or GUPS) benchmark (see: https://ieeexplore.ieee.org/document/4100365) was looking at measuring machines on this kind of workload. In high performance computing this was important for graph calculations and was one of the things the Cray (formerly Tera) MTA machine was particularly good at. I suppose this benchmark wouldn’t be very widely known outside HPC circles.

6 hours ago

jandrewrogers

I worked on the MTA architectures for years among several other HPC systems but I don’t remember this particular benchmark. I suspect it was replaced by the Graph500 benchmark. Graph500 measures something similar and was introduced only a few years after GUPS.

3 hours ago

porcoda

The HPCS benchmarks predated Graph500. They were talked about at SC for a few years in the early 2000s but mostly faded into the background. It’s hard to dig up the numbers for the MTA on RandomAccess, but the Eldorado paper from ‘05 by Feo and friends (https://dl.acm.org/doi/10.1145/1062261.1062268) mentions it and you can see the MTA beating the other popular architectures of the time in one of the tables.

2 hours ago

jandrewrogers

Feo was a major MTA stan and proponent, even years later. Honestly, it is probably my favorite computing architecture of all time despite the weaknesses of the implementation. It was extraordinarily efficient in some contexts. Few people could design properly optimized code for them though, which was an additional problem.

There were proofs of concept by 2010 that the latency-hiding mechanics could be implemented on CPUs in software, which while not as efficient had the advantage of cost and performance, which was a death knell for the MTA. A few attempts to revive that style of architecture have come and gone. It is very difficult to compete with the economics of mass-scale commodity silicon.

I hold out hope that a modern barrel processor will become available at some point but I’m not sanguine about it.

an hour ago

Adhyyan1252

Love this analysis! Was expecting random to be much slower. 4x is not bad at all

8 hours ago

Nevermark

There has to be some power hit for all those extra cache fills. No idea if it would be measurable.

an hour ago

[deleted]

5 hours ago

o11c

Hm, no discussion of cache line size, page size, or the limits of cache associativity?

an hour ago

[deleted]

6 hours ago

FpUser

I did another type of experiment which evaluates benefits of branch prediction on AMD 9950X on contiguous array with 1,000,000 elements. Calculated sum adding element if it is bigger than 125 (50% of 256). Difference between random and sorted was 10 times. I guess branch prediction plays a huge role as well.

4 hours ago

Andys

Thanks for sharing that.

Presumably if you'd split the elements into 16 shares (one for each CPU), summed with 16 threads, and then summed the lot at the end, then random would be faster than sorted?

4 hours ago

bee_rider

I don’t think random should be faster than contiguous access, if you parallelize both of them.

Although, it looks like that chip has a 1MB L2 cache for each core. If these are 4 Bytes ints, then I guess they won’t all fit in one core’s L2, but maybe they can all start out in their respective cores’ L2 if it is parallelized (well, depends on how you set it up).

Maybe it will be closer. Contiguous should still win.

an hour ago

forrestthewoods

Here’s an older blog post of mine on roughly the same topic:

https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...

I’m not sure I agree with the data presentation format. “time per element” doesn’t seem like the right metric.

7 hours ago

klank

What are your qualms with time per element? I liked it as a metric because it kept the total deviation of results to less than 32 across the entire result set.

Using something like the overall run length would have such large variations making only the shape of the graph particularly useful (to me) less so much the values themselves.

If I was showing a chart like this to "leadership" I'd show with the overall run length. As I'd care more about them realizing the "real world" impact rather than the per unit impact. But this is written for engineers, so I'd expect it to also be focused on per unit impacts for a blog like this.

However, having said all that, I'd love to hear what your reservations are using it as a metric.

5 hours ago

forrestthewoods

It’s not wrong per se. I’m just very wary of nano-scale benchmarks. And I think in general you should advertise “velocity” not “time per”.

Perhaps it’s a long time inspiration from this post: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...

I also just don’t know what to do with “1 ns per element”. The scale of 1 to 4 ns per element is remarkably imprecise. Discussing 1 to 250 million to 1 billion elements per second feels like a much wider range. Even if it’s mathematically identical.

Your graphs have a few odd spikes that weren’t deeply discussed. If it’s under 2ns per element who cares!

The logarithmic scale also made it really hard to interpret. Should have drawn clearer lines at L1/L2/L3/ram limits.

On skim I don’t think there’s anything wrong. But as presented it’s a little hard for me as an engineer to extract lessons or use this information for good (or evil).

There shouldn’t be a Linux vs Mac issue. Ignoring mmap this should be HW.

I dunno. Those are all just surface level reactions.

31 minutes ago

alain94040

From your blog post:

> Random access from the cache is remarkably quick. It's comparable to sequential RAM performance

That's actually expected once you think about it, it's a natural consequence of prefetching.

4 hours ago

delusional

If that wasn't the case the machine would have to prefetch to register file. I don't know of any CPU that does that.

41 minutes ago

petermcneeley

Whats most misleading is the data for the smaller sizes (1k)

3 hours ago

[deleted]

5 hours ago