Decreasing the number of memory accesses

90 points
1/20/1970
a year ago
by g0xA52A2A

Comments


twawaaay

Honestly, for a single threaded processing potentially the largest improvement come from:

* Exploiting cache locality (related data in same cache line, operations hit the data before it leaves cache, etc.)

* Exploiting cache prefetching (understanding how to lay out your data and operations so that CPU prefetches data from memory into cache BEFORE it is needed)

* Ensuring high density of data being accessed (make sure you only access what you need and the cache lines you access contain only data you need and not random other things that are not involved in the operation)

... at least in my own experience.

These can be so powerful that may make a dumb algorithm perform like an absolute superstar.

A lot of developers overthink algorithms. They try to create something that is super efficient on paper forgetting that the algorithm is not being executed on paper but by a real CPU. Sometimes it is better to sacrifice "paper efficiency" and just ensure cache locality, prefetching, data density and the result is speedup so large it more than covers for a little bit of inefficiency.

a year ago

hamilyon2

The article is about the same thing essentially? Decreasing number of memory accesses is the same thing as exploiting cache locality. Registers are the just more local more high-speed level of cache. Given on modern superscalar CPUs registers are virtual anyway.

a year ago

twawaaay

> The article is about the same thing essentially? Decreasing number of memory accesses is the same thing as exploiting cache locality.

No, it is not. For example, you can have an application that is doing a lot of memory accesses, but the working set is controlled to fit the cache except for couple of things that need to be fetched into the cache but you make sure those things fit predictable pattern that the CPU prefetcher can recognise and supply in advance of you needing it.

Now, having less memory accesses usually helps, but it is kinda newb way of looking at the problem. What if you have to make a lot of accesses because your application is memory heavy (say it is a database)? Does it mean there is nothing you can do to improve it? Of course there are things you can do to improve it -- but you need a bit more in depth knowledge.

a year ago

mackman

I think this article is about cache locality, but temporal cache locality. Which is just as important albeit orthogonal to what you are describing.

a year ago

[deleted]
a year ago

FartyMcFarter

> These can be so powerful that may make a dumb algorithm perform like an absolute superstar.

That's an over-generalization that is only valid for small input sizes. Consider:

- an O(N*log N) heap-sort implemented by a decent programmer in idiomatic high-level code.

- a dumb O(N^2) bubble-sort coded by an absolute assembly optimization wizard.

Your O(N^2) sort will probably be much faster for small input sizes, but it won't take a large input size for it to become dog-crap slow. You can improve constant factors all day, but that won't obviate the need for doing millions or billions of comparisons.

You can only take a dumb algorithm so far before it breaks down completely in practice.

This is not a theoretical thing either - in unoptimized codebases, I've often seen people finding low-hanging fruit by looking for inadvertent O(N^2) or even O(N^3) behaviour (usually it's not obvious nested loops, but looping hidden by indirection i.e. function calls) and replacing it with an O(N*log N) or O(N) alternative.

a year ago

astrobe_

> That's an over-generalization

I think most people here will interpret the word "can" in the sentence you're quoting as a caveat emptor; and I would add the same applies to TFA. Loop fusion can etc. but measure if it actually does in your exact context.

> You can improve constant factors all day, but that won't obviate the need for doing millions comparisons. > You can only take a dumb algorithm so far before it breaks down completely in practice.

If the dumb algorithm relies entirely on arithmetic while the better algorithm has conditional jumps, branch prediction misses can play against it for unexpectedly large input sizes.

Micro-optimizations may fall short quickly, but big-O() approved algorithms don't always win either, in practice.

a year ago

pclmulqdq

People usually underestimate where that line is. In my past experience, I ended up benchmarking binary search vs linear search on a collection of small objects. The breakeven point on a modern CPU was at 10 megabytes of data. Although this is a particularly extreme example, most people assume that collections of a few thousand items are where you should switch to big-O-optimal algorithms, but the breakeven is usually several orders of magnitude higher.

a year ago

AgentOrange1234

That’s very surprising to me. My intuition screams that surely a binary search is competitive with a linear search even at small sizes. I would love to see benchmarks and code for this if you have them handy.

a year ago

madsbuch

If your problem domain never expands to more than a couple of thousand elements that needs to be ordered, but really often (which is very often the case) the the bubble sort is the way to go (Especially if you have almost-sorted data making bubble sort the ideal choice).

> That's an over-generalization

I think this is exactly the problem with these types of comparisons. As the original commenter also hinted: You can not generalize best choice about the algorithm.

a year ago

twawaaay

I am pretty used to people defending their blind trust in big-O notation.

As I mentioned, the idea is to be open to alternative ways of thinking.

Performance is incredibly difficult or impossible to generalise. I have many times worked on a project where people said confidently it is not possible to speed something up and then I speeded it up by multiple orders of magnitude.

I worked for one of largest banks in the world and we had a batch task on MongoDB that took 12 hours to complete. The previous team, before I joined, took it down to 4 hours but then said it is not possible to improve it anymore. The consultants from MongoDB confirmed it. This was a huge problem because there was a delay of 4 hours between end of trading day and the report being finished which was unacceptable.

I took it down to ~15 seconds. Without changing data structure or the database or the application technology. The only significant thing I did is I made both the data suppliers to stream it to our app and the consumers to stream the results during the data. So the data was being processed during trading day and partial results were already being pushed to the client. The only thing that needed to be done was to sync between the sources and the clients and it took about 15 s to do it (and I could probably take it even further but it was not worth the effort).

So you see, it is all about being open minded.

a year ago

kragen

bubble sort is never as good as insertion sort, not even for almost-sorted data or for six elements

bubble sort is never the way to go

generally in the case you're describing, where you have a few thousand elements that need to stay sorted in the face of insertions, a b-tree would be a better choice, or failing that, merge in a batch of updates whenever the update file gets big enough

either of these can beat bubble sort or insertion sort in this situation by more than an order of magnitude

a year ago

cozzyd

I'm sure it's possible to conceive of a pathological hardware architecture where bubble sort is faster than insertion sort on small inputs.

a year ago

saagarjha

I mean sure, you can always come up with a processor that tries to detect insertion sort and makes it slow while it has hardware acceleration for bubble sort. That doesn't make it a very useful model, though.

a year ago

karmakaze

Also using big-O notation turns the "1/2" of the post title into a discarded constant.

a year ago

mrguyorama

My algorithms class discussed the importance of not forgetting the constants. We talked about an algorithm for recursively taking averages of a list of numbers (or whatever that's not important) that was O(n^2) but only in worst case, while there was a version that is guaranteed O(nlogn) or something but the constant factor was 27. This meant the growth only mattered in extreme cases.

a year ago

karmakaze

That math doesn't seem to check out. Say with base-2 log and a 27 factor difference, for 1000 it would be 1000^2 = 1,000,000 vs 27 * 1000 x log 1000 = 270,000. I wouldn't call n=1000 an extreme case for most problems. Just because practical counterexamples exist, don't think that it applies to your problem unless demonstratably so, an then still worry about N being larger than originally designed for.

a year ago

londons_explore

I really want compilers to be able to make these changes.

It seems that today, very few compilers are willing/able to mess with data structures and rearrange them.

I'd like to see compilers that can at a minimum rearrange arrays of structures into structures of arrays depending on likely data access patterns of the different members of the structure.

I'd also like to see compilers 'factoring out' things like vtables or structure members when they identify that they are constants or never read/written.

a year ago

flohofwoe

Compilers really shouldn't do this under the hood, it adds more complexity to the compiler, which makes it also more likely that things break in unexpected ways.

Instead these should be explicit programming language features (like defining a regular struct but then slice this struct into an SoA).

The rest should be handled by debugging and profiling tools which need to be much better integrated into the programming workflow than what is common today.

a year ago

[deleted]
a year ago

AgentOrange1234

Huh. But putting the complexity into the compiler once, instead of into every application where we would need to manually fuse loops, etc., seems like a huge win for net complexity and readability?

(I’m sympathetic to the concern of things breaking in unexpected ways.)

a year ago

flohofwoe

You can design the language to make such transforms more transparent without resorting to special 'compiler optimization magic'.

For instance Zig has a MultiArrayList container in its stdlib which slices its generic item type into one array for each struct item. This isn't a special compiler feature, but a regular usage of Zig's comptime stuff:

https://github.com/ziglang/zig/blob/master/lib/std/multi_arr...

a year ago

AgentOrange1234

Yeah! And there is definitely value in certain domains for having predictable performance.

On the other hand, giving up that control and designing languages where more automatic optimizations are possible also seems really great. A lot of programming shouldn’t be concerned with low level perf anymore.

a year ago

hansvm

That's a really hard thing for a compiler to predict though. I recently got a 2x performance boost just by switching to an array of struct of array representation (making the cache lines have more data that actually matters), and the only reason it made a difference was the access pattern of a mildly non-trivial algorithm acting on that data (not hard per se, but definitely hard for a compiler to optimize through).

a year ago

bell-cot

There's a long history of compilers attempting such "smarter" optimizations - with (metaphorically) "...and it flew straight into the mountainside" results.

I don't blame the compiler guys. They have to deal with tortuously complex real-world code. Sight unseen. And that code may only work correctly because of some unwritten assumptions, side-effects, and undefined behaviors, which no good program would ever rely upon.

a year ago

RobotToaster

GCC already has flags for various "dangerous" optimisation options (ffast-math for instance), there's no reason you couldn't have similar options that are disabled by default.

a year ago

bell-cot

Very difficult to do + even more difficult to get right + seldom used ==> proposed feature is a low priority for the compiler team.

a year ago

jerf

Our intuition says that our code is full of opportunities like this and a "sufficiently smart compiler" ought to be able to work wonders with our code.

Our intuitions are wrong. In reality it takes a shockingly small problem to break these optimizations and make them do something in violation of the semantics of the language, and as a result, the compiler can't perform the optimization even if it is smart enough to try. Many, many, many compilers have tried. It is not a new idea, it is one that has been tried and failed over and over again.

This is one of the reasons why the functional programming crew advocate so hard for simpler semantics like pervasively immutable variables. The idea is that if the program's semantics promise less, the optimizations ought to be a lot easier. And at least at the surface level this is arguably true; Haskell does quite significant map (rather than "loop") fusion successfully. However, to date, the "sufficiently smart compiler" has not manifested to the extent that we'd like, the one that turns functional code into 2x - 3x faster code than imperative code. Generally the real-world smart compilers are straining just to keep up with C, as long as you write in a particular subset of the language.

I have quoted the term "sufficiently smart compiler" because if you want to dig further in, it is a good search term.

Many of our intuitions about our code are similarly broken. There was a lot of hope for exploiting "implicit paralellization" in our codebases; surely, our brains thought, there's just a ton of things in every code base that can be automatically turned into parallel code. That turned out to be false. There was a paper that took some representative code bases and extracted out the theoretical maximum parallelization that could even be done by any conceivable compiler optimization by analyzing the data dependencies themselves and it turned out to be shockingly small, thus effectively killing the idea. This story has been repeated a lot; another search term you can play with is "supercompiler", which amounts to "compilers are pretty good at optimizations in the small, surely our code bases are rife with opportunities to optimize at much higher levels across packages or the entire code base?" to which the answer to date has amounted to "No, not really."

In my opinion, the only hope for any of these ideas is someone sitting down and designing a language from the very get-go intended to work with these ideas. However, given the amount of experience necessary to do that and the subtlety of the task, I'm not sure that I'm not specifying a super-human degree of work by saying that.

a year ago

srcreigh

Once we solve the halting problem, we’ll easily reorganize arrays of structure into structures of arrays based on likely data access patterns.

a year ago

extrememacaroni

Please no, no extreme changes to what I've or others have written, I want to be able to get at least a reasonable idea of the performance and behavior some code will have in some particular context by looking at the code itself.

Stop trying to be as lazy as possible in such a bad way. Making compilers make up for a lack of programming ability is a mistake.

a year ago

another2another

I feel the first example for ranges is a bit contrived, most people would end up with something more like:

  void get_ids_adult_users(const std::vector<user_t>& allUsers, std::vector<int>& adultUsers)
  {
      for (const user_t& user: allUsers) {
          if (user.age >= 18) {
              adultUsers.push_back(user.age);
          }
      }
  }

probably after calling reserve() on the adultUsers vector to avoid resizing.
a year ago

pipo234

Nice explanation, doing a step-by-step refactoring in C++. So for memory-bound performance optimization, you indeed see that halving memory accesses doubles throughput in practice, more or less.

Curious if this kind of loop fusion would be something you get for free in purely functional language? That is, if you were to write a program (say in Haskell) for min, max plus that same code combined into a single program, would you indeed empirically see double the speed?

a year ago

jamal-kumar

little to no malloc and taking full advantage of what you can get away with without it is one of those tricks every programmer should know

it's how we managed to get away with doing anything at all when our PCs had like 16-32 MEGABYTES of ram in them - or less, those are generous amounts

it's why those computers can still play video

a year ago

LoganDark

> it's why those computers can still play video

Meanwhile, new computers that are thousands or tens of thousands of times faster cannot play the same video.

a year ago

xjay

Other optimization awareness resources:

> This series of five manuals describes everything you need to know about optimizing code for x86 and x86-64 family microprocessors, including optimization advices for C++ and assembly language, details about the microarchitecture and instruction timings of most Intel, AMD and VIA processors, and details about different compilers and calling conventions.

> Operating systems covered: DOS, Windows, Linux, BSD, Mac OS X Intel based, 32 and 64 bits.

https://agner.org/optimize/#manuals

a year ago

cl0ckt0wer

Link isn't loading for me: https://archive.ph/pCYYf

a year ago

g0xA52A2A

a year ago

progx

Loop fusion is something that a compiler / interpreter should do. Compiler need a breeze of AI to detect such code. For developers separated code is easier to read and to maintain.

a year ago

flohofwoe

I don't agree. There's already way too much compiler magic happening, and the process to nudge/hint the compiler towards doing the right thing is already too opaque and fragile. Optimizations need to be predictable, transparent and robust, also across compiler vendors and compiler versions.

What we need instead is "always on" profiling/debugging/inspection tools so that I can see directly and immediately the performance impact of my code changes while I'm typing the code (down to the compiler output as disassembly, annotated with performance warnings like predicted cache misses). Basically the same sort of evolution which gave us warning and error squiggles while typing, but for performance (think godbolt.org, but in realtime and with hot code reloading).

a year ago

mhh__

I think a reasonable solution is to just make the backend accessible from the frontend i.e. you should be able to declare how aggressive you want the inliner to be as a first class language idiom

a year ago

dspillett

> For developers separated code is easier to read and to maintain.

Unless the inner loop is something long or complicated (in which case for reading & maintainability it should maybe be broken down into functions), I generally find the fused version easier for both.

For readability: it is immediately obvious that everything is working on exactly the same data, and the same range within that data, and that one section is not expected to have side effects that affect the running of the subsequent ones.

For maintainability: there are less possibilities of an edit to one part that should affect all being missed elsewhere.

a year ago

mort96

Hmm, I hadn't thought about it before, but I bet ML could be really powerful when it comes to optimizing code. Existing LLMs (which aren't even trained on anything related to optimization!) can already do an incredible job optimizing in a lot of cases.

Get an ML model to take a stab at optimizing, have a solid system for validating that two pieces of code are equivalent according to the C abstract machine, throw away the ML-suggested optimizations which change the semantics of the code, keep the ones which keep semantics the same.

a year ago

mhh__

You can just let the AI pick from a possible set of optimizations, rather than having it suggest the optimizations itself. There are a lot of optimizations that compilers could easily recognize but can't do because they'd be violating the standard.

More realistically you can use an AI to generate vaguely realistic PGO data (profile guidance is where the real speed comes from). This has already been done at least by researchers.

a year ago

mort96

> you can use an AI to generate vaguely realistic PGO data (profile guidance is where the real speed comes from)

That's not a bad idea, but PGO is pretty limited. I'm thinking about things like rewriting a loop to use vector instructions; something which compilers are absolutely terrible at but ChatGPT is actually pretty good at.

a year ago

mhh__

PGO is limited if you are chasing throughput on really tight programs that spend most of their time in a small number of loops (preferably with textbooks written about them), but it's really useful if you have a bunch of branchy code that mainly does one thing but has to cater for other things too (compilers come to mind).

Enabling PGO when building the D compiler made it almost 50% faster on some parts of the test suite (and fairly reasonable big files at that) - speedup elsewhere was not as dramatic but still a tidy win.

a year ago

mort96

50% is probably a pretty extreme case for PGO, but when rewriting code to use vector instructions instead of scalar often has a speedup between 2x and 4x. Both approaches are interesting is all I'm saying.

a year ago

JonChesterfield

> have a solid system for validating that two pieces of code are equivalent according to the C abstract machine

What ideas do you have for doing that in anything under exponential time?

a year ago

mort96

I don't have any, I haven't thought about it enough. However, the reason I think it might be interesting is that checking if something is a correct solution is often easier than coming up with a correct solution for a lot of classes of problems (assuming P!=NP and all that).

And maybe you can't make a polynomial time verifier which works in the general case, but maybe you can make a useful one which handles a lot of cases.

a year ago

c-cube

I'm pretty sure that program equivalence is undecidable, ie you literally can't write this validation pass at all. There's a C compiler (compcert) that is formally verified to preserve the semantic of its input C code down to assembly, but it takes a lot of human developed proofs in Coq to get there.

a year ago

bell-cot

> Loop fusion is also a compiler optimization technique, so in theory the compiler can do it automatically. But this doesn’t happen often and when it does happen, this optimization has a tendency to break easily.

a year ago

JonChesterfield

Compilers don't need AI to do loop fusion. They need a commercially meaningful benchmark to benefit from it or for someone to decide it's an interesting problem to implement.

a year ago

mhh__

Compilers have been doing this for a long time already.

a year ago

planede

I would have expected so, but gcc doesn't seem to do it for the article's example (didn't try other compilers):

https://godbolt.org/z/Paq8f5x3d

a year ago

adpcm

There existerar some 10+ year old patches to do so, which where never accepted/merged.

https://github.com/wichtounet/gcc-loop-fusion/commits/master...

a year ago