Zen 5's 2-ahead branch predictor: how a 30 year old idea allows for new tricks

321 points

1/20/1970

a month ago

by matt_d

Comments

gary_0

Here's a great explanation of branch prediction, starting from the earliest implementations: https://danluu.com/branch-prediction/

a month ago

skywal_l

Godbolt recently did an ELI5 on computerphile about everything CPU[0] and about branch prediction in particular[1].

[0]: https://www.youtube.com/watch?v=nhXevKMm3JI&list=PLzH6n4zXuc...

[1]: https://www.youtube.com/watch?v=nczJ58WvtYo&list=PLzH6n4zXuc...

a month ago

holoduke

Thats a good read

a month ago

ksec

It will be interesting to see the SMT performance, I am expecting this would provide benefits and be further refined in future generation. With Zen5c we get 192 Core or 384vCPU. We should be getting 256 Core with Zen 6c next year. Which means on a Dual Socket 1U Server, that is a potential of 512 Core with 1024 vCPU.

Whatever Web App Scaling issues we had in 2014 could now fit into a single server, assuming we somehow manage to cool the thing. Even at 1 RPS per vCPU that is 1000 RPS, excluding cache hit. Even on HN front-page dont hit the server at 1000 Page View per second.

a month ago

bayindirh

Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

I wonder what about its HPC performance. I think cooling this won’t be big problem, but might be wet one, requiring DLC after a certain point.

a month ago

Dylan16807

> Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

It's hard to be network I/O bound when serving web pages. Netflix struggles to be network I/O bound when serving video, which is so much bigger and uses so much less processing.

Epyc started off with 32 cores on PCIe 3, and quickly moved to 64 cores on PCIe 4. When we hit 256 cores it's probably going to have PCIe 6, which means it's still the same I/O per core.

But those numbers are crazy overkill for web serving anyway. If you wanted to allocate about a gigabit per core, with 512 cores across two CPUs, using PCIe 5 to be conservative, you'd need at total of... 16 lanes, for a single 400gb/s card you can buy today.

(This is assuming you mean the I/O from the server to the network. If you're talking about I/O outside the server, then upgrade your switches. Using denser servers doesn't increase your network load, it lets you take the same load and send it to fewer racks. You're already handling the total data somewhere.)

a month ago

account42

Video being bigger also means it does not fit in lower levels of the cache hierarchy. For a normal website 99% of views should be served entirely from RAM.

a month ago

bayindirh

I mean, we already run NDR Infiniband, and with sometimes multiple cards per host in our data center, so the numbers are not surprising or out of the this world for me...

Terabit Ethernet also is easy. Just 10 100gbit fibers. Again, not much either in port count, or in physical space needs (port count is ample, fibers are thin).

What I meant is, in a conventional data center, you'll probably hit your allocated bandwidth limits before you saturate a processor like this while serving web pages, unless you're sitting on a big exchange point or the backbone router is not in the next system hall. A single socket Epyc system has 128 PCIe lanes and 12 channel memory. A complete overkill for such a job unless you serve millions, and only have a basement to put your servers with some very fat network pipes.

a month ago

Dylan16807

I'm assuming a situation where you'd need more than one server and there actually are bottlenecks. If one server per datacenter is enough, with no bottlenecks, then great you're done.

If your actually-kept-busy servers get 20x faster and your allocated bandwidth doesn't, there's a point where your racks are almost empty and you have some issues to address with the datacenter owner.

a month ago

bayindirh

I have written that comment from a datacenter tenant, not as an owner. Of course a proper datacenter providing colo services should have fat pipes to backbone carriers. What I meant is this processor is an overkill just for serving webpages. It makes sense as a virtualization host, or an HPC worker node, software defined storage node or any use case which can flex its muscles. Serving HTML files is not one of them, though.

a month ago

touisteur

As for zen3 and zen4, it's been a pleasure to work with good vectorization perf, and it seems they've been doing avx512 'right', at least for HPC I know.

Add to that the 12 channels of DDR per socket and the quite generous cache, and once you've swallowed the need to buy 24 sticks for 2 sockets, these things are beautiful beasts, at what Intel has taught us, a very reasonable price (ask you OEM for prices, public prices on CPUs are nuts).

a month ago

JonChesterfield

Lots of water in data centers these days. Works fine. I think the fear about that lost to the utility some years ago.

a month ago

jiggawatts

We're entering the era of "kilo cores" in the same way computing entered the era of kilobytes in the 1940s. If you consider a tightly-coupled rack of servers with GPUs to be one machine, then we're well into the hundreds of kilocores.

I found it entertaining having a debate with someone here on HN who just couldn't grok the concept that it's possible to serve something the size of Wikipedia from a single server. That's been easy for a while now, it just isn't done for practical reasons such as availability or cost-efficiency.

a month ago

JonChesterfield

Is that actually true, or only that you can serve it from one server behind a lot of servers named CDN?

a month ago

jiggawatts

It’s definitely true. It won’t be optimal or fast for end users, but it’s entirely doable.

One of the main benefits of a CDN is that it brings the data closer to the users, reducing network latency. Obviously a single server can’t be close to every user globally!

In practice one would always use a CDN for a site like WikiPedia but the point is that it isn’t necessary to handle the traffic.

A single two-socket EPYC server can put out something like 400 Gbps of traffic, which is far more than all but the biggest sites.

a month ago

fulafel

GPU terminology "cores" are more like SIMD lanes, they're not comparable.

a month ago

WithinReason

The opposite might be true, the better you are at utilising the CPU pipeline the less space you have to run a 2nd thread so SMT's benefits might diminish

a month ago

ColonelPhantom

Correct. However, Zen 5 seems to be the first major 'scaling up' that AMD has done since the original Zen, given that the core seems to be moving from dispatching 6 to 8 instructions per clock and 4 to 6 ALUs. I would expect that SMT to be excellent for low-IPC workloads to get the most out of that scaling.

a month ago

fulafel

SMT could use a lot more benchmarking investigations.

Intuitively having more tasks working on the same problem at half speed should have a memory usage cost, are apps commonly using more memory for no speed gain when using SMT?

In a lot of published benchmarks it seems most apps don't noticeably benefit in executions peed.

a month ago

crest

SMT also allows the cores to hide memory latency by executing the other thread(s). Taken to the extreme end this would give you a barrel CPU.

a month ago

fulafel

Yes, this is reflected in the execution speed of the workloads (vs being an additional thing).

Other things making up the potential for being more than a zero sum game (half the speed) include making better use of the ALU and branch units that would otherwise be waiting for dependencies. On the other side of the scale is how different working sets of the threads can thrash each others cache working sets.

a month ago

JonChesterfield

Push the hyperthread count up higher and you have enough stuff to do while waiting on memory that branch prediction doesn't matter as much. Could call it a GPU.

a month ago

nullc

I think unfortunately with just 2-way smt it's pretty easy to get both of them stalled at once.

The power9 4-way SMT seems to be a fair bit more effective than what I've seen on x86.

a month ago

teaearlgraycold

Web app scaling issues are usually around database latency.

a month ago

jiggawatts

Only because the database is on the far end of a network hop, and then all too often its storage is remote from its compute as well. This is the most common scenario in enterprise or cloud deployments, where for good measure a couple of firewalls, load balancers, routers, and reverse proxies are added at every stage.

a month ago

CodesInChaos

In my experience, scaling is limited by database throughput, not latency.

a month ago

teaearlgraycold

Depends. Most apps that are slow have shitty app code with N+1 queries. If you can get past that stage and write good enough app code then eventually you’ll hit database throughout issues.

a month ago

codesnik

well, db latency in some naive unoptimized app with multiple sequential queries directly affects _app_'s throughput

a month ago

jeltz

But database performance is also a lot about CPU and RAM speed plus concurrent use of those resources.

a month ago

andrepd

Man I sure hope that your server does more than 1 RPS :)

a month ago

ksec

I wish I am joking but there are plenty of production instances doing less than 10 RPS per vCPU. ( I think the request part doesn't really explain well, since it could be a page view but it could also be an API request. In this case it is mostly referring to a page view )

a month ago

IvanAchlaqullah

It's always interesting to see decades old papers, sometimes published with little to no fanfares, suddenly becomes "state of the art" because hardware have become powerful enough.

For example Z-buffers[1]. It's used by 3d video games. When it's first published on paper, it's not even the main topic of the paper, just some side notes because it requires expensive amount of memory to run.

Turn out megabytes is quite cheap few decades latter, and every realtime 3d renderer ended up using it.

[1] https://en.wikipedia.org/wiki/Z-buffering

a month ago

abainbridge

Another example is Low Density Parity Check Codes [1]. Discovered in 1962 by Robert Gallager but abandoned and forgotten about for decades due to being computationally impractical. It looks like there was a 38 year gap in the literature until rediscovered by David MacKay [2].

The first mainstream use was in 2003. It is now used in WiFi, Ethernet and 5G.

[1] https://en.wikipedia.org/wiki/Low-density_parity-check_code

[2] https://scholar.google.com/scholar?q=%22low+density+parity+c...

a month ago

bee_rider

I sometimes wonder if there’s an academic career hidden in there for an engineer: go to the library and read what the CS folks were publishing on physical papers, maybe there are some ideas that can actually be implemented now that weren’t practical back then.

a month ago

Terr_

In a series of books by David Brin [0] there is a galaxy-wide institution known as the library, and civilizations regularly mine its millions of years of data for suddenly-relevant-again techniques and technologies.

I remember one bit where a species had launched some tricky fleet-destroying weapon to surprise their enemies with esoteric physics, only to have it reversed against them, possibly because the Librarian that once helped their research-agent wasn't entirely unbiased.

[0] https://en.m.wikipedia.org/wiki/Uplift_Universe

a month ago

AceJohnny2

Also in Vinge's Deepness in the Sky, there aren't really "programmers" as we know them anymore, but "programmer-archeologists" that just search the archives for code components to reuse.

a month ago

abecedarius

I think that's unfair to the programmer-archaeologists: young Pham wanted to write things from scratch (and took advantage when he got a chance to), and other characters said he was a talented hacker, but as they also said, it was just way more productive most of the time to cobble together ancient code.

a month ago

The_Colonel

That's pretty much what any (decent) programmer does today as well. You first search your code base to see if the application already does something like it, if not, whether there is a published library. Where this starts to fail is the idea that connecting those already written peaces together is easy.

a month ago

manojlds

..And then ask an LLM for code

a month ago

Terr_

Also: In the Destiny mythic sci-fi franchise, the human golden age ended with a mysterious apocalypse, leaving "cryptarchs" (crypto-archeologists) to try to rebuild from arcane fragments of encrypted data or unknown formats.

a month ago

CoastalCoder

Mind editing that to give a spoiler alert?

a month ago

Terr_

Don't worry, it's nowhere near the main plot or characters, just a small "meanwhile, elsewhere" vignette. Basically emphasizing the "why bother everything's already invented" mentality of most client-races, and how deep access and query-secrecy have big impacts.

a month ago

CoastalCoder

Hey, could someone clarify why all the downvotes?

I'd have thought asking for a spoiler alert would be pretty acceptable.

a month ago

samatman

> In a series of books by David Brin [0] there is

Is a spoiler alert.

Also:

> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html

a month ago

CoastalCoder

Thanks.

a month ago

kvemkon

Academic? Perhaps, applied:

"When Soviets Won the Cold War: Wading Through Pyotr Ufimstev's Work on Stealth" (26.03.2024)

https://news.ycombinator.com/item?id=39830671

> In the early 1970s, Lockheed Martin engineer Denys Overholser discovered the key to stealth technology hidden in a stack of translated Soviet technical papers. Disregarded by the Soviet academic elite, and unheard of in the United States, Pyotr Ufimstev had worked out calculations that would help win ...

Not sure of the details of this story, but in general having there enough people to grant them time just to search for something interesting seems not unrealistic.

a month ago

findthewords

Yes, "read 10 year old papers as a source of ideas ripe for commercialization" IS common advice in universities.

a month ago

chrisbrandow

A post-doc in my chemistry lab had the saying, “two weeks in the lab will save you a day in the library”

a month ago

scns

Weeks of work can save hours of planning.

a month ago

[deleted]

a month ago

fragmede

The whole AI trend is in part due to things that are now possible on GPU supercomputers with gigabytes of RAM backed by petabytes of data and at the top speed of GPUs. Some of the algorithms date back to before the ai winter and it's just that we can now do the same thing with a ton more data and faster.

a month ago

zaptrem

All of the main algorithms do (multi-layer perceptrons and stochastic gradient descent are from the 50s and 60s!). Basically the only thing that changed is we decided to multiply some of the outputs of the multi-layer perceptrons by each other and softmax it (attention) before passing them back into more layers. Almost all of the other stuff is just gravy to make it converge faster (and run faster on modern hardware).

a month ago

eru

Is there any indication that people had figured out that simpler activation functions like ReLU are worth bothering?

a month ago

zaptrem

Oh, forgot about that one. Wikipedia says ReLU was used in NNs in 1969 but not widely until 2011. Idk if anyone has ever trained a transformer with sigmoid activations, but I don’t immediately see why it wouldn’t work?

a month ago

eru

I remember some experiments of using modern day training and data on some old style networks, eg with sigmoid activation.

That worked eventually and worked quite well, but took way more compute and training data that anyone back in the olden days would have thought feasible.

The two main problems with sigmoid activation compared to ReLU are: (a) harder to compute (both the value itself and the gradient), and (b) vanishing gradients, especially in deeper networks.

a month ago

toast0

Heck. Look at 10 year old product launch PRs from big tech. Anything that Google launched 10 years ago and killed, but seems like a good idea today is probably also easier to do. And if you look 5-10 years before that, you can find the Yahoo launch PR where they did the same thing ;p

a month ago

dehrmann

I always like to point out Wordle for nailing the timing. Some version of it has been viable since 2000, mass smart phone adoption helped with how it spread, but it could have been viable in 2010. What it did was landed at the right moment.

a month ago

pornel

Image and video compression are like that. Ideas for mainframes in the '80s are realtime algorithms now.

a month ago

eru

I guess they are _soft_ realtime algorithms now?

a month ago

taneq

'Realtime' isn't a specific thing, there's just 'fast enough'. Oldschool "render each frame during the scan line staying ahead of the electron beam" graphics was pretty hardcore though.

a month ago

eru

'Realtime' actually has multiple meanings.

At least one of them is very, very specific and is the one that Wikipedia uses in https://en.wikipedia.org/wiki/Real-time_computing

> Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response.[1] Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".[2]

Strictly speaking, this definition doesn't say anything about how tight those deadlines are, as long as you can guarantee some deadlines.

There's also 'soft' real time where you try to hit your deadlines often-enough, but there's no guarantees and a missed deadline is not the end of the world. Games are good example of that, including the example of chasing the electron beam.

ABS brakes are an example of a 'hard' real time system: the deadlines aren't nearly as tight as for video game frames, but you really, really can't afford to miss them.

a month ago

eru

I wonder if current AI training's effort to hover up all the training data they can find will accidentally give us most of the benefits of that?

A human can only read so much, so has to discriminate. But our machines are omnivorous readers.

a month ago

lispwitch

this seems to be how PEG parsing became popular during the last couple of decades, for example; see https://bford.info/pub/lang/peg/ (peg.pdf p11: "This work is inspired by and heavily based on Birman’s TS/TDPL and gTS/GTDPL systems [from the 1970s ...] Unfortunately it appears TDPL and GTDPL have not seen much practical use, perhaps in large measure because they were originally developed and presented as formal models [...]")

a month ago

[deleted]

a month ago

The_Colonel

> suddenly becomes "state of the art" because hardware have become powerful enough.

I'd rather say that we were capable of such a design for several decades, but only now the set of trade-offs currently in-place made this attractive. Single core performance was stifled in the last 2 decades by prioritizing horizontal scaling (more cores), thus the complexity / die area of each individual core became a concern. I imagine if this trend did not take place for some reason, and the CPU designers primarily pursued single core performance, we'd see implementation much sooner.

Regarding the Z-buffer, I kind of get that this would appear as a side note, it's a simple concept. Perhaps even better example is ray tracing - the concept is even quite obvious to people without 3D graphics background, but it was just impractical (for real-time rendering) in terms of performance until recently. What I find interesting is that we haven't found a simpler approach to approximate true to life rendering and need to fallback on this old, sort of naive (and expensive) solution.

a month ago

pcwalton

Another example is Rust's borrow checker, which has roots in substructural type system papers from decades earlier. Many academics considered substructural type systems dead (killed by GC, more or less) until Rust resurrected the idea by combining it with some new ideas from C++ of the time.

a month ago

protomolecule

"with some new ideas from C++ of the time"

Could you elaborate on that?

a month ago

masklinn

I assume it's the deterministic, implicit, synchronous, single-use destructor.

Affine logic only says what you're able to do with a value, it doesn't say anything about what happens when that value goes out of scope.

a month ago

protomolecule

Hmm, that's possible

a month ago

pcwalton

Move semantics and rvalue references. It was clear that within C++11 there was a substructural type system struggling to get out.

a month ago

protomolecule

I was going to say that Rust's destructive move is quite different from C++'s move semantics, but official docs disagree: "C++: references, RAII, smart pointers, move semantics, monomorphization, memory model" [0]

[0] https://doc.rust-lang.org/reference/influences.html

a month ago

crest

Z-buffers don't just require about an other frame buffer worth of memory, but also lots of read and write bandwidth per pixel. This extra memory bandwidth requirement made them expensive to implement well. High end implementations used dedicated RAM channels for them, but on lower end hardware they "stole" a lot of memory bandwidth from a shared memory interface e.g. some N64 games optimised drawing a software managed background/foreground with the Z-buffer disabled to avoid the cost of reading and updating the depth information.

a month ago

rasz

Power of Z-buffer lies in letting you skip all of the expensive per pixel computations. Secondary perk is avoidance of overdraw.

Early 3D, especially gaming related, implementations didnt have any expensive per pixel calculations. Cost of looking up and updating depth in Z-buffer was higher than just rendering and storing pixels. Clever programming to avoid overdraw (Doom sectors, Duke3D portals, Quake sorted polygon spans) was still cheaper than Z-Buffer read-compare-update.

Even first real 3D accelerators (3Dfx) treated Z-buffer as an afterthought used at the back of fixed pipeline - every pixel of every triangle was being brute force rendered, textured, lighted and blended just to be discarded by Z-buffer at the very end. Its very likely N64 acted in same manner and enabling Z-buffer didnt cut the cost of texturing and shading.

Z-buffer started to shine with introduction of Pixel Shaders where per pixel computations became expensive enough (~2000 GF/Radeon).

a month ago

kevingadd

Some modern algorithmic smarts (like hierarchical Z) helped a lot to reduce the memory bandwidth demands, to be fair.

a month ago

BitPirate

The EEVDF scheduling algorithm is also a good example. Designed back in 1995 and it's the default process scheduler of Linux now.

a month ago

throwaway2037

I never heard about this before your post.

Wiki tells me: <<Earliest eligible virtual deadline first (EEVDF) is a dynamic priority proportional share scheduling algorithm for soft real-time systems.>>

Ref: https://en.wikipedia.org/wiki/Earliest_eligible_virtual_dead...

a month ago

twoodfin

On the software side, garbage collection was well explored by academia for more than two decades before the JVM brought it to wide commercial adoption ca. 1995.

a month ago

bhouston

Lisp and Smalltalk and BASIC (to name a few) were popular languages before the JVM that used GC.

a month ago

tasty_freeze

When I was in high school, I wrote a Z80 disassembler in BASIC and printed out the entire TRS-80 Model 3 ROM on a thick stack of fan folded greenline paper. I spent a lot of free time figuring out what was what and commenting the code by pen in the right margin.

The (string) garbage collector was amazingly frugal and correspondingly slow. Because the garbage collector typically ran only when memory was critically low it used almost no memory, just 4 bytes I think. (it was also possible to invoke it any time with the FRE(0) function call IIRC)

The routine would sweep all of the symbol table and expression stack looking for string pointers, and would remember the lowest (highest? I forget which) string base address which hadn't already been compacted. After each pass it would move that one string and fix the symbol table pointer (or expression stack str pointer) then do another pass. As a result, is was quadratic with the number of live strings -- so if you had a large number of small strings, it was possible for your program to lock solidly for 30 seconds or more.

a month ago

bhouston

Your description closely matches what is described here: https://en.wikipedia.org/wiki/Garbage_collection_(computer_s...

a month ago

fulafel

Not academia, GC (and lots of other PLT) was mature and productized then, just missed by the majority of the industry living the dark ages. In the 90s, enterprisey architecture astronaut cultures went Java/C++ and hackerish cultures went Perl/Python seasoned with C.

a month ago

fragmede

given how much C++ underpins the video game industry, I don't think "enterprisey architecture astronaut" is entirely fair to video game programmers

a month ago

fulafel

Sure, it's a gross generalization, games and embedded have an excuse. Though I don't think C++ was that big in the mid-90s on games side yet, it was more C.

a month ago

mrlonglong

Speculative predictors have been subjected to a number of attacks to weasel out private data. Given that so many of the common ISAs are vulnerable, are they taking steps to reduce the impact of such attacks?

a month ago

kmeisthax

The vulnerability is speculative execution, not branch prediction. The branch predictor is the thing you have to trick to force the processor to speculatively execute code in the victim program. Furthermore you also need a valid timing source to read out the results of the speculative execution.

As for how to stop that, short of boiling the ocean[0], you don't. Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Actual practical mitigations for speculative execution vulnerabilities are varied, but at a minimum you have to ensure process separation between a victim process holding secrets and any potential attackers that may have the opportunity to influence victim process execution. Intel was caught with their hands in their pants speculating across rings, which is why you could read kernel or hypervisor memory from userspace, but on not-poorly-designed CPUs the main victim you have to worry about is HTML iframes. Different origins aren't allowed to make HTTP requests to one another[1], but they can transclude[2] one another without permission. That traditionally loaded information from the origin into the attacker's process, which could be exfiltrated with timing attacks.

The web's solution to this was actually not to process-separate iframes, at least not initially, but to take away shared-memory multithreading entirely. If you deny the attacker a timing reference then it doesn't matter what they can make the victim speculatively execute. But to do this you have to take away multithreading because otherwise a thread can just repeatedly write known data in a loop to create a clock.

[0] https://hackaday.com/2013/08/02/the-mill-cpu-architecture/

[1] At least not without the target origin allowing it via CORS

[2] e.g. hotlink images or embed iframes

a month ago

topspin

> Speculative execution is so valuable for performance that a computer without it is completely unusable.

Jim Keller's view aligns with this and goes further. My interpretation of his thinking is that predictors and speculation are the only meaningful features of CPUs today. ISA doesn't matter anymore because the power of modern compilers makes high performance software highly portable and all CPUs end up bottlenecked on the quality and capacity of the predictors, regardless of the ISA. For example, the burden of x86 complexity no longer matters because it amounts to a "tax" small enough to be lost in the noise.

That's from a designer making high performance RISC-V CPUs.

a month ago

fanf2

This article has several paragraphs discussing how the decode width of x86 front ends is limited by the need to discover instruction boundaries, which in turn limits the useful issue width. Big ARM cores have much larger decode width than x86 cores, so they don’t need SMT to keep their execution units busy.

ISA doesn’t matter any more because all the CISCiest CISCs and the RISCiest RISCs have been discarded (except for RISC V…) so modern CPUs don’t have to cope with multiple memory addresses per instruction or indirect addressing.

a month ago

topspin

> limited by the need to discover instruction boundaries

Keller specifically addressed this. His view is that yes, the x86 instruction width problem is real, but it's not an important performance bottleneck because decoders are now sophisticated enough that other bottlenecks (predictors) dominate the net performance. Yes, this means more power usage and die area, but only a part of the market is sensitive to this: where other considerations dominate the cost premium of x86 can be ignored.

He also believes that "RISC-V is the future of computing," (not surprising from the CEO of a RISC-V vendor,) so it isn't as if the legacy ISAs are somehow just fine. But ISA complexity isn't the key determinant in that future. The keys are the open ISA, commoditization of designs and high software portability.

This makes sense to me. x86, despite it's inherent issues, has fought off alternatives before. But now, as per Keller, there is a Cambrian explosion of RISC-V designs appearing, and what emerges from that is going to prevail against any legacy ISA, not just x86. Ulitimately, therefore, the x86 "tax" doesn't actually matter.

a month ago

kllrnohj

> Big ARM cores have much larger decode width than x86 cores

Not in general they don't. Apple's does, and Qualcomm's newest Snapdragon X Elite also does. But most big ARM cores don't have larger decode widths than x86 cores. The Cortex-A78 is a 4-wide decode, same as the majority of x86 CPUs on the market. ARM's latest & greatest Cortex X2 is only a 5-wide decode, it only just finally surpassed the original Zen design (4-wide).

Also Zen 5 bumps to an 8-wide decode, the same as what Apple & Qualcomm's best ARM chips can do.

a month ago

The_Colonel

This is so annoying about the hype around ARM, for which even smart people fall. Yes, Apple Silicon is good, but it's not because of ARM ISA. I still keep hearing that ARM is RISC which is wrong since the 1990s.

a month ago

hajile

What about ARM64 is not RISC?

RISC was not just about total instructions, but the complexity of those instructions. It's fundamental conceit is that every instruction should ideally take just one cycle to execute. Intel's iAPX432 is one of the most CISCy designs out there and a great case that ISA does matter. It had instructions for stuff like data structures, OOP, and garbage collection. These things could take LOADS of cycles to execute.

In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.

a month ago

The_Colonel

Thumb-2, Neon, VFPv4-D16, VFPv4 are mandatory instruction extension sets for ARM64, but there are also optional, widely implemented extensions like AES and SVE. Ain't nothing "reduced" about the ARM CPUs' instruction sets anymore.

> In contrast, pretty much everything on ARM64 integer spec is going to execute in just a cycle or two.

integer spec is the keyword here. ARM CPUs these days implement much more than integer spec. "If I ignore everything else, it's RISC" is not a good argument IMHO.

a month ago

cesarb

> Thumb-2, [...] are mandatory instruction extension sets for ARM64

Isn't Thumb-2 for 32-bit ARM only? AFAIK, having 32-bit ARM is optional, you can have 64-bit only ARM nowadays.

(And 64-bit ARM no longer has what is IMO the least RISC instructions from 32-bit ARM: the "load multiple" and "store multiple" instructions, which had a built-in loop doing memory accesses to an arbitrary set of registers, were replaced by "load pair" and "store pair", which do a single memory access and work with a fixed number of registers. If you look at https://www.righto.com/2016/01/more-arm1-processor-reverse-e... you can see that the circuitry dedicated for these instructions took a lot of space in the original 32-bit ARM core.)

a month ago

hajile

ARM64 threw out the book and started over. Stuff like Thumb-2 or VFPvX are for legacy ISA. There aren't variable instruction lengths and NEON is baked into the core of the ISA.

AES in ARM (and x86 as this was after their attempt to be more RISC-like) was designed to be RISC. A CISC approach would have just ONE instruction that would take the data pointer and number of rounds then do everything at one time. In contrast, x86 and ARM use a couple simple setup instructions, then an AES round instruction (1-2 cycles) that gets called repeatedly then a couple more finalizing instructions.

If you look at the code on any machine, something like 95+% of it is integer code. Integers are fundamental and everything else is a performance optimization.

The float/vector pipeline is a different thing, but it still takes a RISC appoach. 1-5 cycles for most things, a few are 6-7, and a couple like divide and some types of sqrt can take up to a dozen or so cycles (there's a reason they are avoided). This all happens in its own separate pipeline with its own scheduler. All the ideas of being superscalar and avoiding bubbles apply here too. Even the worst case is a far cry from CISC chips where instructions could take dozens or even hundreds of cycles.

a month ago

ribit

While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.

a month ago

scheme271

I think a few other things like memory models also matter and affect cpu architecture. E.g. the x86 total store order vs arm64's model. You potentially get to do a few nice optimizations on arm64 vs x86. I'm not sure how much of a difference that makes though.

a month ago

topspin

Memory models are indeed important. That's why Apple extended ARM with total store order capability. A CPU can efficiently implement more than one memory model.

a month ago

hajile

Look into Intel's iAPX and tell me that ISA doesn't matter. How would you go about making that pile of garbage into something fast?

Most CISC designs died. A large part of x86's survival is because it wasn't exceptionally CISC.

a month ago

topspin

> How would you go about making that pile of garbage into something fast?

Make a Rosetta analog and translate to an instruction set that is amenable to efficient prediction, wide dispatch, etc., replacing all the iAPX hardware object oriented nonsense with conventional logic. If necessary, extend the CPU to accommodate whatever memory ordering behavior is necessary for efficient execution, ah la Apple Silicon ARM with x86 total store ordering.

a month ago

toast0

> Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Pentiums have branch prediction and speculative execution. You need to go back to i486 if you don't want speculative execution. Most of the socket 5/7 processors from other makers also had branch predictors and speculative execution, but not the Centaur Winchip. The Cyrix 5x86 for socket 3 (486) had speculative execution, but it was disabled by default and is reported to be buggy (but helps performance on published benchmarks).

a month ago

pezezin

According to Wikipedia, the original Pentium had branch prediction, but speculative execution was first implemented by the Pentium Pro: https://en.wikipedia.org/wiki/P6_(microarchitecture)#Feature...

a month ago

gpderetta

The wiki is wrong or at least misleading. Branch prediction is a form of speculative execution.

Even 486 (and possibly 386) had branch prediction (although a trivial one).

P6 was a huge deal because it added out of order execution.

Edit: I guess it is a matter of semantics: in the classical 5 stage in order RISC, instructions after the speculated branch are fetched, decoded, etc, but they won't reach the execution stage before the speculation is resolved, so only the branch is technically "executed" speculatively at the fetch stage. So there is less state to unwind, compared to a true OoO machine that can run ahead.

a month ago

to11mtm

according to this thing I found [0] the 486 has a predictor for the branch not taken. Not sure what that means but it looks like it's mostly for the instruction fetch/decode based on other notes. Ironically sounds close to your 5 stage RISC description, extra ironic because 486 is 5 stages too...

P5 has a more 'real' BP but also had the U/V Pipelining.

But yes as far as OoO goes, I think for x86 it was Nx586 first, then PPro (P6), then Cyrix 6x86 [1] and AMD K6 (courtesy of Nextgen tech) the next year.

[0] - https://people.computing.clemson.edu/~mark/330/colwell/case_...

[1] - Worth noting the Cyrix Coma bug, which was a way to express a flaw in the pipeline. https://en.wikipedia.org/wiki/Cyrix_coma_bug

a month ago

gpderetta

As far as I know 486 predicted all branches as not taken, so it would fetch and decide instructions after a branche and throw that away when the speculation was proven wrong.

I guess calling this speculative execution is a stretch.

a month ago

gpderetta

The first pentium had branch prediction so of course it had speculative execution.

What it didn't have is out of order execution, which greatly increases the speculation window.

a month ago

akira2501

Interactions between speculative execution and virtual memory translation and caches are exploitable. It's not an inherent vulnerability in prediction.

a month ago

paulmd

Sure, as long as it’s not observable in any way, it’s not observable. But the problem is that speculation has been repeatedly been found to be observable in unexpected ways from both brands.

Amd, for example, has observability issues in all zen1/2/3 processors that leaks enough data to break KASLR that remain unpatched in all chips except for Milan (specifically epyc only, not ryzen). They didn’t expect cache ways to be visible in that fashion, and it’s observable that the model is incorrectly implemented.

the idea cannot fail, only be failed, because if it’s observable then obviously you just did it wrong.

a month ago

akira2501

The point was, you can engineer around problems that aren't _inherent_ in the design, but arise as an interaction between systems. You just change the rules of interaction between systems to mitigate the vulnerabilities.

So, to your original question, we're always going to have prediction, and we're going to have to solve any vulnerabilities that arise downstream. Which, thankfully, is always an available option.

a month ago

emn13

As a novice in this area, it's not clear to me after reading this what exactly the 2-ahead branch predictor is.

a month ago

sillywalk

It's from around 30 years ago, my guess is it's referring to this[0] paper from 1996. It's above my head, but it seems to help with branch prediction issues arising with both many instruction units and high-clock speed, which were sort of either or in the '90s, but I think most modern processors are both.

Multiple-block ahead branch predictors

Abstract:

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block. This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures. Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors."

[0] https://dl.acm.org/doi/10.1145/237090.237169

EDIT: oops. Looks like eyegor posted the link earlier. Oh well, enjoy the abstract.

a month ago

cpldcpu

My understanding is that they do not predict the target of the next branch but of the one after the next (2-ahead). This is probably much harder than next-branch prediction but does allows to initiate code fetch much earlier to feed even deeper pipelines.

a month ago

layer8

Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

a month ago

ithkuil

Interestingly you can build a branch predictor that predicts the second branch without predicting the first.

A branch predictor result is just a tuple of ("branch instruction address", "branch target address") that hints the processor that when the CPU will encounter a given branch instructions in the future (at "branch instruction address") it will likely branch to the branch target and so it would make sense to start fetching that address and filling the instruction pipeline with whatever steps are safe to perform before the jump will be actually performed.

Now, commonly this branch happens to be at the end of the current basic block and I assume some branch predictors may also leverage this fact in order to encode only offsets from the current instruction pointer.

But there is no reason why the branch location might be after some other branches may be taken. As long as the cpu eventually gets to that branch location the prediction will be useful. If the IP never reaches that location it's like the branch was never actually taken.

a month ago

toast0

> Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

I'm not involved in CPU design, I just read a lot, but...

I think you need to do something special to have a second prediction, because you have to track three windows of out of order execution:

Window 0: code you're definitely running but is still being completed.

Window 1: code from the branch you think will be taken

Window 2: code from the 2-ahead branch you think will be taken.

If you figure out that the window 1 branch isn't taken, you have to drop the whole pipeline (pipeline bubble). But if you figure out that window 1 is taken, then window 1 becomes window 0 and window 2 bcomes window 1.

With a 1 ahead predictor, the pipeline stalls if you get to a conditional branch while speculating in window 1, because the processor can't manage three instruction windows.

IMHO, it sounds like if the core is doing SMT and both threads are active, each thread only gets 1-ahead prediction because the two windows are statically divided between the cpu threads. This may mean a) a significant boost for some loads when SMT is not in use and b) SMT can branch predict in both threads in the same cycle, I don't think that was possible on AMD before (no idea for other vendors)

a month ago

gpderetta

With a typical OoO CPU you have reorder buffers in the hundreds of instructions, so you are not speculating two or three branches but potentially dozens ahead of non-speculative execution (and that's why prediction accuracy is so important).

So the question remain, what's the innovation here? I'm sure there is something, but it is not simply speculating two ahead.

I need tor was further, bit this might be an optimization in the fetched tage to avoid a bubble when fetching consecutive taken branches.

a month ago

Filligree

Building on the sibling comment:

  if (a) { ... }

  if (b) { return x; } else { return y; }

The two branches can be wholly independent, but predicting the second is still a two-ahed prediction.

a month ago

fulafel

> Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

I'd think if you are at PC N and there are branches at N+1 and N+2, predicting just branch N+2 is fine because you predicted the N+1 branch previously, at PC N-1.

a month ago

flamedoge

I wonder what they need before this change. Branch predictor hardware may not have accounted for depth beyond single conditional branch? but pipeline was probably always filled, unpredicted.

a month ago

emn13

Ah, that makes sense in the context of the article - thanks!

a month ago

hmry

You're not alone. Non-novice, same here. Article spends a lot of time explaining the absolute basics of branch prediction but then when it gets to 2-ahead it just skips over explaining it...

a month ago

eyegor

I think it just predicts 2 branches per cycle instead of 1. So it can evaluate the result of n+2 ahead of time instead of only n+1 (typical branch prediction). How this works without wrecking the L1 cache, I'm not sure. It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

> Zen 5 can look farther forward in the instruction stream beyond the 2nd taken branch and as a result Zen 5 can have 3 prediction windows where all 3 windows are useful in producing instructions for decoding.

The original paper is open access but I haven't read far into it: https://dl.acm.org/doi/10.1145/237090.237169

a month ago

phire

> It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

The frontend is already predicting dozens of branches ahead of what the backend can actually confirm. Looking ahead by one extra branch ahead doesn't really hurt.

Also, modern TAGE branch predictors are scary accurate, well above 99% on most code (including unpredictable indirect jumps). Besides, the majority of branch prediction targets are already in the L1 cache, it only predicts them because it saw them recently.

The branch predictor in Apple's M1 actually takes advantage of the latter fact. It doesn't predict what address to fetch next, it predicts which cacheline in L1 holds the target. So you only actually get branch predictions for targets in L1.

a month ago

mjevans

That seems like a good idea against speculation based attacks too; predict within what is at hand, do not cause side effects.

a month ago

phire

I didn't even notice that advantage. I was just thinking about how it minimised each branch target entry to about 16 bits.

I suspect Apple are also using it as a way predictor. If the BTB points directly to the correct cacheline and each cacheline points to the next way then the only time you need to do a search is on a branch misspredit.

a month ago

tigrou

A regular branch predictor try to guess which way a branch (eg: if-else) will go before it's executed. That way, CPU can fetch and decode instructions in advance.

Each side of the branch leads to the start of a new block of instructions. The last instruction of such a block is usually another branch.

In other words, a branch predictor guess the address of the next block. A 2-ahead branch predictor does the same thing, but for the two subsequent blocks.

As stated in the paper: "information from the current instruction block is used for predicting the address of the block following the next instruction block".

Unlike a regular branch predictor, it can do that without having to wait for instructions in the next block to be decoded. That way, you can feed multiple instruction decoders at once.

This is particularly useful in modern CPUs where the instruction decoder has become a bottleneck. With only 1 decoder and 1 instruction decoded per cycle (at best) it cannot cope with a wide frontend that can execute many instructions (eg: 4-6) per cycle.

a month ago

deadmutex

You can check out the seminal paper linked in the article. Or start by summarizing the paper with Gemini, Claude, ChatGPT, etc. to get a high level overview (and then confirm the answer by reading the paper).

a month ago

coherentpony

https://link.springer.com/article/10.1007/s10676-024-09775-5

a month ago

pyrolistical

We would need more branch hints? https://github.com/ziglang/zig/issues/5177

Cold, warm, warmer, omit hot as it is the default? Sometimes you would set all branches to be cold except one

a month ago

Szpadel

that's probably bad idea but I would like to learn why:

why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

is this that much harder or there are other reasons that makes this not worth it

a month ago

phire

It's a sub-optimal strategy.

A modern TAGE branch predictor is correct well over 99% of the time. So those extra instructions for the other side of the branch are almost always discarded.

Worse, the frontend is fetching dozens of branches ahead of where the backend can actually confirm which direction to take. What are you going to do at the next branch, start decoding four possible branches? then 8, 16, 32 possible branches? Remember, most of the time you are going to throw it away.

If you actually have the hardware to fetch from multiple instruction streams in parallel (which Intel's Gracemont/Goldmont/Skymont and now AMDs Zen 5 do), the better strategy is to assume your branch predictor is actually correct 100% of the time. Follow one side of the branch, then the one after it.

Intel's Skymont actually decodes the next 3 branch targets in parallel because it has three decoders, each 3-wide. Intel actually introduce fake branches to break up large blocks of code, so that all three decoders are always active decoding different part of the upcoming instruction stream. The three uop streams are later merged, allowing Skymont to maintain an effect decode bandwidth of 9 instructions per cycle.

If you executed both sides of the branch, you are only slightly reducing the branch misspredit delay in the rare case the branch prediction was wrong. Instead, by executing one side the next two or three predictions, Intel and AMD can make multiple decoders do work in parallel. Intel are doing 9-wide with three simpler 3-wide decoders, and AMD can do 8-wide with two simpler 4-wide decoders.

a month ago

tigrou

Are you saying that branches can be used as a way of dividing code into several parts, so that several instruction decoders can be fed at the same time? Is it because the instruction decoder is actually one of the bottlenecks, not being able to deliver instructions fast enough to the hungry frontend? If branches can be used in this way, unconditional jumps (e.g. gotos) and function calls can probably be used for similar purposes.

a month ago

phire

Yep. It uses the branch predictor to slice up the instruction stream and feed it to multiple instruction decoders. Because the variable length nature of x86 instructions makes it hard to implement extremely wide instruction decoders. To reach a throughput of 4 instructions per cycle of the typical 4-wide x86 decoder you actually need 16 length decoders running in parallel, one for every byte position. The next pipeline stage then picks the 4 outputs that don't overlap before feeding it into proper decoders.

Intel have actually implemented a 6-wide decoder for Golden Cove, but they must really be pushing the limits of propagation delays. So it's attractive to just use two or three smaller decoders in parallel.

> If branches can be used in this way, unconditional jumps (e.g. gotos) and function calls can probably be used for similar purposes.

It's a common misconception that the branch predictor only predicts conditional branches. They need to predict anything that causes control flow to diverge. So unconditional branches, calls, returns, and maybe even syscalls, all count as "branches" and get handled by the branch predictor.

(On most RISC instruction sets, these instruction are often labeled branch. Call is just "Branch And Link" which saves the return address in a register. To return, you just use the "Branch to Register" instruction)

This is necessary because the frontend is pipelined. It takes a minimum of four cycles to fetch the instruction data from instruction cache and decode it before the frontend can even know there is a call, return or unconditional branch instruction. If we are decoding four instruction per cycle, we are already 16 instructions past the branch before we even know it's there, and it's common to have a branch every 3-4 instructions.

Basically, the frontend is blind and can't see any branches.

Predicting the direction of conditional branches is almost a secondary task for modern branch predictors. It's much more important to predict where the branches are and their destination, so that instruction fetch stage can request the correct data from the instruction cache and keep the pipeline fed.

a month ago

tigrou

Thanks for your reply. I learned a lot. For people wondering why you need a lot of decoders with x86, this answer explain it well (see "Decode problem" part) : https://qr.ae/p2IEt9

a month ago

swatcoder

Because it's rare for a branch result to be random. The compiler/runtime/cpu/etc can often guess which result is more likely, and correctly not do the extra work in the first place, and so that's usually the better strategy than spending silicon and heat on the wrong answer just in case.

I think a lot of people don't have an intuition about how accurate branch prediction can be, but if you look at your own code, you'll quickly realize "well, yeah, control flow is almost always going to go this way and we just have this branch so we can handle the exceptional case" -- compilers can often deduce this pretty well themselves now, and cpus/jits/runtimes can develop some pretty impressive heuristics as well, and when all those fail you can often add explicit hints in your code that tell your compiler/etc what you expect if they can't guess.

a month ago

branko_d

> it's rare for a branch result to be random

How rare, though?

QuickSort has fundamentally unpredictable branches, and it’s a pretty widely used algorithm. Binary search, B-trees also come to mind.

a month ago

kllrnohj

Binary searching is quite slow and should be used sparingly but not because of branch misprediction necessarily but because of memory stalls - you're almost always guaranteed to have a cache miss during the search. Similarly for B-trees it's going to be memory stalls that you're probably more focused on addressing, not branch mispredicts.

a month ago

emn13

This probably depends on the size of the area to be searched, and just how hot that region is. After all, if it's fairly small, there won't be any cache misses, and the data structure does use less memory than a typical hash table, which is itself an advantage.

a month ago

orf

If the size of the data is small, a linear search through a contiguous array is going to be far faster than anything more complex.

a month ago

emn13

Yep; though we'd have to test a few cases to figure out what the cutoffs are here, and if there's any middle ground left for a divide-and-conquer strategy.

It's also definitely going to depend on the cost of the hash function and comparison function - for something like strings, where those can be quite expensive, binary search probably has a better chance of applicability than for guid's say.

a month ago

kllrnohj

For things with non-trivial comparison functions you're all but certainly better off with those in something else like a hashmap. After all, the more expensive the compare, the more expensive the sort & reordering that a binary search requires. And then for trivial comparison objects, binary searching is still slower even at "huge" sizes like 10,000. It's really hard to find a good use of binary searching an array on modern CPUs.

a month ago

eru

Well, it depends on how expensive your comparison is compared to a cache miss.

a month ago

barrkel

B-trees are cache friendly, spending more time doing linear scans at each level to keep the tree shallow, and thus indirection less frequent. They're designed for high latency indirection - loading pages from spinning rust.

a month ago

saagarjha

Most code with branches is not “algorithms”. (Bear with me.) It’s simple loops and function calls and vtable indirection. Those can be predicted with very high accuracy, and loops in particular by their very nature dominate execution. For every unpredictable branch in your quicksort there are several very predictable branches that do things like check bounds, the recursion base case, and return address.

a month ago

eyegor

Also happens to be why quicksort loses to almost anything else on small arrays, even bubble sort

a month ago

gpderetta

Qsort can be implemented branchless by converting the branch into a data dependency.

a month ago

sjburt

In benchmarks, branch predictors guess correctly 90-95% of the time.

a month ago

jayd16

Loops are usually much more one way than the other.

a month ago

sapiogram

Disclaimer: I don't work in this field, just an enthusiast.

As far as I can tell, branch predictors have always been too good for it to be worth it. Moderns CPUs have instruction reorder buffers that are hundreds of instructions deep, so even if only 8 of those instructions are conditional jumps, there's 256 different paths your program could take. If your branch predictor predicts all 8 correctly >50% of the time (It does), doing 256x the work to cover your ass is not worth it.

a month ago

ithkuil

That's called speculative execution and IIRC all modern CPUs are doing it.

It requires more silicon to hold more microarchitectural state and more execution units to fully exploit the technique, but superscalar CPUs already have those since they are essential to exploit instruction level parallelism in non-branchy code. The rest is "just" a lot of headaches to handle complicated stuff such as aliasing, interrupts, ... But hardware engineers are such wizards they can do these things too.

Turns out however that speculative execution opens up a possibility of abusing a cache timing side channel to extract information from data touched by branches of code that has been only speculatively executed but whose architectural side effects were not committed (i.e. not "really" executed).

Which includes code that had been explicitly not executed because of a conditional check (e.g. permissions, ...)

A familiar instance of such an attack is Spectre [1]

1: https://en.m.wikipedia.org/wiki/Spectre_(security_vulnerabil...

a month ago

MBCook

We reached 90% accuracy decades ago. Depending on workload modern chips can do way better.

So basically it’s just nowhere near worth it. Much better to use those chip resources for another thread or core.

a month ago

wmf

It's a huge waste of energy and in some cases it would even be slower because you'd execute more instructions overall. If the branch mispredict rate is around 1% it's simply not worth paying a penalty 99% of the time to get a gain 1% of the time. Maybe it would be worth doing on low-confidence branches.

a month ago

[deleted]

a month ago

kolbe

The vast vast vast majority of instruction pipelines are sparse. You can pack in non-dependent instructions essentially for free.

a month ago

gpderetta

Divergence. On an OoO pipeline fetch might be ahead of retire by hundreds of instructions. That's dozen of branches, and if each causes execution to speculatively fork, it wouldn't scale.

You could potentially do it only for predicted "unpredictable" branches. Now the tradeoff is wasted power and execution units for dead work and so far the tradeoff was just not worth it.

Some form of this has been experimented with. In the late '90 SPRC experimented with scouting threads and as mentioned else thread the Efficient Intel cores can fetch (but not execute) across branches.

a month ago

0x000xca0xfe

- Side effects; how do you handle two different writes to memory?

- Double the execution units; very expensive for wide vector units

- Massive waste of energy as half the resources will always be wasted no matter what

- Bad scaling, i.e. four branches ahead would require 16x the resources

a month ago

isotypic

Handling two different writes to memory is not really a concern - existing speculative/out of order processors already solve this issue by completing (perform architectural side effects) instructions in-order. So even if two writes are made, one in each branch, by the time the write is meant to be completed, the prior branch is resolved and we know which write is actually meant to be made and the bad one can be discarded.

Doubling the execution units also isn't strictly needed - you can use the existing out-of-order core to send two sets of instructions through the same functional units. There will be more contention for the resources, possibly causing stalls, but you don't need to fully double everything.

Things similar to this idea are already done in processors - simultaneous multithreading, early branch resolution, conditional instructions, are all ideas that have similar implementation difficulties. So the reason this specific idea is not done is more in line with your last two points rather than the first two.

a month ago

eigenform

Most branches are biased one way or the other. "Fetching down both paths" means not exploiting any information you might have gathered about a branch being biased - I think that would be equivalent to randomly predicting the branch (except for it would cost more than a random predictor because you're actually fetching both ways instead of just one).

a month ago

[deleted]

a month ago

dymk

Transistor count; now you have to duplicate all the decode and speculative execution circuitry for both possible branches

a month ago

immibis

No, the same circuits would execute them interleaved just like they execute multiple hardware threads now

a month ago

nomel

With the SMT core count having to be one less.

a month ago

immibis

No.

a month ago

nomel

Could you expand on that a bit? How could the "same circuit" be used for twice as many things, for free. If it's currently being used as an SMT core, how could it also support the second branch prediction, for both SMT cores?

I say the SMT core would be reduced by one since, rather than being used as an SMT core, it would be used as the second branch prediction thing.

a month ago

bsder

> why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

We actually do that. It's called a GPU. And it sucks for general code.

a month ago

hnpl

I'd love to see the performance data before judging whether it is a good idea. There's no information on the branch prediction penalty of this approach as well.

Anyway, I think the intuition of this approach is to aggressively fetch/decode instructions that might not already in L1 instruction cache / micro-op cache. This is important for x86 (and probably RISC-V) because both have varied instruction length, and just by looking at an instruction cache block, the core wouldn't know how to decode the instruction in the cache block. Both ISAs (x86, RISC-V) require knowing the PC of at least one instruction to start decoding an instruction cache block. So, knowing where the application can jump to 2 blocks ahead helps the core fetching/decoding further ahead compared to the current approach.

This approach is comparable to instruction prefetching, though, instruction prefetching does not give the core information about the starting point.

(High performance ARM cores probably won't suffer from the "finding the starting point" problem because every instruction has the length of 32-bit, so the decoding procedure can be done in parallel without knowing a starting point).

This approach likely benefits front-end heavy applications (applications with hot code blocks scatter everywhere in the binary, e.g., cloud workloads). I wonder if there's any performance benefit/hit for other types of applications.

a month ago

im3w1l

I still ahve no idea what a 2-ahead branch predictor is.

a month ago

adrian_b

You can better start by reading the old research papers linked in the article.

In general, the older research papers suppose that the reader knows much less about such subjects, because they were still much more niche knowledge at that time.

a month ago

im3w1l

Okay, so a "block" is a sequence of instructions that can be executed from top to bottom, i.e. only the last instruction may be a jump.

A 2-ahead branch predictor is a branch predictor that predicts not which block to run next after the current one, but rather, which block should be run after that one.

a month ago

vegabook

Now all it needs is more memory bandwidth, because those two memory channels on the consumer AM5 socket are pathetic given the compute this will deliver, and especially in comparison with even the most basic ASi.

I moved to an M2 Max from a chunky Zen setup and it's a revelation how much the memory bandwidth improvement accelerates intensive data work. Also for heavy-ish multitasking the Zen setup's narrow memory pipe would often choke.

a month ago

AnthonyMouse

There are very few applications that are actually memory bandwidth bound but aren't more suited to a GPU than a CPU.

The reason people have been looking at Apple Silicon for LLMs in particular is that even though they are more suited to GPUs, they also require a lot of VRAM and NVIDIA charges an extortionate amount of money for GPUs with a lot of VRAM.

What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

a month ago

The_Colonel

> What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

For gaming, it's an overkill (esp. given the price), it would be useful only for LLM enthusiasts which isn't that big of a market.

a month ago

AnthonyMouse

Then how come so many people are buying GPUs with that amount of VRAM for extraordinary amounts of money?

Nobody said it was for gaming. Though of course consumers could use it for both.

a month ago

kimixa

But AMD already offer GPUs with that amount of VRAM for extraordinary amounts of money, so there'll be no change?

And from what I hear they're selling every one they're producing, even to the level where there's a ~6 month waiting list. So even if they reduced the price, the only difference would be that the list would get longer and AMD got less money. No more devices would actually make into people's hands, being production limited.

a month ago

AnthonyMouse

The production limit isn't in the amount of VRAM. They could offer any of their existing consumer GPUs in versions with more VRAM with prices that add twice their own cost for the extra VRAM (but still far less than the datacenter versions) and thereby sell the same GPUs for higher margins.

This might even leave gamers with more GPUs, because right now there are people buying e.g. four 24GB consumer GPUs to run a large model, and they could instead buy one 96GB GPU and leave three more for gamers.

a month ago

kimixa

You're not really saying "They're not offering a product" - so much "They're not offering a product at a price I want", which is really a very different statement.

They're intentionally differentiated in the market and marked up due to the relatively higher R&D costs from targeting a much smaller market niche, with it's own demands on development. I doubt you'll just be running games on them, after all.

If that is your niche, AMD is effectively saying they don't want your custom at that price. No point selling products that lose money, after all. AMD have done it before - arguably are doing it right now as their graphics BU lost money last year if you exclude APUs and consoles - and it went badly each time (despite what some people say about "Market Share!" on some internet forums).

a month ago

AnthonyMouse

That strategy doesn't win though. The enterprise is going to buy the enterprise GPU anyway because they're spending someone else's money and it's a tax write off and the enterprise GPUs are faster and have ECC memory in addition to having more VRAM. But giving something for hobbyists to play with is how you build an ecosystem. So not only do they get higher margins on their consumer GPUs, they get more code written for their architecture, which lets them sell more of the expensive GPUs.

Not doing that is making a mistake.

a month ago

cesarb

> because those two memory channels on the consumer AM5 socket

There are actually four memory channels in AM5, because DDR5 doubles the number of channels.

a month ago

pezezin

But each channel is half the width, so we are back at the same point.

a month ago

yogrish

When I began my career as a DSP optimization engineer, I worked on ZSP processors. I had a particular passion for branch predictions. By analyzing data, we would code and recode branch instructions to minimize mispredictions. I loved that super- scalar architecture of ZSP and its Branch prediction feature was amazing, apart from its multi Instruciton execution in a cycle.

a month ago

phkahler

>> Now when Zen 5 has two threads active, the decode clusters and the accompanying fetch pipes are statically partitioned.

This sounds like a big boost for hyper threading performance. My Zen1 gets about 25 percent faster due to HT. Has anyone tested the newer ones in this regard?

a month ago

paulmd

High SMT speedups aren’t a good thing, because those are pipeline bubbles - resources that can’t be saturated by the first thread.

The ideal speedup from SMT is 0% because you’d be getting full output for a single thread. Like if you got a 50% speedup from SMT, in an ideal world that means a single thread could run 50% faster than it is, but it’s being stalled by pipeline bubbles.

a month ago

Remnant44

For a fully compute-bound workload, you're certainly correct.

That's rare though. All it takes is a couple stalls waiting on memory where a second thread is able to make progress to make that "ideal speedup" certainly be nonzero.

a month ago

paulmd

Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

a month ago

Remnant44

Because most code is not running anywhere near saturation of the available resources, and the problem is only getting worse as cores get wider. I mean, look at the Zen5 block diagram - there are 6 ALUs and 4 AGUs on the integer side alone! That's almost two entire Zen1 cores worth of execution resources, which is pretty amazing. Very, very little real world code is going to be able to get anywhere near saturating that every cycle. SMT helps improve the utilization of the execution resources that have already been paid for in the core.

I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.

Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.

I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.

a month ago

adrian_b

With the continuous improvement of out-of-order execution, the SMT gains have been diminishing from Zen 1 to Zen 4.

However you are right that Zen 5, like also the Intel Lion Cove core, has a jump in the number of available execution resources and it is likely that out-of-order execution will not be enough to keep them busy.

This may lead to a higher SMT gain on Zen 5, perhaps on average around 30% (from typically under 20% with Zen 3 or Zen 4), like in the Intel presentation where they compared a Lion Cove without SMT with a Lion Cove with SMT. In the context of a hybrid CPU, where the MT-performance can be better provided by efficient cores than by SMT, Intel has chosen to omit SMT, for better PPA (performance per power and area), but in the context of their future server CPU with big cores, they will use cores with SMT (and with wider SIMD execution units, to increase the energy efficiency).

a month ago

Dylan16807

> Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Because SMT getting faster is a nearly free side-effect. We didn't add extra units to speed up SMT at the cost of single-thread speed. We added extra units to speed up the single thread, and they just happened to speed up SMT even more (at least for the purpose of this theoretical). That's better than speeding up SMT the same percent, or not speeding up SMT at all.

Imagine if I took a CPU and just made SMT slower, no other changes. That would be a bad thing even though it gets the speedup closer to 0%, right? And then if I undo that it's a good thing, right?

a month ago

paulmd

this doesn't seem to reflect the reality of the way hardware is actually being added to the cores, where Zen5 features a fourth ALU which is only useful to a single thread in the low-single-digits.

https://old.reddit.com/r/hardware/comments/1ee7o1d/the_amd_r...

this isn't adding more units to speed up single-thread and SMT being a nice "incidental" gain, this is actively targeting wide architectures that have lots of pipeline bubbles for SMT to fill.

and that's fine, but it's also a very different dynamic than, eg, apple silicon, where the focus is running super deep speculation and reordering on a single thread.

a month ago

Dylan16807

What do you think they should add instead of another ALU?

The returns are diminishing, but a single digit bump is still a pretty good when your goal is faster single threads.

Adding nothing to save space is obviously not the answer, because that leads to having more slower cores.

Also the topic of the post, the 2-ahead branch predictor, exists specifically to get more ALUs running in parallel!

> apple silicon, where the focus is running super deep speculation

And to make that speculation profitable, the M1 cores have 6 ALUs among other wideness.

a month ago

adrian_b

Yes, while in my older tests on Intel Skylake derivatives I have also obtained SMT speedups around 25%, on the newer Zen 3 (a 5900X) I have obtained at most a 20% speedup in the most SMT friendly task that I have ever encountered, i.e. the compilation of a big software project (the comparison being done between optimal parameters for SMT disabled vs. SMT enabled, which for a 5900X have been determined to be "make -j13" vs. "make -j24").

An example of a multithreaded benchmark that is not SMT friendly is the GeekBench 6 multithreaded test, where Zen 3 with SMT disabled (12 threads on a 5900X) is slightly faster than with SMT enabled (24 threads on a 5900X).

a month ago

ColonelPhantom

It's worth noting that compilation is a partially serial task (e.g. linking is often largely single-core). It's entirely possible that going from 4 to 8 threads is much more helpful than 12 to 24 threads, as a 24-thread system will have far more idle threads in comparison. (Of course this is assuming 4c8t Skylake, so a normal consumer i7. Skylake-X had more cores.)

a month ago

adrian_b

For big projects, as I have mentioned, the linking phase is at most a few percent of the compilation time.

I have CPUs with 48 threads and for big projects the compilation time decreases monotonically from 1 thread to 48 threads (and almost proportionally with the number of threads until 24 threads, then with a constant smaller slope until 48). The published benchmarks show that for big software projects the compilation times decrease monotonically until 512 threads on a 2-socket MB with 128-core CPUs.

So the compilation of a big software project, like Chromium, Firefox, LLVM, gcc, the Linux kernel, Libreoffice, etc. (all of which have many thousands of files that must be compiled) is one of the tasks that can use efficiently any number of threads that is currently available.

Moreover, there are now linkers that work partially concurrently. Tasks like the relocation of the object files and of the external symbol references can be done completely concurrently for all source files (after the start addresses are known for all object files and the corresponding object files are known for each symbol).

a month ago

ryukoposting

4 years after graduating college, my decision to dive into computer architecture classes has borne no fruit except my ability to loosely understand what writeups like this are talking about. But, I guess that's the point, isn't it? This is fascinating stuff, whether or not I need to know it.

a month ago

brcmthrowaway

Did the paper author get a cut from AMD?

a month ago

adrian_b

They have no reason to give him any cut.

On the contrary, he points that one of their most important innovations is not that innovative, because it was analyzed in detail almost 30 years ago.

Before AMD, Intel had started to use such multiple decoders in their Atom cores, now rebranded as E-cores (efficient cores). The multiple decoders have been first used in Tremont (2020), then in Gracemont (2021) and Crestmont (2023), and now in Skymont (2024).

It is hard to predict which of the decoders of Skymont (triple 3-instruction decoder) or of Zen 5 (double 4-instruction decoder) is better. AMD has made the choice of the double decoder because their core uses SMT, and a double decoder is easy to be partitioned into two decoders when both simultaneous threads are active.

a month ago

sholladay

Despite being aware of AMD’s Zen chips, for a moment I thought this was about a ZEN player. Good times!

https://en.wikipedia.org/wiki/Creative_Zen

a month ago