XLSTM: Extended Long Short-Term Memory

197 points

1/20/1970

12 days ago

by mauricesvp

Comments

albertzeyer

It seems Sepp Hochreiter has talked already about this model since Oct 2023: https://github.com/huggingface/transformers/issues/27011

In the scaling law comparison, I wonder if it is reasonable to compare number of parameters between Llama, Mamba, RWKV, xLSTM? Isn't compute time more relevant? E.g. in the figure about scaling laws, replace num of params by compute time.

Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation. So scaling up Transformer could still look better when you look at compute time.

It seems neither the code nor the model params are released. I wonder if that will follow.

12 days ago

korbip

Disclaimer: I'm shared first author of this paper.

As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.

12 days ago

brookst

Congrats on the paper, very interesting.

Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?

1. https://www.etched.com/

2. https://www.embedded.com/ai-chip-features-hardware-support-f...

11 days ago

deepnet

Fascinating work, very promising.

Can you summarise how the model in your paper differs from this implementation of xLSTM ?

https://github.com/huggingface/transformers/issues/27011

12 days ago

korbip

Thanks! I don't see any implementation there. In any case, we are planning a code release soon.

7 days ago

WithinReason

Can you expand on the "cannot solve fundamentally" part?

12 days ago

lucidrains

https://arxiv.org/abs/2404.08819

11 days ago

Der_Einzige

So does anything do proper state tracking? And don’t point to the OP since very often purportedly better new architectures end up being basically vaporware (like mamba or rkwv, which still don’t have good quality pre trained models yet)

11 days ago

impossiblefork

How do you mean vaporware?

Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.

11 days ago

[deleted]

11 days ago

thomasahle

Transformers and SSMs can't do long computations that are inherently sequential.

Unless you give them chain of thought. In which case they do great.

11 days ago

albertzeyer

Congratulations on the paper. That's some very interesting work!

But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.

12 days ago

korbip

Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed. We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.

12 days ago

goldemerald

Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.

11 days ago

hh1

When you talk about "c" or "scalar memory" in the paper, does that refer to a single unit in the vector usually referred to as c?

So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?

Having a bit of issue understanding this fundamental part

11 days ago

korbip

You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.

For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.

7 days ago

SpaceManNabs

> For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture

Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?

11 days ago

logicchains

To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.

12 days ago

korbip

For language in general it seems fine. But there might be specific tasks where it is necessary indeed.

11 days ago

YetAnotherNick

Recurrence is less of issue with really large models training than it is with medium sized models. Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training. And sequence parallelism is same for transformer or recurrent model.

For really large models, it is in fact easier to achieve peak flops because computation required scales faster than memory bandwidth required(square vs cube).

12 days ago

albertzeyer

With sequence parallelism, you mean to increase the batch size, i.e. number of sequences in a batch?

> Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training

Is there some word missing? You mean it's more common for large-sized Transformers?

> computation required scales faster than memory bandwidth required (square vs cube)

That is an interesting thought. I'm trying to understand what exactly you mean. You mean, computation time is in O(N^2) where N is the sequence length, while required memory bandwidth is in O(N^3)? Why is that?

12 days ago

YetAnotherNick

No, it means dividing the sequence into multiple chunks and processing them one by one, very similar to recurrence. See [1]. Sequence parallelism is needed when the sequence can't fit in a single GPU. Sequence parallelism is the hardest parallelism, but it is required for longer sequence. Many models just trains for smaller sequence length for majority of the training and switch to sequence parallelism for last few percentage of training.

[1]: https://arxiv.org/pdf/2105.13120

12 days ago

logicchains

>Sequence parallelism is the hardest parallelism, but it is required for longer sequence

In terms of difficulty of implementation it's arguably much easier than pipeline parallelism, which I'd argue is the hardest kind (at least to implement it efficiently without bubbles), and takes the most lines of code to implement (especially in Jax, where sequence parallelism is almost trivial).

12 days ago

zozbot234

> Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation.

If you mean that you cannot fully parallelize inference, this might be true but also not quite relevant since the computational demands of inference are low. And you can always "parallelize" training to some extent, just by training larger batches.

11 days ago

korbip

This was formulated a bit unclear. It is not possible to parallelize in the sequence dimension for training as it is possible for Transformers. In the batch dimension you can always do it.

11 days ago

KhoomeiK

For those who don't know, the senior author on this paper (Sepp Hochreiter) was the first author on the original paper with Schmidhuber introducing LSTMs in 1997.

12 days ago

ramraj07

At least in biology, the first author of a paper is more often than not just a pair of gifted hands who did the experiments and plotted the graphs. Doesn’t always translate that they become good PIs later (though they get their chances from these papers).

12 days ago

cdavid

In ML, it generally is ordered from most contributor to least contributor, w/ heads of the lab last.

11 days ago

querez

In this specific case, it's fairly well known that Hochreiter was the major brain behind the original LSTM.

11 days ago

[deleted]

12 days ago

WithinReason

I like the color coded equations, I wish they would become a thing. We have syntax highlighting for programming languages, it's time we have it for math too.

12 days ago

imjonse

Math notation has different fonts, with similar goals as syntax highlighting. It also works well in black and white :)

12 days ago

aeonik

Obligatory link to BetterExplained color coded math equations.

https://betterexplained.com/articles/colorized-math-equation...

10 days ago

GistNoesis

Can someone explain the economics behind this ?

The claim is something than will replace the transformer, a technology powering a good chunk of AI companies.

The paper's authors seems to be either from a public university, or Sepp Hochreiter's private company or labs nx-ai.com https://www.nx-ai.com/en/xlstm

Where is the code ? What is the license ? How are they earning money ? Why publish their secret recipe ? Will they not be replicated ? How will the rewards be commensurate with the value their algorithm bring ? Who will get money from this new technology ?

12 days ago

imjonse

Should all arxiv papers be backed by economic considerations or business plans?

12 days ago

AIsore

Nope, they should not. It is academia after all. How would you even do that in, say, pure mathematics? Concretely, I would love to know what the business plan/economic consideration of Gower's 1998 proof of Szemeredi's theorem using higher order Fourier analysis would even look like.

11 days ago

queuebert

Coming soon to an HFT firm near you ...

11 days ago

Der_Einzige

Yes they should. Academia and peer review is so corrupt, gamified, and poor quality that I’d literally trust capitalist parasites more than the current regime of “publish or perish” and citation cartels.

At least capitalists have something to fight over that’s worth fighting for (money). Academics will bitterly fight over the dumbest, least important shit. There’s a law about how the less something matters, the more political the fights over it will be.

11 days ago

AIsore

I am certainly not going to defend peer review and its inherent flaws. I am also not sure "capitalists" or the market is always as efficient as one might hope or think. But that aside, to my point above, if capitalists were to optimize "money" as you say, how would that fix publishing? Firstly, how would they ascribe a monetary value to Gower's 1998 and his other few papers that catapulted him to the Fields Medal? Are you saying these subject do not matter because no one is bidding for these papers? I fear we would not have published Heisenberg's early papers or the discovery of Penicillin if so. And over what horizon would "capitalists" optimize that monetary value (internal IRR)? Governments usually have to step in for long term IRR projects (e.g. the internet protocol's development was famously funded by DARPA and they keeping "deep learning" alive during the downturns as no one believed in short term returns ...). The UK water system or quite a few train services around the world bear witness to the fact that even in a "capitalist" society, some long term common benefits are hard to fund with short term IRR considerations even pension funds consider reasonable. Taking that observation to its perverse conclusion, if you believe in "capitalists" then you could argue that the current imperfect review system is a side effect of capitalist societies' long term research funding plan (universities, research grants, tax breaks for endowments, student grants, ...). I just think knowledge sharing is not always compatible with financial interests. And the former, to me, is the public good that academia should attain. But you get no argument from me that peer review is broken. I struggle to think, though, of a better system and doubt "money" is it, tbh.

11 days ago

jampekka

Or any?

12 days ago

refulgentis

Are you asking how academics make money from giving away knowledge in papers? It's complicated

11 days ago

GistNoesis

I don't understand at all what the monetary value of this algorithm should be.

The authors are positioning themselves as a company and not merely academics :

A Sepp Hochreiter's video from 6 months ago hyping xLSTM :

https://youtu.be/hwIt7ezy6t8?feature=shared&t=561 in which he state his intent to raise €300M to make a european alternative to openai's GPT for niche domains thanks to this new method that will allow to train for cheaper and better.

He recently received (2023) € 35,000 in prize money at the 5th annual German AI Award.

https://www.jku.at/en/festival-university/media/detail/news/...

Or is it just an academic tactic to get more funding ? To extract more work from PhD students by making them think they are going to strike it big ?

How are they intending to build a moat if they publish their papers ? Will this technology be encumbered by patents/license ?

11 days ago

AIsore

If you are asking that question, I guess you must have wondered about this for years, right, in fact nearly a decade? I mean why would Google have bought DeepMind with them publishing in peer reviewed journals for years after? Same for Meta (formerly facebook)? I think there is a well trodden path being followed here ... and I am surprised by your surprise.

11 days ago

GistNoesis

Acquisitions, like for DeepMind is usually a way to hire talent. It can make sense when the technology is new and getting a few year of lead time on what is going to be a growing market may make some financial sense.

In this specific xLSTM case, the industry has matured, they are just one among many (Mamba, S3Ms, transformers-variants... ), they have already been sitting on it for at least 6 months, I don't see what their play is.

An other case study that's probably interesting, are the authors of the Adam Paper, https://arxiv.org/abs/1412.6980 , (Awarded "2020: The Adam optimization paper is the world's #1 most cited scientific paper of the past five years"). Probably a few (10?,100?) billions worth value created. You can find the authors bios http://dpkingma.com/ https://jimmylba.github.io/

I think there is a huge problem with the capture and sharing of value in the whole deep-learning industry. Academia's naivety plays a role in it, Generational Shift technologies are badly rewarded. Incremental Shift technologies aren't rewarded at all.

Powerful technologies into many hands with low rewards for their creators while the value they generate keeps going to the same pockets. That's a recipe for disaster.

Will be a fun thing to come back in a few years to see how it had unfold.

10 days ago

AIsore

There is a lot to unpack. But let's start with your first point. If the acquisition of DeepMind was just a talent acquisition, why continue to let them publish? Your second point: how did you get the impression that this market is "mature"? And, going back to the first point, which market do you actually mean to have matured? Regarding value creation/capturing/sharing, academic naivety, this industry is no different to any other, nor has basic economics changed. Deep Learning is an amazingly powerful new technology that has the potential to change the world. But how you make products/services out of it which we all value, pay for and thus provide the basis of employment is the usual risk/reward cycle ANY business has to subject itself to. More believe in the technology = more investors willing to fund businesses that have negative free cash-flow for longer. Yes, the competitive landscape seems stacked against new entrants, but that is no different to when today's teach behemoths started. And yes, as with any industry, monopolies are not great and, according to Kara Swisher, maybe tech at large, today, is an unhealthy monopoly.

10 days ago

NOCompromisER

Will this technology be encumbered by patents/license ? I guess it is most likely already patented (or very close) and you will need a license. xLTSM is not open source

11 days ago

brookst

I think you’re making a satirical point about how commercial R&D has far outstripped academia, but it’s not 100% clear.

11 days ago

smusamashah

Can someone ELI5 this? Reading comments it sounds like it's going to replace transformers which LLMs are based on? Is it something exponentially better than current tech on scale?

11 days ago

probably_wrong

LSTMs are a recurrent architecture for neural networks, meaning that your output depends both on your current input and your previous output. This is similar to how language works, as the next word in your sentence must fit both the idea you're trying to convey (your input) and the words you've said up until now (your previous output).

LSTMs where very popular for a while (I think the first good version of Google Translate used them) but they had two critical downsides: their performance went down with longer outputs, and they where a bit annoying to parallelize because computing the output for the 10th word required first computing the output of the previous 9 words - no way to use 10 parallel computers. The first problem was solved with Attention, a scaffolding method that prevented degradation over longer sequences. Eventually someone realized that Attention was doing most of the heavy lifting, built an attention-only network that could be easily parallelized (the Transformer), and LSTMs lost the top place.

Are xLSTMs better? On paper I'd say they could be - they seem to have a solid theory and good results. Will they dethrone Transformers? My guess is no, as it wouldn't be the first time that the "better" technology ends up losing against whatever is popular. Having said that, it is entirely possible that some inherently recurrent tasks like stock price prediction could get a boost from this technology and they may find their place.

11 days ago

[deleted]

12 days ago

jasonjmcghee

They reference "a GPT-3 model with 356M parameters"

So GPT-3 Medium (from the GPT-3 paper) - feels pretty disingenuous to list that as no one is referencing that model when they say "GPT-3", but the 175B model.

I wasn't aware that size of the model (356M) was released- what am I missing here?

I also think it's relatively well understood that (with our current methods) transformers have a tipping point with parameter count, and I don't know of any models less than ~3B that are useful- arguably 7B.

Compare these benchmarks to, say, the RWKV 5/6 paper https://arxiv.org/abs/2404.05892

11 days ago

CuriouslyC

phi3 mini is surprisingly capable given its size. You can teach small transformers to do stuff well, you just can't have good general purpose small models.

11 days ago

jasonjmcghee

Totally. But they aren't fine tuning these afaict- but comparing general purpose capabilities.

11 days ago

Der_Einzige

The point still stands, Phi3 is an excellent model and shows that good models don’t need that many parameters

You should see the work on ReFT coming from mannings group showing that you can instruction fine tune models by modifying like, 0.00001% of the parameters. By doing it this way, you significantly mitigate the risk of catastrophic forgetting.

11 days ago

elygre

I have no idea about what this is, so going off topic:

The name XLSTM reminds me of the time in the late eighties when my university professor got accepted to hold a presentation on WOM: write-only memory.

12 days ago

woadwarrior01

I think it's a fine name. The prefix ensures that people don't confuse it with vanilla LSTMs. Also, I'm fairly certain that they must've considered LSTM++ and LSTM-XL.

12 days ago

pquki4

I mean, if you look it another way, XSLT is a real thing that gets used a lot, so I don't mind appending an M there.

12 days ago

sigmoid10

Another week, another paper that thinks they can revive recurrent networks. Although this time the father of LSTM is a co-author, so this paper should not come as a surprise. Sadly, the results seem to indicate that even by employing literally all tricks of the trade, their architecture can't beat the throughput of flash-attention (not by a long shot, but that is not surprising for recurrent designs) and, on top of that, it is even slower than Mamba, which offers similar accuracy at lower cost. So my money is on this being another DOA architecture, like all the others we've seen this year already.

12 days ago

l33tman

To put another perspective on this, lots of modern advancements in both ML/AI and especially computer graphics has come from ideas already from the 70-80s that were published, forgotten, and revived. Because underlying dependencies change, like the profile of the HW of the day. So just let the ideas flow, not every paper has to have an immediate payoff.

12 days ago

KeplerBoy

To be fair, Hochreiter seems pretty confident that this will be a success.

He stated in interviews "Wir werden das blöde GPT einfach wegkicken" (roughly: We will simply kick silly GPT off the pitch) and he just founded a company to secure funding. Interesting times.

Someone gathered most of the available information here: https://github.com/AI-Guru/xlstm-resources

12 days ago

imjonse

With all due respect for his academic accomplishments, confidence in this domain in the current climate is usually a signal towards potential investors; it can be backed by anything between solid work (as I hope this turns out to be) and a flashy slide deck combined with a questionable character.

12 days ago

KeplerBoy

Which is a legitimate stance.

Being a researcher at a public university in a country that doesn't exactly splurge on this kind of research he has to get creative to get any meaningful amount of funding.

12 days ago

l33tman

To say the least. It's a bit unfortunate that there is about 0 culture in the EU regarding moonshot projects compared to silicon valley. I've tried to get money a couple of times from government grants for (yet another..) foundational AI model, neuroscience inspired, but the grants instead seem to almost exclusively go to well developed industrial companies that now wants some free money to "leverage" ChatGPT in their existing internal processes.. and being still in the research phase, the more risk-averse VCs here are not touching stuff like this either.

So I guess what's left is doing these grand proclamations that you are going to "knock the crown off OpenAI" etc. Though, some sort of vision is good to have for sure :)

12 days ago

karalala

Already seeing major flaws in the paper.

The benchmarking done in the table 1 is extremely questionable. Their table basically contradicts the results from multiple peer reviewed papers, especially for RNNs which report results much closer to baseline transformers (and conducted much larger experiments btw).

Page 40 they mention that all models are trained with the same lr for comparability.

> Contradicts their own scaling laws table which uses different lr for different models

> And no it is not a fair comparison to use the same lr to test all these different models. Benchmarking results just looks like they are using tuned hyperparameters for their model which happens to not work for other models.

12 days ago

[deleted]

11 days ago

bingbingbing777

You should publish a response paper and get them to retract their paper if it has major flaws.

12 days ago

karalala

Its xlstm contradicting existing peer reviewed papers lmao. Either xlstm should fix their benchmarks or existing peer reviewed papers should retract.

RWKV-v6 > RWKV-v5 > RWKV-v4, not the other way round obviously. HGRN 8 ppl worse than baseline transformers? NIPS 2023 spotlight paper btw.

12 days ago

AIsore

Are you saying this is obvious because people have published the exact same benchmarks which are 100% comparable in journals? If so where are they? I have seen quite a few published benchmarks that could not quite be reproduced, tbh. So, again, what makes this "obvious" to you?

11 days ago

logicchains

I thought it was common knowledge that architecture comparisons in papers aren't worth the paper they're printed on; there are so many ways to deliberately or accidentally structure things to favour one architecture over the others. Ultimately the lmsys chatpot arena will be the final judge.

12 days ago

karalala

True, but they normally arent this far off. HGRN claims that they outperform transformer for 1B parameter model trained on the pile. HGRN performing 8ppl worse suggests that its useless.

12 days ago

AIsore

My experience - many are far off and most of the time published tables of different papers are hard to compare. If you make the assertion here of these results to be flawed, I would like to see more substance (code, reproduction,...). And for balance, for the same reason, hard to verify the accuracy of these results without further insight.

11 days ago

logicchains

So many papers play tricks with the learning rate schedule: https://arxiv.org/abs/2307.06440

11 days ago

rrr_oh_man

Could you explain for a dum-dum?

12 days ago

karalala

Results of xlstm are promising but will need larger scale experiments.

However they completely messed up benchmarking experiments for various RNN models which in their papers claim comparable and even better performance than base transformer.

12 days ago

AIsore

These experiments seem pretty large already though, no? How are you so sure they messed up benchmarking? Is the code out already?

11 days ago

beAbU

I thought this was some extension or enhancement to XSLT.

12 days ago

cylemons

Same

11 days ago