LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale (2022)

135 points
1/20/1970
10 months ago
by tosh

Comments


albertzeyer

See also the blog post. https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

I personally found most interesting of this work the emergent sparse behavior starting around 6B parameters but not before. That suggest that the model is operating in a different mode at such model size or larger.

10 months ago

heyitsguay

Sounds like the idea is that "outlier features" effectively give the model an inhibitory system for other features, and transformer models converge to consistent outlier channel selection per layer at around the 6.7B param mark?

It's interesting, biological neural networks are known to use inhibitory systems for signal gating and sharpening/sparsification, but it's typically handled by distinct neuron types that are strictly inhibitory.

I wonder if one could boost smaller transformer performance with an analogous construction - add on a small additional module constructed for channel inhibition in the rest of the model.

10 months ago

brucethemoose2

A continuation of that trend (that, for instance, a 65B model would quantize better than a 33B one) was popular belief, but the k-quants dev recently found that LLaMA 7B-65B all quantize very similarly: https://github.com/ggerganov/llama.cpp/pull/1684

10 months ago

sroussey

That graph really puts it all into perspective too.

10 months ago

FL33TW00D

For a more digestible introduction, see the companion blog post: https://huggingface.co/blog/hf-bitsandbytes-integration

10 months ago

xrd

This is really interesting.

Can someone help me understand why this is important:

  This means that in BF16 we can retain the same dynamic range as FP32. But we lose 3 bits of precision with respect to FP16. Now there is absolutely no problem with huge numbers, but the precision is worse than FP16 here.
Why is it important with LLMs and transformers that you can keep track of very big numbers and very small numbers? It feels like you should be able to deal with only big numbers or only tiny numbers when running the models? Why does precision matter?

If I've used anything incorrectly here, please correct me.

10 months ago

Lerc

I can see areas where being able to adjust weights to a wide range of sensitivities is important.

Take the original transformer diagram for example. https://i0.wp.com/bdtechtalks.com/wp-content/uploads/2020/06...

See the arrows coming in from the sides of the "Add & Norm" boxes. These come from bypassing the layer earlier. This gives back-propagation a easier path to adjust the weights earlier in the model. The results of the inputs are added, if the sensitivity were comparable to both inputs to the 'Add and Norm' a strong bypass signal would swamp the information from the path with more processing.

The bypass allows for the easy early training to travel back through the bypass channels, but as the network develops more nuance, the longer path will provide additional accuracy, as this improves, the strength of the bypass signal can be reduced and while still remaining part of the overall signal it probably ends up way less significant than the longer path.

A larger range of weights would allow this combination of shouting and whispers.

note: not an expert, grain of salt etc.

10 months ago

colejohnson66

I don't think they're claiming precision loss is a bad thing, but instead is just pointing out that BF16, in exchange for more "dynamic range" (eight bit exponent instead of five), you lose some precision. After all, BF16 is just F32, but with a truncated significand.

10 months ago

xrd

Yes, that much I do understand. I'm just confused why that is important when predicting the next token? What about the calculation requires that you keep track of really big numbers (and don't overflow) and requires that you have very precise numbers (and don't truncate the precision)?

10 months ago

xeonmc

Because AI is an alchemical art involving heuristically selecting arcane “activation functions” without critically thinking about precision distribution or symmetry, so they rely on band-aiding over it via altering number formats rather than properly matching the function to the appropriate numeric domain.

10 months ago

[deleted]
10 months ago

GaggiX

https://arxiv.org/abs/2306.03078

From the same author and more recent.

10 months ago

homarp

posted here https://news.ycombinator.com/item?id=36216126 but no traction

The paper is entitled "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression" and https://twitter.com/Tim_Dettmers/status/1666076553665744896 is a nice summary

"SpQR allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ"

Code here: https://github.com/Vahe1994/SpQR (https://news.ycombinator.com/item?id=36219128 but no traction )

10 months ago

version_five

Is this what is used in e.g. ggml/llama.cpp which is the most prominent quantized inference program I'm aware of? If not, would it result in a material difference vs whatever it's using now (which I thought was 4-8 bit float numbers). Anybody know?

10 months ago

brucethemoose2

In the wild, people tend to use GTPQ quantization for pure GPU inference: https://github.com/PanQiWei/AutoGPTQ

And ggml's quant for CPU inference with some offload, which just got updated to a more GPTQ-like method days ago: https://github.com/ggerganov/llama.cpp/pull/1684

Some other runtimes like Apache TVM also have their own quant implementations: https://github.com/mlc-ai/mlc-llm

For training, 4-bit bitsandbytes is SOTA, as far as I know: https://huggingface.co/blog/4bit-transformers-bitsandbytes

TBH I'm not sure why this November paper on 8 bit is being linked now. Few are running 8 bit model inference when they could fit a better 3-5 bit model in the same memory pool.

10 months ago

weinzierl

What does it cost to train a 6.7B transformer from scratch. Not considering any data preparation because that would be highly variable. Is this realistically possible for mere mortals? How long until it'll become a national past time?

10 months ago

make3

fine-tuning is cheap, pre training is expensive & hard

10 months ago

weinzierl

Sure, but fine-tuning has limits.

10 months ago

make3

no, you never should pre-train your own LLM unless you have 100k$+ to spare. You should only fine-tune. There is no reason you can't just fine-tune with whatever data you have

10 months ago

weinzierl

I have a huge company internal dataset with domain specific knowledge. What you are saying is that I can just fine-tune an existing model with that data and be fine?

That was exactly our inital idea but from all I learnt when trying this is a dead end approach. From my understanding the consensus seems to be that fine-tuning works well to alter or restrict behaviour but very badly to teach additional knowledge. You can fine-tune a generic base model with generic chatbot but not into a domain expert.

That also seems to be the reason why people still use vector databases for large domain knowledge data. I'm aware that the vector database approach has different pros and cons but if fine-tuning the whole content would be possible we certainly would use it in addition to that.

I'm not an expert, so I'd appreciate any comments, hints, pointers and corrections if I'm mistaken in my understanding.

And my original question still stands. 100k$ is not a lot for a company, it must certainly be more than that?

10 months ago

make3

pre training and fine tuning use the exact same method of next token prediction. the difference is in the quantity of data you have (& whether the model is pre trained). you need to train the model on 1 trillion tokens (https://platform.openai.com/tokenizer https://github.com/google/sentencepiece) anyways for it to get reasoning capacities, which it feels very unlikely that your data is that much.

I'm highly skeptical that you have enough data to pretrain if you don't have enough data to fine tune.

fine tuning + vector search + prompting of as much stuff as you can, on a LLM like palm2 or gpt4 is what I would do. otherwise you can use falcon 40B ofc.

maybe I should charge for this ahah

9 months ago

weinzierl

The data is not the problem. I could train on any of the public datasets combined with my own data. And here comes my point:

The result I'd achieve by training on that combined dataset from scratch cannot be achieved cheaper by utilizing an already pre-trained model of the huge generic dataset plus whatever additional training with the large domain-specific dataset.

From what I understand:

If you fine-tune only with the domain specific data and just enough so that the model picks up that knowledge it will have forgotten most of its generic knowledge already.

If you train on the combined dataset it will take as many epochs for the domain knowledge to shine through as for training the original model. It would cost the same as training from scratch.

You need enough tokens but you also need to train your model the right amount on them. Too much or too little training is bad and the base models we have are just trained for the sweet spot and will tolerate a little bit of fine-tuning to adapt their behaviour but not nearly enough to teach them new facts.

I'm not an expert, so I could be wrong. What makes me a bit confident is that I have not yet found a single project that reports to have used the approach you suggest successfully.

9 months ago

make3

I'm an expert, mix your data with 50% random data. just do it.

9 months ago

villgax

Stop being cowards nvidia & enable FP8 emulated or otherwise on all arch

10 months ago