Binary vector search is better than FP32 vectors

137 points
1/20/1970
a month ago
by gaocegege

Comments


refulgentis

Dear reader: your intuition is right: it's not "better" to reduce the information by a factor of 32.

The game the article plays is to do a KNN search to deal with the fact this flattens similarity significantly.

TL;DR from the field:

- this is extremely helpful for doing a first pass over a _ton_ of documents in a resource constrained environment.

- it is extremely unhelpful unless you're retrieving 10x the documents you want via binary, then doing re-ranking via the FP-32 to rank the remaining.

- in general, it's unlikely you need the technique unless you're A) on the edge, i.e. on consumer devices from 3 years ago or B) you have tens of millions of vectors on a server. All this stuff sounds really fancy, but when you implement it from scratch, you quickly learn "oh its 384 numbers I gotta multiply together"

Source: I do embeddings, locally, to do retrieval for RAG. I "discovered" this about a year ago, and it deeply pains me to see anything that will misinform a lot of people.

Free bonus I haven't seen mentioned elsewhere yet*: you can take the average of the N embeddings forming a document to figure out if you should look at the N embeddings individually. This does over smooth too, ex. my original test document is the GPT-4 sparks paper, and the variety of subjects mentioned and length (100+ pages) meant it was over smooth when searching for the particular example I wanted it to retrieve (the unicorn SVG)

* edited to clarify given reply. also my reaction to reading it, "that dude rocks!!", made me wanna go off a little bit: if you're not in AI/ML, don't be intimidated by it when its wrapped in layers of obscure vocabulary. once you have the time to go do it, things you would have thought that were "stupid" turn out to be just fine. and you find like-minded souls. It's exhilirating

a month ago

gwern

> Free bonus I haven't seen mentioned elsewhere yet: you can take the average of the N embeddings forming a document to figure out if you should look at the N embeddings individually. This does over smooth too, ex. my original test document is the GPT-4 sparks paper, and the variety of subjects mentioned and length (100+ pages) meant it was over smooth when searching for the particular example I wanted it to retrieve (the unicorn SVG)

This is something I've been pondering for doing lookups across essays!

That is, I saw some people doing per-paragraph embeddings, so you could do something like mouse-select a paragraph in an essay, and then get a list of all other paragraphs site-wide which are 'similar'. However, it seems to me like this would run into trouble with small or generic paragraphs (because their embeddings will be noisy & extreme & there's a lot of them and you may wind up pulling up only sibling paragraphs, defeating the point)...

But what if you did it as a tree? Embed each paragraph, then average a couple paragraphs, then average the paragraphs embeddings, and so on to the whole-essay level. Then do the lookups as usual. Any given paragraph might instead pull up a genuinely relevant section, or a range of paragraphs, or a whole essay, rather than some spurious paragraph.

(And you could constrain the lookup to be tree-like too: first find the most similar essay-level embedding, then see if any of that essay's section embeddings are more relevant, then check its paragraphs embeddings etc, and stop when the distance stops decreasing compared to the parent node. This would let you enforce diversity of results: eg take only 1 hit per essay, so you don't wind up showing the reader 10 hits which are all on the same page.)

Does that correspond to anything? https://en.wikipedia.org/wiki/Hierarchical_Navigable_Small_W... seems tree-like but the trees used in it do not seem to correspond to the actual document structure like I'm proposing, and solely is an optimization on pre-existing embeddings which mostly ignore any document structure.

a month ago

sroussey

I’m doing some RAG work with document trees, and various ways to do embeddings. I’ll write when I have some data.

a month ago

gaocegege

Hi. Thanks for your comment! I wrote the post to show that adding a reranking step can actually help save memory from the perspective of someone working with databases. In many cases, there may not be a lot of vectors involved. By including a reranking step, you can significantly reduce the amount of memory needed for the index.

(author of the post

a month ago

moralestapia

>I'll throw in a free bonus no one has mentioned yet, at least publicly: you can take the average of the N embeddings forming a document to figure out if you should look at the N embeddings individually.

Hey, I do this! :D

But I've seen it mentioned in several places, it wasn't completely my own idea.

a month ago

VorticonCmdr

This reminds me a lot of simhashing e.g. https://matpalm.com/resemblance/simhash/

a month ago

dleeftink

Binning and averaging is a well-known technique I'd say, whether applied to PCA components or even raw distributional data sorted over some dimension.

A relatex technique that I've seen speed up retrieval tasks immensely is merging frequently co-occurring binned embeddings into new bins (similar to Byte Pair Encoding) and downweighing the resulting bins when the variance between its constituent dimensions is high.

This effectively yields 'bins of interest' where documents agree on some dimension(s) while filtering out documents that contain information that is general to most.

a month ago

nblgbg

I agree with the comment. Especially if you index these quantized vectors with HNSW you will get even less precision. You need to retrieve lot more and do a kNN with the FP32 vectors is the way to go !

a month ago

sroussey

BTW: with the average of embeddings part… can’t an LLM do that for a large context? Say the context is book size, instead of each word in attention, take each paragraph?

a month ago

refulgentis

#1. Biggest constraint for me is I do embeddings locally. It's absolutely _bonkers_ that these "vector DBs" are storing a ton of private content. But its _barely_ possible, it's about 30 ms to get an embedding for 250 words on an iPhone from 2 years ago. So once there's a ton of downloaded URLs to choose from (say, 100), it's a fantastic way to filter between "oh these are from your query about quiche, and these are from your query about LA Lakers stats" to still get an answer in under 10 seconds.

#2 Real talk? I hate to assert something that sounds so opinionated, but for brevitys sake:

The long context stuff is miserable. More hype and a somewhat decent scratchpad than useful. I have a feeling the next GPT will be much better, Claude 3 is a qualitative jump from both Gemini 1.5's "1 million" tokens and GPTs 128K. Seems to me the long context stuff was a technical achievement layered on top of models using 4K token training materials.

Whenever I see people asserting long context will kick RAGs ass, I can tell they they're either "arguing eventually, on a long enough timeline" or they don't have a practical understanding of how its working. Ex. one of my final user stories was translating entirety of Moby Dick into over-the-top Zoomer. You simply can't get it done with chunks larger than 5K tokens, with any model.

It's pretty darn expensive to just throw tokens at it in any context outside using the cheapest model for extraction. Like, the latest round of GPT-4 price cuts finally gets answers my anesthesiologist friend loves underneath $0.15, and that's just 4K input tokens, about 12 pages.

Random anecdote, re: irony in that Claude having a qualitative leap in using the long context causes novel issues:

It's kinda funny because it's actually caused a nasty problem for me: conversation flow used to look something like:

1. USER: question.

2. (programmatic, invisible) USER: Please generate 3 search queries to help me answer $question.

3. (invisible) AI: $3_queries.

4. (do all the humdrum RAG stuff to make #5).

5. (programmatic, invisible) USER: Here are N documents I looked up, use them to answer $question.

6. AI: $answer.

Claude is the first that notices I don't inject $3_queries back into the question before $answer. So it thinks it still has to answer #2 before it answers #5. So $answer starts with 3 search queries.

a month ago

bigfudge

Why do 5 a an extension of the original chat? Just start a clean context. I actually wish this is something lmql added - the ability to start a clean context but still refer to previous completions by name. I have a hacky implementation of this but it would be great if built into guidance or lmql.

a month ago

sgt101

>Free bonus I haven't seen mentioned elsewhere yet*

This is a good idea. Is it a little like a version of HNSW?

a month ago

heipei

To the people dismissing the idea of binarising vectors: Fair criticism, but consider the fact that you can also train a model with a loss function that approaches a binary behaviour, i.e. so that the magnitude per dimension plays an insignificant role and only the sign of the dimension carries information. In that case you can use the binary vector for search and ranking.

a month ago

mjburgess

in practice, learning_rate * grad(L) is mostly about the direction anyway.

a month ago

niederman

But the direction is not at all binary, it's a direction in a very high-dimensional weight space.

a month ago

bjourne

But the vectors discussed can represent 2^3072 directions. It's close enough.

a month ago

barefeg

This technique had a very recent resurgence via https://txt.cohere.com/int8-binary-embeddings/. Hugging face also covered the technique here https://huggingface.co/blog/embedding-quantization. It seems like a very good tradeoff compared to the shorter embeddings which require fine tuning via the matryoshka technique. On the other hand, Nils Reimers suggests that trivial quantization of the full precision embeddings is not as good as using “compression friendly” embeddings like Cohere’s Embed V3. Does anyone know what’s the difference in precision between trivial quantization and optimized embeddings?

a month ago

crucio

how well would this work for 762 dimensional vectors? At 3072, they're starting with such a high number of dimensions that the accuracy loss may not be representative of what others would see

a month ago

esafak

You'd have to look at the precision-recall curves for your data set and make the trade-off. There are studies on this topic.

a month ago

mediaman

Generally, it seems that people are starting to see more problems when making vectors of fewer than 1,000 dimensions binary.

a month ago

blublubdub

seems to work pretty well check this out: https://huggingface.co/blog/embedding-quantization

a month ago

dicey

This reminds me somewhat of the iSAX papers from ~2010 [0], which was focused on time series but used a pretty cool method to binarize/discretize the real values data and do search. I wonder how folks building things like FAISS or vector DBs incorporate ideas like this , or if the two worlds don’t overlap very often.

[0]. https://www.cs.ucr.edu/~eamonn/iSAX_2.0.pdf

a month ago

kookamamie

> our experiments showed that the decrease in accuracy was not as big as expected

A decrease, nevertheless. This is always the issue with pruning, sparsification, etc. - to get SOTA, you will not want a decrease of any kind.

a month ago

gaocegege

In certain situations, we might have to make a trade-off between performance and cost. There are many scenarios where a high level of recall is not necessary.

a month ago

jn2clark

What is accuracy in this case? is it meant to be recall or is it some evaluation metric?

a month ago

gaocegege

Yeah, it is recall.

a month ago

rurban

April 1 jokes already? Storing floats as bit, great idea!

a month ago

marginalia_nu

I think the the technique is best understood as a special case of random projection.

It's well understood that there are a class of dimensional reduction techniques that preserve distances well (via the J-L lemma), owing to the countrerintuitive fact that most high dimensional vectors are nearly orthogonal.

a month ago

riku_iki

bit embeddings are different to float, they are trained separately.

a month ago