Training LLMs to Reason in a Continuous Latent Space
Comments
ttul
ttul
Indeed, I would not be surprised if OpenAI one day admits that the `o1` model uses the last hidden layer (or some other intermediate layer) to feed the "thought process" that you can watch as it "thinks" about the answer. I suspect that they may take the last hidden layer and feed it back into the front of the `o1` model while also feeding a separate, likely much smaller LLM that generates the "thought process" as language tokens.
In this manner, the model makes use of the rich semantic information encoded at the last hidden layer while informing the user via an extraction of that hidden layer specifically tuned to generate human-legible concepts such as, "I'm considering the impact of converting the units from kilograms to pounds," or whatever.
impossiblefork
I don't think it does, because from this paper this kind of backfeeding is apparently quite difficult to train.
I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.
I don't think o1 is something complicated.
sigmoid10
o1 ist most likely just 4o optimized for CoT with some fine tuning or perhaps merely with a dedicated system prompt (which is probably the reason why they don't let you access it in the API) and enforced structured output. In fact you can recreate something very similar using 4o and the right system prompt + structured outputs.
pedrovhb
That's certainly possible, but it reminds me a bit of a similar thing I've seen in their UI that rhymes in a way that makes me think otherwise. In the code interpreter tool, you have a little preview of the "steps" it's following as it writes code. This turns out to just be the contents of the last written/streamed comment line. It's a neat UI idea I think - pretty simple and works well. I wouldn't be surprised if that's what's going on with o1 too - the thought process is structured in some way, and they take the headings or section names and just display that.
throwawaymaths
> using the last hidden layer
iirc this is a well supported task iirc called "classification head" instead of "language modeling head" in case anyone else wants to do this as a fine-tuning project
WiSaGaN
This is intriguing. When I learned that a lot of people do not have inner monologue, I was fascinated by the fact that people can differ on such seemingly fundamental way of being. Maybe those who have it just have a "tee" that pipes into words.
bongodongobob
I'm not convinced they don't. Ask them what they do when they read. That's all an inner monologue is.
542354234235
Not all people do have an inner monolog when reading, called subvocalization. Subvocalization is when you are basically reading to yourself inside your head, sounding out each word. It is one of the most common reasons for slow reading speed. Most people do not “need” to subvocalize and can train themselves to process the visual text directly, instead of first converting it to “auditory” information.
I found this out a few years ago and I was shocked that the way I read wasn’t universal. I have since been practicing reducing/eliminating subvocalization and I am getting better, and it allows me to increase my reading speed significantly. It also serves as an excellent example of how different our internal mental processes can be, and how completely unaware we are that there could be any other way to think than our own.
Terretta
For some who don't subvocalize, these comments are a thought grenade: injecting conscious thought about subvocalization can cause it!
Reading speed slams to a speech-paced crawl until the reader can "not think of a pink elephant" again.
Regic
Don't we all experience this from time to time? When I'm focused on solving some mathematical problems I'm not thinking in words, but in concepts. When you are thinking of words you also think of a concept, the only difference is that sometimes there are no words associated to it. Im my opinion, words, sentences are just a label to the thinking process, a translation of what is really going on inside, not the driver of it.
taylorius
That's true - though I think of an inner monologue as being more "self driven". Perhaps it's just that their mental voices don't spontaneously say anything.
liuliu
BTW, people found that in-conext instruction is useful for these (for example, directly using the last hidden layer to condition a diffusion model is much worse than encoder-decoder model, but you can add instruction prefix "try to imagine more details with the following text: <prompt>" would enrich the last hidden layer vector to be superior than the encoder-decoder text features. Very interesting stuff.
ttul
It’s so funny how you can basically tickle the model and it laughs.
psb217
"...because it has to somehow tell the next layer how to generate the next token prediction." -- This isn't actually true in the case of transformers. Features in the final TF layer at time t in a sequence do not depend on the features in the final TF layer at any other time step. Recurrence in transformers is done "depthwise" via "causally masked" convolutions. Final layer features at time t can depend on penultimate layer features at time t-1, but not on final layer features at time t-1.
danielmarkbruce
you are misunderstanding what the person is saying. They are saying the final hidden layer outputs a vector which has all the information that decides the logits which decide the probabilities of each token in the entire vocabulary. Ie, it is storing a lot of information.
ttul
Correct. And although the final layer outputs a softmax of the token probabilities, the model by that point surely has a rich understanding of more than just the next token it wants to predict.
versteegen
> surely has a rich understanding of more than just the next token it wants to predict
> the last hidden layer is obviously super rich in semantic information
I don't agree that this is obvious, and think it's likely wrong (see the sibling thread [1]). The model has to at some point compress down its prediction for the entire future string of text to a prediction for a single token. There's no prior reason to assume it does this mostly in the final "LM head" linear layer, and the inputs to it don't have to predict anything other than the very next token so there's no reason it should (which is what I think psb217 was getting at), but I'm not familiar with what research has been done into it. On the other hand, processing seems to typically be concentrated in the central layers.
danielmarkbruce
The last hidden layer outputs a vector which is then used to predict the probabilities of every token in the vocabulary, by a single layer (and, in practice now in llama models, this layer is the transpose of the embedding layer).
That vector has a lot of information in it, it's not a debatable thing.
As noted above in parens, look at the llama 3.x models. The space is already shared in some sense. It's called "tied embedding".
versteegen
> That vector has a lot of information in it, it's not a debatable thing.
Encoding the next token is the minimum possible amount of information it might contain; that's not much information (the distribution over the next token is just a projection from the embedding space). E.g. it would be useless for any classification task.
danielmarkbruce
Various models in production are doing exactly that - training a layer which takes the vector out of the last hidden layer, for classification, in place of the language head. I even have one in production right now doing regression using the output of the last hidden layer....
In the case of llama 3 its 4096 * 16 bit = 8192 bytes of information...that's like 8192 characters of ascii. More than enough for most classification tasks... and if you jsut spend any time thinking about encoding the logits for a vocab of 128k... you'll come to the conclusion it's likely to require at least several hundred bytes (maybe 1000?) to do it in any way that will actually work in practice.
danielmarkbruce
Yup, some tokens are effectively branching decisions. Yann has a whole rant about a shortcoming of LLMs being they take the same compute regardless of the position in a sentence - which isn't great because sometimes you really have a serious decision to make, other times not so much. It also makes you wonder about optimal embedding size - maybe the right size is 10x bigger.
ttul
Think of it like this: the final softmax layer is like being forced to pick a single word as your next prediction, while the hidden layer contains all the reasoning and understanding that led to that decision. It's similar to how a human might have a complex thought but needs to reduce it to a single word when speaking.
Many existing applications make use of hidden layers in a transformer to perform useful tasks such as classification. The concept of an “embedding” is simply the output of a hidden layer, after all.
max93
We conducted similar research earlier and successfully improved performance to a level comparable to models with 3x larger layer sizes. https://arxiv.org/html/2409.14199v3 We utilize more computational time in the latent space to achieve better performance. However, this approach introduces greater resistance compared to Chain of Thought (CoT) reasoning in the token space, especially if the number of CoT rounds in the latent space exceeds 20. I would using the term "better approximation of the data distribution" instead of "reasoning" to describe this kind of process.
cootsnuck
So perhaps could be useful for fine-tuning on smaller models?
max93
I think so. I believe this type of reasoning method, which achieves better results through longer computation time, is very useful on edge devices like mobile phones. Consider a scenario where we only need the model to output a function/action call on the phone; we don't require it to provide an immediate response.
patcon
I think of an LLM model as like a crystallised mathematical snapshot of intelligence... like a cell on a microscope slide, a dead and mounted form of output from the living process of intelligence...
This paper makes me wonder whether, in a very fuzzy sense, we could give #LLMs access to some similarly crystallised analog of emotion or emotional valence, below the level of language
HeatrayEnjoyer
Maybe "stasis" is more appropriate than "dead." Each new session is an unfrozen clone of the original mind snapshot.
threeseed
Intelligence is more than just knowing the probabilistic relationship between every word.
Rhapso
"Intelligence" is a continuous process. Without a continuous feedback loop, LLMs will never be more than a compression algorithm we bullied into being a chatbot.
OpenAi as a mega-organism might be intelligent, but the LLMs definitely are not.
The "compressed capture of semantic relationships" is a new thing we don't have a word for.
thrance
Funnily enough, there is a mathematical link between data compression and AGI [1]. I believe a paper circulated some time ago that compared gpt2 to gzip, with interesting results.
TeMPOraL
More than that, in general, understanding and compression seem to be fundamentally the same thing.
cootsnuck
I would say understanding requires compression not that it equates to it. Probably just semantics though.
3abiton
It's part of the process, given that the "bigger picture" remains in context.
winwang
Do you have strong evidence for this?
aithrowawaycomm
Dogs are highly intelligent, and it makes no sense to say that they get their intelligence by calculating the probabilities between consecutive woofs.
mitthrowaway2
Would you say with equal confidence that they don't exemplify their intelligence by their ability to repeatedly select an often-successful next action from a set of possible next actions, based on a set of input observations?
"Tokens" don't have to be words, or woofs...
aithrowawaycomm
It still doesn’t make sense for dogs. It might make some sense given a higher-level goal (hiding a toy under the bed)[1] but it doesn’t make much sense for selecting the goals (“I should hide this toy because the other dog keeps stealing it”). In building an AI dog it doesn’t work to elevate these higher-level goals into individual tokens because real dogs form goals dynamically according to their environment and the set is infinitely large. (Note that LLM agents also badly struggle with this; generating goals token-by-token means their goals have hallucinations.)
[1] It still doesn’t make much sense to view this as a statistical process; dogs can generalize far better than transformers, as perhaps best seen with seeing-eye dogs. I believe dogs’ powers of causal reasoning exceed what is possible from mere surface statistics: e.g. they innately understand object permanence as puppies, whereas transformers still don’t understand it after viewing thousands of dogs’ lifetimes of experience.
mitthrowaway2
I've not been able to find any way to distinguish "mere surface statistics" from the deeper, richer, and more meaningful kind of something that it is meant to be contrasted with, except that "surface statistics" are un-compressed. For example, surface statistics might be the set of output measurements generated by a compact process, such as the positions of planets over time; knowing the laws of gravity means we can generate gigabytes of these statistics correctly and easily, which will accurately match future observations.
But then going the other way, from statistics to a causal model, is just an inverse problem -- just like, say, going from a set of noisy magnetic field measurements at the boundary of a container to a pattern of electric current flow inside a volume, or going from planet positions to orbit shapes and periods to an inverse square law of gravity. Generating a compressed inverse model from surface statistics is exactly the sort of thing that deep learning has proven to be very good at. And by now we've seen no shortage of evidence that LLMs and other deep networks contain stateful world models, which is exactly what you'd expect, because for all their parameters, they aren't nearly big enough to contain an infinitesimal fraction of the statistics they were trained on.
So I think it's overly dismissive to regard LLMs as mere surface statistics.
threeseed
> So I think it's overly dismissive to regard LLMs as mere surface statistics.
It's literally what they are though.
Yes those probabilities embed human knowledge but that doesn't mean that the LLM itself is intelligent. It's why every LLM today fails at anything that isn't centred around rote learning.
mitthrowaway2
It's what they input and output, but it's not literally what they are. The only way to squeeze that many statistics into a compact model is to curve-fit an approximation of the generating process itself. While it fits stochastic sequences (of any type, but usually text), it's conceptually no different from any other ML model. It's no more surface statistics than a deep neural network trained for machine vision would be.
sullyj3
That only shows that word prediction isn't necessary, not that it's insufficient
ralphsebastian
makes a lot of sense.
rsrsrs86
Please spread the word that predicting the next one is not intelligence. It’s markov…
esafak
It depends on how you predict. To predict better and better you need intelligence.
threeseed
Knowledge is distinct from intelligence.
You can predict better and better with simply more knowledge i.e. data.
esafak
That only gets you as far as the frontier of knowledge. To go beyond you need intelligence.
edgyquant
Did you really just link to a post from your Twitter saying the same thing you did here?
patcon
Meh. I'm sometimes curious the different conversations that are possible in different places, I guess? One sometimes hears from different ppl, but maybe wants cross-talk
Seemed easy, and I thought harmless, tho maybe not
padolsey
Was it just me who thought that this was _already_ how LLMs worked? I'd always assumed they were -- so to speak -- swimming in their own embeddings space before coming out on the other side with language. But it turns out they're just feeding their own incremental outputs back into themselves, without a memory of the path they took to get there. Yowzer!
vrighter
There isn't any memory of how it got to where it did because all weights are evaluated all the time. It got there through the entirety of the network. There is no logic, just (mostly) a bunch of multiply-accumulates.
fabmilo
I like the direction of the research of working in latent space but feeding the last layer representation back as a first layer embedding feels sketchy to me. Those layers have different representation space.
jsenn
> Those layers have different representation space.
Do they? Interpretability techniques like the Logit Lens [1] wouldn't work if this were the case. That author found that at least for GPT-2, the network almost immediately transforms its hidden state into a "logitable" form: you can unproject the hidden state of any layer to see how that layer incrementally refines the next token prediction.
[1]: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti...
zxexz
Feeding the last layer back as the input embedding has been done many times, e.g. Transformer-XL. The models are trained like this, it's not like they're taking a pre-trained Llama and just feeding it to itself. It's a simple, computationally cheap mechanism to add feedback.
empath75
I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.
TeMPOraL
So you're saying that feeding the last layer back to the first makes the model layer-order independent, or kinda infinitely deep, if you squint? :).
torginus
Imo this kind of makes sense - LLMs without a feedback loop can learn to have one themselves by encoding information in the previously generated tokens.
imtringued
They can't, because that would increase training loss. The training loss acts as a gatekeeper for reasoning.
fabmilo
from my understanding that is what they do, see the paper: > We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.
zxexz
Yes, they use a pre-trained model, but they do further training (please correct me if I mis-read, and also I realize my above comment could be interpreted as saying they train a new model entirely from scratch).
> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. The learning rate is set to 1 × 10−4 while the effective batch size is 128. Following Deng et al. (2024), we also reset the optimizer when the training stages switch.
mbowcut2
This was my first thought too. AFAIK each layer encodes different information, and it's not clear that the last layer would be able to communicate well with the first layer without substantial retraining.
Like in a CNN for instance, if you fed later representations back in to the first kernels they wouldn't be able to find anything meaningful because it's not the image anymore, it's some latent representation of the image that the early kernels aren't trained on.
paraschopra
The point is that training regime can force the network to immediately reshape the representation layer (after inputs) depending on whether it is a thought or language context.
liuliu
Not really. See the literature on sharing lm_head (last matrix multiplication) with the input embedding dict.
Basically, the lm_head (a MxN matrix where M is the dictionary size and N is the internal dimension) can be seen as the dictionary too. You can think that and the softmax over it as compute cosine similarity of the last hidden output w.r.t. input embedding dictionary.
In that sense, they are sharing the representation space.
(BTW, I believe sharing lm_head with input embedding not working as good as separating them, so only mobile focused LLMs do so. So here is that. It would be interesting to experiment if injecting a projection layer like you suggested would improve performance or just red-herring).
danielmarkbruce
llama 3.x is already sharing the last layer with the embedding layer, it just uses the transpose in the last layer operation.
empath75
I wonder what would happen if you just ran this on a continuous loop and only intermittently fed in new tokens or queried it for token outputs.
rsrsrs86
Well, if you consider the case of a linear regression, fitting on your output will add no new information to the weights. Try that on any notebook.
rsrsrs86
I feel abdominal pain when I see the words “thinking” or “reasoning” related to LLMs.
I feel back pain when I read the crazy, unsound speculation about how the brain is supposed to be like a computer. Serious mistake.
vidarh
Unless you can show an example of humans reasoning solving a problem outside the Turing computable set, there is no rational basis for assuming the brain is anything but a computer, as the very notion that we exceed Turing computability would be revolutionary and utterly mindbending in terms of consequences on a number of fields.
platz
there is no rational basis for assuming the brain is a "computer" in the same way an intel x86 chip is a "computer" or that the universe is a "computer". Using language in this way without defining terms like what even is a computer is folly.
vidarh
There is no rational basis for assuming it is not, as we have not a single example of a computable function outside the Turing computable set.
The term "computer" has it's original outside of "electronic computer". It used to be a role, a job function. There has been no time in human history where the only computers have been electronic computers.
But, sure, let's be more precise: Any Turing complete system is equivalent to any Turing complete computer and can reasonably be called a computer, but let's also limit it to any system that can not compute functions outside the Turing computable set. We don't know of any such systems that have been shown to compute functions outside the Turing computable set, at all, including brains.
The rational basis for assuming the brain is a computer is that we have not a single shred of evidence that exceeding Turing computability is possible, nor any theory for how to even express a function that is computable for humans but not Turing computable.
If you can find one single such example, there'd be a rational basis for saying the brain isn't a computer. As it stands now, assuming it isn't, is nothing more than blind faith. .
platz
> we have not a single shred of evidence that exceeding Turing computability is possible
if your basis that anything that has equal to or less than turing computability is a computer, then everything is a computer.
aeonik
In the same way that everything is physics.
pounderstanding
Brain is subset of computers, but llms are not subset of brains.
vidarh
The reason a lot of people are unhappy about this notion is that it doesn't really matter: Any Turing complete system can emulate any other Turing complete system, and an LLM can trivially be made to execute a Turing machine if you put a loop around it, which means that unless you can find evidence humans exceed Turing computability AGI is "just" a question of scaling and training.
It could still turn out to be intractable without a better architecture, but the notion that it might not be impossible makes a lot of people very upset, and the only way it can be impossible even for just an LLM with a loop bolted on is if human brains can compute functions outside the Turing computable set.
pounderstanding
"Llm thinks" is false advertising. (Maybe useful jargon, but still)
> Any Turing complete system can emulate any other Turing complete system, and an LLM can trivially be made to execute a Turing machine if you put a loop around it
Wouldn't it be more efficient to erase the LLM and use underlying hardware as Turing complete system?
BTW. Turing test is just admission that we have now way of defining human level intelligence apart from "you'll know it when you see it".
js8
I agree with you. "Chain of thought" is not reasoning, just like LSD trip isn't.
I think we lack a good formal definition of what (fuzzy) reasoning is. Without it, we will always have some kind of unexplained hallucinations.
I also believe AGI could be implemented as a model that can train models for specific tasks completely autonomously. But that would kill the cash cow, so OpenAI etc. are not interested in developing it.
234120987654
100% agree. I miss the days where the title would describe the method instead of being a sales pitch
Terr_
That reminds me of the punchline to this lengthy coming, which you might enjoy and/or find back-pain from.
SubiculumCode
In LLMs, is there a correlation between layer depth and the activations correspondence to the abstract to concrete details continuum?
pizza
Yes: for eg BPE, due to how it progressively pushes compound tokens of already seen - hence more common - subtokens to the ‘top’ of the vocab), you can train a model to do regression over vocabulary index for the next token from the current token embedding - using the same single regression model for all layer depths. If you plot mse of token index prediction versus layer depth then you can see that the mse of the prediction decreases steadily per additional layer. This appears to be because token index in eg BPE is actually fairly smooth and so it seems like the model is capable of localizing to the actual correct vocab index as depth increases, so kind of like a fuzzy->discrete refinement as you go deeper in layers https://arxiv.org/abs/2408.13442
SubiculumCode
THANKS!
anon291
It's a good, understandable paper. The main issue with chain-of-thought (which I think is a solid approach, and one that needs to take place) is that we ourselves aren't necessarily 'trained' on chain-of-thought. Yes, we do learn mathematical proofs and reasoning at some point (usually), but most people settle on latent thinking without training, and switch between the two modes naturally. My intuition says we're missing something, but who knows
thoughtlede
Perhaps these findings might be indicating that we need more NN layers/attention blocks for performing reasoning. This project circumvented the lack of more trained layers by looping the input through currently trained layers more than once.
Also we may have to look for better loss functions than ones that help us predict the next token to train the models if the objective is reasoning.
mentalgear
"We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought")."
Could someone explain the last hidden state of the LLM ? What it shape is and how it is normally used - and why it hasn't been used yet to augment the next input? (which seems logical)
tjbai
The last hidden state is just the output embedding after N residual layers, e.g. input embedding + res1 + res2 + ...
There's typically an "unembedding layer"/"classification head" that uses this hidden state to produce a softmax distribution over the LLM's vocabulary. In this case, we can think of this as "snapping" the hidden state into a single token and feeding that token into the next position of the autoregressive LLM.
In this sense, the last hidden state _does_ augment the next input. The authors simply propose directly feeding this hidden state into the next step rather than reducing it into a single token—thus, reasoning in continuous latent space rather than discrete token space.
intalentive
Moreover “snapping” the hidden state to a token is akin to quantization. It’s lossy. By staying in latent space the model can “reason” at “full resolution” without discretization noise.
snthpy
Sometimes discretization introduces interesting behavior though. Compare for example the logistic map and it's chaotic regime with the simplicity of the logistic ODE. Another example would be quantum mechanics compared to classical mechanics and determinism. The Poincare Conjecture was only interesting for n=3 due to too much connectivity in higher dimensions. Wouldn't it be interesting if consciousness only arose in such a discretized form, a case of incidental complexity and chaos introduced as the result of topological non-triviality from quantization?
Don't forget, non-linearity is fundamental to the whole process, otherwise you'd just have one large linear transformation. Maybe there's a similar role for discretization? :shrug:
soulofmischief
Useful information about conceptual relationships and procedure can be captured in the LM head, so there is also potential lossiness when short-circuiting it.
sweetheart
Wow this was the explanation that made it all click for me. Thanks so much!
AmazingTurtle
Embeddings aka the last hidden state are the mathematical representation of an input of the model before a separate model (usually the decoder) translates that hidden state to a next token (the generative part in generative ai). Normally, the this step repeats over and over. This novel approach introduces re-using the last hidden state as if it was a token that has been generated thus "evolving" the hidden state over each iteration.
psb217
The way the recurrence in this method works -- ie, using last LLM hidden state at previous time step as input token for the next time step -- isn't directly compatible with how recurrence/autoregression is typically handled during LLM training. One of the major strengths of transformers is that they can be trained for recurrence/autoregression (which have sequential dependency) using convolutions (which are embarrasingly parallel). The proposed method requires introducing some sequential dependencies during training that could otherwise be avoided using "causal masking" and convolutions to enforce the correct dependencies between time steps in a sequence. Introducing these sequential dependencies makes training a lot slower.
tldr; the method requires training in a way that loses one of the major benefits of transformers, but maybe in some scenarios that loss is worth it.
ilaksh
It seems like the latent space could be even more useful if it was trained with the transcribed videos.
bick_nyers
I wonder if you would want to use an earlier layer as opposed to the penultimate layer, I would imagine that the LLM uses that layer to "prepare" for the final dimensionality reduction to clean the signal such that it scores well on the loss function.
DalasNoin
So the models will no longer be thinking in plain English but some embedding space? Seems not like what you want.
Vampiero
Seems exactly like what you want. We don't think in plain English, we _rationalize_ our thoughts into English (or whatever language comes out) but they must be more fundamental than language because language is acquired.
Essentially, English is one of many possible encodings of an underlying intuitive, possibly non-symbolic representation.
intalentive
Cognitive scientists called it “mentalese”.
ekianjo
> We don't think in plain English
That's debatable. Language shapes thoughts much more than you might think. Because you learn concepts from language that you could not imagine by yourself until you learned/read about them, so they are in effect very linked to language.
phkahler
I can also think in images and internal visualizations. Geometric reasoning is also a thing. Musicians can also hear things in their mind - some can write it down, others can play it directly, and in my case I'm not good enough to get it out of my head!
In all cases though these thoughts are kind of tied to representations from the real world. Sort of like other languages via different senses. So yeah, how abstract can our thoughts actually be?
idiotsecant
But the thing you learn is not the word 'purple'. You just use the word as the mental scaffolding to build a concept of purple. The word forms a linkage to a deeper embedding, which is further proven by the fact that it's actually slightly different in each mind that has understanding of the concept.
This embedded concept is what is doing the work, the word was just the seed of the understanding and a method by which to convey that understanding to others.
samiskin
Language is definitely a significant part of thinking, but when I remember how cold it was outside yesterday to figure out if it was colder than today, I'm not bringing words to mind. I'm bringing up some other non-discrete information that I could never precisely encode into words and then factoring that in with the other non-discrete information I'm currently taking in through my senses. Its only after that processing that I encode it as a lossy "It was colder yesterday" statement.
Vampiero
Fair, but there are many categories of languages.
For example, I can think in formal logic. I've learned to do that, and surely my brain takes a step-by-step approach to it, but I've also internalized some of it and I don't think that my proficiency with English has anything to do with it.
I could have learned the same concepts in any other language, but the end result would be the same.
And surely there are many thoughts that can't be expressed purely with words. For example all that is related to qualia. You can think of a color but you can't describe what you see in your mind's eye with words, not in a way that would let a blind person share the same experience. Or try describing "love" without making a similitude. Is love a thought? Or a feeling? Is there a meaningful difference between the two?
klausa
You're basically talking about Sapir-Whorf here:
https://en.wikipedia.org/wiki/Linguistic_relativity
>The hypothesis is in dispute, with many different variations throughout its history.[2] The strong hypothesis of linguistic relativity, now referred to as linguistic determinism, is that language determines thought and that linguistic categories limit and restrict cognitive categories. This was a claim by some earlier linguists pre-World War II;[3] since then it has fallen out of acceptance by contemporary linguists.
numpad0
eh, probably both. Why does it have to be a fight between two schools of thoughts? Thoughts can be across-modal. Some of it can be done in specific language or some could be visual.
(universal grammar peoole hates this somehow, it's weird)
drdeca
If you mean “not what we want” for safety reasons, I think I agree.
If you don’t mean for safety reasons, I’m not sure why.
miven
In section 2 they briefly mention studies such as [1] that point out that the token outputs of a chain of thought aren't always entirely faithful to the responses of the models
I'm not sure whether it wouldn't be more reliable to let the model run on latents and try to train a separate latent-reading explainer module that has at least some approximation of what we want as an explicit optimization objective.
Assuming it actually is or has the potential to be better than CoT, from what I gathered from the paper the current results are mostly just more efficient token-wise.
DalasNoin
I was thinking abut safety reasons, but also usability. Seems like a pretty big difference to me if you don't understand the chain of thought. How faithful cot are is another question.
rsrsrs86
They have never been thinking. This is important.
Predicting the next word is not intelligence.
TibbityFlanders
[dead]
vouaobrasil
> Experiments show that Coconut can effectively augment the LLM on several reasoning tasks.
It really seems like we are building a true intelligence, adding components to different parts of a "brain" until we have something rivalling the human mind. It's exceptionally dangerous and it's remarkable how researchers turn a blind eye to any possible consequences.
AmazingTurtle
One day researchers will be like "Oh crap what have we done" and "Shut it down, shut it down!!!"
vouaobrasil
That is true. Most poeple will just respond to immediate physical threats as long as they have the illusory safety net of modern society.
ionwake
Just bear in mind while they are yelling "shut it down" there will be a bunch of commenters with no idea whats happening saying that they are just over reacting
ionwake
agree. Someone should make sure the next ASI develops an extension to hide the comments in every AI thread 80% full of the brightest minds saying " I tried to build a react app and it totally failed doing it the way I wanted ".
rsrsrs86
If ASI is artificial specific intelligence, I bet your pardon; intelligence can hardly be specific. Intelligence can reflect upon itself.
ionwake
sorry what? I didn’t understand the sentence
I've been looking into using the last hidden layer of an off-the-shelf LLM to help my company with a classification task. The last hidden layer is obviously super rich in semantic information because it has to somehow tell the next layer how to generate the next token prediction. That final layer, in some respects, is discarding valuable context information that the final hidden layer encodes.
I am not surprised at all that Meta was able to generate some positive returns by feeding the last hidden layer back into the model auto-regressively.
The method of training they describe in the paper is really cool. Summarized in Figure 2, they train it with a corpus of step-by-step text instructions and then across multiple stages, they iteratively replace one of the textual steps with a last-hidden-layer embedding and see what the model spits out. The weights are then updated through cross-entropy loss as the additional text tokens are generated once again.
So they're basically rewinding the output, replacing an increasing number of textual steps with hidden state embeddings, and playing it forward as the model gradually learns to do all of its step-by-step thinking using just the hidden state data.
In a way, this might be how humans learn to think through language. Our parents teach us using words and our brain gradually replaces the words with thoughts until we can replicate the action or solve the problem ourselves without anyone guiding us with words.