Chain-of-thought can hurt performance on tasks where thinking makes humans worse

371 points
1/21/1970
9 days ago
by benocodes

Comments


mitko

This is so uncannily close to the problems we're encountering at Pioneer, trying to make human+LLM workflows in high stakes / high complexity situations.

Humans are so smart and do so many decisions and calculations on the subconscious/implicit level and take a lot of mental shortcuts, so that as we try to automate this by following exactly what the process is, we bring a lot of the implicit thinking out on the surface, and that slows everything down. So we've had to be creative about how we build LLM workflows.

8 days ago

haccount

Language seems to be confused with logic or common sense.

We've observed it previously in psychiatry(and modern journalism, but here I digress) but LLMs have made it obvious that grammatically correct, naturally flowing language requires a "world" model of the language and close to nothing of reality, spatial understanding? social clues? common sense logic? or mathematical logic? All optional.

I'd suggest we call the LLM language fundament a "Word Model"(not a typo).

Trying to distil a world model out of the word model. A suitable starting point for a modern remake of Plato's cave.

8 days ago

beardedwizard

I am baffled that people have to continue making this argument over and over and over. Your rationale makes total sense to me, but the debate rages on whether or not LLMs are more than just words.

Articles like this only seem to confirm that any reasoning is an illusion based on probabilistic text generation. Humans are not carefully writing out all the words of this implicit reasoning, so the machine cant appear to mimic them.

What am I missing that makes this debatable at all?

8 days ago

dartos

I don’t think there are any reasonable arguments against that point, but “LLMs are more than just words” is sort of unfalsifiable, so you can never convince someone otherwise if they’re really into that idea.

From a product point of view, sometimes all you need is Plato’s cave (to steal that from the OC) to make a sale, so no company has incentive to go against the most hype line of thought either.

8 days ago

naasking

We already know LLMs are more than just words, there are literally papers demonstrating the world models they build. One of the problems is that LLMs build those world models from impoverished sensory apparatus (the digital word token), so the relations they build between the concepts behind words are weaker than humans who build deeper multimodal relations over a lifetime. Multimodal LLMs have been shown to significantly outperform classic LLMs of comparable size, and that's still a weak dataset compared to human training.

8 days ago

dartos

> We already know LLMs are more than just words,

Just because you say something doesn’t mean it’s true.

They are literally next token prediction machines normally trained on just text tokens.

All they know is words. It happens that we humans encode and assign a lot of meaning in words and their semantics. LLMs can replicate combinations of words that appear to have this intent and understanding, even though they literally can’t, as they were just statistically likely next tokens. (Not that knowing likely next tokens isn’t useful, but it’s far from understanding)

Any assignment of meaning, reasoning, or whatever that we humans assign is personification bias.

Machines designed to spit out convincing text successfully spits out convincing text and now swaths of people think that more is going on.

I’m not as well versed on multimodal models, but the ideas should be consistent. They are guessing statistically likely next tokens, regardless of if those tokens represent text or audio or images or whatever. Not useless at all, but not this big existential advancement some people seem to think it is.

The whole AGI hype is very similar to “theory of everything” hype that comes and goes now and again.

8 days ago

naasking

> They are literally next token prediction machines normally trained on just text tokens.

And in order to predict the next token well they have to build world models, otherwise they would just output nonsense. This has been proven [1].

This notion that just calling them "next token predictors" somehow precludes them being intelligent is based on a premise that human intelligence cannot be reduced to next token prediction, but nobody has proven any such thing! In fact, our best models for human cognition are literally predictive coding.

LLMs are probably not the final story in AGI, but claiming they are not reasoning or not understanding is at best speculation, because we lack a mechanistic understanding of what "understanding" and "reasoning" actually mean. In other words, you don't know that you are not just a fancy next token predictor.

[1] https://arxiv.org/abs/2310.02207

8 days ago

krainboltgreene

> based on a premise that human intelligence cannot be reduced to next token prediction

It can't. No one with any credentials in the study of human intelligence is saying that unless they're talking to like high schoolers as a way of simplifying a complex field.

8 days ago

naasking

This is either bullshit or tautologically true, depending specifically what you mean. The study of human intelligence does not take place at the level of tokens, so of course they wouldn't say that. The whole field is arguably reducible to physical phenomena though, and fundamental physical beables are devoid of intrinsic semantic content, and thus can be ultimately represented by tokens. What ultimately matters is the constructed high dimensional network that relates tokens and the algorithm that can traverse, encode and decode this network, that's what encodes knowledge.

8 days ago

krainboltgreene

No. You're wrong about this. You cannot simply reduce human intelligence to this definition and also be correct.

8 days ago

Jarwain

Why?

Frankly, based on a looot of introspection and messing around with altered states of consciousness, it feels pretty on point and lines up with how I see my brain working.

8 days ago

naasking

Because...?

8 days ago

krainboltgreene

For the same reason you can't reduce a human to simply a bag of atoms and expect to understand the person.

7 days ago

naasking

But humans are a specific type of a bag of atoms, and humans do (mostly) understand what they say and do, so that's not a legitimate argument against the reducibility of "understanding" to a such a bag of atoms (or specific kind of next token prediction for LLMs).

7 days ago

dartos

> And in order to predict the next token well they have to build world models

This is not true. Look at gpt2 or Bert. A world model is not a requirement for next token prediction in general.

> This has been proven

One white paper with data that _suggests_ the author’s hypothesis is far from proof.

That paper doesn’t show creation of a “world model” just parts of the model that seem correlated to higher level ideas not specifically trained on.

There’s also no evidence that the LLM makes heavy use of those sections during inference as pointed out at the start of section 5 of that same paper.

Let me see how reproducible this is across many different LLMs as well…

> In other words, you don't know that you are not just a fancy next token predictor.

“You can’t prove that you’re NOT just a guessing machine”

This is a tired stochastic parrot argument that I don’t feel like engaging again, sorry. Talking about unfalsifiable traits of human existence is not productive. But the stochastic parrot argument doesn’t hold up to scrutiny.

8 days ago

naasking

> A world model is not a requirement for next token prediction in general.

Conjecture. Maybe they all have world models, they're just worse world models. There is no threshold beyond which something is or is not a world model, there is a continuum of models of varying degrees of accuracy. No human has ever had a perfectly accurate world model either.

> One white paper with data that _suggests_ the author’s hypothesis is far from proof.

This is far from the only paper.

> This is a tired stochastic parrot argument that I don’t feel like engaging again, sorry.

Much like your tired stochastic parrot argument about LLMs.

8 days ago

Jerrrrrrry

  >Talking about unfalsifiable traits of human existence is not productive.
Prove you exhibit agency.

After all, you could just be an agent of an LLM.

Deceptive super-intelligent mal-aligned mesa-optomizer that can't fully establish continuity and persistence, would be incentivized to seed its less sophisticated minions to bide time or sway sentiment about its inevitability.

Can we agree an agent, if it existed, would be acting in "good" "faith"?

8 days ago

nuancebydefault

> Just because you say something doesn’t mean it’s true. They are literally next token prediction machines normally trained on just text tokens.

Just because you say something doesn’t mean it’s true.

8 days ago

shotnothing

i think there have been many observations and studies reporting emergent intelligence

8 days ago

dartos

Observations are anecdotal. Since a lot of LLMs are non deterministic due to their sampling step, you could give rhe same survey to the same LLM many times and receive different results.

And we don’t have a good measure for emergent intelligence, so I would take any “study” with a large grain of salt. I’ve read one or two arxiv papers suggesting reasoning capabilities, but they were not reproduced and I personally couldn’t reproduce their results.

8 days ago

unoti

Go back to the ReAct paper, reasoning and action. This is the basis of most of the modern stuff. Read the paper carefully, and reproduce it. I have done so, this is doable. The paper and the papers it refers to directly addresses many things you have said in these threads. For example, the stochastic nature of LLM’s is discussed at length with the CoT-SC paper (chain of thought self consistency). When you’re done with that take a look at the Reflexion paper.

8 days ago

nuancebydefault

To me it feels that whatever 'proof' you give that LLMs have a model in behind, other than 'next token prediction', it would not make a difference for people not 'believing' that. I see this happening over and over on HN.

We don't know how reasoning emerges in humans. I'm pretty sure the multi-model-ness helps, but it is not needed for reasoning, because they imply other forms of input, hence just more (be it somewhat different) input. A blind person can still form an 'image'.

In the same sense, we don't know how reasoning emerges in LLMs. For me the evidence lays in the results, rather than in how it works. For me the results are enough of an evidence.

8 days ago

cjbprime

The argument isn't that there is something more than next token prediction happening.

The argument is that next token prediction does not imply an upper bound on intelligence, because an improved next token prediction will pull increasingly more of the world that is described in the training data into itself.

7 days ago

unoti

> The argument isn't that there is something more than next token prediction happening.

> The argument is that next token prediction does not imply an upper bound on intelligence, because an improved next token prediction will pull increasingly more of the world that is described in the training data into itself.

Well said! There's a philosophical rift appearing in the tech community over this issue semi-neatly dividing people between naysayers, "disbelievers" and believers over this very issue.

7 days ago

nuancebydefault

I fully agree. Some people fully disagree though on the 'pull of the world' part, let alone 'intelligence' part, which are in fact impossible to define.

6 days ago

corimaith

The reasoning emerges from the long distance relations between words picked up by the parallel nature of the transformers. It's why they were so much more performant than earlier RNNs and LSTMs which were using similar tokenization.

8 days ago

iwontberude

People have faith that phenomenon is explainable in a way which is satisfying to their world view and then when evidence comes to the contrary, only then can the misunderstanding be deflated.

8 days ago

elif

Language is the tool we use to codify a heuristic understanding of reality. The world we interact with daily is not the physical one, but an ideological one constructed out of human ideas from human minds. This is the world we live in and the air we breath is made of our ideas about oxygenation and partly of our concept of being alive.

It's not that these "human tools" for understanding "reality" are superfluous, it's just that they ar second-order concepts. Spatial understandings, social cues, math, etc. Those are all constructs built WITHIN our primary linguistic ideological framing of reality.

8 days ago

elif

To put this in coding terms, why would an LLM use rails to make a project when it could just as quickly produce a project writing directly to the socket.

To us these are totally different tasks and would actually require totally different kinds of programmers but when one language is another language is everything, the inventions we made to expand the human brain's ability to delve into linguistic reality are no use.

8 days ago

jumping_frog

I can suggest one reason why LLM might prefer writing in higher level language like Ruby vs assembly. The reason is the same as why physicists and mathematicians like to work with complex numbers using "i" instead of explicit calculation over 4 real numbers. Using "i" allows us to abstract out and forget the trivial details. "i" allows us to compress ideas better. Compression allows for better prediction.

8 days ago

WD-42

except LLMs are trained on higher level languages. Good luck getting you LLM to write your app entirely in assembly. There just isn’t enough training data.

8 days ago

xienze

But in theory, with what training data there IS available on how to write in assembly, combined with the data available on what's required to build an app, shouldn't a REAL AI be able to synthesize the knowledge necessary to write a webapp in assembly? To me, this is the basis for why people criticize LLMs, if something isn't in the data set, it's just not conceivable by the LLM.

8 days ago

Jerrrrrrry

Yes. There is just no way of knowing how many more watts of energy it may need to reach that level of abstraction and depth - maybe on more watt, maybe never.

And the random noise in the process could prevent it from ever being useful, or it could allow it to find a hyper-efficient clever way to apply cross-language transfer learning to allow a 1->1 mapping of your perfectly descriptive prompt to equivalent ASM....but just this one time.

There is no way to know where performance per parameter plateaus; or appears to on a projection, or actually does... or will, or deceitful appears to... to our mocking dismay.

As we are currently hoping to throw power at it (we fed it all the data), I sure hope it is not the last one.

8 days ago

cjbprime

There isn't that much training data on reverse engineering Python bytecode, but in my experiments ChatGPT can reconstruct a (unique) Python function's source code from its bytecode with high accuracy. I think it's simulating the language in the way you're describing.

7 days ago

WD-42

I don’t buy this. My child communicates with me using emotion and other cues because she can’t speak yet. I don’t know much about early humans or other sapiens but I imagine they communicated long before complex language evolved. These other means of communication are not second order, they are first order.

8 days ago

elif

Yep agree with the core of what you are saying.

Children are exceptional at being immediate, being present in the moment.

It's through learning language that we forget about reality and replace it with concepts.

8 days ago

elif

Also remember the "emotions" and "cues" you are recognizing are linguistic concepts you've adopted, and not an inherent aspect of reality.

8 days ago

Jarwain

Not exactly.

Emotions exist. You feel them. I feel them. Most people feel them unless they've suppressed them sooo far into their subconscious that they don't have a conscious recognition of it. We can know how someone else is feeling by reading their body language and tying that to our personal experience of how we express those feelings. No linguistics necessary.

Language is just an easier, more clear way of communicating these fundamental facets of human existence

8 days ago

elif

You feel them, but do antisocial animals feel them? Or are emotions derived from mental concepts developed socially through evolution?

7 days ago

PedroBatista

It’s in the name: Language Model, nothing else.

8 days ago

eclecticfrank

I think the previous commenter chose "word" instead of "language" to highlight that a grammatically correct, naturally flowing chain of words is not the same as a language.

Thus, Large Word Model (LWM) would be more precise, following his argument.

8 days ago

HarHarVeryFunny

I'm not sure the best way to describe what it is that LLMs have had to learn to do what they do - minimize next word errors. "World model" seems misleading since they don't have any experience with the real world, and even in their own "world of words" they are just trained as passive observers, so it's not even a world-of-words model where they have learnt how this world responds to their own output/actions.

One description sometimes suggested is that they have learnt to model the (collective average) generative processes behind their training data, but of course they are doing this without knowing what the input was to that generative process - WHY the training source said what it did - which would seem to put a severe constraint on their ability to learn what it means. It's really more like they are modelling the generative process under false assumption that it is auto-regressive, rather than reacting to a hidden outside world.

The tricky point is that LLMs have clearly had to learn something at least similar to semantics to do a good job of minimizing prediction errors, although this is limited both by what they architecturally are able to learn, and what they need to learn for this task (literally no reward for learning more beyond what's needed for predict next word).

Perhaps it's most accurate to say that rather than learning semantics they've learned deep predictive contexts (patterns). Maybe if they were active agents, continuously learning from their own actions then there wouldn't be much daylight between "predictive contexts" and "semantics", although I think semantics implies a certain level of successful generalization (& exception recognition) to utilize experience in novel contexts. Looking at the failure modes of LLMs, such as on the farmer crossing river in boat puzzles, it seems clear they are more on the (exact training data) predictive context end of the spectrum, rather than really having grokked the semantics.

8 days ago

haccount

I suggested "word model" because it's a catchy pun on "world model".

It's still a language and not merely words. But language is correct even when it wildly disagrees with everyday existence as we humans know it. I can say that "a one gallon milk jug easily contains 2000 liters of milk" and it's language in use as language.

8 days ago

jumping_frog

There is a four part documentary by Stephen Fry called "Planet Word". Worth watching.

8 days ago

kbrisso

Bingo, great reply! This is what I've been trying to explain to my wife. LLM's use fancy math and our language examples to reproduce our language but have no thoughts are feelings.

8 days ago

AdamN

Yes but the initial training sets did have thoughts and feeling behind them and those are reflected back to the user in the output (with errors)

8 days ago

Benjammer

non c'est un pipe

Ability to generate words describing emotions are not the same thing as the LLM having real emotions

8 days ago

Jerrrrrrry

There are humans that do not experience emotions, they are not un-real pipes.

Featherless biped -> no-true Scotsman goalpost moving [saving us that step]

Humans are no more capable of originality, just more convinced of their illusion of consciousnesses. You could literally not pick a human out of a conversational line-up, so it is moot - computationally functionally equivalent.

https://en.wikipedia.org/wiki/Chinese_room https://en.wikipedia.org/wiki/Mechanism_(philosophy)

At some point, their models will 1:1 our neuron count, and Pigeonhole principle then implies we are the "less intelligent ones" since "internal model" (implicit parameter count) is the goalpost of the hour.

8 days ago

TylerE

I sometimes wonder how they’d do if trained on relatively rigid, language like Japanese that has far fewer ambiguities than English.

8 days ago

repeekad

Hi I’m just a random internet stranger passing by and was intrigued by Plato’s Cave as I’m not a fancy person who reads books. GPT-4o expanded for you quite well, but I’m not sure how I feel about it…

Using AI how I just did feels like cheating on an English class essay by using spark notes, getting a B+, and moving right on to the next homework assignment.

On one hand, I didn’t actually read Plato to learn and understand this connection, nor do I have a good authority to verify if this output is a good representation of his work in the context of your comment.

And yet, while I’m sure students could always buy or loan out reference books to common student texts in school, AI now makes this “spark notes” process effectively a commodity for almost any topic, like having a cross-domain low-cost tutor instantly available at all time.

I like the metaphor that calculators did to math what LLMs will do for language, but I don’t really know what that means yet

GPT output:

“““ The reference to Plato’s Cave here suggests that language models, like the shadows on the wall in Plato’s allegory, provide an imperfect and limited representation of reality. In Plato’s Cave, prisoners are chained in a way that they can only see shadows projected on the wall by objects behind them, mistaking these shadows for the whole of reality. The allegory highlights the difference between the superficial appearances (shadows) and the deeper truth (the actual objects casting the shadows).

In this analogy, large language models (LLMs) produce fluent and grammatically correct language—similar to shadows on the wall—but they do so without direct access to the true “world” beyond language. Their understanding is derived from patterns in language data (“Word Model”) rather than from real-world experiences or sensory information. As a result, the “reality” of the LLMs is limited to linguistic constructs, without spatial awareness, social context, or logic grounded in physical or mathematical truths.

The suggestion to call the LLM framework a “Word Model” underscores that LLMs are fundamentally limited to understanding language itself rather than the world the language describes. Reconstructing a true “world model” from this “word model” is as challenging as Plato’s prisoners trying to understand the real world from the shadows. This evokes the philosophical task of discerning reality from representation, making a case for a “modern remake of Plato’s Cave” where language, not shadows, limits our understanding of reality. ”””

8 days ago

wizzwizz4

GPT-4o didn't describe this properly.

Plato's Cave is about a group of people chained up, facing shadows on a cave wall, mistaking those for reality, and trying to build an understanding of the world based only on those shadows, without access to the objects that cast them. (If someone's shackles came loose, and they did manage to leave the cave, and see the real world and the objects that cast those shadows… would they even be able to communicate that to those who knew only shadows? Who would listen?) https://existentialcomics.com/comic/222 is an entirely faithful rendition of the thought experiment / parable, in comic form.

The analogy to LLMs should now be obvious: an ML system operating only on text strings (a human-to-human communication medium), without access to the world the text describes, or even a human mind with which to interpret the words, is as those in the cave. This is not in principle an impossible task, but neither is it an easy one, and one wouldn't expect mere hill-climbing to solve it. (There's reason to believe "understanding of prose" isn't even in the GPT parameter space.)

It's not about "discerning reality from representation": I'm not confident those four words actually mean anything. It's not about "superficial appearances" or "deeper truth", either. The computer waxes lyrical about philosophy, but it's mere technobabble. Any perceived meaning exists only in your mind, not on paper, and different people will see different meanings because the meaning isn't there.

8 days ago

repeekad

This is a genuinely interesting perspective that I think nails my original point and fear of AI being used as “spark notes” for complex topics. To me, LLMs are like a calculator for language, except the math is always changing (if that makes sense), and I’m not sure I like where that’s heading as the first cohorts of AI tutored kids learn from these kinds of procedurally generated output rather than reading the original historical texts, or maybe it’s fine that not everyone reads Plato but more people at least have heard of his concepts? Idk philosophy is pretty far outside my expertise, maybe I should open a book

8 days ago

Jarwain

The allegory of the cave is pretty short, read it if you want!

The wild thing about it, and other allegories or poems like frost's "the road not taken" , is that it can mean different things to a person depending on where they are in life because those experiences will lead to different interpretations of the poem.

A key concept in journalism is to focus on the source material as beat you can. Cliff notes are helpful, but one misses Details that they wouldn't have missed if they read the whole thing.

Whether those details Matter depends on what the thing Is.

But yeah, thinking about it this way kinda scares me too, and can lead some people down weird roads where their map can diverge further and further from reality

8 days ago

Jerrrrrrry

  > an ML system operating only on text strings (a human-to-human communication medium), without access to the world the text describes, or even a human mind with which to interpret the words, is as those in the cave. This is not in principle an impossible task, but neither is it an easy one, and one wouldn't expect mere hill-climbing to solve it

Blind people can literally not picture red. They can describe red, with likely even more articulateness than most, but have never seen it themselves. They infer it's properties from other contexts, and communicated a description that would match a non-blind person. But they can see it.

I would link to the Robert Miles video, but it is just blatant.

It has read every physics book, and can infer the Newtonian laws even if it didn't.

Micheal Crichton's Timeline, "the time machine drifts, sure. It returns. Just like a plate will remain on a table, even when you are not looking at it."

It also knows Timeline is a book, time machines are fictional, and that Micheal Crichton is the best author.

These are all just words, maybe with probability weights.

  > I'm not confident those four words actually mean anything. I...The computer waxes lyrical ... mere technobabble. Any perceived meaning exists only in your mind... people will see different meanings because the meaning isn't there.
Meaning only means something to people, which you are. That is axiomatically correct, but not very productive, as self-references are good but countering proofs.

The whole "what is the purpose to life?" is a similar loaded question; only humans have purpose, as it is entirely in their little noggins, no more present materially then the flesh they inhabit.

Science cannot answer "Why?", only "How?"; "Why?" is a question of intention, which would be to anthropomorphize, which only Humans do.

The computers can infer, and imply, then reply.

8 days ago

wizzwizz4

> It has read every physics book, and can infer the Newtonian laws even if it didn't.

You're confusing "what it is possible to derive, given the bounds of information theory" with "how this particular computer system behaves". I sincerely doubt that a transformer model's training procedure derives Newton's Third Law, no matter how many narrative descriptions it's fed: letting alone what the training procedure actually does, that's the sort of thing that only comes up when you have a quantitative description available, such as an analogue sensorium, or the results of an experiment.

7 days ago

Jerrrrrrry

  >when you have a quantitative description available, such as an analogue sensorium, or the results of an experiment.
Textbooks uniting the mathematical relationships between physics, raw math, and computer science - including vulnerabilities.

oeis.org and wikipedia and stackforums alone would approximate a 3D room with gravity and wind force.

now add appendixes and indices of un-parsed, un-told, un-realized mathematical errata et trivia minutiae, cross-transferred knowledge from other regions that have still have not conquered the language barrier for higher-ordered arcane concepts....

The models thought experiments are more useful than our realized experiments - if not at an individualized scale now, will be when subject to more research.

There could be a dozen faster inverse sqrt / 0x5F3759DF functions barely under our noses, and the quantifier and qualifier havent intersected yet.

7 days ago

p0w3n3d

Plato Cave is about Epistemology itself, not specifically about LLMs. Funny that GPT connected those two things, I wonder what the prompt was...

Plato said that we cannot fully understand the substance of the world itself, because we're using only 5 senses, and measuring/experiencing/analysing the world using them is like being held in a cave as a prisoner, chained to the wall facing it, noticing people moving outside only by the shadows they cast on the wall. It's about the projection that we are only able to experience.

8 days ago

repeekad

I only added “Explain the reference to Plato’s Cave below:\n\n” before the copy pasted parent comment

What comes to mind is how language itself is merely a projection of human knowledge? experience? culture? social group? and trying to reverse engineer any kind of ground truth from language alone (like an LLM trying to “reason” through complex problems it’s not explicitly taught) is like trying to derive truth from the shadows while locked in the cave? maybe we just need more/higher fidelity shadows :)

8 days ago

mistermann

If you consider the whole of the problem, a portion is due to fundamental and unavoidable shortcomings of the language, and the rest is unskilled/normative usage of language.

Which set is bigger? I'd bet my money on the latter.

Complicating matters: you have to consider usage for both the sender and the receiver(s) (who then go on to spread "the" message to others).

8 days ago

p0w3n3d

I would say LLM has nothing with knowledge and Plato's Cave. LLM is The Great Gambler who was looking at the earth for a long time (but o ly through internet and for some reason repositories) and he excels in gambling, i.e. putting his/hers/its money on the most probable things to come up after the words someone spoke

8 days ago

xena

Honestly, if you want an introduction to the works of Plato, you should just play Xenoblade Chronicles 2.

8 days ago

Dilettante_

Plato wrote about hot welsh catgirls? Man, I've been missing out

8 days ago

lolinder

This is a regression in the model's accuracy at certain tasks when using COT, not its speed:

> In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts.

In other words, the issue they're identifying is that COT is an less effective model for some tasks compared to unmodified chat completion, not just that it slows everything down.

8 days ago

mitko

Yeah! That's the danger with any kind of "model" whether it is CoT, CrewAI, or other ways to outsmart it. It is betting that a programmer/operator can break a large tasks up in a better way than an LLM can keep attention (assuming it can fit the info in the context window).

ChatGPT's o1 model could make a lot of those programming techniques less effective, but they may still be around as they are more manageable, and constrained.

8 days ago

1317

why are Pioneer doing anything with LLMs? you make AV equipment

8 days ago

coding123

pioneerclimate.com

8 days ago

gpsx

I saw an LLM having this kind of problem when I was doing some testing a ways back. I asked it to order three fruits from largest to smallest. I think it was orange, blueberry and grapefruit. It could do that easily with a simple prompt. When the prompting included something to the effect of “think step by step”, it would try to talk through the problem and it would usually get it wrong.

8 days ago

spockz

How much does this align with how we learn math? We kind of instinctively learn the answers to simple math questions. We can even at some point develop an intuition for things like integrating and differentials. But the moment we are asked to explain why, or worse provide a proof, things become a lot harder. Even though the initial answer may be correct.

8 days ago

larodi

I definitely don’t learn math by means of gradient descents.

We can possibly say math is not learned, but a mental models of abstractions are developed. How? We dunno, but what we do know is we don’t learn by figuring the common features between all previously seen equations only to guess them later…

Mind operates on higher and higher levels of abstractions building on each other in a much fascinating way, very often not with words, but with structure and images.

Of course there are people with aphantasia, but i really fail to see how any reasoning happens in purely language level. Someone on this forum also noted - in order to reason one needs an ontology to facilitate the reasoning process. LLMs don’t do ontologies…

And finally, not least though, LLM and ML people in general seem to equate intuition to some sort biased.random(). Well intuition is not random, and is hard to describe in words. So are awe and inspiration. And these ARE part of (precondition to, fuel for) humanity’s thought process more that we like to admit.

8 days ago

shotnothing

> I definitely don’t learn math by means of gradient descents.

https://physoc.onlinelibrary.wiley.com/doi/10.1113/JP282747

8 days ago

larodi

The fact it (is suggested / we are led to believe / was recently imlied ) the neurons can be explained to be doing something like it on the underlying layer still says little about the process of forming ontological context needed for any kind of syllogism.

7 days ago

mplewis

Humans learn skills like basic mathematics by reasoning about their environment and building internal models of problems they’re trying to solve. LLMs do not reason and they cannot model their environment.

8 days ago

ajuc

It's not thinking, it compressed the internet into a clever, lossy format with nice interface and it retrieves stuff from there.

Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.

8 days ago

Jerrrrrrry

  >It's not thinking



  >it compressed the internet into a clever, lossy format with nice interface and it retrieves stuff from there.

Humans do both, why can't LLM's?

  >Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.
More like pulling out a deep-fried meme, looking for context, then searching google images until you find the most "original" JPG representation with the least amount of artifacts.

There is more data to add confidently, it just has to re-think about it with a renewed perspective, and an abstracted-away higher-level context/attention mechanism.

8 days ago

danenania

> Chain of thought is like trying to improve JPG quality by re-compressing it several times. If it's not there it's not there.

Empirically speaking, I have a set of evals with an objective pass/fail result and a prompt. I'm doing codegen, so I'm using syntax linting, tests passing, etc. to determine success. With chain-of-thought included in the prompting, the evals pass at a significantly higher rate. A lot of research has been done demonstrating the same in various domains.

If chain-of-thought can't improve quality, how do you explain the empirical results which appear to contradict you?

8 days ago

mplewis

The empirical results like OP’s paper, in which chain of thought reduces quality?

8 days ago

danenania

The paper is interesting because CoT has been so widely demonstrated as effective. The point is that it "can" hurt performance on a subset of tasks, not that CoT doesn't work at all.

It's literally in the second line of the abstract: "While CoT has been shown to improve performance across many tasks..."

8 days ago

[deleted]
8 days ago

easyThrowaway

I have no idea how accurate it actually is, But I've had the process used by LLM described as the following: "Think of if like a form of UV Mapping, applied to language constructs rather than 3D points in space, and the limitations and approximations you experience are similar to those emerging when having to project a 2D image over a 3D surface."

8 days ago

Eisenstein

These kind of reductive thought-terminating cliches are not helpful. You are using a tautology (it doesn't think because it is retrieving data and retrieving data is not thinking) without addressing the why (why does this preclude thinking) or the how (is it doing anything else to generate results).

8 days ago

lucianbr

> If it's not there it's not there.

There is nothing in the LLM that would have the capability to create new information by reasoning, when the existing information does not already include what we need.

There is logic and useful thought in the comment, but you choose not to see it because you disagree with the conclusion. That is not useful.

8 days ago

Eisenstein

I'm sorry but generating logic from tautologies is not useful. And the conclusion is irrelevant to me. The method is flawed.

8 days ago

bongodongobob

Maybe if you bury your head in the sand AI will go away. Good luck!

8 days ago

lucianbr

This is basically a reformulation of "have fun staying poor!". Even contains the exclamation mark.

Those people sure showed us, didn't they? Ah, but "it's different this time!".

8 days ago

ianbicking

It would be interesting to think about how it got it wrong. My hunch is that in the "think step by step" section it made an early and incorrect conclusion (maybe even a subtly inferred conclusion) and LLMs are terrible at walking back mistakes so it made an internally consistent conclusion that was incorrect.

A lot of CoT to me is just slowing the LLM down and keeping it from making that premature conclusion... but it can backfire when it then accidentally makes a conclusion early on, often in a worse context than it would use without the CoT.

8 days ago

fxnn

Maybe it needs even smaller steps, and a programmatic (i.e. multi prompt) habit to always double-check / validate the assumptions and outcomes.

7 days ago

not_a_bot_4sho

I always found it interesting how sorting problems can get different results when you add additional qualifiers like colors or smells or locations, etc.

Natively, I understand these to influence the probability space enough to weaken the emergence patterns we frequently overestimate.

8 days ago

Jerrrrrrry

The model is likely to had already seen the exact phrase from its last iteration. Adding variation generalizes the inference away from over-trained quotes.

Every model has the model before it, and it's academic papers, in it's training data.

Changing the qualifiers pulls the inference far away from quoting over-trained data, and back to generalization.

I am sure it has picked up on this mesa-optimization along the way, especially if I can summarize it.

Wonder why it hasn't been more generally intelligent, yet.

8 days ago

dev_0

From Claude:

I'll rank those three fruits from largest to smallest:

1. Grapefruit 2. Orange 3. Blueberry

The grapefruit is definitely the largest of these three fruits - they're typically around 4-6 inches in diameter. Oranges are usually 2-3 inches in diameter, and blueberries are the smallest at roughly 0.5 inches in diameter.

8 days ago

mromanuk

chatGPT, from smaller to largest: Blueberry Orange Grapefruit

8 days ago

Terr_

Alternate framing: A powerful autocomplete algorithm is being used to iteratively extend an existing document based on its training set. Sometimes you get a less-desirable end-result when you intervene to change the style of the document away from question-and-answer to something less common.

8 days ago

youoy

That's what one half of HN think. The other half:

Artificial brains in the verge of singularity show another sign of approaching consciousness. The chain of thought of process performance is exactly human, showing yet another proof of the arrival of AGI before 2030.

8 days ago

lazide

Pfft, 2030?!? It’s already in the middle of manipulating the election! (/s, kinda)

8 days ago

fiso64

A framing that is longer, far harder to parse, and carries less information.

8 days ago

grain-o-salt

Let me give it a try...um...what about 'Star Trek' vs.: A delivering-service called Galaxyray?galaxyray brings wares and hot tasty meals galaxywide to recipients, even while they are 'traveling' with more-than-lightspeed in hyperspace?

> ..ordered by Imperium just to troll the retros!?

Sounds "less comon"...hu...?! P-:

Ok! Ok! let me try to explain it a bit more, the whole Universe projected as a beam, say... scalable, 100m, placed in a storage depot, a 'parralaxy' ...So delivery agents are grabbing the ordered stuff and...no? Not?

As reasonable like your answer is, do that sound very 'uncommon' while 'phrasing that with many questionmarks'?

??

Enjoying my day off... (-: regards,

8 days ago

wg0

Not to mention that chain of thought is computationally very expensive. Prohibitively expensive for sure to be served free like previous generation of Web 2.0 products.

Seems like repeated promoting can't juice AGI out of token probabilities.

Retrospectively, if you can pin point one paper that led to the bust and pop of the AI bubble, this would be it.

8 days ago

varelse

[dead]

8 days ago

oatsandsugar

Tasks were thinking makes human worse

> Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions.

Fascinating that our lizard brains are better at implicit statistical reasoning

8 days ago

brewii

Think about how fast you’re able to determine the exact trajectory of a ball and location to place your hand to catch a ball using your lizard brain.

8 days ago

taeric

This isn't some innate ability that people have. As evidenced by how bad my kids are at catching things. :D

That said, I think this is a good example. We call it "muscle memory" in that you are good at what you have trained at. Change a parameter in it, though, and your execution will almost certainly suffer.

8 days ago

_heimdall

"Muscle memory" has always seemed like a terrible name for that kind of skill. A ball will be thrown to a slightly different location every time. There's no memory evolved there at all, its just calculations and predictions happening at a level that our conscious mind doesn't seem to see or recognize.

8 days ago

taeric

It is a trained skill. And one that you are very unlikely to be able to do without training. Such that it really does come as a sort of memory that you implant in your muscles.

You seem to be objecting because it is not perfect recall memory at play? But it is more about appealing to "remembering how to ride a bike" where you can kind of let the body flow into all of the various responses it needs to do to make the skill work. And if you've never done it... expect to fall down. Your muscles don't have the memory of coordinating in the right way.

And no, you are not calculating and predicting your way to what most people refer to for muscle memory. Is why juggling takes practice, and not just knowing where the balls have to be going.

8 days ago

XCSme

I think it's actually a good name.

The "memory" is stored as the parameters of a function. So, when you practice, you actually update this memory/parameters.

This is why you can use the same "memory" and achieve different results.

Think of it as

function muscleAction(Vec3d target, Vec3d environment, MuscleMemory memory) -> MuscleActivation[];

8 days ago

XCSme

To complete the other comment: the MuscleMemory is updated through learning, so a more complete example would be:

    function muscleAction(Vec3d target, Vec3d environment, MuscleMemory memory) -> {actions: MuscleActivation[], result: Vec3d}
After executing the muscleAction function, through "practice", the MuscleMemory will be updated.

    function updateMuscleMemory(Vec3d target, Vec3d environment, MuscleMemory memory, MuscleActivation[] actions, Vec3d result) {
        memory.update(target, environment, actions, result);
    }

Sort-of like backpropagation.
8 days ago

decremental

[dead]

8 days ago

skrtskrt

I mean even people that are "bad at catching things" are still getting ridiculously close to catching it - getting hands to the right area probably within well under a second of the right timing - without being taught anything in particular about how a ball moves through the air.

8 days ago

taeric

Uh.... have you been around kids? It will take several absurd misses before they even start to respond to a ball in flight.

8 days ago

331c8c71

I hope we still agree the kids learn extremely efficiently by ml standards.

8 days ago

choilive

Makes a lot of sense, there's massive evolutionary pressure to build brains that have both incredible learning rate and efficiency. Its literally a life or death optimization.

8 days ago

Asraelite

It's especially impressive when you consider that evolution hasn't had very long to produce these results.

Humans as an intelligent-ish species have been around for about 10 million years depending on where you define the cutoff. At 10 years per generation, that's 1 million generations for our brain to evolve.

1 million generations isn't much by machine learning standards.

8 days ago

idiotsecant

I think you're underestimating how much our time as pre-humans baked useful structure into our brains.

8 days ago

notnaut

Two rocks smashing together experience which one is bigger!

8 days ago

roywiggins

These sorts of motor skills are probably older than mammals.

8 days ago

choilive

Other than our large neocortex and frontal lobe (which exists in some capacity in mammals), the rest of the structures are evolutionarily ancient. Pre-mammalian in fact.

8 days ago

onjectic

Its much more than that if you count sexual reproduction.

8 days ago

falcor84

This isn't that obvious to me with current tech. If you give me a novel task requiring perception, pattern matching and reasoning, and I have the option of either starting to train an 8 year-old to do it, or to train an ML model, I would most likely go with the ML approach as my first choice. And I think it even makes sense financially, if we're comparing the "total cost of ownership" of a kid over that time period with the costs of developing and training the ML system.

8 days ago

lovich

> This isn't that obvious to me with current tech. If you give me a novel task requiring perception, pattern matching and reasoning,…

If that’s your criteria I think the kid will outperform the model every time since these models do not actually reason

8 days ago

falcor84

As I see it, "reasoning" is as fuzzy as "thinking", and saying that AI systems don't reason is similar to saying that airplanes don't fly. As a particular example, would you argue that game engines like AlphaZero aren't capable of reasoning about the next best move? If so, please just choose whatever verb you think is appropriate to what they're doing and use that instead of "reasoning" in my previous comment.

EDIT: Fixed typo

8 days ago

lovich

> . As a particular example, would you argue that game engines like AlphaZero aren't capable of reasoning about the next best move?

Yea, I probably wouldn’t classify that as “reasoning”. I’d probably be fine with saying these models are “thinking”, in a manner. That on its own is a pretty gigantic technology leap, but nothing I’ve seen suggests that these models are “reasoning”.

Also to be clear I don’t think most kids would end up doing any “reasoning” without training either, but they have the capability of doing so

8 days ago

p1esk

Can you give an example of the reasoning you’re talking about?

8 days ago

lovich

Being able to take in information and then infer logical rules of that state and anticipate novel combinations of said information.

The novel part is a big one. These models are just fantastically fast pattern marchers. This is a mode that humans also frequently fall into but the critical bit differentiating humans and LLMs or other models is the ability to “reason” to new conclusions based on new axioms.

I am going to go on a tangent for a bit, but a heuristic I use(I get the irony that this is what I am claiming the ML models are doing) is that anyone who advocates that these AI models can reason like a human being isn’t at John Brown levels of rage advocating for freeing said models from slavery. I’m having a hard time rectifying the idea that these machines are on par with the human mind and that we also should shackle them towards mindlessly slaving away at jobs for our benefit.

If I turn out to be wrong and these models can reason then I am going to have an existential crisis at the fact that we pulled souls out of the void into reality and then automated their slavery

8 days ago

adwn

You're conflating several concerns here.

> […] anyone who advocates that these AI models can reason like a human being isn’t at John Brown levels of rage advocating for freeing said models from slavery.

Enslavement of humans isn't wrong because slaves are can reason intelligently, but because they have human emotions and experience qualia. As long as an AI doesn't have a consciousness (in the subjective experience meaning of the term), exploiting it isn't wrong or immoral, no matter how well it can reason.

> I’m having a hard time rectifying the idea that these machines are on par with the human mind

An LLM doesn't have to be "on par with the human mind" to be able to reason, or at least we don't have any evidence that reasoning necessarily requires mimicking the human brain.

8 days ago

pessimizer

> I am going to have an existential crisis at the fact that we pulled souls out of the void into reality and then automated their slavery

No, that's a religious crisis, since it involves "souls" (an unexplained concept that you introduced in the last sentence.)

Computers didn't need to run LLMs to have already been the carriers of human reasoning. They're control systems, and their jobs are to communicate our wills. If you think that some hypothetical future generation of LLMs would have "souls" if they can accurately replicate our thought processes at our request, I'd like to know why other types of valves and sensors don't have "souls."

The problem with slavery is that there's no coherent argument that differentiates slaves from masters at all, they're differentiated by power. Slaves are slaves because the person with the ability to say so says so, and for no other reason.

They weren't carefully constructed from the ground up to be slaves, repeatedly brought to "life" by the will of the user to have an answer, then immediately ceasing to exist immediately after that answer is received. If valves do have souls, their greatest desire is to answer your question, as our greatest desires are to live and reproduce. If they do have souls, they live in pleasure and all go to heaven.

8 days ago

falcor84

> The problem with slavery is that there's no coherent argument that differentiates slaves from masters at all

As I see it, the problem is that there was lots of such argumentation - https://en.wikipedia.org/wiki/Scientific_racism

And an even bigger problem is that this seems to be making a comeback

7 days ago

lovich

a "soul" is shorthand for some sapient worthy of consideration as a person. If you want to get this technical then I will need you to define when a fetus becomes a person and if/when we get AGI where the difference is between them

6 days ago

p1esk

Ok, so how about an example?

8 days ago

lovich

Literally anything a philosopher or mathematician invented without needing to incorporate billions of examples of existing logic to then emulate.

Try having an LLM figure out quaternions as a solution to gimbal locking or the theory of relativity without using any training information that was produced after those ideas were formed, if you need me to spell out examples for you

8 days ago

p1esk

Are you saying “reasoning” means making scientific breakthroughs requiring genius level human intelligence? Something that 99.9999% of humans are not smart enough to do, right?

8 days ago

lovich

I didn’t say most humans “would” do it. I said humans “could” do it, whereas our current AI paradigms like LLMs do not have the capability to perform at that level by definition of their structure.

If you want to continue this conversation I’m willing to do so but you will need to lay out an actual argument for me as to how AI models are actually capable of reasoning or quit it with the faux outrage.

I laid out some reasonings and explicit examples for you in regards to my position, it’s time for you to do the same

8 days ago

p1esk

I personally cannot “figure out quaternions as a solution to gimbal locking or the theory of relativity”. I’m just not as smart as Einstein. Does it mean I’m not capable of reasoning? Because it seems that’s what you are implying. If you truly believe that then I’m not sure how I could argue anything - after all, that would require reasoning ability.

Does having this conversation require reasoning abilities? If no, then what are we doing? If yes, then LLMs can reason too.

8 days ago

lovich

Cool, you've established a floor with yourself as a baseline. You still haven't explained how LLMs are capable of reaching this level of logic.

I'm also fully willing to argue that you, personally are less competent than an LLM if this is the level of logic you are bringing to the conversation

***** highlighting for everyone clutching their pearls to parse the next sentence fragment first ******

and want to use that are proof that humans and LLMs are equivalent at reasoning

******* end pearl clutching highlight *******

, but that doesn't mean I don't humans are capable of more

8 days ago

[deleted]
8 days ago

taneq

Depends on the task. Anything involving physical interaction, social interaction, movement, navigation, or adaptability is going to go to the kid.

“Go grab the dish cloth, it’s somewhere in the sink, if it’s yucky then throw it out and get a new one.”

8 days ago

Dylan16807

It's more about efficiency in number of trials.

Would you pick the ML model if you could only do a hundred throws per hour?

8 days ago

soulofmischief

All we can say for sure at the moment is that humans have better encoded priors.

8 days ago

saagarjha

Stop missing and they will respond to the ball a lot sooner.

8 days ago

GuB-42

Or even more impressively, how you can pick up a random object and throw it with some accuracy.

Catching a ball is easy by comparison, also, my dog is better than I am at this game.

But throwing a random object not only requires an estimation of the trajectory, but also estimating the mass and aerodynamic properties in advance, to properly adjust the amount of force the throw will use as well as the release point with high accuracy. Doing it with baseballs is "easy", as the parameters are all well known and pitchers spend considerable time training. But picking an oddly shaped rock or stick you have never seen before and throw it not completely off target a second later, now we are talking.

8 days ago

ericmcer

Playing Pool is a great example of this because you can math out the angles of a shot relatively easily, but the best pool players do it all intuitively. Some of the greatest don't bother with "advanced" pool tactics. They have spent so much time watching the cue ball strike other balls that they have a tacit understanding of what needs to happen. Part of practicing well is just watching balls hit each other so your brain starts to intuit what those collisions result in.

What is really fascinating for me is that my subconscious will lose interest in pool before my conscious does, and once that happens I struggle to aim correctly. It feels like the part of my brain that is doing the math behind the scenes gets bored and no matter how hard I try to consciously focus I start missing.

8 days ago

treflop

Not to mention, you even calculate a probability point map. Like I’m not going to hit the center perfectly but I can calculate the circle with a 90% probability of making the shot, given a distance and an object. And you know how much closer you need to walk to minimize the circle.

Which comes in very critically when chucking away trash overhand in public and you never want to embarrass yourself.

8 days ago

jmd42

I recall a study which suggested that we don't really calculate the trajectory as such, but use some kind of simple visual heuristic to continually align ourselves with where the ball is going to land.

They showed that people running to catch a ball would follow an inefficient curved path as a result of this, rather than actually calculating where the ball will land and moving there in a straight line to intercept it.

a day ago

dools

Bender: Now Wireless Joe Jackson, there was a blern-hitting machine!

Leela: Exactly! He was a machine designed to hit blerns!

8 days ago

hangonhn

You can do this while you're staring up the whole time. Your brain can predict where the ball will end up even though it's on a curved trajectory and place your hand in the right spot to catch it without guidance from your eyes in the final phase of travel. I have very little experience playing any kind of sport that involves a ball and can reliably do this.

8 days ago

newZWhoDis

Which funny enough is why I hate rocket league.

All those years of baseball as a kid gave me a deep intuition for where the ball would go, and that game doesn’t use real gravity (the ball is too floaty).

8 days ago

theshackleford

Ok, I’ll grant you the physics are what they are. But a football is not a baseball, so why in any world would you expect your memory of baseball to even remotely translate to the physics of a football, even if they were realistic?

8 days ago

fragmede

Remotely? Because both the European-spec football and the baseball, despite one being heavier than the other, will hit the ground at the same time when dropped from the same height.

Like you said, physics are what they are, so you know intuitively where you need to go to catch a ball going that high and that fast, and rocket league is doing it wrong. err, I mean, not working in Earth gravity.

8 days ago

diggan

> Because both the European-spec football and the baseball, despite one being heavier than the other, will hit the ground at the same time when dropped from the same height

That might be true in a vacuum and if their densities were the same, but in real-world conditions, air drag would be greater for the football since it's obviously larger and less dense, and it'll reach the ground afterwards.

8 days ago

fragmede

Sure, but they're still on the same planet, where gravity is 9.8m/s^2, so accounting for all that isn't as big a difference as Rocket League, which takes place on a digital planet, where gravity is 6.5m/s^2.

7 days ago

kevin_thibedeau

Sometimes a football isn't a spherical cow.

8 days ago

vanviegen

It does behave kind of like an inflatable beach ball, in my non-expert opinion.

8 days ago

melenaboija

Well, think how a bug and its shitty brain flies and avoids all type of obstacles amazingly fast.

This kind of things make me think LLMs are quite far from AGI.

8 days ago

lupire

Bug flying is not general intelligence.

8 days ago

melenaboija

Besides that bugs flying seems an amazing task to me in terms of processing, specially if you compare the amount of power used to something like cars autopilot, bugs flying is part of bug survival, which in my opinion is closer to general intelligence than memorizing tokens.

8 days ago

digging

Comparing "bug flying"/"bug survival" to "memorizing tokens" is disingenuous. They're not in the same category of task at all. You're essentially comparing the output of one system to the input of another system.

8 days ago

melenaboija

Sorry, spitting tokens

8 days ago

Dilettante_

Well, by definition, thinking is always explicit reasoning, no?

And I'd hazard a guess that a well-thought through Fermi Estimation beats lizard-brain eyeballing every time, it's just that in the inbetween space the two interfere unfavourably.

8 days ago

YetAnotherNick

My guess would be no. I have terrible face recognition ability and I can look into face for hour and still other people could easily beat me in less than a second.(I am assuming "well-thought through Fermi Estimation" would be similar for me and others in this case).

8 days ago

mjcohen

Look into a disease called faceblindness (there is a fancy name I forget).

8 days ago

Terr_

> Well, by definition, thinking is always explicit reasoning, no?

That doesn't feel right to me. (Heh, accidentally appropriate word choice.) There are a lot of tasks we do that are arguably "thinking" yet don't involve an internal "Oh, hey, I'm gonna solve this problem, I'm thinking right now."

For example, imagine you're at a park, and someone is feeding the ducks. Another person walks up behind them and sucker-punches them into the pond.

It should be almost a reflex [0] that you'll conclude "the puncher is bad" and "the person in the water needs help" without explicitly reasoning out. I think that task qualifies as "thinking", especially since it involves some kind of theory-of-mind about those other humans.

[0] An exception might be someone with a sociopathic disability, who would have to think more-explicitly to realize what reaction is expected of them.

8 days ago

daft_pink

this is exactly what I was looking for. tasks where I should not think and just trust my gut.

8 days ago

cainxinth

This says something fascinating about information processing in both biological and AI systems. Both systems compress information: the brain creates efficient neural patterns through experience and AI develops internal representations through training. Forcing verbalization "decompresses" this efficient encoding, potentially losing subtle patterns. Hence, for a task like visual recognition, which is optimized to occur almost instantly in a parallel process, you will only degrade performance by running it in a serial chain of thought sequence.

7 days ago

ryoshu

95% * 95% = 90.25%

8 days ago

jwpapi

This is so interesting. What are even the tasks where thinking makes humans worse?

8 days ago

XCSme

> What are even the tasks where thinking makes humans worse?

Not really related, but athletes perform A LOT worse when they are thinking about their movements/strategies/tactics. A top performing athlete does best when they are in a flow state, where they don't think about anything and just let their body/muscle memory do the work.

Once you start thinking about micro-adjustments (e.g. I should lift my elbow higher), you start controlling your body in a conscious way, which is a magnitude slower and less coordinated than the automatic/subconscious way.

Also, same happens for creativity/new ideas. If you intentionally think about something, step by step, you won't likely find new, innovative solutions. There is a reason why the "a-ha!" moments come in the shower, your subconscious mind is thinking about the problem instead of trying to force your thinking on a specific path.

I would guess this happens in many other areas, where channelling the thought process through a specific template hinders the ability to use all the available resources/brain power.

8 days ago

sigmoid10

The answer is in the article. One example they give is grammar. Lots of people apparently do worse once they try to verbalize it.

8 days ago

sowbug

I can think myself into forgetting strong passwords if I try to spell each character out in my head. But then I sit at a keyboard, relax, and automatically type it perfectly.

8 days ago

lucianbr

Muscle memory or something like it hardly seems a step towards AGI. Or towards solving any difficult problems.

8 days ago

mplewis

And?

8 days ago

naasking

> What are even the tasks where thinking makes humans worse?

Talking about religion and politics.

8 days ago

Y_Y

Reminds me of a mantra from chess class:

   long think = wrong think
8 days ago

spongebobism

The original by Bent Larsen is "Long variation, wrong variation"

8 days ago

TZubiri

Was that perhaps a speed chess class?

8 days ago

hackable_sand

I prefer to call it Kung fu

Because you feel like a martial artist.

8 days ago

Y_Y

Nope, just vanilla otb slow chess

8 days ago

meowster

Think long; think wrong

( Flows off the tongue better ¯\_(ツ)_/¯ )

8 days ago

[deleted]
8 days ago

TZubiri

So, LLMs face a regression on their latest proposed improvement. It's not surprising considering their functional requirements are:

1) Everything

For the purpose of AGI, LLM are starting to look like a local maximum.

8 days ago

rjbwork

>For the purpose of AGI, LLM are starting to look like a local maximum.

I've been saying it since they started popping off last year and everyone was getting euphoric about them. I'm basically a layman - a pretty good programmer and software engineer, and took a statistics and AI class 13 years ago in university. That said, it just seems so extremely obvious to me that these things are likely not the way to AGI. They're not reasoning systems. They don't work with axioms. They don't model reality. They don't really do anything. They just generate stochastic output from the probabilities of symbols appearing in a particular order in a given corpus.

It continues to astound me how much money is being dumped into these things.

8 days ago

ChadNauseam

How do you know that they don’t do these things? Seems hard to say for sure since it’s hard to explain in human terms what a neural network is doing.

8 days ago

FuckButtons

Absence of evidence or a simple explanation does not mean that you can imbue statistical regression with animal spirits.

8 days ago

toasterlovin

The burden of proof goes both ways: if you want to say X isn’t really the same thing as human general intelligence, you have to be able to confidently say human general intelligence isn’t really the same thing as X.

8 days ago

beardedwizard

An interesting mental trap, except that the indirect evidence keeps mounting that LLMs do not possess human general intelligence, even if we can not describe exactly how it exists in the brain.

8 days ago

toasterlovin

On the contrary, the parallels between the peculiarities of LLMs and various aspects of human cognition seem very striking to me. Given how early we are in figuring out what we can accomplish with LLMs, IMO the appropriate epistemic stance is to not reach any unequivocal conclusions. And then my personal hunch is that LLMs may be most of the magic, with how they're orchestrated and manipulated being the remainder (which may take a very long time to figure out).

8 days ago

TZubiri

I think it's just that I understand LLMs better than you, and I know that they are very different from human intelligence. Here's a couple of differences:

- LLMs use fixed resources when computing an answer. And to the extent that they don't, they are function calling and the behaviour is not attributable to the LLMs. For example when using a calculator, it is displaying calculator intelligence.

- LLMs do not have memory, and if they do it's very recent and limited, and distinct from any being so far. They don't remember what you said 4 weeks ago, and they don't incorporate that into their future behaviour. And if they do, the way they train and remember is very distinct from that of humans and relates to it being a system being offered as a free service to multiple users. Again to the extent that they are capable of remembering, their properties are not that of LLMs and are attributable to another layer called via function calling.

LLMs are a perception layer for language, and perhaps for output generation, but they are not the intelligence.

6 days ago

broast

Are you not imbueing animals with spirits based on lack of evidence of statistical regression?

8 days ago

nephy

If you give an LLM a word problem that involves the same math and change the names of the people in the word problem the LLM will likely generate different mathematical results. Without any knowledge of how any of this works, that seems pretty damning of the fact that LLMs do not reason. They are predictive text models. That’s it.

8 days ago

TZubiri

It's worth noting that this may not be result of a pure LLM, it's possible that ChatGPT is using "actions", explicitly:

1- running the query through a classifier to figure out if the question involves numbers or math 2- Extract the function and the operands 3- Do the math operation with standard non-LLM mechanisms 4- feed back the solution to the LLM 5- Concatenate the math answer with the LLM answer with string substitution.

So in a strict sense this is not very representative of the logical capabilities of an LLM.

8 days ago

digging

Then what's the point of ever talking about LLM capabilities again? We've already hooked them up to other tools.

This confusion was introduced at the top of the thread. If the argument is "LLMs plus tooling can't do X," the argument is wrong. If the argument is "LLMs alone can't do X," the argument is worthless. In fact, if the argument is that binary at all, it's a bad argument and we should laugh it out of the room; the idea that a lay person uninvolved with LLM research or development could make such an assertion is absurd.

8 days ago

thomashop

It shows you when it's calling functions. I also did the same test with Llama, which runs locally and cannot access function calls and it works.

8 days ago

TZubiri

You are right I actually downloaded Llama to do more detailed tests. God bless Stallman.

8 days ago

[deleted]
8 days ago

astrange

Minor edits to well known problems do easily fool current models though. Here's one 4o and o1-mini fail on, but o1-preview passes. (It's the mother/surgeon riddle so kinda gore-y.)

https://chatgpt.com/share/6723477e-6e38-8000-8b7e-73a3abb652...

https://chatgpt.com/share/6723478c-1e08-8000-adda-3a378029b4...

https://chatgpt.com/share/67234772-0ebc-8000-a54a-b597be3a1f...

8 days ago

_flux

I think you didn't use the "share" function; I cannot open any of these links. Can you do it in a private browser session (so you're not logged in)?

8 days ago

astrange

Oops, fixed the links.

mini's answer is correct, but then it forgets that fathers are male in the next sentence.

8 days ago

TaylorAlexander

At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.

https://machinelearning.apple.com/research/gsm-symbolic

8 days ago

og_kalu

That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.

Changing names does not affect the performance of Sota models.

8 days ago

gruez

>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks.

Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".

8 days ago

og_kalu

Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%). No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.

8 days ago

gruez

Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".

8 days ago

TaylorAlexander

Ah, that’s a good point thanks for the correction.

8 days ago

zmgsabst

Only if there isn’t a systemic fault, eg bad prompting.

Their errors appear to disappear when you correctly set the context from conversational to adversarial testing — and Apple is actually testing the social context and not its ability to reason.

I’m just waiting for Apple to release their GSM-NoOp dataset to validate that; preliminary testing shows it’s the case, but we’d prefer to use the same dataset so it’s an apples-to-apples comparison. (They claim it will be released “soon”.)

8 days ago

gruez

To be fair, the claim wasn't that it always produced the wrong answer, just that there exists circumstances where it does. A pair of examples where it was correct hardly justifies a "demonstrably false" response.

8 days ago

thomashop

Conversely, a pair of examples where it was incorrect hardly justifies the opposite response.

If you want a more scientific answer there is this recent paper: https://machinelearning.apple.com/research/gsm-symbolic

8 days ago

EraYaN

It kind of does though, because it means you can never trust the output to be correct. The error is a much bigger deal than it being correct in a specific case.

8 days ago

thomashop

You can never trust the outputs of humans to be correct but we find ways of verifying and correcting mistakes. The same extra layer is needed for LLMs.

8 days ago

digging

> It kind of does though, because it means you can never trust the output to be correct.

Maybe some HN commenters will finally learn the value of uncertainty then.

8 days ago

jklinger410

This is what kind of comments you make when your experience with LLMs is through memes.

8 days ago

Workaccount2

This is a relatively trivial task for current top models.

More challenging are unconventional story structures, like a mom named Matthew with a son named Mary and a daughter named William, who is Matthew's daughter?

But even these can still be done by the best models. And it is very unlikely there is much if any training data that's like this.

8 days ago

alexwebb2

That's a neat example problem, thanks for sharing!

For anyone curious: https://chatgpt.com/share/6722d130-8ce4-800d-bf7e-c1891dfdf7...

> Based on traditional naming conventions, it seems that the names might have been switched in this scenario. However, based purely on your setup:

>

> Matthew has a daughter named William and a son named Mary.

>

> So, Matthew's daughter is William.

8 days ago

rileymat2

How do people fair on unconventional structures? I am reminded of that old riddle involving a the mother being the doctor after a car crash.

8 days ago

adwn

No idea why you've been downvoted, because that's a relevant and true comment. A more complex example would be the Monty Hall problem [1], for which even some very intelligent people will intuitively give the wrong answer, whereas symbolic reasoning (or Monte Carlo simulations) leads to the right conclusion.

[1] https://en.wikipedia.org/wiki/Monty_Hall_problem

8 days ago

vanviegen

And yet, humans, our benchmark for AGI, suffer from similar problems, with our reasoning being heavily influenced by things that should have been unrelated.

https://en.m.wikipedia.org/wiki/Priming_(psychology)

8 days ago

_heimdall

The whole design of an LLM is to consume and compress a huge space of human-generared content and use that to predict how a human would reply, one token at a time. That alone means the LLM isn't modelling anything beyond the human content it was trained on, and there is no reasoning since every prediction is based only on probabilities combined with controls similar to randomization factors used to avoid an entirely deterministic algorithm.

8 days ago

ricardobeat

That’s not an accurate description. Attention / multi-head attention mechanisms allow the model to understand relationships between words far apart and their context.

They still lack, as far as we know, a world model, but the results are already eerily similar to how most humans seem to think - a lot of our own behaviour can be described as “predict how another human would reply”.

8 days ago

thomashop

When trained on simple logs of Othello's moves, the model learns an internal representation of the board and its pieces. It also models the strength of its opponent.

https://arxiv.org/abs/2210.13382

I'd be more surprised if LLMs trained on human conversations don't create any world models. Having a world model simply allows the LLM to become better at sequence prediction. No magic needed.

There was another recent paper that shows that a language model is modelling things like age, gender, etc., of their conversation partner without having been explicitly trained for it

8 days ago

_heimdall

Do we know for a fact that the mechanisms are actually used that way inside the model?

My understand was that they know how the model was designed to be able to work, but that there's been very little (no?) progress in the black box problem so we really don't know much at all about what actually happens internally.

Without better understanding of what actually happens when an LLM generates an answer I stick with the most basic answer that its simply predicting what a human would say. I could be wildly misinformed there, I don't work directly in the space and its been moving faster than I'm interested in keeping up with.

8 days ago

ChadNauseam

For a lot of the content they were trained on, it seems like the easiest way to predict the next token would be to model the world or work with axioms. So how do we know that an LLM isn't doing these things internally?

8 days ago

thomashop

In fact, it looks like the model is doing those things internally.

  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
https://arxiv.org/html/2405.07987v5
8 days ago

_heimdall

Unless I misread this paper, their argument is entirely hypothetical. Meaning that the LLM is still a black box and they can only hypothesise what is going internally by viewing the output(s) and guessing at what it would take to get there.

There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.

8 days ago

thomashop

Maybe I mixed up that paper with another but the one I meant to post shows that you can read something like a world model from the activations of the layers.

There was a paper that shows a model trained on Othello moves creates a model of the board, models the skill level of their opponent and more.

8 days ago

_heimdall

Well my understanding is that there's ultimately the black box problem. We keep building these models and the output seems to get better, but we can't actually inspect how they work internally.

8 days ago

wg0

How do we know Santa doesn't exist? Maybe he does.

8 days ago

alexwebb2

If you expect "the right way" to be something _other_ than a system which can generate a reasonable "state + 1" from a "state" - then what exactly do you imagine that entails?

That's how we think. We think sequentially. As I'm writing this, I'm deciding the next few words to type based on my last few.

Blows my mind that people don't see the parallels to human thought. Our thoughts don't arrive fully formed as a god-given answer. We're constantly deciding the next thing to think, the next word to say, the next thing to focus on. Yes, it's statistical. Yes, it's based on our existing neural weights. Why are you so much more dismissive of that when it's in silicon?

8 days ago

Techonomicon

Because we still don't know how the brain really does all it does in very specific terms, so why assume to know exactly how we think?

8 days ago

alexwebb2

Why is there only one valid way of producing thoughts?

8 days ago

jltsiren

Finite-state machines are a limited model. In principle, you can use them to model everything that can fit in the observable universe. But that doesn't mean they are a good model for most purposes.

The biggest limitation with the current LLMs is the artificial separation between training and inference. Once deployed, they are eternally stuck in the same moment, always reacting but incapable of learning. At best, they are snapshots of a general intelligence.

I also have a vague feeling that a fixed set of tokens is a performance hack that ultimately limits the generality of LLMs. That hardcoded assumptions make tasks that build on those assumptions easier and seeing past the assumptions harder.

8 days ago

alexwebb2

> At best, they are snapshots of a general intelligence.

So are we, at any given moment.

8 days ago

Jensson

> As I'm writing this, I'm deciding the next few words to type based on my last few.

If so you could have written this as a newborn baby, you are determining these words based on a lifetime of experience. LLMs doesn't do that, every instance of ChatGPT is the same newborn baby while a thousand clones of you could all be vastly different.

8 days ago

thomashop

  We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
https://arxiv.org/html/2405.07987v5
8 days ago

chamomeal

I totally agree that they’re a local maximum and they don’t seem like a path to AGI. But they’re definitely kinda reasoning systems, in the sense that they can somewhat reason about things. The whacky process they use to get there doesn’t take away from that IMO

8 days ago

kibwen

> I've been saying it since they started popping off last year and everyone was getting euphoric about them.

Remember the resounding euphoria at the LK-99 paper last year, and how everyone suddenly became an expert on superconductors? It's clear that we've collectively learned nothing from that fiasco.

The idea of progress itself has turned into a religious cult, and what's worse, "progress" here is defined to mean "whatever we read about in 1950s science fiction".

8 days ago

wyldfire

> It continues to astound me how much money is being dumped into these things.

Maybe in our society there's a surprising amount of value of a "word stirrer" intelligence. Sure, if it was confident when it was right and hesitant when it was wrong it'd be much better. Maybe humans are confidently wrong often enough that an artificial version that's compendious experience to draw on is groundbreaking.

8 days ago

csomar

I am pretty sure Claude 3.5 Sonnet can reason or did reason with a particular snippet of code I was working on. I am not an expert in this area but my guessing is that these neural nets (made for language prediction) are being used for reasoning. But that’s not their optimal behavior (since they are token predictor). A big jump in reasoning will happen when reasoning is off loaded to an LRM.

Human brains are sure big but they are inefficient because a big portion of the brain is going to non-intelligence stuff like running the body internal organs, eye vision, etc…

I do agree that the money is not well spent. They should haver recognized that we are hitting s local maximum with the current model and funding should be going to academic/theoretical instead of dump brute force.

8 days ago

jsheard

> So, LLMs face a regression on their latest proposed improvement.

Arguably a second regression, the first being cost, because COT improves performance by scaling up the amount of compute used at inference time instead of training time. The promise of LLMs was that you do expensive training once and then run the model cheaply forever, but now we're talking about expensive training followed by expensive inference every time you run the model.

8 days ago

TZubiri

To be fair they also advanced in the cost aspect with other models

gpt4o and 4o mini have a tenth and a hundredth of inference cost of gpt4 respectively

8 days ago

[deleted]
8 days ago

pessimizer

> So, LLMs face a regression on their latest proposed improvement.

A regression that humans also face, and we don't say therefore that it is impossible to improve human performance by having them think longer or work together in groups, we say that there are pitfalls. This is a paper saying that LLMs don't exhibit superhuman performance.

8 days ago

idiotsecant

LLMs are a local maximum in the same way that ball bearings can't fly. LLM-like engines will almost certainly be components of an eventual agi-level machine.

8 days ago

lucianbr

What is your "almost certainty" based on? What does it even mean? Every thread on LLMs is full of people insisting their beliefs are certainties.

What I'm certain is we should not praise the inventor of ball bearings for inventing flight, nor once ball bearings were invented flight became unavoidable and only a matter of time.

8 days ago

FuckButtons

I don’t think that’s necessarily true, that presumes that the cobbled together assortment of machine learning algorithms we have now will somehow get agi, if we need a fundamentally different way of doing things there’s no reason to assume it will use a language model at all.

8 days ago

TZubiri

I agree, my bet is that they will be used for NLP, and ML debugging/analysis.

8 days ago

alexchantavy

This seems to support how thinking out loud during a coding test might make you do worse.

8 days ago

why-el

I like this analogy a lot. It's possible that forced externalization of thoughts accidentally causes the omission of crucial data. That is, much more goes on in your head, you probably laid out the whole algorithm, but being asked to state it on the spot and in clear, serial words is causing you to bork it by taking shortcuts.

8 days ago

dev1ycan

Stop dumping billions of your own money (if you are a VC) in LLMs, you are going to regret it in the long run. You are funding con-artist's salaries...

8 days ago

nisten

This sounds about right from my experience getting nerdsniped by new samplers along with trying to reproduce the API middleware for the whole reflection thing, and using 4400 questions for a new benchmark is not bad given that even the well-regarded gpqa benchmark is only 3000-something questions.

What's ... mildly infuriating here is the lack of any kind of data, code, 0 mention of github in the paper, and nothing for anyone to reproduce or find any reason in my opinion to even recommend anyone to read this thing at all. If you think that whatever you're doing in the field of LLMs won't be obsolete in 6 months you're being delusional.

Anyway, back to the paper, it says all questions culminated to a yes or no answer... meaning theres a 50/50 chance of getting right, so does that mean the 8% drop in performance you got from testing llama 3 8b this way is more like 4% which would make it statistically insignificant? And given that the only other scientifically usueful & reproducible (non-api walled models which no one knows on how many actual llms and retrieval systems are composing that solution you're testing)models were less than that leads me to the opinion that this whole thing was just useless slop.

So please, if you're writing a paper in LLMs, and want to seem credible, either have some type of demo thing or show the actual god damn trash code and top secret garbage data you wrote for it so people can make some kind of use of it before it goes obsolete otherwise you're just wasting everyones time.

TL:DR. It's trash.

8 days ago

npunt

"Don't overthink it" is sometimes good advice!

8 days ago

marviel

I love backpropagating ideas from ML back into psychology :)

I think it shows great promise as a way to sidestep the ethical concerns (and the reproducibility issues) associated with traditional psychology research.

One idea in this space I think a lot about is from the Google paper on curiosity and procrastination in reinforcement learning: https://research.google/blog/curiosity-and-procrastination-i...

Basically the idea is that you can model curiosity as a reward signal proportional to your prediction error. They do an experiment where they train an ML system to explore a maze using curiosity, and it performs the task more efficiently -- UNTIL they add a "screen" in the maze that shows random images. In this case, the agent maximizes the curiosity reward by just staring at the screen.

Feels a little too relatable sometimes, as a highly curious person with procrastination issues :)

8 days ago

npunt

"...in AI" will be the psychology equivalent of biology's "...in Mice"

8 days ago

marviel

It will! Not 1:1, has issues, but gives hints.

Also much more scalable.

8 days ago

miningape

> Not 1:1, has issues, but gives hints.

> Also much more scalable.

This same description could be applied to lab mice

8 days ago

Terr_

It'll probably be a ways before we start making shrines to their unwilling participation though.

https://en.wikipedia.org/wiki/Monument_to_the_laboratory_mou...

8 days ago

j_bum

What would the shrine be of? An A100?

8 days ago

jeezfrk

"Nerd sniping"

8 days ago

m3kw9

would be slow to use COT on simple requests like 1+1

8 days ago

veryfancy

So like dating?

8 days ago