DeepThought-8B: A small, capable reasoning model

134 points
1/21/1970
13 days ago
by AnhTho_FR

Comments


tkgally

There's been a rush of releases of reasoning models in the past couple of weeks. This one looks interesting, too.

I found the following video from Sam Witteveen to be a useful introduction to a few of those models:

https://youtu.be/vN8jBxEKkVo

10 days ago

CGamesPlay

In what way did they "release" this? I can't find it in hugging face or ollama, and they only seem to have a "try online" link in the article. "Self-sovereign intelligence", indeed.

10 days ago

wongarsu

They released it in the same sense OpenAI released GPT4. There is an online demo you can chat with, and a form to get in touch with sales to get API access

10 days ago

underlines

they didn't

10 days ago

tanakai24

Legally, you cannot name the llama3 based models like that, YOu have to use, llama in the name

https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blo...

10 days ago

alexvitkov

Too bad :)

Facebook trained the model on an Internet's worth of copyrighted material without any regard for licenses whatsoever - even if model weights are copyrightable, which is an open question, you're doing the exact same thing they did. Probably not a bulletproof legal defense though.

10 days ago

tourmalinetaco

At least Zuck had the decency to release model weights, unlike these worthless clowns.

9 days ago

littlestymaar

Can't wait until Meta sue them so we can have a judgment on whether or not models weights are subject to copyright.

10 days ago

euroderf

Model weights are (abstractly speaking) a very intensive, concentrated form of website scraping, yes ?

What does the (USA) law say about scraping ? Does "fair use" play a role ?

10 days ago

ranger_danger

Yes, and there have already been court cases that ruled AI training of copyrighted data to be fair use, because it's technically no different than any other form of art, everything is based off of seeing other ideas elsewhere, there are no new ideas anymore.

10 days ago

jb_briant

Am I wrong to think that "reasoning model" is a misleading marketing term?

Isn't it a LLM with an algo wrapper?

10 days ago

viraptor

Whether you bake the behaviour in or wrap it in an external loop, you need to train/tune the expected behaviour. Generic models can do chain of thought if asked for, but will be worse than the specialised one.

10 days ago

benchmarkist

They're not baking anything in. Reasoning, as it is defined by AI marketing departments, is just beam search.

10 days ago

jb_briant

Could you educate me on what is beam search ? Or link a good ressource

EDIT: https://www.width.ai/post/what-is-beam-search

So the wider the beam, the better the outcome?

Yep, no reasoning, just a marketing term to say "more accurate probabilities"

10 days ago

benchmarkist

Beam search just traverses different paths and assigns each path a probability of being correct. The paths with the higher probabilities are kept and the ones with lower probabilities are pruned until the search terminates with an "answer". The marketing department calls it "reasoning" and "test-time compute" because the average consumer does not care whether it's beam search or something else.

Your link seems to do a good job of explaining beam search but it's a classic algorithm in state space exploration so most books on search algorithms and discrete optimization will have a section about it.¹

1: https://books.google.com/books?id=QzGuHnDhvZIC&q=%22beam%20s...

10 days ago

radarsat1

An algorithm that searches for the highest probability answer is not "reasoning"? "Search" has been a fundamental building block of GOFAI since the beginning. How do you define reasoning? Can you justify it being different from the last 70 years of thought on the topic?

10 days ago

jb_briant

Since you asked, I define reasoning as cambridge does:

Reasoning "the process of thinking about something in order to make a decision"

Thinking: "the activity of using your mind to consider something"

Mind: "the part of a person that makes it possible for him or her to think, feel emotions, and understand things"

I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"

I also think that the definition of Mind of Cambridge is incomplete and lacks the creativity part along with cognition and emotions. But it's a vastly different topic.

10 days ago

andai

>I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"

I recall reading about a theory of neurology where many thoughts (neural circuits) fired simultaneously and then one would "win" and the others got suppressed. The closest thing I can find right now is Global Workspace Theory.

10 days ago

Lerc

That seems a bit of an odd definition.

A: Doing it with B.

B: Using C to do it.

C: The part that does B

Without defining, "Your", "Consider", "person", "thinking", "feel", and "understand" it could be anything.

There's more than enough leeway in those undefineds to subjectively choose whatever you want.

10 days ago

nuancebydefault

I would rather call it circular reasoning,a thing humans are very capable of.

10 days ago

dboreham

Enter the Chinese Room..

10 days ago

jb_briant

What do you mean please?

10 days ago

whitten

The Chinese Room is a theoretical room that contains a “Chinese speaker” but actually when given a text to ‘understand’ actually just looks up the text in a huge number of words inside until it finds a way to find a response and then just outputs the response as its reply

10 days ago

amyfp214

>So the wider the beam, the better the outcome?

I looked into it, this "beam search" is nothing but a bit of puffed up nomenclature, not unlike the shock and awe of understanding a language such as Java that introduces synonyms for common terms for no apparent reason, not unlike the intimidating name of "bonferroni multiple test correction" which is just a (1/n) divison operation.

"Beam search" is breadth-first search. Instead of taking all the child nodes at a layer, it takes the top <n> according to some heuristic. But "top n" wasn't enough for whoever cooked up that trivial algorithm, so instead it's "beam width". It probably has more complexities in AI where that particular heuristic becomes more mathematical and complex, as heuristics tend to do.

9 days ago

benchmarkist

AI marketing departments are fond of anthropomorphic language but it's actually just regular beam search.

10 days ago

[deleted]
10 days ago

JTyQZSnP3cQGa8B

The same way they now call "open-source" a completely closed-source binary blob full of copyright infringement.

10 days ago

Kiro

"reasoning model" means nothing so I don't think it's misleading.

10 days ago

astrobe_

Reasoning means "inference" or "deduction" to me, or at least some process related to first order logic.

10 days ago

nyrikki

The known upper bound for transformers on the fly computation abilities is a complexity class called DLOGTIME-uniform TC^0.

There is a lot to unpack there but if you take FO as being closed under conjunction (∧), negation (¬) and universal quantification (∀); you will find that DLOGTIME-uniform TC^0 is equal to FO+Majority Gates.

So be careful about that distinction.

To help break the above down:

DLOGTIME = Constructible by a RAM or TM in logarithmic time. uniform = Only one circuit for all input sizes, when circuits families are the default convention TC^0: Constant-Depth Threshold Circuits

Even NP == SO-E, the second-order queries where the second-order quantifiers are only existantials.

DLOGTIME-uniform TC^0 is a WAY smaller group than most people realize, but anything that is an algorithm or a program basically is logic, with P being FO + transitive closure or a half a dozen other known mappings.

Transformers can figure out syntax, but if you dig into that dlogtime part, you will see that semantic correctness isn't really an option...thus the need to leverage the pattern matching and finding of pre-training as much as possible.

10 days ago

andai

Thanks. If I'm reading this right, the limiting factor on the intelligence of current LLMs is not the network size, nor training data (size/quality) but rather the architecture? Do we know of a better one for complex computations / "reasoning"?

10 days ago

codetrotter

Given the name they gave it, someone with access should ask it for the “Answer to the Ultimate Question of Life, The Universe, and Everything”

If the answer is anything other than a simple “42”, I will be thoroughly disappointed. (The answer has to be just “42”, not a bunch of text about the Hitchhikers Guide to the Galaxy and all that.)

10 days ago

vintermann

Deep Thought didn't answer right away either.

10 days ago

lowbloodsugar

“Right away”. lol.

10 days ago

asah

"what is the population of manhattan below central park"

ChatGPT-o1-preview: 647,000 (based on 2023 data, breaking it down by community board area): https://chatgpt.com/share/674b3f5b-29c4-8007-b1b6-5e0a4aeaf0... (this appears to be the most correct, judging from census data)

DeepThought-8B: 200,000 (based on 2020 census data) Claude: 300-350,000 Gemini: 2.7M during peak times (strange definition of population !)

I followed up with DeepThought-8B: "what is the population of all of manhattan, and how does that square with only having 200,000 below CP" and it cut off its answer, but in the reasoning box it updated its guess to 400,000 by estimating as a fraction of land area.

10 days ago

igleria

I asked it "Describe how a device for transportation of living beings would be able to fly while looking like a sphere" and it just never returned an output

10 days ago

Timwi

I asked it to just count letters in a long word and it never returned an output (been waiting for 30 minutes now)

10 days ago

m3kw9

It isn’t pleased you ask it such questions

10 days ago

ConspiracyFact

Blaine is a pain

10 days ago

nyoomboom

The reasoning steps look reasonable and the interface is simple and beautiful, though Deepthought-8b fails to disambiguate the term "the ruliad" as the technical concept from Wolfram physics, from this company's name Ruliad. Maybe that isn't in the training data, because it misunderstood the problem when asked "what is the simplest rule of the ruliad?" and went on to reason about the company's core principles. Cool release, waiting for the next update.

10 days ago

segalord

Xd, Gotta love how your first question to a test a model is about a “ruliad”. It’s not even in my ios dictionary

10 days ago

euroderf

I am very impressed. I asked chat.ruliad.co

  Beginning from the values for fundamental physical constants, is it possible to derive the laws of entropy ?
and then based on its response to that I asked it

  Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?
10 days ago

nyrikki

That will be very common in the corpus, thus pattern matching and retrieval more than 'reasoning'.

Not that isn't useful, just expected behavior for an LLM these days.

I would be interested in the output, because it is almost certainly the more common by fundamentally incorrect intro class half truth, that is useful for the typical curriculum, but not near the frontier of current understanding.

10 days ago

rkagerer

For those of us who can't try it out ourselves, could you post the results somewhere and share a link?

10 days ago

euroderf

It's not giving me a link, and its responses are multi-step and wordy. I can post some pullquotes...

Yes, it is possible to derive the laws of entropy from the fundamental physical constants. Entropy is a measure of the number of possible microscopic configurations of a system, and it is related to the fundamental physical constants through the Boltzmann constant (k_B) and the laws of thermodynamics.

The second law of thermodynamics states that the total entropy of an isolated system can only increase over time, or remain constant if the system is in equilibrium. This law can be derived from the fundamental physical constants and the microscopic behavior of particles in a system.

One way to derive the laws of entropy is by considering the statistical mechanics of a system. In statistical mechanics, entropy is related to the number of possible microscopic configurations (microstates) of a system, which can be described by the Boltzmann constant (k_B). The entropy (S) of a system is given by the Boltzmann formula:

S = k_B * ln(Ω)

where Ω is the number of possible microstates.

The fundamental physical constants, such as the speed of light (c), the gravitational constant (G), and the Planck constant (h), are related to the other constants through various equations. For example, the Boltzmann constant (k_B) can be expressed in terms of the Planck constant (h), the speed of light (c), and the gravitational constant (G):

k_B = (h * P / 2π) * (c^2 / G)

where P is the pressure.

By using these relationships and considering the microscopic behavior of particles in a system, it is possible to derive the laws of entropy from the fundamental physical constants. However, it is important to note that entropy is a macroscopic property, and its laws are not directly derived from the fundamental physical constants but rather from the microscopic behavior of particles and the laws of thermodynamics.

[Q: Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?]

The dividing line between the region where microscopic/atomic processes are reversible and the region where macroscopic processes are irreversible lies in the scale at which the processes occur and the interactions between particles.

At the microscopic level (a), processes are generally considered reversible because the interactions between individual particles can be easily reversed. For example, when two particles collide, they can exchange momentum and energy, and the process can be considered reversible. In this region, the system is in thermodynamic equilibrium, and the second law of thermodynamics does not apply.

However, as we move to the macroscopic level (b), the number of particles involved in the processes increases, and the interactions between them become more complex. In this region, the processes are considered irreversible because the energy and momentum exchanges between particles lead to an increase in entropy. This increase in entropy is a result of the second law of thermodynamics, which states that the total entropy of an isolated system can only increase over time or remain constant if the system is in equilibrium.

The dividing line between reversible and irreversible processes can be identified by examining the scale at which the system operates and the interactions between particles. At the microscopic level, processes are generally reversible, while at the macroscopic level, processes are irreversible due to the increase in entropy resulting from the interactions between a large number of particles.

-end-

10 days ago

chvid

Is the source code available for this? And who is behind the company?

10 days ago

alexvitkov

Please don't call model weights source code. Code is something you can (usually) read and understand, if anything weights are closer to a very obfuscated compiled binary - although even that can be analyzed by a good enough reverse engineer.

"Open Source" models are the latest in a long series of attempts to take advantage of FOSS's reputation without actually releasing something that adheres to the principles.

10 days ago

littlestymaar

To run a language model you need both the model weights and the source code of the implementation!

10 days ago

rkagerer

I would think truly open source means also including the initial training data. i.e. Everything needed to build[/tweak/customize] the thing from the ground up.

10 days ago

littlestymaar

I agree, but I'm responding to someone confusing source code and model weights…

9 days ago

exe34

llama.cpp is open source!

10 days ago

alexvitkov

Neat! But if we're sticking to the source code analogy, the matrix multiplier probably maps closer to a CPU or an interpreter - I wouldn't say every Java program is open source because you have an open-source JVM.

10 days ago

exe34

does that make every image closed source because you need a viewer to view them?

10 days ago

hn_acker

At least if you use the Free Software Foundation's definition of free software, one requirement is access to the unobfuscated source code. If every copy of a free-as-in-speech program binary's source code were deleted from existence by a god then the program would become proprietary.

I would say that every image using a proprietary file format is proprietary. If the creator of such an image intends for the image to be otherwise free as in speech (any receiver can inspect, modify, redistribute with and without modification for any purpose), then the image can be made free if the image creator converts the image to a free format.

41 minutes ago

alexvitkov

Yes, which is why if you're serious you will exclusively use the terminal to protect your computer from even the passing chance of being exposed to a so called "image file".

In all seriousness, compilation and model training are lossy processes and erase a lot of the context needed to understand the output (and with model training we don't fully understand it even with access to the training data). Images aren't necessarily derivative of anything, so the analogy breaks down here.

10 days ago

exe34

often images are not only lossily compressed, but they are a composite of many images/layers/image transformations. these are lost when the image is flattened and then distributed.

10 days ago

lowyek

I asked it 'find two primes whose sum is 123' .. it is in deep thought from 5 minutes just looping and looping over seemingly repeated hallucinations of right path. (btw, chatgpt immediately answers 61 and 62 lol.. so much for intelligence)

10 days ago

utdiscant

Here is what I got from o1:

To find two prime numbers that sum to 123, let’s consider the properties of prime numbers and the nature of their sums: 1. Prime Numbers Greater Than 2 Are Odd: Except for 2, all prime numbers are odd. 2. Sum of Two Odd Numbers Is Even: Adding two odd numbers results in an even number. 3. Sum of an Even and an Odd Number Is Odd: Adding an even number (like 2) and an odd number results in an odd sum.

Given that 123 is an odd number, the only way two primes can sum to 123 is if one of them is 2 (the only even prime) and the other is an odd prime. Let’s test this possibility: • 2 + x = 123 ⇒ x = 121

However, 121 is not a prime number (since ).

Next, consider the possibility of adding two odd primes: • Odd Prime + Odd Prime = Even Number

Since adding two odd primes yields an even number, it’s impossible for two odd primes to sum to 123 (an odd number).

Conclusion: There are no two prime numbers that sum to 123.

Answer: There are no two prime numbers whose sum is 123; such primes do not exist.

10 days ago

lowyek

I think may be i am not with the right model. the reasoning looks so good here. May be the default model in chatgpt free is not the one which you used.

10 days ago

alwayslikethis

qwq went through a two screenfuls of tokens to try all primes under 123 twice, but eventually gave the right answer.

10 days ago

kenjackson

Well explained!

10 days ago

meta_x_ai

[dead]

10 days ago

sams99

Qwen coder 32b with a JavaScript interpreter

Impressive answer for a model that can run on your own computer

https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...

10 days ago

lowyek

thanks for sharing. Your blog looks like a old forum board.

9 days ago

sans_souse

It looks nice, but my chrome browser on android has all sorts of trouble with rendering the animated bits, so it ends up skipping frames throughout my navigating and clicks. Add to that; the model doesn't respond at all in my multiple attempts, it's a waste of time until it's remedied.

10 days ago

[deleted]
10 days ago

rkagerer

Is it possible to try it without logging in?

Can you log in with anything other than a Google account?

I was excited by the tagline "Self-Sovereign", but it appears this is not.

10 days ago

reissbaker

"Model A 13B", "Model B 20B" etc are pretty vapid claims. Which actual models? There are plenty of terrible high-param-count models from a year or two ago. The benchmark seems meaningless without saying what models are actually being compared against... And "13B" in particular is pretty sketchy: are they comparing it against Llama 2 13B? Even an untuned Llama 3.1 8B would destroy that in any benchmark.

Smells a little grifty to me...

10 days ago

sushidev

It’s just a web page. How to try the model?

10 days ago

Timwi

It's a bit buried, but there's a link further down that takes you to an interface similar to ChatGPT. However, I have not been able to get any output from it; it seems to just think forever. I must be giving it the wrong prompts...

10 days ago

wongarsu

A bit off-topic, but that comparison graph is a great example why you should buy your designer a cheap secondary screen. I was viewing it on my second monitor and had to lean in to make out the off-white bar for Model D on the light-grey background. Moved the window over to my main screen and it's clear as day, five nice shades of coffee on a light-gray background.

10 days ago

peterkos

That's a pretty egregious mistake for a designer to make -- and that's not even mentioning the lack of accessibility. WebAIM's contrast checker says it's a 1:1 contrast ratio!

If someone is releasing a model that claims to have a level of reasoning, one would hope that their training dataset was scrutinized and monitored for unintended bias (as any statistical dataset is susceptible to: see overfitting). But if the graph on the announcement page is literally unreadable to seemingly anyone but the creator... that's damning proof that there is little empathy in the process, no?

10 days ago

parhamn

> that's damning proof that there is little empathy in the process, no?

No.

10 days ago

hmottestad

I wouldn’t say it’s implied, but there’s a reason people put on nice clothes for an interview.

I’m looking at the graphs on my phone and I’m pretty sure that there are 5 graphs and 3 labels. And their 8B model doesn’t seem to be very good, looks like a 20B model beats it in every single benchmark.

10 days ago

pavlov

The body text is also quite hard to read because the font has a tall x-height and line spacing is very tight.

This makes paragraphs look very dense, almost like it was set in uppercase only, because the lowercase letters don’t create a varying flow between lines for the eye to follow.

The model may be good, but the web design doesn’t win any prizes.

10 days ago

AYBABTME

Even on a high quality screen, it's a bit much.

10 days ago

lolinder

Also, is it standard practice to obfuscate which models you're benchmarking against? They're just labeled Model A-D, with sizes but no additional information.

10 days ago

sigmoid10

Given the context, it appears they are not benchmarking against other models but comparing differently sized versions of the same model. The 8B one is just the one they decided to give a catchy name. The other ones are probably also just fine tuned Llama models. But without information on the total compute budget (i.e. nr. of trained tokens), this kind of plot is pretty useless anyways.

10 days ago

lolinder

That doesn't make any sense, because their 8B is listed as benchmarking above the 13B "model A".

10 days ago

sigmoid10

That's why it is very likely it has seen more tokens during training and why the plot is worthless.

8 days ago

xena

I have an iPhone 15 Pro Max and it took me five glances to see the last bar. Use black outlines or something idk

10 days ago

imachine1980_

i sadly don't feel this is a mistake, the transparent once are the two that beat the model in one category or more, its feels more like scam than error, if not please fix it

10 days ago

[deleted]
10 days ago

andai

[flagged]

10 days ago

lowbloodsugar

Oh fuck me, it’s a fellow autistic. You’d think after all these years on the planet my brain would be able to immediately respond with “Fine. You?” but no, this is what my brain does before eventually replying.

10 days ago

alcover

DeepMasking

10 days ago

[deleted]
10 days ago

esteban78

[dead]

13 days ago

[deleted]
10 days ago

19911991

[flagged]

11 days ago

kgeist

Not bad, asked it to count Rs in "strawberry" and Ns in "international", it answered correctly, and it was fast.

10 days ago

viraptor

This is a bad test for reasoning. The model is either trained on the spelling of its tokens or specific words, or not. It's not a thing reasoning would help with in any way. (Ok, maybe accidentally if it tried different splits it's trained to spell, but that's not the reasoning part either)

10 days ago

wongarsu

It's still not quite perfect at letter counting, but it handles questions like "which is heavier? 2kg of feathers or 1kg of lead" really well. The reasoning steps seem to prevent it from jumping to pattern-matched answers

10 days ago

jb_briant

I don't get why this question is relevant to evaluate the reasoning capacity. Gpt4o (no reasoning in an anthropomorphic sense) answers correctly

--- The reasoning lies in the concept of mass and weight. The weight of an object is determined by its mass, not its material.

1. Mass comparison:

2kg of feathers has a mass of 2 kilograms.

1kg of lead has a mass of 1 kilogram.

Since 2 kilograms is greater than 1 kilogram, the feathers are heavier.

2. Irrelevance of material:

The type of material (feathers or lead) does not change the mass measurement.

Lead is denser than feathers, so 1kg of lead takes up much less space than 2kg of feathers, but the weight is still based on the total mass.

Thus, 2kg of any substance, even something as light as feathers, is heavier than 1kg of a dense material like lead.

10 days ago

wongarsu

Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.

10 days ago

jb_briant

Yes makes sense, I didn't took in account the model size and now you mention it makes a lot of sense.

10 days ago

dyl000

Mine was wrong

1 Problem Understanding The task is to count the number of 'r's in the word 'strawberry'. This requires identifying each occurrence of the letter 'r' in the given word and keeping a count. 2 Data Gathering The word to analyze is 'strawberry'. I need to examine each character in this word. 3 Hypothesis Formation Based on a quick glance, I hypothesize that there are 2 'r's in 'strawberry', but I need to verify this systematically. 4 Analysis I will go through each letter of 'strawberry' one by one: s-t-r-a-w-b-e-r-r-y 5 Calculation Counting the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 9. Total count: 2 6 Verification Double-checking: s(0)-t(0)-r(1)-a(1)-w(1)-b(1)-e(1)-r(2)-r(2)-y(2). Confirmed, there are 2 'r's. 7 Conclusion Drawing The final answer is that there are 2 'r's in the word 'strawberry'. 8 Implementation To present the answer, I will state: 'There are 2 r's in the word "strawberry".' This directly answers the question in a clear and concise manner.

10 days ago

earth2mars

I gave a trick question. It went on right path and then gone bonkers!

how many r's are there in the word strawrrberrry

Thought for 173 seconds

1 problem understanding

The task is to count the number of 'R's in the word 'strawrrberrry'. This requires careful examination of each letter in the given word.

2 data gathering

Let's break down the word into individual letters: s, t, r, a, w, r, r, b, e, r, r, r, y

3 analysis

We need to count the occurrences of the letter 'R' in this sequence. Let's go through each letter and keep a count.

4 calculation

Counting 'R's: 1. First 'R' (t=1), 2. Second 'R' (t=2), 3. Third 'R' (t=3), 4. Fourth 'R' (t=4), 5. Fifth 'R' (t=5). Total count: 5

5 verification

Let's double-check our count by going through the word again: strawrrberrry. Confirmed, there are indeed 5 'R's.

6 error checking

Ensuring we haven't missed any 'R's or counted any extra. The count remains at 5, which appears to be correct.

7 conclusion drawing

Based on our careful count and verification, we can confidently state that there are 5 'R's in the word 'strawrrberrry'.

8 implementation

The final response should be a simple statement: 'There are 5 R's in the word strawrrberrry.'

There are 5 R's in the word strawrrberrry.

10 days ago

raffraffraff

I know that this is a typical test of an LLM's ability to reason, but I wonder how much time could be saved by teaching an LLM how to recognise the type of problem that it's inherently bad at, bundling a python interpreter and asking it to to write a short function to solve the problem? Are we not pushing a boulder uphill?

10 days ago

throwaway314155

> teaching an LLM how to recognise the type of problem that it's inherently bad at

Solving this is the actual hard part and is either adjacent to or even equivalent to solving the problem of LLM's hallucinating. ChatGPT already includes a Python interpreter tool which can be used if the context indicates its appropriate.

10 days ago

raffraffraff

I suppose my question is pointing to another, which is: can one make an LLM that doesn't hallucinate? Isn't that problem inherent to the way that LLMs work? Obviously we can try to clean the data so there isn't any nonsense fed into it, but that'll only get you so far with a probabilistic, stochastic system. As an LLM once told me "Some experts argue that hallucination is an innate limitation of LLMs, akin to confabulation in humans, where false memories are created without the intention to deceive". I'm not sure if I believe that though.

10 days ago

nyrikki

For LLM's no, but the explanation is wrong also, it has nothing to do with 'false memories' and has everything to do with how LLMs work.

Here is the paper.

https://arxiv.org/abs/2401.11817

RAG and fine tuning improve domain specificity and may reduce the problem to a level where you don't care, but it will always be there.

Clean data would help reduce the incidents, possibly to a level that is more usable, but also don't remove the problem.

Considering next token prediction as serial runs on multi tape TMs, with the previous output as the input can help.

Especially if you consider Microsoft's 1.53bitnet that requires full precision for training but can reduce weights to just the sign components for inference.

Unfortunately all paths to explain this I have require graduate level complexity theory and/or diff geometry. Or you relive the Brouwer–Hilbert controversy by trying the logic path.

10 days ago

[deleted]
10 days ago

magicalhippo

Does it still count correctly if you ask for variations like "strawerberry"?

10 days ago

wongarsu

I found the following transcript of the thought process to that question equally enlightening and funny:

Thought for 4 seconds

1 problem understanding

The task is to count the number of instances of the letter 'r' in the word 'strawerberry'. This requires identifying each occurrence of 'r' in the given word and keeping a count.

2 data gathering

The word to analyze is 'strawerberry'. I need to examine each character in this word.

3 hypothesis formation

Based on a quick glance, I hypothesize that there are 2 instances of 'r' in 'strawerberry'.

4 analysis

Let's break down the word into individual characters: s, t, r, a, w, e, r, b, e, r, r, y

5 calculation

Now, let's count the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 7, 3) Third 'r' at position 10. Total count: 3

6 verification

Double-checking the count: s(0), t(0), r(1), a(0), w(0), e(0), r(2), b(0), e(0), r(3), r(3), y(0). Confirmed: 3 instances of 'r'.

7 error checking

The initial hypothesis was incorrect. The actual count (3) differs from the initial guess (2). This highlights the importance of careful analysis rather than quick assumptions.

8 conclusion drawing

The letter 'r' appears 3 times in the word 'strawerberry'.

9 implementation

The final response should be the number 3, as this is the correct count of 'r' instances in 'strawerberry'.

10 days ago

Legend2440

I get:

>Alright, I need to figure out how many 'r's are in the word 'strawerberry'. Let's look at it carefully. I'll write it down: s-t-r-a-w-e-r-b-e-r-r-y. Now, I'll go through each letter one by one.

>Starting from the beginning: s, t, r – that's one 'r'. Then a, w, e, r – that's the second 'r'. Next, b, e, r – that's the third 'r'. And finally, another r before the y, making it the fourth 'r'. So, I count four 'r's in total.

>The word "strawberry" contains four 'r's.

10 days ago

optimalsolver

Wow, so close. I wonder what went wrong here?

10 days ago

brookst

Tokenization, still. LLMs don’t see individual letters. It’s like using Google translate to ask a German speaker the same question and concluding that Germans can’t count letters.

10 days ago

optimalsolver

Does this mean there's no actual reasoning happening in these "reasoning" steps?

10 days ago

wongarsu

Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.

The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model

10 days ago

[deleted]
10 days ago

relaxing

Of course there’s no actual reasoning.

10 days ago

dl9999

Is there something more than prediction going on? I'd like to understand why there would be studies like this: https://arxiv.org/pdf/2307.02477 if it's all just prediction. Are there some algorithms that are called "reasoning", that are somewhere between prediction and full-blown reasoning?

10 days ago

relaxing

My guess is the framing in that paper was chosen to be provocative. The authors surely understand deep neural networks neither reason nor recite as we understand it in humans.

In any case, the conclusion was decidedly against reasoning.

10 days ago