DeepThought-8B: A small, capable reasoning model
Comments
tkgally
CGamesPlay
In what way did they "release" this? I can't find it in hugging face or ollama, and they only seem to have a "try online" link in the article. "Self-sovereign intelligence", indeed.
wongarsu
They released it in the same sense OpenAI released GPT4. There is an online demo you can chat with, and a form to get in touch with sales to get API access
underlines
they didn't
tanakai24
Legally, you cannot name the llama3 based models like that, YOu have to use, llama in the name
https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blo...
alexvitkov
Too bad :)
Facebook trained the model on an Internet's worth of copyrighted material without any regard for licenses whatsoever - even if model weights are copyrightable, which is an open question, you're doing the exact same thing they did. Probably not a bulletproof legal defense though.
tourmalinetaco
At least Zuck had the decency to release model weights, unlike these worthless clowns.
littlestymaar
Can't wait until Meta sue them so we can have a judgment on whether or not models weights are subject to copyright.
euroderf
Model weights are (abstractly speaking) a very intensive, concentrated form of website scraping, yes ?
What does the (USA) law say about scraping ? Does "fair use" play a role ?
ranger_danger
Yes, and there have already been court cases that ruled AI training of copyrighted data to be fair use, because it's technically no different than any other form of art, everything is based off of seeing other ideas elsewhere, there are no new ideas anymore.
jb_briant
Am I wrong to think that "reasoning model" is a misleading marketing term?
Isn't it a LLM with an algo wrapper?
viraptor
Whether you bake the behaviour in or wrap it in an external loop, you need to train/tune the expected behaviour. Generic models can do chain of thought if asked for, but will be worse than the specialised one.
benchmarkist
They're not baking anything in. Reasoning, as it is defined by AI marketing departments, is just beam search.
jb_briant
Could you educate me on what is beam search ? Or link a good ressource
EDIT: https://www.width.ai/post/what-is-beam-search
So the wider the beam, the better the outcome?
Yep, no reasoning, just a marketing term to say "more accurate probabilities"
benchmarkist
Beam search just traverses different paths and assigns each path a probability of being correct. The paths with the higher probabilities are kept and the ones with lower probabilities are pruned until the search terminates with an "answer". The marketing department calls it "reasoning" and "test-time compute" because the average consumer does not care whether it's beam search or something else.
Your link seems to do a good job of explaining beam search but it's a classic algorithm in state space exploration so most books on search algorithms and discrete optimization will have a section about it.¹
1: https://books.google.com/books?id=QzGuHnDhvZIC&q=%22beam%20s...
radarsat1
An algorithm that searches for the highest probability answer is not "reasoning"? "Search" has been a fundamental building block of GOFAI since the beginning. How do you define reasoning? Can you justify it being different from the last 70 years of thought on the topic?
jb_briant
Since you asked, I define reasoning as cambridge does:
Reasoning "the process of thinking about something in order to make a decision"
Thinking: "the activity of using your mind to consider something"
Mind: "the part of a person that makes it possible for him or her to think, feel emotions, and understand things"
I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"
I also think that the definition of Mind of Cambridge is incomplete and lacks the creativity part along with cognition and emotions. But it's a vastly different topic.
andai
>I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"
I recall reading about a theory of neurology where many thoughts (neural circuits) fired simultaneously and then one would "win" and the others got suppressed. The closest thing I can find right now is Global Workspace Theory.
Lerc
That seems a bit of an odd definition.
A: Doing it with B.
B: Using C to do it.
C: The part that does B
Without defining, "Your", "Consider", "person", "thinking", "feel", and "understand" it could be anything.
There's more than enough leeway in those undefineds to subjectively choose whatever you want.
nuancebydefault
I would rather call it circular reasoning,a thing humans are very capable of.
dboreham
Enter the Chinese Room..
jb_briant
What do you mean please?
whitten
The Chinese Room is a theoretical room that contains a “Chinese speaker” but actually when given a text to ‘understand’ actually just looks up the text in a huge number of words inside until it finds a way to find a response and then just outputs the response as its reply
amyfp214
>So the wider the beam, the better the outcome?
I looked into it, this "beam search" is nothing but a bit of puffed up nomenclature, not unlike the shock and awe of understanding a language such as Java that introduces synonyms for common terms for no apparent reason, not unlike the intimidating name of "bonferroni multiple test correction" which is just a (1/n) divison operation.
"Beam search" is breadth-first search. Instead of taking all the child nodes at a layer, it takes the top <n> according to some heuristic. But "top n" wasn't enough for whoever cooked up that trivial algorithm, so instead it's "beam width". It probably has more complexities in AI where that particular heuristic becomes more mathematical and complex, as heuristics tend to do.
benchmarkist
AI marketing departments are fond of anthropomorphic language but it's actually just regular beam search.
JTyQZSnP3cQGa8B
The same way they now call "open-source" a completely closed-source binary blob full of copyright infringement.
Kiro
"reasoning model" means nothing so I don't think it's misleading.
astrobe_
Reasoning means "inference" or "deduction" to me, or at least some process related to first order logic.
nyrikki
The known upper bound for transformers on the fly computation abilities is a complexity class called DLOGTIME-uniform TC^0.
There is a lot to unpack there but if you take FO as being closed under conjunction (∧), negation (¬) and universal quantification (∀); you will find that DLOGTIME-uniform TC^0 is equal to FO+Majority Gates.
So be careful about that distinction.
To help break the above down:
DLOGTIME = Constructible by a RAM or TM in logarithmic time. uniform = Only one circuit for all input sizes, when circuits families are the default convention TC^0: Constant-Depth Threshold Circuits
Even NP == SO-E, the second-order queries where the second-order quantifiers are only existantials.
DLOGTIME-uniform TC^0 is a WAY smaller group than most people realize, but anything that is an algorithm or a program basically is logic, with P being FO + transitive closure or a half a dozen other known mappings.
Transformers can figure out syntax, but if you dig into that dlogtime part, you will see that semantic correctness isn't really an option...thus the need to leverage the pattern matching and finding of pre-training as much as possible.
andai
Thanks. If I'm reading this right, the limiting factor on the intelligence of current LLMs is not the network size, nor training data (size/quality) but rather the architecture? Do we know of a better one for complex computations / "reasoning"?
codetrotter
Given the name they gave it, someone with access should ask it for the “Answer to the Ultimate Question of Life, The Universe, and Everything”
If the answer is anything other than a simple “42”, I will be thoroughly disappointed. (The answer has to be just “42”, not a bunch of text about the Hitchhikers Guide to the Galaxy and all that.)
vintermann
Deep Thought didn't answer right away either.
lowbloodsugar
“Right away”. lol.
asah
"what is the population of manhattan below central park"
ChatGPT-o1-preview: 647,000 (based on 2023 data, breaking it down by community board area): https://chatgpt.com/share/674b3f5b-29c4-8007-b1b6-5e0a4aeaf0... (this appears to be the most correct, judging from census data)
DeepThought-8B: 200,000 (based on 2020 census data) Claude: 300-350,000 Gemini: 2.7M during peak times (strange definition of population !)
I followed up with DeepThought-8B: "what is the population of all of manhattan, and how does that square with only having 200,000 below CP" and it cut off its answer, but in the reasoning box it updated its guess to 400,000 by estimating as a fraction of land area.
igleria
I asked it "Describe how a device for transportation of living beings would be able to fly while looking like a sphere" and it just never returned an output
Timwi
I asked it to just count letters in a long word and it never returned an output (been waiting for 30 minutes now)
m3kw9
It isn’t pleased you ask it such questions
ConspiracyFact
Blaine is a pain
nyoomboom
The reasoning steps look reasonable and the interface is simple and beautiful, though Deepthought-8b fails to disambiguate the term "the ruliad" as the technical concept from Wolfram physics, from this company's name Ruliad. Maybe that isn't in the training data, because it misunderstood the problem when asked "what is the simplest rule of the ruliad?" and went on to reason about the company's core principles. Cool release, waiting for the next update.
segalord
Xd, Gotta love how your first question to a test a model is about a “ruliad”. It’s not even in my ios dictionary
euroderf
I am very impressed. I asked chat.ruliad.co
Beginning from the values for fundamental physical constants, is it possible to derive the laws of entropy ?
and then based on its response to that I asked it Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?
nyrikki
That will be very common in the corpus, thus pattern matching and retrieval more than 'reasoning'.
Not that isn't useful, just expected behavior for an LLM these days.
I would be interested in the output, because it is almost certainly the more common by fundamentally incorrect intro class half truth, that is useful for the typical curriculum, but not near the frontier of current understanding.
rkagerer
For those of us who can't try it out ourselves, could you post the results somewhere and share a link?
euroderf
It's not giving me a link, and its responses are multi-step and wordy. I can post some pullquotes...
Yes, it is possible to derive the laws of entropy from the fundamental physical constants. Entropy is a measure of the number of possible microscopic configurations of a system, and it is related to the fundamental physical constants through the Boltzmann constant (k_B) and the laws of thermodynamics.
The second law of thermodynamics states that the total entropy of an isolated system can only increase over time, or remain constant if the system is in equilibrium. This law can be derived from the fundamental physical constants and the microscopic behavior of particles in a system.
One way to derive the laws of entropy is by considering the statistical mechanics of a system. In statistical mechanics, entropy is related to the number of possible microscopic configurations (microstates) of a system, which can be described by the Boltzmann constant (k_B). The entropy (S) of a system is given by the Boltzmann formula:
S = k_B * ln(Ω)
where Ω is the number of possible microstates.
The fundamental physical constants, such as the speed of light (c), the gravitational constant (G), and the Planck constant (h), are related to the other constants through various equations. For example, the Boltzmann constant (k_B) can be expressed in terms of the Planck constant (h), the speed of light (c), and the gravitational constant (G):
k_B = (h * P / 2π) * (c^2 / G)
where P is the pressure.
By using these relationships and considering the microscopic behavior of particles in a system, it is possible to derive the laws of entropy from the fundamental physical constants. However, it is important to note that entropy is a macroscopic property, and its laws are not directly derived from the fundamental physical constants but rather from the microscopic behavior of particles and the laws of thermodynamics.
[Q: Based on this analysis, can you identify and describe where the dividing line is between (a) the region where (microscopic/atomic) processes are reversible, and (b) the region where macroscopic processes are irreversible ?]
The dividing line between the region where microscopic/atomic processes are reversible and the region where macroscopic processes are irreversible lies in the scale at which the processes occur and the interactions between particles.
At the microscopic level (a), processes are generally considered reversible because the interactions between individual particles can be easily reversed. For example, when two particles collide, they can exchange momentum and energy, and the process can be considered reversible. In this region, the system is in thermodynamic equilibrium, and the second law of thermodynamics does not apply.
However, as we move to the macroscopic level (b), the number of particles involved in the processes increases, and the interactions between them become more complex. In this region, the processes are considered irreversible because the energy and momentum exchanges between particles lead to an increase in entropy. This increase in entropy is a result of the second law of thermodynamics, which states that the total entropy of an isolated system can only increase over time or remain constant if the system is in equilibrium.
The dividing line between reversible and irreversible processes can be identified by examining the scale at which the system operates and the interactions between particles. At the microscopic level, processes are generally reversible, while at the macroscopic level, processes are irreversible due to the increase in entropy resulting from the interactions between a large number of particles.
-end-
chvid
Is the source code available for this? And who is behind the company?
alexvitkov
Please don't call model weights source code. Code is something you can (usually) read and understand, if anything weights are closer to a very obfuscated compiled binary - although even that can be analyzed by a good enough reverse engineer.
"Open Source" models are the latest in a long series of attempts to take advantage of FOSS's reputation without actually releasing something that adheres to the principles.
littlestymaar
To run a language model you need both the model weights and the source code of the implementation!
rkagerer
I would think truly open source means also including the initial training data. i.e. Everything needed to build[/tweak/customize] the thing from the ground up.
littlestymaar
I agree, but I'm responding to someone confusing source code and model weights…
exe34
llama.cpp is open source!
alexvitkov
Neat! But if we're sticking to the source code analogy, the matrix multiplier probably maps closer to a CPU or an interpreter - I wouldn't say every Java program is open source because you have an open-source JVM.
exe34
does that make every image closed source because you need a viewer to view them?
hn_acker
At least if you use the Free Software Foundation's definition of free software, one requirement is access to the unobfuscated source code. If every copy of a free-as-in-speech program binary's source code were deleted from existence by a god then the program would become proprietary.
I would say that every image using a proprietary file format is proprietary. If the creator of such an image intends for the image to be otherwise free as in speech (any receiver can inspect, modify, redistribute with and without modification for any purpose), then the image can be made free if the image creator converts the image to a free format.
alexvitkov
Yes, which is why if you're serious you will exclusively use the terminal to protect your computer from even the passing chance of being exposed to a so called "image file".
In all seriousness, compilation and model training are lossy processes and erase a lot of the context needed to understand the output (and with model training we don't fully understand it even with access to the training data). Images aren't necessarily derivative of anything, so the analogy breaks down here.
exe34
often images are not only lossily compressed, but they are a composite of many images/layers/image transformations. these are lost when the image is flattened and then distributed.
lowyek
I asked it 'find two primes whose sum is 123' .. it is in deep thought from 5 minutes just looping and looping over seemingly repeated hallucinations of right path. (btw, chatgpt immediately answers 61 and 62 lol.. so much for intelligence)
utdiscant
Here is what I got from o1:
To find two prime numbers that sum to 123, let’s consider the properties of prime numbers and the nature of their sums: 1. Prime Numbers Greater Than 2 Are Odd: Except for 2, all prime numbers are odd. 2. Sum of Two Odd Numbers Is Even: Adding two odd numbers results in an even number. 3. Sum of an Even and an Odd Number Is Odd: Adding an even number (like 2) and an odd number results in an odd sum.
Given that 123 is an odd number, the only way two primes can sum to 123 is if one of them is 2 (the only even prime) and the other is an odd prime. Let’s test this possibility: • 2 + x = 123 ⇒ x = 121
However, 121 is not a prime number (since ).
Next, consider the possibility of adding two odd primes: • Odd Prime + Odd Prime = Even Number
Since adding two odd primes yields an even number, it’s impossible for two odd primes to sum to 123 (an odd number).
Conclusion: There are no two prime numbers that sum to 123.
Answer: There are no two prime numbers whose sum is 123; such primes do not exist.
lowyek
I think may be i am not with the right model. the reasoning looks so good here. May be the default model in chatgpt free is not the one which you used.
alwayslikethis
qwq went through a two screenfuls of tokens to try all primes under 123 twice, but eventually gave the right answer.
kenjackson
Well explained!
meta_x_ai
[dead]
sams99
Qwen coder 32b with a JavaScript interpreter
Impressive answer for a model that can run on your own computer
https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...
lowyek
thanks for sharing. Your blog looks like a old forum board.
sans_souse
It looks nice, but my chrome browser on android has all sorts of trouble with rendering the animated bits, so it ends up skipping frames throughout my navigating and clicks. Add to that; the model doesn't respond at all in my multiple attempts, it's a waste of time until it's remedied.
rkagerer
Is it possible to try it without logging in?
Can you log in with anything other than a Google account?
I was excited by the tagline "Self-Sovereign", but it appears this is not.
reissbaker
"Model A 13B", "Model B 20B" etc are pretty vapid claims. Which actual models? There are plenty of terrible high-param-count models from a year or two ago. The benchmark seems meaningless without saying what models are actually being compared against... And "13B" in particular is pretty sketchy: are they comparing it against Llama 2 13B? Even an untuned Llama 3.1 8B would destroy that in any benchmark.
Smells a little grifty to me...
sushidev
It’s just a web page. How to try the model?
Timwi
It's a bit buried, but there's a link further down that takes you to an interface similar to ChatGPT. However, I have not been able to get any output from it; it seems to just think forever. I must be giving it the wrong prompts...
wongarsu
A bit off-topic, but that comparison graph is a great example why you should buy your designer a cheap secondary screen. I was viewing it on my second monitor and had to lean in to make out the off-white bar for Model D on the light-grey background. Moved the window over to my main screen and it's clear as day, five nice shades of coffee on a light-gray background.
peterkos
That's a pretty egregious mistake for a designer to make -- and that's not even mentioning the lack of accessibility. WebAIM's contrast checker says it's a 1:1 contrast ratio!
If someone is releasing a model that claims to have a level of reasoning, one would hope that their training dataset was scrutinized and monitored for unintended bias (as any statistical dataset is susceptible to: see overfitting). But if the graph on the announcement page is literally unreadable to seemingly anyone but the creator... that's damning proof that there is little empathy in the process, no?
parhamn
> that's damning proof that there is little empathy in the process, no?
No.
hmottestad
I wouldn’t say it’s implied, but there’s a reason people put on nice clothes for an interview.
I’m looking at the graphs on my phone and I’m pretty sure that there are 5 graphs and 3 labels. And their 8B model doesn’t seem to be very good, looks like a 20B model beats it in every single benchmark.
pavlov
The body text is also quite hard to read because the font has a tall x-height and line spacing is very tight.
This makes paragraphs look very dense, almost like it was set in uppercase only, because the lowercase letters don’t create a varying flow between lines for the eye to follow.
The model may be good, but the web design doesn’t win any prizes.
AYBABTME
Even on a high quality screen, it's a bit much.
lolinder
Also, is it standard practice to obfuscate which models you're benchmarking against? They're just labeled Model A-D, with sizes but no additional information.
sigmoid10
Given the context, it appears they are not benchmarking against other models but comparing differently sized versions of the same model. The 8B one is just the one they decided to give a catchy name. The other ones are probably also just fine tuned Llama models. But without information on the total compute budget (i.e. nr. of trained tokens), this kind of plot is pretty useless anyways.
lolinder
That doesn't make any sense, because their 8B is listed as benchmarking above the 13B "model A".
sigmoid10
That's why it is very likely it has seen more tokens during training and why the plot is worthless.
xena
I have an iPhone 15 Pro Max and it took me five glances to see the last bar. Use black outlines or something idk
imachine1980_
i sadly don't feel this is a mistake, the transparent once are the two that beat the model in one category or more, its feels more like scam than error, if not please fix it
andai
[flagged]
lowbloodsugar
Oh fuck me, it’s a fellow autistic. You’d think after all these years on the planet my brain would be able to immediately respond with “Fine. You?” but no, this is what my brain does before eventually replying.
alcover
DeepMasking
esteban78
[dead]
19911991
[flagged]
kgeist
Not bad, asked it to count Rs in "strawberry" and Ns in "international", it answered correctly, and it was fast.
viraptor
This is a bad test for reasoning. The model is either trained on the spelling of its tokens or specific words, or not. It's not a thing reasoning would help with in any way. (Ok, maybe accidentally if it tried different splits it's trained to spell, but that's not the reasoning part either)
wongarsu
It's still not quite perfect at letter counting, but it handles questions like "which is heavier? 2kg of feathers or 1kg of lead" really well. The reasoning steps seem to prevent it from jumping to pattern-matched answers
jb_briant
I don't get why this question is relevant to evaluate the reasoning capacity. Gpt4o (no reasoning in an anthropomorphic sense) answers correctly
--- The reasoning lies in the concept of mass and weight. The weight of an object is determined by its mass, not its material.
1. Mass comparison:
2kg of feathers has a mass of 2 kilograms.
1kg of lead has a mass of 1 kilogram.
Since 2 kilograms is greater than 1 kilogram, the feathers are heavier.
2. Irrelevance of material:
The type of material (feathers or lead) does not change the mass measurement.
Lead is denser than feathers, so 1kg of lead takes up much less space than 2kg of feathers, but the weight is still based on the total mass.
Thus, 2kg of any substance, even something as light as feathers, is heavier than 1kg of a dense material like lead.
wongarsu
Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.
jb_briant
Yes makes sense, I didn't took in account the model size and now you mention it makes a lot of sense.
dyl000
Mine was wrong
1 Problem Understanding The task is to count the number of 'r's in the word 'strawberry'. This requires identifying each occurrence of the letter 'r' in the given word and keeping a count. 2 Data Gathering The word to analyze is 'strawberry'. I need to examine each character in this word. 3 Hypothesis Formation Based on a quick glance, I hypothesize that there are 2 'r's in 'strawberry', but I need to verify this systematically. 4 Analysis I will go through each letter of 'strawberry' one by one: s-t-r-a-w-b-e-r-r-y 5 Calculation Counting the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 9. Total count: 2 6 Verification Double-checking: s(0)-t(0)-r(1)-a(1)-w(1)-b(1)-e(1)-r(2)-r(2)-y(2). Confirmed, there are 2 'r's. 7 Conclusion Drawing The final answer is that there are 2 'r's in the word 'strawberry'. 8 Implementation To present the answer, I will state: 'There are 2 r's in the word "strawberry".' This directly answers the question in a clear and concise manner.
earth2mars
I gave a trick question. It went on right path and then gone bonkers!
how many r's are there in the word strawrrberrry
Thought for 173 seconds
1 problem understanding
The task is to count the number of 'R's in the word 'strawrrberrry'. This requires careful examination of each letter in the given word.
2 data gathering
Let's break down the word into individual letters: s, t, r, a, w, r, r, b, e, r, r, r, y
3 analysis
We need to count the occurrences of the letter 'R' in this sequence. Let's go through each letter and keep a count.
4 calculation
Counting 'R's: 1. First 'R' (t=1), 2. Second 'R' (t=2), 3. Third 'R' (t=3), 4. Fourth 'R' (t=4), 5. Fifth 'R' (t=5). Total count: 5
5 verification
Let's double-check our count by going through the word again: strawrrberrry. Confirmed, there are indeed 5 'R's.
6 error checking
Ensuring we haven't missed any 'R's or counted any extra. The count remains at 5, which appears to be correct.
7 conclusion drawing
Based on our careful count and verification, we can confidently state that there are 5 'R's in the word 'strawrrberrry'.
8 implementation
The final response should be a simple statement: 'There are 5 R's in the word strawrrberrry.'
There are 5 R's in the word strawrrberrry.
raffraffraff
I know that this is a typical test of an LLM's ability to reason, but I wonder how much time could be saved by teaching an LLM how to recognise the type of problem that it's inherently bad at, bundling a python interpreter and asking it to to write a short function to solve the problem? Are we not pushing a boulder uphill?
throwaway314155
> teaching an LLM how to recognise the type of problem that it's inherently bad at
Solving this is the actual hard part and is either adjacent to or even equivalent to solving the problem of LLM's hallucinating. ChatGPT already includes a Python interpreter tool which can be used if the context indicates its appropriate.
raffraffraff
I suppose my question is pointing to another, which is: can one make an LLM that doesn't hallucinate? Isn't that problem inherent to the way that LLMs work? Obviously we can try to clean the data so there isn't any nonsense fed into it, but that'll only get you so far with a probabilistic, stochastic system. As an LLM once told me "Some experts argue that hallucination is an innate limitation of LLMs, akin to confabulation in humans, where false memories are created without the intention to deceive". I'm not sure if I believe that though.
nyrikki
For LLM's no, but the explanation is wrong also, it has nothing to do with 'false memories' and has everything to do with how LLMs work.
Here is the paper.
https://arxiv.org/abs/2401.11817
RAG and fine tuning improve domain specificity and may reduce the problem to a level where you don't care, but it will always be there.
Clean data would help reduce the incidents, possibly to a level that is more usable, but also don't remove the problem.
Considering next token prediction as serial runs on multi tape TMs, with the previous output as the input can help.
Especially if you consider Microsoft's 1.53bitnet that requires full precision for training but can reduce weights to just the sign components for inference.
Unfortunately all paths to explain this I have require graduate level complexity theory and/or diff geometry. Or you relive the Brouwer–Hilbert controversy by trying the logic path.
magicalhippo
Does it still count correctly if you ask for variations like "strawerberry"?
wongarsu
I found the following transcript of the thought process to that question equally enlightening and funny:
Thought for 4 seconds
1 problem understanding
The task is to count the number of instances of the letter 'r' in the word 'strawerberry'. This requires identifying each occurrence of 'r' in the given word and keeping a count.
2 data gathering
The word to analyze is 'strawerberry'. I need to examine each character in this word.
3 hypothesis formation
Based on a quick glance, I hypothesize that there are 2 instances of 'r' in 'strawerberry'.
4 analysis
Let's break down the word into individual characters: s, t, r, a, w, e, r, b, e, r, r, y
5 calculation
Now, let's count the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 7, 3) Third 'r' at position 10. Total count: 3
6 verification
Double-checking the count: s(0), t(0), r(1), a(0), w(0), e(0), r(2), b(0), e(0), r(3), r(3), y(0). Confirmed: 3 instances of 'r'.
7 error checking
The initial hypothesis was incorrect. The actual count (3) differs from the initial guess (2). This highlights the importance of careful analysis rather than quick assumptions.
8 conclusion drawing
The letter 'r' appears 3 times in the word 'strawerberry'.
9 implementation
The final response should be the number 3, as this is the correct count of 'r' instances in 'strawerberry'.
Legend2440
I get:
>Alright, I need to figure out how many 'r's are in the word 'strawerberry'. Let's look at it carefully. I'll write it down: s-t-r-a-w-e-r-b-e-r-r-y. Now, I'll go through each letter one by one.
>Starting from the beginning: s, t, r – that's one 'r'. Then a, w, e, r – that's the second 'r'. Next, b, e, r – that's the third 'r'. And finally, another r before the y, making it the fourth 'r'. So, I count four 'r's in total.
>The word "strawberry" contains four 'r's.
optimalsolver
Wow, so close. I wonder what went wrong here?
brookst
Tokenization, still. LLMs don’t see individual letters. It’s like using Google translate to ask a German speaker the same question and concluding that Germans can’t count letters.
optimalsolver
Does this mean there's no actual reasoning happening in these "reasoning" steps?
wongarsu
Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.
The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model
relaxing
Of course there’s no actual reasoning.
dl9999
Is there something more than prediction going on? I'd like to understand why there would be studies like this: https://arxiv.org/pdf/2307.02477 if it's all just prediction. Are there some algorithms that are called "reasoning", that are somewhere between prediction and full-blown reasoning?
relaxing
My guess is the framing in that paper was chosen to be provocative. The authors surely understand deep neural networks neither reason nor recite as we understand it in humans.
In any case, the conclusion was decidedly against reasoning.
There's been a rush of releases of reasoning models in the past couple of weeks. This one looks interesting, too.
I found the following video from Sam Witteveen to be a useful introduction to a few of those models:
https://youtu.be/vN8jBxEKkVo