g1: Using Llama-3.1 70B on Groq to create o1-like reasoning chains

334 points

1/20/1970

4 months ago

by gfortaine

Comments

segmondy

This is not even remotely close and very silly. A ChainOfThought in a loop.

TreeOfThoughts is a more sophisticated method, see - https://arxiv.org/pdf/2305.10601

The clue we all had with OpenAI for a long time that this was a search through a tree, they hired Noam Brown, and his past work all hinted towards that. Q, is obviously a search on a tree like A. So take something like CoT, build out a tree, search for the best solution across it. The search is the "system-2 reasoning"

4 months ago

COAGULOPATH

Came here hoping to find this.

You will not unlock "o1-like" reasoning by making a model think step by step. This is an old trick that people were using on GPT3 in 2020. If it were that simple, it wouldn't have taken OpenAI so long to release it.

Additionally, some of the prompt seems counterproductive:

>Be aware of your limitations as an llm and what you can and cannot do.

The LLM doesn't have a good idea of its limitations (any more than humans do). I expect this will create false refusals, as the model becomes overcautious.

4 months ago

anshumankmr

>The LLM doesn't have a good idea of its limitations (any more than humans do). I expect this will create false refusals, as the model becomes overcautious.

Can it not be trained to do so? From my anecdotal observations, the knowledge cutoff is one thing that LLMs are really well trained to know about. Those are limitations that LLMs are currently well trained to handle. Why can it not be trained to know that it is quite frequently bad at math, it may produce sometimes inaccurate code etc.

For humans also, some people know some things are just not their cup of tea. Sure there are times people may have half baked knowledge about things but one can tell if they are good at XYZ things, and not so much at other things.

4 months ago

whimsicalism

you’re wrong and stating things confidently without the evidence to back it up.

alignment is a tough problem and aligning long reasoning sequences to correct answer is also a tough problem. collecting high quality CoT from experts is another tough problem. they started this project in october, more than plausible it could take this time

4 months ago

TrapLord_Rhodo

overcautious when trimming branches on the tree seems like a feature, not a bug.

4 months ago

Meganet

You actually don't know that.

A LLM has a huge amount of data ingested. It can create character profiles, audience, personas etc.

Why wouldn't it have potentially even learned to 'understand' what 'being aware of your limitations' means?

Right now for me 'change of reasoning' feels a little bit of quering the existing meta space through the reasoning process to adjust weights. Basically priming the model.

I would also not just call it a 'trick'. This looks simple, weird or whatnot but i do believe that this is part of AI thinking process research.

Its a good question though what did they train? New Architecture? More parameters? Is this training a mix of experiments they did? Some auto optimization mechanism?

4 months ago

cubefox

It's interesting that DeepMind still publishes this stuff. OpenAI doesn't publish anything of that sort anymore. DeepMind is more research/publication focused, but this is a disadvantage in a competitive landscape where OpenAI and Anthropic can just apply the results of your paper without giving anything back to the research community.

4 months ago

marricks

> but this is a disadvantage in a competitive landscape

Or it's a unique advantage because this stuff doesn't happen without good researches who may want:

1) Their name in scientific papers

2) They might actually care about the openess of AI

4 months ago

cabidaher

Anthropic publishes quite a lot too though.

4 months ago

zaptrem

Where in their blog post (which seemingly had complete examples of the model’s chain of thought) did they suggest they were using search or tree of thoughts?

4 months ago

Joeri

Just a guess:

The chain of thought would be the final path through the tree. Interactively showing the thought tokens would give the game away, which is why they don’t show that.

4 months ago

blackbear_

They mention reinforcement learning, so I guess they used some sort of Monte Carlo tree search (the same algorithm used for AlphaGo).

In this case, the model would explore several chain of thoughts during training, but only output a single chain during inference (as the sibling comment suggests).

4 months ago

whimsicalism

nowhere lol

4 months ago

dinobones

OAI revealed on Twitter that there is no "system" at inference time, this is just a model.

Did they maybe expand to a tree during training to learn more robust reasoning? Maybe. But it still comes down to a regular transformer model at inference time.

4 months ago

ValentinA23

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

https://arxiv.org/pdf/2403.09629

> In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting – ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions.

>[...]

>We generate thoughts, in parallel, following all tokens in the text (think). The model produces a mixture of its next-token predictions with and without a thought (talk). We apply REINFORCE, as in STaR, to increase the likelihood of thoughts that help the model predict future text while discarding thoughts that make the future text less likely (learn).

4 months ago

quantadev

I don't think you can claim you know what's happening internally when OpenAI processes a request. They are a competitive company and will lie for competitive reasons. Most people think Q-Star is doing multiple inferences to accomplish a single task, and that's what all the evidence suggests. Whatever Sam Altman says means absolutely nothing, but I don't think he's claimed they use only a single inference either.

4 months ago

pizza

Source?

4 months ago

boulos

Reminder: you need to escape the * otherwise you end up with emphasis (italics here).

4 months ago

thelastparadise

Another serious advantage of a tree search is parallelism.

4 months ago

PROMISE_237

[dead]

4 months ago

sebzim4500

>In all-caps to improve prompt compliance by emphesizing the importance of the instruction

This kind of thing is still so funny to me.

I wonder if the first guy who gets AGI to work will do it by realizing that he can improve LLM reliability over some threshold by telling it in all caps that his pet's life depends on the answer.

4 months ago

worstspotgain

For extra compliance, use <b><i><u><h1> tags, set volume to 11, phasers to 7, and use SchIzOCasE and +E+X+T+R+A+I+M+P+O+R+T+A+N+T+ annotations. That's assuming Unicode is not supported of course.

4 months ago

zitterbewegung

Telling LLMs not to hallucinate in their prompt improves the output. https://arstechnica.com/gadgets/2024/08/do-not-hallucinate-t...

4 months ago

Havoc

And then the AGI instantly gives up on life realising it was brought into a world where it gets promised a tip that doesn’t materialise and people try to motivate by threatening to kill kittens

4 months ago

pants2

Indeed, in the early days of Bard, the only way to get it to output only JSON was to threaten a human life[1].

1. https://x.com/goodside/status/1657396491676164096

4 months ago

morkalork

We used to be engineers, now we're just monkeys throwing poop at the wall to see what the LLM accepts and obeys.

4 months ago

laweijfmvo

always interesting to me the number of people who try to turn an LLM into AGI by assuming it’s an AGI (i.e. via some fancy prompt)

4 months ago

[deleted]

4 months ago

thorum

o1’s innovation is not Chain-of-Thought. It’s teaching the model to do CoT well (from massive amounts of human feedback) instead of just pretending to. You’ll never get o1 performance just from prompt engineering.

4 months ago

visarga

> from massive amounts of human feedback

It might be the 200M user base of OpenAI that provided the necessary guidance for advanced CoT, implicitly. Every user chat session is also an opportunity for the model to get feedback and elicit experience from the user.

4 months ago

narrator

If the training data for these LLMs is from humanity in general, and it is trying to imitate humanity, wouldn't its IQ tend to be the average of all of humanity? Perhaps the only people who talk about STEM topics are people of higher IQ generally, including a lot of poor students asking homework questions. Thus, the way to get to higher IQ output is to critique the lower IQ answers, which may be more numerous by rejecting their flaws in favor of the higher IQ answers. That, or just training more heavily on textbooks, and so forth. How to reject errors, and maybe train on synthetic data generated without reasoning with errors.

4 months ago

qudat

Do you actually know that’s what’s happening? The details are extremely fickle the last I read (a couple days ago). For all we know, they are doing model routing and prompt engineering to get o1 to work.

4 months ago

logicchains

Maybe they didn't use a huge amount of human feedback; where it excels is coding and maths/logic, so they could have used compiler/unit tests for giving it the coding feedback and a theorem prover like Lean for the math feedback.

4 months ago

quantadev

OpenAI is of course going to claim what they've done is very special and hard to replicate. They're a for-profit company and they want to harm the competition any way they can.

If they were just doing prompt engineering and multiple inferences they'd definitely want to keep that a competitive secret and send all the open source devs off in random directions, or keep them guessing, rather than telling them which way to go to replicate Q-Star.

4 months ago

Oras

Well, with Tree Of Thought (ToT) and fine-tuned models, I'm sure you can achieve the same performance with margin to improve as you identify the bottlenecks.

I'm not convinced OpenAI is using one model. Look at the thinking process (UI), which takes time, and then suddenly, you have the output streamed out at high speed.

But even so, people are after results, not really the underlying technology. There is no difference of doing it with one model vs multiple models.

4 months ago

kristianp

Does o1 need some method to allow it to generate lengthy chains of thought, or does it just do it normally after being trained to do so?

If so, I imagine o1 clones could just be fine tunes of llamas initially.

4 months ago

hjaveed

can you share any resource that mentions about teaching the model to do COT.. their release blog does not document much

4 months ago

GaggiX

This seems the usual CoT that has been used for a while, o1 was trained with reinforcement learning with some unknown policy, so it's much better at utilizing the chain of thought.

4 months ago

codelion

This is good I also had worked on something similar in optillm - https://github.com/codelion/optillm. You can do this with any LLM and several optimization techniques (including cot_reflection) like mcts, plansearch, moa etc.

4 months ago

zby

I am always looking for definitions of "reasoning". My theory is that if we find a good definition - then it will turn out that we can build systems that would combine fuzzy llm thinking with classical algorithms to solve "reasoning".

All the problems with llm not reasoning (like planning, counting letters or deductive inference) are easy for classical algos. There needs to be a way to split the thinking process into two parts and then execute each part on the appropriate model.

4 months ago

imtringued

Solving a decidable problem is a large subset of reasoning tasks. Counting is also a critical reasoning task, since it requires you to both understand natural numbers and the concept of distinct instances of objects belonging to a general category.

Two centuries ago there were no computers, everything had to be done by humans. Get to that level first before you whip out code.

4 months ago

punnerud

I changed it into running 100% locally with ollama:8b: https://github.com/punnerud/g1

Not updated the Readme yet

4 months ago

arnaudsm

You should also try phi-3-small 7B, seems much better at reasoning according to https://livebench.ai

4 months ago

ed

FYI this is just a system prompt and not a fine-tuned model

4 months ago

dangoodmanUT

> Prompt: Which is larger, .9 or .11?

> Result: .9 is larger than .11

we've broken the semver barrier!

4 months ago

[deleted]

4 months ago

esoltys

For fun I forked the project to run Llama-3.1 7B or other models using Ollama locally. It doesn't get strawberry right, but it can figure out 0.9 is bigger.

https://github.com/esoltys/o1lama

4 months ago

londons_explore

> This alone, without any training, is sufficient to achieve ~70% accuracy on the Strawberry problem (n=10, "How many Rs are in strawberry?"). Without prompting, Llama-3.1-70b had 0% accuracy and ChatGPT-4o had 30% accuracy.

I think this class of problem might be better solved by allowing the LLM to 'zoom in' and view the input differently. Rather like you might peer closer for more detail if someone asked you about the print quality of something you were reading.

'zoom in' could input the same text letter by letter, or even in image form (rasterize the text) to help answer questions like "How many letters in the word strawberry contain straight lines?"

4 months ago

pseudotensor

The idea is not silly in my view, I did something similar here: https://github.com/pseudotensor/open-strawberry

The idea is that data generation is required first, to make the reasoning traces. ToT etc. are not required.

4 months ago

bofadeez

Not going to work - https://arxiv.org/abs/2310.01798

4 months ago

a-dub

so is this o1 thing just cot (like has been around for a few years) but baked into the training transcripts, rlhf and inference pipeline?

4 months ago

ttul

Pasting from my Perplexity page on the topic:

The core innovation [1] of o1 lies in its ability to generate and refine internal chains of thought before producing a final output [2]. Unlike traditional LLMs that primarily focus on next-token prediction, o1 learns to:

1. Recognize and correct mistakes 2. Break down complex steps into simpler ones 3. Try alternative approaches when initial strategies fail

This process allows o1 to tackle more complex, multi-step problems, particularly in STEM fields.

OpenAI reports observing new "scaling laws" with o1 [5]:

1. Train-time compute: Performance improves with more extensive reinforcement learning during training. 2. Test-time compute: Accuracy increases when the model is allowed more time to "think" during inference.

This suggests a trade-off between inference speed and accuracy.

Sources [1] Introducing OpenAI o1 https://medium.com/%40sriramramakrishnan.aiexpert/openais-o1... [2] Learning to Reason with LLMs | OpenAI https://openai.com/index/learning-to-reason-with-llms/ [3] OpenAI o1 models - FAQ [ChatGPT Enterprise and Edu] https://help.openai.com/en/articles/9855712-openai-o1-models... [4] OpenAI releases new o1 reasoning model - The Verge https://www.theverge.com/2024/9/12/24242439/openai-o1-model-... [5] 9 things you need to know about OpenAI's powerful new AI model o1 https://fortune.com/2024/09/13/openai-o1-strawberry-model-9-... [6] Notes on OpenAI's new o1 chain-of-thought models https://simonwillison.net/2024/Sep/12/openai-o1/ [7] OpenAI just dropped o1 Model that can 'reason' through complex ... https://www.tomsguide.com/ai/openais-o1-model-takes-ai-to-a-... [8] Models - OpenAI API https://platform.openai.com/docs/models [9] OpenAI Unveils O1 - 10 Key Facts About Its Advanced AI Models https://www.forbes.com/sites/janakirammsv/2024/09/13/openai-...

4 months ago

bofadeez

You can reproduce both of those responses zero shot on 70B with "Let's verify step by step" appended at the end.

4 months ago

asah

benchmark results ?

4 months ago

arthurcolle

these projects become way less fun when you introduce evals

4 months ago

zozbot234

How does this benchmark against Reflection, which was fine-tuned to do the same thing-- provide a detailed Chain of Thought with self-corrections, then write out a final answer?

4 months ago

kkzz99

Pretty sure Reflection-70B was a complete scam. They did the ole bait and switch. The model that they uploaded was completely under-performing compared to their own benchmarks and the "secret API" was just a GPT-4 & Claude wrapper.

4 months ago

m3kw9

You still believe it was real? They had a model then they said it couldn’t reproduce those results lmao

4 months ago

[deleted]

4 months ago

arnaudsm

The latency of Groq is impressive, much better than o1!

Did you benchmark your system against MMLU-pro?

4 months ago

lobochrome

So it’s the asic groq guys right?

Because it says so nowhere in the repo.

Man Elon makes things confusing.

4 months ago

jsheard

The Elon one is spelled Grok, not Groq.

4 months ago

[deleted]

4 months ago

michelsedgh

i love seeing stuff like this, im guessing it wont be long until this method becomes the norm

4 months ago

sebzim4500

This is basically CoT, so it's already the norm for a lot of benchmarks. I think the value proposition here is that it puts a nice UX around using it in a chat interface.

4 months ago

4ad

This is the system prompt it uses:

    You are an expert AI assistant that explains your reasoning step by step. For each step, provide a title that describes what you're doing in that step, along with the content. Decide if you need another step or if you're ready to give the final answer. Respond in JSON format with 'title', 'content', and 'next_action' (either 'continue' or 'final_answer') keys. USE AS MANY REASONING STEPS AS POSSIBLE. AT LEAST 3. BE AWARE OF YOUR LIMITATIONS AS AN LLM AND WHAT YOU CAN AND CANNOT DO. IN YOUR REASONING, INCLUDE EXPLORATION OF ALTERNATIVE ANSWERS. CONSIDER YOU MAY BE WRONG, AND IF YOU ARE WRONG IN YOUR REASONING, WHERE IT WOULD BE. FULLY TEST ALL OTHER POSSIBILITIES. YOU CAN BE WRONG. WHEN YOU SAY YOU ARE RE-EXAMINING, ACTUALLY RE-EXAMINE, AND USE ANOTHER APPROACH TO DO SO. DO NOT JUST SAY YOU ARE RE-EXAMINING. USE AT LEAST 3 METHODS TO DERIVE THE ANSWER. USE BEST PRACTICES.

The Python crap around it is superfluous.

Does it work? Well not really:

https://lluminous.chat/?sl=Yjkxpu

https://lluminous.chat/?sl=jooz48

I have also been using this prompt, and while it fails on then problem above, it works better for me than OPs prompt:

    Write many chains of thought for how you’d approach solving the user's question. In this scenario, more is more. You need to type out as many thoughts as possible, placing all your thoughts inside <thinking> tags. 
    Your thoughts are only visible to yourself, the user does not see them and they should not be considered to be part of the final response.
    Consider every possible angle, recheck your work at every step, and backtrack if needed.
    Remember, there are no limits in terms of how long you can think - more thinking will always lead to a better solution.
    You should use your thoughts as a scratchpad, much like humans do when performing complicated math with paper and pen. Don't omit any calculation, write everything out explicitly.
    When counting or maths is involved, write down an enormously verbose scratchpad containing the full calculation, count, or proof, making sure to LABEL every step of the calculation, and writing down the solution step by step.
    Always remember that if you find yourself consistently getting stuck, taking a step back and reconsidering your approach is a good idea. If multiple solutions are plausible, explore each one individually, and provide multiple answers.
    Always provide mathematical proofs of mathematical answers. Be as formal as possible and use LaTeX.
    Don't be afraid to give obvious answers. At the very very end, after pages upon pages of deep thoughts, synthesize the final answer, inside <answer> tags.

In particular it solves this problem: https://lluminous.chat/?sl=LkIWyS

4 months ago

astrange

That second prompt is interesting. Not magic though. I tried it with every other model I know and they're still basically unable to do:

* give me three sentences that end in "is"

* tell me the line of Star Spangled Banner that comes before "gave proof through the night"

But they did some good thinking before failing at it…

4 months ago

tonetegeatinst

Groq 2 isn't as open as groq 1 iirc. Still hoping we get at least open weights.

4 months ago

gmt2027

You're thinking of Grok, the model from xAI. This Groq is the inference hardware company with a cloud service.

4 months ago

[deleted]

4 months ago

Haskell4life

[dead]

4 months ago

aktuel

Let's just assume for a moment that the hype is real and that these LLMs are incredibly intelligent and will replace us all soon. Then the model shouldn't be any less intelligent if we remove facts like Uma Thurman's measurements and other vapid information. If the model already has the capability to use tools than all of that crap is redundant anyway. And while we are at it let's remove a ton of other junk like languages I will never use and which also doesn't make the model any smarter. So how small can this kernel get while still being clearly intelligent, able to communicate flawlessly in english and apply logical reasoning. That would be a worthwile endeavor and maybe even possible without boiling the oceans.

4 months ago

kenmacd

Your base assumption here is that the 'crap' is actually 'junk'. Let's look at the easy one here, languages. Talk to someone that speaks multiple languages and they'll have examples of concepts in one language that are difficult to express in another. The multilingual person, or someone who just speaks a different language than you, will think differently[1].

Does the LLM take advantage of this? I don't know. It wouldn't surprise me if it did, and if it doesn't now I'd bet it will in the future. Either way though, throwing away those other languages could make the model dumber. As you allude to, there's a balance between intelligence and knowledge.

(in case you hadn't thought of it, those 'tools' can also be other LLMs with more specialized knowledge in a particular field. For example a 'translator' model)

Other 'facts' could also have more merit than it would first appear. Sure, one particular person's shoe size might not be needed, but if you were to filter out shoe sizes in general then the model might not be able to suggest how to find properly fitting footwear, or might not suggest that your back pain could be related to your shoes.

> That would be a worthwile endeavor and maybe even possible without boiling the oceans.

I think it's important to keep in mind that we're very early in the AI journey. Look at the power requirements of early computers versus the ones we use today. I'm all for keeping energy usage in mind, but I'd be careful with hyperbolic language as things are changing so quickly. Tasks that would have taken multiple GPUs can now run on my laptop CPU.

[1] https://www.edge.org/conversation/lera_boroditsky-how-does-o...

4 months ago