Voyager: An Open-Ended Embodied Agent with LLMs
Though, I also don't individually track muscle fibers, and there's strong indications that a lot of my own behaviors are closer to API calls than direct control.
This is very cool despite the most important caveat:
“Note that we do not directly compare with prior methods that take Minecraft screen pixels as input and output low-level controls [54–56]. It would not be an apple-to-apple comparison, because we rely on the high-level Mineflayer  API to control the agent. Our work’s focus is on pushing the limits of GPT-4 for lifelong embodied agent learning, rather than solving the 3D perception or sensorimotor control problems. VOYAGER is orthogonal and can be combined with gradient-based approaches like VPT  as long as the controller provides a code API.”
I wonder how many prompts this uses in a minute.
Interestingly, they mod the server so that the game pauses while waiting for a response from GPT-4. That's a nice way to get around the delays.
This is kind of amazing given that obviously, GPT-4 never contained such tasks and data. I think it puts an end to the claim that "language models are only stochastic parrots and cannot do any reasoning". No, this is 100% a form of reasoning and furthermore, learning that is more similar to how humans learn (gradient-less).
I still don't understand it and it blows my mind - how such properties emerge just from compressing the task of next word prediction. (Yes, I know this is oversimplification, but not a misleading one).
> GPT-4 never contained such tasks and data
No task, but we need to be clear that it did have the data. Remember that GPT4 was trained on a significant portion of the internet, which likely includes sites like Reddit and game fact websites. So there's a good chance GPT4 learned the tech tree and was trained on data about how to progress up that tree, including speed runner discussions. (also remember that as of March GPT4 is also trained on images, not just text)
What data it was trained on is very important and I'm not sure why we keep coming back to this issue. "GPT4 has no zero-shot data" should be as drilled into everyone's head as sayings like "correlation does not equate to causation" and "garbage in, garbage out". Maybe people do not know this data is on the internet? But I'm surprised if the average HN user thought that way.
This doesn't make the paper less valuable or meaningful. But it is more like watching a 10 year old who's read every chess book and played against computers beat (or do really well) against a skilled player vs a 10 year old who's never heard of chess beating a skilled player. Both are still impressive, one just seems like magic though and should raise suspicion.
Looking at the paper, as I understand it they're using Mineflayer https://github.com/PrismarineJS/mineflayer and passing parts of the state of the game as JSON to the LLM that are used for code generation to complete tasks.
> I still don't understand it and it blows my mind - how such properties emerge just from compressing the task of next word prediction.
The Mineflayer library is very popular, so all the relevant tasks are likely already extant in the training data.
> I think it puts an end to the claim that "language models are only stochastic parrots and cannot do any reasoning".
But then two sentences later:
> I still don't understand it and it blows my mind
I've said this before to others and it bears repeating because your line of thinking is dangerous (not sudden AI cataclysm): to feel so totally qualified to make such a statement armed with ignorance, not knowledge, is the cause of mass hysteria around LLMs.
What is happening can be understood without resorting to the sort of magical thinking that ascribes agency to these models.
> What is happening can be understood without resorting to the sort of magical thinking that ascribes agency to these models.
This is what has (as an ML researcher) made me hate conversations around ML/AI recently. Honestly getting me burned out on an area of research I truly love and am passionate about. A lot of technical people openly and confidently are talking about magic. Talking as if the model didn't have access to relevant information (the "zero-shot myth") and other such nonesense. It is one thing for a layman to say these things, but another to see them on the top comment on a website aimed at people with high tech literacy. And even worse to see it coming from my research peers. These models are impressive, and I don't want to diminish that (I shouldn't have to say this sentence but here we are), but we have to be clear that the models aren't magic either. We know a lot about how they work too. They aren't black boxes, they are opaque, and every day we reduce the opacity.
For clarity: here's an alternative explanation to the results that's even weaker than the paper's settings (explains autogpt better). LLM has a good memory. LLM is told (or can infer through relevant information like keywords: "diamond axe") that it is in a minecraft setting. It then looks up a compressed version of a player's guide that was part of its training data. It then uses that data to execute goals. This is still an impressive feat! But it is still in line with the stochastic parrot paradigm. I'm not sure why people don't think stochastic parrots aren't impressive. They are.
But right now ML/AI culture feels like Anime or weed culture. The people it attracts makes you feel embarrassed to be associated with it.
> But it is still in line with the stochastic parrot paradigm.
What makes us different from 'stochastic parrots'? Or where creativity, which machines don't have by definition, begins and ends?
There is a bunch of philosophical questions, but LLMs are more than just parrots. They develop multi-level patterns recognition. And they can solve multi-step problems which they have never seen before. May be each individual step, but not the whole combination. Selecting the right combination out of zillons is not exactly 'parroting'. Doesn't matter how we call it, it has extremely high potential in real physical world. Looks like it's a near future.
We witness the emergency of 'Verbose AI'. IMHO. Which is more then just NLP
>LLM has a good memory. LLM is told (or can infer through relevant information like keywords: "diamond axe") that it is in a minecraft setting. It then looks up a compressed version of a player's guide that was part of its training data. It then uses that data to execute goals.
What about any of what you've just said screams parrot to you ?
I mean here is how the man who coined the term describes it.
A "stochastic parrot", according to Bender, is an entity "for haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning."
So..what exactly from what you've just stated implies the above meaning ?
> What about any of what you've just said screams parrot to you ?
>>LLM has a good memory.
Pretty much this.
> the man
The woman. Bender is a woman. In fact, 3 of 4 of the authors are woman and the 4th has unknown identity.
> according to probabilistic information about how they combine, but without any reference to meaning.
This is the part. I don't think the analogy of the parrot is particularly apt because we all know that the parrot doesn't understand calculus but is able to repeat formulas if you teach it. But we have to realize that there are real world human examples of stochastic parrots, and these are more akin to LLMs. If you don't know the phrase "Murry Gelman Amnesia" let me introduce you to it. It is the concept that you can hear a speaker/writer talk about a subject matter you're familiar with, see them make many mistakes, then when they move to a subject matter you are not familiar with you trust them. We can call this writer or speaker a stochastic parrot as well since they are using words to sound convincing but they do not actually know the meaning behind the words. It is convincing because it matches the probabilistic information that a real expert may use. The difference is in understanding.
But this gets us to a topic at large that is still open: what does it mean to understand? We have no real answer to this. But a well agreed upon part of the definition is the ability to generalize: to take knowledge and apply it to new situations. This is why many ML researchers are looking at zero-shot tasks. But in the current paradigm this term has become very muddied and in many cases is being used incorrectly (you can see my rants about code generation with HumanEval or how training on LAION doesn't allow for zero shot COCO classification).
For specifically this work, we need to evaluate and think about understanding carefully. The critique I am giving is that people are acting as this is similar "understanding" to how we may drop a 10 year old into Minecraft and that 10 year old can figure out how to play the game despite never hearing about the game before (though maybe has played games before. But Minecraft is also many kids "intro game"). This is clearly not what is happening with GPT. GPT has processed a lot of information on the game before entering its environment. It has read guides of how to play, how to optimize game play, it has seen images of the environment (though this version doesn't use pixel information), and has even read code for bots that will farm items. The prompts used in this work tell GPT to use Mineflayer. They also tell it things like that mining iron ore gets you raw iron and several other strong hints of how to play the game. Chain of Thought (CoT) prompts also bring into doubt the understanding nature of a LLM, and really provide a strong case against understanding (since this is something an understanding creature considers). CoT is adding recurrent information into the bot and this causes statistical (Bayesian) updates. This is not dissimilar from allowing you to reroll a set of dice while also being able to load the dice. You can argue that CoT is part of the thought process for an entity that understands things, but need to recognize that this is not inherit to how GPT does things. You may want to draw an analogy to when teaching a child something and they confidently spit out the wrong answer and then you say "are you sure?" but we need to be careful to draw these parallels and think very nuanced and carefully. The nuance is critical here.
But I want to give you some more intuition into this understanding idea. We attribute understanding to many creatures and I'll select a subset that is more difficult to argue against: mammals and birds. While they don't understand everything at the level of humans, it is clear that there are certain tasks they understand, being able to use tools, quickly adapt to new novel environments, and much more. But there's a key clue here about something, we know that they can all simulate their environments. How? Because they dream. I can't help but think this is part of the inspiration for Philip K Dick naming his book that way, since this is question we're getting at is part of its central theme. But as for GPT, it isn't embodied. It does not seem to be able to answer questions about itself and it has show clear difficulties in simulating any environment. While it can make some hits, it makes more misses.
TLDR: see this prompt and ChatGPT's response: https://i.imgur.com/sK4pLw0.png
Fwiw: Bard answers similarly to ChatGPT: https://i.imgur.com/CmWsf9X.png https://i.imgur.com/QJXIBDl.png https://i.imgur.com/zSGjYss.png
Side Note: I'm even often critical of Bender myself. I think she is far too harsh on LLMs and is promoting dommerism that isn't helpful. But this has nothing to do with the meaning of Stochastic Parrot. We should also recognize that the term has changed as it has entered the lexicon and adapted. Just like every other word/phrase in human language.
> TLDR: see this prompt and ChatGPT's response
And wow, that's GPT4.
I've had similar thoughts as you. It feels like amazing intelligence one day, but the next seems like a extremely good, but naive pattern matcher.
I've experienced similar GPT-4 disappoinments trying to teach it concepts not well in training data (it does badly) or making modifications to programs that go outside training data (e.g. make a tax calculator calculate long term capital gain tax correctly).. ends up doing much worse than a human.
To be clear, that's ChatGPT, not GPT4. GPT4 should be better, but it is still limited beta and I haven't bothered joining. Note that 3.5-turbo (the API) is worse
> They both weigh the same amount, which is 1 pound.
It is clearly a strong example of Murry Gelman Amnesia when we can't trust it to tell us the difference between two simple things but we trust it to tell us complicated things.
It is also a clear example of how it is a stochastic parrot -- doesn't understand what it is saying -- as it even explains the reasoning and is not self consistent. We wouldn't expect an entity that can understand something to be wildly non-consistent in this short of a period of time. Clearly the model is relying more on the statistics of the question (the pattern and frequency that most of those words are in that order) rather than the actual content and meaning of those words.
Despite this, I still frequently use LLMs. I just scrutinize them and don't trust them. Utility and trust are different things and people seem to be forgetting this.
>> To be clear, that's ChatGPT, not GPT4. GPT4 should be better, but it is still limited beta and I haven't bothered joining.
Well, I can predict the next few token sequences you're about to get in response to your comment. "That's why you got that answer GPT4 is so much better" etc.
Regarding your earlier comment about burnout, you're not alone. I stayed on HN because I could have the occasional good discussion about AI. There were always conversations that quickly got saturated with low-knowledge comments, the inevitable effect of discussions about "intelligence", "understanding" and other things everybody has some experience with but for which there is no commonly accepted formal definition that can keep the discussion focused. That kind of comment used to be more or less constant in quantity and I could usually still find the informed users' corner. After ChatGPT went viral though, those kinds of comments have really exploded and most conversations have no more space for reasoned and knowledgeable exchange.
>> LLM has a good memory.
Btw, intuitively, neural nets are memories. That's why they need so much data and still can't generalise (but, well, they need all that data because they can't generalise). There's a paper arguing so with actual maths, by Pedro Domingos but a) it's a single paper, b) I haven't read it carefully and c) it's got an "Xs are Ys" type of title so I refuse to link it. With LLMs you can sort of see them working like random access memories when you have to tweak a prompt carefully to get a specific result (or like how you only get the right data from a relational database when you make the right query). I think, if we trained an LLM to generate prompts for an LLM, we'd find that the prompts that maximise the probability of a certain answer look nothing like the chatty, human-like prompts people compose when speaking to a chatbot, they'd even look random and incomprehensible to humans.
Well it is good to know I'm not alone. These are strange times indeed. I often think one of the great filters of civilizations is overcoming a biological mechanism that designs brains to think simple (cheap compute/complexity is often unnecessary for survival objectives) and then advancing into a level of civilization where a significant amount of the problems the civilization require beyond first and second order approximations. (it happens when most challenges are solved to first and second order approximations) Unless one is able to rewire their consciousness I don't see how this wouldn't be a issue for any species but maybe I'm thinking too narrow or from too much of a bias.
> GPT4 should be better, but it is still limited beta and I haven't bothered joining.
Ah my bad. Gpt4 via bing precise gets it correct:
> A kilogram of feathers weighs more than a pound of bricks. A kilogram is a metric unit of mass and is equivalent to 2.20462 pounds. So, a kilogram is heavier than a pound.
While I do believe LLMs can perform some reasoning, I'm not sure this is the best example as all the reasoning you would ever need for Minecraft is well contained in the data set used to train it. A lot has been written about minecraft.
To me, it would be more convincing if they developed an enterly new game with somewhat novel and arbitrary rules and saw if the embodied agent could learn this game.
I read through the code and tried it out for 15 mins.
It's a hard-coded program that can do a text search for it's own hard-coded, human-implemented functions. Apparently it can string those functions together, but doesn't do it correctly.
20 minutes of light reading through the repository pretty much dispels any notions that this is a self-learning system that can reason and think. It's the same minecraft automation we have been seeing for a decade now, with a chatbot text search builtin.
Semantics of how this works aside, take a moment to appreciate how easy it is to remap the variable “zombie” to “human” in a prompt without the model altering its behavior. It instantly makes you realize the immensity of the AI safety & alignment problem.
Just checked by talking to the free version of ChatGPT, and yes, the MineFlayer api docs are indeed in its training set. It can give me detailed instructions on how to build a minecraft bot. And of course, it also knows the entire minecraft tech tree very well.
So this isn't really open ended work, its just making it do something it is already trained on, by connecting it to an API that it has learned the docs of.
However, the skills library that it writes with live feedback from runtime errors and that it retrieves with a vector DB is really interesting. In that sense it looks like a very interesting code generation application.
TLDR. An AI system with an IQ of 110-130, with some careful prompting can generate code to play Minecraft through an API.
>mines straight up and down
I’m not sure what is the point that you are making. GPT-4 does tend to pass various IQ tests with the scores in the range of 110 to 130, with outliers between 90 to 150.
It's a joke about playing the game "right." Mining straight up/down is a rather suboptimal strategy, as:
- mining straight up means you either seal your path behind you, or are limited how high up you can go
- mining straight down likely traps you in a pit
- mining straight down far enough can drop you straight into lava, as many Minecraft players learn early on
I'm not sure how the authors arrive at the idea that this agent is embodied or open-ended. It is sending API calls to minecraft, there's no "body" involved except as a symbolic concept in a game engine, and the fact that minecraft is a video game with a limited variety of behaviors (and the authors give the GPT an "overarching goal" of novelty) precludes open-endedness. To me this feels like an example of the ludic fallacy. Spitting out "bot.equip('sword')" requires a lot of non-LLM work to be done on the back end of that call to actually translate to game mechanics, and it doesn't indicate that the LLM understands anything about what it "really" means to equip a sword, or that it would be able to navigate a real-world environment with swords etc.