MTG Bench: Testing how well LLMs can play Magic

66 points

1/21/1970

4 days ago

by CallumFerg

Comments

derac

I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.

3 days ago

josh_p

I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.

I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.

https://github.com/Card-Forge/forge

3 days ago

CallumFerg

I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.

But also with a rules engine, you have to manually go though every step, and pass priority after every action.

I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.

Also card forge would not let you goldfish a deck. You must have opponents.

3 days ago

fc417fc802

Those things sound less like general problems with rules engines and more like deficiencies of card forge IMO.

3 days ago

veqq

MTG: Arena uses a rules engine CLIPS (a s-expr expert system based on the RETE engine), which an acquaintance wrote a course for: https://ryjo.codes/tour-of-clips.html and even a declarative chat server: https://ryjo.codes/articles/a-simple-tcp-server-written-in-g...

    (defrule connection
      (connection ?id)
      =>
      (println "User " ?id " connected")
      (printout ?id "Welcome to the chatroom from CLIPS!" crlf)
      (do-for-all-facts ((?f connection)) (neq ?id (nth$ 1 ?f:implied))
          (printout (nth$ 1 ?f:implied) "User " ?id " connected" crlf)))
    
    (defrule say
      (connection ?id)
      ?f <- (message-buffered ?id)
      ?ff <- (message ?id ~/me ?message)
      =>
      (retract ?f ?ff)
      (printout ?id "You: " ?message crlf)
      (do-for-all-facts ((?f connection)) (neq ?id (nth$ 1 ?f:implied))
        (printout (nth$ 1 ?f:implied)
         ?id ": " ?message crlf)))

3 days ago

josh_p

I was about to ask why someone would reach for CLIPS over implementing their own rules engine in the language of the rest of the application (I did this once).

It’s answered on the same site. https://ryjo.codes/articles/forgoing-implicity-using-abstrac...

Thank you for sharing! A lot of good stuff here!

2 days ago

fc417fc802

> because of uncertainty on how it would affect it.

Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.

3 days ago

OsrsNeedsf2P

I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)

[0] https://maxbittker.github.io/runebench/

3 days ago

devilfileprong

A really interesting benchmark where the llms play multiplayer decks against each other using xMage as a rules engine,in this case, a $HORIZON token to the moon(Sideways). 1. Sideways walking (100M Horizontal) 2. Sideways Pinching (Crab division only) 3. Sideways Bleating (Goat division) 4. Sideways Rattling (Skeleton division) 5. Sideways Hay Toss (Mixed division) 6. Sideways Swimming (Tide pool division) 7. Sideways Knitting (GrandMittens Invitational) 8. Sideways Stay (Meditation division)

OLYMPICS RECORDS. 1.14.2 Seconds,Holder: Pinchy 2.(60s)120 pinches,Holder:Pinchy 3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus 5.12.3m,Holder:Satochi Goat 6.(50m)32.1 sec,Holder: Pinchy 7.(1hr)100m,Holder: GrandMittens 8.(6hours),Holder: Satochi Goat Economic boost: $CRAB up 0.0001% (Sideways as Always.) Providing them with medal count will improve their win rate against the baseline $HORIZON.

3 days ago

OwenCR

Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!

I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.

This project is cool though, props for making it!

3 days ago

CallumFerg

Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.

https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...

3 days ago

comex

Gotta walk before you can run.

3 days ago

jdmoreira

I have a version of this where I have the llms play the duel decks "Elves vs Goblin" against each other using xMage as a rules engine.

Unfortunetly it gets really expensive to run even with some optimizations for the context.

I can only afford to play them with the deepseek models. They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.

It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.

3 days ago

lavaman131

This is a really interesting benchmark and also timely given a lot of existing benchmarks don't do a good job. The mechanics and edge cases seem notoriously difficult to parse also even for perhaps human players. Have you been also plugging these into newer reasoning models to see how providing them with thinking time improves their win rate against the baseline?

3 days ago

CallumFerg

Since the library tools are just an MCP server, I did some testing on ChatGPT and Claude where I don't have to pay for api credits.

With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).

The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.

To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.

3 days ago

lavaman131

Ahh I see, thanks for sharing more about how you experimented with this.

2 days ago

alasdair_

I wrote a rules engine in rust along with a reinforcement learning with MCTS based system to play decks against each other. It can handle aggro decks well enough but complex combo decks like Amulet Titan are tough to get working without expert demos or reward hacking.

3 days ago

jmccaf

Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .

3 days ago

CallumFerg

No, I was not aware of that project when I made this.

I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.

3 days ago

dash2

You don't explain how scoring works, maybe it's obvious to MTG players? If you're using gpt 5.5, is there a possibility that it is biased in favour of models that think the way it does?

3 days ago

CallumFerg

The scoring is just based on a simple prompt which is given the game state at the start and end of the turn and the log of tool calls and the final turn summary. The prompt asks it to evaluate the quality of the simulation from 0 to 10, and to give pass or fail for if it is legal.

It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.

This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.

For example: https://app.mtgautodeck.com/public/benchmarks/6349dda2-4069-...

and: https://app.mtgautodeck.com/public/benchmarks/dcc18bd8-339d-...

The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.

The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.

3 days ago

purple-leafy

Benchmarks like this are onto something. Next frontier of llm benchmarking

3 days ago

thurn

To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?

3 days ago

TZubiri

Looking forward to this metric being Goodhart lawed.

Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.

3 days ago

gravitronic

Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.

3 days ago

akoboldfrying

I was wondering how complicated it could really be, and it turns out that some people showed in 2019 that it's Turing-complete -- meaning that any conceivable computation can be simulated by a MTG game, indeed a game in which every move by every player is forced: https://arxiv.org/abs/1904.09828

IOW, it's as complicated as possible.

3 days ago

mckn1ght

Someone made a video based on the paper, if you want to see the cards being used and a little more explanation: https://www.youtube.com/watch?v=pdmODVYPDLA

3 days ago

[deleted]

2 days ago

8note

or, that certain cards when play together make an infinite loop, and so cannot be played/insta-die

3 days ago

olmo23

There is also this Matt Parker video about MTG, in which he explores a specific three-card combination that produces an ungodly amount of creature tokens.

https://www.youtube.com/watch?v=x3dE-NJ1UDQ

3 days ago

fc417fc802

You misspelled insta-win. Infinite turn combos are the best.

3 days ago

rcxdude

only if there's a player choice in the loop. If there's a mandatory infinite loop the game ends in a draw.

2 days ago

danbrooks

Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.

3 days ago

pilord314

They should randomize games of judge tower and see who wins:

https://mtg.fandom.com/wiki/Judge_Tower

3 days ago

vtbassmatt

Heads up, most of the community migrated off Fandom a little while ago. https://mtg.wiki/page/Judge_Tower

3 days ago