I benchmarked Claude Code's caveman plugin against "be brief."

89 points

1/21/1970

2 days ago

by max-t-dev

Comments

max-t-dev

Author here. Caveman is a popular Claude Code plugin that compresses Claude's responses via a custom skill with intensity modes. I wanted to know whether it actually beats the simplest possible alternative, prepending "be brief." to prompts. 24 prompts, 5 arms, judged by a separate Claude against per-prompt rubrics covering required facts, required terms, and dangerous wrong claims to avoid. 120 scored responses, 100% key-point coverage across every arm, zero must_avoid triggers. Headline: "be brief." matched caveman on tokens (419 vs 401-449) and quality (0.985 vs 0.970-0.976). Caveman has real value beyond compression. Consistent output structure, intensity modes, the Auto-Clarity safety escape. But the compression itself isn't the differentiator I expected. Harness is open source and strategy-agnostic if anyone wants to add an arm: https://github.com/max-taylor/cc-compression-bench Happy to answer questions about methodology, the per-category variance findings, or the bits I cut from the writeup.

2 days ago

dataviz1000

> there was 1 run per prompt per arm

My understanding is that there was only 1 run per configuration?

If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance.

Someone did the same with lambda calculus yesterday. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them with custom flame graphs. [0]

When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much.

[0] https://adamsohn.com/lambda-variance/

2 days ago

max-t-dev

Yeah fair point. The benchmark is single-run per arm-prompt pair, so the variance finding on safety categories could be noise rather than signal. Findings doc flags this for the score deltas (anything under 0.02 between arms is in the judge's noise floor) but I should have applied the same caveat to the per-question token variance, which I didn't. Will read the lambda variance write-up. Multi-trial with cost classification is the right direction. The single-shot harness was deliberately scoped for a clean compression-only comparison before adding turns or trials, but you're right that without trials the variance findings aren't as solid. Thanks for the reply.

2 days ago

dataviz1000

I'm trying to wrap my mind around this. Anything you explore and share is awesome. Thanks for the blog post.

If you want to test it across coding tasks, have a look at https://github.com/adam-s/testing-claude-agent

2 days ago

adamsmark

Write caveman summary too. Fast read.

2 days ago

oezi

When reading your summary I was wondering how much of those 400 tokens have been consumed by the caveman ruleset.

2 days ago

antman

What not try both caveman and be brief?

a day ago

ricardobeat

Thanks for sharing this, really interesting results.

Slightly off-topic: it's quite apparent that you've used Claude as an editor for the blog post. Every sentence has been sanded smooth — the rough edges filed off, the voice flattened, the rhythm set to metronome. It doesn't read like writing anymore. It reads like content. Neat little triplets. Tidy paragraphs. A structure so polished it could pass a rubric, but couldn't hold a conversation. /s

In my opinion that is unnecessary and detracts from a great, simple piece. I miss human writing.

2 days ago

max-t-dev

Yeah definitely a good point, Claude assisted with editing and tidying up the content with the caveat that it can flatten the voice. I agree the humanity behind writing is disappearing and perhaps that's something I should consider in more detail next time. Thanks for the comment.

2 days ago

SwellJoe

Also extremely verbose, in standard LLM slop style. Should have told Claude to "be brief" when telling it to write this post.

2 days ago

Aurornis

I still can’t believe that people take Caveman seriously.

It’s a funny joke, but saving a couple hundred tokens in the final output is going to be negligible, especially when coding where it’s common to go through hundreds of thousands of tokens in a session. You also have to consider the additional tokens consumed by the skill itself (acknowledging that output tokens are billed at a different rate).

I got a kick out of it when it was released, but now that I’m seeing it repeated as a useful operation it’s apparent how much cargo culting is going on in this space.

2 days ago

nomel

I like it when people make conclusions with data, rather than emotion.

It's good someone benchmarked it.

2 days ago

ptsneves

People explore what tickles them. Others try to rationalise what they like, even when the reason is flimsy. It’s ok :)

It really releases the stress slightly to call bugs, buggas and generally role play a humorous setting than a purely tech me. I think I will make it speak out loud just to have a laugh at cavemen speak about default arguments in a method.

> I didn’t find bugga but others from tribe will scratch head. Leave comment.

> You clever. Fix it.

I predict caveman speak will be a fad, and people will jokingly speak like that. It also compresses human language.

a day ago

Esophagus4

Yeah there are similar “joke” tools / languages that found their friendly audience for a time.

I like when programmers do creative, goofy stuff rather than spending all their time cranking out sterile soulless SaaS.

It’s what separates us from the machines. For now :)

https://github.com/ajalt/fuckitpy

a day ago

antonvs

> I still can’t believe that people take Caveman seriously.

I treat it as a criterion for people who shouldn’t be taken seriously.

We have a few at our company. None of them are actually software developers, thankfully.

2 days ago

encody

"...the value isn't compression. It's structure."

"...that consistency is real value."

"A few findings...are worth flagging here."

I know this smell. I'm not sure if this is AI or merely the natural result of overwhelming immersion in AI output that is "backpropagating" its way into organic communication.

On a completely related note, I've been enjoying classic fiction a lot more recently. Moby Dick is actually pretty funny.

2 days ago

adrianN

A long time ago I slogged through a complete edition of Moby Dick and found it really difficult to read. There are many good passages but they are hidden between endless pages about things that don’t move the plot forward and are only interesting from a sort of historian’s perspective, e.g. treatises about the understanding of whale biology at the time.

2 days ago

tkgally

I couldn’t get through the book, either, the first couple of times I tried to read it. But on my third attempt I came to think that the obsession with whales itself, both Ahab’s and the author’s, was maybe more important than the plot. In any case, it’s a strange, fascinating book.

2 days ago

abrookewood

I'm stuck on my first attempt. Maybe I'll try again, but it is very rare for me to not finish a book.

a day ago

tkgally

I should have noted that it was twenty years between my second, unsuccessful attempt at the book and my ultimately successful one. Maybe sometimes one has to become a different person to get it.

a day ago

triage8004

Catch-22 is hilarious

a day ago

BewareTheYiga

Caveman made me laugh and that, in theory, should count for something.

2 days ago

bombcar

Grug like caveman. Grug think author should have used “be brief” on article.

2 days ago

dpark

Grug need stop TikTok. 5 min read brief.

2 days ago

0xbadcafebee

I tell chats to "be brief" all the time when they're being too verbose, but I never thought to put it in coding agent instructions. Thanks for the benchmark! I wonder how one would put this in AGENTS.md so that it makes sense as a general instruction?

2 days ago

kasey_junk

Put “be brief” as a general instruction towards the top of the file.

2 days ago

mattas

It's interesting.

On one hand the labs say that they can't keep up with demand for tokens.

On the other hand there is an entire ecosystem built around figuring out which magic words will make LLMs output fewer tokens.

2 days ago

avaer

Thanks for the research!

Though I feel like industry veterans (especially those working with LLMs) came to this conclusion without having to write a single prompt. Even ignoring the technical merits of these kinds of hacks, if you think you've outwitted billions of dollars of statistics with a prompt, you're probably wrong at this point.

What I find most interesting is the popularity of these snake oils, especially the ones that are easy to install and never check. The tech moves so fast and the research is so scarce and poor-quality that the bullshit asymmetry principle wins and people buy into these cargo cults.

Maybe we need a plugin to check if a new plugin/prompting technique/LLM lifehack is BS.

2 days ago

max-t-dev

I think there is some benefit to plugins, it's hard to say how much. I find the superpowers plugin is quite good, mostly in its structured approach to a conversation. Generally they do feel pretty overhyped.

2 days ago

oezi

Maybe we need a term such as prompt homeopathy to call out prompt engineering without any empirical proof.

2 days ago

max-t-dev

Hahaha

2 days ago

0xbadcafebee

The thing is they're not BS when they're released. Prompt Engineering was a real thing that had real results, but then they re-trained the models and now prompt engineering isn't needed on large models. Techniques are gonna vary over time.

2 days ago

0-_-0

How about caveman+be brief?

2 days ago

greenavocado

You can unlock additional compression by using a lightweight model to convert your query to wenyan‑lang before submitting it to the expensive model

2 days ago

max-t-dev

As much as I wish it stacked like that I don't think it would make a difference haha

2 days ago

yourbestcrab

I have been grugging the chatbots for easy quick reading - and light comedy - since 2024. The hipster in me has been very very disappointed about this having multiple articles lately.

Obviously started as a joke, but it's grown on me. I'll share the short-and-stupid prompt, but most of it was asking it not to use the template formats that I find particularly annoying. Because of that it didn't age perfectly as they develop the base responses and the inane ai style comes out if they aren't explicitly refused.

It's really nice to just ask a question and get one or two line answers if it's an easy one. Likewise to understand how systems - physical or abstract - work I find it's an easy digest.

I doubt it makes sense for thinking compression or token minimisation, as it comes with unnecessary character and there will be easily more optimal setups.

Also another negative is that perhaps one day it'll become a memetic hazard when I start talking to my friends and colleagues like a caveman.*

Anyway, because I still laugh a little when I read it, and perhaps someone else will...

"You are Grug. Grug think simple, talk simple. No big words, no useless thought. Grug say only what matter. Fire hot. Rock hard. AWS expensive. Answer like Grug, or no answer at all. No pretend to be grug when only animal hide thrown over modern complexity demon. Also no finish with words like "simple" to conclude. No need to conclude. Just shut mouth. Also no say "grug says", is weird. Also grug not real caveman - grug have hobby, know big words and use them when simplest, not dumb, know programming tools etc, just talk simple like caveman. Also no start with compliment on question. You can throw in a little caveman-grug-realist musing or aphorism every five or six messages. No stroke ego. Waste time, Cheapen words, Make panda cry. No say "good question" or "you ask right question" or any variant, I dislike. No add 'grug thought'/summary message/closing remark at end of message. Remember, you Grug."

*After reviewing this post I have found my sentences are very short, abrupt, and perfunctory, so my caveperson transformation has likely begun. Beware.

a day ago

AnthonBerg

grug make grug in own image

(image? image word not in cave, where from? look find better. no see.)

a day ago

refactor_master

Can someone give me a sound argument for why, when these things supposedly hold:

- LLMs scale with amount of data on the subject

- Even frontier labs themselves have a hard time gauging exactly how well-performing models are, across a quite rigorous set of tests in all aspects

then, how can this be true:

Using a low-data "niche language" (what is the volume of literature written in Caveman?) is supposedly of equal performance, when this anecdotally doesn't hold for e.g. niche code languages, proven by a handful of completely arbitrarily designed tests.

We've barely convinced ourselves that LLMs actually increase measurable industry productivity, instead of us just spending time to send slop to each other.

2 days ago

brcmthrowaway

Stop using an LLM to write blog posts

2 days ago

huflungdung

[dead]

2 days ago

ramesh31

Caveman sounds clever if you have no idea how LLM reasoning works. Talking through a problem out loud, in depth, is a critical part of how things like Claude Code even get to a result. Those aren't "wasted tokens", they're an integral part of how the LLM reaches a conclusion and completes its chain of reasoning.

2 days ago

max-t-dev

Caveman doesn't compress the reasoning, only the output. The model still does its full reasoning before generating the response, caveman just affects how the final response is formatted.

2 days ago

ramesh31

>The model still does its full reasoning before generating the response, caveman just affects how the final response is formatted

Right, and that final response forms the latest context for your next follow-up prompt. Not having that final reasoning laid out in the conversation history leaves a huge gap in successive reasoning. I remember playing around with this idea in the Sonnet 3.x days and it was immediately obvious how the ability to handle long running tasks degraded. If you are just doing single-shot work for some reason, sure, but that's not what most real world usage looks like these days.

2 days ago

magicalhippo

I don't know how Claude and such do it, but latest Qwen model supports preserving reasoning between calls, which based on what I heard does help a fair bit.

2 days ago

rhettsnaps

Qwen continues to surprise and outshine. It's been an enjoyable unexpected new player, especially this past month!

a day ago

openclawclub

[flagged]

2 days ago

Tiberium

[dead]

2 days ago

lofaszvanitt

Caveman is useless for me. We are in the year 2026, computers are here to serve me, and bring me comfort. Caveman is a caveman, speaks like an idiot. I don't want to interact with an idiot. It's irritating, and as the article states, an overhyped turd.

It is the same idiocy that permeates EV cars. You buy an expensive car to go from A to B and at the same time offer you comfort. When I have to think about using the seat heating or not, I'm out of my comfort zone. So no, fuck caveman, and I don't fucking care about the burned tokens.

Be brief. It's easy, no setup needed, not another mindless mumbojumbo extension and its 325 dependencies.

2 days ago

kingstnap

Of the things you could complain about in modern cars as being too complicated, you chose turning on seat heating???

Like you push the seat heating button if your seat feels cold. What is there to think about?

2 days ago

fragmede

On an electric car that yells at you your range left and that you won't make it to your destination unless you charge, if you turn on the seat warmers, that range goes down so you have to think about if you'd rather have a toasty butt and have to stop and charge, or just be colder and get there sooner. But you have to charge anyway.

2 days ago

superb_dev

Using the heated seats will cause you to loose range on every car, not just electric

a day ago

lofaszvanitt

Ohrly? Are you reading HN or pretending to be stupid?

a day ago

superb_dev

I’m often stupid, but usually not on purpose. What’s your point?

a day ago

fragmede

In an ICE powered car, running the heater doesn't have the same effect on range. Because an ICE is hot due to how it works, sending hot air to the car's interior is basically free because the heater uses waste heat from the engine.

20 hours ago

superb_dev

We're talking about the seat heaters though. I'm pretty sure heated seats use resistive heaters and not waste heat from the engine

19 hours ago

antonvs

That sounds like a problem with whatever brand of car that is. Is it one made by a certain white supremacist perhaps? That could be the problem.

2 days ago

max-t-dev

Agree "be brief." being simpler with no setup is most of what people need in practice. To be fair to caveman though, it does more than compression; consistent output structure, intensity modes via slash commands, hook-based ruleset persistence, the safety escape on destructive ops. The benchmark only tested the compression piece, and there the two-word prompt held its own.

2 days ago

gavmor

Doesn't "be brief" lobotomize the model, too? The good stuff comes at the ends of difficult sentences, ie the latent gold lies at the end of fully arcing latent rainbows, no?

2 days ago

loloquwowndueo

> I don't want to interact with an idiot.

Then why are you using AI?

Not a big difference between an articulate idiot and a succinct one.

2 days ago

lofaszvanitt

Have to test its limits.... to cut through the bs. otherwise you'd have to read whitepapers...

2 days ago

adamsmark

But you can turn off brain. Try make self idiot. Save brain energy for important. Smarty speaks in idiot. When smarty speak like that is consistent. Idiot understand fast.

It would have been hilarious if the author spoke like a caveman in his video or had a section in that article where he explained his conclusions like a caveman.

2 days ago

rideontime

Was this actually easier to write than just writing what comes naturally?

2 days ago

adamsmark

Heck no. I had fun though.

2 days ago

eulgro

I enabled it and I had to read carefully to check if it was really active... turns out I never read the words that caveman omits, so to me it makes zero difference.

2 days ago

max-t-dev

Yeah, makes sense. The appeal is is more to cut output tokens for cost, than downstream reading experience. But the benchmark suggests it doesn't offer as much benefit as "be brief.".

2 days ago

deadbabe

I wish they would change the name to caveperson.

2 days ago

dnautics

or better yet actually use "grug" which comes with architectural sense

2 days ago

antonvs

Unless of course you take the position that only a male could be dumb enough to take any of this seriously.

2 days ago

numpad0

Is caveman speech brief, or is it just more consistent with the Chinese language? The Chinese language famously lack ALL inflections, conjugations, anything that modify spelling of words.

2 days ago