EvanFlow – A TDD driven feedback loop for Claude Code

108 points
1/21/1970
4 days ago
by evanklem2004

Comments


Deeds67

To be honest, the official superpowers/brainstorming skill already does TDD so well, I don't see that much of a need for this. TDD is definitely the way to go with agentic development.

3 days ago

synergy20

how?i saw superpowers/brainstorming but never saw tdd code produced

3 days ago

jghn

It’s supposed to do this, but I’ve found it doesn’t always do it

3 days ago

Deeds67

Just tell it to use TDD

3 days ago

Atotalnoob

There is another skill for tdd. You can activate it manually or tell the harness to

3 days ago

shruubi

Two questions

1) Do you not feel self-conscious or weird about calling this "EvanFlow"? Seems like a lot of people these days are naming their AI tools/skills/whatever after themselves which seems self-absorbed. Either that or they hope that if their thing takes off like OpenClaw did then they'll grab the fame that comes along with it.

2) Why does your TDD flow miss the refactor step of TDD?

4 days ago

toyg

I initially thought it was a pun on Pearl Jam's classic "Even Flow", then I read your comment and noticed the username... Sad.

3 days ago

mansilladev

I was really hoping this was something I could find on CPAN from the author username perlJam.

3 days ago

phyzix5761

Let the guy have something. Free and open source developers work tirelessly for free for years supporting software that billion dollar companies use to make huge profits.

We don't question when scientists name stuff after themselves so why question this? At least he gets some recognition for his work.

3 days ago

[deleted]
3 days ago

evanklem2004

1): you have things backwards, the EvanFlow is not something i came up with but rather something i discovered similar to the dao. i am named Evan after the EvanFlow not the other way around.

2): you're right and dmitry called this out below too. shipped a fix that puts REFACTOR per-cycle, instead of being a deferred "after all tests pass" step. the old step 4 was iterate-shaped not TDD-shaped.

3 days ago

collingreen

> 1): you have things backwards, the EvanFlow is not something i came up with but rather something i discovered similar to the dao. i am named Evan after the EvanFlow not the other way around.

What does this mean?

2 days ago

evanklem2004

sit under the lotus tree and it will come to you

2 days ago

wenc

I feel like 1 is a self correcting problem. If this goes nowhere it will soon be forgotten.

I can think of one example that did go somewhere: Linux.

4 days ago

stingraycharles

ReiserFS is another one that comes to mind.

And djb (the djb) also wrote djbdns.

There are plenty of examples, usually when it coincides with someone’s first project.

3 days ago

anon_46135

TanStack was started by a guy named Tanner

Debian is a portmanteau of Debra (Ian's girlfriend) and Ian.

I don't mind it. It's just a name

3 days ago

globular-toast

Linus did not name it Linux himself: https://en.wikipedia.org/wiki/Linux#Naming

3 days ago

u_fucking_dork

He merely laundered it through a coworker.

3 days ago

cindyllm

[dead]

3 days ago

[deleted]
3 days ago

cornyhorse

Debian is an even better example

3 days ago

EvanKnowles

Feels like a bonus to me.

3 days ago

normie3000

Ref 1, he should have called it Daughter.

4 days ago

reitzensteinm

No Code, surely?

4 days ago

ButlerianJihad

"Evenflo is a hundred year old infant feeding brand." Probably named to market its baby bottles and accessories.

Everybody who grew up to listen to Pearl Jam had seen or used an Evenflo pacifier, baby bottle, or car seat. That's one reason the song already sounded so familiar.

3 days ago

subscribed

Now go challenge Linus Torvalds :D

a day ago

infecto

1) Do you feel weird asking a question like this? What constructive benefit does it add to any dialogue?

Sometimes it’s helpful to ask oneself what’s the benefit of an answer. I cannot think of any for your question and the way you worded it is a bit cringe. People name things after themselves all the time. It does not matter in the slightest.

3 days ago

arathis

Jesus mate, talk about loaded questions.

“Who are you? How dare you create anything”

3 days ago

s20n

EvanFlow - thoughts arrive like butterflies?

4 days ago

sbseitz

Oh, he don't know, so he chases them away

4 days ago

jamesbfb

Oooohhhh

4 days ago

ge96

Seeeethinnggg tests failing not complete... again

3 days ago

__mharrison__

Someday soon he'll begin his life again

3 days ago

[deleted]
4 days ago

conception

If you’re just looking for the TDD part - https://github.com/nizos/tdd-guard - is the only project I’ve come across that actually enforces it with hooks and blocks edits rather than relying on a prompt that gets context rotted away.

3 days ago

Nizoss

Creator of TDD Guard here, thanks for the mention!

TDD Guard was built when Claude Code was the only one to offer hooks. Plugins didn't exist and the models were weaker, so the validation context and instructions took more work to get right. This is why it ended up requiring test reporters for different languages.

I have started a new project that does the same TDD enforcement, also through hooks, but without reporters. It works with any test runner, and it is vendor-agnostic, it works with Claude Code, Codex, and GitHub Copilot. The validator also sees recent session history which helps it handle cases like refactoring better.

The TDD instructions are still pretty basic compared to TDD Guard's, which have been dogfooded for a year. One thing I noticed while testing across agents is that some follow TDD a lot better than others, Codex struggled the most with the basic instructions.

Feedback welcome:

https://github.com/nizos/conduct

3 days ago

thisisfatih

The refactor-per-cycle fix lands in the right place. The harder problem shows up when EvanFlow forks into parallel coder/overseer mode: unit tests pass per agent, but the seams break at merge. Your note that "integration tests at touchpoints ARE the cohesion contract" is exactly right, but enforcement is what makes it stick. Each parallel branch needs its own failing test that can't be masked by another branch's green run. Worktree isolation handles this cleanly since each agent's environment is separate. Without that, vertical-slice TDD in parallel collapses to "tests pass somewhere."

On jtfrench's unanswered question about dumb zone evasion: context length is what drives the drift. Agents go off-track when a loop runs long enough that early design context falls out. Resetting at each RED-GREEN-REFACTOR boundary keeps cycles short enough to avoid it. The hard cap of 5 iterate rounds is the same instinct applied at the macro level.

We ran into the parallel integration seam problem building tonone, a 23-agent Claude Code plugin where each domain agent works in its own worktree and integration tests are the merge contract.

https://github.com/tonone-ai/tonone if curious.

3 days ago

kevinluddy39

The per-agent-green / merge-broken pattern is the diagonal failure mode of multi-agent systems. Unit testing each agent in isolation captures correctness within scope; what's invisible is the seam at handoff — argument schemas drifting between coder and overseer, response shapes that satisfy each agent's local validator but break the next's parser, error messages that get summarized into "no error" by the time they reach the orchestrator.

  Built tool-call-grader to instrument exactly this. Session-level statistics across the tool-call trace plus six pathology detectors (silent failure, tool fixation, response bloat, schema drift, irrelevant response, cascading failure). On a hand-designed multi-agent benchmark, 7/7 scenarios passed — including specifically the case you're describing:
  per-agent results look fine, schema-drift fires at the seam.  
  The detector runs over the trace, not the output. Catches the failure several turns before it shows up as "weird merge bug" the human has to debug. MIT licensed, npx-installable. Methodology in profile.
2 days ago

dpark

I’ve thought of going down the TDD model for LLMs as a way of providing constraints on their behavior. I would think that “vertical slice” TDD would encourage the LLM to start tailoring the tests to the implementation rather than establishing the invariants up front, though. I was considering “horizontal” TDD to force the agent to implement constraints before coding to them.

3 days ago

evanklem2004

yeah went back and forth on exactly this trade-off, you're right that vertical can produce tests tailored to the impl. horizontal forces invariants up front but the failure mode flips: you're tailoring tests to the architecture you imagined before any feedback from working code. so it's invariants-vs-behaviors, both have a tailoring failure mode just on different axes. compromise i landed on: vertical + an explicit anti-tailoring grill check at each cycle. definitely gonna tweak with more as i keep refining.

3 days ago

dpark

What if you don’t ask for code yet. Prompt only for tests with maybe a minimal interface context that tests can code against?

3 days ago

viktorianer

[dead]

2 days ago

alex1sa

[dead]

3 days ago

dmitry_dv

The refactor step is the silent casualty in AI-assisted TDD. Once the test is green, Claude optimizes for moving to the next test, not for cleaning up the impl that just passed. An "iterate-until-clean" pass at the end is a different thing: you're refactoring cold code, not refactoring with a freshly-written test as the safety net.

3 days ago

evanklem2004

mmm good point! just shipped a fix that puts RED → GREEN → REFACTOR per cycle with the fresh test as safety net just like beck intended. macro/cross-cycle refactor lives in iterate now as its own separate thing so the two don't conflate. thanks for the catch : )

3 days ago

pydry

When I first used agentic coding I was already doing strict TDD and I just tried using it for the refactor step.

It sucked so hard I thought the idea of agentic coding was just a joke. Ive tried it periodically and it literally never stopped sucking.

I figure if it cant do that part it isnt worth using it for any part.

Ever since then whenever people tell me it's gotten better I've tried it out and nope, still sucks.

I still get gaslit about how well it works by people who just discovered TDD though, and watch it power through CRUD boilerplate getting impressed, blissfully unaware that boilerplate spew is an antipattern.

3 days ago

lukewrites

Curious, In the repo you mention

> Several rules come from 2025-2026 industry research on agentic coding failure modes

What are some of the papers you read?

3 days ago

esperent

With no disrespect intended because this is also how I would do it (but I wouldn't publish and name it after myself!) - they didn't read the research. They had the AI that actually created this do that for them.

3 days ago

evanklem2004

fair to call out but half true. i did send claude off to look up specific stats on failure modes (62% assertion correctness, etc), but the design decisions came from my own reading of anthropic's reports, the columbia daplab paper i cited, and a mix of matt pocock's lectures + my own anecdotal experience running this loop on real projects.

3 days ago

nghnam

superpowers/brainstorming is doing TDD as well.

3 days ago

esperent

> execute → tdd

How are these separate steps?

TDD is how you execute, not something you tack on afterwards.

3 days ago

evanklem2004

yeah that is a little confusing, tdd is actually a substep of execution. it was listed separately in the diagram because not every task uses TDD (config, generated types, etc. skip it), so the skill is invoked conditionally during execution rather than always. but the arrow notation made it look sequential when it's actually nested. updated the README diagram to show that. thanks for the nudge.

3 days ago

jtfrench

How does this handle “dumb zone” evasion while looping?

4 days ago

cratermoon

4 days ago

[deleted]
3 days ago

phibz

... thoughts arrive like butterflies Oh he don't know, so he chases them away Oh someday yet, he'll begin his life again Life again, life again

3 days ago

evanklem2004

Built this as an opinionated Claude Code development flow based on evidence based practices and what has been working for me while developing professional code.

EvanFlow is a single TDD-driven loop. Say "let's evanflow this" and it walks brainstorm → plan → execute → tdd → iterate → STOP. Real checkpoints at design and plan approval. Never auto-commits, never auto-stages, never proposes integration - every git op is your call.

The three things that actually changed how I work:

1. Vertical-slice TDD. One failing test → minimal impl → next test. Watch each test fail before writing the impl that passes it. (Sounds obvious. Almost no agent does it by default. ~62% of LLM-generated test assertions are wrong per HumanEval research, so testing TDD discipline matters more than the impl discipline.)

2. Embedded grilling at decision points. Before locking a plan: what breaks if a user does X? What's the rollback? What's explicitly out of scope? Catches design flaws while they're still cheap.

3. Iterate-until-clean (hard cap of 5 rounds). Re-read the diff against dead code, naming, the deletion test, assertion correctness, and a Five Failure Modes pass (hallucinated actions, scope creep, cascading errors, context loss, tool misuse). For UI: screenshot via headless Chromium.

For bigger plans with 3+ independent units sharing types, it forks into a parallel coder/overseer orchestration. Integration tests at touchpoints ARE the cohesion contract.

Three install paths: Claude Code plugin marketplace, npx skills add, manual copy. MIT.

4 days ago

girvo

Please don’t post AI generated comments :(

Just write it yourself. I promise it’s worth it

3 days ago

deaux

He's even being cheeky by intentionally replacing the em-dash by a regular dash, haha

3 days ago

girvo

It's quite well done really, but the cadence...

No x. No y. No z. Just abc.

Its like nails on a chalkboard...

3 days ago

evanklem2004

sometimes you gotta hit em with the ol' linkedin one two hehe

3 days ago

jimmypk

[dead]

3 days ago

tommy29tmar

[dead]

3 days ago

enesz

[dead]

3 days ago

youwangd

[dead]

3 days ago

jonahs197

[dead]

3 days ago

marsven_422

[dead]

3 days ago

here2learnstuff

[flagged]

3 days ago

fragmede

Linus started Linux when he was 21, an undergrad at the University of Helsinki. You're entirely welcome to use whatever filtering function for products you use, but it doesn't seem like soley using this particular product's creator's age as a disqualifier comes from a place of sound reasoning, to me.

3 days ago

avyjit

This is such a BS take. If you feel the product is immature or not great - that's valid criticism. This is not

3 days ago

xaxfixho

i'm new around here, how do i *DOWN VOTE* stuff?

3 days ago

sdevonoes

TDD in 2026? Besides, TDDs main benefit is to come up with a decent architecture for your system… LLMs can already do that if instructed. I don’t see the point of TDD

3 days ago

myko

I've always been hesitant to prescribe TDD to _everything_ until agentic coding agents came along. TDD is a great way to keep them on track.

3 days ago