Private coding agent evals
You don't ship untested code. You ship untested agent changes every week.
A new model, an edited AGENTS.md, a higher reasoning level — each silently moves your code quality and your cost per task, and a passing test suite tells you neither. Stet replays real work from your repo, scores the change, and lets your coding agent improve its own setup against the evidence.
Tests pass — that's all you know.
In our testing, two models hit an identical 73% pass rate — then landed 2–5× apart on every quality dimension. METR found ~50% of test-passing AI PRs wouldn't be merged by maintainers. Green tests are not enough.
The decision stack
3 models × 4 instruction sets × 2 tool configs × 2 workflow modes × 3 reasoning levels × 2 harness versions = 288 configurations.
You're testing: 1
How the other 287 get decided
You can't test 288 by hand. Your agent can — if it has a measurement to optimize against.
Tell your coding agent to improve its own setup.
It proposes a change — a tighter AGENTS.md, a cheaper model, a lower reasoning level — and Stet scores each attempt on real work from your repo. You keep only what wins, and every run ends in a decision you can defend.
⚠Partial implementations on 3 tasks
⚠Validation still weak on 2 tasks
⚠2 tasks still fail code review for style
⚠Consider follow-up for review quality
Your merged PRs are the tasks. Your tests are the judge. Your code is the baseline.
Recent Stet experiments
It's not just prompts. Model swaps, harness changes, skill edits, config flips — each one checked the same way.
Find where you're overpaying — and prove it.
When the cheaper setup — a smaller model, a lower reasoning level — matches the expensive one on a class of tasks, you're paying twice for the same result. Stet shows you which jobs need the premium setup and which don't, and hands you the receipt to defend the call.
What engineering leaders say
"No way to tell the board we're getting value out of agents."
"Outside of vanity metrics, I have nothing of value to show."
Model comparison
A public benchmark of AI coding agents on real open-source codebases — run the same eval privately on your repo.
Same test gate, n=50 across Go and Rust — the separation is the craft and cost above it. Directional read; full calibration and caveats in the post.
Read the full comparison →The number is always moving.
Every model release, AGENTS.md edit, and reasoning-level change moves your quality/cost optimum. Stet is the loop — replay real work from your repo, score each candidate, keep the best, and re-run when the ground shifts.
Stet means 'let it stand.' Every change to your agent stack gets checked against real work from your repo — stet, or revert.
Run it on your repo