Evals are broken.
Benchmarks are contaminated.
Your team has no way to tell
if AI is helping — or hurting.
OpenAI declared SWE-bench Verified dead — contamination across all frontier models. The primary benchmark is broken. — OpenAI, Feb 2026
“Outside of vanity metrics, I have nothing of value to show.” — Principal engineer, 900-person company (source)
Model and harness updates change AI behavior. Config changes compound. Quality drifts — and the only way people notice are when the “vibes are off”.
Codex 5.2-high beats Codex 5.2-xhigh 67% of the time on real coding tasks. More thinking tokens isn’t always better. — Voratiq, 175 runs
It’s not just the model. It’s the harness, skills, the rules, the tools, the workflow. Every layer is a variable. None of them are tested together.
Median 3 tools per engineer; 14.7% use 5+. 49.1% use different tools for different tasks. — Pragmatic Engineer Survey, 2026
Green checks, red quality.
AI generates more PRs. Each one still needs human review. Tests / CI are the gate, not the source of truth.
“One of our main challenges has been code reviews, as the quantity of code produced goes up, and quality used to go down, pre-Opus 4.5.” — Staff engineer, 30-person company (source)
~50% of test-passing SWE-bench PRs would not be merged by repo maintainers. — METR, March 2026
zod #4843 — Fix branded-primitive typing in error tree
Both pass your CI.
One produces code you’d ship. One produces code you’d rewrite.
Real patches from Stet evaluation runs · zod dataset
You don’t ship untested code changes.
Stop shipping unmeasured agent changes.
The teams that get this right tell their agent to test itself. Every model swap, skill change, or config update gets measured before rollout — on their own code, their own tests, their own standards.
We ran two models on 60 tasks from a real open-source repo. Same tasks. Same tests. Here’s what the data showed.
Quality degrades by default. Every update is a risk.
Every change is an experiment. Test it on your code.
Pass rate ≠ quality. The gap is where models diverge.
Stet replays your merged PRs, scores quality above pass/fail, and delivers comparison reports through your agent. Recurring runs and release gates follow.
We’re publishing our evaluation results openly.
See the data →