Your AI coding benchmark is hiding a 2x quality gap
The assumption
The base assumption that coding agent evals (SWE-Bench, Terminal Bench) make is that there is one primary metric to measure agent quality, and that metric is test pass rate.
Claude Code with Opus 4.6 passes tests 73% of the time. Codex with GPT 5.4 passes 80%. Ship GPT 5.4. EZ, right?
How we measured
We ran 3 models against 87 tasks, drawn from 3 real open-source repos: Zod, graphql-go-tools, and sqlparser-rs.
Each task is a real PR or commit that was merged to the repo. The agent gets the repo prior to the merge, and instructions to do the task. The PR's own tests decide if the agent's change passes or fails.
Pass rate is the gate. But we also score three quality dimensions above it:
- Equivalence — how closely does the agent's patch match the real PR that was merged?
- Code review — would another model pass or fail the agent's patch in review?
- Footprint risk — how many unnecessary changes did the agent make?
The dead heat
On 87 shared W2 tasks, the pass rates are almost identical:
gpt-5.1-codex-mini:77/87(88.5%)gpt-5.3-codex:78/87(89.7%)gpt-5.4:78/87(89.7%)
That sounds like a tie, but it isn't.
Mini and 5.3 agree on 82/87 tasks. 75 both-pass, 7 both-fail, 5 mixed. The pass-rate headline is only moving on five tasks.
So I looked at the 75 tasks where both agents pass the tests.
Same pass rate. Completely different code.
5.3 is 1.6x more likely than mini to match the human patch. 5.4 is best across the board — highest equivalence, best review pass rate, lowest footprint risk — and the cheapest at $1.34/task.
METR confirmed it
Different methodology, same conclusion.
METR had 4 active maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed the automated grader. ~50% would not be merged.
We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions
Others are seeing the same thing
Voratiq found the same pattern in their own workflow across 4,784 candidate patches: test-passing candidates were selected 1.8x more often, but top-reviewed candidates were selected 9.9x more often. Tests are a weak proxy for the code teams actually accept. — Voratiq, March 2026
Tests are the gate, not the source of truth
Pass rate is where models agree. The quality above the gate — equivalence, review, footprint, cost — is where they diverge.
If you're choosing agents by the one metric where they all look the same, you're not choosing.
If you want the full picture: /why
If this matches what you're seeing: ben@benr.build