Your AI coding benchmark is hiding a 2x quality gap

March 14, 2026

The assumption

The base assumption that coding agent evals (SWE-Bench, Terminal Bench) make is that there is one primary metric to measure agent quality, and that metric is test pass rate.

Claude Code with Opus 4.6 passes tests 73% of the time. Codex with GPT 5.4 passes 80%. Ship GPT 5.4. EZ, right?

For current pairwise model comparisons, see the Stet model comparison hub, including GPT-5.5, GPT-5.4, Opus 4.7, and Opus 4.6 runs on real coding tasks.

How we measured

We ran 3 models against 87 tasks, drawn from 3 real open-source repos: Zod, graphql-go-tools, and sqlparser-rs.

Each task is a real PR or commit that was merged to the repo. The agent gets the repo prior to the merge, and instructions to do the task. The PR's own tests decide if the agent's change passes or fails.

Pass rate is the gate. But we also score three quality dimensions above it:

Equivalence — how closely does the agent's patch match the real PR that was merged?
Code review — would another model pass or fail the agent's patch in review?
Footprint risk — how many unnecessary changes did the agent make?

The dead heat

On 87 shared W2 tasks, the pass rates are almost identical:

gpt-5.1-codex-mini: 77/87 (88.5%)
gpt-5.3-codex: 78/87 (89.7%)
gpt-5.4: 78/87 (89.7%)

That sounds like a tie, but it isn't.

Mini and 5.3 agree on 82/87 tasks. 75 both-pass, 7 both-fail, 5 mixed. The pass-rate headline is only moving on five tasks.

So I looked at the 75 tasks where both agents pass the tests.

model	equivalence to gold	code review pass	high-risk footprint	cost / task
`gpt-5.1-codex-mini`	`24.0%`	`9.3%`	`12.0%`	`$1.98`
`gpt-5.3-codex`	`38.7%`	`8.0%`	`9.3%`	`$5.23`
`gpt-5.4`	`45.3%`	`16.0%`	`8.0%`	`$1.34`

Same pass rate. Completely different code.

5.3 is 1.6x more likely than mini to match the human patch. 5.4 is best across the board — highest equivalence, best review pass rate, lowest footprint risk — and the cheapest at $1.34/task.

gpt-5.1-codex-mini

gpt-5.3-codex

gpt-5.4

pass ratethe gate

88.5%

89.7%

above the gate

equivalence

24.0%

38.7%

45.3%

code review pass

9.3%

8.0%

16.0%

footprint risklower is better

12.0%

9.3%

8.0%

cost / tasklower is better

$1.98

$5.23

$1.34

METR confirmed it

Different methodology, same conclusion.

METR had 4 active maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs that passed the automated grader. ~50% would not be merged.

We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by repo maintainers, even after adjusting for noise in maintainer merge decisions

— METR, March 2026

Others are seeing the same thing

Voratiq found the same pattern in their own workflow across 4,784 candidate patches: test-passing candidates were selected 1.8x more often, but top-reviewed candidates were selected 9.9x more often. Tests are a weak proxy for the code teams actually accept. — Voratiq, March 2026

Tests are the gate, not the source of truth

Pass rate is where models agree. The quality above the gate — equivalence, review, footprint, cost — is where they diverge.

If you're choosing agents by the one metric where they all look the same, you're not choosing.

If you want the full picture: /why

If this matches what you're seeing: ben@stet.sh