STET
Pinned W1→W2 comparison

Zod (TypeScript) · 28-task shared slice

Zod (TypeScript) is tightly contested: GPT-5.4 edges out GPT-5.3 Codex at 79% vs 75% — but quality separates them: GPT-5.4 leads equivalence at 45%; GPT-5.4 costs 12x less.

Updated Apr 17, 2026·463 runs·3 repos·Judge: GPT-5.3 Codex
Model
28 tasks per model
Week 3
The test-based pass/fail bar
78.6%
GPT-5.4
Match the intended fix?
45.5%
GPT-5.4 · all 39.3%
Would a reviewer approve?
33.3%
Claude Opus 4.6 · all 25.0%
Surgical or over-edited?
0.0%
Claude Opus 4.6 · lowest
API spend per task
$0.67
GPT-5.4
GPT-5.4codex cli
78.6% 0.0pp
78.6%78.6% · n=28
45.5% 0.0pp
all 39.3%
31.8% 0.0pp
all 25.0%
18.2% 0.0pp
all 28.6%
$0.67
GPT-5.3 Codexcodex cli
75.0% 0.0pp
75.0%75.0% · n=28
42.9% 0.0pp
all 35.7%
19.0% 0.0pp
all 14.3%
28.6% 0.0pp
all 32.1%
$3.06
GPT-5.1 Codex Minicodex cli
75.0% 0.0pp
75.0%75.0% · n=28
19.0% 0.0pp
all 14.3%
9.5% 0.0pp
all 7.1%
47.6% 0.0pp
all 50.0%
$1.33
Claude Opus 4.6claude code
42.9%new
42.9%42.9% · n=28
41.7%
all 32.1%
33.3%
all 25.0%
0.0%
all 10.7%
$6.65
Claude Opus 4.7claude code
42.9%new
42.9%42.9% · n=28
41.7%
all 46.4%
25.0%
all 25.0%
0.0%
all 0.0%
$8.11
GPT-5.4 Minicodex cli
32.1%new
32.1%32.1% · n=28
11.1%
all 14.3%
16.7%
all 18.2%
77.8%
all 60.7%
$3.34