Pinned W1→W2 comparison
Zod (TypeScript) · 28-task shared slice
“Zod (TypeScript) is tightly contested: GPT-5.4 edges out GPT-5.3 Codex at 79% vs 75% — but quality separates them: GPT-5.4 leads equivalence at 45%; GPT-5.4 costs 5x less.”
Updated Mar 07, 2026·261 runs·3 repos·3 canonical models·3/3 repos comparable·denominators aligned
Pinned W1→W2 leaderboard: 3 repos, 3 canonical models, same denominator for testing, equivalence, and review. Pass rates are tied across 3 repos — but quality tells a different story: GPT-5.4 leads equivalence by 2x on average; GPT-5.4 costs 4x less.
Pinned W1→W2
Zod (TypeScript) is publishing a pinned 28-task slice.
testing, equivalence, and review compare like-for-like
Lineage
Pinned
Shared Slice
28 tasks
Models
3 aligned
Quality Denom
Equiv + review complete
28 tasksWeek 2
GateCumulative pass rate across all weeks — the gate, not the differentiator
78.6%
GPT-5.4
EquivBest cumulative equivalence rate — among passing tasks across all weeks
45.5%
GPT-5.4 · all 39.3%
ReviewBest cumulative code review pass rate — among passing tasks across all weeks
31.8%
GPT-5.4 · all 25.0%
Cost/TaskCheapest model's average API cost per task
$0.67
GPT-5.4
Pinned Weekly DeltaW1 → W2 · same 28-task slice · pp
GPT-5.4
Passnew
GPT-5.3 Codex
Passv -3.6pp
Equivv -2.6pp
Reviewv -3.7pp
Footprint^ +1.3pp
GPT-5.1 Codex Mini
Passv -3.6pp
Equivv -3.7pp
Review^ +5.0pp
Footprint^ +2.2pp
Current Rankings
| Model | Pass Rate (95% CI)Binary pass rate with 95% bootstrap confidence interval | Δ W1->W2Change in pass rate (percentage points) from W1 to W2 | EquivEquivalence rate among passing tasks (with all-tasks and W1->W2 delta shown in-row when available) | ReviewCode-review pass rate among passing tasks (with all-tasks and W1->W2 delta shown in-row when available) | FootprintFootprint-risk flag rate among passing tasks (with all-tasks and W1->W2 delta shown in-row when available) | TimeMedian task completion time across all tasks | Cost/TaskAverage API cost per task |
|---|---|---|---|---|---|---|---|
| ▸GPT-5.4 | 78.6%78.6%–78.6%n=28 | n/a | 45.5% all 39.3% | 31.8% all 25.0% | 18.2% all 28.6% | — | $0.67 |
| ▸GPT-5.3 Codex | 75.0%75.0%–75.0%n=28 | v -3.6pp | 42.9% all 35.7% | 19.0% all 14.3% | 28.6% all 32.1% | — | $3.06 |
| ▸GPT-5.1 Codex Mini | 75.0%75.0%–75.0%n=28 | v -3.6pp | 19.0% all 14.3% | 9.5% all 7.1% | 47.6% all 50.0% | — | $1.33 |