STET
Pinned W1→W2 comparison

Zod (TypeScript) · 28-task shared slice

Zod (TypeScript) is tightly contested: GPT-5.4 edges out GPT-5.3 Codex at 79% vs 75% — but quality separates them: GPT-5.4 leads equivalence at 45%; GPT-5.4 costs 5x less.

Updated Mar 07, 2026·261 runs·3 repos·3 canonical models·3/3 repos comparable·denominators aligned

Pinned W1→W2 leaderboard: 3 repos, 3 canonical models, same denominator for testing, equivalence, and review. Pass rates are tied across 3 repos — but quality tells a different story: GPT-5.4 leads equivalence by 2x on average; GPT-5.4 costs 4x less.

Pinned W1→W2
Zod (TypeScript) is publishing a pinned 28-task slice.
testing, equivalence, and review compare like-for-like
Lineage
Pinned
Shared Slice
28 tasks
Models
3 aligned
Quality Denom
Equiv + review complete
28 tasksWeek 2
GateCumulative pass rate across all weeks — the gate, not the differentiator
78.6%
GPT-5.4
EquivBest cumulative equivalence rate — among passing tasks across all weeks
45.5%
GPT-5.4 · all 39.3%
ReviewBest cumulative code review pass rate — among passing tasks across all weeks
31.8%
GPT-5.4 · all 25.0%
Cost/TaskCheapest model's average API cost per task
$0.67
GPT-5.4
Pinned Weekly DeltaW1 → W2 · same 28-task slice · pp
GPT-5.4
Passnew
GPT-5.3 Codex
Passv -3.6pp
Equivv -2.6pp
Reviewv -3.7pp
Footprint^ +1.3pp
GPT-5.1 Codex Mini
Passv -3.6pp
Equivv -3.7pp
Review^ +5.0pp
Footprint^ +2.2pp

Current Rankings

ModelPass Rate (95% CI)Binary pass rate with 95% bootstrap confidence intervalΔ W1->W2Change in pass rate (percentage points) from W1 to W2EquivEquivalence rate among passing tasks (with all-tasks and W1->W2 delta shown in-row when available)ReviewCode-review pass rate among passing tasks (with all-tasks and W1->W2 delta shown in-row when available)FootprintFootprint-risk flag rate among passing tasks (with all-tasks and W1->W2 delta shown in-row when available)TimeMedian task completion time across all tasksCost/TaskAverage API cost per task
GPT-5.478.6%78.6%78.6%n=28n/a45.5%
all 39.3%
31.8%
all 25.0%
18.2%
all 28.6%
$0.67
GPT-5.3 Codex75.0%75.0%75.0%n=28v -3.6pp42.9%
all 35.7%
19.0%
all 14.3%
28.6%
all 32.1%
$3.06
GPT-5.1 Codex Mini75.0%75.0%75.0%n=28v -3.6pp19.0%
all 14.3%
9.5%
all 7.1%
47.6%
all 50.0%
$1.33