Benchmark evidence
GPT-5.5 vs GPT-5.4 vs Opus 4.7 Evidence
Evidence table for the May 1, 2026 Stet benchmark across 56 real coding tasks from Zod and graphql-go-tools.
On May 1, 2026, Stet ran GPT-5.5, GPT-5.4, and Opus 4.7 on 56 real coding tasks from Zod and graphql-go-tools; GPT-5.5 produced 28 clean passes, GPT-5.4 produced 11, and Opus 4.7 produced 10.
- Tasks
- 56 real coding tasks
- Repos
- Zod, graphql-go-tools
- Judge
- GPT-5.4
- Date
- 2026-05-01
- Harnesses
- Claude Code, OpenAI Codex CLI
- Caveat
- Each model ran once per task with a single seed.
Results table
| Metric | Opus 4.7 | GPT-5.4 | GPT-5.5 |
|---|---|---|---|
| Tests pass | 33 / 56 | 31 / 56 | 38 / 56 |
| Equivalence | 19 / 56 | 35 / 56 | 40 / 56 |
| Clean pass | 10 / 56 | 11 / 56 | 28 / 56 |
| Footprint risk | 0.20 | 0.34 | 0.32 |
| Mean time per task | 11m18s | 8m24s | 6m56s |
| Cost per task | $3.43 | $2.39 | $2.86 |
GPT-5.5 vs GPT-5.4
GPT-5.5 was more shippable: 28 clean passes versus 11 for GPT-5.4, with higher test pass and equivalence counts.
GPT-5.5 vs Opus 4.7
GPT-5.5 cleared review and integration work more often. Opus 4.7 kept a lower footprint and can be attractive for narrow diffs.
GPT-5.4 vs Opus 4.7
GPT-5.4 was stronger on equivalence than Opus 4.7, 35 of 56 versus 19 of 56. Opus 4.7 had the lower footprint risk, 0.20 versus 0.34.
Source article: Source writeup
Methodology: scoring and validation details