STET

Benchmark evidence

GPT-5.5 vs GPT-5.4 vs Opus 4.7 Evidence

Evidence table for the May 1, 2026 Stet benchmark across 56 real coding tasks from Zod and graphql-go-tools.

On May 1, 2026, Stet ran GPT-5.5, GPT-5.4, and Opus 4.7 on 56 real coding tasks from Zod and graphql-go-tools; GPT-5.5 produced 28 clean passes, GPT-5.4 produced 11, and Opus 4.7 produced 10.

Tasks
56 real coding tasks
Repos
Zod, graphql-go-tools
Judge
GPT-5.4
Date
2026-05-01
Harnesses
Claude Code, OpenAI Codex CLI
Caveat
Each model ran once per task with a single seed.

Results table

MetricOpus 4.7GPT-5.4GPT-5.5
Tests pass33 / 5631 / 5638 / 56
Equivalence19 / 5635 / 5640 / 56
Clean pass10 / 5611 / 5628 / 56
Footprint risk0.200.340.32
Mean time per task11m18s8m24s6m56s
Cost per task$3.43$2.39$2.86

GPT-5.5 vs GPT-5.4

GPT-5.5 was more shippable: 28 clean passes versus 11 for GPT-5.4, with higher test pass and equivalence counts.

GPT-5.5 vs Opus 4.7

GPT-5.5 cleared review and integration work more often. Opus 4.7 kept a lower footprint and can be attractive for narrow diffs.

GPT-5.4 vs Opus 4.7

GPT-5.4 was stronger on equivalence than Opus 4.7, 35 of 56 versus 19 of 56. Opus 4.7 had the lower footprint risk, 0.20 versus 0.34.

Source article: Source writeup

Methodology: scoring and validation details