Benchmark evidence

GPT-5.5 vs GPT-5.4 vs Opus 4.7 Evidence

Name: GPT-5.5 vs GPT-5.4 vs Opus 4.7 Evidence
Creator: Stet
Published: 2026-05-01

Evidence table for the May 1, 2026 Stet benchmark across 56 real coding tasks from Zod and graphql-go-tools.

On May 1, 2026, Stet ran GPT-5.5, GPT-5.4, and Opus 4.7 on 56 real coding tasks from Zod and graphql-go-tools; GPT-5.5 produced 28 clean passes, GPT-5.4 produced 11, and Opus 4.7 produced 10.

Tasks: 56 real coding tasks
Repos: Zod, graphql-go-tools
Judge: GPT-5.4
Date: 2026-05-01
Harnesses: Claude Code, OpenAI Codex CLI
Caveat: Each model ran once per task with a single seed.

Results table

Metric	Opus 4.7	GPT-5.4	GPT-5.5
Tests pass	33 / 56	31 / 56	38 / 56
Equivalence	19 / 56	35 / 56	40 / 56
Clean pass	10 / 56	11 / 56	28 / 56
Footprint risk	0.20	0.34	0.32
Mean time per task	11m18s	8m24s	6m56s
Cost per task	$3.43	$2.39	$2.86

GPT-5.5 vs GPT-5.4

GPT-5.5 was more shippable: 28 clean passes versus 11 for GPT-5.4, with higher test pass and equivalence counts.

GPT-5.5 vs Opus 4.7

GPT-5.5 cleared review and integration work more often. Opus 4.7 kept a lower footprint and can be attractive for narrow diffs.

GPT-5.4 vs Opus 4.7

GPT-5.4 was stronger on equivalence than Opus 4.7, 35 of 56 versus 19 of 56. Opus 4.7 had the lower footprint risk, 0.20 versus 0.34.

Source article: Source writeup

Methodology: scoring and validation details