Model comparisons

Stet compares coding models on real merged-code tasks, then scores the patches above test pass rate: equivalence, code review, footprint risk, runtime, and cost.

56 tasks · Zod and graphql-go-tools

GPT-5.5 vs GPT-5.4 real coding benchmark

GPT-5.5 beat GPT-5.4 on tests, equivalence, review pass, and clean pass across 56 real coding tasks.

Winner: GPT-5.5
Tradeoff: GPT-5.4 was cheaper per task; GPT-5.5 produced far more clean passes.

Evidence table

56 tasks · Zod and graphql-go-tools

GPT-5.5 vs Claude Opus 4.7 coding-agent benchmark

GPT-5.5 was the better shipping default, while Opus 4.7 remained the lower-footprint option.

Winner: GPT-5.5
Tradeoff: Opus 4.7 wrote smaller patches; GPT-5.5 cleared review more often.

Evidence table

56 tasks · Zod and graphql-go-tools

GPT-5.4 vs Claude Opus 4.7 coding-agent benchmark

GPT-5.4 matched the human patch more often than Opus 4.7, but Opus kept patches smaller and lower-risk.

Winner: Depends on the constraint
Tradeoff: GPT-5.4 was stronger on equivalence; Opus 4.7 had lower footprint risk.

Evidence table

28 tasks · Zod

Claude Opus 4.7 vs Opus 4.6 Zod benchmark

Opus 4.7 did not pass more tests, but it produced tighter and more equivalent patches.

Winner: Opus 4.7 above the gate
Tradeoff: All arms tied on test pass rate; Opus 4.7 improved footprint and equivalence.

Evidence table

GPT-5.4 vs Opus 4.7

GPT-5.4 was stronger on equivalence in the 56-task benchmark: 35 of 56 patches matched the human change, versus 19 of 56 for Opus 4.7. Opus had the lower footprint risk, 0.20 versus 0.34, which makes it the more conservative patch writer.

Metric	Opus 4.7	GPT-5.4
Tests pass	33 / 56	31 / 56
Equivalence	19 / 56	35 / 56
Clean pass	10 / 56	11 / 56
Footprint risk	0.20	0.34