STET

Model comparisons

Stet compares coding models on real merged-code tasks, then scores the patches above test pass rate: equivalence, code review, footprint risk, runtime, and cost.

56 tasks · Zod and graphql-go-tools

GPT-5.5 vs GPT-5.4 real coding benchmark

GPT-5.5 beat GPT-5.4 on tests, equivalence, review pass, and clean pass across 56 real coding tasks.

Winner
GPT-5.5
Tradeoff
GPT-5.4 was cheaper per task; GPT-5.5 produced far more clean passes.
Evidence table

GPT-5.4 vs Opus 4.7

GPT-5.4 was stronger on equivalence in the 56-task benchmark: 35 of 56 patches matched the human change, versus 19 of 56 for Opus 4.7. Opus had the lower footprint risk, 0.20 versus 0.34, which makes it the more conservative patch writer.

MetricOpus 4.7GPT-5.4
Tests pass33 / 5631 / 56
Equivalence19 / 5635 / 56
Clean pass10 / 5611 / 56
Footprint risk0.200.34