Model comparisons
Stet compares coding models on real merged-code tasks, then scores the patches above test pass rate: equivalence, code review, footprint risk, runtime, and cost.
56 tasks · Zod and graphql-go-tools
GPT-5.5 beat GPT-5.4 on tests, equivalence, review pass, and clean pass across 56 real coding tasks.
- Winner
- GPT-5.5
- Tradeoff
- GPT-5.4 was cheaper per task; GPT-5.5 produced far more clean passes.
Evidence table56 tasks · Zod and graphql-go-tools
GPT-5.5 was the better shipping default, while Opus 4.7 remained the lower-footprint option.
- Winner
- GPT-5.5
- Tradeoff
- Opus 4.7 wrote smaller patches; GPT-5.5 cleared review more often.
Evidence table56 tasks · Zod and graphql-go-tools
GPT-5.4 matched the human patch more often than Opus 4.7, but Opus kept patches smaller and lower-risk.
- Winner
- Depends on the constraint
- Tradeoff
- GPT-5.4 was stronger on equivalence; Opus 4.7 had lower footprint risk.
Evidence table28 tasks · Zod
Opus 4.7 did not pass more tests, but it produced tighter and more equivalent patches.
- Winner
- Opus 4.7 above the gate
- Tradeoff
- All arms tied on test pass rate; Opus 4.7 improved footprint and equivalence.
Evidence table