Model comparisons
Stet compares coding models on real merged-code tasks, then scores the patches above test pass rate: equivalence, code review, footprint risk, runtime, and cost.
Public comparisons show how model and harness choices behaved on these repos. A private Stet run answers the same question with your PR history, tests, AGENTS.md, and review standards. Compare Claude Code vs Codex on your repo.
56 tasks · Zod and graphql-go-tools
GPT-5.5 beat GPT-5.4 on tests, equivalence, review pass, and clean pass across 56 real coding tasks.
- Winner
- GPT-5.5
- Tradeoff
- GPT-5.4 was cheaper per task; GPT-5.5 produced far more clean passes.
Evidence table56 tasks · Zod and graphql-go-tools
GPT-5.5 was the better shipping default, while Opus 4.7 remained the lower-footprint option.
- Winner
- GPT-5.5
- Tradeoff
- Opus 4.7 wrote smaller patches; GPT-5.5 cleared review more often.
Evidence table56 tasks · Zod and graphql-go-tools
GPT-5.4 matched the human patch more often than Opus 4.7, but Opus kept patches smaller and lower-risk.
- Winner
- Depends on the constraint
- Tradeoff
- GPT-5.4 was stronger on equivalence; Opus 4.7 had lower footprint risk.
Evidence table28 tasks · Zod
Opus 4.7 did not pass more tests, but it produced tighter and more equivalent patches.
- Winner
- Opus 4.7 above the gate
- Tradeoff
- All arms tied on test pass rate; Opus 4.7 improved footprint and equivalence.
Evidence table26 tasks · graphql-go-tools
GPT-5.5 Codex at four reasoning settings on 26 matched graphql-go-tools tasks: equivalence and review pass climbed sharply with reasoning, while tests were not monotonic.
- Winner
- High as default, xhigh for complex work
- Tradeoff
- Xhigh produced the best equivalence and review quality, but cost about 2.18x high per task.
Evidence table29 tasks · graphql-go-tools
Claude Opus 4.7 peaked at medium reasoning effort on 29 matched graphql-go-tools tasks; high, xhigh, and max were more expensive without improving the primary rollout metrics.
- Winner
- Medium
- Tradeoff
- Higher effort cost more without beating medium on test pass, equivalence, review pass, or aggregate craft.
Evidence table