STET

Model comparisons

Stet compares coding models on real merged-code tasks, then scores the patches above test pass rate: equivalence, code review, footprint risk, runtime, and cost.

Public comparisons show how model and harness choices behaved on these repos. A private Stet run answers the same question with your PR history, tests, AGENTS.md, and review standards. Compare Claude Code vs Codex on your repo.

56 tasks · Zod and graphql-go-tools

GPT-5.5 vs GPT-5.4 real coding benchmark

GPT-5.5 beat GPT-5.4 on tests, equivalence, review pass, and clean pass across 56 real coding tasks.

Winner
GPT-5.5
Tradeoff
GPT-5.4 was cheaper per task; GPT-5.5 produced far more clean passes.
Evidence table

26 tasks · graphql-go-tools

GPT-5.5 low vs medium vs high vs xhigh reasoning curve

GPT-5.5 Codex at four reasoning settings on 26 matched graphql-go-tools tasks: equivalence and review pass climbed sharply with reasoning, while tests were not monotonic.

Winner
High as default, xhigh for complex work
Tradeoff
Xhigh produced the best equivalence and review quality, but cost about 2.18x high per task.
Evidence table

GPT-5.4 vs Opus 4.7

GPT-5.4 was stronger on equivalence in the 56-task benchmark: 35 of 56 patches matched the human change, versus 19 of 56 for Opus 4.7. Opus had the lower footprint risk, 0.20 versus 0.34, which makes it the more conservative patch writer.

MetricOpus 4.7GPT-5.4
Tests pass33 / 5631 / 56
Equivalence19 / 5635 / 56
Clean pass10 / 5611 / 56
Footprint risk0.200.34