Analysis and notes from the Stet runs
Methodology notes, benchmark evidence, and model-comparison writeups from real AI coding-agent evals.
Topic
Model comparisons
Real-repo coding-agent comparisons across GPT-5.5, GPT-5.4, Claude Opus 4.7, and Opus 4.6, scored by tests, equivalence, code review, footprint, time, and cost.
Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust
June 2, 2026
I graded four frontier coding models on 50 real merged PRs in Go and Rust - not just whether tests pass, but craft, equivalence, and cost. Opus 4.8 led on craft in both.
I had Codex iterate on its own AGENTS.md 8 times and measured each version against real PRs. The best one still regressed on a clean holdout.
May 27, 2026
Codex optimized its own AGENTS.md against real Stet repo tasks. The best candidate improved the training slice, then regressed enough on a clean holdout that it was not safe to ship.
GPT-5.5 High Regression Check on GraphQL-go-tools
May 18, 2026
A fresh GPT-5.5 Codex high rerun on 21 clean GraphQL-go-tools tasks compared with the May 5 GPT-5.5 high run. The rerun was directionally worse on tests, equivalence, and review pass count, but the evidence is mixed and does not show a broad quality collapse.
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
May 12, 2026
Claude Opus 4.7 reasoning-effort curve on 29 matched GraphQL-go-tools tasks: low, medium, high, xhigh, and max. Medium wins the behavioral metrics; more reasoning does not reliably buy better patches.
GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo
May 7, 2026
An interactive GPT-5.5 Codex reasoning-effort curve on 26 matched GraphQL-go-tools tasks: low, medium, high, and xhigh.
GPT-5.5 vs GPT-5.4 vs Opus 4.7 on 56 real coding tasks from 2 open source repos
May 1, 2026
Opus 4.7 vs GPT-5.5 vs GPT-5.4 on 56 real coding tasks across two open-source repos. Opus writes smaller patches; GPT-5.5 writes patches that more often survive review.
Opus 4.7 vs Old Opus 4.6 vs New Opus 4.6
April 17, 2026
Three Opus snapshots, same 12/28 test pass rate. Above the gate, 4.7 is directionally better — more disciplined, not fundamentally smarter.
Your AGENTS.md is the highest-leverage code you're not testing
April 8, 2026
AGENTS.md is loaded into every turn. If you are not testing and monitoring changes to it, you are guessing at org scale.
Your AI coding benchmark is hiding a 2x quality gap
March 14, 2026
Three models, same pass rate. Under the hood, one matches the human patch 2x more often. Test pass rate is the gate, not the source of truth.