STET

Analysis and notes from the Stet runs

Methodology notes, benchmark evidence, and model-comparison writeups from real AI coding-agent evals.

Topic

Model comparisons

Real-repo coding-agent comparisons across GPT-5.5, GPT-5.4, Claude Opus 4.7, and Opus 4.6, scored by tests, equivalence, code review, footprint, time, and cost.

Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust

June 2, 2026

I graded four frontier coding models on 50 real merged PRs in Go and Rust - not just whether tests pass, but craft, equivalence, and cost. Opus 4.8 led on craft in both.

I had Codex iterate on its own AGENTS.md 8 times and measured each version against real PRs. The best one still regressed on a clean holdout.

May 27, 2026

Codex optimized its own AGENTS.md against real Stet repo tasks. The best candidate improved the training slice, then regressed enough on a clean holdout that it was not safe to ship.

GPT-5.5 High Regression Check on GraphQL-go-tools

May 18, 2026

A fresh GPT-5.5 Codex high rerun on 21 clean GraphQL-go-tools tasks compared with the May 5 GPT-5.5 high run. The rerun was directionally worse on tests, equivalence, and review pass count, but the evidence is mixed and does not show a broad quality collapse.

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo

May 12, 2026

Claude Opus 4.7 reasoning-effort curve on 29 matched GraphQL-go-tools tasks: low, medium, high, xhigh, and max. Medium wins the behavioral metrics; more reasoning does not reliably buy better patches.

GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo

May 7, 2026

An interactive GPT-5.5 Codex reasoning-effort curve on 26 matched GraphQL-go-tools tasks: low, medium, high, and xhigh.

GPT-5.5 vs GPT-5.4 vs Opus 4.7 on 56 real coding tasks from 2 open source repos

May 1, 2026

Opus 4.7 vs GPT-5.5 vs GPT-5.4 on 56 real coding tasks across two open-source repos. Opus writes smaller patches; GPT-5.5 writes patches that more often survive review.

Opus 4.7 vs Old Opus 4.6 vs New Opus 4.6

April 17, 2026

Three Opus snapshots, same 12/28 test pass rate. Above the gate, 4.7 is directionally better — more disciplined, not fundamentally smarter.

Your AGENTS.md is the highest-leverage code you're not testing

April 8, 2026

AGENTS.md is loaded into every turn. If you are not testing and monitoring changes to it, you are guessing at org scale.

Your AI coding benchmark is hiding a 2x quality gap

March 14, 2026

Three models, same pass rate. Under the hood, one matches the human patch 2x more often. Test pass rate is the gate, not the source of truth.