Analysis and notes from the Stet runs

Methodology notes, benchmark evidence, and model-comparison writeups from real AI coding-agent evals.

Topic

Model comparisons

Real-repo coding-agent comparisons across GPT-5.5, GPT-5.4, Claude Opus 4.7, and Opus 4.6, scored by tests, equivalence, code review, footprint, time, and cost.

GPT-5.4 vs Opus 4.7 Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo GPT-5.5 vs GPT-5.4 vs Opus 4.7 on 56 real coding tasks from 2 open source repos Opus 4.7 vs Old Opus 4.6 vs New Opus 4.6

I Tested Six Ways to Save $$$ on 5.6 Sol. The only one that worked was using Terra

July 20, 2026

Six interventions meant to save tokens gave different answers depending on how you count. None cut the ten-task batch total twice. Caveman and the cheaper Terra model looked mildly cheaper per task, but within run-to-run noise.

Sonnet 5 vs Opus 4.8: how they behave and when to use each

July 8, 2026

On 24 real tasks, Sonnet scaled effort into more checking while Opus stayed flatter through high. The graders leaned Sonnet on clarity and Opus on diff minimality. Here is when I would use each.

GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest

June 19, 2026

GLM 5.2 vs Composer 2.5 and the premium field on 50 real merged PRs from graphql-go-tools (Go) and sqlparser-rs (Rust). GLM lands last on craft and equivalence in both repos, costs about twice Composer, and writes more code than the human while missing the change. A routing guide for the new cheap model.

Composer 2.5 was 6.5x cheaper on 50 real Rust and Go PRs. Here's where it's actually safe.

June 18, 2026

Composer 2.5 vs Opus 4.8, GPT-5.5 and Opus 4.7 on 50 real merged PRs from sqlparser-rs (Rust) and graphql-go-tools (Go). Composer is 6.5-7x cheaper and ties the test gate, but finishes last on craft in both repos. A routing guide for where it's safe.

When Fable 5 Is Worth the Premium

June 14, 2026

Fable 5 vs Opus 4.8, GPT-5.5 and two more on 30 real GraphQL Go Tools and SQLParser Rust tasks. The useful read is taskwise W/D/L: which model won which metric, on which task, and why.

Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust

June 2, 2026

I graded four frontier coding models on 50 real merged PRs in Go and Rust - not just whether tests pass, but craft, equivalence, and cost. Opus 4.8 led on craft in both.

I had Codex iterate on its own AGENTS.md 8 times and measured each version against real PRs. The best one still regressed on a clean holdout.

May 27, 2026

Codex optimized its own AGENTS.md against real Stet repo tasks. The best candidate improved the training slice, then regressed enough on a clean holdout that it was not safe to ship.

GPT-5.5 High Regression Check on GraphQL-go-tools

May 18, 2026

A fresh GPT-5.5 Codex high rerun on 21 clean GraphQL-go-tools tasks compared with the May 5 GPT-5.5 high run. The rerun was directionally worse on tests, equivalence, and review pass count, but the evidence is mixed and does not show a broad quality collapse.

Analysis and notes from the Stet runs

Model comparisons

I Tested Six Ways to Save $$$ on 5.6 Sol. The only one that worked was using Terra

Sonnet 5 vs Opus 4.8: how they behave and when to use each

GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest

Composer 2.5 was 6.5x cheaper on 50 real Rust and Go PRs. Here's where it's actually safe.

When Fable 5 Is Worth the Premium

Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust

I had Codex iterate on its own AGENTS.md 8 times and measured each version against real PRs. The best one still regressed on a clean holdout.

GPT-5.5 High Regression Check on GraphQL-go-tools

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo

GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo

GPT-5.5 vs GPT-5.4 vs Opus 4.7 on 56 real coding tasks from 2 open source repos

Opus 4.7 vs Old Opus 4.6 vs New Opus 4.6

Your AGENTS.md is the highest-leverage code you're not testing

Your AI coding benchmark is hiding a 2x quality gap