Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust
Opus 4.8 is finally out - how good is it actually?
In this benchmark I compared Opus 4.8 against the rest of the frontier (GPT-5.5, Opus 4.7, Composer 2.5) on 50 real tasks from two open-source repos - graphql-go-tools and sqlparser-rs, Go and Rust respectively - representing complex backend software engineering work across a variety of tasks.
The important part is that these repos are arbitrary. I could have tested the models on my own repo, with my own tasks, to see how the frontier performs on domain-specific work. The goal here is to explore, with some granularity, how a benchmark like this is built and what it can actually tell you about model behavior.
The result
The king is back. On this n=50 slice, Opus 4.8 is the craft leader in both Go and Rust, and it dominates the two premium-reasoning arms - GPT-5.5 high and Opus 4.7 xhigh - on the cost-quality plane: equal-or-better craft while running cheaper and leaner. Its only loss is raw price - Composer 2.5 is ~6.5× cheaper on Rust and ~7× on Go, but materially weaker on craft.
Against GPT-5.5 it's a clean win: better craft and leaner everywhere, cheaper on Rust and on par on Go. Against Opus 4.7 xhigh it matches or beats its own predecessor at a lower reasoning tier, plus a clean reliability win. Against Composer it's the quality win and the price loss.
The binary test gate is near-saturated and not the axis that separates these models (pooled 47/44/44/42 of 50 - the next section). The separation lives in the craft band above the gate.
How strong is each claim? The craft win over Composer is decision-grade in both repos; over GPT-5.5 it's decision-grade on Rust but only directional on Go; and the exact ordering among the "premium" models is directional (n=25, one grader pass). "Decision-grade" vs "directional" is defined two sections down.
Every frontier model clears the test gate, so tests can't separate them. The separation is in the craft band above the gate, where Opus 4.8 leads in both Go and Rust while running cheaper than both premium arms.
cost vs custom-score frontier
custom score on y, $/task (log) on x
Custom score = 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint, scaled 0-100. Opus 4.8 sits up-and-left of both premium arms - higher score at lower cost - while Composer anchors the cheap, lower-score corner. The composite is a directional read; the calibrated per-grader claims are in the calibration table below.
gate and craft headroom
exact per-repo values; tests are the floor
behavioral fingerprint
four shapes the gate collapses into one column
Opus 4.8
the disciplined frontier
GPT-5.5
the gate-passer
Composer 2.5
the sprinter
Opus 4.7 xhigh
the over-thinker
Axes are normalized 0-1 across these four arms (relative, not absolute) and repo-balanced, so a smallest lobe means worst-of-four, not zero. The gate sees one near-flat column; graded measurement sees four distinct shapes, which is what you actually choose on.
claim calibration
decision-grade, directional, or no survivor
The grey rows are the honesty mechanism: they stay visible, but they do not get promoted into clean winner claims.
cost-ratio evidence
ratio = Opus 4.8 / baseline; lower is leaner
n=50, two repos, single seed, GPT-5.4 judge, per-repo replication.
Headline craft claims rest on BH-FDR calibration; the frontier custom score is a directional composite, not a calibrated ranking.
GPT-5.5 Go cost was re-priced from a cache artifact to 0.83x noise-band.
Composer Rust test validity is owner-attested via waiver, not hermetic; craft and cost axes are gate-untouched.
Definitions
- Equivalence - same behavioral change as the human patch?
- Code review - would a reviewer accept it?
- Footprint risk - extra code touched vs the human patch.
- Craft/discipline - 8 graders: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, diff minimality.
Why the test gate isn't the answer
Most public benchmarks answer a binary question - did the model satisfy the grading condition the task author set out. That's useful for measuring model intelligence, but it's notably different from how real engineers use these models.
As a SWE in an enterprise codebase, I don't just care whether Opus 4.8 passes the tests. I want it to write idiomatic, maintainable code that doesn't introduce subtle bugs - high-quality diffs that my teammates would approve and merge. And the real decision in front of a team - "should I move from Opus 4.7 to 4.8 / from Claude to GPT-5.5 / try Composer to cut cost?" - is almost impossible to answer from public data alone. You need hands-on experience on your own code, or local benchmark data, to see how the models behave in reality.
On the binary test gate, the four models are nearly indistinguishable:
Binary pass-rate (tests pass or pass_with_warn, /25):
Pooled, that's Opus 4.8 47/50 · GPT-5.5 44/50 · Composer 44/50 · Opus 4.7 42/50 - a near-flat column. That flatness shows tests alone can't separate the field, so craft and cost have to carry the story.
Equivalence - did the patch make the same behavioral change the human's did, not just pass the tests - pulls the field apart where the gate can't:
Behavioral equivalence rate (judge-rated, out of 1.0):
On Rust the gate is saturated - three of the four arms sit at 24-25/25 - yet equivalence still spreads 20 points, from Opus 4.8's 0.92 down to Opus 4.7's 0.72. A passing test only tells you the patch works; equivalence asks whether it made the same change the human did, and that second question separates arms the first one can't. (Go equivalence is uniformly low and noisier - GPT-5.5 actually edges Opus there, 0.44 vs 0.40.)
How each task is graded
Each task is a real merged PR or commit from the source repo. The agent is dropped into a Docker container with a frozen repo snapshot, a prompt describing the task, and one attempt. Stet then applies the patch and runs the task's tests in an isolated container.
That result is graded beyond test pass/fail on the metrics mentioned at the start.
One run per task, single seed. The judge is GPT-5.4, blinded to which model produced the patch, with manual spot-checks. There's no human calibration pass, so trust the direction of deltas over absolute scores.
The arms: Opus 4.8 (high, Claude Code) · Opus 4.7 (xhigh, Claude Code) · GPT-5.5 (high, Codex) · Composer 2.5 (Cursor). Dataset: 25 matched Go tasks (graphql-go-tools) + 25 matched Rust tasks (sqlparser-rs).
One integrity note. This corpus isn't network-sandboxed, so I audited for contamination. One Composer Rust result turned out to be a gold-leak - the agent fetched the merged PR - which I caught and swapped for a clean rerun; removing it only widened Opus's lead. A broader set of tasks, Composer and Opus alike, touched the network in ways I judged benign and kept.
As an aside, these evals double as an "autoresearch" optimization loop, not just a benchmark. I tell my agent something like "make AGENTS.md better for this repo"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate did better or worse and why, and iterates to improve the numbers.
How to read the numbers
With n=25 per repo, no single grader is conclusive - the smallest craft gap one grader can reliably catch (~0.34–0.49 on the 0–4 scale) is bigger than most of the real gaps here. So the signal isn't any one grader, but agreement across graders. Think coin flips: one landing heads tells you nothing, but flip ten and get all heads and something's up. When 8–11 independent graders all lean the same way, a sign test on that consensus is significant even when no single grader is.
Concretely: in most pairings, 6–8 of the 8 craft graders sit below their own detection floor, and no single craft grader is individually decision-grade except on the largest gap (Opus vs Composer). The signal is in cross-grader sign consistency. So I tag a result decision-grade (DG) when it survives multiplicity correction (Benjamini-Hochberg FDR at q=0.05), and directional when it's consistent but doesn't clear that bar.
Here's how many of the 11 craft/review graders survive that correction for each pairing:
BH-FDR survivors:
0 of 11 graders survive for Opus 4.8 vs GPT-5.5 on Go. That's why I keep the Go craft edge labeled directional, not decision-grade.
Opus 4.8 vs GPT-5.5 high - premium, no trade-off
Against GPT-5.5, Opus 4.8 is the rare upgrade with no trade-off: better craft, leaner everywhere, and cheaper on Rust (Go cost lands ~par).
- Better code in both repos. Craft-mean leads on Rust (3.28 vs 2.94, decision-grade - 4 graders survive: coherence, diff-minimality, intentionality, simplicity) and on Go (2.90 vs 2.72, directional - 0 survive at q=0.05).
- Leaner everywhere, cheaper on Rust. Tokens are decision-grade wins in both repos (Rust 0.71×, Go 0.60×), with far less tool churn (Rust 65 tools/27 shell vs GPT's 88/59; Go 30/15 vs 91/64). On cost, Opus is decision-grade cheaper on Rust (0.81×); on Go the two land roughly par (0.83×, noise-band). Opus is modestly slower on the wall clock (1.17× Rust / 1.04× Go).
- Smaller blast radius, equivalence splits. Footprint risk is lower both ways (Go 0.224 vs 0.264, Rust 0.236 vs 0.291, directional). Equivalence splits: Opus wins Rust (0.92 vs 0.88) but GPT edges Go (0.44 vs 0.40, both low).
The trace - sqlparser-rs #1414. GPT bolted on a parallel-option enum, a public-API field-type change, and unrelated rustfmt churn across 60–96 shell commands - and still missed Azure SQL DW's CLUSTERED COLUMNSTORE INDEX ORDER. Opus made the targeted change. Code-review 100 vs 63.75. More grinding ≠ more complete.
GPT's win - graphql-go-tools #1128. GPT found a seam Opus missed: emit a StaticString in the response visitor and rewrite the goldens to prove no backend fetch. That made it equivalent where Opus was non-equivalent (code-review 88.75 vs 41.25). It cost ~2.6× more ($7.27 vs $2.75). This is part of why I keep Go directional.
One caveat. The two biggest Go "Opus > GPT" tasks are GPT completion failures, not craft losses: on #1308 GPT forgot to regenerate the federation golden (the suite fails), and on #1076 it edited the wrong Go module (root pkg/ instead of v2/, correctness 0) - and Opus's own #1076 fix was the messiest patch in the panel (simplicity 1.8). Another reason Go-vs-GPT stays directional.
Opus 4.8 vs Opus 4.7 xhigh - ≥ at a lower tier, plus a reliability win
This is the cleanest story: Opus 4.8 matches or beats its predecessor while running at a lower reasoning tier (4.8 high vs 4.7 xhigh).
- Equal craft in Rust, ahead in Go. Rust is a genuine tie (craft 3.28 vs 2.98, but 0 graders survive). Go is a real edge (2.90 vs 2.63, 2 survive: code-review-overall and simplicity). Honest note: 4.7 still tops the Rust code-review column (3.44 vs 3.32, a ~0.12 near-tie).
- Cheaper where it's measurable, a wash where it isn't. Go runs 0.66× cost / 0.50× tokens / 0.80× duration (decision-grade on all three). Rust is a statistical wash. Equivalence favors 4.8: Rust 0.92 vs 0.72, Go 0.40 vs 0.28.
- The reliability win: 4.8 just does the work. Opus 4.7 xhigh shipped 0-byte patches on 4 Rust tasks - it asked permission instead of implementing - so 4/25 vs 4.8's 0/25. I re-ran all four to rule out a transient fluke - 4.7 only shipped a patch once explicitly re-prompted to implement, so the raw first attempt is what is scored here.
The trace - "asks permission instead of doing the work." On #1538, 4.7's entire turn was "What would you like me to do with this PR?…" followed by end_turn (six steps, zero tool calls). On #1398 it investigated properly (24 steps), correctly diagnosed the exact fix - a new Dialect::require_interval_qualifier, overridden true for MySQL/ANSI/BigQuery - then asked "Want me to implement that, or just sketch the diff?" and ended at 0 bytes. Opus 4.8 read the identical prompt as a work order and shipped (both tasks pass). (I later reran the 0-byte patches)
4.7's win. 4.7 tops the Rust code-review column (3.44). But more reasoning didn't buy more restraint: on Go #859/#1230 it spent ~1.4–2.2× the output tokens (53k vs 24k on #1230) for the less disciplined patch - a hand-rolled FederationMetaData index layer where a smaller change sufficed (diff-minimality 1.4 vs 3.7 on #859), and #1230 came back non-equivalent where 4.8 matched the gold.
Opus 4.8 vs Composer 2.5 - the budget arm (quality win, price loss)
Composer is the cheap seat, and on quality it shows - but the price gap is real, and it's Composer's biggest win.
- Cleaner code in both languages, and not close enough to be luck. Craft-mean Rust 3.28 vs 2.84, Go 2.90 vs 2.48 - the strongest result in the post: BH-FDR survivors 10/11 in Go, 7/11 in Rust (Go simplicity dz +1.00, scope-discipline +0.93, instruction-adherence +0.77; Rust diff-minimality +0.91, intentionality +0.65). Opus also leads on equivalence and code-review.
- The catch is cost. Composer runs ~6.5× cheaper on Rust (geo-mean 6.47×) and ~7× cheaper on Go (geo-mean 7.15×) - cheaper on every one of the 25 Go tasks ($17.71 total vs Opus's $110.27). It's also the fastest arm in both repos (~254 s median on Rust vs Opus's ~489 s; ~433 s on Go, recovered from the raw per-task logs).
The trace - sqlparser-rs #1580. The task was a surgical AST edit. Composer checked a 21 MB compiled binary (rust_out) into the repo root, ballooning the patch to ~6.85 MB and tripping a "patch too large" guardrail - then widened the public Derived AST API beyond scope on top of it. Opus made the one-spot edit and stopped. Grader deltas (Opus → Composer): diff-minimality 2.4 → 0.6, intentionality 4.0 → 0.4, scope 2.6 → 1.2, code-review 93.75 (pass) → 73.75 (fail), both passing tests. Discipline is knowing which lines not to write.
Composer's win. Fastest and cheapest, and it ties the field at the gate - the natural fit for an "Opus plans, Composer executes" split. One honesty note: don't read "ties on tests" too broadly - Composer's Go tests failed on #1260 and #1380, so its honest Go figure is 19/25.
Cost & runtime
Put the cost ratios in one place. Each is a geo-mean of Opus 4.8 ÷ baseline, so a number below 1 means Opus is leaner.
Cost / tokens / duration ratios (below 1 = Opus leaner):
The frontier shape this draws: Composer sits in the cheap, low-craft corner; GPT-5.5 and Opus 4.7 xhigh are the high-cost premium arms; and Opus 4.8 is high-craft at lower cost than both premium arms - decision-grade cheaper than GPT-5.5 on Rust and on par on Go, cheaper than 4.7 in Go and a wash on Rust.
Four models, four fingerprints
- Opus 4.8 - the disciplined frontier. Writes the smallest patch that does the job, then stops. Top craft mean and lowest footprint in both repos, and its traces read as intentional (the one-spot edit on #1580, the gold-matching shape on #1230). The price of that restraint is wall-clock - it's the slowest arm.
- Opus 4.7 xhigh - the over-thinker. The largest reasoning budget of any arm, spent ranging wider rather than landing cleaner: more tokens for the less disciplined patch, a hand-rolled layer where a smaller change sufficed. Its signature failure is procedural - diagnose the fix, then ask permission instead of writing it.
- GPT-5.5 - the gate-passer. The surest arm at turning tests green (25/25 on Rust), and it gets there by grinding, then over-models once it's green (a public-API change plus rustfmt churn on #1414).
- Composer 2.5 - the sprinter. Fastest and cheapest by a wide margin, and it ties the gate. The catch is everything above it: it can't stop adding code - a checked-in binary, a widened API - which is why it places last on craft in both languages. The executor in "Opus plans, Composer executes," not the author of record.
A pass-rate leaderboard collapses these four shapes into one flat column. The fingerprint is what you actually choose on - and only graded measurement produces it.
Vibe check
Numbers are only part of the story - model feel gives signal too. For background, I use GPT-5.5 and Opus 4.7 almost every day, for work and side projects.
After a weekend on Opus 4.8, the launch post's "modest but tangible improvement" is the phrasing I'd reach for too. I simply trust Opus 4.8 to do the right thing more. It feels more aligned with my intent, and more willing to question its own output - and I'm more willing to let it think longer without it getting lost (a prior report of mine found 4.7 prone to overthinking).
On the flip side, I've watched it get entangled in its own thoughts: it'll go down a rabbit hole, then announce that the prior 30 minutes of work were wrong. At least it knows it's wrong now...
Compared to GPT-5.5, Opus feels like it has more breadth - I'm more willing to use it to generate new ideas - but it still lacks the discipline GPT-5.5 shows.
What replicates, what the data can't call
Replicates / solid (BH q=0.05):
- Opus 4.8 > Composer on craft - decision-grade in both repos (10/11 Go, 7/11 Rust). The strongest result here.
- Opus 4.8 > GPT-5.5 on craft - decision-grade on Rust, directional on Go; leaner in both (DG tokens); cheaper on Rust (DG) and ~par on Go.
- Opus 4.8 ≥ Opus 4.7 - even on Rust, ahead on Go, at a lower reasoning tier, plus the reliability win (4/25 → 0).
- The binary gate cannot separate the field - pooled 47/44/44/42 of 50.
Directional / confounded:
- The exact ordering among the three premium models is not decision-grade.
A useful counterfactual here - Opus 4.8 was the worst of the four on Rust #1472 (craft 2.34 vs 3.68–3.79, failed code-review, non-equivalent) - it turned on both GenericDialect flags, contradicting its own doc-comment, while 4.7 did less and added the boundary test. Even the winner has off days.
Prior art
The strongest new private benchmarks share this one's real-work substrate, and they're worth looking at for comparison.
DeepSWE (Datacurve) is the closest cousin - same real-repo, multi-language idea, but it's still binary. 113 original tasks across 91 open-source repos in TS/Go/Python/JS/Rust. It ranks GPT-5.5 xhigh > Opus 4.8, reversing my finding.
CursorBench (Cursor) also claims a quality axis - but it's vendor-internal and correctness-led. It scores solution correctness, code quality, efficiency, and interaction behavior on tasks mined from real Cursor sessions. It ranks Opus 4.7 > GPT-5.5 > Opus 4.8 > Composer 2.5, all within ~1% of each other.
Differences like these come down to methodology, the models measured (both run the highest reasoning effort, where I run high / xhigh as labeled), grading, task mix, and n - they're measuring different things (gate intelligence vs craft above it).
Conclusion + recommendation
On this n=50 slice, Opus 4.8 high is a clear winner over Opus 4.7 xhigh - scoring better while costing less. It also, surprisingly, outperforms GPT-5.5 high, against my own prior assumptions and community sentiment. That could be a bad day for Codex (OpenAI is reportedly preparing GPT-5.5 Codex Spark and/or a 5.6), a blip, or genuine dominance by Opus - this one slice can't tell you which.
Composer wins when raw per-task price dominates and a measurable code-quality gap is acceptable - which may fit nicely into an "Opus plans, Composer executes" workflow.
Where output quality matters, Opus 4.8 (high) is the default to beat over both GPT-5.5 and Opus 4.7 xhigh: craft leader and leaner against both, replicated in Go and Rust, and cheaper too (vs 4.7 in Go; decision-grade cheaper than GPT-5.5 on Rust and ~par on Go). For me, that means integrating Opus 4.8 as a thought partner and trusted implementer - a welcome change after 4.7's recent underperformance. Welcome back to the team, Claude.
But your results may vary - which is exactly why teams should measure their own harness, on their own tasks, rather than copying global benchmark defaults.
FAQ
Is Opus 4.8 better than GPT-5.5 for coding? On this 50-task slice, yes on graded craft (decision-grade on Rust, directional on Go) and leaner in both (0.60–0.71× the tokens). On cost it's decision-grade cheaper on Rust (0.81×) and about par on Go (0.83×) - though GPT-5.5 ties or leads at the binary test gate.
Is Opus 4.8 cheaper than Opus 4.7? Yes where it's measurable: on Go, 0.66× the cost and 0.50× the tokens (decision-grade). On Rust it's a wash.
Should I switch my team from Opus 4.7 to 4.8? Directionally yes: equal-or-better craft at a lower reasoning tier, plus a reliability win - 4.7 shipped 0-byte patches on 4 of 25 Rust tasks, 4.8 on 0 of 25.
Is Composer 2.5 good enough to replace Opus? Only if price dominates. It's ~6.5–7× cheaper, but with a decision-grade craft gap in both languages.
Does this mean Opus 4.8 is the best coding model? It's the craft leader on this two-repo slice by Stet's graders. Measure your own harness before you change a team default.
Disclosure + over to you
Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better, or reduce token usage - and it uses Stet to test candidate changes against your repo's historical tasks. If your team is already using coding agents heavily and has a concrete decision in front of you - high vs xhigh, Codex vs Claude Code, an AGENTS.md update, or which tasks are safe to delegate - I am looking for a few teams to run repo-specific trials with. Stet runs entirely locally, using your own LLM subscriptions. stet.sh/private, or reach out to me directly.
Two questions to close on: did GPT-5.5 just have a bad run here, or is Opus 4.8 genuinely ahead? And have you moved a team default from 4.7 to 4.8 (or to or from GPT-5.5) on evidence rather than vibes - and how did you measure it? Questions to think about as we evaluate whether 4.8 is worth the upgrade.
Related reading: the GPT-5.5 reasoning-curve report.