Opus 4.7 vs Old Opus 4.6 vs New Opus 4.6
Everyone says Opus 4.6 was getting dumber. Then Opus 4.7 released mid-test, so I ran both questions end-to-end: does a fresh Opus 4.6 still match the March-19 Opus 4.6, and is 4.7 actually better?
Three Opus snapshots, 28 historical Zod tasks, identical 12/28 test pass rate across all three arms. On raw pass rate the upgrade looks flat. Above the test gate the arms diverge enough that the useful mental model is Opus 4.7 is directionally better, not categorically better.
Opus 4.7 appears to be a more disciplined coder, not a fundamentally smarter one.
On cost, tokens, and wall-clock time: 4.7 is cheaper per task than March 4.6 ($8.11 vs $8.93), uses fewer total tokens (44.0M vs 49.1M), and finishes the full 28-task run faster (1h 30m vs 1h 36m). Fresh 4.6 is the cheapest arm, but it takes 2.3x longer to produce looser, less equivalent patches.
I'm building Stet, which scored these runs on equivalence, footprint, craft, and discipline beyond pass/fail. Zod was chosen as a specific, concrete repo rather than a high-level benchmark — I've seen similar shapes on internal repos.
Method
For each task: sample a merged commit from Zod as the baseline, run each Opus snapshot in Claude Code to reproduce the same changes, then score each patch alongside test pass rate on:
- Equivalence — does the patch solve the intended problem, regardless of whether tests catch it?
- Code-review pass — binary: does the patch look merge-worthy?
- Footprint risk — how divergent is the patch from the accepted change? Lower is better.
- Craft (0–4) — simplicity, coherence, intentionality, robustness, clarity.
- Discipline (0–4) — instruction adherence, scope discipline, diff minimality.
Grading notes: the judge is gpt-5.4, run with identical rubric versions across all three arms. Each patch is scored independently — the judge sees the patch and task, not the arm label or model name. No dual-rater calibration, so treat absolute scores as directional; the cross-arm deltas are the thing to trust.
Headline
All three arms pass identical tests. The one dimension where 4.7 doesn't lead is the binary code-review bar, where the March 19 run cleared it more often (11 vs 7); fresh 4.6 is modestly cheaper per task.
A lot of people say 4.7 is more expensive. On this slice it isn't: $8.11/task vs $8.93 for March 4.6, and 44.0M vs 49.1M total tokens. Fresh 4.6 is the cheapest arm ($6.65, 35.6M tokens) but takes 2.3x longer to produce looser, less equivalent patches — the savings buy you worse output.
Everywhere else — equivalence, footprint risk, maintainability on shippable-looking patches, mean task time — 4.7 is the strongest of the three.
New Opus 4.6 is the weakest arm: lower equivalence, higher footprint risk, longer time-to-task. It used ~28% fewer input tokens than the March run despite taking 2.3x longer. Whatever changed under the hood, the output is looser patches, and thinking for less.
Footprint risk is the clearest signal
Footprint risk asks whether the patch is larger or more divergent than the accepted change. It's the delta I'd trust most — a more than 2x relative drop on 4.7, measured on a more continuous scale than the rubric scores.
Count of patches per risk bucket. Opus 4.7 is the only arm with zero high-footprint patches.
Opus 4.7 had no high-footprint patches. New Opus 4.6 more often made changes that touched more code than necessary.
Equivalence
Equivalence asks whether the patch solves the intended problem, not merely whether available tests catch it. 4.7's patches were more equivalent with the human-authored Zod changes, consistent with being more aligned to codebase standards and human intent.
Rubric scores
Two 0–4 rubrics: review (run only on patches that cleared the binary code-review bar) and discipline (run on every patch).
On review, the pattern isn't "4.7 is uniformly more correct." It's closer to: when 4.7 produces a shippable-looking patch, that patch tends to be cleaner and more maintainable — the maintainability delta (2.00 → 2.46 → 2.85) is the biggest gap in the review block.
On discipline, 4.7 leads on every dimension, which tracks the footprint-risk result: tighter, more on-task patches. Scope discipline (+0.31 to +0.45) and diff minimality (+0.32 to +0.37) are the biggest gaps over the 4.6 arms.
Craft means — simplicity, coherence, intentionality, robustness, clarity on 0–4 — sit within ~0.1 across arms (2.86 / 2.84 / 2.93), so treat as consistent-with-noise at n=28. The one separator is intentionality, which climbs 2.98 → 3.27 → 3.58: 4.7's patches read as more purposeful.
Beyond the numbers, the grader narratives cluster differently by arm.
Shared weaknesses across all three. Silent fallback branches that hide the root cause instead of propagating a diagnostic — accepting unknown precisions as unrestricted, emitting empty anyOf for null-only tuples, printing raw English labels for unmapped types, returning the original object when a recursion cap is hit. Type-system escape hatches at the call site — as any, inline _zod intersections, whole-expression SafeParseResult casts — used in place of tightening the underlying boundary.
Old Opus 4.6. Distinctive flag: unearned plumbing. Fields and helpers added for a nearby idea but never consumed — ProcessParams.parent, Sizable.verb, an Identity type, a ~validate method with a single caller. Commented-out scratch code left behind in production files. On tasks with mirrored Deno and Node surfaces, some mirror cleanly while others leave deno/lib stale.
New Opus 4.6. Damaging flag: checked-in generated artifacts. Vendored node_modules/.pnpm trees, node_modules/.bin/attw, .pytest_cache, compiled .pyc files — on one task the patch balloons to 2.6 GB. Near-miss public strings: "draft-04" written as "draft-4", a version bump to 4.2.0 when a patch release was intended, a recheck dependency added without being asked. Duplicated lookup tables across parallel locale surfaces (Hebrew TypeLabels/parsedType/Origins/ContainerLabels; Spanish TypeNames vs parsedType).
Opus 4.7. Mirror image of 4.6's weaknesses. Patches stay tightly within one or two files directly implied by the task; unrelated refactors don't appear. Weakness is under-scoping: multi-site refactors get narrowed to a single illustrative spot (assertion removals touch four v3 sites when v3+v4 helpers were expected; OpenAPI-3.0 null fix handles the tuple branch and leaves primitive and union cases alone). Local escape hatches like Writeable casts replace making generic constraints readonly-aware. The agent reliably honors meta-instructions like "do not perform a code review" and keeps new API surface additive rather than replacing existing aliases.
Why patches fail
All three arms fail the same 16 of 28 tasks by the test-passed bar. The reasons cluster differently:
The pass rate is identical. The failure shape is not. Opus 4.7 never runs out the clock, and its patches drift from wrong problem toward looks right, tests disagree.
Two things jump out. 4.7 never runs out the clock — it finishes every task. And the "equivalent patch, tests still fail" bucket nearly triples on 4.7 (3 → 8) while the "non-equivalent" bucket shrinks by roughly the same amount. 4.7's failures shift toward looks right but tests disagree — more patches an independent reviewer judges equivalent to the accepted change, fewer that miss the intent entirely.
Take this shift with a grain of salt. It could mean 4.7 genuinely writes cleaner patches that still miss a subtle obligation the test suite catches, or that the equivalence grader is more forgiving of tight-footprint patches than of sprawling ones. The under-reach pattern below is consistent with the first reading, but it's a signal worth auditing.
Where Opus 4.6 loses ground: breadth. Both 4.6 runs repeatedly miss the Deno mirror on tasks that need parallel Node and Deno updates, leave localization passes partial (Hebrew and Spanish messages retain old wording or untranslated labels), and miss requested API surfaces — a shared NEVER export, mini-schema support, whole families of assertion removals. The fresh rerun adds unforced errors: vendored node_modules trees committed, wrong published target strings, a version bump that doesn't match the intended release.
Where Opus 4.7 loses ground: under-reach. When 4.7 misses, it stops at a narrow local fix — updating only ZodMiniType.check when the task asked for four related inference changes, applying a tuple-local OpenAPI workaround while leaving union-with-null semantics alone, working around readonly discriminated unions with Writeable casts instead of making the types readonly-aware. The patches are clean and low-risk for what they touch; they just don't touch enough.
What all three share is a handful of structurally hard spots — deepPartial that preserves nested inferred types, recursion cutoffs that don't silently accept over-limit cases, refinement clones that carry parent links through to finalization, predicate-aware refine on mini schemas, the full Hebrew localization pass. The failure there isn't reasoning or discipline; it's task-structural.
Takeaway
For this Zod slice, Opus 4.7 is directionally better, not categorically better.
It doesn't pass more tests, fewer patches clear the binary code-review bar, and fresh 4.6 edges it on cost per task.
However, it wins clearly on footprint risk (>2x tighter patches) and leads on equivalence, discipline, maintainability-when-shippable, and task time. The failure modes shift in step: fewer wrong-problem patches and fewer runaway sessions, more cases of stopping short on a narrow fix. The mental model is a more disciplined coder, not a fundamentally smarter one.
4.7 is worth a serious look on your own repo. Patch quality and alignment with intent move meaningfully even when test pass count stays flat.
Zod is a TypeScript schema library. Your repo is different — that's exactly the point of measuring this on your work rather than a public benchmark. That's what Stet is for.