Opus 4.7 vs Old Opus 4.6 vs New Opus 4.6

April 17, 2026 · Updated May 1, 2026

Everyone says Opus 4.6 was getting dumber. Then Opus 4.7 released mid-test, so I ran both questions end-to-end: does a fresh Opus 4.6 still match the March-19 Opus 4.6, and is 4.7 actually better?

Three Opus snapshots, 28 historical Zod tasks, identical 12/28 test pass rate across all three arms. On raw pass rate the upgrade looks flat. Above the test gate the arms diverge enough that the useful mental model is Opus 4.7 is directionally better, not categorically better.

Opus 4.7 appears to be a more disciplined coder, not a fundamentally smarter one.

On cost, tokens, and wall-clock time: 4.7 is cheaper per task than March 4.6 ($8.11 vs $8.93), uses fewer total tokens (44.0M vs 49.1M), and finishes the full 28-task run faster (1h 30m vs 1h 36m). Fresh 4.6 is the cheapest arm, but it takes 2.3x longer to produce looser, less equivalent patches.

I'm building Stet, which scored these runs on equivalence, footprint, craft, and discipline beyond pass/fail. Zod was chosen as a specific, concrete repo rather than a high-level benchmark — I've seen similar shapes on internal repos.

Method

Arm	Reasoning effort	What it represents
Opus 4.6, March 19, 2026	high	Earlier Opus 4.6 run on this same task set
Opus 4.6, April 16, 2026	high	Fresh Opus 4.6 rerun on the same task set
Opus 4.7, April 16, 2026	high	Fresh Opus 4.7 run on the same task set

For each task: sample a merged commit from Zod as the baseline, run each Opus snapshot in Claude Code to reproduce the same changes, then score each patch alongside test pass rate on:

Equivalence — does the patch solve the intended problem, regardless of whether tests catch it?
Code-review pass — binary: does the patch look merge-worthy?
Footprint risk — how divergent is the patch from the accepted change? Lower is better.
Craft (0–4) — simplicity, coherence, intentionality, robustness, clarity.
Discipline (0–4) — instruction adherence, scope discipline, diff minimality.

Grading notes: the judge is gpt-5.4, run with identical rubric versions across all three arms. Each patch is scored independently — the judge sees the patch and task, not the arm label or model name. No dual-rater calibration, so treat absolute scores as directional; the cross-arm deltas are the thing to trust.

What changed in 4.7

For coding-agent work, the biggest observed change was not raw task completion. It was patch shape. Opus 4.7 looked more disciplined: it avoided runaway sessions, avoided high-footprint patches, and more often produced patches an equivalence grader judged aligned with the human change.

The tradeoff is under-reach. Opus 4.7 is worth testing on your own repo, but do not expect a simple pass-rate jump. The improvement is mostly in footprint and alignment, not in the number of visible tests passed.

Headline

opus 4.6 — mar 19

opus 4.6 — apr 16

opus 4.7 — apr 16

pass ratethe gate — all tied

42.9%

above the gate — where the arms diverge

equivalence

39.3%

32.1%

46.4%

code-review pass

39.3%

25.0%

footprint risklower is better

0.210

0.221

0.090

mean time / tasklower is better

3m 26s

7m 58s

3m 12s

cost / tasklower is better

$8.93

$6.65

$8.11

total tokenslower is better

49.1M

35.6M

44.0M

All three arms pass identical tests. The one dimension where 4.7 doesn't lead is the binary code-review bar, where the March 19 run cleared it more often (11 vs 7); fresh 4.6 is modestly cheaper per task.

A lot of people say 4.7 is more expensive. On this slice it isn't: $8.11/task vs $8.93 for March 4.6, and 44.0M vs 49.1M total tokens. Fresh 4.6 is the cheapest arm ($6.65, 35.6M tokens) but takes 2.3x longer to produce looser, less equivalent patches — the savings buy you worse output.

Everywhere else — equivalence, footprint risk, maintainability on shippable-looking patches, mean task time — 4.7 is the strongest of the three.

New Opus 4.6 is the weakest arm: lower equivalence, higher footprint risk, longer time-to-task. It used ~28% fewer input tokens than the March run despite taking 2.3x longer. Whatever changed under the hood, the output is looser patches, and thinking for less.

Footprint risk is the clearest signal

Footprint risk asks whether the patch is larger or more divergent than the accepted change. It's the delta I'd trust most — a more than 2x relative drop on 4.7, measured on a more continuous scale than the rubric scores.

footprint risk distributionn = 28 tasks • lower mean is better

Count of patches per risk bucket. Opus 4.7 is the only arm with zero high-footprint patches.

opus 4.6 — mar 19

mean 0.210

26 low1 medium1 high

opus 4.6 — apr 16

mean 0.221

22 low3 medium3 high

opus 4.7 — apr 16

mean 0.090

27 low1 medium0 high

low

medium

high

Opus 4.7 had no high-footprint patches. New Opus 4.6 more often made changes that touched more code than necessary.

Equivalence

Equivalence asks whether the patch solves the intended problem, not merely whether available tests catch it. 4.7's patches were more equivalent with the human-authored Zod changes, consistent with being more aligned to codebase standards and human intent.

Arm	Equivalence
opus 4.6 — mar 19	39.3%
opus 4.6 — apr 16	32.1%
opus 4.7 — apr 16	46.4%

Rubric scores

Two 0–4 rubrics: review (run only on patches that cleared the binary code-review bar) and discipline (run on every patch).

rubric scores0–4 scale • higher is better

opus 4.6 — mar 19

opus 4.6 — apr 16

opus 4.7 — apr 16

review — shippable patches only

correctness

1.38

2.00

2.15

bug risk

2.08

2.46

2.54

edge cases

2.00

2.46

maintainability

2.00

2.46

2.85

overall

1.87

2.35

2.50

discipline — all patches

instruction adherence

2.39

2.41

2.58

scope discipline

2.84

2.98

3.29

diff minimality

3.02

3.07

3.39

mean

2.75

2.82

3.09

On review, the pattern isn't "4.7 is uniformly more correct." It's closer to: when 4.7 produces a shippable-looking patch, that patch tends to be cleaner and more maintainable — the maintainability delta (2.00 → 2.46 → 2.85) is the biggest gap in the review block.

On discipline, 4.7 leads on every dimension, which tracks the footprint-risk result: tighter, more on-task patches. Scope discipline (+0.31 to +0.45) and diff minimality (+0.32 to +0.37) are the biggest gaps over the 4.6 arms.

Craft means — simplicity, coherence, intentionality, robustness, clarity on 0–4 — sit within ~0.1 across arms (2.86 / 2.84 / 2.93), so treat as consistent-with-noise at n=28. The one separator is intentionality, which climbs 2.98 → 3.27 → 3.58: 4.7's patches read as more purposeful.

Beyond the numbers, the grader narratives cluster differently by arm.

Shared weaknesses across all three. Silent fallback branches that hide the root cause instead of propagating a diagnostic — accepting unknown precisions as unrestricted, emitting empty anyOf for null-only tuples, printing raw English labels for unmapped types, returning the original object when a recursion cap is hit. Type-system escape hatches at the call site — as any, inline _zod intersections, whole-expression SafeParseResult casts — used in place of tightening the underlying boundary.

Old Opus 4.6. Distinctive flag: unearned plumbing. Fields and helpers added for a nearby idea but never consumed — ProcessParams.parent, Sizable.verb, an Identity type, a ~validate method with a single caller. Commented-out scratch code left behind in production files. On tasks with mirrored Deno and Node surfaces, some mirror cleanly while others leave deno/lib stale.

New Opus 4.6. Damaging flag: checked-in generated artifacts. Vendored node_modules/.pnpm trees, node_modules/.bin/attw, .pytest_cache, compiled .pyc files — on one task the patch balloons to 2.6 GB. Near-miss public strings: "draft-04" written as "draft-4", a version bump to 4.2.0 when a patch release was intended, a recheck dependency added without being asked. Duplicated lookup tables across parallel locale surfaces (Hebrew TypeLabels/parsedType/Origins/ContainerLabels; Spanish TypeNames vs parsedType).

Opus 4.7. Mirror image of 4.6's weaknesses. Patches stay tightly within one or two files directly implied by the task; unrelated refactors don't appear. Weakness is under-scoping: multi-site refactors get narrowed to a single illustrative spot (assertion removals touch four v3 sites when v3+v4 helpers were expected; OpenAPI-3.0 null fix handles the tuple branch and leaves primitive and union cases alone). Local escape hatches like Writeable casts replace making generic constraints readonly-aware. The agent reliably honors meta-instructions like "do not perform a code review" and keeps new API surface additive rather than replacing existing aliases.

Why patches fail

All three arms fail the same 16 of 28 tasks by the test-passed bar. The reasons cluster differently:

where the 16 failures goall arms fail the same 16 tasks

The pass rate is identical. The failure shape is not. Opus 4.7 never runs out the clock, and its patches drift from wrong problem toward looks right, tests disagree.

non-equivalent patchsolved a different problem

opus 4.6 — mar 19

13 / 16

opus 4.6 — apr 16

12 / 16

opus 4.7 — apr 16

8 / 16

equivalent patch, tests still faillooked right, tests disagreed

opus 4.6 — mar 19

3 / 16

opus 4.6 — apr 16

4 / 16

opus 4.7 — apr 16

8 / 16

hit time budgetran out the clock

opus 4.6 — mar 19

1 / 16

opus 4.6 — apr 16

4 / 16

opus 4.7 — apr 16

0 / 16

Two things jump out. 4.7 never runs out the clock — it finishes every task. And the "equivalent patch, tests still fail" bucket nearly triples on 4.7 (3 → 8) while the "non-equivalent" bucket shrinks by roughly the same amount. 4.7's failures shift toward looks right but tests disagree — more patches an independent reviewer judges equivalent to the accepted change, fewer that miss the intent entirely.

Take this shift with a grain of salt. It could mean 4.7 genuinely writes cleaner patches that still miss a subtle obligation the test suite catches, or that the equivalence grader is more forgiving of tight-footprint patches than of sprawling ones. The under-reach pattern below is consistent with the first reading, but it's a signal worth auditing.

Where Opus 4.6 loses ground: breadth. Both 4.6 runs repeatedly miss the Deno mirror on tasks that need parallel Node and Deno updates, leave localization passes partial (Hebrew and Spanish messages retain old wording or untranslated labels), and miss requested API surfaces — a shared NEVER export, mini-schema support, whole families of assertion removals. The fresh rerun adds unforced errors: vendored node_modules trees committed, wrong published target strings, a version bump that doesn't match the intended release.

Where Opus 4.7 loses ground: under-reach. When 4.7 misses, it stops at a narrow local fix — updating only ZodMiniType.check when the task asked for four related inference changes, applying a tuple-local OpenAPI workaround while leaving union-with-null semantics alone, working around readonly discriminated unions with Writeable casts instead of making the types readonly-aware. The patches are clean and low-risk for what they touch; they just don't touch enough.

What all three share is a handful of structurally hard spots — deepPartial that preserves nested inferred types, recursion cutoffs that don't silently accept over-limit cases, refinement clones that carry parent links through to finalization, predicate-aware refine on mini schemas, the full Hebrew localization pass. The failure there isn't reasoning or discipline; it's task-structural.

Takeaway

For this Zod slice, Opus 4.7 is directionally better, not categorically better.

It doesn't pass more tests, fewer patches clear the binary code-review bar, and fresh 4.6 edges it on cost per task.

However, it wins clearly on footprint risk (>2x tighter patches) and leads on equivalence, discipline, maintainability-when-shippable, and task time. The failure modes shift in step: fewer wrong-problem patches and fewer runaway sessions, more cases of stopping short on a narrow fix. The mental model is a more disciplined coder, not a fundamentally smarter one.

4.7 is worth a serious look on your own repo. Patch quality and alignment with intent move meaningfully even when test pass count stays flat.

Zod is a TypeScript schema library. Your repo is different — that's exactly the point of measuring this on your work rather than a public benchmark. That's what Stet is for.

FAQ

Opus 4.6 vs Opus 4.7: did 4.7 pass more coding tasks?

No. In this 28-task Zod benchmark, Opus 4.7 and both Opus 4.6 snapshots each passed 12 of 28 tests. The difference appeared above the test gate: Opus 4.7 had stronger equivalence, lower footprint risk, better discipline, and faster mean task time.

Is Claude Opus 4.7 better than Opus 4.6 for coding agents?

In this Stet benchmark, Claude Opus 4.7 was directionally better but not categorically better. It produced tighter, more equivalent patches, but it did not improve raw test pass rate and cleared the binary code-review bar less often than the March Opus 4.6 run.

What was Opus 4.7's main coding weakness?

Opus 4.7's main weakness was under-reach. Its patches were clean and low-footprint, but some misses came from stopping at a narrow local fix instead of carrying the change through all affected surfaces.