Benchmark evidence

Opus 4.7 vs Opus 4.6 Evidence

Evidence table for the Stet benchmark comparing Opus 4.7 with two Opus 4.6 snapshots on 28 historical Zod tasks.

Stet ran Opus 4.7, a March 19 Opus 4.6 snapshot, and an April 16 Opus 4.6 rerun on 28 historical Zod tasks; all three passed 12 of 28 tests, while Opus 4.7 led on equivalence and footprint risk.

Tasks: 28 real coding tasks
Repos: Zod
Judge: GPT-5.4
Date: 2026-04-17
Harnesses: Claude Code
Caveat: Each arm ran once per task with identical rubric versions.

Results table

Metric	Opus 4.6 Mar 19	Opus 4.6 Apr 16	Opus 4.7 Apr 16
Pass rate	42.9%	42.9%	42.9%
Equivalence	39.3%	32.1%	46.4%
Code-review pass	39.3%	25.0%	25.0%
Footprint risk	0.210	0.221	0.090
Mean time per task	3m26s	7m58s	3m12s
Cost per task	$8.93	$6.65	$8.11
Total tokens	49.1M	35.6M	44.0M

Opus 4.7 vs Opus 4.6

Opus 4.7 did not pass more visible tests, but it improved equivalence, footprint risk, discipline, and mean task time.

Source article: Source writeup

Methodology: scoring and validation details