Benchmark evidence
Opus 4.7 vs Opus 4.6 Evidence
Evidence table for the Stet benchmark comparing Opus 4.7 with two Opus 4.6 snapshots on 28 historical Zod tasks.
Stet ran Opus 4.7, a March 19 Opus 4.6 snapshot, and an April 16 Opus 4.6 rerun on 28 historical Zod tasks; all three passed 12 of 28 tests, while Opus 4.7 led on equivalence and footprint risk.
- Tasks
- 28 real coding tasks
- Repos
- Zod
- Judge
- GPT-5.4
- Date
- 2026-04-17
- Harnesses
- Claude Code
- Caveat
- Each arm ran once per task with identical rubric versions.
Results table
| Metric | Opus 4.6 Mar 19 | Opus 4.6 Apr 16 | Opus 4.7 Apr 16 |
|---|---|---|---|
| Pass rate | 42.9% | 42.9% | 42.9% |
| Equivalence | 39.3% | 32.1% | 46.4% |
| Code-review pass | 39.3% | 25.0% | 25.0% |
| Footprint risk | 0.210 | 0.221 | 0.090 |
| Mean time per task | 3m26s | 7m58s | 3m12s |
| Cost per task | $8.93 | $6.65 | $8.11 |
| Total tokens | 49.1M | 35.6M | 44.0M |
Opus 4.7 vs Opus 4.6
Opus 4.7 did not pass more visible tests, but it improved equivalence, footprint risk, discipline, and mean task time.
Source article: Source writeup
Methodology: scoring and validation details