STET

Benchmark evidence

Opus 4.7 vs Opus 4.6 Evidence

Evidence table for the Stet benchmark comparing Opus 4.7 with two Opus 4.6 snapshots on 28 historical Zod tasks.

Stet ran Opus 4.7, a March 19 Opus 4.6 snapshot, and an April 16 Opus 4.6 rerun on 28 historical Zod tasks; all three passed 12 of 28 tests, while Opus 4.7 led on equivalence and footprint risk.

Tasks
28 real coding tasks
Repos
Zod
Judge
GPT-5.4
Date
2026-04-17
Harnesses
Claude Code
Caveat
Each arm ran once per task with identical rubric versions.

Results table

MetricOpus 4.6 Mar 19Opus 4.6 Apr 16Opus 4.7 Apr 16
Pass rate42.9%42.9%42.9%
Equivalence39.3%32.1%46.4%
Code-review pass39.3%25.0%25.0%
Footprint risk0.2100.2210.090
Mean time per task3m26s7m58s3m12s
Cost per task$8.93$6.65$8.11
Total tokens49.1M35.6M44.0M

Opus 4.7 vs Opus 4.6

Opus 4.7 did not pass more visible tests, but it improved equivalence, footprint risk, discipline, and mean task time.

Source article: Source writeup

Methodology: scoring and validation details