Benchmark evidence
Claude Opus 4.7 reasoning curve evidence
Evidence table for the May 12, 2026 Stet run of Claude Opus 4.7 at low, medium, high, xhigh, and max reasoning effort across 29 matched graphql-go-tools tasks.
On May 12, 2026, Stet ran Claude Opus 4.7 at low, medium, high, xhigh, and max reasoning effort on 29 matched graphql-go-tools tasks; medium led the primary rollout metrics with 28/29 tests pass, 14/29 equivalence, 10/29 code-review pass, and a 2.759 aggregate craft/discipline mean.
- Tasks
- 29 real coding tasks
- Repos
- graphql-go-tools
- Judge
- GPT-5.4
- Date
- 2026-05-12
- Harnesses
- Claude Code 2.1.126-2.1.138
- Caveat
- Inspect-grade stitched curve; single seed per task; no human grader calibration on this set.
Results table
| Metric | Low | Medium | High | Xhigh | Max |
|---|---|---|---|---|---|
| Tests pass | 23 / 29 | 28 / 29 | 26 / 29 | 25 / 29 | 27 / 29 |
| Equivalence | 10 / 29 | 14 / 29 | 12 / 29 | 11 / 29 | 13 / 29 |
| Code-review pass | 5 / 29 | 10 / 29 | 7 / 29 | 4 / 29 | 8 / 29 |
| Code-review rubric mean | 2.426 | 2.716 | 2.509 | 2.482 | 2.431 |
| Footprint risk mean | 0.155 | 0.189 | 0.206 | 0.238 | 0.227 |
| Craft / discipline avg | 2.598 | 2.759 | 2.670 | 2.669 | 2.690 |
| Cost per task (mean) | $2.50 | $3.15 | $5.01 | $6.51 | $8.84 |
| Mean agent duration | 383.8s | 450.7s | 716.4s | 803.8s | 996.9s |
Low vs Medium
Medium was the quality step up: tests rose from 23/29 to 28/29, equivalence from 10/29 to 14/29, and code-review pass from 5/29 to 10/29.
Medium vs High
High cost 1.59x medium per task but trailed medium on tests, equivalence, code-review pass, code-review rubric mean, and aggregate craft/discipline.
High vs Xhigh
Xhigh was more expensive and broader-footprint than high while dropping to 25/29 tests, 11/29 equivalence, and 4/29 code-review pass.
Medium vs Max
Max cost 2.81x medium per task and remained below medium on the primary quality metrics, despite being the busiest and slowest arm.
Source article: Source writeup
Methodology: scoring and validation details