STET

Benchmark evidence

Claude Opus 4.7 reasoning curve evidence

Evidence table for the May 12, 2026 Stet run of Claude Opus 4.7 at low, medium, high, xhigh, and max reasoning effort across 29 matched graphql-go-tools tasks.

On May 12, 2026, Stet ran Claude Opus 4.7 at low, medium, high, xhigh, and max reasoning effort on 29 matched graphql-go-tools tasks; medium led the primary rollout metrics with 28/29 tests pass, 14/29 equivalence, 10/29 code-review pass, and a 2.759 aggregate craft/discipline mean.

Tasks
29 real coding tasks
Repos
graphql-go-tools
Judge
GPT-5.4
Date
2026-05-12
Harnesses
Claude Code 2.1.126-2.1.138
Caveat
Inspect-grade stitched curve; single seed per task; no human grader calibration on this set.

Results table

MetricLowMediumHighXhighMax
Tests pass23 / 2928 / 2926 / 2925 / 2927 / 29
Equivalence10 / 2914 / 2912 / 2911 / 2913 / 29
Code-review pass5 / 2910 / 297 / 294 / 298 / 29
Code-review rubric mean2.4262.7162.5092.4822.431
Footprint risk mean0.1550.1890.2060.2380.227
Craft / discipline avg2.5982.7592.6702.6692.690
Cost per task (mean)$2.50$3.15$5.01$6.51$8.84
Mean agent duration383.8s450.7s716.4s803.8s996.9s

Low vs Medium

Medium was the quality step up: tests rose from 23/29 to 28/29, equivalence from 10/29 to 14/29, and code-review pass from 5/29 to 10/29.

Medium vs High

High cost 1.59x medium per task but trailed medium on tests, equivalence, code-review pass, code-review rubric mean, and aggregate craft/discipline.

High vs Xhigh

Xhigh was more expensive and broader-footprint than high while dropping to 25/29 tests, 11/29 equivalence, and 4/29 code-review pass.

Medium vs Max

Max cost 2.81x medium per task and remained below medium on the primary quality metrics, despite being the busiest and slowest arm.

Source article: Source writeup

Methodology: scoring and validation details