Benchmark evidence

Claude Opus 4.7 reasoning curve evidence

Name: Claude Opus 4.7 reasoning curve evidence
Creator: Stet
Published: 2026-05-12

Evidence table for the May 12, 2026 Stet run of Claude Opus 4.7 at low, medium, high, xhigh, and max reasoning effort across 29 matched graphql-go-tools tasks.

On May 12, 2026, Stet ran Claude Opus 4.7 at low, medium, high, xhigh, and max reasoning effort on 29 matched graphql-go-tools tasks; medium led the primary rollout metrics with 28/29 tests pass, 14/29 equivalence, 10/29 code-review pass, and a 2.759 aggregate craft/discipline mean.

Tasks: 29 real coding tasks
Repos: graphql-go-tools
Judge: GPT-5.4
Date: 2026-05-12
Harnesses: Claude Code 2.1.126-2.1.138
Caveat: Inspect-grade stitched curve; single seed per task; no human grader calibration on this set.

Results table

Metric	Low	Medium	High	Xhigh	Max
Tests pass	23 / 29	28 / 29	26 / 29	25 / 29	27 / 29
Equivalence	10 / 29	14 / 29	12 / 29	11 / 29	13 / 29
Code-review pass	5 / 29	10 / 29	7 / 29	4 / 29	8 / 29
Code-review rubric mean	2.426	2.716	2.509	2.482	2.431
Footprint risk mean	0.155	0.189	0.206	0.238	0.227
Craft / discipline avg	2.598	2.759	2.670	2.669	2.690
Cost per task (mean)	$2.50	$3.15	$5.01	$6.51	$8.84
Mean agent duration	383.8s	450.7s	716.4s	803.8s	996.9s

Low vs Medium

Medium was the quality step up: tests rose from 23/29 to 28/29, equivalence from 10/29 to 14/29, and code-review pass from 5/29 to 10/29.

Medium vs High

High cost 1.59x medium per task but trailed medium on tests, equivalence, code-review pass, code-review rubric mean, and aggregate craft/discipline.

High vs Xhigh

Xhigh was more expensive and broader-footprint than high while dropping to 25/29 tests, 11/29 equivalence, and 4/29 code-review pass.

Medium vs Max

Max cost 2.81x medium per task and remained below medium on the primary quality metrics, despite being the busiest and slowest arm.

Source article: Source writeup

Methodology: scoring and validation details