Benchmark evidence

GPT-5.5 reasoning curve evidence

Name: GPT-5.5 reasoning curve evidence
Creator: Stet
Published: 2026-05-07

Evidence table for the May 7, 2026 Stet run of GPT-5.5 Codex at low, medium, high, and xhigh reasoning effort across 26 matched graphql-go-tools tasks.

On May 7, 2026, Stet ran GPT-5.5 Codex at low, medium, high, and xhigh reasoning effort on 26 matched graphql-go-tools tasks; equivalence rose from 4/26 (low) to 23/26 (xhigh) and code-review pass rose from 3/26 to 18/26, while test pass was not monotonic (low 21, medium 21, high 25, xhigh 24).

Tasks: 26 real coding tasks
Repos: graphql-go-tools
Judge: GPT-5.4
Date: 2026-05-07
Harnesses: OpenAI Codex CLI 0.128.0
Caveat: Single seed per task; 26-task slice; no human grader calibration on this set.

Results table

Metric	Low	Medium	High	Xhigh
Tests pass	21 / 26	21 / 26	25 / 26	24 / 26
Equivalence	4 / 26	11 / 26	18 / 26	23 / 26
Code-review pass	3 / 26	5 / 26	10 / 26	18 / 26
Footprint risk	0.200	0.268	0.314	0.365
Craft / discipline avg	2.311	2.604	2.736	3.071
Cost per task (mean)	$2.65	$3.13	$4.49	$9.77
Mean agent duration	286.9s	411.0s	579.0s	753.3s

Low vs Medium

Tests tied at 21/26, but medium jumped from 4/26 to 11/26 on equivalence and lifted craft/discipline from 2.311 to 2.604. Tests alone would have missed the reasoning-effort difference.

Medium vs High

High was the cleanest practical upgrade: tests 21 to 25, equivalence 11 to 18, review pass 5 to 10. Cost rose 1.43x; integration details were correct more often.

High vs Xhigh

Xhigh bought equivalence (18 to 23) and review pass (10 to 18) at 2.18x cost, but tests dropped from 25 to 24 and footprint risk widened. A quality mode, not a default.

Source article: Source writeup

Methodology: scoring and validation details