Benchmark evidence
GPT-5.5 reasoning curve evidence
Evidence table for the May 7, 2026 Stet run of GPT-5.5 Codex at low, medium, high, and xhigh reasoning effort across 26 matched graphql-go-tools tasks.
On May 7, 2026, Stet ran GPT-5.5 Codex at low, medium, high, and xhigh reasoning effort on 26 matched graphql-go-tools tasks; equivalence rose from 4/26 (low) to 23/26 (xhigh) and code-review pass rose from 3/26 to 18/26, while test pass was not monotonic (low 21, medium 21, high 25, xhigh 24).
- Tasks
- 26 real coding tasks
- Repos
- graphql-go-tools
- Judge
- GPT-5.4
- Date
- 2026-05-07
- Harnesses
- OpenAI Codex CLI 0.128.0
- Caveat
- Single seed per task; 26-task slice; no human grader calibration on this set.
Results table
| Metric | Low | Medium | High | Xhigh |
|---|---|---|---|---|
| Tests pass | 21 / 26 | 21 / 26 | 25 / 26 | 24 / 26 |
| Equivalence | 4 / 26 | 11 / 26 | 18 / 26 | 23 / 26 |
| Code-review pass | 3 / 26 | 5 / 26 | 10 / 26 | 18 / 26 |
| Footprint risk | 0.200 | 0.268 | 0.314 | 0.365 |
| Craft / discipline avg | 2.311 | 2.604 | 2.736 | 3.071 |
| Cost per task (mean) | $2.65 | $3.13 | $4.49 | $9.77 |
| Mean agent duration | 286.9s | 411.0s | 579.0s | 753.3s |
Low vs Medium
Tests tied at 21/26, but medium jumped from 4/26 to 11/26 on equivalence and lifted craft/discipline from 2.311 to 2.604. Tests alone would have missed the reasoning-effort difference.
Medium vs High
High was the cleanest practical upgrade: tests 21 to 25, equivalence 11 to 18, review pass 5 to 10. Cost rose 1.43x; integration details were correct more often.
High vs Xhigh
Xhigh bought equivalence (18 to 23) and review pass (10 to 18) at 2.18x cost, but tests dropped from 25 to 24 and footprint risk widened. A quality mode, not a default.
Source article: Source writeup
Methodology: scoring and validation details