STET

Benchmark evidence

GPT-5.5 reasoning curve evidence

Evidence table for the May 7, 2026 Stet run of GPT-5.5 Codex at low, medium, high, and xhigh reasoning effort across 26 matched graphql-go-tools tasks.

On May 7, 2026, Stet ran GPT-5.5 Codex at low, medium, high, and xhigh reasoning effort on 26 matched graphql-go-tools tasks; equivalence rose from 4/26 (low) to 23/26 (xhigh) and code-review pass rose from 3/26 to 18/26, while test pass was not monotonic (low 21, medium 21, high 25, xhigh 24).

Tasks
26 real coding tasks
Repos
graphql-go-tools
Judge
GPT-5.4
Date
2026-05-07
Harnesses
OpenAI Codex CLI 0.128.0
Caveat
Single seed per task; 26-task slice; no human grader calibration on this set.

Results table

MetricLowMediumHighXhigh
Tests pass21 / 2621 / 2625 / 2624 / 26
Equivalence4 / 2611 / 2618 / 2623 / 26
Code-review pass3 / 265 / 2610 / 2618 / 26
Footprint risk0.2000.2680.3140.365
Craft / discipline avg2.3112.6042.7363.071
Cost per task (mean)$2.65$3.13$4.49$9.77
Mean agent duration286.9s411.0s579.0s753.3s

Low vs Medium

Tests tied at 21/26, but medium jumped from 4/26 to 11/26 on equivalence and lifted craft/discipline from 2.311 to 2.604. Tests alone would have missed the reasoning-effort difference.

Medium vs High

High was the cleanest practical upgrade: tests 21 to 25, equivalence 11 to 18, review pass 5 to 10. Cost rose 1.43x; integration details were correct more often.

High vs Xhigh

Xhigh bought equivalence (18 to 23) and review pass (10 to 18) at 2.18x cost, but tests dropped from 25 to 24 and footprint risk widened. A quality mode, not a default.

Source article: Source writeup

Methodology: scoring and validation details