STET

GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo

May 7, 2026

I ran GPT-5.5 Codex at all reasoning effort settings (low, medium, high, and xhigh) on the same 26 tasks from an open source repo (GraphQL-go-tools, in Go).

Low and medium tied on tests at 21/26, but medium was much better on semantic equivalence with the original human PR, and posted higher review quality. High looked like the practical sweet spot. Xhigh produced the best equivalence/review scores, but was much more expensive.

Reasoning effort seems to change the kind of patch Codex produces, not just the pass rate of the tests.

Low → medium: less heuristic/partial implementation, more repo/domain modeling.

Medium → high: the practical jump. More tasks become complete, integrated, and reviewable without xhigh-level cost.

High → xhigh: quality mode. Better on complex tasks, but expensive and slow.

One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work.

Data dump (will explore this later throughout the post):

For this post, “equivalent” means the patch matched the intent of the merged human PR; “code-review pass” means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.

GPT-5.5 Codex on GraphQL-go-tools: tests say low and medium tie, but equivalence climbs 15% -> 42% -> 69% -> 88%. High is the best cost-quality point; xhigh buys more review quality but costs 3.7x low and regresses on tests versus high.

Outcomes

Per-task drilldown - sorted by widest spread

Enter opens profile; Tab moves between rows.

Task profile graphql-go-tools#891
tests pass
L
pass
M
fail
H
pass
XH
pass
equivalence
L
-0.95
M
--
H
0.98
XH
0.98
review pass
L
fail
M
fail
H
pass
XH
pass
review rubric
L
0.75
M
2.00
H
4.00
XH
4.00
all-3 pass
L
1/3
M
0/3
H
3/3
XH
3/3
footprint risk
L
0.116
M
0.280
H
0.247
XH
0.247
mean duration
L
407s
M
430s
H
586s
XH
450s
craft
L
1.57
M
2.13
H
3.73
XH
3.65
discipline
L
1.18
M
2.38
H
3.55
XH
3.52
Cost and time
Efficiency
Quality - craft and discipline

Inspect-grade stitched four-arm curve. Medium/high come from a completed but insufficient-evidence rules compare; low and xhigh are clean candidate-arm runs on the matched 26-task slice.

Xhigh aggregate cost/cache values are preserved from clean pre-regrade evidence because the post-regrade xhigh summary has null per-task task_cost and cache-token fields.

Xhigh code-review rubric means are recovered from task-level validation.json artifacts because the post-regrade xhigh summary preserved review pass/fail signals but dropped flattened code_review RubricScores.

Cost skew: Top two xhigh-cost tasks account for 39.6% of xhigh matched-slice cost. Mean cost is therefore worth reading next to median cost.

Reproduce from the source summaries listed in the raw JSON; regenerate the chart data with npm --prefix leaderboard exec tsx scripts/build-gpt55-graphql-reasoning-curve.mjs.

Why I Ran This

After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. On X/Reddit/HN I had seen speculation around which reasoning effort level is optimal for GPT-5.5 (with some claiming that low/medium is better than high/xhigh due to "overthinking", a known failure mode for 5.4 and 5.3-codex).

To separate vibes from reality, and figure out where the cost/performance sweet spot is for GPT-5.5, I ran this experiment.

This is not meant to be a universal benchmark result - I don’t have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo.

Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding agents perform on real-world tasks.

Terminal-Bench is primarily esoteric coding questions, SWE-bench verified is contaminated (as in models already have answers baked in), and SWE-bench Pro is useful, but generic. That is not a knock on SWE-bench or Terminal-Bench. Standardized benchmarks are useful, but they mostly answer a binary task-outcome question.

The question I care about day to day is narrower and more annoying: did the agent make the same kind of change a human merged in my codebase, and would I want to own the patch afterward?

Experimental Setup

Each task is derived from a real merged PR or commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch in a Docker container. Stet then applies the patch and runs the task's tests in an isolated container to check if it passed/failed.

Then Stet grades the result beyond pass/fail:

  • Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
  • Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
  • Footprint risk: how much additional code did the agent touch when compared with the human patch?
  • Craft/discipline rubrics: attempt to capture non-correct aspects of code. Basically, would a reviewer want to maintain this code. The categories are clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality

Every model ran once per task with a single seed. The LLM-as-a-judge model was GPT-5.4. Each patch was scored independently - the judge sees the patch and the task, and was blinded to the model/effort that produced the patch. I also manually inspected representative examples as sanity checks. There was no human calibration pass on this task set, so I would trust the direction of the deltas more than any single absolute score.

As an aside, I've also been using these evaluations as an "autoresearch" optimization loop, not just a benchmark. I tell my agent something like "make AGENTS.md better for this repo"; it proposes an edit, runs Stet on historical tasks, figures out where the candidate was better / worse and why, and iterates to improve the evaluation numbers.

Details:

  • Model: GPT-5.5
  • Harness: Codex 0.128.0
  • Dataset: 26 matched real GraphQL-go-tools tasks.
    • Yes this is small - however running even this used ~50+% of my weekly 20x quota
  • Main metrics:
    • test pass
    • semantic equivalence
    • code-review pass
    • footprint risk
    • craft/discipline custom graders
    • cost and runtime

Low To Medium: From Heuristics To Domain Modeling

Let's jump into the data!

Low and medium both pass tests on 21/26 tasks. If tests were the only metric, low and medium would look tied.

However, when we look at semantic equivalence, the jump from low to medium is 4/26 to 11/26. Similarly, code-review pass jumps from 3/26 to 5/26, and aggregate craft/discipline scores rise from 2.311 to 2.604.

In this slice, tests alone would have missed most of the reasoning-effort differences.

Coding-agent evals that only measure tests can flatten differences that matter to humans reviewing the patch.

Speaking from the perspective of a professional software engineer, the code I want AI to merge into my team's codebase doesn't just pass tests. It is also clear, maintainable, at the correct level of abstraction, and following the codebase's standards.

Example: PR #1297 asks the agent to validate nullable external @requires dependencies in GraphQL Federation. If a nullable required field comes back null with an error, dependent downstream fetches should not receive that tainted entity.

  • Task: model a subtle federation data-dependency rule, not just add a validation branch.
  • Lower-effort failure mode: low passed tests, but it was non-equivalent and review-failing because it used heuristic required-field/error matching and missed structured nullable @requires metadata.
  • Higher-effort change: medium became equivalent, passed review, tracked tainted objects, filtered downstream fetch inputs, and improved craft/discipline quality from 1.350 to 3.225.
  • Lesson: medium stops guessing and starts representing the actual federation behavior. High and xhigh stayed in the same quality band, so this is mainly a low-to-medium example.

High Looks Like The Practical Sweet Spot

High vs medium:

MetricMediumHighΔ
Tests pass21/26, 80.8%25/26, 96.2%+15.4pp
Equivalent11/26, 42.3%18/26, 69.2%+26.9pp
Code-review pass5/26, 19.2%10/26, 38.5%+19.2pp
Footprint risk mean0.2680.314+0.046
Craft/Discipline avg2.6042.736+0.132
Cost/task (mean)$3.13$4.49+$1.35, 1.43x
Mean duration411.0s579.0s+168.0s

High was the cleanest practical upgrade. It improved the obvious metrics and the semantic/review metrics, while cost rose meaningfully but not absurdly.

High appears to be the point where the extra tokens pay off in terms of real gains - it’s the point where integration details are correct more often.

Let’s look at some examples:

PR #1209 asks the gRPC datasource to honor GraphQL aliases in response JSON, validate referenced protobuf message types up front, and update mapping coverage for union/interface mutation paths.

  • Task: carry alias/response-key semantics through planning, marshaling, and gRPC mapping coverage.
  • Lower-effort failure mode: low and medium both passed tests but stayed non-equivalent and review-failing. Medium handled much of alias serialization and missing-message validation, but missed the createUser mutation mapping update and overloaded JSONPath with response-key semantics.
  • Higher-effort change: high became the first strict pass. It introduced explicit response-key/alias handling, carried aliases through planning and JSON marshaling, and raised custom quality to 3.625.
  • Lesson: high did not just add more code. It got the integration obligation exactly right. Xhigh also passed, but did not improve the task-level read and was much slower in the regenerated summary (790.7s agent duration versus 314.0s for high).

PR #1155 is a broad gRPC datasource hardening task: support repeated scalar fields, avoid null/invalid message panics, propagate gRPC status codes, allow disabling the datasource, and support dynamic clients.

  • Task: harden several production boundaries across gRPC datasource behavior.
  • Lower-effort failure mode: low and medium were test-green but non-equivalent. Medium improved robustness, but still serialized invalid repeated fields as empty arrays, missed aliased-root planning behavior, and had dynamic-client lifecycle risk.
  • Higher-effort change: high became equivalent and review-passing, with safer nil/invalid handling, status-code propagation, disabled-datasource behavior, and dynamic client-provider coverage.
  • Lesson: this is also a high-vs-xhigh reversal. Xhigh still passed tests, but became non-equivalent and review-failing because disabled datasource semantics and invalid-list behavior were wrong.

Xhigh Is Better Quality, Not Obviously A Better Default

Xhigh vs high:

MetricHighXhighΔ
Tests pass25/26, 96.2%24/26, 92.3%-3.8pp
Equivalent18/26, 69.2%23/26, 88.5%+19.2pp
Code-review pass10/26, 38.5%18/26, 69.2%+30.8pp
Footprint risk mean0.3140.365+0.051
Craft/Discipline avg2.7363.071+0.335
Cost/task (mean)$4.49$9.77+$5.29, 2.18x
Mean duration579.0s753.3s+174.3s

Xhigh seems to buy semantic and review quality, but it is not a simple "turn the knob up and everything improves" story. It is expensive, and tests are not monotonic.

It seems like xhigh produces code that is more aligned with human intent, covering more bases, and making more complete changes, at the cost of way more tokens. The review-rubric mean/median tells the same story: xhigh scored 3.365 mean / 3.500 median, versus high at 2.817 mean / 2.750 median. The median being above the mean matters: this was not just one or two great xhigh patches dragging up the average.

One caveat: xhigh looked more semantically complete, but it also tended to touch more code relative to the human patch, increasing the footprint risk. That is the interesting tension in this run: xhigh was much more likely to match the human PR semantically, but it was also more willing to expand the patch surface.

I checked whether that extra surface was mostly tests or production logic. Using a simple file-path split across the 26 matched tasks, xhigh added 13,144 lines total: 5,918 implementation lines and 7,226 test, fixture, or expected-output lines. Compared with high, xhigh added 2,631 more lines, and 2,436 of those extra added lines were in test/fixture/expected-output files. So the footprint increase is not just "the model wrote a huge pile of production code." A lot of it is xhigh building more verification and fixture coverage. Still, that is real review surface: someone has to read and maintain those tests, fixtures, and expected-output updates too.

Some examples:

PR #1076 restructures subscription handling to avoid shared-mutex race conditions: per-subscription serialized writes, per-subscription heartbeat control, race detector coverage, and corrected WebSocket close semantics.

  • Task: remove concurrency risk from subscription delivery without breaking close/unsubscribe behavior.
  • Lower-effort failure mode: medium passed tests but was non-equivalent and review-failing. High became equivalent and instruction-adherent, but still failed review because the new worker queue could block the global subscription event loop, shutdown could hang behind a stuck worker, hung updates were unbounded, and client-level unsubscribe still skipped internal subscriptions.
  • Higher-effort change: xhigh was the first strict pass and raised custom quality to 3.475.
  • Lesson: this is the best example of xhigh as a quality mode. The extra spend bought review-risk cleanup in a concurrency-heavy task that simpler signals did not fully capture.

PR #1308 implements GraphQL @oneOf input objects: add the built-in directive, expose it through introspection, validate operation literals and runtime variables, and improve undefined-variable source locations.

  • Task: implement a cross-cutting GraphQL validation feature across schema, introspection, operation validation, and runtime variables.
  • Lower-effort failure mode: medium and high both passed tests but stayed non-equivalent and review-failing because they missed important @oneOf semantics around runtime variables, nullable variables, provided-null payloads, or introspection shape.
  • Higher-effort change: xhigh was the first strict pass, with robustness 3.7, instruction adherence 4.0, and custom quality 3.525.
  • Lesson: the difference is not superficial polish. Xhigh handled edge-case coverage across several parts of the system.

PR #1240 asks the agent to consolidate GraphQL AST field-selection merging and inline-fragment selection merging into a single normalization walk.

  • Task: refactor duplicated normalization behavior without changing the executable merge semantics.
  • Lower-effort success: low and high were strict passes.
  • Higher-effort failure mode: xhigh remained equivalent at the semantic-grader level, but review failed because it still preserved prioritized subpasses, changed AbstractFieldNormalizer ordering, and left obsolete field-merge registration behind.
  • Lesson: higher reasoning can produce a more elaborate, plausible refactor while still missing the exact executable behavior the tests and reviewer care about.

Craft And Discipline

The custom graders show the same broad lift as the review rubric. Xhigh's all-custom score was 3.071 mean / 3.087 median, versus high at 2.736 mean / 2.688 median. Craft and discipline were both higher at the median too, which supports the read that xhigh generally improved patch quality rather than only producing a few standout examples.

MetricLow mean / medianMedium mean / medianHigh mean / medianXhigh mean / median
Craft aggregate2.327 / 2.3382.618 / 2.5252.781 / 2.7873.126 / 3.100
Discipline aggregate2.295 / 2.3252.590 / 2.5882.691 / 2.6883.015 / 3.013
All custom graders2.311 / 2.3382.604 / 2.5502.736 / 2.6883.071 / 3.087
DeltaMedium minus lowHigh minus mediumXhigh minus highXhigh minus low
Craft average+0.291+0.162+0.345+0.799
Discipline average+0.295+0.101+0.324+0.720
All custom graders+0.293+0.132+0.335+0.760
Simplicity+0.069+0.038+0.315+0.423
Coherence+0.500+0.181+0.423+1.104
Intentionality+0.147+0.004+0.065+0.216
Robustness+0.450+0.427+0.577+1.454
Clarity+0.054+0.088+0.131+0.273
Instruction adherence+0.531+0.519+0.381+1.431
Scope discipline+0.354-0.058+0.381+0.677
Diff minimality+0.242-0.146+0.404+0.500

From this, we can interpret:

  • Low had weak robustness and instruction adherence.
  • Medium fixed a meaningful amount of that without improving aggregate test pass.
  • High improved practical correctness and robustness.
  • Xhigh improved almost every dimension, including scope and diff discipline.

Cost And Runtime

Reasoning effortTask cost meanTask cost medianAgent duration meanAgent duration median
Low$2.65$1.91286.9s294.6s
Medium$3.13$2.87411.0s371.8s
High$4.49$3.99579.0s572.9s
Xhigh$9.77$6.39753.3s732.7s

Cost is skewed at low and especially xhigh. Xhigh is still more expensive at the median, but the mean is pulled up by a few expensive tasks. Runtime broadly tracks the mean, so the cost story is more skewed than the time story. Xhigh is still clearly slower at the median.

  • High costs about 1.43x medium per task.
  • Xhigh costs about 2.18x high per task.
  • Xhigh cost is skewed by outliers, but its median task cost is still higher than high.

Limitations

I am not pretending that this is a statistically significant result, or that this result will carry over to your repo. That's ok!

As long as we're aware that this is just one run, at one point in time, on one repo, we can still use it to gain insights into how we can think about our own reasoning settings, it's helpful.

Specific limitations / methodology gaps:

  • Single seed per task.
  • 26 matched real GraphQL-go-tools tasks.
  • LLM-as-judge was GPT-5.4; judge saw patch/task, not label (so theoretically doesn’t know the model).
  • No grader calibration on this task set.

Prior art

Voratiq's current real-work leaderboard points in the same direction, although the methodology is very different. On their board, GPT-5.5 xhigh is at 1994 vs GPT-5.5 high at 1807, a +187 point / +10.3% rating lift; cost is $4.23 vs $2.52 (+67.9%) and duration is 11.9m vs 7.8m (+52.6%). My Stet slice shows a larger high → xhigh lift on equivalence (+19.2pp, +27.8% relative) and code-review pass (+30.8pp, +80.0% relative), but a very similar lift on the craft/discipline aggregate (+12.2%).

However, Voratiq is a preference/selection-style leaderboard over ongoing work, while this is one 26-task repo slice across multiple reasoning levels, so these aren't directly comparable. But it does make the shape less surprising. Xhigh seems to buy reviewer-preferred / quality outcomes more than it buys a clean test-pass default.

Conclusion

The data supports using xhigh for ambiguous, cross-cutting, concurrency-heavy, or high-review-risk work. The practical recommendation is to use high as the default daily driver; use medium/lower settings where cost matters more and the task is routine or well-scoped.

Reasoning effort clearly matters, but the curve is not smooth or monotonic task-by-task. Aggregate quality generally improved as reasoning increased, while individual tasks still had reversals where high beat xhigh or a higher setting made a plausible but wrong implementation choice.

Specifically:

  • Medium starts modeling repo/domain semantics more reliably than low.
  • High looks like the best practical setting on this dataset.
  • Xhigh looks like a quality mode, not a default.

What I’ll do moving forward: continue using high as my daily driver, and for exploratory / complex work, use xhigh.

However, your results may vary. This is why teams should measure their own harnesses, on their own tasks, rather than copying global benchmark defaults.

Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better - and it uses Stet to test candidate changes against historical repo tasks. If your team is already using coding agents heavily and has a concrete decision in front of you - high vs xhigh, Codex vs Claude Code, an AGENTS.md update, or which tasks are safe to delegate - I am looking for a few teams to run repo-specific trials with. Stet runs entirely locally, using your LLM subscriptions. You can also reach out to me directly.

Data is great, but I’m also interested in anecdotal experience. How have people here been finding the behavior of GPT-5.5 at various reasoning efforts? Which one is your default? And if you have changed team defaults based on evidence instead of vibes, I especially want to hear how you measured it.

Inspect-grade stitched curve from a 26-task GraphQL-go-tools matched slice. Medium/high come from a completed but insufficient-evidence rules compare; low and xhigh are clean candidate-arm runs on the same slice.

Methodology →