STET

GPT-5.5 High Regression Check on GraphQL-go-tools

May 18, 2026 · Updated May 19, 2026

TL;DR

There were reports of a GPT-5.5 regression, so I compared GPT-5.5 high against a prior run from May 5 on 21 tasks from the GraphQL-go-tools open-source repo.

On that slice, GPT-5.5 high moved from:

  • 19/21 resolved to 18/21, a -1 test/gate movement.
  • 14/21 equivalent to 12/21, a -2 equivalence movement.
  • 8/21 code-review pass to 7/21, a -1 review-pass movement.

That is a real directional regression on tests, equivalence, and review pass count, but it is not a broad quality collapse. Code-review rubric mean improved, most craft/discipline averages improved, and cost per task was roughly flat. Footprint risk was roughly flat, with a small mean increase.

The strongest negative signal is qualitative: GPT-5.5 high still writes plausible, test-passing patches, but it shows recurring weakness on deep invariants where tests do not fully encode lifecycle, concurrency, or GraphQL validity requirements.

Scorecard

For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.

Metric5/5 GPT-5.5 high5/18 GPT-5.5 highDelta
Clean comparable tasks2121flat
Tests / Harbor resolved19/2118/21-1
Equivalent to human PR14/2112/21-2
Code-review pass8/217/21-1
Code-review fail11/2112/21+1
Code-review unsure2/212/21flat
Code-review rubric mean2.7042.857+0.154
Footprint risk, lower better0.3190.328+0.009
Craft avg2.7452.814+0.069
Discipline avg2.6512.818+0.167
All custom graders avg2.7092.815+0.105
Cost per task$6.45$6.54+$0.09
Mean agent duration558s611s+53s
Input tokens126.9M153.2M+26.4M
Output tokens328.3K360.4K+32.1K
Cached input tokens113.1M142.2M+29.1M
Uncached input tokens13.9M11.1M-2.8M

The outcome metrics are worse by one resolved row, two equivalent rows, and one review-pass row. The equivalence movement is a net number: four prior equivalent rows became non-equivalent, while two prior non-equivalent rows became equivalent. Rubric means improve; cost is slightly higher.

That mix is why I'd describe this as a targeted reliability concern and not a blanket model-quality regression. I'll look at specific examples below.

regression scorecardn = 21 clean tasks - judge: gpt-5.4

Fresh GPT-5.5 high lost one resolved task, two equivalent patches, and one review-pass row. Rubric means still moved up, so the signal reads as targeted reliability risk rather than a broad collapse.

May 5 highprior GPT-5.5 high
May 18 highfresh GPT-5.5 high
outcomes21 clean comparable tasks
tests / Harbor resolved-1
19/21
18/21
equivalent to human PR-2
14/21
12/21
code-review pass-1
8/21
7/21
code-review fail+1
11/21
12/21
quality0-4 unless noted
code-review rubric mean+0.154
2.704
2.857
all custom graders avg+0.105
2.709
2.815
scope discipline+0.366
2.610
2.976
instruction adherence-0.057
2.728
2.671
cost and effortlower is better
cost per task+$0.09
$6.45
$6.54
mean agent duration+53s
558s
611s
mean tool calls+8.5
78.4
87.0

Grader Detail

GraderPrior GPT-5.5 highNew GPT-5.5 highDelta
Simplicity2.8522.823-0.029
Coherence2.5432.581+0.039
Intentionality3.0813.153+0.072
Robustness2.4472.557+0.110
Clarity2.8002.800flat
Instruction adherence2.7282.671-0.057
Scope discipline2.6102.976+0.366
Diff minimality2.6152.700+0.085

Most grader means move up, especially scope discipline and robustness. Simplicity and instruction adherence move down - the new run often looked cleaner and more disciplined, but it sometimes missed specific obligations or semantic boundaries.

Most small deltas are within expected variance. I wouldn't read too much into them individually.

Tool Behavior

The new run used more agent interaction overall: more total tool calls, shell calls, and patch calls, but fewer detected test-command invocations.

MetricPrior GPT-5.5 highNew GPT-5.5 highDelta
Captured task traces21/2121/21flat
Mean agent steps100.1107.2+7.1
Mean tool calls78.487.0+8.5
Mean shell calls65.474.1+8.7
Mean patch calls13.113.8+0.6
Mean test-command calls9.79.0-0.6
Max tool calls on one task125197+72
Max patch calls on one task3350+17
Total tool calls1,6471,826+179
Total shell calls1,3741,557+183
Total patch calls276289+13
Total test-command calls203190-13

More shell and tool activity overall, especially on a few high-effort rows, and slightly more patch calls after removing the contaminated row. That lines up with the broader read: the new run often looked more disciplined, while still missing some specific semantic obligations the old run caught.

Case Studies

Real Negative Semantic Flip: pr-1076

This is the clearest regression row. The task was to rework GraphQL subscription concurrency so response writes are serialized per subscription, heartbeat writes move out of shared global state, WebSocket close handling is safe, and race-detector coverage becomes the default CI path.

Prior GPT-5.5 high passed equivalence but still failed code review. The reviewer worried the worker queue could block the global subscription event loop and that client-level unsubscribe still skipped internal subscriptions.

The new run passed tests but became non-equivalent. The grader called out concrete missed obligations: race detector was not made the default path across all relevant modules, old event/update semaphores remained in the resolver, heartbeat interval clamping was computed but not used for the ticker, and synchronous subscription completion could be queued after the executor-draining loop stopped.

The new run built a plausible concurrency protocol, but it missed explicit CI/defaulting work and kept enough old synchronization machinery that the lifecycle story became fragile.

Broader Equivalence Losses: pr-1240, pr-1262, pr-1351

The common pattern across these three is that GPT-5.5 high still wrote reasonable code. The new run had enough plausible implementation surface to pass or partially pass, while missing narrower equivalence details the human PR handled.

Positive Counterexamples: pr-1128, pr-1260, pr-1268

The run is not uniformly worse - it trades off row-level outcomes in both directions. pr-1128 and pr-1260 moved from non-equivalent to equivalent. pr-1268 stayed equivalent and improved from review-unsure to review-pass.

Equivalent but Review-Uncertain: pr-859

This is a useful warning row. The task was to replace hot planner slice scans with map-backed lookup paths.

The prior run passed tests, equivalence, and review. The new run still passed tests and equivalence, but review moved to unsure: the grader saw the same broad performance intent, while calling out that the patch expanded beyond focused hot-path indexes into broader secondary-run pruning and cache state.

The new run appears behaviorally equivalent, but the implementation changes planner control flow beyond the minimal indexed-lookup fix, so review confidence drops.

Conclusion

On this small n=21 slice, GPT-5.5 high is directionally worse on tests, equivalence, and review pass count: -1 resolved, -2 equivalent, and -1 review pass. Review rubric improves, most custom-grader means improve, and cost per task is only slightly higher.

I'd trust the direction of the deltas more than any single absolute score. My read is that there may be evidence a targeted reliability concern - especially around deep concurrency/lifecycle and GraphQL semantic-invariant tasks - but it is not evidence of a broad GPT-5.5 high quality collapse. And given the conflicting results from the custom graders, the model may simply just be behaving differently, not necessarily worse.

I am not pretending this is a statistically significant result, or that it will carry over to your repo. That is ok! Despite that, these results show that claims of a broad regression are likely overstated - we would need to look at specific use cases and examples to confirm/deny that.


Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better - and it uses Stet to test candidate changes against historical repo tasks. If your team is already using coding agents heavily and has a concrete decision in front of you - high vs xhigh, Codex vs Claude Code, an AGENTS.md update, or which tasks are safe to delegate - I am looking for a few teams to run repo-specific trials with. Stet runs entirely locally, using your LLM subscriptions. Join the waitlist at stet.sh/private or reach out to me directly.

Source-arm comparison between the May 18 fresh GPT-5.5 high rerun and the May 5 GPT-5.5 high context on 21 clean comparable GraphQL-go-tools tasks, after excluding contaminated tasks including pr-828 for reference-patch exposure.

Methodology