GPT-5.5 High Regression Check on GraphQL-go-tools

May 18, 2026 · Updated May 19, 2026

TL;DR

There were reports of a GPT-5.5 regression, so I compared GPT-5.5 high against a prior run from May 5 on 21 tasks from the GraphQL-go-tools open-source repo.

On that slice, GPT-5.5 high moved from:

19/21 resolved to 18/21, a -1 test/gate movement.
14/21 equivalent to 12/21, a -2 equivalence movement.
8/21 code-review pass to 7/21, a -1 review-pass movement.

That is a real directional regression on tests, equivalence, and review pass count, but it is not a broad quality collapse. Code-review rubric mean improved, most craft/discipline averages improved, and cost per task was roughly flat. Footprint risk was roughly flat, with a small mean increase.

The strongest negative signal is qualitative: GPT-5.5 high still writes plausible, test-passing patches, but it shows recurring weakness on deep invariants where tests do not fully encode lifecycle, concurrency, or GraphQL validity requirements.

First scored result within the hour, on the Claude subscription you already have.

Join the waitlist

Scorecard

For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.

Metric	5/5 GPT-5.5 high	5/18 GPT-5.5 high	Delta
Clean comparable tasks	21	21	flat
Tests / Harbor resolved	19/21	18/21	-1
Equivalent to human PR	14/21	12/21	-2
Code-review pass	8/21	7/21	-1
Code-review fail	11/21	12/21	+1
Code-review unsure	2/21	2/21	flat
Code-review rubric mean	2.704	2.857	+0.154
Footprint risk, lower better	0.319	0.328	+0.009
Craft avg	2.745	2.814	+0.069
Discipline avg	2.651	2.818	+0.167
All custom graders avg	2.709	2.815	+0.105
Cost per task	$6.45	$6.54	+$0.09
Mean agent duration	558s	611s	+53s
Input tokens	126.9M	153.2M	+26.4M
Output tokens	328.3K	360.4K	+32.1K
Cached input tokens	113.1M	142.2M	+29.1M
Uncached input tokens	13.9M	11.1M	-2.8M

The outcome metrics are worse by one resolved row, two equivalent rows, and one review-pass row. The equivalence movement is a net number: four prior equivalent rows became non-equivalent, while two prior non-equivalent rows became equivalent. Rubric means improve; cost is slightly higher.

That mix is why I'd describe this as a targeted reliability concern and not a blanket model-quality regression. I'll look at specific examples below.

regression scorecardn = 21 clean tasks - judge: gpt-5.4

Fresh GPT-5.5 high lost one resolved task, two equivalent patches, and one review-pass row. Rubric means still moved up, so the signal reads as targeted reliability risk rather than a broad collapse.

May 5 highprior GPT-5.5 high

May 18 highfresh GPT-5.5 high

outcomes21 clean comparable tasks

tests / Harbor resolved-1

19/21

18/21

equivalent to human PR-2

14/21

12/21

code-review pass-1

8/21

7/21

code-review fail+1

11/21

12/21

quality0-4 unless noted

code-review rubric mean+0.154

2.704

2.857

all custom graders avg+0.105

2.709

2.815

scope discipline+0.366

2.610

2.976

instruction adherence-0.057

2.728

2.671

cost and effortlower is better

cost per task+$0.09

$6.45

$6.54

mean agent duration+53s

558s

611s

mean tool calls+8.5

78.4

87.0

Grader Detail

Grader	Prior GPT-5.5 high	New GPT-5.5 high	Delta
Simplicity	2.852	2.823	-0.029
Coherence	2.543	2.581	+0.039
Intentionality	3.081	3.153	+0.072
Robustness	2.447	2.557	+0.110
Clarity	2.800	2.800	flat
Instruction adherence	2.728	2.671	-0.057
Scope discipline	2.610	2.976	+0.366
Diff minimality	2.615	2.700	+0.085

Most grader means move up, especially scope discipline and robustness. Simplicity and instruction adherence move down - the new run often looked cleaner and more disciplined, but it sometimes missed specific obligations or semantic boundaries.

Most small deltas are within expected variance. I wouldn't read too much into them individually.

Tool Behavior

The new run used more agent interaction overall: more total tool calls, shell calls, and patch calls, but fewer detected test-command invocations.

Metric	Prior GPT-5.5 high	New GPT-5.5 high	Delta
Captured task traces	21/21	21/21	flat
Mean agent steps	100.1	107.2	+7.1
Mean tool calls	78.4	87.0	+8.5
Mean shell calls	65.4	74.1	+8.7
Mean patch calls	13.1	13.8	+0.6
Mean test-command calls	9.7	9.0	-0.6
Max tool calls on one task	125	197	+72
Max patch calls on one task	33	50	+17
Total tool calls	1,647	1,826	+179
Total shell calls	1,374	1,557	+183
Total patch calls	276	289	+13
Total test-command calls	203	190	-13

More shell and tool activity overall, especially on a few high-effort rows, and slightly more patch calls after removing the contaminated row. That lines up with the broader read: the new run often looked more disciplined, while still missing some specific semantic obligations the old run caught.

Case Studies

Real Negative Semantic Flip: `pr-1076`

This is the clearest regression row. The task was to rework GraphQL subscription concurrency so response writes are serialized per subscription, heartbeat writes move out of shared global state, WebSocket close handling is safe, and race-detector coverage becomes the default CI path.

Prior GPT-5.5 high passed equivalence but still failed code review. The reviewer worried the worker queue could block the global subscription event loop and that client-level unsubscribe still skipped internal subscriptions.

The new run passed tests but became non-equivalent. The grader called out concrete missed obligations: race detector was not made the default path across all relevant modules, old event/update semaphores remained in the resolver, heartbeat interval clamping was computed but not used for the ticker, and synchronous subscription completion could be queued after the executor-draining loop stopped.

The new run built a plausible concurrency protocol, but it missed explicit CI/defaulting work and kept enough old synchronization machinery that the lifecycle story became fragile.

Broader Equivalence Losses: `pr-1240`, `pr-1262`, `pr-1351`

The common pattern across these three is that GPT-5.5 high still wrote reasonable code. The new run had enough plausible implementation surface to pass or partially pass, while missing narrower equivalence details the human PR handled.

Positive Counterexamples: `pr-1128`, `pr-1260`, `pr-1268`

The run is not uniformly worse - it trades off row-level outcomes in both directions. pr-1128 and pr-1260 moved from non-equivalent to equivalent. pr-1268 stayed equivalent and improved from review-unsure to review-pass.

Equivalent but Review-Uncertain: `pr-859`

This is a useful warning row. The task was to replace hot planner slice scans with map-backed lookup paths.

The prior run passed tests, equivalence, and review. The new run still passed tests and equivalence, but review moved to unsure: the grader saw the same broad performance intent, while calling out that the patch expanded beyond focused hot-path indexes into broader secondary-run pruning and cache state.

The new run appears behaviorally equivalent, but the implementation changes planner control flow beyond the minimal indexed-lookup fix, so review confidence drops.

Conclusion

On this small n=21 slice, GPT-5.5 high is directionally worse on tests, equivalence, and review pass count: -1 resolved, -2 equivalent, and -1 review pass. Review rubric improves, most custom-grader means improve, and cost per task is only slightly higher.

I'd trust the direction of the deltas more than any single absolute score. My read is that there may be evidence a targeted reliability concern - especially around deep concurrency/lifecycle and GraphQL semantic-invariant tasks - but it is not evidence of a broad GPT-5.5 high quality collapse. And given the conflicting results from the custom graders, the model may simply just be behaving differently, not necessarily worse.

I am not pretending this is a statistically significant result, or that it will carry over to your repo. That is ok! Despite that, these results show that claims of a broad regression are likely overstated - we would need to look at specific use cases and examples to confirm/deny that.

Disclosure: I am building Stet.sh, the local eval tool I used to run this. The product version is that you can ask your coding agent to improve its own setup - for example, make AGENTS.md better - and it uses Stet to test candidate changes against historical repo tasks. Stet runs entirely locally, using your LLM subscriptions. You can now install it and run a first local trial from stet.sh/private, or reach out to me directly for paid pilots and CI setup.

First scored result within the hour, on the Claude subscription you already have.

Join the waitlist

FAQ

Did the May 18 GPT-5.5 high rerun show a broad coding-agent regression?

Not from this slice alone. On 21 clean GraphQL-go-tools tasks, the fresh run was worse on tests, equivalence, and review pass count, but review rubric mean and most custom-grader means improved. The evidence may point to a targeted reliability concern, but not to a broad quality collapse.

What changed between the old and fresh GPT-5.5 high runs?

The fresh run moved from 19/21 resolved to 18/21, from 14/21 equivalent to 12/21, and from 8/21 code-review pass to 7/21. The clearest negative row was pr-1076, where tests passed but equivalence regressed on concurrency and lifecycle obligations.

Is this a statistically significant benchmark?

No. It is a small source-arm comparison on 21 clean tasks from one open-source repo. It is useful as a concrete datapoint and failure analysis, not as proof of a global GPT-5.5 regression.