GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest
GLM 5.2 has been getting a lot of hype for being a "frontier killer". We can evaluate it 2 ways: how cheap is it, and is good enough for most work?
On tasks from these 2 open source repos, it is neither. It costs roughly twice Composer 2.5 for lower quality on both dimensions that matter.
Route GLM to supervised first-draft work with a smarter model supervising. Don't run it unattended against production-grade repositories.
The setup: fifty real merged pull requests across two repos — graphql-go-tools in Go and sqlparser-rs in Rust — replayed against frozen snapshots so nothing leaks, one attempt per task. Every patch is graded beyond pass/fail: a craft score (0–4, code quality and idiom) and equivalence (0–1, how closely the patch reproduces the merged human PR's actual behavior). Stet, the local eval harness I build, runs and grades them; a blinded gpt-5.4 judge scores independently of the runner. GLM ran at medium reasoning. Does GLM belong at this table, and if not, where does it actually belong?
n=50 slice: graphql-go-tools (Go, n=25) plus sqlparser-rs (Rust, n=25), blinded GPT-5.4 judge. GLM 5.2 (run at medium reasoning) lands last in the field on craft and equivalence in both repos — cheaper than the premium arms, but pricier than Composer and last on quality. Below: where it stands, how it behaves, and what it costs.
calibrated standing · GLM 5.2 medium
Last on craft and equivalence, in both repos
craft = 8-grader mean (0–4); equivalence = how closely the patch reproduces the merged human PR's behavior (0–1). GLM is decision-grade behind the whole premium field on both, in both repos — a gap big enough to survive the statistics at this sample size. Against the budget arm Composer 2.5 it is a noise-band peer (too close to call) on Go and decision-grade behind on Rust equivalence — while costing about twice as much on Rust.
graphql-go-tools (Go), n=25
| Arm | craft | equiv | $/task |
|---|---|---|---|
| Opus 4.8 high | 2.90 | 0.73 | $3.98 |
| GPT-5.5 high | 2.72 | 0.73 | $4.69 |
| Opus 4.7 xhigh | 2.63 | 0.68 | $5.93 |
| Composer 2.5 | 2.48 | 0.60 | $0.71 |
| GLM 5.2 medium | 2.38 | 0.47 | $1.40 |
sqlparser-rs (Rust), n=25
| Arm | craft | equiv | $/task |
|---|---|---|---|
| Opus 4.8 high | 3.28 | 0.98 | $3.02 |
| Opus 4.7 xhigh | 2.98 | 0.97 | $3.55 |
| GPT-5.5 high | 2.94 | 0.96 | $3.41 |
| Composer 2.5 | 2.84 | 0.95 | $0.53 |
| GLM 5.2 medium | 2.69 | 0.78 | $1.04 |
GLM ran at medium reasoning. Composer's Go cost is recovered from raw Cursor logs (directional); all other figures are from the calibrated per-repo panels.
how GLM works
Normal-sized patches, last on quality
Median agent patch by model (additions right, deletions left). Every arm writes more than the human PR here (Go +111 / −47, Rust +110 / −17), and GLM sits mid-field — it adds the most on Rust, is unremarkable on Go, and deletes little, like the Opus arms; GPT-5.5 and Composer churn more. So GLM's gap isn't a patch-size problem — it writes a normal-looking diff and still lands last on equivalence and craft.
graphql-go-tools (Go)
← deleted · added →
sqlparser-rs (Rust)
← deleted · added →
tokens, turns & patches by model
| Model | input/task | output/task | turns | patches |
|---|---|---|---|---|
| GLM 5.2 | 3.2M | 22k | 122 | 50/50 |
| Composer 2.5 | 2.1M | 16k | — | 50/50 |
| GPT-5.5 | 4.8M | 15k | 94 | 50/50 |
| Opus 4.8 | 3.0M | 32k | 113 | 50/50 |
| Opus 4.7 | 5.6M | 31k | 100 | 46/50 |
Medians per task; input/output tokens and patch counts pooled across both repos. Input is context (including cache reads); output is generated tokens. Turns are the sqlparser-rs median, where capture is complete across the claude-code and codex arms — GLM runs the most (122 vs Opus 4.8's 113, GPT-5.5's 94); Composer ran on Cursor, which batches its work under a few assistant turns, so its turn count isn't comparable. Patches = tasks with a non-empty diff (Opus 4.7's four misses are routing no-patches). GLM also grinds the longest by wall-clock — a ~16-minute median, the slowest arm; its worst Go task burned 14.1M tokens and $4.07 in a single 326-turn loop, almost all of it re-reading the same files.
glm 5.2 comparison
cost vs local score
vsCompare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.
- GLM 5.2 medium: repo-balanced local score 64.6, cost per task $1.22
- Composer 2.5: repo-balanced local score 71.6, cost per task $0.62
- GPT-5.5 high: repo-balanced local score 75.6, cost per task $4.05
- Opus 4.8 high: repo-balanced local score 80.2, cost per task $3.50
- Opus 4.7 xhigh: repo-balanced local score 75.9, cost per task $4.74
Lower cost is better.
Higher local score is better.
Hover a point, or focus/tap a row to show the selected metric values in the chart. Repo-mean pooled, both repos. GLM lands last on the blended local score (64.6) below Composer (71.6); cheaper than the premium arms but ~2x Composer in both repos, and the slowest arm by wall-clock (~16 min/task). Toggle the X axis (cost, time, tokens) and the Y axis (local score, craft, code review, equivalence, tests). GLM ran at medium reasoning.
local score formula
A single repo-balanced display score: 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint. GLM 5.2 lands last at 64.6 — but only 7 points behind Composer, because footprint is the one component where GLM doesn't trail, and it props the blended number up. The calibrated per-axis read above is harsher. Drag to reweight and watch the ranking move. GLM ran at medium reasoning.
weights
live local score
same task, two models
Where Composer shipped it, GLM shipped a partial
Two sqlparser-rs tasks Composer solved cleanly. GLM passed the same tests and came back non-equivalent — tests green, behavior incomplete. The same plausible-partial shape shows up across the slice.
sqlparser-rs #1472
Hive ! negation vs PostgreSQL ! factorial
Disambiguate the bang operator by dialect.
Composer 2.5
Equivalent patch, cleared review, craft 3.68.
GLM 5.2
Passed the same tests, non-equivalent: it enabled both bang forms for GenericDialect.
sqlparser-rs #1493
JSON_TABLE FOR ORDINALITY with NESTED PATH
Add the ordinality column syntax.
Composer 2.5
Equivalent and review-clean, craft 3.53.
GLM 5.2
Passed tests, non-equivalent: “misses the core real-world ordinality syntax — FOR ORDINALITY requires a data type.”
green tests, failed review
graphql-go-tools #859
$0.36 · non-equivalent
planner-path optimization
Tests passed, non-equivalent: “plausible partial… misses the datasource metadata membership side.”
sqlparser-rs #1398
$0.73 · non-equivalent
interval qualifiers
Tests passed, non-equivalent: “fails to enforce required qualifiers; regression coverage removed without visible replacement.”
It passes tests – so does everyone
GLM passed 38 of 50 tasks. That sounds usable. Composer 2.5 passed 44/50. GPT-5.5 also 44/50. Opus 4.8 hit 47/50.
The test gate has saturated as a discriminator — all four models cluster above 75%. A model can write a structurally plausible patch, leave the core behavior incomplete, and still go green, because, on this dataset, test suites rarely cover every edge the human PR was fixing. GLM does exactly that, repeatedly, as the task contrasts below show.
Routing decisions built on pass/fail won't separate GLM from Opus 4.8. The axes that diverge are equivalence and craft.
Last on quality, in both repos
GLM trails the entire field on craft and equivalence in both repos. Against the premium arms that gap is decision-grade — large enough to survive the statistics at n=50. Against Composer the picture splits by repo. On Go the craft and equivalence gaps are −0.10 and −0.14: noise-band, too close to call. On Rust craft stays close but equivalence falls 0.17 short, which is decision-grade. On the Rust repo every other arm clears 0.95 equivalence; GLM sits at 0.78, the clear outlier.
GLM is last on quality. Composer is also cheap, and it's better.
How GLM actually works
The quality gap has a mechanism:
Turns and tokens
GLM grinds. It ran a median of roughly 135 agent turns per task on Go and 122 on Rust. Opus 4.8 ran a median of ~113 turns on Rust, GPT-5.5 ~94. GLM takes more turns than the premium models and produces worse output; its cost edge comes from cheap per-token pricing, not from efficiency. Median token spend ran ~3.9M per Go task and ~2.5M on Rust.
It writes more than the human, then misses the change
The patch shape is the tell. GLM writes roughly 1.8x the human's churn. On Go the median GLM patch added 222 lines and deleted 16 across 4 files; the human PR was +111 / −47. On Rust, GLM +284 / −12 across 6 files against the human's +110 / −17. GLM bolts new code alongside the existing path and rarely edits or deletes what's there. The premium arms write add-heavy patches too, though, so diff shape alone isn't the gap — the difference is in the content. The same-task contrasts below show GLM adding around the change instead of making it.
Wildly variable
The variance is wide, and the diffs show why. Go #1308 (expose @oneOf input objects through introspection and validation) is the tail-risk case: GLM found the right surfaces — the @oneOf directive, the isOneOf introspection field, the validation rule — and the patch mostly works. But it ran its new check ahead of the existing one, so an undefined variable used in a one-of selection comes back with the wrong error, and it ground 326 turns and 14.1M tokens — only ~34K of them output, the rest re-reading the same files — to get there, at $4.07. Go #1034 (canonicalize variable names while preserving the originals for validation) is the over-write case: the human added a small variables_mapper layer; GLM wrote +522 / −6 — twice the human — pulling in a third-party JSON parser and re-serializing the variables byte-by-byte, then dropped the original-variable path anyway. More code, same miss, craft 1.94.
It can also land cleanly. On Rust #2174 (a derive_dialect! macro for custom dialects, 734 / 285 in the human's PR) GLM shipped an equivalent patch in roughly half the churn — it built the derive path but hand-wrote the dialect method table instead of generating it. Rust #1538 it got right too, but ground through 11.3M tokens and $3.21 to do it. Same model, same settings, wildly different outcomes — which is the routing problem.
Same-task contrasts with Composer
The sharpest window is running both models on identical tasks. sqlparser-rs #1472 makes ! dialect-specific — Hive reads !a as logical NOT, PostgreSQL reads a! as factorial, and a dialect that supports neither must reject both. The human added two opt-in predicates to the Dialect trait (supports_factorial_operator, supports_bang_not_operator, both defaulting to false) so each dialect declares what it allows and everything else rejects ! for free. GLM instead hard-coded dialect_of!(self is HiveDialect | GenericDialect) branches in the parser and let the permissive GenericDialect accept both forms. It passed the happy-path tests — it even added one asserting MsSql rejects ! — but it's non-equivalent: it keys on dialect identity instead of capability, so it accepts syntax the human's design rejects. Composer shipped the equivalent, review-clean patch, craft 3.68. sqlparser-rs #1493 (JSON_TABLE FOR ORDINALITY) is the same shape — Composer equivalent and review-clean, craft 3.53; GLM non-equivalent because it parses a data type before checking for FOR ORDINALITY, so the real, type-free n FOR ORDINALITY won't parse and only the non-standard n INT FOR ORDINALITY does — the right code in the wrong order. Two more in the same vein: on Go #859 ($0.36) GLM made the planner's path lookups O(1) with three parallel caches but never touched the DataSourceMetadata membership scans the task also called out — half the optimization. On Rust #1398 ($0.73) it changed how the interval value parses but never added the error that rejects a missing unit, and it deleted the INTERVAL 1 + 1 DAY regression test rather than make it pass — the suite went green by dropping the coverage that would have failed.
The pattern is a plausible partial: tests go green, behavior stays incomplete — which is exactly the failure a test-only gate cannot see.
Cheaper than frontier, pricier than Composer
GLM at $1.40/task on Go and $1.04 on Rust runs about 0.3x what the premium arms cost: Opus 4.8 $3.02–$3.98, GPT-5.5 $3.41–$4.69, Opus 4.7 $3.55–$5.93. Measured upward, the cost position is real.
Measured against the model directly below it, it isn't. Composer costs $0.71 on Go and $0.53 on Rust — GLM is roughly 2x Composer in both repos, for worse quality. The blended local score makes GLM look closer than it is: Opus 4.8 80.2, GPT-5.5 75.6, Composer 71.6, GLM 64.6. GLM sits only ~7 behind Composer because code footprint is the one component where it doesn't trail, and it props the composite up. The per-axis read is harsher and more accurate: decision-grade behind the entire premium field on craft and equivalence, decision-grade behind Composer on Rust equivalence.
The "cheap" story only holds if you ignore the model sitting right below GLM in the standings.
On lived cost: running the same fifty-task eval consumed 100% of a week's usage on GLM's $60/week plan. The same tasks on Composer used about 30% of a month's usage on its $20/month plan — and GLM's tighter caps fragment a long run into pieces the way Composer's don't.
Route by behavior, not the test gate
If you already run Composer 2.5, GLM buys you nothing on these workloads — more turns, more cost, lower quality. The routing comparison isn't close.
GLM's honest slot is supervised first-draft generation: work where a human or frontier model is tightly guiding it. The deletion-aversion and the plausible-partial pattern are manageable with eyes on the output; they are not manageable at scale without them. Tighter turn and cost caps contain the flail cases like #1308's 326-turn, $4.07 grind, but caps limit the damage — they don't fix the incompleteness.
GLM is a draft generator with a reviewer attached, not a merge trigger.
What this is and isn't
Fifty tasks per model, two repos — graphql-go-tools in Go and sqlparser-rs in Rust — June 2026. Grader: blinded gpt-5.4, independent of the runner. Single seed, one attempt per task. Contamination audited clean on both repos. Costs are cache-aware. GLM ran at medium reasoning.
Per-repo replication is the confidence mechanism: when a finding holds in Go and in Rust, that's a stronger signal than one repo alone. Pooled numbers are directional, not definitive.
This is not a general benchmark — two repos, one domain cluster (parsers and query planning), one judge model, one agent scaffold. Other code domains, task distributions, and harness configurations may rank these models differently. I can't build a routing rule for your codebase out of these numbers.
First scored result within the hour, on the Claude subscription you already have.
Join the waitlistMeasure your own harness on your own code. GLM's behavior on sqlparser-rs says something specific about dialect-sensitive parser work; whether that carries to your service layer, your migrations, your infrastructure code is an open question. What would these standings look like on the three repos you actually maintain?
FAQ
Is GLM 5.2 good enough to replace a premium coding model?
On this 50-task slice, no. GLM finished last on craft and equivalence in both repos, decision-grade behind the entire premium field. Its honest slot is cheap, supervised first-draft work where a human reviews the patch, not unattended production work.
GLM 5.2 vs Composer 2.5 — which cheap model is better?
On this slice, Composer. They are a noise-band tie on Go craft and equivalence, but Composer is decision-grade ahead on Rust equivalence and costs about half as much in both repos ($0.53/task Rust, $0.71 Go vs GLM's $1.04 and $1.40). GLM is cheaper than the premium field but not the cheapest arm.
Is GLM 5.2 cheap to run?
Cheaper than the premium field (~$1.40/task on Go, $1.04 on Rust, about 0.3x Opus 4.8 and GPT-5.5) but roughly 2x Composer 2.5 in both repos. Token volume drives its cost: it runs more agent turns than the premium models and writes more code than the human PR.
How does GLM 5.2 fail when it fails?
It ships a plausible partial: a test-passing patch that bolts on new code but misses the change the PR actually made. It writes about 1.8x the human's churn while rarely deleting, and it cleared code review on only 5 of 50 tasks. On sqlparser-rs #1472 and #1493, where Composer shipped equivalent, review-clean patches, GLM passed the same tests but came back non-equivalent.
Where is GLM 5.2 safe to use?
Cheap, supervised first-draft work where a human reviews before merge and the spec is narrow. Not auto-merge-on-green production work, and not unattended batch runs — its five-hour and weekly usage caps fragment long runs and its plausible-partial patches need human eyes.