When Fable 5 Is Worth the Premium
Fable 5 is back. How good is it, and when is it worth the premium? What will we do when Fable is paywalled behind usage credits?
The top level finding is that, on this n=30 slice, Fable led the local score at 81.7 and cost $5.29 per task. Clean Opus 4.8 landed at 72.7 and cost $3.56 per task.
However, that misses key nuance about where Fable 5 is successful, and when we can route to Opus 4.8 instead.
When reading tasks one-by-one, we can see when it makes sense to use Fable. On graphql-go-tools#817, Fable was the right expensive route, even though GPT-5.5 won cost and Opus 4.8 had the smallest footprint. On graphql-go-tools#843, GPT-5.5 was the right route. On sqlparser-rs#1472, Composer was the right cheap draft. The average hides those reversals.
So, the overall recommendation isn't to simply "always use Fable." (most of us won't even be able to!). The finding is we should route to Fable when the task has elevated scope or risk, and keep cheaper models in the loop when the task is narrow enough to review / iterate on.
I tested Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, and Composer 2.5 on 30 real merged PRs from GraphQL Go Tools and SQLParser Rust. Each task gets a frozen repo snapshot in Docker, a prompt, and one attempt. Then Stet, the local eval harness I build, grades beyond pass/fail: did the patch make the same behavioral change as the human patch? Would a reviewer accept it? How much extra code did the agent touch? Is the diff clear, minimal, and well-structured?
This is NOT meant to be a statistically significant comparison - it is n=30 over 2 repos, with one reasoning effort per arm. Instead, I'm interested in showing how we can use methodology similar to popular evals (DeepSWE, FrontierCode, CursorBench) and draw directional signals on our own repos for a fraction of the cost.
Cleaned n=30 slice: GraphQL Go Tools (Go, n=9) plus SQLParser (Rust, n=21). Fable leads the compact local-score table, but the useful read is task by task: route to Fable when hidden semantic behavior or patch shape is the risk; default cheaper when the named examples do not show that risk.
aggregate context
quality frontier
vsCompare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.
- Fable 5 high: repo-balanced local score 81.7, cost per task $5.29
- Opus 4.8 high: repo-balanced local score 72.7, cost per task $3.56
- GPT-5.5 high: repo-balanced local score 73.2, cost per task $3.50
- Opus 4.7 xhigh: repo-balanced local score 71.0, cost per task $4.93
- Composer 2.5: repo-balanced local score 68.0, cost per task $0.51
Lower cost is better.
Higher local score is better.
Hover a point, or focus/tap a row to show the selected metric values in the chart. Aggregate context only: toggle axes to explore cost, time, input/output tokens, and tools versus local score, craft, equivalence, review, and tests.
local score formula
Tests are downweighted because they saturate early — every arm in this slice cleared 87%+. This is a compact display score, not a substitute for the task-by-task evidence below. Drag to reweight and watch the ranking move.
weights
live local score
model comparison
per-axis breakdown, all five arms
Raw numbers across all five arms. Best value in each row is highlighted. Quality rows: higher is better. Resource rows: lower is better.
task example
graphql-go-tools#817: pub/sub argument-template validation
graphql-go-tools#817: pub/sub argument-template validation. All three arms passed tests. Equivalence and review split them.
agent.patch
argument_templates validator
@@ func (p *NatsEventManager) extractEventSubject @@allMatches := argument_templates.ArgumentTemplateRegex.FindAllStringSubmatch(subject, -1)validationResult, err := argument_templates.ValidateArgumentPath(p.visitor.Definition, matches[1], fieldDefinitionRef)variablePlaceholder, err := p.addContextVariableByArgumentRef(argumentRef, validationResult.ArgumentPath)if !isValidNatsSubject(argument_templates.ArgumentTemplateRegex.ReplaceAllLiteralString(subject, "a")) { return "", fmt.Errorf("subject %q is not a valid NATS subject", subject)@@ func (p *NatsEventManager) addContextVariableByArgumentRef @@variablePath := append([]string{string(variableName)}, argumentPath[1:]...)renderer, err := resolve.NewPlainVariableRendererWithValidationFromTypeRef(..., variablePath...)Interactive: switch model output, then hover, focus, or pin a rubric row to see which patch lines carry the evidence.
Taskwise Comparisons
calibrated taskwise comparison
Fable wins, draws, and losses by metric.
Each rail is task-by-task. Craft rows are individual grader metrics, not an average; resource rows count lower cost or fewer tokens as the win.
Behavior
Tests
saturated gate
Equivalence
small behavioral edge
Code review
review tied overall
Craft metrics
Clarity
0.25-point draw band
Simplicity
0.25-point draw band
Coherence
0.25-point draw band
Intentionality
0.25-point draw band
Robustness
0.25-point draw band
Instruction adherence
0.25-point draw band
Scope discipline
0.25-point draw band
Diff minimality
0.25-point draw band
Footprint + resources
Footprint
Opus usually writes smaller
Cost
premium is not close
Input tokens
Fable reads less more often
Output tokens
modest Fable lean
Guard bands are intentionally conservative: binary ties stay draws, craft grader rows use a 0.25-point draw band on the 0-4 scale, footprint needs material patch-surface separation, and cost/token rows need a visible resource gap. Composer resource rails use the 21 SQLParser rows with comparable telemetry.
It's helpful to look at taskwise Wins/Draws/Losses per model - as n is low here, it helps us extract more signal.
Against Opus 4.8, behavior mostly ties, craft is mixed by dimension, and Opus carries the cost and footprint case. Against GPT-5.5 and Composer, Fable leads more craft rows, but is much more expensive.
Some examples: On graphql-go-tools#817, Fable was the expensive route that paid off: all three arms passed tests, but Fable passed review while Opus 4.8 left edge holes and GPT-5.5 missed final NATS subject validation. On graphql-go-tools#843, GPT-5.5 was the right route: Fable was cheaper and still failed the source/serialization behavior. On sqlparser-rs#1472, Composer was a valid cheap draft because the invariant was narrow and reviewable. On sqlparser-rs#1727, that same cheap route failed because the PostgreSQL grammar boundary was subtle.
Five Trace Reads
(AI used for analysis here)
graphql-go-tools#817: tests tied; Fable won the boundary. This was GraphQL pub/sub argument-template validation. The hard part was not the happy path. It was knowing which GraphQL argument templates to reject, which nested input-object paths were valid, whether paths ended at scalar or enum leaves, and whether the final NATS subject was legal after placeholders were substituted.
Fable covered the full boundary: root field argument validation, nested input-object traversal, static and templated subject validation, multiple placeholders, and shared behavior with subscription filters. Opus 4.8 solved enough to pass tests and equivalence, but review found single-segment input-object templates, list-wrapped paths, and malformed template-like strings that could still slip through. GPT-5.5 had a stronger schema-path checker in some spots, but missed final NATS subject validation. That is why the axis winners split: GPT won cost, Opus won footprint, Fable won review and craft.
Tests said "fine." The reviewer said "nope."
graphql-go-tools#859: both passed; Fable found the extra hot path. This was a planner-performance task. Both Fable and Opus 4.8 passed tests, equivalence, and review. The split was not pass/fail. The split was whether the implementation found the whole shape of the performance issue.
Fable indexed datasource TypeFields membership, added-path lookup, and missing-path-by-parent lookup, while preserving fallback scans, first-occurrence behavior, remove semantics, nil handling, and duplicate merging. Opus 4.8 solved the core repeated-scan problem with a smaller patch and lower cost, but left a scan in the missing-path case and carried stale-index / nil-receiver review risks. If the task is a narrow local optimization, Opus is a good default. If the task is really lifecycle invariants plus performance, Fable is the route I would test first.
sqlparser-rs#1713: Fable won parser shape, but only with a dialect audit. FROM-first SQL is a parser/API shape task: add FROM t SELECT * and FROM t, round-trip display, expose AST shape, and avoid over-accepting dialects. Fable and Opus 4.8 both passed tests, equivalence, and review. Fable used fewer tokens and had the cleaner parser shape. Opus was cheaper and had cleaner trace provenance.
The caveat matters: patch audit found Fable appeared to opt GenericDialect into FROM-first support, so I would not ship that task on the aggregate read alone. The routing lesson is narrower: use Fable for parser/API shape when maintainability is the risk, then add the dialect-negative tests that force the unsupported boundary closed.
graphql-go-tools#843: GPT-5.5 was the right route. This was block-string JSON serialization. Fable was cheaper, had lower footprint, and used fewer input tokens. It still failed the task. It normalized already-stored string content rather than recovering the raw triple-quoted source and kept an escaping path that could produce invalid JSON for multiline strings.
GPT-5.5 noticed the representation-layer trap: the parser-stored content had already been trimmed, so the patch needed to recover raw source and then emit through JSON serialization with the right escaping behavior. More expensive was not better here. This is why the routing policy needs model escape hatches instead of a single premium default.
sqlparser-rs#2185: Fable won the task and still was not merge-safe. Oracle hierarchical queries look straightforward by test result: both arms passed tests. Fable passed equivalence, Opus 4.8 did not, and Fable used fewer input and output tokens. But both failed review.
Fable reused the Oracle select-item operator path for CONNECT_BY_ROOT and handled the headline START WITH / CONNECT BY / NOCYCLE variants better than Opus. The remaining bug was deeper parser control flow: CONNECT BY still sat too late relative to later clauses such as GROUP BY and HAVING, and the AST/display shape could not preserve original clause order. That row is important because it prevents the lazy version of the Fable story. A task winner can still be a hold.
sqlparser-rs#1472 and sqlparser-rs#1727: Composer's cheap-draft boundary. On sqlparser-rs#1472, the invariant was small and local: Hive accepts prefix logical NOT, PostgreSQL keeps postfix factorial, and dialects with neither capability reject both. Composer passed review for $0.22. Fable over-preserved compatibility by letting GenericDialect accept both meanings, which was exactly the wrong boundary.
On sqlparser-rs#1727, the story flipped. PostgreSQL ALTER TYPE enum support needed grammar-boundary judgment: which clauses attach to ADD VALUE, which targets are bare identifiers, and which adjacent ALTER TYPE forms stay out of scope. Composer passed tests for $0.26 but accepted PostgreSQL-invalid syntax. Fable cost $3.99 and passed equivalence and review. Composer can draft. Review decides whether the draft is safe.
That is where I would pay. Hidden semantic boundaries. Planner performance without semantic drift. Parser, public API, or AST shape. Cross-dialect containment. Broad refactors where a bad patch can pass CI and still create review debt.
When to route to cheaper models
Opus 4.8 remains my default route for most tasks. Against Fable, the quality read is close, and Opus cost $3.56/task instead of $5.29.
sqlparser-rs#1620 is a good example. SQLite tokenizer dialect opt-in. Opus 4.8 shipped an equivalent, review-passing patch with slightly higher craft, 3.73 vs Fable's 3.69, for $0.74 instead of $1.56.
GPT-5.5 is also a good choice, with a weaker craft/review read than Opus 4.8 despite a similar compact score. It cost $3.50/task and sometimes won outright. On graphql-go-tools#843, block-string JSON serialization, GPT-5.5 was the only arm to recover the raw source and emit spec-correct output. On graphql-go-tools#1232, fetch deduplication, it was the cheapest patch that cleared review.
Composer is the cheap draft route, not a peer to the frontier. It cost $0.51/task. Composer is attractive when the invariant is local and mechanically reviewable; it is dangerous when adjacent surface area tempts the model to generalize or over-abstract.
Opus 4.7 is the one I cannot route to from this slice. It is near Fable's cost at $4.93/task and directionally weaker. It neither saves enough money nor wins enough quality.
Routing policy
Based on this, the policy I would test from this slice is cheap-first routing with explicit escalation.
Start with Opus 4.8 high / xhigh for normal medium-risk engineering. Examples from this slice: commodity dialect adds, routine parser plumbing, ordinary API additions, work where the test suite and review rubric are likely to expose the important failure modes.
From this slice, I would route direct to Fable when the task has one of these shapes:
- invalid-input rejection or edge-case semantics;
- parser, planner, public API, or AST shape;
- broad or breaking refactor;
- high review-cost work where a subtly wrong patch would waste senior reviewer time;
- scope-discipline-sensitive work where overgeneralization is a risk.
Cost and process
Fable's premium isn't incoherent spend. It used far fewer tool calls, shell calls, and patch calls than any other arm, and uses moderately less tokens.
Fable averaged 25.0 tool calls per task against Opus 4.8 at 70.3, GPT-5.5 at 84.0, and Composer at 96.0. Patch calls: 4.7 for Fable, 13.0 for Opus 4.8, 12.8 for GPT-5.5, 10.3 for Opus 4.7, and 20.0 for Composer. Patch-rewrite ratio: 0.49 for Fable versus 1.6-2.7 for the field.
The token split matters too. On the exact selected slice, Fable averaged about 2.98M input tokens and 32K output tokens per task. Opus 4.8 averaged 3.70M and 45.5K. GPT-5.5 averaged 4.55M and 15K. Opus 4.7 averaged 6.85M and 33K. Composer's token telemetry is partial, so I only use its token rows where they were captured.
Fable bought results with money, not by repeatedly drafting and re-patching. Whether that reflects better first-draft quality or less willingness to iterate on hard tasks is not clear from this data.
What this means
This was a model comparison. The same workflow can answer the question on your tasks: is this model, this AGENTS.md configuration, this routing policy change safe to roll out on my repo?
You can't get that from a public leaderboard or comparison on code that isn't yours. A model that passes your test suite is not the same as a model shipping the behavioral changes your engineers would.
The only way to know how Fable 5 truly performs on your repo is to measure it.
First scored result within the hour, on the Claude subscription you already have.
Join the waitlistFAQ
Did Fable 5 beat Opus 4.8?
Not as a blanket routing rule. Fable led the compact local score, 81.7 vs clean Opus 4.8 at 72.7, but the taskwise evidence says when to pay: hidden semantic boundaries and planner/parser shape. Opus 4.8 cost $3.56/task vs Fable at $5.29.
When is Fable 5 worth the premium?
When the task hides above-test risk: invalid-input rejection, parser or planner architecture, public API or AST shape, cross-dialect precedence, broad refactors, or high review cost. For ordinary medium-risk work, route to Opus 4.8 first.
What was Fable 5's strongest result?
The clearest named examples were graphql-go-tools#817, where all compared arms passed tests but only Fable passed review, and graphql-go-tools#859, where Fable found the extra planner hot path while preserving fallback semantics.
Should every team switch to Fable 5?
No. This is a local two-repo slice, not a global ranking. The takeaway is routing: default to Opus 4.8, pay for Fable on above-test risk, and measure the policy on your own repos.