When Fable 5 Is Worth the Premium
Fable 5 launched last week. I wanted to know whether the premium over Opus 4.8 is real on the repos I already had instrumented. I tested Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, and Composer 2.5 on 30 real tasks from two open-source repos (Go and Rust parser libraries). Fable led the headline score at 81.7. It was also the most expensive arm at $5.29/task.
Task-by-task, the model you reach for most of the time is Opus 4.8. Fable scored higher on most quality axes, but at n=30 on two parser repos, the lead doesn't survive statistical correction — not enough data to confidently generalize, especially since the task set underrepresents Fable's strengths on long-horizon work. Meanwhile you're paying about $1.73 extra per task.
Each task is a real merged PR. The agent gets a frozen repo snapshot in Docker, a prompt, and one attempt. Then Stet — the local eval harness I build — grades the result beyond pass/fail: did the patch make the same behavioral change as the human (equivalence)? Would a reviewer accept it (code review)? How much extra code did the agent touch (footprint)? Is the diff clear, minimal, and well-structured (craft)?
Cleaned n=30 slice: GraphQL Go Tools (Go, n=9) plus SQLParser (Rust, n=21). Fable leads the headline score and has calibrated craft and review advantages over GPT-5.5 and Composer. Fable also scored higher than Opus 4.8 on most quality axes, but the lead doesn't survive correction at this sample size — and Opus 4.8 costs a third less.
fable 5 comparison
quality frontier
vsCompare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.
- Fable 5 high: repo-balanced local score 81.7, cost per task $5.29
- Opus 4.8 high: repo-balanced local score 72.7, cost per task $3.56
- GPT-5.5 high: repo-balanced local score 73.2, cost per task $3.50
- Opus 4.7 xhigh: repo-balanced local score 71.0, cost per task $4.93
- Composer 2.5: repo-balanced local score 68.0, cost per task $0.51
Lower cost is better.
Higher local score is better.
Hover a point, or focus/tap a row to show the selected metric values in the chart. Toggle axes to explore cost, time, tokens, and tools versus local score, craft, equivalence, review, and tests.
local score formula
Tests are downweighted because they saturate early — every arm in this slice cleared 87%+. This is a compact display score, not a substitute for the calibrated pairwise statistics below. Drag to reweight and watch the ranking move.
weights
live local score
model comparison
per-axis breakdown, all five arms
Raw numbers across all five arms. Best value in each row is highlighted. Quality rows: higher is better. Resource rows: lower is better.
how confident are these comparisons?
statistical strength of each pairwise read, after correcting for multiple comparisons
vs Opus 4.8 high
local score 81.7 vs 72.7; Fable higher on most quality axes (review ties, Opus edges it on footprint) but none survive correction at this sample size
Fable costs 1.49x more, but uses fewer tokens/tools/patch calls
vs GPT-5.5 high
craft mean +0.309 and CR-overall +0.367 survive BH-FDR
GPT is cheaper and faster; Fable uses fewer tokens and tools
vs Opus 4.7 xhigh
tests +13.3pp, equivalence +20.0pp, CR-overall directional
cost roughly par; Fable has much lower token/tool/patch churn
vs Composer 2.5
craft, CR, and footprint rows survive BH-FDR for Fable
Composer is 10.3x cheaper where normalized telemetry is present
Decision-grade (DG) = survives BH-FDR multiplicity correction at q < 0.05. Directional = consistent across graders but doesn't clear that bar. Inconclusive = lead visible but sample too small to confidently generalize.
trace example
same task, three models, one test suite — three different outcomes
pr 1727: add PostgreSQL ALTER TYPE enum support. All three arms passed tests. Only two passed review. Switch tabs to see what tests missed.
agent.patch
ALTER TYPE parser
@@ ALTER TYPE enum support @@parse ALTER TYPE ... ADD VALUEparse ALTER TYPE ... RENAME VALUEparse ALTER TYPE ... RENAME TOIF NOT EXISTS on ADD VALUEIF NOT EXISTS on RENAME VALUE// PostgreSQL forbids this combinationInteractive: switch model output, then hover, focus, or pin a rubric row to see which patch lines carry the evidence.
When to use each model
Opus 4.8: the smart-money default
Fable scored higher on most quality axes — craft, equivalence, code review; Opus edged it on footprint — but none of those leads survived multiplicity correction at n=30. The sample is small, both repos are parser libraries with tightly scoped tasks, and Fable's strongest case — long-horizon feature work that crosses interface boundaries — is underrepresented here. The data points toward Fable being better; it just isn't enough data to confidently generalize. Opus 4.8 costs about $3.56/task against Fable's $5.29, and unlike Fable, it carries no regulatory restrictions.
Opus 4.8 and GPT-5.5 are effectively tied on composite score (72.7 vs 73.2) and now cost about the same. What separates them is reviewability: Opus 4.8 out-crafts GPT-5.5 (3.12 vs 2.93) and passes clean code review on more tasks (18/30 vs 14/30). When you're routing for a reviewer's trust, craft is the axis that carries.
pr 1620 (SQLite tokenizer dialect opt-in): Opus 4.8 shipped an equivalent, review-passing patch with slightly higher craft (3.73 vs 3.69) for $0.74, less than half Fable's $1.56. Same behavioral change, same review outcome, 53% cheaper. Fable's dollars bought nothing here.
The pattern holds across ordinary, well-specified work: dialect adds, routine plumbing, tokenizer adjustments. The premium over Opus 4.8 buys nothing you can measure.
When Fable's premium is worth it
Two shapes. On these 30 tasks, the two together covered about three out of thirty.
Shape one: review-cost insurance on above-test-risk work. The clearest case is pr 817 (GraphQL pub/sub argument-template validation). The hard part isn't the happy path — it's knowing which inputs to reject. Fable was the only arm whose code review passed clean. Opus 4.8 was cheaper ($5.81 vs $9.71) and hit tests and equivalence, but its review failed on exactly those hidden invalid-input holes. Tests said "fine." The reviewer said "nope."
Same shape on pr 2107 (preserve SQL comments): Fable was the only arm with a zero-risk review, while the $2.01 Opus 4.8 patch carried a latent reversed-range panic the suite never triggered.
Shape two: the broad or breaking refactor cheaper models can't land. pr 1293 was a wide, breaking GraphQL metadata reshaping — the kind of change that touches interface boundaries across the repo. Fable was the only arm whose patch passed the suite at all. That one task cost $23.07. Worth it when you genuinely need the change landed.
GPT-5.5: cheap and sometimes uniquely right
GPT-5.5 (about $3.50/task) sometimes wins outright. pr 843 (block-string JSON serialization): GPT-5.5 was the only arm to recover the raw source and emit spec-correct output. Four arms, four price tags, one shared wrong answer, and only GPT-5.5 got it right. On pr 1232 (fetch-dedup bugfix), GPT-5.5 was the cheapest arm AND the only one whose patch cleared review.
Where Fable holds the calibrated edge is craft and reviewability over GPT-5.5 — statistically significant at q < 0.05. GPT-5.5 is the faster, cheaper pick when the task is well-specified and you can accept a measurable craft risk.
Composer: fine draft, needs review
Composer runs at about $0.51/task, roughly 10x cheaper than Fable. The quality gap is real and statistically significant. The cautionary trace is pr 1727 (PostgreSQL ALTER TYPE): Composer's $0.26 patch passed tests but taught the parser to accept syntax PostgreSQL rejects. Both Fable ($3.99) and Opus 4.8 ($1.95) passed review on the same task.
But on commodity dialect adds, Composer ships in-spec for pennies. pr 1472 (dialect-aware ! operator): Composer's $0.22 patch passed review with perfect scope, while Fable's $3.95 attempt failed review by overgeneralizing. Composer is a fine cheap draft with review in the loop.
Opus 4.7: dominated
Rough cost-parity with Fable ($4.93 vs $5.29), directionally weaker on quality. It neither saves money nor wins on quality. I can't construct a routing rule that prefers it.
The routing summary
Even the winner isn't safe to merge unread. On pr 2185, Fable passed tests and equivalence, then failed review on Oracle CONNECT BY clause placement.
Cost and process
Fable's premium isn't incoherent spend. It used far fewer tool calls, shell calls, and patch calls than any other arm.
Fable averaged about 25 tool calls per task against Opus 4.8 at 70, GPT-5.5 at 84, and Composer at 96. Patch calls: 4.7 for Fable against 10-20 for the rest. Patch-rewrite ratio (rewrites per original patch call): 0.49 for Fable versus 1.6-2.7 for the field. Fable bought results with money, not by repeatedly drafting and re-patching. Whether that reflects better first-draft quality or less willingness to iterate on hard tasks is not clear from this data.
What this means for platform teams
This was a model comparison. The same workflow answers the actual platform-team question: is this model, this AGENTS.md configuration, this routing policy change safe to roll out on my repo?
You can't get that from a public leaderboard. A model that passes your test suite is not the same as a model shipping the behavioral changes your engineers would. Those are different claims, and the difference is where the regressions you didn't catch hide.
Measure above the gate, on your own historical work. That's the harness.
First scored result within the hour, on the Claude subscription you already have.
Join the waitlistHow are you routing between models on your repos? I'm especially curious whether anyone has measured the Fable-vs-Opus tradeoff on Python or TypeScript work.
FAQ
Did Fable 5 beat Opus 4.8?
Fable scored higher on this slice, but at n=30 on parser libraries the lead doesn't survive statistical correction — not enough data to confidently generalize. Opus 4.8 costs about a third less and is the practical default.
When is Fable 5 worth the premium?
When the task hides above-test risk: which inputs to reject, cross-dialect precedence, public API shape. And on the rare broad refactor cheaper models cannot land. For ordinary work, route to Opus 4.8.
What was Fable 5's strongest result?
Statistically significant craft and code-review advantages over GPT-5.5 and Composer (q < 0.05 after multiplicity correction).
Should every team switch to Fable 5?
No. This is a local two-repo slice, not a global ranking. The takeaway is routing: default to Opus 4.8, pay for Fable on above-test risk, and measure the policy on your own repos.