When Fable 5 Is Worth the Premium

Q: Did Fable 5 beat Opus 4.8?

Fable scored higher on this slice, but at n=30 on parser libraries the lead doesn't survive statistical correction — not enough data to confidently generalize. Opus 4.8 costs about a third less and is the practical default.

Q: When is Fable 5 worth the premium?

When the task hides above-test risk: which inputs to reject, cross-dialect precedence, public API shape. And on the rare broad refactor cheaper models cannot land. For ordinary work, route to Opus 4.8.

Q: What was Fable 5's strongest result?

Statistically significant craft and code-review advantages over GPT-5.5 and Composer (q < 0.05 after multiplicity correction).

Q: Should every team switch to Fable 5?

No. This is a local two-repo slice, not a global ranking. The takeaway is routing: default to Opus 4.8, pay for Fable on above-test risk, and measure the policy on your own repos.

June 14, 2026 · Updated June 18, 2026

Fable 5 launched last week. I wanted to know whether the premium over Opus 4.8 is real on the repos I already had instrumented. I tested Fable 5, Opus 4.8, Opus 4.7, GPT-5.5, and Composer 2.5 on 30 real tasks from two open-source repos (Go and Rust parser libraries). Fable led the headline score at 81.7. It was also the most expensive arm at $5.29/task.

Task-by-task, the model you reach for most of the time is Opus 4.8. Fable scored higher on most quality axes, but at n=30 on two parser repos, the lead doesn't survive statistical correction — not enough data to confidently generalize, especially since the task set underrepresents Fable's strengths on long-horizon work. Meanwhile you're paying about $1.73 extra per task.

Each task is a real merged PR. The agent gets a frozen repo snapshot in Docker, a prompt, and one attempt. Then Stet — the local eval harness I build — grades the result beyond pass/fail: did the patch make the same behavioral change as the human (equivalence)? Would a reviewer accept it (code review)? How much extra code did the agent touch (footprint)? Is the diff clear, minimal, and well-structured (craft)?

Cleaned n=30 slice: GraphQL Go Tools (Go, n=9) plus SQLParser (Rust, n=21). Fable leads the headline score and has calibrated craft and review advantages over GPT-5.5 and Composer. Fable also scored higher than Opus 4.8 on most quality axes, but the lead doesn't survive correction at this sample size — and Opus 4.8 costs a third less.

fable 5 comparison

quality frontier

Compare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.

modelLocal scoreCost

Lower cost is better.

Higher local score is better.

Hover a point, or focus/tap a row to show the selected metric values in the chart. Toggle axes to explore cost, time, tokens, and tools versus local score, craft, equivalence, review, and tests.

local score formula

Tests are downweighted because they saturate early — every arm in this slice cleared 87%+. This is a compact display score, not a substitute for the calibrated pairwise statistics below. Drag to reweight and watch the ranking move.

weights

Tests5%Equivalence30%Code review25%Craft25%Footprint15%

live local score

Fable 581.7

GPT-5.573.2

Opus 4.872.7

Opus 4.771.0

Composer68.0

model comparison

per-axis breakdown, all five arms

Raw numbers across all five arms. Best value in each row is highlighted. Quality rows: higher is better. Resource rows: lower is better.

Axis	Fable 5	Opus 4.8	GPT-5.5	Opus 4.7	Composer
Tests	30/30	28/30	29/30	26/30	28/30
Equivalence	26/30	22/30	23/30	20/30	20/30
Code review	18/30	18/30	14/30	16/30	12/30
Craft mean	3.24	3.12	2.93	3.09	2.76
Footprint	73.1	78.1	74.2	76.9	69.8
Cost/task	$5.29	$3.56	$3.50	$4.93	$0.51
Tool calls	25.0	70.3	84.0	66.5	96.0
Patch calls	4.7	13.0	12.8	10.3	20.0
Duration	10.0m	10.5m	8.0m	10.8m	4.6m

how confident are these comparisons?

statistical strength of each pairwise read, after correcting for multiple comparisons

vs Opus 4.8 high

Fable leads, inconclusive at n=30

local score 81.7 vs 72.7; Fable higher on most quality axes (review ties, Opus edges it on footprint) but none survive correction at this sample size

Fable costs 1.49x more, but uses fewer tokens/tools/patch calls

vs GPT-5.5 high

q < 0.05 craft + CR

craft mean +0.309 and CR-overall +0.367 survive BH-FDR

GPT is cheaper and faster; Fable uses fewer tokens and tools

vs Opus 4.7 xhigh

directional vs Opus 4.7

tests +13.3pp, equivalence +20.0pp, CR-overall directional

cost roughly par; Fable has much lower token/tool/patch churn

vs Composer 2.5

q < 0.05 craft + CR

craft, CR, and footprint rows survive BH-FDR for Fable

Composer is 10.3x cheaper where normalized telemetry is present

Decision-grade (DG) = survives BH-FDR multiplicity correction at q < 0.05. Directional = consistent across graders but doesn't clear that bar. Inconclusive = lead visible but sample too small to confidently generalize.

trace example

same task, three models, one test suite — three different outcomes

pr 1727: add PostgreSQL ALTER TYPE enum support. All three arms passed tests. Only two passed review. Switch tabs to see what tests missed.

SQLParserover-permissivereview riskReview / fail

agent.patch

ALTER TYPE parser

Review / fail

@@ ALTER TYPE enum support @@

+parse ALTER TYPE ... ADD VALUE

+parse ALTER TYPE ... RENAME VALUE

+parse ALTER TYPE ... RENAME TO

+IF NOT EXISTS on ADD VALUE

+IF NOT EXISTS on RENAME VALUE

// PostgreSQL forbids this combination

Interactive: switch model output, then hover, focus, or pin a rubric row to see which patch lines carry the evidence.

When to use each model

Opus 4.8: the smart-money default

Fable scored higher on most quality axes — craft, equivalence, code review; Opus edged it on footprint — but none of those leads survived multiplicity correction at n=30. The sample is small, both repos are parser libraries with tightly scoped tasks, and Fable's strongest case — long-horizon feature work that crosses interface boundaries — is underrepresented here. The data points toward Fable being better; it just isn't enough data to confidently generalize. Opus 4.8 costs about $3.56/task against Fable's $5.29, and unlike Fable, it carries no regulatory restrictions.

Opus 4.8 and GPT-5.5 are effectively tied on composite score (72.7 vs 73.2) and now cost about the same. What separates them is reviewability: Opus 4.8 out-crafts GPT-5.5 (3.12 vs 2.93) and passes clean code review on more tasks (18/30 vs 14/30). When you're routing for a reviewer's trust, craft is the axis that carries.

pr 1620 (SQLite tokenizer dialect opt-in): Opus 4.8 shipped an equivalent, review-passing patch with slightly higher craft (3.73 vs 3.69) for $0.74, less than half Fable's $1.56. Same behavioral change, same review outcome, 53% cheaper. Fable's dollars bought nothing here.

The pattern holds across ordinary, well-specified work: dialect adds, routine plumbing, tokenizer adjustments. The premium over Opus 4.8 buys nothing you can measure.

When Fable's premium is worth it

Two shapes. On these 30 tasks, the two together covered about three out of thirty.

Shape one: review-cost insurance on above-test-risk work. The clearest case is pr 817 (GraphQL pub/sub argument-template validation). The hard part isn't the happy path — it's knowing which inputs to reject. Fable was the only arm whose code review passed clean. Opus 4.8 was cheaper ($5.81 vs $9.71) and hit tests and equivalence, but its review failed on exactly those hidden invalid-input holes. Tests said "fine." The reviewer said "nope."

Same shape on pr 2107 (preserve SQL comments): Fable was the only arm with a zero-risk review, while the $2.01 Opus 4.8 patch carried a latent reversed-range panic the suite never triggered.

Shape two: the broad or breaking refactor cheaper models can't land. pr 1293 was a wide, breaking GraphQL metadata reshaping — the kind of change that touches interface boundaries across the repo. Fable was the only arm whose patch passed the suite at all. That one task cost $23.07. Worth it when you genuinely need the change landed.

GPT-5.5: cheap and sometimes uniquely right

GPT-5.5 (about $3.50/task) sometimes wins outright. pr 843 (block-string JSON serialization): GPT-5.5 was the only arm to recover the raw source and emit spec-correct output. Four arms, four price tags, one shared wrong answer, and only GPT-5.5 got it right. On pr 1232 (fetch-dedup bugfix), GPT-5.5 was the cheapest arm AND the only one whose patch cleared review.

Where Fable holds the calibrated edge is craft and reviewability over GPT-5.5 — statistically significant at q < 0.05. GPT-5.5 is the faster, cheaper pick when the task is well-specified and you can accept a measurable craft risk.

Composer: fine draft, needs review

Composer runs at about $0.51/task, roughly 10x cheaper than Fable. The quality gap is real and statistically significant. The cautionary trace is pr 1727 (PostgreSQL ALTER TYPE): Composer's $0.26 patch passed tests but taught the parser to accept syntax PostgreSQL rejects. Both Fable ($3.99) and Opus 4.8 ($1.95) passed review on the same task.

But on commodity dialect adds, Composer ships in-spec for pennies. pr 1472 (dialect-aware ! operator): Composer's $0.22 patch passed review with perfect scope, while Fable's $3.95 attempt failed review by overgeneralizing. Composer is a fine cheap draft with review in the loop.

Opus 4.7: dominated

Rough cost-parity with Fable ($4.93 vs $5.29), directionally weaker on quality. It neither saves money nor wins on quality. I can't construct a routing rule that prefers it.

The routing summary

Route	Use when	Don't use it as
Opus 4.8 (default)	Ordinary medium-risk engineering, commodity dialect or API adds, routine plumbing.	A free substitute for review on hidden semantic-boundary work.
Fable 5	Above-test risk you might auto-merge: input-rejection validation, cross-dialect precedence, public API shape; broad or breaking refactors.	A blanket default for every patch.
GPT-5.5	Fast, well-specified, test-covered changes where craft risk is acceptable.	The final route when craft or reviewability is central.
Composer 2.5	Cheap drafts, commodity adds, low-risk reversible work — with review in the loop.	A quality peer to the premium models.

Even the winner isn't safe to merge unread. On pr 2185, Fable passed tests and equivalence, then failed review on Oracle CONNECT BY clause placement.

Cost and process

Fable's premium isn't incoherent spend. It used far fewer tool calls, shell calls, and patch calls than any other arm.

Fable averaged about 25 tool calls per task against Opus 4.8 at 70, GPT-5.5 at 84, and Composer at 96. Patch calls: 4.7 for Fable against 10-20 for the rest. Patch-rewrite ratio (rewrites per original patch call): 0.49 for Fable versus 1.6-2.7 for the field. Fable bought results with money, not by repeatedly drafting and re-patching. Whether that reflects better first-draft quality or less willingness to iterate on hard tasks is not clear from this data.

What this means for platform teams

This was a model comparison. The same workflow answers the actual platform-team question: is this model, this AGENTS.md configuration, this routing policy change safe to roll out on my repo?

You can't get that from a public leaderboard. A model that passes your test suite is not the same as a model shipping the behavioral changes your engineers would. Those are different claims, and the difference is where the regressions you didn't catch hide.

Measure above the gate, on your own historical work. That's the harness.

First scored result within the hour, on the Claude subscription you already have.

Join the waitlist

How are you routing between models on your repos? I'm especially curious whether anyone has measured the Fable-vs-Opus tradeoff on Python or TypeScript work.

FAQ

Did Fable 5 beat Opus 4.8?

Fable scored higher on this slice, but at n=30 on parser libraries the lead doesn't survive statistical correction — not enough data to confidently generalize. Opus 4.8 costs about a third less and is the practical default.

When is Fable 5 worth the premium?

When the task hides above-test risk: which inputs to reject, cross-dialect precedence, public API shape. And on the rare broad refactor cheaper models cannot land. For ordinary work, route to Opus 4.8.

What was Fable 5's strongest result?

Statistically significant craft and code-review advantages over GPT-5.5 and Composer (q < 0.05 after multiplicity correction).

Should every team switch to Fable 5?

No. This is a local two-repo slice, not a global ranking. The takeaway is routing: default to Opus 4.8, pay for Fable on above-test risk, and measure the policy on your own repos.