Claude Fable 5 Benchmarks: 2026 Scorecard vs GPT & Gemini

RunFreeTools TeamJun 10, 20269 min read
Claude Fable 5 Benchmarks: 2026 Scorecard vs GPT & Gemini

TL;DR — Claude Fable 5 is the new benchmark leader across coding, reasoning, and agentic work in 2026. It posts 95.0% on SWE-bench Verified, ~80% on the tougher SWE-bench Pro (vs 58.6% for GPT‑5.5), a frontier 1932 Elo on GDPval-AA, and takes #1 on FrontierCode — while beating Opus 4.8 at every effort level. This guide decodes every score, compares Fable 5 with GPT‑5.5, Gemini 3.1 Pro, and Opus 4.8, and explains the asterisks you should know before trusting any of it.


The headline scorecard

Anthropic released Claude Fable 5 on June 9, 2026 as the first public model in its frontier "Mythos" tier. The benchmark story is unusually clean: Fable 5 doesn't just edge ahead, it opens daylight on the hardest, longest tasks. Here are the numbers that anchor the launch.

Benchmark Fable 5 What it measures
SWE-bench Verified 95.0% Real GitHub issue resolution
SWE-bench Pro ~80% Contamination-resistant, harder engineering
GDPval-AA 1932 Elo Economically valuable knowledge work
FrontierCode (Cognition) #1 Production-grade coding under quality bars
GPQA Diamond ~94.6% Graduate-level science reasoning
Terminal-Bench State-of-the-art Agentic terminal coding

The pattern Anthropic emphasizes — and that partner evaluations confirm — is that Fable 5's lead grows with task length and complexity. On a quick one-shot prompt it's excellent; on a multi-hour, multi-file, judgment-heavy job, it's in a different class.

SWE-bench Verified vs SWE-bench Pro: read the right number

The single most misread benchmark in AI is SWE-bench. There are two versions, and they tell very different stories.

  • SWE-bench Verified is a human-validated set of 500 real GitHub issues where annotators confirmed the fix is solvable and the tests are reliable. Fable 5 scores 95.0% here. But the whole frontier is now clustered at the top — Opus 4.x, Gemini 3.1 Pro, and others all sit in the low-to-mid 80s to low 90s — and OpenAI has publicly flagged training-data contamination concerns across the board.
  • SWE-bench Pro is the newer, contamination-resistant successor built on harder, live repositories. This is where the spread reappears. Fable 5 posts ~80%; OpenAI's GPT‑5.5 manages 58.6%. A ~22-point gap on the reliable benchmark is the real signal.

The takeaway: when two models both claim "~90% on SWE-bench," check whether it's Verified (saturated, contamination-prone) or Pro (discriminating). Fable 5's lead is widest exactly where measurement is most trustworthy.

Fable 5 vs GPT‑5.5 vs Gemini 3.1 Pro vs Opus 4.8

Here's the side-by-side that most readers came for. Scores are drawn from Anthropic's launch materials, partner evaluations, and public 2026 leaderboards such as LLM-Stats; treat cross-vendor numbers as directional, since test harnesses differ.

Dimension Fable 5 Opus 4.8 GPT‑5.5 Gemini 3.1 Pro
SWE-bench Verified 95.0% 88.6% ~84% 80.6%
SWE-bench Pro ~80% 69.2% 58.6% ~62%
GDPval-AA (Elo) 1932 1890
GPQA Diamond ~94.6% 93.6% ~92% ~93%
Long-horizon autonomy Exceptional Strong Strong Strong
Input / output price (per 1M) $10 / $50 $5 / $25 varies varies

Two things stand out. First, Opus 4.8 is the value champion — it's roughly half the price and still beats GPT‑5.5 and Gemini 3.1 Pro on the hard coding benchmark. Second, the jump from Opus 4.8 to Fable 5 is real but specific: it shows up most on long, agentic, multi-step work, which is exactly where the extra cost can pay for itself in fewer turns.

Coding benchmarks, in depth

Software engineering is where Fable 5's reputation was made, and partner-run evals are arguably more meaningful than synthetic tests because they measure production-grade work:

  • FrontierCode (Cognition): highest among frontier models — and it leads even at medium effort, meaning you don't have to pay for maximum reasoning tokens to get top results.
  • CursorBench (Cursor): state-of-the-art; unlocked a class of long-horizon problems earlier models couldn't reach.
  • ViBench (Replit): the highest-performing model tested on end-to-end "vibe coding," nearly saturating their base cases while using fewer tokens.
  • Terminal-Bench: state-of-the-art agentic terminal coding — the multi-command, tool-using work that real engineering automation depends on.

The most quoted real-world data point: during early access, Stripe reported Fable 5 compressed months of engineering into days, running a codebase-wide migration inside a 50-million-line Ruby codebase in a single day. That's not a benchmark — it's the capability benchmarks are trying to predict.

Reasoning, finance, and knowledge work

Coding gets headlines, but the analytical gains matter just as much for business users:

  • GPQA Diamond (graduate-level science): Fable 5 lands around 94.6%, narrowly ahead of Opus 4.8's 93.6%.
  • GDPval-AA (economically valuable knowledge work): a frontier 1932 Elo, up from Opus 4.8's 1890.
  • Hebbia Finance Benchmark (senior-level financial reasoning): the highest score of any model, with big gains in document reasoning, chart/table interpretation, and multi-step problem solving.
  • One analytics partner reported Fable 5 was the first model to break 90% on their core benchmark of complex, long-running analytical tasks — roughly a 10-point jump over Opus.

Vision and long-context

Fable 5 is the new state-of-the-art for vision, and the demonstrations show range: it can rebuild a web app's source code from screenshots alone, and it beat Pokémon FireRed using a vision-only harness — something earlier Claude models couldn't do even with elaborate helper scaffolding.

On long-context memory, the Slay the Spire test is the tell: when given persistent file-based memory, Fable 5 improved 3× more than the same memory boost gave Opus 4.8. The lesson for anyone building long-running agents: Fable 5 is far better at actually using accumulated context instead of drowning in it.

The asterisk: Fable 5 is not always Mythos 5

Here's the caveat almost no benchmark table mentions. Fable 5 and Mythos 5 are the same model, but Fable 5 ships with safety classifiers. When a query touches cybersecurity, biology/chemistry, or model distillation, Fable 5 silently hands off to Opus 4.8 — which means a public Fable 5 result on those specific task types can reflect Opus's ceiling, not the frontier model's.

Anthropic says fallback happens in under 5% of sessions, so for the overwhelming majority of coding, reasoning, and analysis work, the benchmark numbers above are what you'll actually get. But if your workload is cyber- or bio-adjacent, treat the headline scores as the Mythos 5 ceiling, not a guarantee.

How to read any 2026 benchmark

Benchmarks are useful and abused in equal measure. Three sanity checks before you trust a leaderboard:

  1. Verified vs Pro. Saturated, contamination-prone benchmarks (SWE-bench Verified) compress the field; harder successors (SWE-bench Pro) reveal the real gaps.
  2. Self-reported vs independent. Many launch numbers are vendor-run. Partner evals and third-party leaderboards add credibility; a 2026 Berkeley RDI study even showed several popular agent benchmarks could be gamed to near-perfect scores without solving the tasks.
  3. Your workload is the only benchmark that counts. Cross-vendor harnesses differ. The honest move is to run the model on your codebase, your documents, your tasks — which is exactly what the free evaluation window through June 22 is for.

Token efficiency: the benchmark that decides cost

Raw capability isn't the only axis that matters — how many tokens a model burns to reach a result decides your bill, and this is a quiet Fable 5 strength.

On Cognition's FrontierCode, Fable 5 led even at medium effort, meaning you don't have to crank reasoning (and token spend) to maximum for top-tier output. A major spreadsheet partner found Fable 5 beat Opus 4.8 at every effort level while finishing 25–30% faster with fewer turns. A frontier-physics partner reported Fable 5 reached in 36 hours roughly where a competitor model landed after four days, using a third of the reasoning tokens.

The implication for cost is real: while Fable 5's sticker price is 2× Opus 4.8, on hard multi-turn jobs where Opus needs more turns, the delivered cost can be far closer than the per-token gap suggests. Benchmark the cost per completed task, not per token.

The agentic benchmarks: working alone, for hours

The benchmarks that best predict product value are the autonomous ones — can the model hold a goal, build its own tools, and self-correct over a long run? Anthropic's demonstrations are striking:

  • Solar-system simulation: Fable 5 derived the planets' orbital motion from physics first principles, built a simulation, and used it to predict solar eclipses.
  • Factorio: it autonomously played the notoriously complex factory-building game, strategizing and building an automated factory on its own.
  • VibeCAD: Fable 5 designed a complete 3D-printable model in a browser CAD editor — one that it had also built, including a built-in AI copilot.
  • Slay the Spire (memory): with persistent file-based memory, Fable 5 improved 3× more than Opus 4.8 and reached the final act 3× more often.

None sit on a public leaderboard, but together they measure the trait that matters most for agents: sustained, self-directed competence over long horizons.

What the early-access partners reported

Partner evaluations carry more weight than synthetic benchmarks because they run on real production work. The consistent verdict:

  • Stripe: a codebase-wide migration in a 50M-line Ruby codebase in a day — work estimated at two-plus months by hand.
  • Cursor: state-of-the-art on their internal coding benchmark; unlocked long-horizon problems earlier models couldn't reach.
  • GitHub: complex, long-horizon coding with autonomy and reliability beyond their previous benchmarks.
  • Cognition (Devin): highest-scoring model on their frontier coding eval, praising long-horizon reasoning and out-of-the-box tool use.
  • Replit: highest-performing model on their end-to-end vibe-coding benchmark, nearly saturating base cases with fewer tokens.

The verdict

On the numbers, Claude Fable 5 is the most capable model you can use in 2026 — clearly so on the discriminating benchmarks (SWE-bench Pro, FrontierCode, GDPval-AA) and decisively so on long-horizon, agentic work. But the smartest read isn't "Fable 5 wins everything." It's this:

  • Use Opus 4.8 as your cost-effective default — at half the price it still beats GPT‑5.5 and Gemini 3.1 Pro on hard coding.
  • Escalate to Fable 5 when the task is long, complex, or high-value enough that fewer turns and higher reliability justify the 2× price.

The frontier moved. Just make sure you're reading the benchmark that actually measures it.

Frequently asked questions

Fable 5 scores 95.0% on SWE-bench Verified, around 80% on SWE-bench Pro, 1932 Elo on GDPval-AA, and ranks #1 on Cognition's FrontierCode, leading on the hardest, longest agentic tasks.

On the contamination-resistant SWE-bench Pro, Fable 5 scores about 80% versus GPT-5.5's 58.6% — roughly a 22-point lead on the more reliable coding benchmark.

Fable 5 leads on SWE-bench Verified (95.0% vs about 80.6%) and SWE-bench Pro, with its biggest advantages on long-horizon, multi-step work.

Verified is a human-validated 500-task set that is now saturated and contamination-prone; Pro is a harder, contamination-resistant successor where the real gaps between models reappear.

They are strong but partly self-reported. The hardest benchmarks like SWE-bench Pro and FrontierCode are most trustworthy, but always validate on your own workloads — some 2026 studies showed agent benchmarks can be gamed.

For long, complex, high-value tasks, yes — it finishes in fewer turns with higher reliability. For routine work, Opus 4.8 is the cheaper smart default and still beats GPT-5.5 on hard coding.

Fable 5's classifiers hand cyber, bio and distillation queries to Opus 4.8, so on those task types the public model reflects Opus's ceiling — the frontier scores reflect the unrestricted Mythos 5.

Sources

Share this article

Send it to a teammate or save the link for later.

New tools, straight to your inbox

A short note whenever we ship a new free tool or guide. No spam, unsubscribe in one click.

9min left