Methodology

How we build the leaderboard, where every number comes from, and exactly how the RunFree Score is computed. We publish this in full because transparency is the point: you should be able to check our work.

Where the data comes from

Catalog & pricing

Model list, context windows, capabilities and live API pricing come from the OpenRouter Models API, refreshed hourly. Prices are USD per 1M tokens; the sortable “blended” price is a 3:1 input:output weighting.

Benchmark scores

Curated by hand from primary sources only — official model cards, vendor launch posts, arXiv papers and open leaderboards (LMArena, Epoch AI). Every score stores its source link and date.

The RunFree Score

A single, sortable 0-100 intelligence index. We compute it in three steps so that benchmarks of very different difficulty (a 30% on Humanity's Last Exam vs an 85% on SWE-Bench) become directly comparable:

  1. 1Normalize each benchmark across the field. For every benchmark, the lowest model becomes 0 and the highest becomes 100 (min-max), so each one measures a model's standing relative to its peers.
  2. 2Average within each category. A model's normalized scores are averaged inside each of the four categories below.
  3. 3Weight the categories. The category scores are combined with the weights shown, re-normalized over whichever categories the model actually has data for.
CategoryWeight
Reasoning & knowledge30%
Coding & software30%
Math20%
Agentic / tool use20%

A model needs results in at least two of the four categories to receive a RunFree Score. Models below that bar are marked insufficient data rather than given a guessed number.

Benchmarks we track

We focus on the evaluations that still separate frontier models in 2026, and skip saturated ones (MMLU, HumanEval, GSM8K) except as history.

BenchmarkType
GPQA Diamondpercent
Humanity's Last Exampercent
MMLU-Propercent
SWE-Bench Verifiedpercent
LiveCodeBenchpercent
AIME 2025percent
MATH-500percent
Terminal-Benchpercent
τ²-Benchpercent
SciCodepercent
LMArena Eloelo

Freshness

Pricing and the catalog refresh hourly from OpenRouter. Benchmark scores carry the date they were published, shown as “as of” on each model. Last-updated stamps are read from real data, never faked.

What we don't do

No fabricated numbers — a missing score is “insufficient data,” not a guess. We don't scrape or redistribute Artificial Analysis or paid pricing feeds; every figure traces to a primary source you can open.

Spotted an error or a stale number? Tell us — corrections are welcome.