Methodology

How we build the leaderboard, where every number comes from, and exactly how the RunFree Score is computed. We publish this in full because transparency is the point: you should be able to check our work.

Where the data comes from

Catalog & pricing

Model list, context windows, capabilities and live API pricing come from the OpenRouter Models API, refreshed hourly. Prices are USD per 1M tokens; the sortable “blended” price is a 3:1 input:output weighting.

Benchmark scores

Curated by hand from primary sources only — official model cards, vendor launch posts, arXiv papers and open leaderboards (LMArena, Epoch AI). Every score stores its source link and date.

The RunFree Score

A single, sortable 0-100 intelligence index. We compute it in three steps so that benchmarks of very different difficulty (a 30% on Humanity's Last Exam vs an 85% on SWE-Bench) become directly comparable:

1Normalize each benchmark across the field. For every benchmark, the lowest model becomes 0 and the highest becomes 100 (min-max), so each one measures a model's standing relative to its peers.
2Average within each category. A model's normalized scores are averaged inside each of the four categories below.
3Weight the categories. The category scores are combined with the weights shown, re-normalized over whichever categories the model actually has data for.

Category	Weight
Reasoning & knowledge	30%
Coding & software	30%
Math	20%
Agentic / tool use	20%

A model needs results in at least two of the four categories to receive a RunFree Score. Models below that bar are marked insufficient data rather than given a guessed number.

Benchmarks we track

We focus on the evaluations that still separate frontier models in 2026, and skip saturated ones (MMLU, HumanEval, GSM8K) except as history.

Benchmark	Measures	Type
GPQA Diamond	reasoning	percent
Humanity's Last Exam	knowledge	percent
MMLU-Pro	knowledge	percent
SWE-Bench Verified	coding	percent
LiveCodeBench	coding	percent
AIME 2025	math	percent
MATH-500	math	percent
Terminal-Bench	agentic	percent
τ²-Bench	agentic	percent
SciCode	coding	percent
LMArena Elo	composite	elo

Freshness

Pricing and the catalog refresh hourly from OpenRouter. Benchmark scores carry the date they were published, shown as “as of” on each model. Last-updated stamps are read from real data, never faked.

What we don't do

No fabricated numbers — a missing score is “insufficient data,” not a guess. We don't scrape or redistribute Artificial Analysis or paid pricing feeds; every figure traces to a primary source you can open.

Spotted an error or a stale number? Tell us — corrections are welcome.