Methodology
How we build the leaderboard, where every number comes from, and exactly how the RunFree Score is computed. We publish this in full because transparency is the point: you should be able to check our work.
Where the data comes from
The RunFree Score
A single, sortable 0-100 intelligence index. We compute it in three steps so that benchmarks of very different difficulty (a 30% on Humanity's Last Exam vs an 85% on SWE-Bench) become directly comparable:
- 1Normalize each benchmark across the field. For every benchmark, the lowest model becomes 0 and the highest becomes 100 (min-max), so each one measures a model's standing relative to its peers.
- 2Average within each category. A model's normalized scores are averaged inside each of the four categories below.
- 3Weight the categories. The category scores are combined with the weights shown, re-normalized over whichever categories the model actually has data for.
| Category | Weight |
|---|---|
| Reasoning & knowledge | 30% |
| Coding & software | 30% |
| Math | 20% |
| Agentic / tool use | 20% |
A model needs results in at least two of the four categories to receive a RunFree Score. Models below that bar are marked insufficient data rather than given a guessed number.
Benchmarks we track
We focus on the evaluations that still separate frontier models in 2026, and skip saturated ones (MMLU, HumanEval, GSM8K) except as history.
| Benchmark | Measures | Type |
|---|---|---|
| GPQA Diamond | reasoning | percent |
| Humanity's Last Exam | knowledge | percent |
| MMLU-Pro | knowledge | percent |
| SWE-Bench Verified | coding | percent |
| LiveCodeBench | coding | percent |
| AIME 2025 | math | percent |
| MATH-500 | math | percent |
| Terminal-Bench | agentic | percent |
| τ²-Bench | agentic | percent |
| SciCode | coding | percent |
| LMArena Elo | composite | elo |
Spotted an error or a stale number? Tell us — corrections are welcome.