AI benchmarks, explained
The tests behind the leaderboard. Each one probes a different skill — graduate science, real GitHub bug-fixing, competition math, terminal agents. Click any benchmark to see what it measures and which models lead.
Reasoning
Knowledge
Humanity's Last Exam
Around 2,500 expert-written, closed-ended questions across 100+ academic subjects. The hardest broad-knowledge exam in use; frontier models still score well below human experts.
MMLU-Pro
A harder, reasoning-heavy rebuild of MMLU with 10 answer choices instead of 4, cutting saturation and prompt sensitivity across 14 disciplines.
Coding
SWE-Bench Verified
500 human-validated real GitHub issues; the model must produce a patch that makes the repo's hidden tests pass. The standard agentic software-engineering benchmark.
LiveCodeBench
Contamination-free competitive-programming problems collected continuously from LeetCode, AtCoder and Codeforces, so newer problems can't be in training data.
SciCode
Research-level scientific coding: implement methods from physics, biology, chemistry and materials science papers as runnable, test-verified code.
Math
AIME 2025
The 2025 American Invitational Mathematics Examination — 15 hard, integer-answer competition problems. A leading discriminator of mathematical reasoning.
MATH-500
A 500-problem slice of the MATH dataset spanning algebra to number theory at five difficulty levels, used for step-by-step mathematical reasoning.
Agentic / tool use
Terminal-Bench
Real, end-to-end agentic tasks executed inside a live terminal and sandbox (build, debug, configure systems). Measures tool-use and long-horizon execution.
τ²-Bench
Tool-Agent-User benchmark: the model handles realistic customer tasks (telecom, retail, airline) by calling tools and following policy across multi-turn conversations.