AI benchmarks, explained

The tests behind the leaderboard. Each one probes a different skill — graduate science, real GitHub bug-fixing, competition math, terminal agents. Click any benchmark to see what it measures and which models lead.

Reasoning

GPQA Diamond

Graduate-level, Google-proof science questions (the hard 'Diamond' subset). PhD-level chemistry, biology and physics that domain experts answer at roughly 65%, designed to resist web lookup.

Knowledge

Humanity's Last Exam

Around 2,500 expert-written, closed-ended questions across 100+ academic subjects. The hardest broad-knowledge exam in use; frontier models still score well below human experts.

MMLU-Pro

A harder, reasoning-heavy rebuild of MMLU with 10 answer choices instead of 4, cutting saturation and prompt sensitivity across 14 disciplines.

Coding

SWE-Bench Verified

500 human-validated real GitHub issues; the model must produce a patch that makes the repo's hidden tests pass. The standard agentic software-engineering benchmark.

LiveCodeBench

Contamination-free competitive-programming problems collected continuously from LeetCode, AtCoder and Codeforces, so newer problems can't be in training data.

SciCode

Research-level scientific coding: implement methods from physics, biology, chemistry and materials science papers as runnable, test-verified code.

Math

AIME 2025

The 2025 American Invitational Mathematics Examination — 15 hard, integer-answer competition problems. A leading discriminator of mathematical reasoning.

MATH-500

A 500-problem slice of the MATH dataset spanning algebra to number theory at five difficulty levels, used for step-by-step mathematical reasoning.

Agentic / tool use

Terminal-Bench

Real, end-to-end agentic tasks executed inside a live terminal and sandbox (build, debug, configure systems). Measures tool-use and long-horizon execution.

τ²-Bench

Tool-Agent-User benchmark: the model handles realistic customer tasks (telecom, retail, airline) by calling tools and following policy across multi-turn conversations.

Composite & human preference

LMArena Elo

Crowdsourced human-preference rating from blind, pairwise model battles on LMArena (formerly Chatbot Arena). A real-world signal of overall response quality.