GPQA Diamond

reasoning

Graduate-level, Google-proof science questions (the hard 'Diamond' subset). PhD-level chemistry, biology and physics that domain experts answer at roughly 65%, designed to resist web lookup.

Official benchmark page

Model rankings on GPQA Diamond

#	Model	Score	As of	Source
1	Gemini 3.1 Pro Preview	94.3%	Feb 19, 2026	cite
2	Claude Opus 4.7	94.2%	May 28, 2026	cite
3	Claude Opus 4.8	93.6%	May 28, 2026	cite
4	Qwen3 Max Thinking	92.8%	Nov 12, 2025	cite
5	GLM 5.2	91.2%	Jun 13, 2026	cite
6	Gemini 3 Flash Preview	90.4%	Dec 17, 2025	cite
7	DeepSeek V4 Pro	90.1%	Apr 24, 2026	cite
8	Claude Sonnet 4.6	89.9%	Feb 17, 2026	cite
9	GPT-5 Pro	88.4%	Aug 7, 2025	cite
10	DeepSeek V4 Flash	88.1%	Apr 24, 2026	cite
11	Gemini 2.5 Pro	86.4%	Jun 27, 2025	cite
12	Kimi K2 Thinking	84.5%	Nov 6, 2025	cite
13	Llama 4 Maverick	69.8%	Apr 5, 2025	cite

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard