GPQA Diamond

reasoning

Graduate-level, Google-proof science questions (the hard 'Diamond' subset). PhD-level chemistry, biology and physics that domain experts answer at roughly 65%, designed to resist web lookup.

Official benchmark page

Model rankings on GPQA Diamond

#ModelScoreAs ofSource
1Gemini 3.1 Pro Preview94.3%Feb 19, 2026 cite
2Claude Opus 4.794.2%May 28, 2026 cite
3Claude Opus 4.893.6%May 28, 2026 cite
4Qwen3 Max Thinking92.8%Nov 12, 2025 cite
5GLM 5.291.2%Jun 13, 2026 cite
6Gemini 3 Flash Preview90.4%Dec 17, 2025 cite
7DeepSeek V4 Pro90.1%Apr 24, 2026 cite
8Claude Sonnet 4.689.9%Feb 17, 2026 cite
9GPT-5 Pro88.4%Aug 7, 2025 cite
10DeepSeek V4 Flash88.1%Apr 24, 2026 cite
11Gemini 2.5 Pro86.4%Jun 27, 2025 cite
12Kimi K2 Thinking84.5%Nov 6, 2025 cite
13Llama 4 Maverick69.8%Apr 5, 2025 cite

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard