SWE-Bench Verified

coding

500 human-validated real GitHub issues; the model must produce a patch that makes the repo's hidden tests pass. The standard agentic software-engineering benchmark.

Official benchmark page

Model rankings on SWE-Bench Verified

#	Model	Score	As of	Source
1	Claude Opus 4.8	88.6%	May 28, 2026	cite
2	Claude Opus 4.7	87.6%	May 28, 2026	cite
3	Gemini 3.1 Pro Preview	80.6%	Feb 19, 2026	cite
4	DeepSeek V4 Pro	80.6%	Apr 24, 2026	cite
5	Claude Sonnet 4.6	79.6%	Feb 17, 2026	cite
6	DeepSeek V4 Flash	79%	Apr 24, 2026	cite
7	Gemini 3 Flash Preview	78%	Dec 17, 2025	cite
8	Kimi K2 Thinking	71.3%	Nov 6, 2025	cite
9	Qwen3 Max	69.6%	Sep 23, 2025	cite
10	Gemini 2.5 Pro	59.6%	Jun 27, 2025	cite

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard