SWE-Bench Verified

coding

500 human-validated real GitHub issues; the model must produce a patch that makes the repo's hidden tests pass. The standard agentic software-engineering benchmark.

Official benchmark page

Model rankings on SWE-Bench Verified

#ModelScoreAs ofSource
1Claude Opus 4.888.6%May 28, 2026 cite
2Claude Opus 4.787.6%May 28, 2026 cite
3Gemini 3.1 Pro Preview80.6%Feb 19, 2026 cite
4DeepSeek V4 Pro80.6%Apr 24, 2026 cite
5Claude Sonnet 4.679.6%Feb 17, 2026 cite
6DeepSeek V4 Flash79%Apr 24, 2026 cite
7Gemini 3 Flash Preview78%Dec 17, 2025 cite
8Kimi K2 Thinking71.3%Nov 6, 2025 cite
9Qwen3 Max69.6%Sep 23, 2025 cite
10Gemini 2.5 Pro59.6%Jun 27, 2025 cite

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard