SWE-Bench Verified
coding500 human-validated real GitHub issues; the model must produce a patch that makes the repo's hidden tests pass. The standard agentic software-engineering benchmark.
Official benchmark pageModel rankings on SWE-Bench Verified
| # | Model | Score | As of | Source |
|---|---|---|---|---|
| 1 | 88.6% | May 28, 2026 | cite | |
| 2 | 87.6% | May 28, 2026 | cite | |
| 3 | 80.6% | Feb 19, 2026 | cite | |
| 4 | 80.6% | Apr 24, 2026 | cite | |
| 5 | 79.6% | Feb 17, 2026 | cite | |
| 6 | 79% | Apr 24, 2026 | cite | |
| 7 | 78% | Dec 17, 2025 | cite | |
| 8 | 71.3% | Nov 6, 2025 | cite | |
| 9 | 69.6% | Sep 23, 2025 | cite | |
| 10 | 59.6% | Jun 27, 2025 | cite |
Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.