Terminal-Bench
agenticReal, end-to-end agentic tasks executed inside a live terminal and sandbox (build, debug, configure systems). Measures tool-use and long-horizon execution.
Official benchmark pageModel rankings on Terminal-Bench
| # | Model | Score | As of | Source |
|---|---|---|---|---|
| 1 | 81% | Jun 13, 2026 | cite | |
| 2 | 78.2% | May 28, 2026 | cite | |
| 3 | 74.6% | May 28, 2026 | cite | |
| 4 | 68.5% | Feb 19, 2026 | cite | |
| 5 | 67.9% | Apr 24, 2026 | cite | |
| 6 | 66.1% | May 28, 2026 | cite | |
| 7 | 59.1% | Feb 17, 2026 | cite | |
| 8 | 57% | Jun 1, 2026 | cite | |
| 9 | 56.9% | Apr 24, 2026 | cite | |
| 10 | 47.1% | Nov 6, 2025 | cite |
Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.