Terminal-Bench

agentic

Real, end-to-end agentic tasks executed inside a live terminal and sandbox (build, debug, configure systems). Measures tool-use and long-horizon execution.

Official benchmark page

Model rankings on Terminal-Bench

#ModelScoreAs ofSource
1GLM 5.281%Jun 13, 2026 cite
2GPT-5.578.2%May 28, 2026 cite
3Claude Opus 4.874.6%May 28, 2026 cite
4Gemini 3.1 Pro Preview68.5%Feb 19, 2026 cite
5DeepSeek V4 Pro67.9%Apr 24, 2026 cite
6Claude Opus 4.766.1%May 28, 2026 cite
7Claude Sonnet 4.659.1%Feb 17, 2026 cite
8MiniMax M2.757%Jun 1, 2026 cite
9DeepSeek V4 Flash56.9%Apr 24, 2026 cite
10Kimi K2 Thinking47.1%Nov 6, 2025 cite

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard