Terminal-Bench

agentic

Real, end-to-end agentic tasks executed inside a live terminal and sandbox (build, debug, configure systems). Measures tool-use and long-horizon execution.

Official benchmark page

Model rankings on Terminal-Bench

#	Model	Score	As of	Source
1	GLM 5.2	81%	Jun 13, 2026	cite
2	GPT-5.5	78.2%	May 28, 2026	cite
3	Claude Opus 4.8	74.6%	May 28, 2026	cite
4	Gemini 3.1 Pro Preview	68.5%	Feb 19, 2026	cite
5	DeepSeek V4 Pro	67.9%	Apr 24, 2026	cite
6	Claude Opus 4.7	66.1%	May 28, 2026	cite
7	Claude Sonnet 4.6	59.1%	Feb 17, 2026	cite
8	MiniMax M2.7	57%	Jun 1, 2026	cite
9	DeepSeek V4 Flash	56.9%	Apr 24, 2026	cite
10	Kimi K2 Thinking	47.1%	Nov 6, 2025	cite

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard