τ²-Bench

agentic

Tool-Agent-User benchmark: the model handles realistic customer tasks (telecom, retail, airline) by calling tools and following policy across multi-turn conversations.

Official benchmark page

Model rankings on τ²-Bench

No verified scores for this benchmark yet. We only list results with a primary source.

Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.

← All benchmarks · Full leaderboard