τ²-Bench
agenticTool-Agent-User benchmark: the model handles realistic customer tasks (telecom, retail, airline) by calling tools and following policy across multi-turn conversations.
Official benchmark pageModel rankings on τ²-Bench
No verified scores for this benchmark yet. We only list results with a primary source.
Scores are self-reported or from primary evaluations, each linked to its source. Test conditions (tools, shots, prompt) vary between labs — see the source for details.