Arena Blog – 4 Jun 26
Agent Arena: Causal Evaluation of Agents in the Real World
Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.
Agent Arena: AI Model Agentic Performance Leaderboard
Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.
1 个帖子 - 1 位参与者