Arena Blog – 4 Jun 26 Agent Arena: Causal Evaluation of Agents in the Real World Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability. Agent Arena: AI Model Agentic Performance Leaderboard Agent Arena: AI Model Agentic Performance Leaderboard Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability. 1 个帖子 - 1 位参与者 阅读完整话题
使用久未更新的题集 llm-benchmark.github.io Reasoning Models Evaluation 【目前GPT 5.5 XHIGH 大致只差2道,GEMINI也接近,但这里题目不够细致区分GEMINI3.1和 GPT 5.5,显然GPT 5.5 更强】 选择某简单题目 1 使用QODER【不确定是否真实模型】 QWEN 3.7 无限循环思考,1个多小时关闭了 2 官网,正确回答,但是费时极长,20分钟以上,无法接受的低效率 所以我第一次开始怀疑评测博主 nao榜单的真实性, 他声称了 gpt5.5 80,qwen达到78的结论. 4 个帖子 - 4 位参与者 阅读完整话题
https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro
https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro
https://help.openai.com/en/articles/10306912-sharing-feedback-evaluation-and-fine-tuning-data-and-api-inputs-and-outputs-with-openai 1M token group (250K for usage tiers 1-2): gpt-5.4-2026-03-05 gpt-5.2-2025-12-11 gpt-5.1-2025-11-13 gpt-5.1-codex gpt-5-codex gpt-5-2025-08-07 gpt-5-chat-latest gpt-4.5-preview-2025-02-27 (deprecated and shut down as of 7/14/25) gpt-4.1-2025-04-14 gpt-4o-2024-05-13 gpt-4o-2024-08-06 gpt-4o-2024-11-20 o3-2025-04-16 o1-preview-2024-09-12 o1-2024-12-17 10M token group (2.5M for usage tiers 1-2): gpt-5.4-mini-2026-03-17 gpt-5.4-nano-2026-03-17 gpt-5.1-codex-mini gpt-5-mini-2025-08-07 gpt-5-nano-2025-08-07 gpt-4.1-mini-2025-04-14 gpt-4.1-nano-2025-04-14 gpt-4o-mini-2024-07-18 o4-mini-2025-04-16 o1-mini-2024-09-12 codex-mini-latest 1 个帖子 - 1 位参与者 阅读完整话题