OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis Paper • 2604.15093 • Published 11 days ago • 27
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published 14 days ago • 63
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents Paper • 2604.06132 • Published 20 days ago • 117
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning Paper • 2603.17024 • Published Mar 17 • 109
SGI-Bench Collection Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows • 12 items • Updated Mar 25 • 33
meituan-longcat/LongCat-Flash-Chat Text Generation • 562B • Updated Sep 24, 2025 • 50.6k • 531
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows Paper • 2510.24411 • Published Oct 28, 2025 • 73
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published Oct 29, 2025 • 46