SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Paper • 2410.03859 • Published Oct 4, 2024 • 1
LongCodeBench: Evaluating Coding LLMs at 1M Context Windows Paper • 2505.07897 • Published May 12, 2025
EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities Paper • 2409.16165 • Published Sep 24, 2024
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration Paper • 2412.15701 • Published Dec 20, 2024
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 37
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Paper • 2602.22124 • Published Feb 25 • 2
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published Apr 22 • 15
ProgramBench: Can Language Models Rebuild Programs From Scratch? Paper • 2605.03546 • Published 25 days ago • 3
SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published Feb 2 • 61