AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking Paper • 2601.17645 • Published 5 days ago • 22
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published 13 days ago • 32
mlx-community/XortronCriminalComputingConfig-mlx-8Bit Text Generation • Updated Jun 19, 2025 • 20 • 3
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published 14 days ago • 64