Model Evaluation - a Stalin16 Collection

Stalin16 's Collections

Model Evaluation

Reasoning Models

Data and other things

Gen AI Diffusion

Model Evaluation

updated about 16 hours ago

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

Paper • 2502.07445 • Published Feb 11, 2025 • 11
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Paper • 2502.04689 • Published Feb 7, 2025 • 8
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published Feb 5, 2025 • 60
Preference Leakage: A Contamination Problem in LLM-as-a-judge

Paper • 2502.01534 • Published Feb 3, 2025 • 40
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

Paper • 2502.01639 • Published Feb 3, 2025 • 26
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Paper • 2502.09621 • Published Feb 13, 2025 • 28
Logical Reasoning in Large Language Models: A Survey

Paper • 2502.09100 • Published Feb 13, 2025 • 24
IHEval: Evaluating Language Models on Following the Instruction Hierarchy

Paper • 2502.08745 • Published Feb 12, 2025 • 20
InductionBench: LLMs Fail in the Simplest Complexity Class

Paper • 2502.15823 • Published Feb 20, 2025 • 7
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

Paper • 2503.04504 • Published Mar 6, 2025 • 5
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Paper • 2503.03601 • Published Mar 5, 2025 • 232
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

Paper • 2503.05037 • Published Mar 6, 2025 • 4
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Paper • 2503.12349 • Published Mar 16, 2025 • 44
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Paper • 2503.12605 • Published Mar 16, 2025 • 35
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Paper • 2503.11557 • Published Mar 14, 2025 • 22
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16, 2025 • 27
Where do Large Vision-Language Models Look at when Answering Questions?

Paper • 2503.13891 • Published Mar 18, 2025 • 8
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24, 2025 • 119
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Paper • 2506.05523 • Published Jun 5, 2025 • 34
JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

Paper • 2506.17612 • Published Jun 21, 2025 • 64
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Paper • 2506.21551 • Published Jun 26, 2025 • 28
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Paper • 2506.19794 • Published Jun 24, 2025 • 8
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Paper • 2506.21876 • Published Jun 27, 2025 • 28
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Paper • 2507.07484 • Published Jul 10, 2025 • 18
Hidden in plain sight: VLMs overlook their visual representations

Paper • 2506.08008 • Published Jun 9, 2025 • 7
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Paper • 2507.13428 • Published Jul 17, 2025 • 16
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Paper • 2507.12806 • Published Jul 17, 2025 • 21
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

Paper • 2507.15028 • Published Jul 20, 2025 • 21
Pixels, Patterns, but No Poetry: To See The World like Humans

Paper • 2507.16863 • Published Jul 21, 2025 • 69
AgroBench: Vision-Language Model Benchmark in Agriculture

Paper • 2507.20519 • Published Jul 28, 2025 • 8
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Paper • 2508.03644 • Published Aug 5, 2025 • 25
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Paper • 2508.09848 • Published Aug 13, 2025 • 71
A Survey on Large Language Model Benchmarks

Paper • 2508.15361 • Published Aug 21, 2025 • 20
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Paper • 2509.01396 • Published Sep 1, 2025 • 58
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Paper • 2509.04013 • Published Sep 4, 2025 • 4
MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

Paper • 2510.09510 • Published Oct 10, 2025 • 8
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Paper • 2510.16641 • Published Oct 18, 2025 • 5
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30, 2025 • 34
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Paper • 2510.25760 • Published Oct 29, 2025 • 17
UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Paper • 2511.01295 • Published Nov 3, 2025 • 39
SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Paper • 2511.21750 • Published Nov 23, 2025 • 6
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Paper • 2512.02622 • Published Dec 2, 2025 • 10
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Paper • 2512.04324 • Published Dec 3, 2025 • 155
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Paper • 2602.02185 • Published 24 days ago • 125
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Paper • 2602.14337 • Published 10 days ago • 10