SVBench: Evaluation of Video Generation Models on Social Reasoning
Abstract
Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
Community
Currently, most of the work focuses on discussing the physical plausibility of the videos; we need more research to examine whether the actions themselves are inherently reasonable.
Our project page is available https://github.com/Gloria2tt/SVBench-Evaluation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models (2025)
- Are Vision Language Models Cross-Cultural Theory of Mind Reasoners? (2025)
- RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence (2025)
- MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis (2025)
- TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models (2025)
- ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning (2025)
- T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper