Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models
Abstract
A new benchmark and dataset are introduced to evaluate and improve spatial reasoning capabilities in text-to-image models through information-dense prompts and fine-tuning.
Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
Community
A very interesting benchmark (ICLR2026) for T2I models!
Everything in Its Right Place
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions (2026)
- Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs (2026)
- M3-Verse: A "Spot the Difference" Challenge for Large Multimodal Models (2025)
- GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation (2025)
- Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing (2026)
- JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation (2025)
- LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper