If it’s useful, great.
I’ll also include how to use these benchmarks. Just in case.
Use a 3-phase plan:
- Cheap game-specific eval: GlitchBench + PhysGame.
- Mid-cost game QA eval: a subset of VideoGameQA-Bench.
- Optional long-video stress test: a small slice of LVBench / LongVideoBench / MLVU.
In all phases:
- Cache perception once per video.
- Compare plain video LMM baseline vs your timeline/KB system.
- Track both accuracy and compute (frames, tokens, time).
I’ll break this down clearly.
1. Background: what each benchmark actually gives you
1.1 VideoGameQA-Bench
- NeurIPS 2025 dataset & benchmarks track.
- Designed to test VLMs on real game QA workflows: visual unit tests, visual regression, needle-in-a-haystack, glitch detection, and bug report generation, for both images and videos. (asgaardlab.github.io)
- Around a few thousand questions; the official site describes nine tasks and 4 786 QA pairs over multiple games. (GitHub)
So this is your “does this help game QA at all?” benchmark.
1.2 PhysGame
- Benchmark for physical commonsense violations in gameplay videos.
- 880 videos, each with a glitch and a multiple-choice question about what physics rule was violated.
- Covers 4 physics domains (mechanics, kinematics, optics, materials) and 12 categories (e.g., gravity, velocity). (physgame.github.io)
This is your “does the model notice physics-breaking events?” test.
1.3 GlitchBench
- CVPR 2024 benchmark: 593 glitches curated from r/GamePhysics across 205 games. (arXiv)
- Mostly short clips or single frames from weird or broken situations.
- Evaluates whether LMMs can detect and interpret glitches, not just recognise objects. (GlitchBench)
This is your “generic glitch detection” test.
1.4 Long-video context benchmarks (for later)
You only need one of these as a long-video sanity check:
-
LVBench
- “Extreme long video” benchmark with long videos from TV, sports, surveillance, etc.
- Multiple tasks aimed at long-term comprehension; average length is in the many-minutes range and some videos go much longer. (LVBench)
-
LongVideoBench
- 3 763 web videos with subtitles, up to about an hour each.
- 6 678 human-annotated MCQ questions across 17 categories. (arXiv)
-
MLVU
- 1 730 videos, 3 102 QA pairs.
- Video lengths from about 3 minutes to 2 hours, 9 tasks; covers genres including games among others. (arXiv)
You use one of these only after you’ve nailed the game-specific benchmarks.
2. High-level evaluation goals for Gameplay-Vision-LLM
Your system’s advertised strengths:
- Long-horizon gameplay understanding (10+ minute sessions).
- Timeline of key events instead of drowning the LLM in frames.
- Causal reasoning: “what sequence of events led to X?” not just “what is in this frame?”
So your evaluation plan should explicitly measure:
-
Task performance
- Accuracy on glitch and physics detection.
- QA quality and bug-oriented answers.
-
Temporal / causal value
- Cases where the answer depends on events spread across time.
-
Efficiency
- Frames used per question.
- Tokens and wallclock per question.
Every experiment should compare:
- A plain long-video baseline (video LMM without timeline/KB).
- Your timeline + retrieval system (light and full configs).
3. Phase 1 — Cheap, focused eval: GlitchBench + PhysGame
3.1 Why Phase 1 is “cheap”
Everything is short, so perception and LLM passes are cheap, especially with caching.
3.2 Model configurations to test
Define three configurations:
-
Baseline-Plain
- A strong video LMM (e.g. your Qwen3-VL) with uniform frame sampling, ASR/OCR if available.
- No external timeline/KB, no extra encoders.
-
GVP-Light (Gameplay-Vision-LLM light)
-
GVP-Full (optional in Phase 1)
- Full perception stack (SAM3 + VideoMAE etc.) if you want to see the upper bound.
3.3 Implementation strategy: cache then evaluate
For each dataset, use a two-stage process:
-
Stage A: perception + timeline cache
-
Stage B: question answering
This is the standard trick used in long-video benchmarks like MLVU and LongVideoBench so you don’t recompute encoders for every question. (arXiv)
3.4 How to handle each benchmark
GlitchBench
- Input: 593 glitches; some are frames, some are short clips; each has text describing a QA scenario and a target label. (arXiv)
- Task: detect whether there is a glitch and sometimes what kind (depending on the subset you use).
- Metrics: accuracy, maybe F1 for glitch vs no-glitch.
Plan:
-
Build a small adapter that takes GlitchBench’s data format → your generic “video+question” interface.
-
For each sample, ask the model a specific prompt like:
- “Is there a glitch in this gameplay? Answer ‘yes’ or ‘no’ and briefly explain.”
-
For MCQ variants, ask the model to choose “A/B/C/…”, then map to the label.
Compare:
- Baseline-Plain vs GVP-Light vs GVP-Full on accuracy and average inference time per sample.
PhysGame
- Input: 880 gameplay clips, each with an MCQ about what physical commonsense is violated. (arXiv)
- Task: choose the correct option (e.g. which physics rule is broken).
- Metrics: MCQ accuracy; maybe per-domain (mechanics, kinematics, etc.).
Plan:
-
Use the same two-stage process with caching.
-
For each example, prompt like:
- “Choose the correct option (A/B/C/D). Do not invent new options.”
-
Parse the model’s first token or letter and map to options.
Compare again Baseline-Plain vs GVP-Light vs GVP-Full.
At the end of Phase 1 you should know:
- Does your structured timeline/KB help on short clips for glitches and physics anomalies?
- How heavy is the full stack vs light stack in practice?
4. Phase 2 — Game QA eval: targeted subset of VideoGameQA-Bench
4.1 Why not run all of it immediately
VideoGameQA-Bench has nine tasks and thousands of QA pairs. (asgaardlab.github.io)
Running all tasks, all samples, all configs from day one will:
- burn compute,
- complicate analysis.
You only need 2–3 tasks to demonstrate your system’s strengths.
4.2 Pick the most relevant tasks
From the paper and site, VideoGameQA-Bench includes tasks like: (asgaardlab.github.io)
- visual unit test,
- visual regression,
- video needle-in-a-haystack,
- video glitch detection,
- bug-report style QA.
For Gameplay-Vision-LLM, the best match is:
-
Video needle-in-a-haystack
- Stresses long-range retrieval and your timeline.
-
Video glitch detection
- Overlaps with GlitchBench / PhysGame, but in this benchmark’s style.
-
Video bug report / freeform QA (if available as open-ended QA in the bench)
- Tests multi-step and causal descriptions.
You can ignore image-only tasks at first.
4.3 Sampling strategy to keep runtime sane
For each chosen task:
- Start with a tiny debug set (e.g. 50 examples).
- Once everything runs, scale to a moderate subset (e.g. 500–1 000 QA pairs total, not the full 4 786). (GitHub)
Frame sampling:
- Set a frame budget per video, e.g. 64–128 frames.
- Use uniform sampling or simple motion-aware sampling for now.
- Lean on your timeline to decide which frames are actually fed into the LLM.
4.4 Metrics and comparisons
For each task and for your chosen subset:
-
Measure the benchmark’s standard metrics (accuracy, F1, maybe BLEU for bug reports). (arXiv)
-
For each model configuration (Baseline-Plain, GVP-Light, GVP-Full) record:
- frames per video,
- tokens per question,
- average LLM latency.
Your key table for this phase will look like:
- rows = model configs.
- columns = metric per task + “frames used” + “tokens per QA”.
This shows both quality and efficiency.
5. Phase 3 — One long-video stress test
After Phases 1 and 2:
- You already have a strong game-specific story.
- Now add one generic long-video benchmark to test temporal scaling.
5.1 Choose exactly one benchmark
Options:
-
LVBench
- Focused on extreme long videos with various genres. (LVBench)
-
LongVideoBench
- 3 763 videos, up to about an hour, with 6 678 MCQs across 17 categories. (arXiv)
-
MLVU
- 1 730 videos, 3 102 QA pairs; 3–120 minute lengths and 9 tasks. (arXiv)
A pragmatic choice is MLVU or LongVideoBench:
- Both are widely used;
- They already report multiple baselines you can compare against. (NeurIPS Proceedings)
5.2 Subsetting to avoid huge compute
For the chosen benchmark:
-
Select one or two tasks that resemble QA over events (not specialized things like OCR).
-
Limit to 50–100 videos with lengths you can handle (e.g. 5–30 minutes).
-
Use aggressive pre-processing:
- cap total frames per video (e.g. 256 frames),
- build a hierarchical timeline summary per video,
- store everything in cache.
Do not attempt the full benchmark initially. You want a signal that your approach scales to hour-class content, not leaderboard domination.
5.3 What you report
For this long-video experiment:
-
Benchmark’s metric (accuracy / score) for:
- Baseline-Plain with the same frame budget,
- GVP-Light,
- GVP-Full (if feasible).
-
Frames, tokens, and latency per QA.
This becomes the “our method scales to generic long video tasks as well” paragraph, supported by recognized LVU benchmarks. (arXiv)
6. Putting it all together: practical checklist
Here is a direct checklist you can follow:
-
Decide presets and baselines
- Baseline-Plain = Qwen3-VL (or similar) with naive sampling.
- GVP-Light = your pipeline with one encoder + ASR/OCR, no SAM3/VideoMAE.
- GVP-Full = full perception stack.
-
Phase 1 (short clips, cheap)
- Implement loaders and adapters for GlitchBench and PhysGame. (arXiv)
- Build a perception+timeline cache per clip.
- Run MCQ / classification evaluation for all clips.
- Compare accuracy + latency across configs.
-
Phase 2 (game QA)
- Integrate VideoGameQA-Bench. (asgaardlab.github.io)
- Pick 2–3 video tasks (needle-in-a-haystack, glitch, bug-like QA).
- Evaluate on 500–1 000 QA pairs total with caching.
- Log task metrics, frames, tokens, latency.
-
Phase 3 (long-video stress)
- Choose one: LVBench / LongVideoBench / MLVU. (LVBench)
- Subset to 50–100 videos and 1–2 tasks.
- Use frame caps and hierarchical summaries.
- Run a small comparison (Baseline-Plain vs GVP-Light).
-
Across all phases
You end up with:
- Phase 1: “We improve glitch/physics detection on GlitchBench and PhysGame.”
- Phase 2: “We help on realistic game QA tasks from VideoGameQA-Bench with fewer frames/tokens.”
- Phase 3: “Our approach scales to generic long-video QA on a recognized LVU benchmark.”
That is a practical, compute-aware plan that directly tests what Gameplay-Vision-LLM claims to do, without requiring you to run every long-video benchmark at full scale.