Gameplay-Vision-LLM (open-source): long-horizon gameplay video understanding + causal reasoning — can you review it and rate it 1–10?

hey everyone :waving_hand:

I’ve been building an open-source AI project for long-horizon gameplay video understanding (the stuff that breaks most VLMs once the video gets long). goal is to take longer gameplay, keep the important moments, and answer questions that need temporal + causal reasoning (not just “what’s in this frame”).

repo: GitHub - chasemetoyer/gameplay-vision-llm

what i’m trying to do (quick)

  • understand long gameplay videos (10+ min / long sessions)
  • keep a timeline of key events (so it doesn’t drown in frames/tokens)
  • answer questions that require multi-step reasoning over the whole run

what i want feedback on (pick any)

  1. architecture sanity check: does the overall pipeline make sense? any obvious flaws or missing pieces?
  2. repo quality: structure, readability, naming, “what is this folder even for” moments
  3. reproducibility: is the setup/run path clear? what would you change in the README so a stranger can run it fast?
  4. ml/research critique: what ablations or evals would you expect before you’d believe the claims?
  5. scope: what should i cut, simplify, or rewrite first?

rate it 1–10 (be blunt)

if you can, drop an overall 1–10 rating plus quick scores for:

  • README clarity: _/10
  • code quality: _/10
  • novelty/interest: _/10
  • reproducibility: _/10

even a quick skim + 2 notes helps. if you roast it, pls roast it usefully (specific > vibes).
not selling anything, just trying to make it actually good.

1 Like

There may be room for improvement, but there doesn’t seem to be any real problem. (below is by GPT-5.1 Thinking)


Overall: 7/10

  • README clarity: 8/10
  • Code quality: 7/10
  • Novelty / interest: 8/10
  • Reproducibility: 6/10

You’re clearly solving a real hole in current video LMMs (long gameplay + causal reasoning). The core idea is sound. The weak spots are evaluation and practical friction.

I’ll go through your 5 points in detail and then end with concrete action items.


1. Architecture sanity check

1.1 What you’ve actually built

From the README and source layout, your pipeline is:

  • Perception stack

    • SAM3 for segmentation and entity localization.
    • SigLIP for semantic image embeddings.
    • VideoMAE for temporal features.
    • Wav2Vec2 + Whisper for audio / speech features.
    • PaddleOCR for HUD / text. (GitHub)
  • Projection + fusion

    • Each encoder’s output goes through an MLP projector into Qwen3-VL’s 4096-dim latent space, via MultiModalProjector and a ProjectorBank. (GitHub)

    • You then index events over time for hybrid retrieval:

      • time-based retrieval for @mm:ss queries,
      • semantic retrieval over the timeline using all-MiniLM-L6-v2. (GitHub)
  • Reasoning core

    • Qwen/Qwen3-VL-8B-Instruct with LoRA (q/k/v/o projections, r=16, α=32, bf16). (GitHub)
    • Conversation history manager (multi-turn, summarization, follow-up detection).
    • Feature cache for precomputed features keyed by content hash. (GitHub)
  • Future hooks (partially implemented / planned)

    • TriggerDetector for cascaded processing (only run heavy reasoning on “important” events).
    • TemporalContextManager with HiCo-style hierarchical compression to maintain a rolling long-horizon context.
    • KnowledgeBaseBuilder for entity-centric KB and explicit causal graphs. (GitHub)

As a systems design, this is coherent and aligns closely with modern “agent + memory” approaches like VideoAgent / VideoAgent2 and deep memory systems like VideoLucy, which show that structured memory beats simple long-context models on long video QA. (GitHub)

1.2 Strengths

  • Separation of concerns is good

    • perception/, audio/, temporal/, and fusion_indexing/ are separated from agent_core/. (GitHub)
    • ConversationHistory, FeatureCache, and ConfidenceScorer are clean, self-contained utilities with obvious responsibilities. (GitHub)
  • Hybrid retrieval is appropriate for your goal

    • Time-based + semantic retrieval against a timeline is exactly what you want for long gameplay, especially when questions reference specific moments or diffuse events. (GitHub)
    • This matches patterns in long-video agents that first retrieve temporal segments before deeper reasoning. (GitHub)
  • Multi-encoder design matches gameplay reality

    • Games are weird: HUD text, physics glitches, audio cues, UI events. A single vision encoder is usually not enough. Your combination (segmentation + vision embeddings + text + audio) is justified for complex QA scenarios. (GitHub)

Overall, the architecture conceptually makes sense for long-horizon gameplay QA.

1.3 Obvious missing or over-complex pieces

These are design gaps, not implementation bugs.

  1. KB / causal graph is not “first-class” yet

    • README and future work emphasize entity-centric KB and causal link extraction, but the current implementation is mostly:

      • timeline index + semantic retrieval + generic Qwen reasoning. (GitHub)
    • For your core claim (“causal reasoning over long runs”), the KB should be the centerpiece, not a TODO.

  2. TemporalContextManager / HiCo is described but not concretely operational

    • You reference HiCo-style hierarchical token compression in future work, but the current system appears to rely mainly on:

      • timeline retrieval,
      • prompt-level summarization. (GitHub)
    • That is workable, but it underutilizes VideoMAE and temporal embeddings compared to recent long-video compression systems like VideoChat-Flash or ViLaMP which explicitly design multi-level token hierarchies. (Hugging Face Forums)

  3. Perception stack may be overbuilt for the current reasoning layer

    • You are running SAM3, SigLIP, VideoMAE, Wav2Vec2, Whisper, and OCR, then injecting everything into Qwen. (GitHub)
    • Without ablations, you cannot tell if half of that stack is actually helping relative to a simpler “ASR + HUD OCR + one vision encoder” baseline.
  4. No explicit exploration vs cost policy

    • TriggerDetector is in future work, yet you are already paying the cost of heavy perception per frame. (GitHub)

    • Long sessions (hours) will be dominated by SAM3 and SigLIP. You need a clear strategy for:

      • which frames get full processing,
      • which regions get detailed features,
      • which intervals are summarized only coarsely.

Architecture verdict: solid high-level design, too many moving parts for the current level of evaluation. The shape is right; the pipeline is ahead of the evidence.


2. Repo quality: structure, readability, “what is this folder”

2.1 Structure

Top-level layout is clear and conventional:

  • src/ with subpackages:

    • agent_core/ – Qwen reasoning core, perception–reasoning loop.
    • perception/, audio/, temporal/, fusion_indexing/ – nicely separated domains.
  • scripts/ – explicit entry points for inference, feature extraction, training, LoRA, and projector demos.

  • data/raw_videos/, training/, outputs/.

  • docs/, tests/, outputs/ for weights and adapters. (GitHub)

There are almost no “what is this folder even for” moments. The structure is more organized than many research repos.

2.2 Code style and readability

From qwen_reasoning_core.py:

  • Heavy use of @dataclass, type hints, and docstrings.

  • Classes like ConversationHistory, FeatureCache, ConfidenceScorer are well documented, with clear examples in the docstrings. (GitHub)

  • Naming is descriptive, not cryptic:

    • timeline_indexer.py, knowledge_base_builder.py, siglip_semantic_encoder.py, etc. (GitHub)

This is above-average clarity for research code.

Areas you can tighten:

  1. Remove or explain A: xxx references in README / comments

    • The README includes references like [A: 147, A: 163, A: 418] and similar pointers that appear to be anchors from your article or internal notes. (GitHub)
    • For an external user, those mean nothing and feel like leftover research artifacts.
  2. Centralize configuration

    • Right now, flags for --use-sam, --fps, paths, etc., are spread across scripts. (GitHub)
    • A small config.py or YAML-based configuration would reduce duplication and make it easier to maintain multiple presets (full vs light, A100 vs 24GB consumer GPU).
  3. More examples / tests around KB and retrieval

    • You already have tests/, but from the README it is not obvious what is covered. (GitHub)

    • A few unit tests and mini examples specifically for:

      • timeline indexing,
      • KB building,
      • retrieval behavior given a small synthetic video log,
        would help reviewers trust the core logic.
  4. Docs for docs/ are not surfaced

    • README mentions docs/ but does not point to specific rendered docs or key design docs. (GitHub)
    • Either link to a main doc in that folder or briefly describe what’s inside.

Repo quality verdict: good layout and naming, minor documentation cruft, some configuration sprawl.


3. Reproducibility and setup / run path

3.1 What you provide now

The README offers:

  • Automated setup script setup_env.sh

    • Installs PyTorch 2.8, FlashAttention 2.8.3, core deps, PaddlePaddle GPU 3.2.0, PaddleOCR.
    • Notes that SAM3 requires a dev version of transformers>=5.0.0.dev0 from GitHub. (GitHub)
  • Tested environment

    • RunPod pytorch:2.8.0-py3.12-cuda12.8.0-ubuntu24.04, Python 3.12, CUDA 12.8.
    • GPU: H200 / A100, 40GB+ recommended.
    • Memory breakdown: Qwen3-VL (~16GB), SAM3 (~4GB), SigLIP (~2GB), VideoMAE (~1GB), audio/Whisper (~1GB). (GitHub)
  • Quick run commands

    • realtime_inference.py --video ... --use-sam --interactive
    • extract_features.py and training scripts with clear CLI examples. (GitHub)

This is better than many open-source research projects. The issue is who can realistically reproduce it.

3.2 Pain points

  1. Hardware barrier is high

    • 24GB VRAM minimum and 40–80GB recommended makes this inaccessible to most hobbyists and many researchers. (GitHub)
    • There is a “no SAM3” mode, but you don’t give a clear “this is the 16GB preset” or similar.
  2. transformers dev requirement is fragile

    • Depending on a dev branch of transformers (to support SAM3) is risky:

      • API changes,
      • incompatibilities with other libraries (vLLM, accelerate, etc.). (GitHub)
  3. PaddlePaddle GPU installation is non-trivial

    • You pin PaddlePaddle GPU 3.2.0 with a custom index URL, which is fine but brittle for environments outside Linux + CUDA 12.6. (GitHub)
  4. No “minimal config” story yet

    • You show --use-sam and “without SAM3 (faster)”, but not a full “minimal reproducible config” that:

      • uses only Qwen3-VL + ASR + OCR,
      • runs on e.g. a 24GB GPU,
      • skips VideoMAE and extra encoders for first-time users.

3.3 Concrete improvements for reproducibility

You can raise reproducibility from ~6/10 to 8/10 with a modest amount of work:

  1. Publish a Dockerfile or container image

    • You already have a specific RunPod image; write a Dockerfile equivalent and link it in README. (GitHub)
  2. Define two explicit presets

    • preset_full.yaml: all encoders enabled, A100/H100-class.

    • preset_light.yaml:

      • only SigLIP + Whisper + OCR,
      • no SAM3, no VideoMAE,
      • runs on 24GB.
    • Map CLI options to these configs so users just run --preset full or --preset light.

  3. Bundle a tiny example dataset

    • 1–2 short gameplay clips (e.g. 30–60s) plus 3–5 example questions.

    • Provide a “smoke test” script that:

      • runs feature extraction,
      • runs inference,
      • prints Q&A and timeline snippets.
  4. Clearly scope “what you need to re-train”

    • You already list LoRA and projector training commands but do not state dataset scale or expected compute. (GitHub)

    • Add approximate:

      • number of Q&A pairs used,
      • training time on a given GPU,
      • any important training hyperparameters beyond what is already listed.

Reproducibility verdict: good documentation, high hardware + dependency friction, missing “minimal path.”


4. ML / research critique: what I would expect before believing the claims

4.1 Current claims

You state that:

  • The project validates long-horizon reasoning over extended timelines.
  • Multimodal alignment works: SigLIP, VideoMAE, and Wav2Vec2 projected into Qwen’s space.
  • LoRA fine-tuning enables structured “why” answers, linking actions to subsequent states. (GitHub)

You also define training methods:

  • LoRA on Qwen3-VL targeting q/k/v/o projections (rank 16, α=32, LR 2e-4).
  • A “Generative Alignment Objective” for projectors: MSE pushing projected embedding norm toward √(hidden_dim) to ensure compatibility. (GitHub)

These are interesting ideas, but you do not yet show numbers or baseline comparisons.

4.2 Baselines I would require

Given current 2025 work:

  • VideoGameQA-Bench evaluates VLMs on video game QA activities like visual unit tests, regression, needle-in-haystack, glitch detection, and bug-report generation. (arXiv)
  • PhysGame tests physical commonsense violations in gameplay video (880 glitch videos with MCQs). (arXiv)
  • GlitchBench focuses on glitch detection capabilities of multimodal models. (Semantic Scholar)

At minimum, I would expect:

  1. Baseline 1: plain video LMM

    • e.g. Qwen2.5-VL or VideoChat-Flash with uniform frame sampling and ASR, but no external timeline/KB, no extra encoders. (Hugging Face Forums)
  2. Baseline 2: caption-based agent

    • A simpler agent that uses a single encoder and text captions (VideoINSTA / VideoAgent style) with basic temporal indexing. (GitHub)
  3. Your system

    • With the full or light perception stack, using your timeline, KB, and projectors.

Metrics to report:

  • QA accuracy or MCQ accuracy on:

    • VideoGameQA-Bench video tasks.
    • PhysGame.
    • Optionally, GlitchBench. (arXiv)
  • Temporal localization accuracy (if tasks include timestamps).

  • Average token count per query and GPU inference time.

Only then does your “causal reasoning over long runs” claim become convincing.

4.3 Ablations I would expect

  1. Encoder ablations

    • Qwen3-VL + ASR/OCR only.
      • SigLIP.
      • SigLIP + VideoMAE.
      • SAM3.
    • Compare QA performance and cost.
  2. Projector objective ablations

    • Your generator-alignment objective (norm matching). (GitHub)

    • Versus:

      • plain linear projection with no training,
      • standard contrastive / cosine alignment,
      • or joint fine-tuning with LoRA.
  3. Timeline vs naive long-context

    • Run Qwen3-VL with long-context (many frames) and no explicit timeline, then compare:

      • accuracy,
      • whether the timeline+KB reduces token usage and improves answers.
  4. KB vs no KB

    • Use timeline-only vs timeline+entity-KB once KB is more complete.
    • Evaluate on “why” questions that involve state transitions (HP loss, buffs, phases).

Research verdict: promising ideas, but currently anecdotal. Needs baseline comparisons, encoder ablations, and benchmarks like VideoGameQA-Bench / PhysGame before the claims are convincing.


5. Scope: what to cut, simplify, or rewrite first

You have a lot of moving parts. To make this maintainable and publishable, I would narrow scope in this order.

5.1 Cut / de-emphasize

  1. Fancy tool-calling and boss wiki commands

    • /search, /wiki, /boss are nice, but they dilute the core story. (GitHub)
    • For now, these can be labeled “experimental” or moved behind a flag; they are not central to long-horizon gameplay reasoning.
  2. Multi-GPU parallelization and batch tuning

    • The Future Work section talks about multi-GPU pipeline parallelism, tuning SigLIP batch sizes, etc. (GitHub)
    • These are optimizations. Without solid evaluation, speeding up an unproven pipeline is premature.
  3. Low-priority research items

    • Game-specific adapters, interactive training, etc., are fine ideas but should come after core evaluation. (GitHub)

5.2 Simplify

  1. Perception tiers

    • Define three tiers explicitly:

      • Tier 0 “light”: Qwen3-VL + ASR + OCR + SigLIP only.
      • Tier 1 “standard”: add VideoMAE.
      • Tier 2 “full”: add SAM3 and heavy KB construction.
    • Document each tier’s hardware needs and target use case.

  2. Dependency footprint

    • Make SAM3 and PaddleOCR optional extras.
    • Suggest simpler alternatives (e.g., Tesseract OCR) for users who only want basic HUD text extraction.
  3. Configuration

    • Move env-specific details (CUDA versions, wheel URLs) into config or environment docs rather than burying them in README.

5.3 Rewrite / solidify

  1. Entity-KB + causal graph

    • Make knowledge_base_builder.py the centerpiece.

    • Define a stable schema for:

      • entity types,
      • state variables (HP, position, statuses),
      • relations (ATTACKS, COLLIDES_WITH, HEALS, TRIGGERS, etc.).
    • Ensure this KB is exported to disk and used explicitly in prompts.

  2. Temporal context handling

    • Implement a concrete, tested TemporalContextManager:

      • segment-level summaries (N sec),
      • higher-level summaries (N minutes),
      • pointers into detailed events.
    • Use that instead of ad-hoc context concatenation.

  3. Evaluation harness

    • Add a benchmarks/ or experiments/ folder with scripts that:

      • run on VideoGameQA-Bench subsets,
      • produce tables comparing baselines. (arXiv)

Scope verdict: trim the side features, double down on KB + timeline, and ship a minimal but fully evaluated core.


6. Concrete action items

If you want maximum impact per unit of effort:

  1. Ratings (for clarity)

    • Overall: 7/10
    • README clarity: 8/10 (strong, but dense and with some confusing references).
    • Code quality: 7/10 (clean style, but needs more config centralization and tests around KB / retrieval).
    • Novelty / interest: 8/10 (gameplay + structured memory is a real niche; aligned with new game QA benchmarks). (arXiv)
    • Reproducibility: 6/10 (good docs, high hardware + dependency friction, no minimal config yet).
  2. Short-term (weeks)

    • Freeze a KB / timeline schema and export it as JSON/graph per run.
    • Add preset_light and preset_full, with clear VRAM requirements.
    • Add one smoke-test example (short clip + questions) that runs end-to-end.
    • Clean up README references ([A: 147] etc.).
  3. Medium-term (months)

    • Implement a real TemporalContextManager with hierarchical summaries.

    • Run a small but real evaluation against:

      • a strong long-video baseline (e.g., Qwen2.5-VL or VideoChat-Flash),
      • at least one game QA benchmark (VideoGameQA-Bench subset or a small custom set). (Hugging Face Forums)
  4. Longer-term

    • Add more robust KB-based causal reasoning (explicit action→effect linking).
    • Explore game-specific adapters and streaming mode once the core is validated.

Final bullet summary

  • Architecture is conceptually strong and matches modern agent+memory trends. The main gap is that KB and temporal context are not yet fully realized in code. (GitHub)
  • Repo structure and code style are above average: clear names, clear modules, solid docstrings, but some references and config details need cleanup. (GitHub)
  • Reproducibility is okay but hardware-heavy; adding a light preset and a Dockerfile would make a big difference. (GitHub)
  • From a research perspective, you need baseline comparisons and ablations, especially against long-video LMMs and benchmarks like VideoGameQA-Bench / PhysGame. (arXiv)
  • The most important scope move is to prioritize the gameplay KB + timeline and their evaluation over extra features like web tool-calling and multi-GPU optimization.
2 Likes

Thank you so much for taking the time to write this it genuinely helped a lot. This is my first research project and open-source release, so I really appreciate the depth and clarity of your feedback. I agree with everything you pointed out, and I’m going to start working through the improvements you suggested.

The evaluation benchmarks you mentioned are especially valuable. I’ve been wanting to run proper evaluations, but I wasn’t sure which benchmarks were most appropriate, and the compute estimates for some long-video evaluations were showing 100+ hour runtimes. Having concrete directions like VideoGameQA-Bench, PhysGame, and GlitchBench gives me a much clearer roadmap for making the results meaningful.

I’m genuinely grateful for your detailed review. I’ll be diving into these ideas and figuring out how to implement the next round of improvements. Thanks again it means a lot.

1 Like

If it’s useful, great.:grinning_face: I’ll also include how to use these benchmarks. Just in case.


Use a 3-phase plan:

  1. Cheap game-specific eval: GlitchBench + PhysGame.
  2. Mid-cost game QA eval: a subset of VideoGameQA-Bench.
  3. Optional long-video stress test: a small slice of LVBench / LongVideoBench / MLVU.

In all phases:

  • Cache perception once per video.
  • Compare plain video LMM baseline vs your timeline/KB system.
  • Track both accuracy and compute (frames, tokens, time).

I’ll break this down clearly.


1. Background: what each benchmark actually gives you

1.1 VideoGameQA-Bench

  • NeurIPS 2025 dataset & benchmarks track.
  • Designed to test VLMs on real game QA workflows: visual unit tests, visual regression, needle-in-a-haystack, glitch detection, and bug report generation, for both images and videos. (asgaardlab.github.io)
  • Around a few thousand questions; the official site describes nine tasks and 4 786 QA pairs over multiple games. (GitHub)

So this is your “does this help game QA at all?” benchmark.

1.2 PhysGame

  • Benchmark for physical commonsense violations in gameplay videos.
  • 880 videos, each with a glitch and a multiple-choice question about what physics rule was violated.
  • Covers 4 physics domains (mechanics, kinematics, optics, materials) and 12 categories (e.g., gravity, velocity). (physgame.github.io)

This is your “does the model notice physics-breaking events?” test.

1.3 GlitchBench

  • CVPR 2024 benchmark: 593 glitches curated from r/GamePhysics across 205 games. (arXiv)
  • Mostly short clips or single frames from weird or broken situations.
  • Evaluates whether LMMs can detect and interpret glitches, not just recognise objects. (GlitchBench)

This is your “generic glitch detection” test.

1.4 Long-video context benchmarks (for later)

You only need one of these as a long-video sanity check:

  • LVBench

    • “Extreme long video” benchmark with long videos from TV, sports, surveillance, etc.
    • Multiple tasks aimed at long-term comprehension; average length is in the many-minutes range and some videos go much longer. (LVBench)
  • LongVideoBench

    • 3 763 web videos with subtitles, up to about an hour each.
    • 6 678 human-annotated MCQ questions across 17 categories. (arXiv)
  • MLVU

    • 1 730 videos, 3 102 QA pairs.
    • Video lengths from about 3 minutes to 2 hours, 9 tasks; covers genres including games among others. (arXiv)

You use one of these only after you’ve nailed the game-specific benchmarks.


2. High-level evaluation goals for Gameplay-Vision-LLM

Your system’s advertised strengths:

  1. Long-horizon gameplay understanding (10+ minute sessions).
  2. Timeline of key events instead of drowning the LLM in frames.
  3. Causal reasoning: “what sequence of events led to X?” not just “what is in this frame?”

So your evaluation plan should explicitly measure:

  • Task performance

    • Accuracy on glitch and physics detection.
    • QA quality and bug-oriented answers.
  • Temporal / causal value

    • Cases where the answer depends on events spread across time.
  • Efficiency

    • Frames used per question.
    • Tokens and wallclock per question.

Every experiment should compare:

  • A plain long-video baseline (video LMM without timeline/KB).
  • Your timeline + retrieval system (light and full configs).

3. Phase 1 — Cheap, focused eval: GlitchBench + PhysGame

3.1 Why Phase 1 is “cheap”

Everything is short, so perception and LLM passes are cheap, especially with caching.

3.2 Model configurations to test

Define three configurations:

  1. Baseline-Plain

    • A strong video LMM (e.g. your Qwen3-VL) with uniform frame sampling, ASR/OCR if available.
    • No external timeline/KB, no extra encoders.
  2. GVP-Light (Gameplay-Vision-LLM light)

    • Your pipeline with:

      • one visual encoder (e.g. SigLIP or built-in vision),
      • ASR/OCR,
      • no SAM3, no VideoMAE.
    • Timeline index + simple KB if ready.

  3. GVP-Full (optional in Phase 1)

    • Full perception stack (SAM3 + VideoMAE etc.) if you want to see the upper bound.

3.3 Implementation strategy: cache then evaluate

For each dataset, use a two-stage process:

  1. Stage A: perception + timeline cache

    • For each clip:

      • decode frames at a modest FPS (e.g. 4–8 fps cap).

      • run perception (appropriate encoders for chosen preset).

      • build:

        • frame embeddings,
        • timeline events,
        • early KB if implemented.
    • Save outputs under a per-video ID (e.g. .npz for features + JSON for timeline/KB).

  2. Stage B: question answering

    • For each benchmark question:

      • load cached features.
      • retrieve relevant segments from the timeline.
      • call the LLM to answer.

This is the standard trick used in long-video benchmarks like MLVU and LongVideoBench so you don’t recompute encoders for every question. (arXiv)

3.4 How to handle each benchmark

GlitchBench

  • Input: 593 glitches; some are frames, some are short clips; each has text describing a QA scenario and a target label. (arXiv)
  • Task: detect whether there is a glitch and sometimes what kind (depending on the subset you use).
  • Metrics: accuracy, maybe F1 for glitch vs no-glitch.

Plan:

  • Build a small adapter that takes GlitchBench’s data format → your generic “video+question” interface.

  • For each sample, ask the model a specific prompt like:

    • “Is there a glitch in this gameplay? Answer ‘yes’ or ‘no’ and briefly explain.”
  • For MCQ variants, ask the model to choose “A/B/C/…”, then map to the label.

Compare:

  • Baseline-Plain vs GVP-Light vs GVP-Full on accuracy and average inference time per sample.

PhysGame

  • Input: 880 gameplay clips, each with an MCQ about what physical commonsense is violated. (arXiv)
  • Task: choose the correct option (e.g. which physics rule is broken).
  • Metrics: MCQ accuracy; maybe per-domain (mechanics, kinematics, etc.).

Plan:

  • Use the same two-stage process with caching.

  • For each example, prompt like:

    • “Choose the correct option (A/B/C/D). Do not invent new options.”
  • Parse the model’s first token or letter and map to options.

Compare again Baseline-Plain vs GVP-Light vs GVP-Full.

At the end of Phase 1 you should know:

  • Does your structured timeline/KB help on short clips for glitches and physics anomalies?
  • How heavy is the full stack vs light stack in practice?

4. Phase 2 — Game QA eval: targeted subset of VideoGameQA-Bench

4.1 Why not run all of it immediately

VideoGameQA-Bench has nine tasks and thousands of QA pairs. (asgaardlab.github.io)

Running all tasks, all samples, all configs from day one will:

  • burn compute,
  • complicate analysis.

You only need 2–3 tasks to demonstrate your system’s strengths.

4.2 Pick the most relevant tasks

From the paper and site, VideoGameQA-Bench includes tasks like: (asgaardlab.github.io)

  • visual unit test,
  • visual regression,
  • video needle-in-a-haystack,
  • video glitch detection,
  • bug-report style QA.

For Gameplay-Vision-LLM, the best match is:

  1. Video needle-in-a-haystack

    • Stresses long-range retrieval and your timeline.
  2. Video glitch detection

    • Overlaps with GlitchBench / PhysGame, but in this benchmark’s style.
  3. Video bug report / freeform QA (if available as open-ended QA in the bench)

    • Tests multi-step and causal descriptions.

You can ignore image-only tasks at first.

4.3 Sampling strategy to keep runtime sane

For each chosen task:

  1. Start with a tiny debug set (e.g. 50 examples).
  2. Once everything runs, scale to a moderate subset (e.g. 500–1 000 QA pairs total, not the full 4 786). (GitHub)

Frame sampling:

  • Set a frame budget per video, e.g. 64–128 frames.
  • Use uniform sampling or simple motion-aware sampling for now.
  • Lean on your timeline to decide which frames are actually fed into the LLM.

4.4 Metrics and comparisons

For each task and for your chosen subset:

  • Measure the benchmark’s standard metrics (accuracy, F1, maybe BLEU for bug reports). (arXiv)

  • For each model configuration (Baseline-Plain, GVP-Light, GVP-Full) record:

    • frames per video,
    • tokens per question,
    • average LLM latency.

Your key table for this phase will look like:

  • rows = model configs.
  • columns = metric per task + “frames used” + “tokens per QA”.

This shows both quality and efficiency.


5. Phase 3 — One long-video stress test

After Phases 1 and 2:

  • You already have a strong game-specific story.
  • Now add one generic long-video benchmark to test temporal scaling.

5.1 Choose exactly one benchmark

Options:

  • LVBench

    • Focused on extreme long videos with various genres. (LVBench)
  • LongVideoBench

    • 3 763 videos, up to about an hour, with 6 678 MCQs across 17 categories. (arXiv)
  • MLVU

    • 1 730 videos, 3 102 QA pairs; 3–120 minute lengths and 9 tasks. (arXiv)

A pragmatic choice is MLVU or LongVideoBench:

  • Both are widely used;
  • They already report multiple baselines you can compare against. (NeurIPS Proceedings)

5.2 Subsetting to avoid huge compute

For the chosen benchmark:

  1. Select one or two tasks that resemble QA over events (not specialized things like OCR).

  2. Limit to 50–100 videos with lengths you can handle (e.g. 5–30 minutes).

  3. Use aggressive pre-processing:

    • cap total frames per video (e.g. 256 frames),
    • build a hierarchical timeline summary per video,
    • store everything in cache.

Do not attempt the full benchmark initially. You want a signal that your approach scales to hour-class content, not leaderboard domination.

5.3 What you report

For this long-video experiment:

  • Benchmark’s metric (accuracy / score) for:

    • Baseline-Plain with the same frame budget,
    • GVP-Light,
    • GVP-Full (if feasible).
  • Frames, tokens, and latency per QA.

This becomes the “our method scales to generic long video tasks as well” paragraph, supported by recognized LVU benchmarks. (arXiv)


6. Putting it all together: practical checklist

Here is a direct checklist you can follow:

  1. Decide presets and baselines

    • Baseline-Plain = Qwen3-VL (or similar) with naive sampling.
    • GVP-Light = your pipeline with one encoder + ASR/OCR, no SAM3/VideoMAE.
    • GVP-Full = full perception stack.
  2. Phase 1 (short clips, cheap)

    • Implement loaders and adapters for GlitchBench and PhysGame. (arXiv)
    • Build a perception+timeline cache per clip.
    • Run MCQ / classification evaluation for all clips.
    • Compare accuracy + latency across configs.
  3. Phase 2 (game QA)

    • Integrate VideoGameQA-Bench. (asgaardlab.github.io)
    • Pick 2–3 video tasks (needle-in-a-haystack, glitch, bug-like QA).
    • Evaluate on 500–1 000 QA pairs total with caching.
    • Log task metrics, frames, tokens, latency.
  4. Phase 3 (long-video stress)

    • Choose one: LVBench / LongVideoBench / MLVU. (LVBench)
    • Subset to 50–100 videos and 1–2 tasks.
    • Use frame caps and hierarchical summaries.
    • Run a small comparison (Baseline-Plain vs GVP-Light).
  5. Across all phases

    • Always record:

      • benchmark metric,
      • number of frames per input,
      • tokens per QA,
      • average GPU time per QA.
    • Use the same prompts and decoding parameters across configs so comparisons are fair.

You end up with:

  • Phase 1: “We improve glitch/physics detection on GlitchBench and PhysGame.”
  • Phase 2: “We help on realistic game QA tasks from VideoGameQA-Bench with fewer frames/tokens.”
  • Phase 3: “Our approach scales to generic long-video QA on a recognized LVU benchmark.”

That is a practical, compute-aware plan that directly tests what Gameplay-Vision-LLM claims to do, without requiring you to run every long-video benchmark at full scale.

Thanks implementing these as we speak!

1 Like