Video-Text-to-Text
Transformers
Safetensors
internvl_chat
multimodal
video-understanding
temporal-localization
qwen
custom_code
Instructions to use UserJoseph/DisTime-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UserJoseph/DisTime-1B with Transformers:
# Load model directly from transformers import InternVLChatModelTime model = InternVLChatModelTime.from_pretrained("UserJoseph/DisTime-1B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| pipeline_tag: video-text-to-text | |
| library_name: transformers | |
| tags: | |
| - multimodal | |
| - video-understanding | |
| - temporal-localization | |
| - qwen | |
| # DisTime: Distribution-based Time Representation for Video Large Language Models | |
| This repository contains the official implementation and checkpoints for the paper: | |
| [**DisTime: Distribution-based Time Representation for Video Large Language Models**](https://huggingface.co/papers/2505.24329) (ICCV 2025). | |
| For more details, including installation, training, and evaluation scripts, please refer to the official [GitHub repository](https://github.com/josephzpng/DisTime). | |
| <div align="center"> | |
| <img src="https://github.com/josephzpng/DisTime/raw/main/images/network.png" width="600px"/> | |
| </div> | |
| ## Abstract | |
| Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at [this URL](https://github.com/josephzpng/DisTime). | |
| ## Dataset | |
| The InternVid-TG dataset proposed in the paper is released at: [yingsen/internvid-tg](https://huggingface.co/datasets/yingsen/internvid-tg). | |
| ## Usage | |
| You can load the model using the `transformers` library and use it for video understanding tasks. | |
| ```python | |
| import numpy as np | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, AutoProcessor | |
| from decord import cpu, VideoReader | |
| # Load model, tokenizer, and processor | |
| tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-1B") | |
| model = AutoModelForCausalLM.from_pretrained("UserJoseph/DisTime-1B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto") | |
| processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-1B") | |
| model.eval() | |
| # Example video input | |
| video_path = "./examples/video1.mp4" # Replace with your video path | |
| qs = "Describe this video in detail" | |
| # Load video frames | |
| vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) | |
| fps = float(vr.get_avg_fps()) | |
| frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) | |
| video_frames = [] | |
| for frame_index in frame_indices: | |
| img = vr[frame_index].asnumpy() | |
| video_frames.append(img) | |
| video_frames = np.stack(video_frames) | |
| # Prepare inputs | |
| messages = [{"role": "user", "content": [{"type": "video", "video": video_frames}, {"type": "text", "text": qs}]}] | |
| inputs = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(inputs, return_tensors="pt").to(model.device) | |
| # Generate response | |
| with torch.inference_mode(): | |
| output_ids = model.generate( | |
| **inputs, | |
| do_sample=False, | |
| temperature=0.2, | |
| max_new_tokens=128, | |
| use_cache=True, | |
| ) | |
| pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() | |
| print(pred) | |
| ``` | |
| ## Citation | |
| If you find this work useful, please cite the paper: | |
| ```bibtex | |
| @article{zeng2025distime, | |
| title={DisTime: Distribution-based Time Representation for Video Large Language Models}, | |
| author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang}, | |
| journal={arXiv preprint arXiv:2505.24329}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Acknowledgement | |
| DisTime is developed with the codebases of the following projects: [InternVL](https://github.com/OpenGVLab/InternVL) and [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models. |