Papers
arxiv:2605.01506

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

Published on May 2
Authors:
,
,
,
,
,
,

Abstract

Omni-Encoder presents a unified Transformer architecture that processes visual and audio signals at equal temporal resolution within a shared latent space, improving cross-modal interaction and fine-grained visual motion capture compared to traditional modality-specific approaches.

AI-generated summary

Recent advances in omni-modal large language models have enabled remarkable progress in joint vision-audio understanding. However, prevailing architectures rely on modality-specific encoders with a video-coarse, audio-dense design -- sampling visual frames at 1--2 fps while processing audio waveforms at 25 fps -- resulting in systems that perceive video frame by frame, modality by modality rather than holistically as humans do. Such a discrepancy leaves models with impoverished cross-modal interaction during encoding and an inability to capture fine-grained visual motion. To bridge this gap, we present Omni-Encoder, a unified Transformer backbone designed to co-embed visual and audio signals at a symmetrical 25 fps within a shared latent space. This architecture leverages three core innovations -- the Omni-Encoder Token Template, Omni-RoPE, and Temporal Window Shifting -- to effectively reconcile the dual challenges of modality disentanglement and computational efficiency. Experiments demonstrate that, compared to the modality-specific baseline Qwen2.5-Omni under the same input token budget to the LLM decoder, Omni-Encoder delivers substantial gains on visual continuous understanding tasks -- such as sign language recognition and fine-grained sports action analysis -- while maintaining competitive performance on established audio-visual benchmarks such as AVQA and Speaker Identification and Localization. These results suggest that unified omnivorous encoding offers a promising direction for building omni-modal models that more closely reflect the integrated nature of human perception.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.01506
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.01506 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.01506 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.01506 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.