Papers
arxiv:2512.21734

Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

Published on Dec 25
· Submitted by
Zihan Wang
on Dec 30
Authors:
,
,
,
,
,

Abstract

Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

Community

Paper submitter

framework

We propose Knot Forcing, a streaming framework for real-time portrait animation that enables high-fidelity, temporally consistent, and interactive video generation from dynamic inputs such as reference images and driving signals. Unlike diffusion-based models that are non-causal and latency-heavy, or autoregressive methods that suffer from error accumulation and motion discontinuities, our approach supports efficient frame-by-frame synthesis while maintaining long-term visual and temporal coherence on consumer-grade hardware.

Our method introduces three key innovations:

  • Chunk-wise causal generation with hybrid memory: We preserve global identity by caching KV states of the reference image, while modeling local dynamics using sliding window attention for efficient temporal coherence.
  • Temporal knot module: By overlapping adjacent video chunks and propagating spatio-temporal cues via image-to-video conditioning, we smooth transitions and reduce motion jitter at chunk boundaries.
  • Global context running ahead: During inference, we dynamically update the temporal coordinate of the reference frame to keep its semantic context ahead of the current generation step, enabling stable long-term rollout.

Together, these designs enable Knot Forcing to deliver real-time, high-quality portrait animation over infinite sequences with strong visual stability and responsiveness.

Project page: this url

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.21734 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.21734 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.21734 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.