Astra π: General Interactive World Model with Autoregressive Denoising
π [arXiv] π [Project Page] π₯οΈ [Github]
Yixuan Zhu1, Jiaqi Feng1, Wenzhao Zheng1 β , Yuan Gao2, Xin Tao2, Pengfei Wan2, Jie Zhou 1, Jiwen Lu1
(β Project leader)
1Tsinghua University, 2Kuaishou Technology.
π Introduction
TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.
Astra is an interactive, action-driven world model that predicts long-horizon future videos across diverse real-world scenarios. Built on an autoregressive diffusion transformer with temporal causal attention, Astra supports streaming prediction while preserving strong temporal coherence. Astra introduces noise-augmented history memory to stabilize long rollouts, an action-aware adapter for precise control signals, and a mixture of action experts to route heterogeneous action modalities. Through these key innovations, Astra delivers consistent, controllable, and high-fidelity video futures for applications such as autonomous driving, robot manipulation, and camera motion.
Gallery
Astra+Wan2.1
π₯ Updates
- [2025.11.17]: Release the project page.
- [2025.12.09]: Release the inference code, model checkpoint.
π― TODO List
Release full inference pipelines for additional scenarios:
- π Autonomous driving
- π€ Robotic manipulation
- πΈ Drone navigation / exploration
Open-source training scripts:
- β¬οΈ Action-conditioned autoregressive denoising training
- π Multi-scenario joint training pipeline
Release dataset preprocessing tools
Provide unified evaluation toolkit
βοΈ Run Astra (Inference)
Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:
Inference
Step 1: Set up the environment
DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"
Install DiffSynth-Studio:
git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .
Step 2: Download the pretrained checkpoints
- Download the pre-trained Wan2.1 models
cd script
python download_wan2.1.py
- Download the pre-trained Astra checkpoint
Please download from huggingface and place it in models/Astra/checkpoints.
Step 3: Test the example image
python infer_demo.py \
--dit_path ../models/Astra/checkpoints/diffusion_pytorch_model.ckpt \
--wan_model_path ../models/Wan-AI/Wan2.1-T2V-1.3B \
--condition_image ../examples/condition_images/garden_1.png \
--cam_type 4 \
--prompt "A sunlit European street lined with historic buildings and vibrant greenery creates a warm, charming, and inviting atmosphere. The scene shows a picturesque open square paved with red bricks, surrounded by classic narrow townhouses featuring tall windows, gabled roofs, and dark-painted facades. On the right side, a lush arrangement of potted plants and blooming flowers adds rich color and texture to the foreground. A vintage-style streetlamp stands prominently near the center-right, contributing to the timeless character of the street. Mature trees frame the background, their leaves glowing in the warm afternoon sunlight. Bicycles are visible along the edges of the buildings, reinforcing the urban yet leisurely feel. The sky is bright blue with scattered clouds, and soft sun flares enter the frame from the left, enhancing the sceneβs inviting, peaceful mood." \
--output_path ../examples/output_videos/output_moe_framepack_sliding.mp4 \
Step 4: Test your own images
To test with your own custom images, you need to prepare the target images and their corresponding text prompts. We recommend that the size of the input images is close to 832Γ480 (width Γ height), which is consistent with the resolution of the generated video and can help achieve better video generation effects. For prompts generation, you can refer to the Prompt Extension section in Wan2.1 for guidance on crafting the captions.
python infer_demo.py \
--dit_path path/to/your/dit_ckpt \
--wan_model_path path/to/your/Wan2.1-T2V-1.3B \
--condition_image path/to/your/image \
--cam_type your_cam_type \
--prompt your_prompt \
--output_path path/to/your/output_video
We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.
| cam_type | Trajectory |
|---|---|
| 1 | Move Forward (Straight) |
| 2 | Rotate Left In Place |
| 3 | Rotate Right In Place |
| 4 | Move Forward + Rotate Left |
| 5 | Move Forward + Rotate Right |
| 6 | S-shaped Trajectory |
| 7 | Rotate Left β Rotate Right |
Future Work π
Looking ahead, we plan to further enhance Astra in several directions:
- Training with Wan-2.2: Upgrade our model using the latest Wan-2.2 framework to release a more powerful version with improved generation quality.
- 3D Spatial Consistency: Explore techniques to better preserve 3D consistency across frames for more coherent and realistic video generation.
- Long-Term Memory: Incorporate mechanisms for long-term memory, enabling the model to handle extended temporal dependencies and complex action sequences.
These directions aim to push Astra towards more robust and interactive video world modeling.
π€ Awesome Related Works
Feel free to explore these outstanding related works, including but not limited to:
ReCamMaster: ReCamMaster re-captures in-the-wild videos with novel camera trajectories.
GCD: GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
ReCapture: a method for generating new videos with novel camera trajectories from a single user-provided video.
Trajectory Attention: Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.
GS-DiT: GS-DiT provides 4D video control for a single monocular video.
Diffusion as Shader: a versatile video generation control model for various tasks.
TrajectoryCrafter: TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.
GEN3C: a generative video model with precise Camera Control and temporal 3D Consistency.
π Citation
Please leave us a star π and cite our paper if you find our work helpful.
@misc{zhu2025astrageneralinteractiveworld,
title={Astra: General Interactive World Model with Autoregressive Denoising},
author={Yixuan Zhu and Jiaqi Feng and Wenzhao Zheng and Yuan Gao and Xin Tao and Pengfei Wan and Jie Zhou and Jiwen Lu},
year={2025},
eprint={2512.08931},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.08931},
}