YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
Tuan Duc Ngo1
Jiahui Huang2
Seoung Wug Oh2
Kevin Blackburn-Matzen2
Evangelos Kalogerakis1,3
Chuang Gan1
Joon-Young Lee2
1UMass Amherst 2Adobe Research 3TU Crete
CVPR 2026
DAGE delivers accurate and consistent 3D geometry, fine-grained and high-resolution depthmaps, while maintaining efficiency and scalability.
Overview
DAGE is a dual-stream transformer that disentangles global coherence from fine detail for geometry estimation from uncalibrated multi-view/video inputs.
- LR stream builds view-consistent representations and estimates cameras efficiently.
- HR stream preserves sharp boundaries and fine structures per-frame.
- Lightweight adapter fuses the two via cross-attention without disturbing the pretrained single-frame pathway.
- Scales resolution and clip length independently, supports inputs up to 2K, and achieves state-of-the-art on video geometry estimation and multi-view reconstruction.
Updates
- [TBD] Initial release with inference code and model checkpoint.
Quick Start
1. Clone & Install Dependencies
git clone https://github.com/ngoductuanlhp/DAGE.git
cd DAGE
bash scripts/instal_env.sh
conda activate dage
This creates a conda environment with Python 3.10, PyTorch 2.10.0 (CUDA 13.0), and all required dependencies.
2. Download Checkpoints
Download the model checkpoint and place it in the checkpoints/ directory:
mkdir -p checkpoints
# Download from Hugging Face (TBD)
gdown --fuzzy https://drive.google.com/file/d/1BsBJ7MTarlBP5RjCVfPQoQMsCxccBabF/view?usp=sharing -O ./checkpoints/
3. Run Inference
Run on the included demo data or your own video/image folder:
# Run with default settings on demo data
bash scripts/infer/infer_dage.sh
# Or run directly with custom arguments
# Default: LR at 252px, HR at 3600 tokens (~840x840 for square images)
python inference/infer_dage.py --checkpoint checkpoints/model.pt
# Higher LR resolution (better camera poses, more compute)
python inference/infer_dage.py --checkpoint checkpoints/model.pt --lr_max_size 518
# Higher HR resolution up to 2K (sharper pointmaps)
python inference/infer_dage.py --checkpoint checkpoints/model.pt --hr_max_size 1920
# Memory-efficient chunking for GPUs with <40GB VRAM (lower chunk_size if OOM)
python inference/infer_dage.py --checkpoint checkpoints/model.pt --hr_max_size 1920 --chunk_size 8
Arguments:
| Argument | Default | Description |
|---|---|---|
--checkpoint |
checkpoints/model.pt |
Path to model checkpoint |
--output_dir |
quali_results/dage |
Directory to save results |
--lr_max_size |
252 |
Max resolution for the LR stream |
--hr_max_size |
None |
Max resolution for the HR stream (auto-computed from 3600 tokens if not set) |
--chunk_size |
None |
Chunk size for HR stream (enables memory-efficient chunked inference) |
Input: Place videos (.mp4, .MOV) or image folders in assets/demo_data/.
Output: For each input, the script saves:
<name>_disp_colored.mp4β colorized disparity video<name>_depth_colored.mp4β colorized depth video<name>.npyβ dictionary withpointmap,pointmap_global,pointmap_mask,rgb, andextrinsics
Detailed Usage
Model Input & Output
- Input:
torch.Tensorof shape(B, N, 3, H, W)with pixel values in[0, 1]. - Output: A
dictwith the following keys:
| Key | Shape | Description |
|---|---|---|
local_points |
(B, N, H, W, 3) |
Per-view 3D point maps in local camera space |
conf |
(B, N, H, W, 1) |
Confidence logits (apply torch.sigmoid() for probabilities) |
camera_poses |
(B, N, 4, 4) |
Camera-to-world transformation matrices (OpenCV convention) |
metric_scale |
(B, 1) |
Predicted metric scale factor |
global_points |
(B, N, H, W, 3) |
3D points in world space (after infer()) |
mask |
(B, N, H, W) |
Binary confidence mask (after infer()) |
Example Code Snippet
import torch
from dage.models.dage import DAGE
from dage.utils.data_utils import read_video
# --- Setup ---
device = 'cuda'
model = DAGE.from_pretrained('checkpoints/model.pt').to(device).eval()
# --- Load Data ---
# read_video returns (frames, H, W, fps)
# Options: stride=N, max_frames=N, force_num_frames=N
video, H, W, fps = read_video('path/to/video.mp4', stride=10, max_frames=100)
# Prepare tensors (B, N, C, H, W), values in [0, 1]
from einops import rearrange
import torch.nn.functional as F
lr_video = ... # resize to LR resolution (multiples of 14)
hr_video = ... # resize to HR resolution (multiples of 14)
lr_video = rearrange(torch.from_numpy(lr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0
hr_video = rearrange(torch.from_numpy(hr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0
# --- Inference ---
with torch.no_grad():
output = model.infer(
hr_video=hr_video,
lr_video=lr_video,
lr_max_size=252,
chunk_size=None, # optional, for memory efficiency
)
# Access outputs
local_points = output['local_points'] # (N, H, W, 3)
global_points = output['global_points'] # (N, H, W, 3)
camera_poses = output['camera_poses'] # (N, 4, 4)
mask = output['mask'] # (N, H, W)
Resolution Handling
Both streams require resolutions that are multiples of the patch size (14). The HR stream defaults to 3600 tokens total (e.g., 840x840 for square images, 630x1120 for 9:16), but can be overridden with --hr_max_size.
Visualization
We use viser for interactive 3D point cloud visualization. The inference script saves .npy files that can be directly visualized.
Dynamic scenes β renders pointmaps sequentially with playback controls (timestep slider, play/pause, FPS control):
python visualization/vis_pointmaps.py --data_path quali_results/dage/<name>.npy
Static scenes β merges all frames into a single point cloud in a shared coordinate frame:
python visualization/vis_pointmaps_all.py --data_path quali_results/dage/<name>.npy
Both scripts launch a viser server (default port 7891) accessible via browser. Common options:
| Argument | Default | Description |
|---|---|---|
--downsample_ratio |
1 |
Spatial downsampling for faster rendering |
--point_size |
0.002 / 0.01 |
Point size in the viewer |
--scale_factor |
1.0 |
Scale the point cloud |
--sample_num |
all | Uniformly sample N frames |
--port |
7891 |
Viser server port |
Training
See docs/TRAINING.md for detailed instructions on data preparation, loss functions, and configuration.
Evaluation
See docs/EVALUATION.md for detailed instructions.
Project Structure
DAGE/
βββ assets/
β βββ demo_data/ # Demo videos for inference
βββ configs/
β βββ model_config_dage.yaml # Model architecture config
βββ dage/ # Main package
β βββ models/
β β βββ dage.py # DAGE model
β β βββ dinov2/ # DINOv2 backbone
β β βββ layers/ # Transformer blocks, attention, camera head
β β βββ moge/ # MoGe encoder components
β βββ utils/ # Geometry, visualization, data loading
βββ evaluation/ # Benchmark evaluation
βββ inference/
β βββ infer_dage.py # Main inference script
βββ scripts/
β βββ eval/ # Evaluation bash scripts
β βββ infer/ # Inference bash scripts
β βββ instal_env.sh # Environment setup
βββ setup.py
βββ third_party/ # Code for related work (VGGT, Pi3, Cut3r, etc)
βββ training/
βββ dataloaders/ # Video dataloaders & dataset configs
βββ loss/ # Loss functions
βββ train_dage_stage{1,2,3}.py # Three-stage training scripts
βββ training_configs/ # YAML configs for trainings
Acknowledgements
Our work builds upon several open-source projects:
Citation
If you find our work useful, please consider citing:
@inproceedings{ngo2026dage,
title={DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation},
author={Ngo, Tuan Duc and Huang, Jiahui and Oh, Seoung Wug and Blackburn-Matzen, Kevin and Kalogerakis, Evangelos and Gan, Chuang and Lee, Joon-Young},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
License
TBD