YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Tuan Duc Ngo1   Jiahui Huang2   Seoung Wug Oh2   Kevin Blackburn-Matzen2  
Evangelos Kalogerakis1,3   Chuang Gan1   Joon-Young Lee2

1UMass Amherst     2Adobe Research     3TU Crete

CVPR 2026

Paper Project Page

DAGE delivers accurate and consistent 3D geometry, fine-grained and high-resolution depthmaps, while maintaining efficiency and scalability.

Overview

DAGE is a dual-stream transformer that disentangles global coherence from fine detail for geometry estimation from uncalibrated multi-view/video inputs.

  • LR stream builds view-consistent representations and estimates cameras efficiently.
  • HR stream preserves sharp boundaries and fine structures per-frame.
  • Lightweight adapter fuses the two via cross-attention without disturbing the pretrained single-frame pathway.
  • Scales resolution and clip length independently, supports inputs up to 2K, and achieves state-of-the-art on video geometry estimation and multi-view reconstruction.

Updates

  • [TBD] Initial release with inference code and model checkpoint.

Quick Start

1. Clone & Install Dependencies

git clone https://github.com/ngoductuanlhp/DAGE.git
cd DAGE

bash scripts/instal_env.sh
conda activate dage

This creates a conda environment with Python 3.10, PyTorch 2.10.0 (CUDA 13.0), and all required dependencies.

2. Download Checkpoints

Download the model checkpoint and place it in the checkpoints/ directory:

mkdir -p checkpoints
# Download from Hugging Face (TBD)
gdown --fuzzy https://drive.google.com/file/d/1BsBJ7MTarlBP5RjCVfPQoQMsCxccBabF/view?usp=sharing -O ./checkpoints/

3. Run Inference

Run on the included demo data or your own video/image folder:

# Run with default settings on demo data
bash scripts/infer/infer_dage.sh

# Or run directly with custom arguments

# Default: LR at 252px, HR at 3600 tokens (~840x840 for square images)
python inference/infer_dage.py --checkpoint checkpoints/model.pt

# Higher LR resolution (better camera poses, more compute)
python inference/infer_dage.py --checkpoint checkpoints/model.pt --lr_max_size 518

# Higher HR resolution up to 2K (sharper pointmaps)
python inference/infer_dage.py --checkpoint checkpoints/model.pt --hr_max_size 1920

# Memory-efficient chunking for GPUs with <40GB VRAM (lower chunk_size if OOM)
python inference/infer_dage.py --checkpoint checkpoints/model.pt --hr_max_size 1920 --chunk_size 8

Arguments:

Argument Default Description
--checkpoint checkpoints/model.pt Path to model checkpoint
--output_dir quali_results/dage Directory to save results
--lr_max_size 252 Max resolution for the LR stream
--hr_max_size None Max resolution for the HR stream (auto-computed from 3600 tokens if not set)
--chunk_size None Chunk size for HR stream (enables memory-efficient chunked inference)

Input: Place videos (.mp4, .MOV) or image folders in assets/demo_data/.

Output: For each input, the script saves:

  • <name>_disp_colored.mp4 β€” colorized disparity video
  • <name>_depth_colored.mp4 β€” colorized depth video
  • <name>.npy β€” dictionary with pointmap, pointmap_global, pointmap_mask, rgb, and extrinsics

Detailed Usage

Model Input & Output

  • Input: torch.Tensor of shape (B, N, 3, H, W) with pixel values in [0, 1].
  • Output: A dict with the following keys:
Key Shape Description
local_points (B, N, H, W, 3) Per-view 3D point maps in local camera space
conf (B, N, H, W, 1) Confidence logits (apply torch.sigmoid() for probabilities)
camera_poses (B, N, 4, 4) Camera-to-world transformation matrices (OpenCV convention)
metric_scale (B, 1) Predicted metric scale factor
global_points (B, N, H, W, 3) 3D points in world space (after infer())
mask (B, N, H, W) Binary confidence mask (after infer())

Example Code Snippet

import torch
from dage.models.dage import DAGE
from dage.utils.data_utils import read_video

# --- Setup ---
device = 'cuda'
model = DAGE.from_pretrained('checkpoints/model.pt').to(device).eval()

# --- Load Data ---
# read_video returns (frames, H, W, fps)
# Options: stride=N, max_frames=N, force_num_frames=N
video, H, W, fps = read_video('path/to/video.mp4', stride=10, max_frames=100)

# Prepare tensors (B, N, C, H, W), values in [0, 1]
from einops import rearrange
import torch.nn.functional as F

lr_video = ...  # resize to LR resolution (multiples of 14)
hr_video = ...  # resize to HR resolution (multiples of 14)

lr_video = rearrange(torch.from_numpy(lr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0
hr_video = rearrange(torch.from_numpy(hr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0

# --- Inference ---
with torch.no_grad():
    output = model.infer(
        hr_video=hr_video,
        lr_video=lr_video,
        lr_max_size=252,
        chunk_size=None,  # optional, for memory efficiency
    )

# Access outputs
local_points = output['local_points']   # (N, H, W, 3)
global_points = output['global_points'] # (N, H, W, 3)
camera_poses = output['camera_poses']   # (N, 4, 4)
mask = output['mask']                   # (N, H, W)

Resolution Handling

Both streams require resolutions that are multiples of the patch size (14). The HR stream defaults to 3600 tokens total (e.g., 840x840 for square images, 630x1120 for 9:16), but can be overridden with --hr_max_size.

Visualization

We use viser for interactive 3D point cloud visualization. The inference script saves .npy files that can be directly visualized.

Dynamic scenes β€” renders pointmaps sequentially with playback controls (timestep slider, play/pause, FPS control):

python visualization/vis_pointmaps.py --data_path quali_results/dage/<name>.npy

Static scenes β€” merges all frames into a single point cloud in a shared coordinate frame:

python visualization/vis_pointmaps_all.py --data_path quali_results/dage/<name>.npy

Both scripts launch a viser server (default port 7891) accessible via browser. Common options:

Argument Default Description
--downsample_ratio 1 Spatial downsampling for faster rendering
--point_size 0.002 / 0.01 Point size in the viewer
--scale_factor 1.0 Scale the point cloud
--sample_num all Uniformly sample N frames
--port 7891 Viser server port

Training

See docs/TRAINING.md for detailed instructions on data preparation, loss functions, and configuration.

Evaluation

See docs/EVALUATION.md for detailed instructions.

Project Structure

DAGE/
β”œβ”€β”€ assets/
β”‚   └── demo_data/                  # Demo videos for inference
β”œβ”€β”€ configs/
β”‚   └── model_config_dage.yaml      # Model architecture config
β”œβ”€β”€ dage/                           # Main package
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ dage.py                 # DAGE model
β”‚   β”‚   β”œβ”€β”€ dinov2/                 # DINOv2 backbone
β”‚   β”‚   β”œβ”€β”€ layers/                 # Transformer blocks, attention, camera head
β”‚   β”‚   └── moge/                   # MoGe encoder components
β”‚   └── utils/                      # Geometry, visualization, data loading
β”œβ”€β”€ evaluation/                     # Benchmark evaluation
β”œβ”€β”€ inference/
β”‚   └── infer_dage.py               # Main inference script
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ eval/                       # Evaluation bash scripts
β”‚   β”œβ”€β”€ infer/                      # Inference bash scripts
β”‚   └── instal_env.sh               # Environment setup
β”œβ”€β”€ setup.py
β”œβ”€β”€ third_party/                    # Code for related work (VGGT, Pi3, Cut3r, etc)
└── training/
    β”œβ”€β”€ dataloaders/                # Video dataloaders & dataset configs
    β”œβ”€β”€ loss/                       # Loss functions
    β”œβ”€β”€ train_dage_stage{1,2,3}.py  # Three-stage training scripts
    └── training_configs/           # YAML configs for trainings

Acknowledgements

Our work builds upon several open-source projects:

Citation

If you find our work useful, please consider citing:

@inproceedings{ngo2026dage,
  title={DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation},
  author={Ngo, Tuan Duc and Huang, Jiahui and Oh, Seoung Wug and Blackburn-Matzen, Kevin and Kalogerakis, Evangelos and Gan, Chuang and Lee, Joon-Young},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

License

TBD

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for TuanNgo/DAGE