Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2504.17343

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Video Understanding

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 13
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Paper • 2505.16175 • Published May 22 • 41

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 10
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

Motion2Language, unsupervised learning of synchronized semantic motion segmentation

Paper • 2310.10594 • Published Oct 16, 2023 • 1
StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

Paper • 2507.07803 • Published Jul 10
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation

Paper • 2407.03809 • Published Jul 4, 2024
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 13

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Paper • 2504.02821 • Published Apr 3 • 9
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 13
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

Paper • 2504.15921 • Published Apr 22 • 7
Causal-Copilot: An Autonomous Causal Analysis Agent

Paper • 2504.13263 • Published Apr 17 • 7

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Motion2Language, unsupervised learning of synchronized semantic motion segmentation

Paper • 2310.10594 • Published Oct 16, 2023 • 1
StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

Paper • 2507.07803 • Published Jul 10
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation

Paper • 2407.03809 • Published Jul 4, 2024
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 13

Video Understanding

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 13
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Paper • 2505.16175 • Published May 22 • 41

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Paper • 2504.02821 • Published Apr 3 • 9
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24 • 13
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

Paper • 2504.15921 • Published Apr 22 • 7
Causal-Copilot: An Autonomous Causal Analysis Agent

Paper • 2504.13263 • Published Apr 17 • 7

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 10
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs