LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Paper
•
2601.10129
•
Published
•
9
Computer Vision
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs