Showing 1–16 of 16 results
/ Date/ Name
Oct 11, 2023VeCLIP: Improving CLIP Training via Visual-enriched CaptionsOct 3, 2024Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation ModelsMay 20, 2025Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?Jul 22, 2024SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsJul 18, 2024MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse DomainsMar 28, 2026Incentivizing Temporal-Awareness in Egocentric Video Understanding ModelsSep 30, 2024MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuningMar 21, 2025ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and AnsweringOct 3, 2025Taming Text-to-Sounding Video Generation via Advanced Modality Condition and InteractionApr 8, 2026VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant ModelsOct 3, 2024Contrastive Localized Language-Image Pre-TrainingMar 24, 2025SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video UnderstandingMay 8, 2025StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming AssistantMay 28, 2024Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum LearningDec 10, 2024STIV: Scalable Text and Image Conditioned Video GenerationFeb 5, 2024MobilityGPT: Enhanced Human Mobility Modeling with a GPT model