"au:"Zhengfeng Lai"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Zhengfeng Lai"" — arXiv2 Search

Showing 1–16 of 16 results

/ Date/ Name

Oct 11, 2023VeCLIP: Improving CLIP Training via Visual-enriched Captions Oct 3, 2024Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models May 20, 2025Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?Jul 22, 2024SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Jul 18, 2024MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains Mar 28, 2026Incentivizing Temporal-Awareness in Egocentric Video Understanding Models Sep 30, 2024MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Mar 21, 2025ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering Oct 3, 2025Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction Apr 8, 2026VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models Oct 3, 2024Contrastive Localized Language-Image Pre-Training Mar 24, 2025SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding May 8, 2025StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant May 28, 2024Empowering Source-Free Domain Adaptation via MLLM-Guided Reliability-Based Curriculum Learning Dec 10, 2024STIV: Scalable Text and Image Conditioned Video Generation Feb 5, 2024MobilityGPT: Enhanced Human Mobility Modeling with a GPT model