Showing 1–20 of 34 results
/ Date/ Name
Apr 23, 2026Context Unrolling in Omni ModelsMar 21, 2026ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent FrameworkMar 10, 2026InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and EditingOct 30, 2025Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF BenchmarkJul 19, 2025Docopilot: Improving Multimodal Models for Document-Level UnderstandingJun 5, 2025MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought ReasoningMay 21, 2025UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation LearningMar 13, 2025CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion ModelsFeb 13, 2025MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and EfficiencyOct 9, 2024Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and MethodologyMay 29, 2024Enhancing Vision-Language Model with Unmasked Token AlignmentMay 23, 2024TerDiT: Ternary Diffusion Models with TransformersFeb 8, 2024SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsNov 13, 2023SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language ModelsJun 15, 2023Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language ModelsApr 28, 2023LLaMA-Adapter V2: Parameter-Efficient Visual Instruction ModelMar 9, 2023Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature MimickingDec 14, 2022ConQueR: Query Contrast Voxel-DETR for 3D Object DetectionAug 6, 2022Frozen CLIP Models are Efficient Video LearnersJun 27, 2022ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning