"au:"Hongsheng Li"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Hongsheng Li"" — arXiv2 Search

Showing 1–20 of 34 results

/ Date/ Name

Apr 23, 2026Context Unrolling in Omni Models Mar 21, 2026ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework Mar 10, 2026InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing Oct 30, 2025Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark Jul 19, 2025Docopilot: Improving Multimodal Models for Document-Level Understanding Jun 5, 2025MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning May 21, 2025UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning Mar 13, 2025CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models Feb 13, 2025MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency Oct 9, 2024Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology May 29, 2024Enhancing Vision-Language Model with Unmasked Token Alignment May 23, 2024TerDiT: Ternary Diffusion Models with Transformers Feb 8, 2024SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Nov 13, 2023SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Jun 15, 2023Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models Apr 28, 2023LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Mar 9, 2023Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking Dec 14, 2022ConQueR: Query Contrast Voxel-DETR for 3D Object Detection Aug 6, 2022Frozen CLIP Models are Efficient Video Learners Jun 27, 2022ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning