Showing 1–20 of 29 results
/ Date/ Name
Mar 26, 2026Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion ScaleMar 21, 2026ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent FrameworkMar 10, 2026InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and EditingOct 27, 2025EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoTOct 14, 2025MetaCaptioner: Towards Generalist Visual Captioning with Open-source SuitesOct 13, 2025Vlaser: Vision-Language-Action Model with Synergistic Embodied ReasoningSep 29, 2025Learning Goal-Oriented Vision-and-Language Navigation with Self-Improving Demonstrations at ScaleSep 26, 2025MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document ParsingAug 25, 2025InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and EfficiencyMay 30, 2025Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in SpacesJan 14, 2025Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion ModelsJul 10, 2024VEnhancer: Generative Space-Time Enhancement for Video GenerationJun 26, 2024EgoVideo: Exploring Egocentric Foundation Model and Downstream AdaptationJun 12, 2024OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with TextMar 26, 2024InternLM2 Technical ReportMar 22, 2024InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingMar 11, 2024VideoMamba: State Space Model for Efficient Video UnderstandingFeb 29, 2024WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext DatasetFeb 8, 2024SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsNov 13, 2023SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models