Showing 1–19 of 19 results
/ Date/ Name
May 22, 2023VDT: General-purpose Video Diffusion Transformers via Mask ModelingOct 17, 2024Exploring the Design Space of Visual Context Representation in Video MLLMsApr 15, 2022COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal RetrievalSep 23, 2022LGDN: Language-Guided Denoising Network for Video-Language ModelingMar 17, 2025Efficient Motion-Aware Video MLLMOct 11, 2024Baichuan-Omni Technical ReportAug 27, 2019Mobile Video Action RecognitionDec 10, 2019Learning Depth-Guided Convolutions for Monocular 3D Object DetectionMar 24, 2021Learning Versatile Neural Architectures by Propagating Network CodesJun 14, 2021Pre-Trained Models: Past, Present and FutureMar 11, 2021WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-TrainingOct 27, 2021Towards artificial general intelligence via a multimodal foundation modelFeb 13, 2023UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal ModelingJun 20, 2024Towards Event-oriented Long Video UnderstandingOct 21, 2024Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM PretrainingJun 13, 2024Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsFeb 18, 2025Baichuan-M1: Pushing the Medical Capability of Large Language ModelsJan 26, 2025Baichuan-Omni-1.5 Technical ReportJan 3, 2025Virgo: A Preliminary Exploration on Reproducing o1-like MLLM