Showing 1–20 of 34 results
/ Date/ Name
Aug 17, 2022Multimodal foundation models are better simulators of the human brainApr 15, 2022COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal RetrievalSep 23, 2022LGDN: Language-Guided Denoising Network for Video-Language ModelingMar 8, 2024DeepSeek-VL: Towards Real-World Vision-Language UnderstandingMay 29, 2023speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognitionMar 17, 2025Efficient Motion-Aware Video MLLMDec 16, 2025HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge DevicesNov 20, 2024Functional normalizing flow for statistical inverse problems of partial differential equationsApr 10, 2025Kimi-VL Technical ReportMar 24, 2021Learning Versatile Neural Architectures by Propagating Network CodesMar 11, 2021WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-TrainingOct 27, 2021Towards artificial general intelligence via a multimodal foundation modelJan 25, 2022Image Fragile Watermarking Algorithm Based on Deneighborhood MappingNov 2, 2022Monolingual Recognizers Fusion for Code-switching Speech RecognitionFeb 13, 2023UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal ModelingJun 20, 2024Towards Event-oriented Long Video UnderstandingAug 6, 2024Characterizing the current systems in the Martian ionosphereOct 21, 2024Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM PretrainingJun 13, 2024Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMsJan 10, 2026BabyVision: Visual Reasoning Beyond Language