Showing 1–17 of 17 results
/ Date/ Name
Jul 10, 2024How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?May 21, 2025STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMsMar 20, 2022Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge DistillationJul 10, 2024EA-VTR: Event-Aware Video-Text RetrievalApr 8, 2026Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content ModerationMar 31, 2022CREATE: A Benchmark for Chinese Short Video Retrieval and Title GenerationAug 10, 2021Natural Language Processing with Commonsense Knowledge: A SurveySep 26, 2024E.T. Bench: Towards Open-Ended Event-Level Video-Language UnderstandingJun 12, 2024Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A SurveyNov 22, 2024mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQADec 11, 2025From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language ModelsDec 15, 2025MMhops-R1: Multimodal Multi-hop ReasoningSep 22, 2025UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual ReasoningApr 7, 2026Beyond Semantic Search: Towards Referential Anchoring in Composed Image RetrievalOct 1, 2025Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme ExplanationFeb 17, 2025iMOVE: Instance-Motion-Aware Video UnderstandingMay 23, 2025DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval