"au:"Zongyang Ma"" — arXiv2 Search

/ Date/ Name

/ Date/ Name

"au:"Zongyang Ma"" — arXiv2 Search

Showing 1–17 of 17 results

/ Date/ Name

Jul 10, 2024How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?May 21, 2025STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs Mar 20, 2022Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation Jul 10, 2024EA-VTR: Event-Aware Video-Text Retrieval Apr 8, 2026Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation Mar 31, 2022CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation Aug 10, 2021Natural Language Processing with Commonsense Knowledge: A Survey Sep 26, 2024E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding Jun 12, 2024Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey Nov 22, 2024mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA Dec 11, 2025From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models Dec 15, 2025MMhops-R1: Multimodal Multi-hop Reasoning Sep 22, 2025UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning Apr 7, 2026Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval Oct 1, 2025Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation Feb 17, 2025iMOVE: Instance-Motion-Aware Video Understanding May 23, 2025DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval