Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

/ Authors

/ Abstract

Question Answering (Q\A) systems have traditionally relied on structured text data, but the rapid growth of multimedia content images, audio, video, and structured metadata has introduced new challenges and opportunities for retrieval augmented QA. In this survey, we review recent advancements in Q\A systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross modal alignment, latency accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context aware Q&A systems leveraging multimedia data.

Journal: Proceedings of the 2nd ACM Workshop in AI-powered Question & Answering Systems

DOI: 10.1145/3746274.3760393