Showing 21–38 of 38 results
/ Date/ Name
Nov 3, 2022Video Event Extraction via Tracking Visual States of ArgumentsJun 14, 2022Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across ModalitiesSep 27, 2021Joint Multimedia Event Extraction from Video and ArticleMar 23, 2021Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in VideosJan 13, 2022CLIP-Event: Connecting Text and Images with Event StructuresNov 20, 2023InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language ModelsOct 9, 2022Learning to Decompose Visual Features with Latent Textual PromptsDec 20, 2021MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and GroundingMar 14, 2022All in One: Exploring Unified Video-Language Pre-trainingDec 1, 2021Object-aware Video-language Pre-training for RetrievalJun 16, 2024Investigating Video Reasoning Capability of Large Language Models with Tropes in MoviesDec 1, 2023Video Summarization: Towards Entity-Aware CaptionsDec 28, 2022TempCLR: Temporal Alignment Representation with Contrastive LearningMar 3, 2024SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional VideosMay 22, 2022Language Models with Image Descriptors are Strong Few-Shot Video-Language LearnersMar 15, 2022Revitalize Region Feature for Democratizing Video-Language Pre-training of RetrievalApr 18, 2024BLINK: Multimodal Large Language Models Can See but Not PerceiveJul 7, 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities