Showing 1–20 of 38 results
/ Date/ Name
Jun 5, 2022Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language RetrievalJan 28, 2021VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal InputsJan 26, 2022Learning To Recognize Procedural Activities with Distant SupervisionOct 12, 2019Context-Gated ConvolutionOct 24, 2019Towards Train-Test Consistency for Semi-supervised Temporal Action LocalizationJan 6, 2023In Defense of Structural Symbolic Representation for Video Event-Relation PredictionMar 4, 2019Unsupervised Rank-Preserving Hashing for Large-Scale Image RetrievalJan 11, 2019DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action RecognitionDec 2, 2021Video-Text Pre-training with Learned RegionsDec 10, 2019Flow-Distilled IP Two-Stream Networks for Compressed Video Action RecognitionOct 22, 2022Weakly-Supervised Temporal Article GroundingSep 22, 2024Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie SynopsesJan 24, 2025PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location PredictionMar 25, 2023Supervised Masked Knowledge Distillation for Few-Shot TransformersFeb 17, 2025Progress of the TianQin projectJan 24, 2025ENTER: Event Based Interpretable Reasoning for VideoQAMay 27, 2023Non-Sequential Graph Script Induction via Multimedia GroundingJun 19, 2024Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?Jan 10, 2024Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal ReasoningApr 7, 2023Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering