Showing 1–20 of 38 results
/ Date/ Name
Apr 8, 2022From PHY to QoE: A Parameterized Framework DesignFeb 15, 2020UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and GenerationNov 24, 2021NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtionMar 29, 2023TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsJan 21, 2024Exploring Diffusion Time-steps for Unsupervised Representation LearningDec 20, 2023ASSISTGUI: Task-Oriented Desktop Graphical User Interface AutomationMay 2, 2020A Benchmark for Structured Procedural Knowledge Extraction from Cooking VideosJun 17, 2025PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language ReasoningSep 16, 2020Tag and Correct: Question aware Open Information Extraction with Two-stage DecodingApr 18, 2021CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip RetrievalAug 5, 2021Hybrid Reasoning Network for Video-based Commonsense CaptioningJun 14, 2023AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and LearnNov 11, 2024Explore the Reasoning Capability of LLMs in the Chess TestbedFeb 7, 2026Pull Requests as a Training Signal for Repo-Level Code EditingMay 24, 2018R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question AnsweringDec 22, 2023Voila-A: Aligning Vision-Language Models with User's Gaze AttentionJul 10, 2023KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report SummarizationOct 28, 2023EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray ImagesSep 22, 2022CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal GroundingDec 19, 2022MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering