Showing 1–20 of 39 results
/ Date/ Name
Jun 6, 2019Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in VideoFeb 9, 2024ContPhy: Continuum Physical Concept Learning and Reasoning from VideosAug 2, 2024Compositional Physical Reasoning of Objects and Events from VideosJan 25, 2020Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in VideoNov 8, 2023GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsOct 10, 2023TextPSG: Panoptic Scene Graph Generation from Textual DescriptionsOct 28, 2021Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and LanguageSep 2, 2021Deep Face Video Inpainting via UV MappingJan 12, 2023See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual ReasoningApr 6, 2023Visual Dependency Transformers: Dependency Tree Emerges from Reversed AttentionJun 7, 2023ModuleFormer: Modularity Emerges from Mixture-of-ExpertsJul 29, 2024FlexAttention for Efficient High-Resolution Vision-Language ModelsApr 22, 2025Vidi: Large Multimodal Models for Video Understanding and EditingJul 24, 20233D-LLM: Injecting the 3D World into Large Language ModelsMar 9, 2023Planning with Large Language Models for Code GenerationOct 9, 2023SALMON: Self-Alignment with Instructable Reward ModelsOct 11, 2023Sparse Universal TransformerNov 6, 2023CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative DecodingMay 15, 2024SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World KnowledgeJan 30, 2024Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble