Showing 1–20 of 25 results
/ Date/ Name
Jan 11, 2017Attention-Based Multimodal Fusion for Video DescriptionJun 27, 2023Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from VideosOct 13, 2021Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher LearningJun 21, 2018End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video FeaturesSep 23, 2020Multi-Pass Transformer for Machine TranslationAug 4, 2021Optimizing Latency for Online Video CaptioningUsing Audio-Visual TransformersJun 22, 2017End-to-end Conversation Modeling Track in DSTC6Oct 16, 2023Generation or Replication: Auscultating Audio Latent Diffusion ModelsSep 29, 2025SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMsOct 30, 2023Scenario-Aware Audio-Visual TF-GridNet for Target Speech ExtractionNov 14, 2019The Eighth Dialog System Technology ChallengeJul 8, 2020Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled TransformersApr 19, 2021Advanced Long-context End-to-end Speech Recognition Using Context-expanded TransformersJan 11, 2019Dialog System Technology Challenge 7Jan 3, 2020Multi-Layer Content Interaction Through Quaternion Product For Visual Question AnsweringFeb 18, 2022(2.5+1)D Spatio-Temporal Scene Graphs for Video Question AnsweringJan 25, 2019Audio-Visual Scene-Aware DialogFeb 27, 2024NIIRF: Neural IIR Filter Field for HRTF Upsampling and PersonalizationNov 21, 2025Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLMJun 18, 2025Factorized RVQ-GAN For Disentangled Speech Tokenization