cs.MM — arXiv2

Mar 25, 2025AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Jan 3, 2025Robust Self-Paced Hashing for Cross-Modal Retrieval with Noisy Labels

Dec 6, 2024LinVT: Empower Your Image-level Large Language Model to Understand Videos

Nov 22, 2024Health AI Developer Foundations

Oct 2, 2024Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer

Sep 20, 2024ChemDFM-X: Towards Large Multimodal Model for Chemistry

Aug 3, 2024SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

Jul 31, 2024Open-Vocabulary Audio-Visual Semantic Segmentation

Jul 3, 2024MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Jul 2, 2024To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models

Jun 21, 2024EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot

Jun 2, 2024Once-for-All: Controllable Generative Image Compression with Dynamic Granularity Adaptation

May 29, 2024Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

May 24, 2024Looking Backward: Streaming Video-to-Video Translation with Feature Banks

May 23, 2024Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Apr 29, 2024G-Refine: A General Quality Refiner for Text-to-Image Generation

Apr 21, 2024Counterfactual Reasoning Using Predicted Latent Personality Dimensions for Optimizing Persuasion Outcome

Mar 26, 2024Panonut360: A Head and Eye Tracking Dataset for Panoramic Video

Mar 18, 2024QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

Mar 1, 2024An Experimental Study of Low-Latency Video Streaming over 5G