Showing 21–40 of 223 results
/ Date/ Name
Dec 29, 2025MiMo-Audio: Audio Language Models are Few-Shot LearnersDec 12, 2025Processing through encoding: Quantum circuit approaches for point-wise multiplication and convolutionDec 7, 2025Multi-Accent Mandarin Dry-Vocal Singing Dataset: Benchmark for Singing Accent RecognitionDec 7, 2025Singing Timbre Popularity Assessment Based on Multimodal Large Foundation ModelNov 12, 2025Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music GenerationNov 9, 2025We Can Hear You with mmWave Radar! An End-to-End Eavesdropping SystemSep 15, 2025Fun-ASR Technical ReportSep 9, 2025VStyle: A Benchmark for Voice Style Adaptation with Spoken InstructionsSep 5, 2025Layer-wise Analysis for Quality of Multilingual Synthesized SpeechAug 5, 2025When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign InputsAug 1, 2025AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song GenerationJul 23, 2025Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your VoiceJul 17, 2025VoxtralJun 24, 2025Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio GenerationJun 1, 2025CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow MatchingMay 23, 2025Source Separation of Small Classical Ensembles: Challenges and OpportunitiesMay 20, 2025FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset GenerationMay 19, 2025MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their MixApr 1, 2025A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal PerspectivesMar 17, 2025Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment