eess.AS — arXiv2

Sep 15, 2025Fun-ASR Technical Report

Sep 9, 2025VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Sep 5, 2025Layer-wise Analysis for Quality of Multilingual Synthesized Speech

Aug 10, 2025Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Aug 5, 2025When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Aug 1, 2025AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Aug 1, 2025Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

Jul 23, 2025Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice

Jul 20, 2025DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Jul 17, 2025Voxtral

Jun 24, 2025Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Jun 23, 2025Infant Cry Emotion Recognition Using Improved ECAPA-TDNN with Multiscale Feature Fusion and Attention Enhancement

Jun 1, 2025CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

May 23, 2025Source Separation of Small Classical Ensembles: Challenges and Opportunities

May 20, 2025FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

May 19, 2025Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

May 19, 2025MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Mar 17, 2025Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

Feb 21, 2025Retrieval-Augmented Speech Recognition Approach for Domain Challenges

Nov 27, 2024Music2Fail: Transfer Music to Failed Recorder Style