cs.SD — arXiv2

Mar 17, 2025Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

Feb 21, 2025Retrieval-Augmented Speech Recognition Approach for Domain Challenges

Nov 27, 2024Music2Fail: Transfer Music to Failed Recorder Style

Nov 11, 2024Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

Nov 6, 2024Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Nov 1, 2024Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Oct 25, 2024GPT-4o System Card

Oct 17, 2024STCON System for the CHiME-8 Challenge

Oct 7, 2024Demo of Zero-Shot Guitar Amplifier Modelling: Enhancing Modeling with Hyper Neural Networks

Sep 20, 2024Unifying Global and Near-Context Biasing in a Single Trie Pass

Sep 20, 2024Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Sep 16, 2024StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Sep 13, 2024DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

Sep 8, 2024The first Cadenza challenges: using machine learning competitions to improve music for listeners with a hearing loss

Jul 15, 2024Towards zero-shot amplifier modeling: One-to-many amplifier modeling via tone embedding control

Jul 5, 2024Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Jul 5, 2024TokenVerse: Towards Unifying Speech and NLP Tasks via Transducer-based ASR

Jul 3, 2024MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

Jun 30, 2024Improving Real-Time Music Accompaniment Separation with MMDenseNet

Jun 25, 2024Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet