cs.SD — arXiv2

Sep 28, 2023Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

Sep 25, 2023AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data

Sep 19, 2023MelodyGLM: Multi-task Pre-training for Symbolic Melody Generation

Sep 18, 2023RECAP: Retrieval-Augmented Audio Captioning

Aug 14, 2023The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Aug 5, 2023Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Jul 18, 2023SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

Jun 27, 20233D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Jun 18, 2023MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Jun 13, 2023StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

May 26, 2023Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

May 24, 2023ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

May 19, 2023Language-universal phonetic encoder for low-resource speech recognition

May 19, 2023Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

May 14, 2023REMAST: Real-time Emotion-based Music Arrangement with Soft Transition

Apr 20, 2023Using Mobile Data and Deep Models to Assess Auditory Verbal Hallucinations

Mar 14, 2023CAT: Causal Audio Transformer for Audio Classification

Feb 16, 2023Personalized Audio Quality Preference Prediction

Jan 20, 2023Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Dec 29, 2022StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models