eess.AS — arXiv2

Oct 27, 2022Multimodal Transformer Distillation for Audio-Visual Synchronization

Oct 18, 2022Simple and Effective Unsupervised Speech Translation

Oct 3, 2022Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection

Aug 28, 2022Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Aug 16, 2022Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Jul 29, 2022Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

Jul 20, 2022Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Jun 7, 2022LegoNN: Building Modular Encoder-Decoder Models

Jun 3, 2022Constraining Gaussian processes for physics-informed acoustic emission mapping

May 30, 2022StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

May 16, 2022PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification

May 8, 2022Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information

May 6, 2022Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

May 3, 2022i-Code: An Integrative and Composable Multimodal Learning Framework

Apr 26, 2022Reformulating Speaker Diarization as Community Detection With Emphasis On Topological Structure

Apr 25, 2022Parallel Synthesis for Autoregressive Speech Generation

Apr 25, 2022Graph Convolutional Network Based Semi-Supervised Learning on Multi-Speaker Meeting Data

Apr 22, 2022Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

Apr 8, 2022Transducer-based language embedding for spoken language identification

Apr 1, 2022Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis