On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation

eess.AS

/ Authors

Séverin Baroudi, Hervé Bredin, Joseph Razik, Ricard Marxer

/ Abstract

Self-supervised speech models such as wav2vec2.0 and WavLM have been shown to significantly improve the performance of many downstream speech tasks, especially in low-resource settings, over the past few years. Despite this, evaluations on tasks such as Speaker Diarization and Speech Separation remain limited. This paper investigates the quality of recent self-supervised speech representations on these two speaker identity-related tasks, highlighting gaps in the current literature that stem from limitations in the existing benchmarks, particularly the lack of diversity in evaluation datasets and variety in downstream systems associated to both diarization and separation.