A novel multimodal dynamic fusion network for disfluency detection in spoken utterances

/ Authors

Sreyan Ghosh, Utkarsh Tyagi, Sonal Kumar, Manan Suri, R. Shah

/ Abstract

Disﬂuency, though originating from human spoken utterances, is primarily studied as a uni-modal text-based Natural Language Processing (NLP) task. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose a novel multimodal architecture for disﬂuency detection from individual utterances. Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an exist-ing text encoder commonly used in prior art to leverage the prosodic and acoustic cues hidden in speech. Through experiments, we show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disﬂuency detection and outperforms prior unimodal and multimodal systems in literature by a signiﬁcant margin. In addition, we make a thorough qualitative analysis and show that, unlike text-only systems, which suffer from spurious correlations in the data, our system overcomes this problem through additional cues from speech signals. We make all our codes publicly available on GitHub 1 .

Journal: ArXiv

DOI: 10.48550/arXiv.2211.14700