Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction
/ Abstract
In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically require clean reference speech. This limits their applicability in the real world. To address this, numerous deep learning-based non-intrusive speech quality assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, convolution blocks for learning frame-level features, and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on key aspects of the input data. The proposed model, in spite of having relatively fewer parameters, has shown higher correlation and lower mean squared error in STOI prediction for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as input.
Journal: 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)