Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke
Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.
Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, Srikanth Madikeri, Petr Motlicek, Andreas Stolcke
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
Vishwanath Pratap Singh, Shashi Kumar, Ravi Shekhar Jha, Abhishek Pandey
The COVID-19 pandemic has resulted in more than 125 million infections and more than 2.7 million casualties. In this paper, we attempt to classify covid vs non-covid cough sounds using signal processing and deep learning methods. Air turbulence, the vibration of tissues, movement of fluid through airways, opening, and closure of glottis are some of the causes for the production of the acoustic sound signals during cough. Does the COVID-19 alter the acoustic characteristics of breath, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. In this paper, we incorporated novel data augmentation methods for cough sound augmentation and multiple deep neural network architectures and methods along with handcrafted features. Our proposed system gives 14% absolute improvement in area under the curve (AUC). The proposed system is developed as part of Interspeech 2021 special sessions and challenges viz. diagnosing of COVID-19 using acoustics (DiCOVA). Our proposed method secured the 5th position on the leaderboard among 29 participants.
Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Thorbecke, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju
In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Our code is publicly available: https://github.com/idiap/tokenverse-unifying-speech-nlp
Andres Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esau Villatoro-Tello, Petr Motlicek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
Pradeep Rangappa, Andres Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth Madikeri, Esau Villatoro-Tello, Bidisha Sharma, Petr Motlicek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, Aravind Ganapathiraju
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.
Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi
Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC
Shashi Kumar, Srikanth Madikeri, Esaú Villatoro-Tello, Sergio Burdisso, Pradeep Rangappa, Andrés Carofilis, Petr Motlicek, Karthik Pandia, Shankar Venkatesan, Kadri Hacioğlu, Andreas Stolcke
Token-based multitasking frameworks like TokenVerse require all training utterances to have labels for all tasks, hindering their ability to leverage partially annotated datasets and scale effectively. We propose TokenVerse++, which introduces learnable vectors in the acoustic embedding space of the XLSR-Transducer ASR model for dynamic task activation. This core mechanism enables training with utterances labeled for only a subset of tasks, a key advantage over TokenVerse. We demonstrate this by successfully integrating a dataset with partial labels, specifically for ASR and an additional task, language identification, improving overall performance. TokenVerse++ achieves results on par with or exceeding TokenVerse across multiple tasks, establishing it as a more practical multitask alternative without sacrificing ASR performance.
Shashi Kumar, Shakti P. Rath, Abhishek Pandey
Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
Shashi Kumar, Yacouba Kaloga, John Mitros, Petr Motlicek, Ina Kodrasi
Low-rank adaptation (LoRA) is a widely used method for parameter-efficient finetuning. However, existing LoRA variants lack mechanisms to explicitly disambiguate task-relevant information within the learned low-rank subspace, potentially limiting downstream performance. We propose Factorized Variational Autoencoder LoRA (FVAE-LoRA), which leverages a VAE to learn two distinct latent spaces. Our novel Evidence Lower Bound formulation explicitly promotes factorization between the latent spaces, dedicating one latent space to task-salient features and the other to residual information. Extensive experiments on text, audio, and image tasks demonstrate that FVAE-LoRA consistently outperforms standard LoRA. Moreover, spurious correlation evaluations confirm that FVAE-LoRA better isolates task-relevant signals, leading to improved robustness under distribution shifts. Our code is publicly available at: https://github.com/idiap/FVAE-LoRA
Saurabh Kumar, Shashi Ranjan Kumar, Abhinav Sinha
In this paper, we consider the tracking of arbitrary curvilinear geometric paths in three-dimensional output spaces of unmanned aerial vehicles (UAVs) without pre-specified timing requirements, commonly referred to as path-following problems, subjected to bounded inputs. Specifically, we propose a novel nonlinear path-following guidance law for a UAV that enables it to follow any smooth curvilinear path in three dimensions while accounting for the bounded control authority in the design. The proposed solution offers a general treatment of the path-following problem by removing the dependency on the path's geometry, which makes it applicable to paths with varying levels of complexity and smooth curvatures. Additionally, the proposed strategy draws inspiration from the pursuit guidance approach, which is known for its simplicity and ease of implementation. Theoretical analysis guarantees that the UAV converges to its desired path within a fixed time and remains on it irrespective of its initial configuration with respect to the path. Finally, the simulations demonstrate the merits and effectiveness of the proposed guidance strategy through a wide range of engagement scenarios, showcasing the UAV's ability to follow diverse curvilinear paths accurately.
Ashok Samrat R, Swati Singh, Shashi Ranjan Kumar
This paper presents a nonlinear guidance scheme designed to achieve precise interception of stationary targets at a pre-specified impact time. The proposed strategy essentially accounts for the constraints imposed by the interceptor's seeker field-of-view (FOV) and actuator limitations, which, if ignored, can degrade guidance performance. To address these challenges, the guidance law incorporates known actuator bounds directly into its design, thereby improving overall interceptor effectiveness. The proposed method utilizes an input-affine magnitude saturation model to effectively enforce these constraints. By appending this input saturation model to the interceptor's kinematic equations, a guidance law is derived that ensures interception at the desired impact time while accounting for the physical constraints of the sensor and actuator. The efficacy of the proposed strategies is demonstrated through comprehensive numerical simulations across various scenarios and is compared against an existing guidance strategy.
Sahaya Aarti Dennisselvan, Shashi Ranjan Kumar, Dwaipayan Mukherjee
This work proposes a cooperative strategy for a group of quadrotors interacting over ring digraphs with macro-vertices of size two. Consensus for a group of general double integrators has been initially investigated, and it has been proved that through a suitable choice of a single controller parameter, consensus and stability of the resulting networked dynamical system can be ensured. This further opens up the possibility of achieving a desired formation and to move a swarm of quadrotors, interacting over ring digraphs, at a desired flight velocity, using a single controller gain. An analysis of achievable velocities is performed. Examples have been provided to offer deeper insights into the obtained analytical results. Simulation studies clearly demonstrate that a desired formation is achieved, starting from arbitrary initial positions, while also ensuring convergence to a final desired flight velocity.
Ram Milan Kumar Verma, Shashi Ranjan Kumar, Hemendra Arya
This work addresses the problem of constrained motion control of the uncrewed surface vessels. The constraints are imposed on states/inputs of the vehicles due to the physical limitations, mission requirements, and safety considerations. We develop a nonlinear feedback controller utilizing log-type Barrier Lyapunov Functions to enforce static and dynamic motion constraints. The proposed scheme uniquely addresses asymmetric constraints on position and heading alongside symmetric constraints on surge, sway, and yaw rates. Additionally, a smooth input saturation model is incorporated in the design to guarantee stability even under actuator bounds, which, if unaccounted for, can lead to severe performance degradation and poor tracking. Rigorous Lyapunov stability analysis shows that the closed-loop system remains stable and that all state variables remain within their prescribed bounds at all times, provided the initial conditions also lie within those bounds. Numerical simulations demonstrate the effectiveness of the proposed strategies for surface vessels without violating the motion and actuator constraints.
Dwaipayan Mukherjee, Shashi Ranjan Kumar
This paper presents a finite-time heterogeneous cyclic pursuit scheme that ensures consensus among agents modelled as integrators. It is shown that for the proposed sliding mode control, even when the gains corresponding to each agent are non-identical, consensus results within a finite-time provided all the gains are positive. An algorithm is presented to compute the consensus value and consensus time for a given set of gains and initial states of the agents. The set of values where consensus can occur, by varying the gains, has been derived and a second algorithm aids in determining the gains that enable consensus at any point in the aforementioned set, within a given finite-time. As an application, the finite-time consensus in line-of-sight (LOS) rates, over a cycle digraph, for a group of interceptors is shown to be effective in ensuring co-operative collision-free interception of a target, for both kinematic and realistic models of the interceptors. Simulations vindicate the theoretical results.
Abhinav Sinha, Shashi Ranjan Kumar
We propose predefined-time consensus-based cooperative guidance laws for a swarm of interceptors to simultaneously capture a target capable of executing various kinds of motions. Unlike leader-follower cooperative guidance techniques, the swarm of interceptors has no leader and each interceptor executes its own guidance command in a distributive fashion, thus obviating the residency of the mission over a single interceptor. We first design cooperative guidance commands subject to target's mobility, followed by the same when the target is stationary, and also account for interceptors' changing topologies. We further show through rigorous analysis that the proposed cooperative guidance laws guarantee consensus in the interceptors' time-to-go values within a predefined-time. The proposed design allows a feasible time of consensus in time-to-go to be set arbitrarily at will during the design regardless of the interceptors' initial time-to-go values, thereby ensuring a simultaneous target interception. We also demonstrate the efficacy of the proposed design via simulations for interceptors connected over static and dynamic networks.
Lohitvel Gopikannan, Shashi Ranjan Kumar, Abhinav Sinha
This paper proposes a cooperative integrated estimation-guidance framework for simultaneous interception of a non-maneuvering target using a team of unmanned autonomous vehicles, assuming only a subset of vehicles are equipped with dedicated sensors to measure the target's states. Unlike earlier approaches that focus solely on either estimation or guidance design, the proposed framework unifies both within a cooperative architecture. To circumvent the limitation posed by heterogeneity in target observability, sensorless vehicles estimate the target's state by leveraging information exchanged with neighboring agents over a directed communication topology through a prescribed-time observer. The proposed approach employs true proportional navigation guidance (TPNG), which uses an exact time-to-go formulation and is applicable across a wide spectrum of target motions. Furthermore, prescribed-time observer and controller are employed to achieve convergence to true target's state and consensus in time-to-go within set predefined times, respectively. Simulations demonstrate the effectiveness of the proposed framework under various engagement scenarios.
Shivam Bajpai, Abhinav Sinha, Shashi Ranjan Kumar
This paper addresses a critical aerial defense challenge in contested airspace, involving three autonomous aerial vehicles -- a hostile drone (the pursuer), a high-value drone (the evader), and a protective drone (the defender). We present a cooperative guidance framework for the evader-defender team that guarantees interception of the pursuer before it can capture the evader, even under highly dynamic and uncertain engagement conditions. Unlike traditional heuristic, optimal control, or differential game-based methods, we approach the problem within a time-constrained guidance framework, leveraging true proportional navigation based approach that ensures robust and guaranteed solutions to the aerial defense problem. The proposed strategy is computationally lightweight, scalable to a large number of agent configurations, and does not require knowledge of the pursuer's strategy or control laws. From arbitrary initial geometries, our method guarantees that key engagement errors are driven to zero within a fixed time, leading to a successful mission. Extensive simulations across diverse and adversarial scenarios confirm the effectiveness of the proposed strategy and its relevance for real-time autonomous defense in contested airspace environments.
Abhinav Sinha, Dwaipayan Mukherjee, Shashi Ranjan Kumar
This work proposes a cooperative strategy that employs deviated pursuit guidance to simultaneously intercept a moving (but not manoeuvring) target. As opposed to many existing cooperative guidance strategies which use estimates of time-to-go, based on proportional-navigation guidance, the proposed strategy uses an exact expression for time-to-go to ensure simultaneous interception. The guidance design considers nonlinear engagement kinematics, allowing the proposed strategy to remain effective over a large operating regime. Unlike existing strategies on simultaneous interception that achieve interception at the average value of their initial time-to-go estimates, this work provides flexibility in the choice of impact time. By judiciously choosing the edge weights of the communication network, a weighted consensus in time-to-go can be achieved. It has been shown that by allowing an edge weight to be negative, consensus in time-to-go can even be achieved for an impact time that lies outside the convex hull of the set of initial time-to-go values of the individual interceptors. The bounds on such negative weights have been analysed for some special graphs, using Nyquist criterion. Simulations are provided to vindicate the efficacy of the proposed strategy.