S 6 -DAMON: BRIDGING SELF-SUPERVISED SPEECH MODELS AND REAL-TIME SPEECH RECOGNITION Anonymous

Abstract

There has been an growing demand for deep neural network (DNN) powered automatic speech recognition (ASR) on mobile platforms for real-time speech recognition. However, ubiquitous on-device ASR systems are still hindered by two bottlenecks: (1) the lack of large-scale transcribed speech data especially for low-resource spoken languages and (2) the large gap between DNNs' prohibitive complexity and mobiles' limited resources. In parallel, speech models pretrained via self-supervised learning (SSL) have emerged to reduce the reliance on the availability of transcribed speech data, which however further enlarges the efficiency gap because they often adopt large transformers to ensure expressive speech representations. Thus, it is highly desired to trim down the complexity of speech SSL models to enable real-time on-device ASR. This is particularly challenging since only structured sparsity can favor hardware efficiency in commercial devices, under which the speech representation learned by SSL could easily be demolished. To this end, we develop a framework dubbed S 6 -DAMON to pursue structured sparsity in speech SSL models via data-model co-compression. On the data side, leveraging both the duration of each phoneme and the pauses between the words/phonemes of human utterances, we propose a salient audio token detector, dubbed SALAD, to remove input audio tokens that are redundant; On the model side, we identify that the failure of the SOTA ASR pruning method under structured sparsity is caused by the sparsity discrepancy between finetuning/deployment and their limited learnability of sparsity distributions, and then tackle it via a new ASR pruning pipeline dubbed SAFARI, which adopts a three-step pipeline -sparsify, finetune, and adjust sparsity. Extensive experiments validate that S 6 -DAMON can enable real-time ASR with limited transcribed speech data requirements while maintaining decent recognition performance. All source codes will be released upon acceptance.

1. INTRODUCTION

Recent breakthroughs in deep neural networks (DNNs) have tremendously advanced the field of Automatic Speech Recognition (ASR), enabling record-breaking end-to-end ASR systems (Hannun et al., 2014; Chan et al., 2016; Zhang et al., 2020; Gulati et al., 2020) . Considering that speech is one of the basic input modalities from users of intelligent mobile devices, there has been an increasing interest in the development and deployment of on-device ASR systems. For example, intelligent assistants (Meta-AI, 2022; Vox, 2022) are highly desired in next-generation augmented reality and virtual reality (AR/VR) devices for enabling immersive AR/VR experiences. This has called for advanced speech technologies in order to deliver accurate and real-time ASR systems. There still remain two critical efficiency bottlenecks for ubiquitous on-device ASR systems, including (1) data efficiency: big data is often not practical for ASR since collecting transcription on a large scale is costly or not even possible, especially for low-resource spoken languages and (2) model efficiency: the often limited on-device resources stand at odds with the complexity of deep ASR models, making it particularly challenging to satisfy real-time ASR requirements. To promote the aforementioned data efficiency, recent advances in self-supervised learning (SSL) for speech representation (Chi et al., 2020; Baevski et al., 2020; 2022) have demonstrated empirical success and become the de-facto paradigm for low-resource ASR, where SSL models pretrained on unlabeled audio data can be generalized to handle low-resource transcribed speech after being finetuned. However, this could further aggravate the model efficiency bottleneck as giant transformers (Vaswani

