S 6 -DAMON: BRIDGING SELF-SUPERVISED SPEECH MODELS AND REAL-TIME SPEECH RECOGNITION Anonymous

Abstract

There has been an growing demand for deep neural network (DNN) powered automatic speech recognition (ASR) on mobile platforms for real-time speech recognition. However, ubiquitous on-device ASR systems are still hindered by two bottlenecks: (1) the lack of large-scale transcribed speech data especially for low-resource spoken languages and (2) the large gap between DNNs' prohibitive complexity and mobiles' limited resources. In parallel, speech models pretrained via self-supervised learning (SSL) have emerged to reduce the reliance on the availability of transcribed speech data, which however further enlarges the efficiency gap because they often adopt large transformers to ensure expressive speech representations. Thus, it is highly desired to trim down the complexity of speech SSL models to enable real-time on-device ASR. This is particularly challenging since only structured sparsity can favor hardware efficiency in commercial devices, under which the speech representation learned by SSL could easily be demolished. To this end, we develop a framework dubbed S 6 -DAMON to pursue structured sparsity in speech SSL models via data-model co-compression. On the data side, leveraging both the duration of each phoneme and the pauses between the words/phonemes of human utterances, we propose a salient audio token detector, dubbed SALAD, to remove input audio tokens that are redundant; On the model side, we identify that the failure of the SOTA ASR pruning method under structured sparsity is caused by the sparsity discrepancy between finetuning/deployment and their limited learnability of sparsity distributions, and then tackle it via a new ASR pruning pipeline dubbed SAFARI, which adopts a three-step pipeline -sparsify, finetune, and adjust sparsity. Extensive experiments validate that S 6 -DAMON can enable real-time ASR with limited transcribed speech data requirements while maintaining decent recognition performance. All source codes will be released upon acceptance.

1. INTRODUCTION

Recent breakthroughs in deep neural networks (DNNs) have tremendously advanced the field of Automatic Speech Recognition (ASR), enabling record-breaking end-to-end ASR systems (Hannun et al., 2014; Chan et al., 2016; Zhang et al., 2020; Gulati et al., 2020) . Considering that speech is one of the basic input modalities from users of intelligent mobile devices, there has been an increasing interest in the development and deployment of on-device ASR systems. For example, intelligent assistants (Meta-AI, 2022; Vox, 2022) are highly desired in next-generation augmented reality and virtual reality (AR/VR) devices for enabling immersive AR/VR experiences. This has called for advanced speech technologies in order to deliver accurate and real-time ASR systems. There still remain two critical efficiency bottlenecks for ubiquitous on-device ASR systems, including (1) data efficiency: big data is often not practical for ASR since collecting transcription on a large scale is costly or not even possible, especially for low-resource spoken languages and (2) model efficiency: the often limited on-device resources stand at odds with the complexity of deep ASR models, making it particularly challenging to satisfy real-time ASR requirements. To promote the aforementioned data efficiency, recent advances in self-supervised learning (SSL) for speech representation (Chi et al., 2020; Baevski et al., 2020; 2022) have demonstrated empirical success and become the de-facto paradigm for low-resource ASR, where SSL models pretrained on unlabeled audio data can be generalized to handle low-resource transcribed speech after being finetuned. However, this could further aggravate the model efficiency bottleneck as giant transformers (Vaswani

Token-wise Predictions

Repeat Blank

/ Recognition Results

It is a very fine old place of red brick softened by a pale powdery lichen which has dispersed itself with happy irregularity et al., 2017) (e.g., > 90M parameters) are often adopted in state-of-the-art (SOTA) speech SSL models to ensure effective representation learning during SSL, making it increasingly more challenging for on-device deployment. Therefore, it is imperative to compress speech SSL models while maintaining their generalizable speech representation for delivering efficient ASR systems. I T I S A V E R R Y F F I I N E E O O L L D P L L A A C C E E E O F F R R E D D B R R I C K K S F T E N N E E D B Y P A L L L E E P O W D E R R R Y L I C E N W H H I C C H A A S D I S S P E R R S S E D I T T S E L F F W I H H H A A P P Y Y I I R R R E G G U L L A A R R I T Y Despite the demanding need, it is non-trivial to narrow the gap between large speech SSL models and constrained resources in mobile devices. First, under the SOTA pretrain-then-finetune paradigm, most useful features are learned during the SSL stage and then pretrained speech SSL models only slightly adapt their weights to encode task-specific information during finetuning, whereas it is difficult to learn a sparsity distribution during finetuning while maintaining the fidelity of the speech representation given the low-resource downstream speech. Note that this is particularly challenging for ASR due to the more stringent low-resource settings, e.g., LibriSpeech-10m (Panayotov et al., 2015) for ASR only contains 48 sentences for training and development, whereas the CoLA dataset (Wang et al., 2018) for natural language processing (NLP) contains 9594 sentences. Second, only structured sparsity can favor hardware efficiency in commercial mobile devices, which however will pose a more severe destruction during finetuning on the SSL speech representation learned during pretraining than unstructured sparsity, e.g., enforcing structured sparsity in the SOTA unstructured ASR pruning framework called PARP (Lai et al., 2021) will cause a >8% increase in word-errorrate (WER) under only a 20% sparsity on wav2vec2-base/LibriSpeech-1h. Third, considering that ASR corresponds to a sequence-to-sequence task where the alignment between inputs and outputs is monotonic, the compression process is thus required to be more meticulous in preserving the information of useful audio frames than compressing classification-task models. Our Contributions. We develop a framework dubbed S 6 -DAMON which is the first to pursue structured sparsity in speech SSL models under low-resource settings via data-model co-compression for enabling real-time on-device speech recognition. On the data side, S 6 -DAMON leverages the intrinsic redundancy in human speech. As the duration of each phoneme and the pauses between the words/phonemes of human utterances, the sampled audio frames and the corresponding extracted audio tokens, i.e., inputs for the transformers, may (1) repeat the previous tokens, or (2) stand as blank, contributing little to the final recognition (see an example in Fig. 1 (a) ). We call both as non-salient audio tokens (NATs) and the first-appearing tokens that are indispensable for ensuring monotonic recognition as salient audio tokens (SATs). Properly removing NATs can lead to non-trivial savings in model efficiency while better maintaining the accuracy than removing SATs, e.g., NATs account for 50.6% of the total tokens on LibriSpeech test-clean based on token-wise annotations from finetuned wav2vec2-base. As only sentence-level transcriptions are annotated in ASR datasets and token-wise labels are not available to classify SATs/NATs, we design a salient audio token detector called SALAD and train it in a semi-supervised manner based on the pseudo token-wise labels annotated by finetuned speech SSL models on untranscribed speech. A high recall is enforced to ensure the coverage of SATs, thus properly removing NATs detected by SALAD in inference can better maintain the speech representation fidelity. On the model side, we discover that the failures of SOTA ASR pruning method PARP (Lai et al., 2021) under structured sparsity are caused by (1) the sparsity discrepancy between finetuning/deployment, i.e., PARP finds the



Figure 1: (a) An example from LibriSpeech for illustrating two types of non-salient audio tokens; (b) The trade-offs between WER on LibriSpeech test-clean and FLOPs savings achieved by different ASR compression schemes on top of wav2vec2-base finetuned on LibriSpeech-100h.

