CONTINUOUS PSEUDO-LABELING FROM THE START

Abstract

Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform 'continuous training' where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.

1. INTRODUCTION

The past few years have witnessed a growth in methods that leverage large amount of unlabeled data in domains such as speech, vision and language to produce state-of-the-art results, e.g. Baevski et al. (2020; 2022) 2021). Amongst the techniques that have made this possible are self-supervised learning (SSL) and self-training (ST) (Scudder, 1965; Lee, 2013) . While SSL is typically used in unsupervised settings, ST is applied in supervised settings where labeled data can be extended with unlabeled data that is labeled using a prior model, a process known as pseudo-labeling (PL). These techniques can reduce the burden of expensive labeling processes while successfully train data hungry models such as transformers using large quantities of unlabeled data. Current state-of-the-art SSL methods in speech (Baevski et al., 2020; Hsu et al., 2021; Baevski et al., 2022; Chung et al., 2021) are typically trained in two phases. First, the models are pre-trained on thousands of hours of unlabeled speech, and then they are further adapted by fine-tuning on the actual task of automatic speech recognition (ASR) using a smaller supervised set. However, because the pre-training (PT) phase is task agnostic, self-supervision can under-perform on a specific downstream task (Talnikar et al., 2021; Dery et al., 2022) . Further, SSL pre-training leads to a more complicated pipeline involving multiple phases. By contrast, ST algorithms also use unlabeled data but do not require phases of training with different objectives that makes the training pipeline simpler. In this paper, we focus on recent ST algorithms that perform 'continuous training' of a single model. In contrast to earlier ST training methods that iterate between generating PLs over the entire unlabeled dataset and training a model (teacher-student) (Synnaeve et al., 2020; Kahn et al., 2020a , 2020; Park et al., 2020) , here pseudo-labels (PLs) are generated online with a very recent version of the model (Xu et al., 2020; Likhomanenko et al., 2021a; Manohar et al., 2021; Higuchi et al., 2021; 2022a; b) and training is faster and more resource-efficient. One of the main challenges for continuous ST is training stability (Likhomanenko et al., 2021a; Higuchi et al., 2021; 2022b; Cai et al., 2022) . While these prior works use various techniques for stabilization, one common ingredient is that models are initially trained on labeled data for M steps. slimIPL (Likhomanenko et al., 2021a) showed robustness to M in some settings, but a well-established recipe does not seem to exist for the case of small labeled datasets (aka. the low resource setting). Indeed, we find that more pre-training steps, compared to what was shown previously in Likhomanenko et al. (2021a) , can lead to worse results (see Table 1 ). We hypothesize that this is due to over-fitting to the labeled set early in training in low resource settings and in this paper we try to improve results by doing ST without any pre-training (i.e. M = 0). However, in our experiments, off-the-shelf slimIPL diverges early in training in low resource settings, so we developed methods to address this problem which we summarize here: • We show that sampling transcriptions from the output distribution instead of using the best transcription makes ST robust and stable, especially when no pre-training is performed. For the first time, with these strategies we show that continuous PL can be done from the very start of the training matching prior works without an external language model.

2. EXPERIMENTAL SETUP AND RELATED METHODS

Data All our experiments are performed using the LibriSpeech dataset (Panayotov et al., 2015) . We use the train-clean-360 and train-other-500 regular subsets as unlabeled data, and consider either a subset of 10h randomly drawn from train-clean-100, or the full 100h set (train-clean-100) as labeled data. Comparisons with existing works are also provided using the 10h subset from Libri-Light (Kahn et al., 2020b) foot_0 . In addition, we evaluate the final configuration of our methods on the Common Voice dataset Ardila et al. (2020) for French language where we sample 10h and 100h from the train set to use as labeled data and the rest as unlabeled data (see Appendix A.3). et al., 2018) . This allows us to speed up training (by 2-3x) and decrease the memory footprint significantly. All models are trained on 8 GPUs for a maximum of 500k updates. We use either a static batch of 8 examples or a dynamic batch that packs ∼ 290s of audio per GPU.

Acoustic model

Continuous pseudo-labeling (PL) in ASR Let L = {x i , y i } and U = {x j } be the labeled and unlabeled datasets, respectively. We consider a semi-supervised PL approach where an acoustic model



Libri-Light 10h subset contains only speakers drawn from the whole LibriSpeech (from both clean and noisy subsets). To keep our experiments consistent, and also to assess domain transfer to the unlabeled noisy subsets, we reconstructed the 10h set from the train-clean-100, sampling randomly from the speakers and retaining the original speakers from this subset. 2 26 letters augmented with the apostrophe and a word boundary token.



; Chen et al. (2020a); Caron et al. (2021); He et al. (2022); Cai et al. (2022); Brown et al. (2020); Ramesh et al. (

• We propose a new curriculum for controlling the PL distribution during training. The curriculum uses the Levenshtein distance between PLs at different time steps to control how PLs are updated, and how unsupervised examples are chosen for training.

; Zhang Continuous ST (using slimIPL) with different pre-training steps (M ) using a 10h dataset reveals that more pre-training can lead to worse results (we show word error rate, WER, on dev-clean).

Following Likhomanenko et al. (2021a), models are trained with English letters token set 2 , the Connectionist Temporal Classification Graves et al. (2006) (CTC) loss, identical SpecAugment (Park et al., 2019) parameters, and Adagrad optimizer (Duchi et al., 2011). The acoustic model is the same transformer architecture that was introduced in slimIPL, except that we encode positions with either absolute sinusoidal positional embedding (Vaswani et al., 2017) or the recently proposed CAPE (Likhomanenko et al., 2021b) instead of relative positional embedding (Shaw

