CONTINUOUS PSEUDO-LABELING FROM THE START

Abstract

Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform 'continuous training' where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.

1. INTRODUCTION

The past few years have witnessed a growth in methods that leverage large amount of unlabeled data in domains such as speech, vision and language to produce state-of-the-art results, e.g. Baevski et al. (2020; 2022) ; Chen et al. (2020a) ; Caron et al. (2021) ; He et al. (2022) ; Cai et al. (2022) ; Brown et al. (2020) ; Ramesh et al. (2021) . Amongst the techniques that have made this possible are self-supervised learning (SSL) and self-training (ST) (Scudder, 1965; Lee, 2013) . While SSL is typically used in unsupervised settings, ST is applied in supervised settings where labeled data can be extended with unlabeled data that is labeled using a prior model, a process known as pseudo-labeling (PL). These techniques can reduce the burden of expensive labeling processes while successfully train data hungry models such as transformers using large quantities of unlabeled data. Current state-of-the-art SSL methods in speech (Baevski et al., 2020; Hsu et al., 2021; Baevski et al., 2022; Chung et al., 2021) are typically trained in two phases. First, the models are pre-trained on thousands of hours of unlabeled speech, and then they are further adapted by fine-tuning on the actual task of automatic speech recognition (ASR) using a smaller supervised set. However, because the pre-training (PT) phase is task agnostic, self-supervision can under-perform on a specific downstream task (Talnikar et al., 2021; Dery et al., 2022) . Further, SSL pre-training leads to a more complicated pipeline involving multiple phases. By contrast, ST algorithms also use unlabeled data but do not require phases of training with different objectives that makes the training pipeline simpler. In this paper, we focus on recent ST algorithms that perform 'continuous training' of a single model. In contrast to earlier ST training methods that iterate between generating PLs over the entire unlabeled dataset and training a model (teacher-student) (Synnaeve et al., 2020; Kahn et al., 2020a; Zhang Table 1: Continuous ST (using slimIPL) with different pre-training steps (M ) using a 10h dataset reveals that more pre-training can lead to worse results (we show word error rate, WER, on dev-clean).

M

10k 20k 40k WER 14.3 17.1 22.9 et al., 2020; Park et al., 2020) , here pseudo-labels (PLs) are generated online with a very recent version of the model (Xu et al., 2020; Likhomanenko et al., 2021a; Manohar et al., 2021; Higuchi et al., 2021; 2022a; b) and training is faster and more resource-efficient. One of the main challenges for continuous ST is training stability (Likhomanenko et al., 2021a; Higuchi et al., 2021; 2022b; Cai et al., 2022) . While these prior works use various techniques for stabilization, one common ingredient is that models are initially trained on labeled data for M steps. slimIPL (Likhomanenko et al., 2021a) showed robustness to M in some settings, but a well-established recipe does not seem to exist for the case of small labeled datasets (aka. the low resource setting). Indeed, we find that more pre-training steps, compared to what was shown previously in Likhomanenko et al. (2021a) , can lead to worse results (see Table 1 ). We hypothesize that this is due to over-fitting to the labeled set early in training in low resource settings and in this paper we try to improve results by doing ST without any pre-training (i.e. M = 0). However, in our experiments, off-the-shelf slimIPL diverges early in training in low resource settings, so we developed methods to address this problem which we summarize here: • We show that sampling transcriptions from the output distribution instead of using the best transcription makes ST robust and stable, especially when no pre-training is performed. For the first time, with these strategies we show that continuous PL can be done from the very start of the training matching prior works without an external language model.

2. EXPERIMENTAL SETUP AND RELATED METHODS

Data All our experiments are performed using the LibriSpeech dataset (Panayotov et al., 2015) . We use the train-clean-360 and train-other-500 regular subsets as unlabeled data, and consider either a subset of 10h randomly drawn from train-clean-100, or the full 100h set (train-clean-100) as labeled data. Comparisons with existing works are also provided using the 10h subset from Libri-Light (Kahn et al., 2020b) foot_0 . In addition, we evaluate the final configuration of our methods on the Common Voice dataset Ardila et al. (2020) for French language where we sample 10h and 100h from the train set to use as labeled data and the rest as unlabeled data (see Appendix A.3). (Duchi et al., 2011) . The acoustic model is the same transformer architecture that was introduced in slimIPL, except that we encode positions with either absolute sinusoidal positional embedding (Vaswani et al., 2017) or the recently proposed CAPE (Likhomanenko et al., 2021b) instead of relative positional embedding (Shaw et al., 2018) . This allows us to speed up training (by 2-3x) and decrease the memory footprint significantly. All models are trained on 8 GPUs for a maximum of 500k updates. We use either a static batch of 8 examples or a dynamic batch that packs ∼ 290s of audio per GPU.

Acoustic model

Continuous pseudo-labeling (PL) in ASR Let L = {x i , y i } and U = {x j } be the labeled and unlabeled datasets, respectively. We consider a semi-supervised PL approach where an acoustic model Algorithm 1: slimIPL algorithm and our proposed changes (red → deletion and green → addition) Inputs: labeled L = {xi, yi} and unlabeled U = {xj} data, x = augmentation(x), initialization θ 0 , cache C = {}, learning rate η k , losses LL and LU , parameters M , NL, NU , pout and C PL function P L(x; θ, τ ) = P L(x; θ) defined via Eq. ( 2) PL function P L(x; θ, τ ) defined via sampling with temperature τ (see Section 4.2) Result: Acoustic model A(x; θ) // Initial pre-training (PT) phase : train only on labeled samples Train A on (x, y) ∈ L for M steps: θ k+1 = θ k -η k ∇LL(A( x; θ k ), y), k = 1, M Decrease model's A(x; θ) dropout // Train on labeled data while filling the cache  for k = M + 1, M + C do For random x ∈ U generate ŷ = P L(A inf erence (x; θ k ), τ ) and C ← C {(x, ŷ)} θ k+1 = θ k -η k ∇LL(A( x; θ k ), y), (x, y) ∈ L τ = max(0, 1 -k/K) // Continuous pseudo-labeling training with the cache repeat if rand(0, 1) < NL/(NL + NU ) then Draw (x, y) ∈ L and θ k+1 = θ k -η k ∇LL(A( x; θ k ), y) else Draw b = (x, y) ∈ C and θ k+1 = θ k -η k ∇LU (A( x; θ k ), y) ŷ = P L(A inf erence (x; θ k , τ )) // Compute current model state PL pout = T ER(y, ŷ) if k < K else pout = 1 // Compute dynamic pout if rand(0, 1) < pout then For random x ′ ∈ U generate ŷ′ = P L(A inf erence (x ′ ; θ k , τ )) and C ← C \ b {(x ′ , ŷ′ )} L(θ) = L L (θ) + λL U (θ) , where λ ∈ R + is a tunable hyper-parameter controlling the importance of unlabeled data. The loss for labeled data is defined as L L (θ) = -E (x,y)∼L log p θ (y|x), where p θ (y|x) is the conditional distribution defined by A(x; θ). The loss for unlabeled data is defined as L U (θ) = -E x∼U log p θ ( ŷ|x), where ŷ is the PL transcription for a data point generated using the model being trained. Specifically, ŷ = argmax y log p θ (y|x). Continuous PL keeps updating the pseudo-labels via Eq. ( 2), as the model trains. This procedure is prone to divergence, as without any constraint PLs can self-reinforce rapidly to a trivial distribution. Methods to stabilize training Several approaches have been proposed to stabilize continuous PL. A pre-training phase (PT) on the supervised data only (optimizing the loss L L (θ) for M updates) is always a key component. For e.g. in Chen et al. (2020b) PT is performed until full convergence. Another technique is the use of an exponential moving average (EMA) of the acoustic model to generate the pseudo-labels in Eq. ( 2) (Likhomanenko et al., 2021a; Manohar et al., 2021; Higuchi et al., 2021; 2022b; Zhang et al., 2022) . PLs selection Pseudo-labels selection can help to achieve better convergence by filtering out noisy PLs that prevent model from faster training. There are also a lot of efforts on the curriculum pseudo-labeled data selection: e.g. confidence filtering (Zhang et al., 2021) or assigning weights to pseudo-labeled data based on the model uncertainty estimation (Huang et al., 2022) . One of the recent works (Zhang et al., 2022) in ASR proposes to use PLs curriculum filtering based on the Levenshtein distance between PLs generated for original and weakly augmented inputs. Later we will see that our idea is based solely on the PL evolution rather than on input augmentation. Relation to consistency regularization Popular consistency regularization methods (Sajjadi et al., 2016; Laine & Aila, 2016; Sohn et al., 2020; Berthelot et al., 2019) leverage the idea that a model should output the same distribution for an unlabeled example even after it has been augmented. In this paper we take inspiration from these works but we focus on an orthogonal view: we consider distances between model outputs at different time steps. Also, contrary to consistency regularization, we do not use this distance as an objective function to train a model but as a data selection criterion. Hyper-parameter selection All hyper-parameters and model selections are performed using devclean and dev-other sets. We report final token (TER) or word (WER) error rates on test-clean and test-other sets. In all experiments, we only tune (C, p out , M , λ) from the training procedure while everything else is kept as in the slimIPL paper. By default we use C = 1000, λ = 1, M = 0. In most experiments we try 3 different random seeds and report metric mean and standard deviation.

3. MOTIVATION

Existing continuous PL approaches rely on a two-step process: first pre-training (PT) on labeled data only, then continue the model training with both labeled and unlabeled data. While PT is known to be critical for the stability of continuous PL, we are interested in this work to find ways to remove the PT phase to simplify the whole procedure, and possibly improve the overall performance, both in terms of convergence speed and final WER. PT improves the final WER Initial experiments with slimIPL, Table 2 , show that with even its simple cache strategy used to stabilize training, PT helps improving the final WER. It is not surprising, as without PT, PLs are of poor quality (> 90% WER) at the beginning of training as the model mostly produces random outputs. Careful tuning of the number of PT steps is however important, especially in low-resource supervised settings, as shown in Table 1 . Caching as a replacement for PT Vanilla continuous PL is very similar to slimIPL with p out = 1 (see Section 2). With the caching strategy, slimIPL picks unlabeled samples (and their associated PLs) from a cache when needed, and immediately replaces these examples with new unlabeled samples (and their new PLs). This allows to always use PLs generated from a previous version of the trained model, while efficiently computing these PLs. While being simple, we observe in Table 2 that this approach is enough to stabilize continuous PL, assuming a large enough cache. When to update the PLs from the cached samples is critical In slimIPL (Algorithm 1), each sample (x, ŷ) in the cache C at step k ′ has a PL ŷ= P L(A(x; θ k )) that was generated with the model θ k at step k < k ′ when it was added to the cache. After using the sample (x, ŷ) for training, slimIPL adds it back into the cache with probability 1 -p out , leaving its corresponding PLs unchanged. We found however that updating PLs with the current model state ŷ = P L(A(x; θ k ′ )) improves final WER performance. See Table 3 , which compares the original slimIPL strategy ('old'), with the one where the PLs are updated when a sample has been selected in the cache ('new'). For that reason, in the following experiments, we will be using ŷ = P L(A(x; θ k ′ )) as a PL strategy, when keeping a sample back into the cache. Controlling cache contents dynamically can improve WER When the cache is updated less often (p out < 1), we see in The above observations suggest that by dynamically controlling how the cache evolves we can improve results in limited data settings. One possible way of doing this is by using a strategy that depends on the rate of evolution of PLs in the cache. In the next section we present such a method. Let's consider an example x ∈ U to be put into the cache at training step k, see Figure 1 . Its PL is defined as ŷ = P L(A(x; θ k ) = P L(x; k). At step k ′ > k, this example (x, ŷ) is selected from the cache and the model is updated to θ k ′ +1 using the gradient of the loss. Unlike slimIPL, the probability of removing the example from the cache is not constant anymore. Instead, p out is dynamically computed at step k ′ for sample x that is selected from the cache as follows: Note that while we explained the method using a single example x from the unlabelled set, in practice we operate the algorithm on a batch level, and the statistics are computed over a full batch of examples, which are all put back in the cache or removed together. p out (x; k) = f [ρ(P L(x; k), P L(x; k ′ ))]

4.2. ALIGNMENT SAMPLING

As discussed in Section 3 training instability shows up as the acoustic model distribution A(x; θ k ) collapses to a degenerate distribution, e.g. empty transcriptions. While a cache and/or an exponential moving average model can stabilize training, they do not resolve the issue entirely, especially in the low data regime, with no pre-training, and the model often collapses to a degenerate solution. Even our proposed method above (see Section 4.1) is susceptible to this collapse on the 10h dataset. In order to overcome the collapse issue and still make use of unlabeled data as early as possible, we propose to sample targets from the token distribution for every frame (Likhomanenko et al., 2022) . We believe that sampling PLs around the most probable hard labels is an effective stabilization technique which works by adding appropriate noise to the targets: it is a way to enforce a lower bound on the entropy of the label distribution which mitigates the collapse issuefoot_3 . As the model is learnt with CTC, every per frame predicted distribution p t θ (w|x), w ∈ w for token set w and time frame t is considered to be independent. Thus, for every audio frame, we sample a token label w t ∼ p t θ (w|x). A temperature τ is introduced to smooth the distribution obtained from the model. After the frame level labels are sampled, they are transformed into the transcription by deduplicating consecutive repetitions of the same output token, and removing the left over auxiliary blank tokensfoot_4 . Sampling Temperature Schedule As τ → ∞ the distribution over tokens p t θ (w|x, τ ) approaches the uniform one, the PL sequence of tokens becomes purely random. On the other hand, as τ → 0 the distribution approaches the argmax function which is equivalent to the hard labels in slimIPL. We find that τ > 1 performs poorly. With τ = 1 a model avoids divergence at the beginning of training but end up with worse final performance than hard PLs (τ = 0): this happens mostly because of larger noise presence due to sampling (quality of PLs is observed being worse). Lower temperatures, e.g. τ = 0.1, give indistinguishable results from hard PLs (τ = 0). These observations suggest that decreasing temperature as training proceeds can stabilize training at the beginning and benefit from less noisy PLs in the end. We found that simple linear schedule for τ from 1 to 0.1 works well. The summary of our proposed methods on top of slimIPL is given in Algorithm 1.

5.1. DYNAMIC SELECTION FOR PSEUDO-LABELED SAMPLES

In Table 4 we show results from using only the method introduced in Section 4.1. We experiment with token error rate (TER) distance computed between PLs on an entire batch and the two functions as discussed above. For both settings of 100h and 10h of supervised data the proposed dynamic Table 4 : WER on dev-clean and dev-other for different cache selection methods (p). We use either p out = p or a strategy where p out = p for the first 130K steps, switching to p out = 1 afterwards, as shown in Section 3. Alignment sampling from Section 4.2 is not used. selection decreases WER over the baseline with constant p out . This behavior also holds when we switch from the dynamic strategy of Eq. ( 3) to a constant p out = 1 after 130K steps of training. For a 10h of labeled data setting the improvement over the baseline is larger and reaches around 1% absolute. The function f : x → 1 -x performs worse than f : x → x and hence we use this setting for subsequent experiments. Our analysis of dynamic probabilities p out from Table 4 shows: (i) T ER[P L(x; k), P L(x; k ′ )] is close to 100% at the beginning of training (the model changes very fast), and quickly decreases (less than 10% after 30k steps); (ii) over training different batches get different values of p out , see Figure 2a;  (iii) proposed distance correlates with the oracle WER computed between PLs and ground truth labels for x ∈ U , see Figure 2b . The latter demonstrates that our choice of dynamic selection encapsulates knowledge about actual PLs quality.

5.2. ALIGNMENT SAMPLING

In Table 5 we compare results for models trained with hard PLs (τ = 0), models trained with alignment sampling and constant τ > 0, and models trained with a linear schedule of τ from 1 to 0.1 (1 → 0.1), as described in Section 4.2. For this section we do not use dynamic control of the cache as introduced in Section 4.1. Here we highlight some observations. Firstly, alignment sampling with high τ reduces the number of diverged models (either τ = 1 or τ = 1 → 0.1). Secondly, constant temperature over the training does not provide best results: τ = 0.1 is similar to the baseline while τ = 1 is even worse; the difference is more pronounced for the 10h of supervision with p out = 0.1 → 1. Besides, WER we also report TER to highlight that sampling with τ = 1 leads to a notable CER degradation. However, scheduled τ = 1 → 0.1 provides both stable training (no divergence is observed in experiments) and similar or significantly better TER/WER (1.3%-2.7%) over the baseline. The best results are obtained with p out = 0.1 → 1 showing compatibility of sampling and dynamic probability. Table 5 : TER and WER on dev-other for sampling PLs with different temperature τ , including linear schedule of τ in case of constant p out (left parts) or alternated one (right parts), see Section 3. 'DV' denotes the number of diverged models over 3 runs with random seeds. PL evolution via dynamic cache probability from Section 4.1 is not used. 

5.3. COMBINING METHODS FOR BEST RESULTS

In this section we highlight the results that can be achieved by combining together all the methods reported above in Sections 4.1 and 4.2. In Table 6 we give a detailed comparison for both 10h and 100h of supervision. As we have now stable training pipeline from the start (no PT), we also play with a ratio λ (see Eq. ( 1)) searching it in range [1, 5] . This raises training instability risk while larger proportion of unlabeled data may improve the model according to Likhomanenko et al. (2021a) . For 10 hours of supervised data the models benefit a lot from the higher λ and become competitive with models trained with PT phase as well as with prior works (Baevski et al., 2020; Likhomanenko et al., 2021a) ). Note that combining sampling with dynamic p out based on PLs evolution is necessary to have stable training for λ > 1. To have a proper comparison with aforementioned prior works we increase the batch size and use dynamic batching for the best configuration. First, we confirm that both sampling and dynamically controlling the cache give stable training (see e.g. Appendix C Table 13 ). Second, in Table 7 foot_5 for 10h/100h setup (λ = 5/λ = 3) our models achieve similar or better results with no PT compared to PT-based models (which are reproductions of slimIPL using the same settings that we use for our method) while matching the prior works. To ensure our methods are general enough we probe the final configuration (found for LibriSpeech) on Common Voice, French language data. We use exactly the same models with sinusoidal positional embedding and the same hyper-parameters. The only thing we tune is slimIPL parameter M . Results in Table 8 show that our methods work out of the box: without PT we are able to match slimIPL baseline for 100h of supervision, while we improve results upon slimIPL for low supervision setting of 10h with an average relative WER reduction of 18%. Lower bound, fully supervised 960h CAPE 2.6 0.1 6.9 0.1 2.7 0.1 6.9 0.1 Fully supervised 540h 10.9 0.4 12.3 0.3

6. CONCLUSION

In this paper we show that we can perform continuous pseudo-labeling from the very start of training and get improved results in low supervision settings. We were able to achieve these results by using alignment sampling and a dynamic cache selection strategy that is based on the evolution of the pseudo-labels during training. Being able to perform pseudo-labeling from the very start further simplifies training, avoiding complicated multi-step pipelines and allows us to focus on a simpler one. Our work also provides avenues for explorations into curriculum strategies for pseudo-labeling and we hope to build upon the ideas and results presented in this paper. In the future we wish to explore the effectiveness of these methods to other settings for ASR such as sequence-to-sequence/transducer modelsfoot_6 , out-of-domain unsupervised data, and neural models not based on transformers.

7. REPRODUCIBILITY STATEMENT

We report detailed settings of our experiments which are based on the previously open-sourced recipes for Likhomanenko et al. (2021a) through the paper and also in Appendix A.2 and B. We aim to open source the code of our method and experiments soon.

8. ETHICS

For this paper we used publicly available datasets. Our goal is to build models that work for low supervision settings and hope this is a positive contribution towards under-represented data sources for ASR. While one can imagine ASR being used for negative purposes, it is our hope that the advantages generated by improving ASR for low-resource settings outweigh its possible negative uses. The rest of the training remains the same. To reproduce wav2vec 2.0 (Baevski et al., 2020) we take open-sourced Large model pre-trained on the full LibriSpeech 8 and then perform fine-tuning on our 10h set and the 10h set from Libri-Light. For fine-tuning we use open-sourced configurations for 10h 9 . We fine-tune models on 24 GPUs as specified in Baevski et al. (2020) for 3 different seeds. C ABLATIONS: SAMPLING FOR LARGER BATCHES 



Libri-Light 10h subset contains only speakers drawn from the whole LibriSpeech (from both clean and noisy subsets). To keep our experiments consistent, and also to assess domain transfer to the unlabeled noisy subsets, we reconstructed the 10h set from the train-clean-100, sampling randomly from the speakers and retaining the original speakers from this subset. 2 26 letters augmented with the apostrophe and a word boundary token. With no regularization (cache, and/or alignment sampling), the PL procedure often collapses to generating just blanks very quickly(Likhomanenko et al., 2021a) -it is biased, has 100% WER, but has no variance. Alignment sampling avoids this by generating noisy targets that have variance. E.g. alignment 'cc###aatttt#' will be transformed into 'cat', where # is a CTC blank token. As we use different 10h split in this work we also report results for 10h set with 24 speakers from Libri-Light used in prior works. We found that training with no PT is more prone to unstable training for this set, while our method is able to stabilize it and get comparable performance with its baseline counterpart which lags behind the prior works. The proposed dynamic control of the cache does not rely on anything specific to CTC. Alignment sampling should be transferable to Transducer directly, while for sequence-to-sequence we would sample transcription directly from the model.



\ b {(x, y)} // Same sample and PLs back into the cache 22 C ← C \ b {(x, ŷ)} // Same sample but new PLs back into the cache k ← k + 1 τ = max(0, 1 -k/K) until convergence or maximum iterations are reached A(x; θ) with model parameters θ is continuously trained on a combination of L and a pseudo-labelled set derived from U . The model is trained by minimizing a loss

To avoid the significant memory footprint ofEMA Likhomanenko et al. (2021a)  introduced slimIPL, which uses a dynamic cache instead of the EMA to stabilize the training. The cache maintains a set of unlabeled samples U C (with fixed size |U C | = C) and their associated PLs, generated by previous model states. After the pre-training phase, slimIPL minimizes the loss in Eq. (1), using the

Figure 1: Comparison between slimIPL (left) and how we control the cache by using PL evolution (right). The constant p out from slimIPL now is dynamic and computed based on the PL evolution.

)where ρ is the Levenshtein edit-distance, and f the function that encapsulates how evolution in PLs should determine the rate at which examples are removed from the cache. Using different choices of f we can consider different ways of actively controlling the cache (and hence the model training) using the evolution of the PLs. We consider simple functions f : x → x and f : x → 1 -x. The first function encourages the cache to maintain examples whose PLs are stable, which might lead to slower learning. The second function maintains examples whose PLs are changing fast which might lead to faster learning but less stable behavior.

0.6 25.4 0.4 13.7 0.8 20.7 0.8 4.5 0.1 10.6 0.3 4.8 0.1 11.3 0.1 T ER[P L(k), P L(k ′ )] 14.7 0.5 24.6 0.3 13.2 1.6 19.1 1.6 4.6 0.1 10.5 0.2 4.4 0.1 10.1 0.2 1 -T ER[P L(k), P L(k ′ )] 16.0 0.4 26.5 0.8 17.8 1.2 30.4 2.3 4.4 0.1 11.1 0.5 4.5 0.0 10.5 0.5 (a) pout=WER[P L(x; k),P L(x; k ′ )] per batch along the training. (b) Correlation between WER[P L(x; k),golden] and WER[P L(x; k),P L(x; k ′ )].

Figure 2: Analysis of our curriculum PLs selection criteria. WER is given in scale of (0, 1).

Evolution of pout for the different curriculum selection strategies. (b) Comparison between models trained with different pout: constant 1 (blue) or 0.1 (orange), or scheduled 0.1 → 1 (green).

Figure 3: Analysis of the probability p out .

Continuous PL w/ and w/o pre-training (PT) phase for slimIPL. 'DV' states for divergence.

Table2that one may improve the WER, but then PT is essential to avoid any divergence. InLikhomanenko et al. (2021a), the authors of slimIPL have reported robustness (in terms of test WER) with respect to p out . However, our experiments reported in Table3and Figure3bin Appendix C reveal different learning dynamics for different values of p out : our ablations with specific schedules on the probability p out suggest that models without a PT phase would benefit more from low p out at the beginning of training, which would make training easier initially by letting the model focus on the same examples. In addition, later in training, the training procedure might benefit from high p out , as seeing a wider range of examples may lead to more stability. While we observe significant changes in dynamics with 10h of supervision, with larger labeled set (100h) the different strategies do not make such a huge difference.



Combination of our methods (Sections 4.1 and 4.2) for hard labels (left part) and for sampling (right part) with a linear schedule on the temperature. 'DV' states for models divergence, 'old' denotes usage of P L(x; k), while 'new' denotes the use of P L(x; k ′ ). We compare different p out (all with using 'new'): scheduled p out = 0.1 → 1 (switching at 130K steps), ρ = T ER and scheduled ρ = T ER → 1 (switching at 130K steps). The WER on dev-other is reported. All results are reported across 3 runs with different seeds.

Comparison of our best models with prior works for 10h and 100h of supervision. Results are reported across 3 random seeds. For wav2vec 2.0 and slimIPL we report the prior work results and our reproduction following official open-sourced recipes. 'Posemb' denotes type of used positional embedding. The 10h set from Libri-Light is marked with '*'.

Comparison of fully supervised, slimIPL and our methods on Common Voice French. Results are reported across 6 random seeds. Sinusoidal positional embedding is used for all models.

Detailed hyper-parameters for the final experiments on LibriSpeech from Table 7 for 10h * setting.



ACKNOWLEDGMENTS

We would like to thank Richard He Bai, Jagrit Digani, David Grangier, Loren Lugosch, Yizhe Zhang, and machine learning research teammates for helpful discussions and support throughout the work.

A DETAILS ON EXPERIMENTAL SETUP A.1 SPEAKERS IN LIBRISPEECH

There is no intersection between speakers in different LibriSpeech train sets as well as in validation / test sets -all speakers are unique and are present in only one of the LibriSpeech sets. To prepare the 10h set we randomly sampled audio per speaker to gather a total 10h of audio.

A.2 ACOUSTIC MODEL TRAINING

We keep the original 16kHz sampling rate and compute log-mel filterbanks with 80 coefficients for a 25ms sliding window, strided by 10ms which are normalized to zero mean and unit variance per input sequence before feeding into a model.Throughout the paper we consider transformer-based models with a convolutional frontend to perform the proper striding. The encoder is composed of a 1-D convolution with kernel size 7 and stride 3 followed by 36 4-head Transformer blocks (Vaswani et al., 2017) . The self-attention dimension is 768 and the feed-forward network (FFN) dimension is 3072 (with 4 heads) in each transformer block. The output of the encoder is followed by a linear layer to the output classes. We use dropout after the convolution, dropout on the self-attention and on the FFN for all transformer layers, and layer drop (Fan et al., 2020) , dropping entire layers at the FFN level.We get rid of relative positional embedding (Shaw et al., 2018) and use either sinusoidal one (Vaswani et al., 2017) or recently proposed CAPE embedding (Likhomanenko et al., 2021b) (only global shift of 30s is used): this speeds up training by 2-3x and decreases memory usage.For SpecAugment Park et al. ( 2019) we follow parameters from Likhomanenko et al. (2021a) : two frequency masks with frequency mask parameter F = 30, ten time masks with maximum time-mask ratio p = 0.1 and time mask parameter T = 50; time warping is not used.All models are trained with CTC loss and Adagrad optimizer with linear warmup period of 64k steps, constant learning rate of 0.03 and step-wise (by 2) learning rate decay at the end of training. All models are trained on tf32 tensor cores of 8 Ampere A100 40GB GPUs for a maximum of 500k updates.For slimIPL parameters we use always cache size of 1k. Throughout the paper we vary the proportion λ (by default we use λ = 1 if not stated otherwise) as well as p out . From experiments we observe that it is important to activate SpecAugment later in training (e.g. after 5k training steps) otherwise slimIPL baseline is even more prone to divergence.

A.3 COMMON VOICE EXPERIMENTS

We use Common Voice data release from 21 July 2021 7 with French language. In total, there are 543 hours in train, 25.1h in validation and 25.8 in test sets. We randomly sample speakers from the train and take all audio belonging to the same speaker to form a 100h train subset. We end up with 982 speakers and 102h. We further sample speakers from this 100h subset to form a 10h subset: it contains 171 speakers with 11.5h. These 10h and 100h subsets are used as labeled data while the remaining 443h are used as unlabeled data. We normalize transcriptions by lower casing, removing any punctuation tokens except apostrophe, changing all diacritical marks to their corresponding English characters and removing any other non-English characters. Later, we use the same token set as for LibriSpeech.We use the same acoustic model as for LibriSpeech experiments with sinusoidal positional embedding as all audios in Common Voice are very short (5.2s±1.5s). For fully supervised models we use dropout 0.5, 0.3 and 0.1 for 10h, 100h and 540h sets correspondingly. For slimIPL we change dropout and layer drop from 0.5 to 0.1 for 10h and from 0.3 to 0.1 for 100h, while for our methods we use dropout and layer drop of 0.1 from the beginning of training. For slimIPL we tune only parameter M for the 10h setting. The rest of parameters are the same as in original slimIPL work (Likhomanenko et al., 2021a) : C is 1000 (100), cache probability p out is 0.1, data proportion λ is 10 (3), M is 40k 7 https://github.com/common-voice/cv-dataset/blob/main/datasets/ cv-corpus-7.0-2021-07-21.json (20k) for 10h (100h) setting. All models are trained with dynamic batch, same as for LibriSpeech. For our methods we use exactly the same parameters as for LibriSpeech experiments with dynamic batch. 

B WAV2VEC AND SLIMIPL REPRODUCTION

To reproduce baselines in Table 7 for slimIPL we follow Likhomanenko et al. (2021a) and its published recipe. The only change we do is positional embedding as discussed above and batch size.

