RESOURCE EFFICIENT SELF-SUPERVISED LEARNING FOR SPEECH RECOGNITION

Abstract

Representation learning from sequential data using self-supervised learning (SSL) has proven to be a powerful technique and improved state-of-the-art (SOTA) results when fine-tuned for various downstream tasks, including Automatic Speech Recognition (ASR). So far the success of SSL frameworks, e.g., Wav2Vec-2.0, for speech-to-text modeling is primarily carried out by masking intermediate features and then solving a contrastive task in an end-to-end manner. Although very successful, the overall training time (for example, days or weeks) and demanding resource requirements for achieving SOTA performance remain a significant barrier to further improving ASR solutions using such approaches. In this work we show that non-contrastive learning, such as an extension of the Barlow-Twins methodology, when applied to speech-to-text SSL modeling improves convergence, while reducing training time. Our results show that Wav2Vec-2.0 architecture pre-training with a non-contrastive SSL approach reduces the GPU training hours by 2.23 times, compared to masking based SSL approaches, while achieving a significant improvement (i.e., up to 6% relative WER decrease) in the model performance for the ASR task. We further demonstrate that a combination of both masking-based contrastive and non-contrastive SSL improves the ASR performance, e.g., up to 12% relative WER decrease, for all splits of LibriSpeech evaluation dataset.

1. INTRODUCTION

Modern industry-scale speech recognition systems often require tens-of-thousand hours of high quality labeled speech data to achieve acceptable deployment performance (Baevski et al., 2020; Ramos et al., 2022) . However, large-scale data collection remains an extremely time consuming and costly procedure and does not scale as the number of languages to support increases. Furthermore, for a majority of the vast number of spoken languages, a large and high-quality training dataset is often unavailable (Babu et al., 2022) . Thus, effective learning using primarily unlabeled data has been an important and long-standing research topic within the machine learning community, where the main emphasis is on learning good representations from unlabeled data and then fine-tuning using task dependent limited amount of labeled data. Recent progress in self-supervised learning (SSL) has been highly successful in utilizing unlabeled data and demonstrated superior performance in the domains of computer vision (CV) (Chen et al., 2020; He et al., 2020; Chen & He, 2021) , natural language processing (NLP) (Devlin et al., 2019; Lewis et al., 2019) , and speech recognition (SR) (Liu et al., 2020a; Chung et al., 2021; Baevski et al., 2022; Schneider et al., 2019; Baevski et al., 2020) . In particular, SSL-based approaches exploit abundance of unlabeled data to learn underlying representations, while using both contrastive and non-contrastive approaches (Jaiswal et al., 2020; Balestriero & LeCun, 2022) . Especially, in the domain of ASR, masking based contrastive methods have emerged as the leading SSL approach and yielding current state-of-the-art solutions, e.g., Wav2Vec-2.0 (Baevski et al., 2020), and Hu-BERT (Hsu et al., 2021) . The success of these approaches is mainly due to easy availability of large curated unlabeled open source datasets (Kearns, 2014; Panayotov et al., 2015; Kahn et al., 2020; Wang et al., 2021; Ardila et al., 2020) , availability of industry-scale GPU infrastructures, improvements in the data training pipeline and scaling (e.g., data-, pipeline-, model-parallelism) of deep learning frameworks. However, the overall training time for achieving SOTA performance remains a significant barrier to further improving ASR solutions using contrastive SSL. learning. In general, non-contrastive SSL is heavily under represented in audio research, with a few notable exceptions (Liu et al., 2022) . This is primarily due to the fact that SSL methods are more established in the domain of CV than audio applications and often require novel extensions. Towards bridging this gap, we consider Barlow-Twins (Zbontar et al., 2021) as a representative example of non-contrastive SSL and expand its scope from vision to audio by inventing the following extensions: (i) we incorporate a number of new loss functions via purposefully designed time-merging and time-unrolling, and (ii) applying static (hyper-parameter optimization) and dynamic (stop gradient) methodologies to balance the different scales in individual losses. We further explore the effect of sequential use of non-contrastive and contrastive training and observe improved performance, i.e., decreased word error rate (WER) when compared to solely non-contrastive or contrastive training, which is in line with recent work on SSL based speech representation learning for speaker verification rather than ASR (Zhang & Yu, 2022) . A summary of our main findings regarding the benefits of our non-contrastive SSL approach for speech representation learning are as follows: (i) non-contrastive SSL ASR yields a 2.23x training speed up compared with contrastive SSL ASR in our experiments, while simultaneously improving up to 6% relative WER (c.f., first and second rows of Tables 1 and 6 ), (ii), fewer required GPUs and smaller batch size reducing memory requirements in non-contrastive as opposed to contrastive methods (c.f., Table 3 ), and (iii) lowest WER achieved by sequentially combined approach followed by non-contrastive training (c.f., third rows of Tables 1, 6 ).

2. APPROACH

The most common SSL methods in speech considered are masking-based contrastive learning and autoregressive prediction based learning. In this work, we explore the potential of a non-contrastive SSL method for learning speech representations and its effectiveness on the downstream ASR.

2.1. MOTIVATION

Recent work in the area of non-contrastive SSL (e.g., BYOL (Grill et al., 2020) , SimSiam (Chen & He, 2021), Barlow-Twins (Zbontar et al., 2021) , DINO (Caron et al., 2021) ) have shown remarkable capacity to learn powerful representations from only positive pairs, i.e., two augmented views of the same data point. Unlike contrastive SSL approaches that use negative pairs to prevent representational collapse, non-contrastive SSL approaches employ a dual pair of Siamese networks to process two augmented views of a data point and minimize their representational differences. In general, contrastive SSL methods require large batch sizes, e.g. SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) , to achieve good performance. On the contrary, non-contrastive SSL approaches are comparatively more efficient and easy to train with smaller batches and reduced memory. As shown in (Zbontar et al., 2021) , a non-contrastive SSL method such as Barlow-Twins could learn effectively with up to 16x smaller batch size.

2.2. METHOD

In this subsection, we present an overview of our approach for learning speech embeddings with a first of its kind non-contrastive SSL method designed for time series speech modeling (c.f., Figure 1 (b)) and its comparison with a standard non-contrastive SSL method for non-time series data such as images (see Figure 1 (a) ). Similar to all non-contrastive SSL methods used in vision, our approach for learning speech embeddings has a dual pair of Siamese networks referred to as online (O) and target (T ) networks. Only the online network is trained via gradient descent and the target network employs a momentum encoder (He et al., 2020) that slowly follows the online network in a delayed fashion through an exponential moving average (EMA). The outputs of the online and target networks are then encouraged to learn good representations via a self-supervised loss function. However, there are two key differences in our approach for learning speech embeddings compared to image embeddings, which can be categorized as modeling and learning differences. These differences are summarized below. First, instead of performing augmentations in the input space (c.f. Figure 1 (a)), our solution operates in a latent space (see Figure 1 (b) ). Specifically, we apply augmentation not directly on the

