RESOURCE EFFICIENT SELF-SUPERVISED LEARNING FOR SPEECH RECOGNITION

Abstract

Representation learning from sequential data using self-supervised learning (SSL) has proven to be a powerful technique and improved state-of-the-art (SOTA) results when fine-tuned for various downstream tasks, including Automatic Speech Recognition (ASR). So far the success of SSL frameworks, e.g., Wav2Vec-2.0, for speech-to-text modeling is primarily carried out by masking intermediate features and then solving a contrastive task in an end-to-end manner. Although very successful, the overall training time (for example, days or weeks) and demanding resource requirements for achieving SOTA performance remain a significant barrier to further improving ASR solutions using such approaches. In this work we show that non-contrastive learning, such as an extension of the Barlow-Twins methodology, when applied to speech-to-text SSL modeling improves convergence, while reducing training time. Our results show that Wav2Vec-2.0 architecture pre-training with a non-contrastive SSL approach reduces the GPU training hours by 2.23 times, compared to masking based SSL approaches, while achieving a significant improvement (i.e., up to 6% relative WER decrease) in the model performance for the ASR task. We further demonstrate that a combination of both masking-based contrastive and non-contrastive SSL improves the ASR performance, e.g., up to 12% relative WER decrease, for all splits of LibriSpeech evaluation dataset.

1. INTRODUCTION

Modern industry-scale speech recognition systems often require tens-of-thousand hours of high quality labeled speech data to achieve acceptable deployment performance (Baevski et al., 2020; Ramos et al., 2022) . However, large-scale data collection remains an extremely time consuming and costly procedure and does not scale as the number of languages to support increases. Furthermore, for a majority of the vast number of spoken languages, a large and high-quality training dataset is often unavailable (Babu et al., 2022) . Thus, effective learning using primarily unlabeled data has been an important and long-standing research topic within the machine learning community, where the main emphasis is on learning good representations from unlabeled data and then fine-tuning using task dependent limited amount of labeled data. Recent progress in self-supervised learning (SSL) has been highly successful in utilizing unlabeled data and demonstrated superior performance in the domains of computer vision (CV) (Chen et al., 2020; He et al., 2020; Chen & He, 2021) , natural language processing (NLP) (Devlin et al., 2019; Lewis et al., 2019) , and speech recognition (SR) (Liu et al., 2020a; Chung et al., 2021; Baevski et al., 2022; Schneider et al., 2019; Baevski et al., 2020) . In particular, SSL-based approaches exploit abundance of unlabeled data to learn underlying representations, while using both contrastive and non-contrastive approaches (Jaiswal et al., 2020; Balestriero & LeCun, 2022) . Especially, in the domain of ASR, masking based contrastive methods have emerged as the leading SSL approach and yielding current state-of-the-art solutions, e.g., Wav2Vec-2.0 (Baevski et al., 2020) , and Hu-BERT (Hsu et al., 2021) . The success of these approaches is mainly due to easy availability of large curated unlabeled open source datasets (Kearns, 2014; Panayotov et al., 2015; Kahn et al., 2020; Wang et al., 2021; Ardila et al., 2020) , availability of industry-scale GPU infrastructures, improvements in the data training pipeline and scaling (e.g., data-, pipeline-, model-parallelism) of deep learning frameworks. However, the overall training time for achieving SOTA performance remains a significant barrier to further improving ASR solutions using contrastive SSL. In this paper we present a technique for decreasing the training time of SSL based ASR systems by using a non-contrastive SSL method, rather than a contrastive method, for speech representation 1

