ON THE ROBUSTNESS OF SELF-SUPERVISED REPRE-SENTATIONS FOR SPOKEN LANGUAGE MODELING

Abstract

Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach.

1. INTRODUCTION

Self-supervised speech models were shown to learn effective representations for various downstream tasks (Hsu et al., 2021; Chen et al., 2022; Baevski et al., 2020) . These models were mainly evaluated on discriminate tasks, such as automatic speech recognition, speaker verification, intent classification, etc. (Yang et al., 2021) . Recently, Lakhotia et al. (2021) demonstrated that such self-supervised learning (SSL) representations can be used for Generative Spoken Language Modeling. Generative Spoken Language Modeling (GSLM) is the task of learning the acoustic and linguistic characteristics of a language from raw audio. In other words, a discrete representation of the audio signal is being learned. Then a speech-language model is trained on top of the obtained representation. Finally, a neural vocoder converts the output tokens to raw audio. As the discrete speech representation often operates over tokens extracted every twenty milliseconds of audio, sequences can be long and contain repetitions, e.g., 10 11 11 11 21 32 32 32 21. Preliminary studies have found that removing sequential repetitions of units improves performance, hence applying it universally (Lakhotia et al., 2021) . For example, a pseudo-text 10 11 11 11 21 32 32 32 21 becomes 10 11 21 32 21. This framework was shown to be effective in modeling multiple levels of the speech utterance, namely prosody, and content (Lakhotia et al., 2021; Kharitonov et al., 2021a; Borsos et al., 2022 ), speech codec (Polyak et al., 2021) , speech emotion conversion (Kreuk et al., 2021 ), spoken dialogue (Nguyen et al., 2022) , and speech-to-speech translation (Lee et al., 2021a; Popuri et al., 2022; Lee et al., 2021b ). An essential prerequisite for such an audio representation to be used in real-world conditions is robustness to various signal corruptions. Although the aforementioned audio representation models have shown effectiveness in many tasks, they were mainly evaluated on academic benchmarks. In this work, we evaluate current state-of-the-art self-supervised speech representation models on what are arguably the most basic signal variations, namely time-stretch, pitch-shift, additive-noise, and reverberation. Our premise is that while these variations modify the signal, its' underlying content remains the same, especially under the tokens repetition removal process. Therefore, a robust representation should be affected by such variations to a minimal extent. As a first step, we propose a set of metrics for evaluating the model's robustness. Then, we point to the lack of robustness of these models with respect to the aforementioned variations. Next, we design a simple and effective method for learning robust discrete representation on top of any speech SSL model. We demonstrate how such a method greatly improves robustness. Then, we empirically show that performance improves on several tasks for various SSL models. Specifically, we evaluate the newly proposed speech encoders when considering established encoding metrics, i.e., ABX, WUGGY, BLIMP (Nguyen et al., 2020), together with a high-level downstream task in the form of speech-to-speech translation.

2. BACKGROUND

The general Generative Spoken Language Modeling (GSLM) pipeline is comprised of three main modules: (i) Speech-to-unit, (ii) Unit language model, and (iii) Unit-to-speech, where each of these modules is trained separately. Speech resynthesis can be achieved while ignoring the language model and directly feeding the quantized units into the unit-to-speech module (Polyak et al., 2021) (See Figure 1 for a visual description of the overall pipeline). In the following paragraphs, we give detailed background for each of the three components mentioned above, including the standard evaluation methods. Speech-to-unit module encodes the raw speech signal into a discrete representation. The common approach is first to encode the speech into a continuous representation and then quantize the representation to achieve a sequence of discrete units (Lakhotia et al., 2021; Polyak et al., 2021; Popuri et al., 2022; Lee et al., 2021a; Kharitonov et al., 2021a; Kreuk et al., 2021; Kharitonov et al., 2022; Nguyen et al., 2022; Borsos et al., 2022; Tjandra et al., 2019; 2020) . Formally, denote the domain of audio samples by X ⊂ R. The representation for a raw signal is therefore a sequence of samples x = (x 1 , . . . , x T ), where x t ∈ X for all 1 ≤ t ≤ T . Consider an encoder network, f , that gets as input the speech utterance and outputs a sequence of spectral representations sampled at a low frequency as follows f (x) = (v 1 , . . . , v T ′ ). Note that we do not assume anything about the structure of the encoder network f . Lakhotia et al. ( 2021), evaluated several speech encoders, namely, Mel-spectrogram, Contrastive Predictive Coding (CPC) (Oord et al., 2018 ), wav2vec2 (Baevski et al., 2020 ), and HuBERT (Hsu et al., 2021) . Since the representations learned by such models are usually continuous, a k-means algorithm is applied over the models' outputs to generate discrete units, denoted as z = (z 1 , . . . , z T ′ ). Each element z i in z is a positive integer, z i ∈ {1, .., K} for 1 ≤ i ≤ T ′ , where K is the number of discrete units. We denote the quantization model with E. Unit Language Model is trained on the extracted discrete units, z. Such a language model learns a probability distribution of the learned unit sequences, which enables direct modeling of speech data without textual supervision.



Figure1: Generative Spoken Language Modeling is composed of three components: (i) Speech-tounit, (ii) Unit language model, and (iii) Unit-to-speech. Pre-trained ASR and language models are used to evaluate those components.

