ON THE ROBUSTNESS OF SELF-SUPERVISED REPRE-SENTATIONS FOR SPOKEN LANGUAGE MODELING

Abstract

Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach.

1. INTRODUCTION

Self-supervised speech models were shown to learn effective representations for various downstream tasks (Hsu et al., 2021; Chen et al., 2022; Baevski et al., 2020) . These models were mainly evaluated on discriminate tasks, such as automatic speech recognition, speaker verification, intent classification, etc. (Yang et al., 2021 ). Recently, Lakhotia et al. (2021) demonstrated that such self-supervised learning (SSL) representations can be used for Generative Spoken Language Modeling. Generative Spoken Language Modeling (GSLM) is the task of learning the acoustic and linguistic characteristics of a language from raw audio. In other words, a discrete representation of the audio signal is being learned. Then a speech-language model is trained on top of the obtained representation. Finally, a neural vocoder converts the output tokens to raw audio. As the discrete speech representation often operates over tokens extracted every twenty milliseconds of audio, sequences can be long and contain repetitions, e.g., 10 11 11 11 21 32 32 32 21. Preliminary studies have found that removing sequential repetitions of units improves performance, hence applying it universally (Lakhotia et al., 2021) . For example, a pseudo-text 10 11 11 11 21 32 32 32 21 becomes 10 11 21 32 21. This framework was shown to be effective in modeling multiple levels of the speech utterance, namely prosody, and content (Lakhotia et al., 2021; Kharitonov et al., 2021a; Borsos et al., 2022 ), speech codec (Polyak et al., 2021) , speech emotion conversion (Kreuk et al., 2021 ), spoken dialogue (Nguyen et al., 2022 ), and speech-to-speech translation (Lee et al., 2021a; Popuri et al., 2022; Lee et al., 2021b ). An essential prerequisite for such an audio representation to be used in real-world conditions is robustness to various signal corruptions. Although the aforementioned audio representation models have shown effectiveness in many tasks, they were mainly evaluated on academic benchmarks. In this work, we evaluate current state-of-the-art self-supervised speech representation models on what are arguably the most basic signal variations, namely time-stretch, pitch-shift, additive-noise, and reverberation. Our premise is that while these variations modify the signal, its' underlying content remains the same, especially under the tokens repetition removal process. Therefore, a robust representation should be affected by such variations to a minimal extent.

