PROSODYBERT: SELF-SUPERVISED PROSODY REPRE-SENTATION FOR STYLE-CONTROLLABLE TTS

Abstract

We propose ProsodyBERT, a self-supervised approach to learning prosody representations from raw audio. Different from most previous work, which uses information bottlenecks to disentangle prosody features from lexical content and speaker information, we perform an offline clustering of speaker-normalized prosody-related features (energy, pitch, etc.) and use the cluster labels as targets for HuBERT-like masked unit prediction. A span boundary loss is also used to capture long-range prosodic information. We demonstrate the effectiveness of ProsodyBERT on a multi-speaker style-controllable text-to-speech (TTS) system, showing that the TTS system trained with ProsodyBERT features generate natural and expressive speech samples, surpassing Fastspeech 2 (which directly models pitch and energy) in subjective human evaluation. In addition, we achieve new state-of-the-art results on the IEMOCAP emotion recognition task by combining our prosody features with HuBERT features, showing that ProsodyBERT is complementary to popular pretrained speech self-supervised models. 1

1. INTRODUCTION

Human speech contains information beyond the associated word sequence. For example, the intonation, stress, rhythm, and tempo of speech carry important cues associated with the speaking style, emotion, and intent. These factors are generally referred to as prosody. Prosodic modeling has been widely investigated in expressive text-to-speech (TTS) synthesis (Valle et al., 2020; Ren et al., 2021; Kenter et al., 2020; Ren et al., 2022) and voice conversion (VC) (Kreuk et al., 2021; Zhou et al., 2022) , and has been shown to be important for generating natural and expressive synthesized speech. Prosody is also applied in spoken language understanding tasks by providing information that disambiguates and complements information in the associated word sequence. Examples include parsing (Tran et al., 2018 ), punctuation prediction (Klejch et al., 2017; Cho et al., 2022 ), emotion recognition (Rao et al., 2013) , and other paralinguistic recognition tasks. Prosody is traditionally defined in terms of its function in communicating linguistic structure and paralinguistic information and/or in terms of the associated acoustic correlates, which includes fundamental frequency (F 0 ), energy, duration, and other measures associated with vocal effort. In this work, we focus on acoustic correlates, specifically F 0 and energy, with duration implicitly encoded via the temporal dynamics. These features have limitations. F 0 tracking algorithms are known to be unreliable in some contexts, with pitch halving and doubling errors. Energy is sensitive to recording conditions. F 0 and duration depend on speaker and segmental context. Further, F 0 , energy, and duration are highly inter-dependent, but are often modeled independently, which can lead to unnatural prosody in TTS and limit their usefulness in speech understanding. For these reasons, researchers have been exploring automatic methods for learning alternative representations of prosody. Automatically learned prosody representations have been proposed for speech synthesis using autoencoders that condition on text and speaker identity, which encourages residual information (assumed to be prosody) to be captured in an information bottleneck (Skerry-Ryan et al., 2018; Wang et al., 2018; Zhang et al., 2019; Qian et al., 2020) . These approaches rely on having high-quality speech transcripts, limiting the amount of data that can be used in training and the ability to learn a broadly generalizable representation. Another paradigm of representation learning is self-supervised learning (SSL). SSL models are pretrained on a large amount of unlabeled examples and then finetuned on task-specific data. This paradigm has been particularly successful for natural language processing (Peters et al., 2018; Devlin et al., 2019) . Recent speech SSL models like wav2vec 2.0 (Baevski et al., 2020) , HuBERT (Hsu et al., 2021), and WavLM (Chen et al., 2022) have been proposed to learn acoustic representations from untranscribed speech. Focusing on the phone level, they achieve good performance on speech recognition and understanding tasks, especially when only a small-amount of task-specific data is available. SSL methods have also been explored for prosody learning in Weston et al. ( 2021), but the approach requires word time marking, which relies on having human transcripts. To address the challenge of learning a prosody representation without word transcripts, we propose ProsodyBERT, a self-supervised learning method that disentangles prosody features from speech content and speaker information. Similar to HuBERT (Hsu et al., 2021) , we pretrain an SSL model by masked unit prediction. The pseudo labels are given by K-means clustering on speaker-normalized acoustic-prosodic attributes (pitch, energy, and related features), which encourages the model to focus on prosody learning. In addition, inspired by SpanBERT (Joshi et al., 2020) , we propose a span boundary loss to encourage the model to better represent long-range prosody information. We also substantially compress the model size and reduce the feature dimensions to make the model easy to use. Similar to prior SSL models, ProsodyBERT is first trained on a large amount of raw speech audio and then adapted to target tasks. Such a design enables ProsodyBERT to learn a rich representation of prosody on massive amounts of untranscribed speech. Our approach follows that of recent speech synthesis work, which aims to disentangle prosody from lexical content and speaker identity. With this view, acoustic-prosodic features that can be speaker-dependent, such as F 0 range, are accounted for in the speaker representation. Disentangling the speech representation into different factors can improve the model's ability to generalize across different conditions and enable zero-shot speaker models for synthesis. While this factoring could be done in different ways (and include other factors), learning a prosody representation that has minimal speaker information is also useful for privacy-sensitivite speech processing. Using speaker verification experiments, we show that ProsodyBERT is effective at providing de-identified prosody features. We demonstrate the effectiveness of pretrained ProsodyBERT features on text-to-speech (TTS) and emotion recognition. During training, we extract ProsodyBERT features from speech and use them as conditional inputs for the TTS decoder. A separate prosody predictor is trained such that it takes text and style as inputs and generates prosody features. During inference, the TTS decoder takes the predicted prosody features as input. Experiments show that the TTS system trained with ProsodyBERT features generates natural and expressive speech, surpassing FastSpeech 2 (Ren et al., 2021) (trained with energy and F 0 ) by a large margin in subjective human evaluation. The expressiveness can be controlled by using speaking-style vectors (learned from multi-style data) in prosody prediction. For the emotion recognition task, we simply concatenate ProsodyBERT features to HuBERT features and use them as the inputs for downstream models. We achieve a new state of the art on IEMOCAP emotion recognition task, showing that ProsodyBERT features are complementary to HuBERT features. In summary, the key contribution of this work is in the development of a new, low-dimensional representation of prosody that can be learned from untranscribed speech. We demonstrate good speaker de-identification and utility of the new features in both TTS and emotion recognition.

2. RELATED WORK

Modeling Prosody in TTS and VC Most prior work on prosody treats prosody feature learning as an auxiliary module for downstream generation tasks. Recent approaches include directly using signal prosody (F 0 , energy, etc.) (Wan et al., 2019; Valle et al., 2020; Ren et al., 2021; Kenter et al., 2020; Liu et al., 2021; Kharitonov et al., 2022) , learning a latent style embedding (Wang et al., 2018; Zhang et al., 2019; Hsu et al., 2019; Sun et al., 2020) , learning frame-level or phone-level representation (Du & Yu, 2021; Kreuk et al., 2021) , and utilizing reference audios for style (Choi et al., 2020; Yi et al., 2022) . Most of these prosody representations rely on task-specific models.



Audio samples are available at: https://neurtts.github.io/prosodybert_demo/.

