PROSODYBERT: SELF-SUPERVISED PROSODY REPRE-SENTATION FOR STYLE-CONTROLLABLE TTS

Abstract

We propose ProsodyBERT, a self-supervised approach to learning prosody representations from raw audio. Different from most previous work, which uses information bottlenecks to disentangle prosody features from lexical content and speaker information, we perform an offline clustering of speaker-normalized prosody-related features (energy, pitch, etc.) and use the cluster labels as targets for HuBERT-like masked unit prediction. A span boundary loss is also used to capture long-range prosodic information. We demonstrate the effectiveness of ProsodyBERT on a multi-speaker style-controllable text-to-speech (TTS) system, showing that the TTS system trained with ProsodyBERT features generate natural and expressive speech samples, surpassing Fastspeech 2 (which directly models pitch and energy) in subjective human evaluation. In addition, we achieve new state-of-the-art results on the IEMOCAP emotion recognition task by combining our prosody features with HuBERT features, showing that ProsodyBERT is complementary to popular pretrained speech self-supervised models. 1

1. INTRODUCTION

Human speech contains information beyond the associated word sequence. For example, the intonation, stress, rhythm, and tempo of speech carry important cues associated with the speaking style, emotion, and intent. These factors are generally referred to as prosody. Prosodic modeling has been widely investigated in expressive text-to-speech (TTS) synthesis (Valle et al., 2020; Ren et al., 2021; Kenter et al., 2020; Ren et al., 2022) and voice conversion (VC) (Kreuk et al., 2021; Zhou et al., 2022) , and has been shown to be important for generating natural and expressive synthesized speech. Prosody is also applied in spoken language understanding tasks by providing information that disambiguates and complements information in the associated word sequence. Examples include parsing (Tran et al., 2018 ), punctuation prediction (Klejch et al., 2017; Cho et al., 2022) , emotion recognition (Rao et al., 2013) , and other paralinguistic recognition tasks. Prosody is traditionally defined in terms of its function in communicating linguistic structure and paralinguistic information and/or in terms of the associated acoustic correlates, which includes fundamental frequency (F 0 ), energy, duration, and other measures associated with vocal effort. In this work, we focus on acoustic correlates, specifically F 0 and energy, with duration implicitly encoded via the temporal dynamics. These features have limitations. F 0 tracking algorithms are known to be unreliable in some contexts, with pitch halving and doubling errors. Energy is sensitive to recording conditions. F 0 and duration depend on speaker and segmental context. Further, F 0 , energy, and duration are highly inter-dependent, but are often modeled independently, which can lead to unnatural prosody in TTS and limit their usefulness in speech understanding. For these reasons, researchers have been exploring automatic methods for learning alternative representations of prosody. Automatically learned prosody representations have been proposed for speech synthesis using autoencoders that condition on text and speaker identity, which encourages residual information (assumed to be prosody) to be captured in an information bottleneck (Skerry-Ryan et al., 2018; Wang et al., 2018; Zhang et al., 2019; Qian et al., 2020) . These approaches rely on having high-quality speech transcripts, limiting the amount of data that can be used in training and the ability to learn a broadly generalizable representation. Another paradigm of representation learning is self-supervised learning



Audio samples are available at: https://neurtts.github.io/prosodybert_demo/.1

