IMPROVING ZERO-SHOT VOICE STYLE TRANSFER VIA DISENTANGLED REPRESENTATION LEARNING

Abstract

Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speakerrelated style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With informationtheoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.

1. INTRODUCTION

Style transfer, which automatically converts a data instance into a target style, while preserving its content information, has attracted considerable attention in various machine learning domains, including computer vision (Gatys et al., 2016; Luan et al., 2017; Huang & Belongie, 2017) , video processing (Huang et al., 2017; Chen et al., 2017) , and natural language processing (Shen et al., 2017; Yang et al., 2018; Lample et al., 2019; Cheng et al., 2020b) . In speech processing, style transfer was earlier recognized as voice conversion (VC) (Muda et al., 2010) , which converts one speaker's utterance, as if it was from another speaker but with the same semantic meaning. Voice style transfer (VST) has received long-term research interest, due to its potential for applications in security (Sisman et al., 2018 ), medicine (Nakamura et al., 2006 ), entertainment (Villavicencio & Bonada, 2010) and education (Mohammadi & Kain, 2017) , among others. Although widely investigated, VST remains challenging when applied to more general application scenarios. Most of the traditional VST methods require parallel training data, i.e., paired voices from two speakers uttering the same sentence. This constraint limits the application of such models in the real world, where data are often not pair-wise available. Among the few existing models that address non-parallel data (Hsu et al., 2016; Lee & Wu, 2006; Godoy et al., 2011) , most methods cannot handle many-to-many transfer (Saito et al., 2018; Kaneko & Kameoka, 2018; Kameoka et al., 2018) , which prevents them from converting multiple source voices to multiple target speaker styles. Even among the few non-parallel many-to-many transfer models, to the best of our knowledge, only two models (Qian et al., 2019; Chou & Lee, 2019) allow zero-shot transfer, i.e., conversion from/to newly-coming speakers (unseen during training) without re-training the model. The only two zero-shot VST models (AUTOVC (Qian et al., 2019) and AdaIN-VC (Chou & Lee, 2019)) share a common weakness. Both methods construct encoder-decoder frameworks, which extract the style and the content information into style and content embeddings, and generate a voice sample by combining a style embedding and a content embedding through the decoder. With the combination of the source content embedding and the target style embedding, the models generate the transferred voice, based only on source and target voice samples. AUTOVC (Qian et al., 2019) uses a GE2E (Wan et al., 2018) pre-trained style encoder to ensure rich speaker-related information in style embeddings. However, AUTOVC has no regularizer to guarantee that the content encoder does not encode any style information. AdaIN-VC (Chou & Lee, 2019) applies instance normalization (Ulyanov et al., 2016) to the feature map of content representations, which helps to eliminate the style information from content embeddings. However, AdaIN-VC fails to prevent content information from being revealed in the style embeddings. Both methods cannot assure that the style and content embeddings are disentangled without information revealed from each other. With information-theoretic guidance, we propose a disentangled-representation-learning method to enhance the encoder-decoder zero-shot VST framework, for both style and content information preservation. We call the proposed method Information-theoretic Disentangled Embedding for Voice Conversion (IDE-VC). Our model successfully induces the style and content of voices into independent representation spaces by minimizing the mutual information between style and content embeddings. We also derive two new multi-group mutual information lower bounds, to further improve the representativeness of the latent embeddings. Experiments demonstrate that our method outperforms previous works under both many-to-many and zero-shot transfer setups on two objective metrics and two subjective metrics.

2. BACKGROUND

In information theory, mutual information (MI) is a crucial concept that measures the dependence between two random variables. Mathematically, the MI between two variables x and y is I(x; y) := E p(x,y) log p(x, y) p(x)p(y) , where p(x) and p(y) are marginal distributions of x and y, and p(x, y) is the joint distribution. Recently, MI has attracted considerable interest in machine learning as a criterion to minimize or maximize the dependence between different parts of a model (Chen et al., 2016; Alemi et al., 2016; Hjelm et al., 2018; Veličković et al., 2018; Song et al., 2019) . However, the calculation of exact MI values is challenging in practice, since the closed form of joint distribution p(x, y) in equation ( 1) is generally unknown. To solve this problem, several MI estimators have been proposed. For MI maximization tasks, Nguyen, Wainwright and Jordan (NWJ) (Nguyen et al., 2010) propose a lower bound by representing (1) as an f -divergence (Moon & Hero, 2014) : I NWJ := E p(x,y) [f (x, y)] -e -1 E p(x)p(y) [e f (x,y) ], with a score function f (x, y). Another widely-used sample-based MI lower bound is In-foNCE (Oord et al., 2018) , which is derived with Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010) . With sample pairs {(x i , y i )} N i=1 drawn from the joint distribution p(x, y), the InfoNCE lower bound is defined as ,yj ) . I NCE := E 1 N N i=1 log e f (xi,yi) 1 N N j=1 e f (xi (3) For MI minimization tasks, Cheng et al. (2020a) proposed a contrastively learned upper bound that requires the conditional distribution p(x|y): I(x; y) ≤ E 1 N N i=1 log p(x i |y i ) - 1 N N j=1 log p(x j |y i ) . where the MI is bounded by the log-ratio of conditional distribution p(x|y) between positive and negative sample pairs. In the following, we derive our information-theoretic disentangled representation learning framework for voice style transfer based on the MI estimators described above.

3. PROPOSED MODEL

We assume access to N audio (voice) recordings from M speakers, where speaker u has N u voice samples X u = {x ui } Nu i=1 . The proposed approach encodes each voice input x ∈ X = ∪ M u=1 X u into a speaker-related (style) embedding s = E s (x) and a content-related embedding c = E c (x),

