IMPROVING ZERO-SHOT VOICE STYLE TRANSFER VIA DISENTANGLED REPRESENTATION LEARNING

Abstract

Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speakerrelated style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With informationtheoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.

1. INTRODUCTION

Style transfer, which automatically converts a data instance into a target style, while preserving its content information, has attracted considerable attention in various machine learning domains, including computer vision (Gatys et al., 2016; Luan et al., 2017; Huang & Belongie, 2017) , video processing (Huang et al., 2017; Chen et al., 2017) , and natural language processing (Shen et al., 2017; Yang et al., 2018; Lample et al., 2019; Cheng et al., 2020b) . In speech processing, style transfer was earlier recognized as voice conversion (VC) (Muda et al., 2010) , which converts one speaker's utterance, as if it was from another speaker but with the same semantic meaning. Voice style transfer (VST) has received long-term research interest, due to its potential for applications in security (Sisman et al., 2018) , medicine (Nakamura et al., 2006) , entertainment (Villavicencio & Bonada, 2010) and education (Mohammadi & Kain, 2017) , among others. Although widely investigated, VST remains challenging when applied to more general application scenarios. Most of the traditional VST methods require parallel training data, i.e., paired voices from two speakers uttering the same sentence. This constraint limits the application of such models in the real world, where data are often not pair-wise available. Among the few existing models that address non-parallel data (Hsu et al., 2016; Lee & Wu, 2006; Godoy et al., 2011) , most methods cannot handle many-to-many transfer (Saito et al., 2018; Kaneko & Kameoka, 2018; Kameoka et al., 2018) , which prevents them from converting multiple source voices to multiple target speaker styles. Even among the few non-parallel many-to-many transfer models, to the best of our knowledge, only two models (Qian et al., 2019; Chou & Lee, 2019) allow zero-shot transfer, i.e., conversion from/to newly-coming speakers (unseen during training) without re-training the model. The only two zero-shot VST models (AUTOVC (Qian et al., 2019) and AdaIN-VC (Chou & Lee, 2019) ) share a common weakness. Both methods construct encoder-decoder frameworks, which extract the style and the content information into style and content embeddings, and generate a voice sample by combining a style embedding and a content embedding through the decoder. With the combination of the source content embedding and the target style embedding, the models generate the transferred voice, based only on source and target voice samples. AUTOVC (Qian et al., 2019) uses a GE2E (Wan et al., 2018) pre-trained style encoder to ensure rich speaker-related information in style embeddings. However, AUTOVC has no regularizer to guarantee that the content encoder does not encode any style information. AdaIN-VC (Chou & Lee, 2019) applies instance normalization (Ulyanov et al., 2016) to the feature map of content representations, which helps to eliminate the style information from content embeddings. However, AdaIN-VC fails to prevent content information from being revealed in the style embeddings. Both methods cannot assure that the style and content embeddings are disentangled without information revealed from each other. With information-theoretic guidance, we propose a disentangled-representation-learning method to enhance the encoder-decoder zero-shot VST framework, for both style and content information preservation. We call the proposed method Information-theoretic Disentangled Embedding for Voice Conversion (IDE-VC). Our model successfully induces the style and content of voices into independent representation spaces by minimizing the mutual information between style and content embeddings. We also derive two new multi-group mutual information lower bounds, to further improve the representativeness of the latent embeddings. Experiments demonstrate that our method outperforms previous works under both many-to-many and zero-shot transfer setups on two objective metrics and two subjective metrics.

2. BACKGROUND

In information theory, mutual information (MI) is a crucial concept that measures the dependence between two random variables. Mathematically, the MI between two variables x and y is I(x; y) := E p(x,y) log p(x, y) p(x)p(y) , where p(x) and p(y) are marginal distributions of x and y, and p(x, y) is the joint distribution. Recently, MI has attracted considerable interest in machine learning as a criterion to minimize or maximize the dependence between different parts of a model (Chen et al., 2016; Alemi et al., 2016; Hjelm et al., 2018; Veličković et al., 2018; Song et al., 2019) . However, the calculation of exact MI values is challenging in practice, since the closed form of joint distribution p(x, y) in equation ( 1) is generally unknown. To solve this problem, several MI estimators have been proposed. For MI maximization tasks, Nguyen, Wainwright and Jordan (NWJ) (Nguyen et al., 2010) propose a lower bound by representing (1) as an f -divergence (Moon & Hero, 2014) : I NWJ := E p(x,y) [f (x, y)] -e -1 E p(x)p(y) [e f (x,y) ], with a score function f (x, y). Another widely-used sample-based MI lower bound is In-foNCE (Oord et al., 2018) , which is derived with Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010) . With sample pairs {(x i , y i )} N i=1 drawn from the joint distribution p(x, y), the InfoNCE lower bound is defined as ,yj ) . I NCE := E 1 N N i=1 log e f (xi,yi) 1 N N j=1 e f (xi (3) For MI minimization tasks, Cheng et al. (2020a) proposed a contrastively learned upper bound that requires the conditional distribution p(x|y): I(x; y) ≤ E 1 N N i=1 log p(x i |y i ) - 1 N N j=1 log p(x j |y i ) . ( ) where the MI is bounded by the log-ratio of conditional distribution p(x|y) between positive and negative sample pairs. In the following, we derive our information-theoretic disentangled representation learning framework for voice style transfer based on the MI estimators described above.

3. PROPOSED MODEL

We assume access to N audio (voice) recordings from M speakers, where speaker u has N u voice samples X u = {x ui } Nu i=1 . The proposed approach encodes each voice input x ∈ X = ∪ M u=1 X u into a speaker-related (style) embedding s = E s (x) and a content-related embedding c = E c (x), using respectively a style encoder E s (•) and a content encoder E c (•). To transfer a source x ui from speaker u to the target style of the voice of speaker v, x vj , we combine the content embedding c ui = E c (x ui ) and the style embedding s vj = E s (x vj ) to generate the transferred voice xu→v,i = D(s vj , c ui ) with a decoder D(s, c). To implement this two-step transfer process, we introduce a novel mutual information (MI)-based learning objective, that induces the style embedding s and content embedding c into independent representation spaces (i.e., ideally, s contains rich style information of x with no content information, and vice versa). In the following, we first describe our MI-based training objective in Section 3.1, and then discuss the practical estimation of the objective in Sections 3.2 and 3.3.

3.1. MI-BASED DISENTANGLING OBJECTIVE

From an information-theoretic perspective, to learn representative latent embedding (s, c), it is desirable to maximize the mutual information between the embedding pair (s, c) and the input x. Meanwhile, the style embedding s and the content c are desired to be independent, so that we can control the style transfer process with different style and content attributes. Therefore, we minimize the mutual information I(s; c) to disentangle the style embedding and content embedding spaces. Consequently, our overall disentangled-representation-learning objective seeks to minimize L = I(s; c) -I(x; s, c) = I(s; c) -I(x; c|s) -I(x; s). (5) As discussed in Locatello et al. (Locatello et al., 2019) , without inductive bias for supervision, the learned representation can be meaningless. To address this problem, we use the speaker identity u as a variable with values {1, . . . , M } to learn representative style embedding s for speaker-related attributes. Noting that the process from speaker u to his/her voice x ui to the style embedding s ui (as u → x → s) is a Markov Chain, we conclude I(s; x) ≥ I(s; u) based on the MI data-processing inequality (Cover & Thomas, 2012) (as stated in the Supplementary Material). Therefore, we replace I(s; x) in L with I(s; u) and minimize an upper bound instead: L = I(s; c) -I(x; c|s) -I(u; s) ≥ I(s; c) -I(x; c|s) -I(x; s), In practice, calculating the MI is challenging, as we typically only have access to samples, and lack the required distributions (Chen et al., 2016) . To solve this problem, below we provide several MI estimates to the objective terms I(s; c), I(x; c|s) and I(u; s).

3.2. MI LOWER BOUND ESTIMATION

To maximize I(u; s), we derive the following multi-group MI lower bound (Theorem 3.1) based on the NWJ bound developed in Nguyen et al. (Nguyen et al., 2010) . The detailed proof is provided in the Supplementary Material. Let µ (-ui) v = µ v represent the mean of all style embeddings in group X v , constituting the style centroid of speaker v; µ (-ui) u is the mean of all style embeddings in group X u except data point x ui , representing a leave-x ui -out style centroid of speaker u. Intuitively, we minimize s uiµ (-ui) u to encourage the style embedding of voice x ui to be more similar to the style centroid of speaker u, while maximizing s uiµ (-ui) v to enlarge the margin between s ui and the other speakers' style centroids µ v . We denote the right-hand side of (7) as Î1 . Theorem 3.1. Let µ (-ui) v = 1 Nv Nv k=1 s vk if u = v; and µ (-ui) u = 1 Nu-1 j =i s uj . Then, I(u; s) ≥ E 1 N M u=1 Nu i=1 -s ui -µ (-ui) u 2 - e -1 N M v=1 N v exp{-s ui -µ (-ui) v 2 } . (7) To maximize I(x; c|s), we derive a conditional mutual information lower bound below: Theorem 3.2. Assume that given s = s u , samples {(x ui , c ui )} Nu i=1 are observed. With a variational distribution q φ (x|s, c), we have I(x; c|s) ≥ E[ Î], where Î = 1 N M u=1 Nu i=1 log q φ (x ui |c ui , s u ) -log 1 N u Nu j=1 q φ (x uj |c ui , s u ) . ( ) objective for minimizing I(s; c) becomes: Î3 := 1 N M u=1 Nu i=1 log q θ (s ui |c ui ) - 1 N M v=1 Nv j=1 log q θ (s ui |c vj ) . ( ) When weights of encoders E c , E s are updated, the embedding spaces s, c change, which leads to the changing of conditional distribution p(s|c). Therefore, the neural approximation q θ (s|c) must be updated again. Consequently, during training, the encoders E c , E s and the approximation q θ (s|c) are updated iteratively. In the Supplementary Material, we further discuss that with a good approximation q θ (s|c), Î3 remains an MI upper bound.

3.4. ENCODER-DECODER FRAMEWORK

With the aforementioned MI estimates Î1 , Î2 , and Î3 , the final training loss of our method is L * = [ Î3 -Î1 -Î2 ] -βF(θ), ( ) where β is a positive number re-weighting the two objective terms. Term Î3 -Î1 -Î2 is minimized w.r.t the parameters in encoders E c , E s and decoder D; term F(θ) as the likelihood function of q θ (s|c) is maximized w.r.t the parameter θ. In practice, the two terms are updated iteratively with gradient descent (by fixing one and updating another). The training and transfer processes of our model are shown in Figure 1 . We name this MI-guided learning framework as Information-theoretic Disentangled Embedding for Voice Conversion (IDE-VC).

4. RELATED WORK

Many-to-many Voice Conversion Traditional voice style transfer methods mainly focus on one-toone and many-to-one conversion tasks, which can only transfer voices into one target speaking style. This constraint limits the applicability of the methods. Recently, several many-to-many voice conversion methods have been proposed, to convert voices in broader application scenarios. StarGAN-VC (Kameoka et al., 2018) uses StarGAN (Choi et al., 2018) to enable many-to-many transfer, in which voices are fed into a unique generator conditioned on the target speaker identity. A discriminator is also used to evaluate generation quality and transfer accuracy. Blow (Serrà et al., 2019 ) is a flow-based generative model (Kingma & Dhariwal, 2018) , that maps voices from different speakers into the same latent space via normalizing flow (Rezende & Mohamed, 2015) . The conversion is accomplished by transforming the latent representation back to the observation space with the target speaker's identifier. Two other many-to-many conversion models, AUTOVC (Qian et al., 2019) and AdaIN-VC (Chou & Lee, 2019) , extend applications into zero-shot scenarios, i.e., conversion from/to a new speaker (unseen during training), based on only a few utterances. Both AUTOVC and AdaIN-VC construct an encoder-decoder framework, which extracts the style and content of one speech sample into separate latent embeddings. Then when a new voice from an unseen speaker comes, both its style and content embeddings can be extracted directly. However, as discussed in the Introduction, both methods do not have explicit regularizers to reduce the correlation between style and content embeddings, which limits their performance. Disentangled Representation Learning Disentangled representation learning (DRL) aims to encode data points into separate independent embedding subspaces, where different subspaces represent different data attributes. DRL methods can be classified into unsupervised and supervised approaches. Under unsupervised setups, Burgess et al. ( 2018), Higgins et al. (2016) and Kim & Mnih (2018) use latent embeddings to reconstruct the original data while keeping each dimension of the embeddings independent with correlation regularizers. This has been challenged by Locatello et al. (2019) , in that each part of the learned embeddings may not be mapped to a meaningful data attribute. In contrast, supervised DRL methods effectively learn meaningful disentangled embedding parts by adding different supervision to different embedding components. Between the two embedding parts, the correlation is still required to be reduced to prevent the revealing of information to each other. The correlation-reducing methods mainly focus on adversarial training between embedding parts (Hjelm et al., 2018; Kim & Mnih, 2018) , and mutual information minimization (Chen et al., 2018; Cheng et al., 2020b) . By applying operations such as switching and combining, one can use disentangled representations to improve empirical performance on downstream tasks, e.g. conditional generation (Burgess et al., 2018 ), domain adaptation (Gholami et al., 2020) , and few-shot learning (Higgins et al., 2017) . 80 channels for the last layer. The output of the post-network can be viewed as a residual signal. The final conversion signal is computed by directly adding the output of the initial decoder and the post-network. The reconstruction loss is applied to both the output of the initial decoder and the final conversion signal. Approximation Network Architecture As described in Section 3.3, minimizing the mutual information between style and content embeddings requires an auxiliary variational approximation q θ (s|c). For implementation, we parameterize the variational distribution in the Gaussian distribution family q θ (s|c) = N (µ θ (c), σ 2 θ (c) • I), where mean µ θ (•) and variance σ 2 θ (•) are two-layer fully-connected networks with tanh(•) as the activation function. With the Gaussian parameterization, the likelihoods in objective Î3 can be calculated in closed form.

5.3. STYLE TRANSFER PERFORMANCE

For the many-to-many VST task, we randomly select 10% of the sentences for validation and 10% of the sentences for testing from the VCTK dataset, following the setting in Blow (Serrà et al., 2019) . The rest of the data are used for training in a non-parallel scheme. For evaluation, we select voice pairs from the testing set, in which each pair of voices have the same content but come from different speakers. In each testing pair, we conduct transfer from one voice to the other voice's speaking style, and then we compare the transferred voice and the other voice as evaluating the model performance. We test our model with four competitive baselines: Blow (Serrà et al., 2019) foot_2 , AUTOVC (Qian et al., 2019) , AdaIN-VC (Chou & Lee, 2019) and StarGAN-VC (Kameoka et al., 2018) . The detailed implementation of these four methods are provided in the Supplementary Material. Table 1 shows the subjective and objective evaluation for the many-to-many VST task. Both methods with the encoder-decoder framework, AdaIN-VC and AUTOVC, have competitive results. However, our IDE-VC outperforms the other baselines on all metrics, demonstrating that the style-content disentanglement in the latent space improves the performance of the encoder-decoder framework. For the zero-shot VST task, we use the same train-validation dataset split as in the many-to-many setup. The testing data are selected to guarantee that no test speaker has any utterance in the training set. We compare our model with the only two baselines, AUTOVC (Qian et al., 2019) and AdaIN-VC (Chou & Lee, 2019) , that are able to handle voice transfer for newly-coming unseen speakers. We used the same implementations of AUTOVC and AdaIN-VC as in the many-to-many VST. The evaluation results of zero-shot VST are shown in Table 2 , among the two baselines AdaIN-VC performs better than AUTOVC overall.Our IDE-VC outperforms both baseline methods, on all metrics. All three tested models have encoder-decoder transfer frameworks, the superior performance We also empirically evaluate the disentanglement, by predicting the speakers' identity based on only the content embeddings. A two-layer fully-connected network is trained on the testing set with a content embedding as input, and the corresponding speaker identity as output. We compare our IDE-VC with AUTOVC and AdaIN-VC, which also output content embeddings. The classification results are shown in Table 3 . Our IDE-VC reaches the lowest classification accuracy, indicating that the content embeddings learned by IDE-VC contains the least speaker-related information. Therefore, our IDE-VC learns disentangled representations with high quality compared with other baselines. 4 . We compare our model with two models trained by part of the loss function in (11), while keeping the other training setups unchanged, including the model structure. From the results, when the model is trained without the style encoder loss term Î1 , a transferred voice still is generated, but with a large distance to the ground truth. The verification accuracy also significantly decreases with no speaker-related information utilized. When the disentangling term Î3 is removed, the model still reaches competitive performance, because the style encoder E s and decoder D are well trained by Î1 and Î2 . However, when adding term Î3 , we disentangle the style and content spaces, and improve the transfer quality with higher verification accuracy and less distortion. The performance without term Î2 is not reported, because the model cannot even generate fluent speech without the reconstruction loss.

6. CONCLUSIONS

We have improved the encoder-decoder voice style transfer framework by disentangled latent representation learning. To effectively induce the style and content information of speech into independent embedding latent spaces, we minimize a sample-based mutual information upper bound between style and content embeddings. The disentanglement of the two embedding spaces ensures the voice transfer accuracy without information revealed from each other. We have also derived two new multi-group mutual information lower bounds, which are maximized during training to enhance the representativeness of the latent embeddings. On the real-world VCTK dataset, our model outperforms previous works under both many-to-many and zero-shot voice style transfer setups. Our model can be naturally extended to other style transfer tasks modeling time-evolving sequences, e.g., video and music style transfer. Moreover, our general multi-group mutual information lower bounds have broader potential applications in other representation learning tasks.



https://github.com/resemble-ai/Resemblyzer https://www.mturk.com/ For Blow model, we use the official implementation available on Github (https://github.com/joansj/blow). We report the best result we can obtain here, under training for 100 epochs (11.75 GPU days on Nvidia V100).



Figure 2: Left: t-SNE visualization for speaker embeddings. Right: t-SNE visualization for content embedding. The embeddings are extracted from the voice samples of 10 different speakers.

Many-to-many VST evaluation results. For all metrics except Distance, higher is better.

Zero-Shot VST evaluation results. For all metrics except Distance, higher is better.



Ablation study with different training losses. Performance is measured by objective metrics.

ACKNOWLEDGEMENTS

This research was supported in part by the DOE, NSF and ONR.

E s (•)

< l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F J B j O n 4 X o r B m G j k V L B J s Z s Z l h I 6 J H 3 W s V S S h J l g P L t 4 4 p 5 a J X J j p W 1 J d G f q 7 4 k x S Y w Z J a H t T A g O z K I 3 F f / z O h n G V 8 G Y y z R D J u l 8 U Z w J F 5 U 7 f d + N u G Y U x c g S Q j W 3 t 7 p 0 Q D S h a E M q 2 h D 8 x Z e X S f O 8 6 n t V / / 6 i X L v O 4 y j A M Z x A B X y 4 h B r c Q R 0 a Q E H C M 7 z C m 2 O c F + f d + Z i 3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F J B j O n 4 X o r B m G j k V L B J s Z s Z l h I 6 J H 3 W s V S S h J l g P L t 4 4 p 5 a J X J j p W 1 J d G f q 7 4 k x S Y w Z J a H t T A g O z K I 3 F f / z O h n G V 8 G Y y z R D J u l 8 U Z w J F 5 U 7 f d + N u G Y U x c g S Q j W 3 t 7 p 0 Q D S h a E M q 2 h D 8 x Z e X S f O 8 6 n t V / / 6 i X L v O 4 y j A M Z x A B X y 4 h B r c Q R 0 a Q E H C M 7 z C m 2 O c F + f d + Z i 3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F J B j O n 4 X o r B m G j k V L B J s Z s Z l h I 6 J H 3 W s V S S h J l g P L t 4 4 p 5 a J X J j p W 1 J d G f q 7 4 k x S Y w Z J a H t T A g O z K I 3 F f / z O h n G V 8 G Y y z R D J u l 8 U Z w J F 5 U 7 f d + N u G Y U x c g S Q j W 3 t 7 p 0 Q D S h a E M q 2 h D 8 x Z e X S f O 8 6 n t V / / 6 i X L v O 4 y j A M Z x A B X y 4 h B r c Q R 0 a Q E H C M 7 z C m 2 O c F + f d + Z i 3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h P + 6 L r U f 2 d 3 t Z a l d q a Q Q v E K M X y w = " > A A A B 2 X i c b Z D N S g M x F I X v 1 L 8 6 V q 1 r N 8 E i u C o z b n Q p u H F Z w b Z C O 5 R M 5 k 4 b m s k M y R 2 h D H 0 B F 2 5 E f C 9 3 v o 3 p z 0 J b D w Q + z k n I v S c u l L Q U B N 9 e b W d 3 b / + g f u g f N f z j k 9 N m o 2 f z 0 g j s i l z l 5 j n m F p X U 2 C V J C p 8 L g z y L F f b j 6 f 0 i 7 7 + g s T L X T z Q r M M r 4 W M t U C k 7 O 6 o y a r a A d L M W 2 I V x D C 9 Y a N b + G S S 7 K D D U J x a 0 d h E F B U c U N S a F w 7 g 9 L i w U X U z 7 G g U P N M 7 R R t R x z z i 6 d k 7 A 0 N + 5 o Y k v 3 9 4 u K Z 9 b O s t j d z D h N 7 G a 2 M P / L B i W l t 1 E l d V E S a r H 6 K C 0 V o 5 w t d m a J N C h I z R x w Y a S b l Y k J N 1 y Q a 8 Z 3 H Y S b G 2 9 D 7 7 o d B u 3 w M Y A 6 n M M F X E E I N 3 A H D 9 C B L g h I 4 B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z H / b 1 M 8 c 8 f R i 8 a 9 F n E 3 / + k Q X t c g = " > A A A B 5 n i c b Z B L S w M x F I X v + K y 1 a n X r J l i E u i k z b n Q p i O C y g n 1 g O 5 R M J t O G Z p I h u S O U o f / C j Q t F / E n u / D e m j 4 W 2 H g h 8 n J O Q e 0 + U S W H R 9 7 + 9 j c 2 t 7 Z 3 d 0 l 5 5 v 3 J w e F Q 9 r r S t z g 3 j L a a l N t 2 I W i 6 F 4 i 0 U K H k 3 M 5 y m k e S d a H w 7 y z v P 3 F i h 1 S N O M h 6 m d K h E I h h F Z z 3 d D W y 9 z 2 K N F 4 N q z W / 4 c 5 F 1 C J Z Q g 6 W a g + p X P 9 Y s T 7 l C J q m 1 v c D P M C y o Q c E k n 5 b 7 u e U Z Z W M 6 5 D 2 H i q b c h s V 8 4 i k 5 d 0 5 M E m 3 c U U j m 7 u 8 X B U 2 t n a S R u 5 l S H N n V b G b + l / V y T K 7 D Q q g s R 6 7 Y 4 q M k l w Q 1 m a 1 P Y m E 4 Q z l x Q J k R b l b C R t R Q h q 6 k s i s h W F 1 5 H d q X j c B v B A 8 + l O A U z q A O A V z B D d x D E 1 r A Q M E L v M G 7 Z 7 1 X 7 2 N R 1 4 a 3 7 O 0 E / s j 7 / A G H j 4 7 w < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z H / b 1 M 8 c 8 f R i 8 a 9 F n E 3 / + k Q X t c g = " > A A A B 5 n i c b Z B L S w M x F I X v + K y 1 a n X r J l i E u i k z b n Q p i O C y g n 1 g O 5 R M J t O G Z p I h u S O U o f / C j Q t F / E n u / D e m j 4 W 2 H g h 8 n J O Q e 0 + U S W H R 9 7 + 9 j c 2 t 7 Z 3 d 0 l 5 5 v 3 J w e F Q 9 r r S t z g 3 j L a a l N t 2 I W i 6 F 4 i 0 U K H k 3 M 5 y m k e S d a H w 7 y z v P 3 F i h 1 S N O M h 6 m d K h E I h h F Z z 3 d D W y 9 z 2 K N F 4 N q z W / 4 c 5 F 1 C J Z Q g 6 W a g + p X P 9 Y s T 7 l C J q m 1 v c D P M C y o Q c E k n 5 b 7 u e U Z Z W M 6 5 D 2 H i q b c h s V 8 4 i k 5 d 0 5 M E m 3 c U U j m 7 u 8 X B U 2 t n a S R u 5 l S H N n V b G b + l / V y T K 7 D Q q g s R 6 7 Y 4 q M k l w Q 1 m a 1 P Y m E 4 Q z l x Q J k R b l b C R t R Q h q 6 k s i s h W F 1 5 H d q X j c B v B A 8 + l O A U z q A O A V z B D d x D E 1 r A Q M E L v M G 7 Z 7 1 X 7 2 N R 1 4 a 3 7 O 0 E / s j 7 / A G H j 4 7 w < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p W v k 1 N G 3 N H 1 4 L o W + M w Z N H 2 F l g 4 g = " > A A A B 8 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k r i R Y 9 F E T x W s B / Y h r L Z b N q l m 0 3 Y n Q i l 9 F 9 4 8 a C I V / + N N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S q F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z Z J M M 9 5 k i U x 0 J 6 C G S 6 F 4 E w V K 3 k k 1 p 3 E g e T s Y 3 c z 8 9 h P X R i T q A c c p 9 2 M 6 U C I S j K K V H m / 7 p t p j Y Y L n / X L F r b l z k F X i 5 a Q C O R r 9 8 l c v T F g W c 4 V M U m O 6 n p u i P 6 E a B Z N 8 W u p l h q e U j e i A d y 1 V N O b G n 8 w v n p I z q 4 Q k S r Q t h W S u / p 6 Y 0 N i Y c R z Y z p j i 0 C x 7 M / E / r 5 t h d O V P h E o z 5 I o t F k W Z J J i Q 2 f s k F J o z l G N L K N P C 3 k r Y k G r K 0 I Z U s i F 4 y y + v k t Z F z X N r 3 r 1 b q V / n c R T h B E 6 h C h 5 c Q h 3 u o A F N Y K D g G V 7 h z T H O i / P u f C x a C 0 4 + c w x / 4 H z + A L 3 H k E g = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F J B j O n 4 X o r B m G j k V L B J s Z s Z l h I 6 J H 3 W s V S S h J l g P L t 4 4 p 5 a J X J j p W 1 J d G f q 7 4 k x S Y w Z J a H t T A g O z K I 3 F f / z O h n G V 8 G Y y z R D J u l 8 U Z w J F 5 U 7 f d + N u G Y U x c g S Q j W 3 t 7 p 0 Q D S h a E M q 2 h D 8 x Z e X S f O 8 6 n t V / / 6 i X L v O 4 y j A M Z x A B X y 4 h B r c Q R 0 a Q E H C M 7 z C m 2 O c F + f d + Z i 3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F J B j O n 4 X o r B m G j k V L B J s Z s Z l h I 6 J H 3 W s V S S h J l g P L t 4 4 p 5 a J X J j p W 1 J d G f q 7 4 k x S Y w Z J a H t T A g O z K I 3 F f / z O h n G V 8 G Y y z R D J u l 8 U Z w J F 5 U 7 f d + N u G Y U x c g S Q j W 3 t 7 p 0 Q D S h a E M q 2 h D 8 x Z e X S f O 8 6 n t V / / 6 i X L v O 4 y j A M Z x A B X y 4 h B r c Q R 0 a Q E H C M 7 z C m 2 O c F + f d + Z i 3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F J B j O n 4 X o r B m G j k V L B J s Z s Z l h I 6 J H 3 W s V S S h J l g P L t 4 4 p 5 a J X J j p W 1 J d G f q 7 4 k x S Y w Z J a H t T A g O z K I 3 F f / z O h n G V 8 G Y y z R D J u l 8 U Z w J F 5 U 7 f d + N u G Y U x c g S Q j W 3 t 7 p 0 Q D S h a E M q 2 h D 8 x Z e X S f O 8 6 n t V / / 6 i X L v O 4 y j A M Z x A B X y 4 h B r c Q R 0 a Q E H C M 7 z C m 2 O c F + f d + Z i 3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T / 6 y M b Z A m E q o E v 3 D S v k X + e a V d s = " > A A A B 8 X i c b V B N S 8 N A E J 3 4 W e t X 1 a O X Y B H q p S Q i 6 L E o g s c K 9 g P b U D a b T b t 0 s x t 2 J 0 I p / R d e P C j i 1 X / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m A p u 0 P O + n Z X V t f W N z c J W c X t n d 2 + / d H D Y N C r T l D W o E k q 3 Q 2 K Y 4 J I 1 k K N g 7 V Q z k o S C t c L h z d R v P T F t u J I P O E p Z k J C + 5 D G n B K 3 0 e N s z l S 6 N F J 7 1 S m W v 6 s 3 g L h M / J 2 X I U e + V v r q R o l n C J F Jl a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " n S q c l 1 k y N h e e x k g t v 2 y g a g y U tl a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " n S q c l 1 k y N h e e x k g t v 2 y g a g y U tB X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z HZ j 7 P j K d o a m 4 w T X 9 1 o j t 3 7 9 1 / c P Q w f n T 8 + M n T k 9 P j S 6 N a T d m I q l r p 6 w I M q 7 l k I + R Y s + t G M x B F z a 6 K x X n w X y 2 Z N l z J T 7 h p 2 E T A X P I Z p 4 C e G u e F s G s 3 t c u F m 5 5 0 0 3 6 6 t e Q 2 y P a g S / Z 2 M T 3Z j 7 P j K d o a m 4 w T X 9 1 o j t 3 7 9 1 / c P Q w f n T 8 + M n T k 9 P j S 6 N a T d m I q l r p 6 w I M q 7 l k I + R Y s + t G M x B F z a 6 K x X n w X y 2 Z N l z J T 7 h p 2 E T A X P I Z p 4 C e G u e F s G s 3 t c u F m 5 5 0 0 3 6 6 t e Q 2 y P a g S / Z 2 M T 33 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I T3 r j j 5 z B H 8 g f P 5 A 7 8 H k E w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h PB X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z H / b 1 M 8 c 8 f R i 8 a 9e m j 4 W 2 H g h 8 n J O Q e 0 + U S W H R 9 7 + 9 j c 2 t 7 Z 3 d 0 l 5 5 v 3 J w e F Q 9 r r S t z g 3 j L a a l N t 2 I W i 6 F 4 i 0 U K H k 3 M 5 y m k e S d a H w 7 y z v P 3e m j 4 W 2 H g h 8 n J O Q e 0 + U S W H R 9 7 + 9 j c 2 t 7 Z 3 d 0 l 5 5 v 3 J w e F Q 9 r r S t z g 3 j L a a l N t 2 I W i 6 F 4 i 0 U K H k 3 M 5 y m k e S d a H w 7 y z v P 3< l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8 g e D Y t 8 W L w / 6 I / f d K u y g 5 6 i Z + g 5 K t A r N E b v 0 B G a I I o a 9 B l 9 Q V + T T 8 m 3 5 H v y Y 0 N N e l 3 O E 7 R l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8 g e D Y t 8 W L w / 6 I / f d K u y g 5 6 i Z + g 5 K t A r N E b v 0 B G a I I o a 9 B l 9 Q V + T T 8 m 3 5 H v y Y 0 N N e l 3 O E 7 R l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8 y X 9 X z s a m + A m 7 E d x e x v u g v P R s M i H x c e c H J A X 5 C V 5 R Q r y l o z J B 3 J G J o S S m n w l 3 5 I v y f f k x 3 a P k t 5 u o Y 7 J X i Q / / w D 6 s v Q c < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "x T 8 r e 7 Z p Z P k h u H C F t + C G e A 0 e g n f A m w S J t I x l 6 d M 3 M 5 9 n x l M 2 U j j M 8 1 + 9 5 M 7 d e / c f 7 D 1 M H + 0 / f v L 0 4 H D / z J n W M j 5 m R h p 7 U Y L j U m g + R o G S X z S W g y o l P y 8 X J 5 3 / / I p b J 4 z + h K u G T x X M t Z g J B h g p S k v l q W r D p Y / 3 o J 8 P 8 7 V l t 0 G x B X 2 y t d P L w 9 5 v W9 m O R g w I k + P p V m f B r T 5 W b Y e 7 a k p z 7 f 3 w N V 2 K q p t U 3 p 3 g j 6 g a Y d 0 v 6 F V T g 0 a j / K u w V b N i X i N Y a 5 b h K H S S 4 W 8 7 B u v Y T V y Q 4 u Y 6 3 A Z n x 8 M i H x Y f c r J H n p M X 5 C U p y B s y I u / J K R k T R h r y h X w l 3 5 L P y f f k x 2 a V k t 5 2 p 5 6 R H U t + / g E w c / q l < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "x T 8 r e 7 Z p Z P k h u H C F t + C G e A 0 e g n f A m w S J t I x l 6 d M 3 M 5 9 n x l M 2 U j j M 8 1 + 9 5 M 7 d e / c f 7 D 1 M H + 0 / f v L 0 4 H D / z J n W M j 5 m R h p 7 U Y L j U m g + R o G S X z S W g y o l P y 8 X J 5 3 / / I p b J 4 z + h K u G T x X M t Z g J B h g p S k v l q W r D p Y / 3 o J 8 P 8 7 V l t 0 G x B X 2 y t d P L w 9 5 v W9 m O R g w I k + P p V m f B r T 5 W b Y e 7 a k p z 7 f 3 w N V 2 K q p t U 3 p 3 g j 6 g a Y d 0 v 6 F V T g 0 a j / K u w V b N i X i N Y a 5 b h K H S S 4 W 8 7 B u v Y T V y Q 4 u Y 6 3 A Z n x 8 M i H x Y f c r J H n p M X 5 C U p y B s y I u / J K R k T R h r y h X w l 3 5 L P y f f k x 2 a V k t 5 2 p 5 6 R H U t + / g E w c / q l < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E nr / y h j 5 L + b z s a 6 t B N W J D i 6 j p c B y e j Y Z E P i / d 5 f / y m W 5 U 9 9 B Q 9 Q 8 9 R g V 6 h M X q H j t E E U d S g z + g L + p p 8 S r 4 l 3 5 M f W 2 r S 6 3 K e o B 1 L f v 4 B D Y / 8 H A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8 l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8l y c 8 / D s / 8 I A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m q g h t q R z S p y 6 E Z U k m H n q c T q J 5 K 8r 6 3 A T n A / 6 W d r P 3 h 9 1 h y f t q u y h 5 + g F e o k y 9 A Y N 0 T t 0 h k a I o g p 9 R l / Q 1 + h T 9 C 3 6 H v 3 Y U q N O m / M M 7 V j 0 8 w 8 R c P w h < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 9x Y t m U m l z v D v E c g h l N 8 P L q i Q K t H S v f K t m + L w E Y o x e + U P f S P q / 7 W g o Q z d h Q b L r 6 3 A T n A / 6 W d r P 3 h 9 1 h y f t q u y h 5 + g F e o k y 9 A Y N 0 T t 0 h k a I o g p 9 R l / Q 1 + h T 9 C 3 6 H v 3 Y U q N O m / M M 7 V j 0 8 w 8 R c P w h < / l a t e x i t > < l a t e x i t s h a x Y t m U m l z v D v E c g h l N 8 P L q i Q K t H S v f K t m + L w E Y o x e + U P f S P q / 7 W g o Q z d h Q b L r 6 3 A T n A / 6 W d r P 3 h 9 1 h y f t q u y h 5 + g F e o k y 9 A Y N 0 T t 0 h k a I o g p 9 R l / Q 1 + h T 9 C 3 6 H v 3 Y U q N O m / M M 7 V j 0 8 w 8 R c P w h < / l a t e x i t > Based on the criterion for s in equation ( 7), a well-learned style encoder E s pulls all style embeddings s ui from speaker u together. Suppose s u is representative of the style embeddings of set X u . If we parameterize the distribution q φ (x|s, c) ∝ exp(-x -D(s, c) 2 ) with decoder D(s, c), then based on Theorem 3.2, we can estimate the lower bound of I(x; c|s) with the following objective: When maximizing Î2 , for speaker u with his/her given voice style s u , we encourage the content embedding c ui to well reconstruct the original voice x ui , with small x ui -D(c ui , s u ) . Additionally, the distance x uj -D(c ui , s u ) is minimized, ensuring c ui does not contain information to reconstruct other voices x uj from speaker u. With Î2 , the correlation between x ui and c ui is amplified, which improves c ui in preserving the content information.

3.3. MI UPPER BOUND ESTIMATION

The crucial part of our framework is disentangling the style and the content embedding spaces, which imposes (ideally) that the style embedding s excludes any content information and vice versa. Therefore, the mutual information between s and c is expected to be minimized. To estimate I(s; c), we derive a sample-based MI upper bound in Theorem 3.3 base on (4). The upper bound in (9) requires the ground-truth conditional distribution p(s|c), whose closed form is unknown. Therefore, we use a probabilistic neural network q θ (s|c) to approximate p(s|c) by maximizing the log-likelihood F(θ) =

M u=1

Nu i=1 log q θ (s ui |c ui ). With the learned q θ (s|c), the

5. EXPERIMENTS

We evaluate our IDE-VC on real-world voice a dataset under both many-to-many and zero-shot VST setups. The selected dataset is CSTR Voice Cloning Toolkit (VCTK) (Yamagishi et al., 2019) , which includes 46 hours of audio from 109 speakers. Each speaker reads a different sets of utterances, and the training voices are provided in a non-parallel manner. The audios are downsampled at 16kHz.In the following, we first describe the evaluation metrics and the implementation details, and then analyze our model's performance relative to other baselines under many-to-many and zero-shot VST settings.

5.1. EVALUATION METRICS

Objective Metrics We consider two objective metrics: Speaker verification accuracy (Verification) and the Mel-Cepstral Distance (Distance) (Kubichek, 1993) . The speaker verification accuracy measures whether the transferred voice belongs to the target speaker. For fair comparison, we used a third-party pre-trained speaker encoder Resemblyzer 1 to classify the speaker identity from the transferred voices. Specifically, style centroids for speakers are learned with ground-truth voice samples. For a transferred voice, we encode it via the pre-trained speaker encoder and find the speaker with the closest style centroid as the identity prediction. For the Distance, the vanilla Mel-Cepstral Distance (MCD) cannot handle the time alignment issue described in Section 2. To make reasonable comparisons between the generation and ground truth, we apply the Dynamic Time Warping (DTW) algorithm (Berndt & Clifford, 1994) to automatically align the time-evolving sequences before calculating MCD. This DTW-MCD distance measures the similarity of the transferred voice and the real voice from the target speaker. Since the calculation of DTW-MCD requires parallel data, we select voices with the same content from the VCTK dataset as testing pairs. Then we transfer one voice in the pair and calculate DTW-MCD with the other voice as reference.Subjective Metrics Following Wester et al. (Wester et al., 2016) , we use the naturalness of the speech (Naturalness), and the similarity of the transferred speech to target identity (Similarity) as subjective metrics. For Naturalness, annotators are asked to rate the score from 1-5 for each transferred speech.For the Similarity, the annotators are presented with two audios (the converted speech and the corresponding reference), and are asked to rate the score from 1 to 4. For both scores, the higher the better. Following the setting in Blow (Serrà et al., 2019) , we report Similarity defined as a total percentage from the binary rating. The evaluation of both subjective metrics is conducted on Amazon Mechanical Turk (MTurk) 2 . More details about evaluation metrics are provided in the Supplementary Material.

5.2. IMPLEMENTATION DETAILS

Following AUTOVC (Qian et al., 2019) , our model inputs are represented via mel-spectrogram.The number of mel-frequency bins is set as 80. When voices are generated, we adopt the WaveNet vocoder (Oord et al., 2016) pre-trained on the VCTK corpus to invert the spectrogram signal back to a waveform. The spectrogram is first upsampled with deconvolutional layers to match the sampling rate, and then a standard 40-layer WaveNet is applied to generate speech waveforms. Our model is implememted with Pytorch and takes 1 GPU day on an Nvidia Xp to train.Encoder Architecture The speaker encoder consists of a 2-layer long short-term memory (LSTM) with cell size of 768, and a fully-connected layer with output dimension 256. The speaker encoder is initialized with weights from a pretrained GE2E (Wan et al., 2018) encoder. The input of the content encoder is the concatenation of the mel-spectrogram signal and the corresponding speaker embedding. The content encoder consists of three convolutional layers with 512 channels, and two layers of a bidirectional LSTM with cell dimension 32. Following the setup in AUTOVC (Qian et al., 2019) , the forward and backward outputs of the bi-directional LSTM are downsampled by 16.Decoder Architecture Following AUTOVC (Qian et al., 2019) , the initial decoder consists of a three-layer convolutional neural network (CNN) with 512 channels, three LSTM layers with cell dimension 1024, and another convolutional layer to project the output of the LSTM to dimension of 80. To enhance the quality of the spectrogram, following AUTOVC (Qian et al., 2019) , we use a post-network consisting of five convolutional layers with 512 channels for the first four layers, and

