CONTEXTSPEECH: EXPRESSIVE AND EFFICIENT TEXT-TO-SPEECH FOR PARAGRAPH READING

Abstract

Although Text-to-Speech (TTS) has made rapid progress in speech quality at sentence level, it still faces a lot of challenges in paragraph / long-form reading. Synthesizing sentence by sentence in a paragraph and then concatenating them together will cause inconsistent issues that affect paragraph-level expressiveness. While directly modelling all the sentences in a paragraph will incur large computation / memory cost. In this paper, we develop a TTS system called ContextSpeech, which models the contextual information in a paragraph for coherence and expressiveness without largely increasing the computation or memory cost. On the one hand, we introduce a memory-cached recurrence mechanism to let the current sentence see more history information both on the text and speech sides. On the other hand, we construct text-based semantic information in a hierarchical structure, which can broaden the horizon and incorporate the future information. Additionally, we use a linearized self-attention with compatible relative-position encoding to reduce the computation / memory cost. Experiments show that ContextSpeech significantly improves the paragraph-level voice quality and prosody expressiveness in terms of both subjective and objective evaluation metrics. Furthermore, ContextSpeech achieves better model efficiency in both training and inference stage.



These TTS models usually convert text to speech in sentence-level. In fact, there are many scenarios in TTS where the synthesized audio is created in paragraph-level, like news reading, audiobook, audio content dubbing, or even dialogue composed by multiple interrelated sentences. Regarding the large variation of context in long-form content, concatenating synthesized speech sentence by sentence still has obvious gap to natural recording in paragraph reading from perceptual evaluation. Meanwhile, the imbalanced distribution of TTS corpus data with long-tail sentences, e.g., extra-long or extra-short sentences, makes it difficult for TTS systems to generate high quality synthesized speech in such context. From our observation, the sentence-level speech synthesis is limited as followings: • Correlation between adjacent sentences. For paragraph reading, adjacent sentences influence each other naturally as the semantic information flowing. Thus, sentence-level synthesis lacks coherence between adjacent sentences and accurate expression. • Efficiency or consistency on extra-long sentences. Synthesizing extra-long sentences usually leads to unstable results (e.g. bad alignment between text and speech) and high latency. Generally, such sentences will be cut into two or more segments and then synthesized separately, which may cause inconsistent speech rate or prosody between cut segments. • Quality on extra-short sentences. As rare patterns in corpus, sentences that are too short (e.g. consisted by one or two words) are easy to generate audio with poor quality (e.g. bad pronunciation or extremely slow speech rate). Naturally, involving contextual-related information should be helpful to address above issues and do better paragraph-level text-to-speech. To be more specific, 1) for adjacent sentences, contextual information brings inter-sentence knowledge, to guide the generation of current sentence under more appropriate speaking style. 2) For extra-long sentences, intra-sentence contextual information can benefit each position to generate expressive and consistent audio. 3) For extra-short sentences, leveraging contextual information can enlarge the perceptive scope to produce a more stable result. Nevertheless, contextual information is difficult to define and the introduction of external information into the model leads to high calculation and memory cost. Thus, the challenges of this work are 1) how to construct and leverage effective contextual information; and 2) how to reduce the cost introduced by the external information. 2018)-based embedding and pre-defined statistical information from contextual scripts. After that, we build a textbased contextual encoder to consume these features and then combine them with phoneme embedding to go through the Conformer block. Such integration of the context information can broaden the model horizon from history to future and alleviate the one-to-many mapping issue in TTS Ren et al. ( 2020). • To reduce the memory and calculation cost, we use a linearized self-attention module to avoid quadratic complexity caused by softmax self-attention. Meanwhile, we adopt a permute-based relative position encoding to fit the efficient self-attention under our memory reused framework. We conduct experiments on a speech corpus of Chinese audiobook. 



based Text-to-speech (TTS) develops rapidly and brings significant breakthroughs. Neural Network based models like Tacotron 1/2 Wang et al. (2017); Shen et al. (2018), Deep voice 3 Ping et al. (2017), Transformer TTS Li et al. (2019), FastSpeech1/2 Ren et al. (2019; 2020), together with neural vocoder model like WaveNet Oord et al. (2016), Parallel WaveNet Oord et al. (2018), WaveRNN Kalchbrenner et al. (2018), MelGAN Kumar et al. (2019) can generate high-quality voices. More recently it evolves to fully end-to-end model like VITS Kim et al. (2022) and NatualSpeech Tan et al. (2022), which achieves high-fidelity and close-to-recording quality.

Based on such motivation, we propose the ContextSpeech model, and the mainly contributions of this work are summarized as follows: • We present a framework with cached hidden state to capture previous information, thus it evolves from sentence-level to paragraph-level speech synthesis in perspective of modeling. We use one of the state-of-the-art sentence-level speech synthesis architecture, Conformer Peng et al. (2021) based TTS in Liu et al. (2021), as our baseline framework. This model can capture local correlation well and produce more expressive speech compared with Feed-Forward Transformer based FastSpeech. Based on that, referring to segment-level recurrence mechanism proposed in Dai et al. (2019), the cached hidden state for each Conformer block in both encoder and decoder transfers text and speech information from the previous segment to current segment.

The results show that Con-textSpeech can generate more expressive and coherent paragraph audios compared with baseline ConformerTTS model in terms of objective and subjective evaluation. From the observation, it also alleviates the issues caused by extra-long and extra-short sentences obviously. And the ablation experiments demonstrate that both of our proposed contextual model framework and the integration of text-based contextual encoder are effective. Additionally, the efficiency optimization on Con-textSpeech expand around 2x of both memory tolerance and training speedup, and the final model largely alleviate the efficiency issue of extra-long input compared with baseline model. technique aimed at converting given text to speech automatically, and the goal of which includes naturalness, expressiveness, robustness and efficiency. Along with the flourishing of deep learning, the conventional methods like concatenative synthesis Hunt & Black (1996) and statistical parametric synthesis Zen et al. (2009) are gradually replaced by neural network based approaches Tan et al. (2021). Autoregressive acoustic model Wang et al. (2017); Shen et al. (2018); Ping et al. (2017); Li et al. (2019) and vocoder Oord et al. (2016); Kalchbrenner et al. (2018) largely improves the

