HOW DISTINGUISHABLE ARE VOCODER MODELS? ANALYZING VOCODER FINGERPRINTS FOR FAKE AUDIO

Abstract

In recent years, vocoders powered by deep neural networks (DNNs) have found much success in the task of generating raw waveforms from acoustic features, as the audio generated becomes increasingly realistic. This development however raises a few challenges, especially in the field of forensics, where the attribution of audio to real or generated sources is vital. To our knowledge, our investigation constitutes the first efforts to answer this question on the existence of vocoder fingerprints and to analyze them. In this paper, we present our discoveries in identifying the sources of generated audio waveforms. Our experiments conducted on the multi-speaker LibriTTS dataset show that (1) vocoder models do leave model-specific fingerprints on the audio they generate, and (2) minor differences in vocoder training can result in sufficiently different fingerprints in generated audio as to allow for distinguishing between the two. We believe that these differences are strong evidence that there exist vocoder-specific fingerprints that can be exploited for source identification purposes.

1. INTRODUCTION

Over the past decades, advancements in text-to-speech (TTS) and voice conversion (VC) technologies permit the artificial generation of speech audio. As the last step in TTS or VC pipelines, generating raw waveforms from acoustic features, vocoders are important components in these tasks. With the development of deep learning, deep neural networks (DNNs) have seen increasing usage in both TTS and VC. Recent advancements in these areas (Wang et al., 2017; Shen et al., 2018; Kalchbrenner et al., 2018; Li et al., 2019; Ren et al., 2022) have led to the capacity of generating increasingly realistic and natural audio. While such technological progress is certainly welcomed, it does also admittedly come with challenges, especially when one considers the potential misuse and abuse of TTS and VC technologies, either inadvertently or maliciously. The risks and challenges associated with these technologies mostly arise in the following areas: Audio forensics. As TTS technologies generate increasingly realistic audio, they also become increasingly prone to being misused, by their very nature, to mislead or even defraud victims who cannot judge the authenticity of audio. The possibility of misuse and abuse evidently gives rise to concerns about security, and public order. Therefore, it is consistent with the public interest to find methods to combat such risk of malicious usage by coming up with effective forensic strategies. While modern audio forensic techniques used in challenges such as ASVspoof (Wu et al., 2015; Wang et al., 2020; Yamagishi et al., 2021) and ADD (Yi et al., 2022) have achieved effective results in separating fake audios from real ones, they focus more on performing binary classification between real and fake sets. There is nonetheless also an interest in surpassing the constraints of binary real/fake classification and actually pinpoint the source responsible for generating any fake audio. 2022) are built with much effort, as they require careful design and fine-tuning of models and algorithms, as well as extensive training with more data than can reasonably be collected by any one person. As such, it is in the interest of researchers and commercial software developers alike to ensure that they are properly recognized to hold the ownership and copyright of their intellectual properties, the enforcement of which would not be feasible without the (Yamamoto et al., 2020) ; "Full" refers to the clean subset of LibriTTS (see section 4.1); SS1 consists of 50,000 randomly selected samples from the "Full" set) ability to detect unauthorized usage of a system. While there have been efforts in watermarking a given DNN model (Zhang et al., 2018) , these efforts require crafted inputs on models that have been specifically tweaked for watermarking during training, instead of relying on properties intrinsic to the vocoder model in performing waveform generation.

Intellectual

Owing to the reasons above, it is important to be able to identify the source of fake audio. Given the advancement of TTS and VC technologies, and judging by the results obtained in work by other scholars, where generated audios are often judged to be quite natural and realistic, we consider it infeasible to manually perform such scrutiny and source identification. Consequently, this calls for a way to automatically extract relevant information from audio samples and classify fake audio based on its source of generation, with minimal human intervention. While the idea of source-dependent fingerprints have been explored on GAN-generated images (Yu et al., 2019; Marra et al., 2019) , those efforts rely on image properties and are not directly applicable to waveforms generated by vocoders. While there has been some preliminary investigation on the concept of vocoder fingerprints (Yan et al., 2022) , the effort of the study rests on an architecturespecific level. We seek to find the existence of fingerprints that allow for distinguishing models sharing the same architecture, separating audios generated by different vocoder models, and identifying their respective sources of generation. In this paper, we put forth and seek to explain the following questions: 1. Do vocoder models leave model-specific fingerprints on the audio they generate? 2. What is the condition for their existence? In other words, how similar can two vocoder model instances be before it becomes infeasible to distinguish between the two? 3. How do we present these fingerprints for the purposes of fake audio source identification? To answer these questions, we perform experiments to evaluate our hypothesis that vocoder models used to generate audio leave a fingerprint on the audio they generate that is specific to the model, as well as to explore how these fingerprints may be visualized and exploited for the aforementioned purposes of audio forensics and intellectual property protection. Our experiments confirm the existence of these fingerprints and show that vocoder fingerprints depend on architecture, training data, initialization seeds, and training hyperparameters, as shown in Figure 1 .

2. RELATED WORK

As our work deals with finding vocoder-specific features to aid in fake audio source identification, a task that falls in the purview of forensics, it behooves us to reiterate some of the work done by other scholars in the areas of vocoders and audio forensics.



property protection. Mainstream TTS and VC systems such as Shen et al. (2018), Arik et al. (2018), and Ren et al. (

Figure1: A t-SNE projection of vocoder fingerprints belonging to models of different architectures, initialization seeds, and hyperparameters, obtained in our experiments. ("PWG" refers to the Parallel WaveGAN architecture(Yamamoto et al., 2020); "Full" refers to the clean subset of LibriTTS (see section 4.1); SS1 consists of 50,000 randomly selected samples from the "Full" set)

