DO YOU REMEMBER? OVERCOMING CATASTROPHIC FORGETTING FOR FAKE AUDIO DETECTION

Abstract

Current fake audio detection algorithms achieve promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of some audio, including fake audio obtained by the same algorithm and genuine audio, on different datasets. To overcome this limitation, we propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting, called Regularized Adaptive Weight Modification (RAWM). Specifically, when fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. The adaptive modification direction ensures the network can detect fake audio on the new dataset while preserving its knowledge of previous model, thus mitigating catastrophic forgetting. In addition, orthogonal weight modification of fake audios in the new dataset will skew the distribution of inferences on audio in the previous dataset with similar acoustic characteristics, so we introduce a regularization constraint to force the network to remember this distribution. We evaluate our approach across multiple datasets and obtain a significant performance improvement on cross-dataset experiments.

1. INTRODUCTION

Currently, fake audio detection has attracted increasing attention since the organization of a series of challenges, such as the ASVspoof challenge (Wu et al., 2015; Kinnunen et al., 2017; Todisco et al., 2019; Yamagishi et al., 2021) and the Audio Deep Synthesis Detection challenge (ADD) (Yi et al., 2022) . In these competitions, deep neural networks have achieved great success. Currently, largescale pre-trained models have gradually been applied to fake audio detection and achieved stateof-the-art results on several public fake audio detection datasets (Tak et al., 2022; Martín-Doñas & Álvarez, 2022; Lv et al., 2022; Wang & Yamagishi, 2021) . Although fake audio detection achieves promising performance, it may be significantly degraded when dealing with audio of another dataset. The diversity of audio proposes a significant challenge to fake audio detection across datasets (Zhang et al., 2021b; a) . Some approaches have been proposed to improve detection performance across datasets. Monteiro et al. (2020) proposed an ensemble learning method to improve the detection ability of the model for unseen audio. Wang et al. (2020) designed a dual-adversarial domain adaptive network to learn more generalized features for different datasets. Both methods require some audio from the old dataset, but in some practical situations, it is almost impossible to obtain them. For instance, a pretrained model proposed by a company has been released to the public. It is unfeasible for the public to fine-tune it using the data belonging to the original company. Zhang et al. (2021b) proposed a data augmentation method to extract more robust features for detection across datasets, which is only suitable for the datasets with similar feature distribution. Ma et al. (2021) proposed the first continual learning method for fake audio detection, called Detecting Fake Without Forgetting (DFWF) inspired by Learning without Forgetting (LwF) (Li & Hoiem, 2017) . The DFWF improves the detection performance by fine-tuning on the new dataset and overcomes catastrophic forgetting by introducing regularization. Although the above methods are evaluated as viable options, there are still some insufficient places, like the acquisition of previous data in the first two and deteriorating learning performance in the DFWF. This paper, however, aims to overcome catastrophic forgetting while exerting a positive influence on acquiring new knowledge without any previous samples. As for fake audio detection, we have observed that most datasets are under clean conditions. Regarding these datasets, the genuine audio has a more similar feature distribution than the fake audio. Specifically, the variance of the feature distribution of genuine audio is smaller than that of fake audio (Ma et al., 2021) . A few datasets, however, are collected under noisy conditions (Müller et al., 2022) , which makes a great difference in their feature distributions of genuine audio (Ma et al., 2022) . In this regard, if we modify the model weights as the orthogonal weight modification (OWM) method (Zeng et al., 2019) which introduces a new weight direction orthogonal to all previous data, most genuine audio can not be trained efficiently. The reason is that new data is supposed by the OWM to damage learned knowledge for its different feature distribution but it is unreasonable for fake audio detection. It is more efficient for most genuine audio to be trained with the same direction modification because of their similar feature distributions. To address these issues, we propose a continual learning approach, named Regularized Adaptive Weight Modification (RAWM). Because genuine audio has more similar feature distribution, it is reasonable to modify model weights in the same direction as the old one. Specifically, if the proportion of fake audio is larger, the modified direction is closer to the orthogonal projector of the subspace spanned by all previous input; if the proportion of genuine audio is larger, the modification is closer to the previous input subspace. However, when the feature distributions of old and new genuine audio are quite different, the effect of the above method is not obvious. We address this issue by introducing a regularization constraint. This constraint forces the model to remember the feature distribution without requiring prior knowledge. In addition, compared with the experience-replay-based method in continuous learning, RAWM does not require previous data, which makes this method suitable in most situations. Finally, the optimization process of RAWM is compared with that of the Stochastic Gradient Descent search (SGD) and OWM in Figure 1a .

Contributions:

We propose a regularized adaptive weight modification algorithm to overcome catastrophic forgetting for fake audio detection. There are two essential modules in our method: adaptive weight modification (AWM) and regularization. The former AWM is proposed for continual learning in most situations where genuine audio has similar feature distribution and the latter regularization is introduced to ease the problem that genuine audio may have different feature distribution in a few cases. The experimental results show that our proposed method outperforms several continual learning methods in acquiring new knowledge and overcoming forgetting, including Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) , LwF, OWM, and DFWF. The code will be publicly available in the foreseeable future.



Figure 1: Schematic of SGD, OWM, and RAWM. (a), With RAWM, the optimization process searches for configurations that lead to great performance on both old (blue area) and new (green area) datasets. A successful optimized configuration θrawm stops inside the overlapping subspace. However, the configuration θsgd obtained by SGD is optimized without considering forgetting, and the configuration θowm obtained by orthogonal weight modification can not reach the overlapping region. (b), the RAWM adaptively modifies weight direction by introducing a projector that is orthogonal to the projector P proposed by OWM.

