DO YOU REMEMBER? OVERCOMING CATASTROPHIC FORGETTING FOR FAKE AUDIO DETECTION

Abstract

Current fake audio detection algorithms achieve promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of some audio, including fake audio obtained by the same algorithm and genuine audio, on different datasets. To overcome this limitation, we propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting, called Regularized Adaptive Weight Modification (RAWM). Specifically, when fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. The adaptive modification direction ensures the network can detect fake audio on the new dataset while preserving its knowledge of previous model, thus mitigating catastrophic forgetting. In addition, orthogonal weight modification of fake audios in the new dataset will skew the distribution of inferences on audio in the previous dataset with similar acoustic characteristics, so we introduce a regularization constraint to force the network to remember this distribution. We evaluate our approach across multiple datasets and obtain a significant performance improvement on cross-dataset experiments.

1. INTRODUCTION

Currently, fake audio detection has attracted increasing attention since the organization of a series of challenges, such as the ASVspoof challenge (Wu et al., 2015; Kinnunen et al., 2017; Todisco et al., 2019; Yamagishi et al., 2021) and the Audio Deep Synthesis Detection challenge (ADD) (Yi et al., 2022) . In these competitions, deep neural networks have achieved great success. Currently, largescale pre-trained models have gradually been applied to fake audio detection and achieved stateof-the-art results on several public fake audio detection datasets (Tak et al., 2022; Martín-Doñas & Álvarez, 2022; Lv et al., 2022; Wang & Yamagishi, 2021) . Although fake audio detection achieves promising performance, it may be significantly degraded when dealing with audio of another dataset. The diversity of audio proposes a significant challenge to fake audio detection across datasets (Zhang et al., 2021b; a) . Some approaches have been proposed to improve detection performance across datasets. Monteiro et al. (2020) proposed an ensemble learning method to improve the detection ability of the model for unseen audio. Wang et al. ( 2020) designed a dual-adversarial domain adaptive network to learn more generalized features for different datasets. Both methods require some audio from the old dataset, but in some practical situations, it is almost impossible to obtain them. For instance, a pretrained model proposed by a company has been released to the public. It is unfeasible for the public to fine-tune it using the data belonging to the original company. Zhang et al. (2021b) proposed a data augmentation method to extract more robust features for detection across datasets, which is only suitable for the datasets with similar feature distribution. Ma et al. (2021) proposed the first continual learning method for fake audio detection, called Detecting Fake Without Forgetting (DFWF) inspired by Learning without Forgetting (LwF) (Li & Hoiem, 2017) . The DFWF improves the detection performance by fine-tuning on the new dataset and overcomes catastrophic forgetting by introducing regularization. Although the above methods are evaluated as viable options, there are still some insufficient places, like the acquisition of previous data in the first two and deteriorating 1

