DEFENDING AGAINST ADVERSARIAL AUDIO VIA DIFFUSION MODEL

Abstract

Deep learning models have been widely used in commercial acoustic systems in recent years. However, adversarial audio examples can cause abnormal behaviors for those acoustic systems, while being hard for humans to perceive. Various methods, such as transformation-based defenses and adversarial training, have been proposed to protect acoustic systems from adversarial attacks, but they are less effective against adaptive attacks. Furthermore, directly applying the methods from the image domain can lead to suboptimal results because of the unique properties of audio data. In this paper, we propose an adversarial purification-based defense pipeline, AudioPure, for acoustic systems via offthe-shelf diffusion models. Taking advantage of the strong generation ability of diffusion models, AudioPure first adds a small amount of noise to the adversarial audio and then runs the reverse sampling step to purify the noisy audio and recover clean audio. AudioPure is a plug-and-play method that can be directly applied to any pretrained classifier without any fine-tuning or re-training. We conduct extensive experiments on speech command recognition task to evaluate the robustness of AudioPure. Our method is effective against diverse adversarial attacks (e.g. L 2 or L ∞ -norm). It outperforms the existing methods under both strong adaptive white-box and black-box attacks bounded by L 2 or L ∞norm (up to +20% in robust accuracy). Besides, we also evaluate the certified robustness for perturbations bounded by L 2 -norm via randomized smoothing. Our pipeline achieves a higher certified accuracy than baselines. Code is available at

1. INTRODUCTION

Deep neural networks (DNNs) have demonstrated great successes in different tasks in the audio domain, such as speech command recognition, keyword spotting, speaker identification, and automatic speech recognition. Acoustic systems built by DNNs (Amodei et al., 2016; Shen et al., 2019) are applied in safety-critical applications ranging from making phone calls to controlling household security systems. Although DNN-based models have exhibited significant performance improvement, extensive studies have shown that they are vulnerable to adversarial examples (Szegedy et al., 2014; Carlini & Wagner, 2018; Qin et al., 2019; Du et al., 2020; Abdullah et al., 2021; Chen et al., 2021a) , where attackers add imperceptible and carefully crafted perturbations to the original audio to mislead the system with incorrect predictions. Thus, it becomes crucial to design robust DNN-based acoustic systems against adversarial examples. To address it, existing works (e.g., Rajaratnam & Alshemali, 2018; Yang et al., 2019) have tried to leverage the temporal dependency property of audio to defend against adversarial examples. They apply the time-domain and frequency-domain transformations to the adversarial examples to improve the robustness. Although they can alleviate this problem to some extent, they are still vulnerable against strong adaptive attacks where the attacker obtains full knowledge of the whole acoustic system (Tramer et al., 2020) . Another way to enhance the robustness against adversarial examples is adversarial training (Goodfellow et al., 2015; Madry et al., 2018 ) that adversarial perturbations have been added to the training stage. Although it has been acknowledged as the most effective defense, the training process will require expensive computational resources and the model is still vulnerable to other types of adversarial examples that are not similar to those used in the training process (Tramer & Boneh, 2019) . Adversarial purification (Yoon et al., 2021; Shi et al., 2021; Nie et al., 2022) is another family of defense methods that utilizes generative models to purify the adversarial perturbations of the input examples before they are fed into neural networks. The key of such methods is to design an effective generative model for purification. Recently, diffusion models have been shown to be the state-of-the-art models for images (Song & Ermon, 2019; Ho et al., 2020; Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021) and audio synthesis (Kong et al., 2021; Chen et al., 2021b) . It motivates the community to use it for purification. In particular, in the image domain, DiffPure (Nie et al., 2022) applies diffusion models as purifiers and obtains good performance in terms of both clean and robust accuracy on various image classification tasks. Since such methods do not require training the model with pre-defined adversarial examples, they can generalize to diverse threats. Given the significant progress of diffusion models made in the image domain, it motivates us to ask: is it possible to obtain similar success in the audio domain? Unlike the image domain, audio signals have some unique properties. There are different choices of audio representations, including raw waveforms and various types of time-frequency representations (e.g., Mel spectrogram, MFCC). When designing an acoustic system, some particular audio representations may be selected as the target features, and defenses that work well on some features may perform poorly on other features. In addition, one may think of treating the 2-D time-frequency representations (i.e., spectrogram) as images, where the frequency-axis is set as height and the timeaxis is set as width, then directly apply the successful DiffPure (Nie et al., 2022) from the image domain for spectrogram. Despite the simplicity, there are two major issues: i) the acoustic system can take audio with variable time duration as the input, while the underlying diffusion model within DiffPure can only handle inputs with fixed width and height. ii) Even if we apply it in a fixed-length segment-wise manner for the time being, it still achieves the suboptimal results as we will demonstrate in this work. These unique issues pose a new challenge of designing and evaluating defense systems in the audio domain. In this work, we aim to defend against diverse unseen adversarial examples without adversarial training. We propose a play-and-plug purification pipeline named AudioPure based on a pre-trained diffusion model by leveraging the unique properties of audio. In specific, our model consists of two main components: (1) a waveform-based diffusion model and (2) a classifier. It takes the audio waveform as input and leverages the diffusion model to purify the adversarial audio perturbations. Given an adversarial input formatted with waveform, AudioPure first adds a small amount of noise via the diffusion process to override the adversarial perturbations, and then uses the truncated reverse process to recover the clean sample. The recovered sample is fed into the classifier. We conduct extensive experiments to evaluate the robustness of our method on the task of speech command recognition. We carefully design the adaptive attacks so that the attacker can accurately compute the full gradients to evaluate the effectiveness of our method. In addition, we also comprehensively evaluate the robustness of our method against different black-box attacks and the Expectation Over Transformation (EOT) attack. Our method shows a better performance under both white-box and black-box attacks against diverse adversarial examples. Moreover, we also evaluate the certified robustness of AudioPure via randomized smoothing, which offers a provable guarantee of model robustness against L 2 -based perturbation. We show that our method achieves better certified robustness than baselines. Specifically, our method obtains a significant improvement (up to +20% at most in robust accuracy) compared to adversarial training, and over 5% higher certified robust accuracy than baselines. To the best of our knowledge, we are the first to use diffusion models to enhance the security of acoustic systems and investigate how different working domains of defenses affect adversarial robustness.

2. RELATED WORK

Adversarial attacks and defenses. Szegedy et al. (2014) introduce adversarial examples, which look similar to normal examples but will fool the neural networks to give incorrect predictions. Usually, adversarial examples are constrained by L p norm to ensure the imperceptibility. Recently, stronger attack methods are emerging (Madry et al., 2018; Carlini & Wagner, 2017; Andriushchenko et al., 2020; Croce & Hein, 2020; Xiao et al., 2018a; b; 2019; 2022b; a; Cao et al., 2019b; a; 2022a) . In the audio domain, Carlini & Wagner (2018) introduce audio adversarial examples, and Qin et al. (2019) manage to make them more imperceptible. Black-box attacks (Du et al., 2020; Chen et al., 2021a) are also developed, aiming to mislead the end-to-end acoustic systems. In order to protect neural networks from adversarial attacks, different defense methods are proposed. The most widely used one is adversarial training (Madry et al., 2018) , which deliberately uses adversarial examples as the training data of neural networks. The main problems of adversarial training are the accuracy drop of benign examples and the expensive computational cost. Many improved versions of adversarial training aim to alleviate these problems (Wong et al., 2020; Shafahi et al., 2019; Zhang et al., 2019b; a; Sun et al., 2021; Cao et al., 2022b; Zhang et al., 2019c) . Another line of work is adversarial purification (Yoon et al., 2021; Shi et al., 2021; Nie et al., 2022) , which uses generative models to remove the adversarial perturbations before classification. Both of these two types of defenses are mainly developed for computer vision tasks and cannot be directly applied to the audio domain. In this paper, we explicitly design a defense pipeline according to the characteristics of audio data. Speech processing. Many speech processing applications are vulnerable to adversarial attacks, including speech command recognition (Warden, 2018) , keyword spotting (Chen et al., 2014; Li et al., 2019 ), speaker identification (Reynolds et al., 2000; Ravanelli & Bengio, 2018; Snyder et al., 2018) , and speech recognition (Amodei et al., 2016; Shen et al., 2019; Ravanelli et al., 2019) . In particular, speech command recognition is closely related to keyword spotting, and can be viewed as speech recognition with limited vocabulary. In this work, we choose speech command recognition as the testbed for the proposed AudioPure pipeline. The proposed pipeline is applicable for keyword spotting and speech recognition. A speech command recognition system consists of a feature extractor and a classifier. The feature extractor processes the raw audio waveforms and outputs acoustic features, e.g. Mel spectrograms or Mel-frequency cepstral coefficients (MFCC). Then these features are fed into the classifier, and the classifier gives predictions. Given the 2-D spectrogram features, convolutional neural networks for images are readily applicable (Simonyan & Zisserman, 2015; He et al., 2016; Zagoruyko & Komodakis, 2016; Xie et al., 2017; Huang et al., 2017) .

3.1. BACKGROUND OF DIFFUSION MODELS

A diffusion model normally consists of a forward diffusion process and a reverse sampling process. The forward diffusion process gradually adds gaussian noise to the input data until the distribution of the noisy data converges to a standard Gaussian distribution. The reverse sampling process takes the standard gaussian noise as input and gradually denoises the noisy data to recover clean data. At present, diffusion models can be divided into two different types: discrete-time diffusion models based on sequential sampling, such as SMLD Song & Ermon (2019) , DDPM (Ho et al., 2020) , and DDIM (Song et al., 2021a) , and continuous-time diffusion models based on SDEs (Song et al., 2021c) . Song et al. (2021c) also build the connection between these two types of diffusion models. Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) is one of the most widely used diffusion models. Many of the subsequently proposed diffusion models, including DiffWave for audio (Kong et al., 2021) , are based on the DDPM formulation. In DDPM, both the diffusion and reverse processes are defined by Markov chains. For input data x 0 ∈ R d , we denote x 0 ∼ q(x 0 ) as the original data distribution, and x 1 , . . . , x N are intermediate latent variables from the distributions q(x 1 |x 0 ), . . . , q(x N |x N -1 ), where N is the total number of steps. Generally, with a pre-defined or learned variance schedule β 1 , . . . , β N (usually linearly increasing small constants), the forward transition probability q(x n |x n-1 ) can be formulated as: q(x n |x n-1 ) = N (x n ; 1 -β n x n-1 , β n I), Based on the variance schedule {β n }, a set of constants is defined as: α n = 1 -β n , ᾱn = N n=1 α n , βn = 1-ᾱn-1 1-ᾱn β n , n > 1 β 1 , n = 1 , and using the reparameterization trick, we have: q(x n |x 0 ) = N (x n ; √ ᾱn x 0 , (1 -ᾱn )I) When n gradually gets larger to infinity, q(x n |x 0 ) will converge to a standard Gaussian distribution. Meanwhile, for the reverse process, we have: x n-1 ∼ p θ (x n-1 |x n ) = N (x n-1 ; µ θ (x n , n), σ 2 θ (x n , n)I), where the mean term µ θ (x n , n) and the variance term σ 2 θ (x n , n) is instantiated by parameter θ. Ho et al. (2020) ; Kong et al. (2021) use a neural network ϵ θ to define µ θ , and σ θ is fixed to a constant: µ θ (x n , n) = 1 √ α n x n - β n √ 1 -ᾱn ϵ θ (x n , n) , σ θ (x n , n) = βn . We denote x n (x 0 , ϵ) = √ ᾱn x 0 + (1 -ᾱn )ϵ, ϵ ∼ N (0, I), and the optimization objective is: θ ⋆ = arg max θ N n=1 λ n E x(0) ϵ -ϵ θ ( √ ᾱn x 0 + (1 -ᾱn )ϵ, n) 2 2 (6) where λ n is the weighting coefficient (Ho et al., 2020) . According to Song et al. (2021c) , as N → ∞, DDPM becomes VP-SDE, a continuous-time formulation of diffusion models. Particularly, the forward SDE is formulated as: dx = - 1 2 β(t)xdt + β(t)dw. ( ) where t ∈ [0, 1], dt is an infinitesimal positive time step, w is a standard Wiener process, β(t) is the continuous-time noise schedule. Similarly, the reverse SDE can be defined as: dx = - 1 2 β(t)[x + 2∇ x log p t (x)]dt + β(t)d w, ( ) where dt is an infinitesimal negative time step, and w is a reverse-time standard Wiener process.

3.2. AUDIOPURE: A PLUG-AND-PLAY DEFENSE FOR ACOUSTIC SYSTEMS

To standardize the formulation of the defense, as suggested by Nie et al. (2022) , we use the continuous-time formulation defined by Eq. 7 and Eq. 8. Note that since the existing pretrained DiffWave models (Kong et al., 2021) are based on DDPM, we will use their equivalent VP-SDE. If we use the Euler-Maruyama method to solve the VP-SDE and the step size ∆t = 1 N , the sampling of the reverse-time SDE will be equivalent to the reverse sampling of DDPM (detailed proofs can be found in Song et al. (2021c) ). Under this prerequisite, we have t = n N where n ∈ {1, . . . , N }. We define β( n N ) := β n , ᾱ( n N ) := ᾱn , β( n N ) := βn , and x( n N ) := x n . Given an adversarial example x adv as the input at t = 0, i.e. x 0 = x adv , we first run the forward SDE from t = 0 to t ⋆ = n ⋆ N by solving Eq. 7 (it is equivalent to running n * DPPM steps), which yields: x(t ⋆ ) = ᾱ(t ⋆ )x adv + 1 -ᾱ(t ⋆ )z, z ∼ N (0, I), Next, we run the truncated reverse SDE from t = t ⋆ to t = 0 by solving Eq. 8. Similar to Nie et al. (2022) , we define an SDE solver sdeint that uses the Euler-Maruyama method, and sequentially takes in six inputs: initial value, drift coefficient, diffusion coefficient, Wiener process, initial time, and end time. The reverse output x(0) at t = 0 can be formulated as: x(0) = sdeint(x(t ⋆ ), f rev , g rev , w, t ⋆ , 0). ( ) where the drift and diffusion coefficients are: f rev (x, t) := - 1 2 β(t)[x + 2s θ (x, t)], g rev (t) := β(t). Note that we use a diffusion coefficient different from Nie et al. (2022) for the purpose of cleaner output (see the detailed explanation in Section 3.3). Next, we use the discrete-time noise estimator ϵ θ (x n , n) to compute the continuous-time score estimator s θ (x, t). By defining εθ (x(t), t) := ϵ θ (x( n N ), n) = ϵ θ (x n , n) with t := n N , the score function in the reverse VP-SDE can be estimated as: s θ (x, t) = - εθ (x, t) 1 -ᾱ(t) ≈ ∇ x log p t (x). Accordingly, x(0), the purified output of the adversarial input x(0) = x adv , is fed into the later stages of the acoustic system to make predictions. The whole purification operation can be denoted as a function Purifier : R d × R → R d : Purifier(x adv , n ⋆ ) = sdeint ᾱ( n ⋆ N )x adv + 1 -ᾱ( n ⋆ N )z, f rev , g rev , w, n ⋆ N , 0 The acoustic systems are usually built on the features extracted from the raw audio. For example, the system can extract Mel spectrogram as the features: 1) it first applies short-time Fourier transformation (STFT) on the time-domain waveform to get linear-scale spectrogram, and 2) it then rescales the frequency band to the Mel-scale. We denote this process as Wave2Mel : R d → R m × R n , which is a differentiable function. Then the classifier F : R m × R n → R c (usually a convolutional network) takes the Mel spectrogram as the input and gives predictions. Since both the time domain waveform and time-frequency domain spectrogram go through the pipeline, the purifier can be applied in either the time domain or time-frequency domain. If the purifier is applied in the time domain, the whole defended acoustic system AS : R d × R → R c can be formulated as: AS(x adv , n ⋆ ) = F (Wave2Mel(Purifier(x adv , n ⋆ ))) (14) where the waveform Purifier is based on DiffWave. Meanwhile, if we want to purify the input adversarial examples in the time-frequency domain, we can choose a diffusion model used for image synthesis, and apply it to the output spectrogram of Wave2Mel. We denote this purifier as Purifier spec : R m × R n × R → R m × R n . In this scenario, the whole defended acoustic system will be: AS(x adv , n ⋆ ) = F (Purifier spec (Wave2Mel(x adv )), n ⋆ ) (15) The architecture of the whole pipeline is illustrated in Figure . 1. For the purification in the timefrequency domain spectrogram, we use an Improved DDPM (Nichol & Dhariwal, 2021) trained on the Mel spectrograms of audio data and denote it as DiffSpec. We compare these two purifiers and discover that the purification in the time domain waveform is more effective to defend against adversarial audio. Detailed experimental results can be found in Sec. 4.2.

3.3. TOWARDS EVALUATING AUDIOPURE

Adaptive attack For the forward diffusion process formulated as Eq. 9, the gradients of the output x(t ⋆ ) w.r.t. the input x(0) is a constant. For the reverse process formulated as Eq. 10, the adjoint method (Li et al., 2020) is applied to compute the full gradients of the objective function L w.r.t. x(t ⋆ ) without any out-of-memory issues, by solving another augmented SDE: x(t ⋆ ) ∂L ∂x(t ⋆ ) = sdeint x(0) ∂L ∂x(0) , f rev ∂frev ∂x z , g rev 1 0 , -w(1 -t) -w(1 -t) , 0, t ⋆ where 1 and 0 represent the vectors of all ones and all zeros, respectively.

SDE modifications for clean output

We observe that directly applying the framework of Nie et al. (2022) to the audio domain will cause the performance degradation. That is, when converting the discrete-time reverse process of DiffWave (Kong et al., 2021) to its corresponding reverse VP-SDE in Eq. 8, the output audio still contains much noise, resulting in lower classification accuracy. We identify two influencing factors and solve this problem by modifying the SDE formulation. The first factor is the diffusion error due to the mismatch of the reverse variance between the discrete and continuous cases. Ho et al. (2020) observed that both σ 2 θ = βt and σ 2 θ = β t get similar results experimentally in the image domain. However, we find that it is not the case in the audio modeling with diffusion models. For audio synthesis using DiffWave trained with σ 2 θ = βt , if we switch the reverse variance schedule to σ 2 θ = β t , the output audio becomes noisy. Thus, in Sec. 3.2 we define β( n N ) = βn and use the diffusion coefficient g rev = β(t) in Eq. 11 instead of g rev = β(t) to match the variance βt in DiffWave. The second factor is the inaccuracy from the continuous-time noise schedule β(t) = β 0 +(β N -β 0 )t and α(t) = e -t 0 β(s)ds used by Nie et al. (2022) . The impact of the difference between β(t) = β 0 +(β N -β 0 )t and β N t cannot be negligible, especially when N is not large enough (e.g. N = 200 for the pretrained DiffWave model we use). Besides, when t is close to 0, α(t) = e -t 0 β(s)ds is not a good approximation of ᾱNt any more. Thus, we define the continuous-time noise schedule directly based on the discrete schedule, namely, β( n N ) := β n and ᾱ( n N ) := ᾱn , for the purpose of better denoised output and more accurate gradient computation.

4. EXPERIMENTS

In this section, we first introduce the detailed experimental settings. Then we compare the performance of our method and other defenses under strong white-box adaptive attack where the attacker has full knowledge about the defense and black-box attacks. To further show the robustness of our method, we also evaluate the certified accuracy via randomize smoothing Cohen et al. (2019) , which provides a provable guarantee of model robustness against L 2 norm bounded adversarial perturbations.

4.1. EXPERIMENTAL SETTINGS

Dataset. Our method is evaluated on the task of speech command recognition. We use the Speech Commands dataset (Warden, 2018) , which consists of 85,511 training utterances, 10,102 validation utterances, and 4,890 tests utterances. Following the setting of Kong et al. (2021) , we choose the utterances which stand for digits 0 ∼ 9 and denote this subset as SC09. Models. We use DiffWave (Kong et al., 2021) and DiffSpec (based on Improved DDPM (Nichol & Dhariwal, 2021)) as our defensive purifiers, which are representative diffusion models on the waveform domain and the spectral domain respectively. We use the unconditional version of Diffwave with the officially provided pretrained checkpoints. Since the Improved DDPM model does not provide the pretrained checkpoint for audio, we train it from scratch on the Mel spectrograms of audio from SC09. The training details and hyperparameters are in the appendix A. For the classifier, we use ResNeXt-29-8-64 (Xie et al., 2017) for spectrogram representation and M5 Net (Dai et al., 2017) for waveform except the experiments for ablation studies. Attacks. For white-box attacks, we use PGD (Madry et al., 2018) with different iteration steps from 10 to 100 among L ∞ and L 2 norms. The attack budget is set to ϵ = 0.002 for L ∞ -norm constraint except the ablation study and ϵ = 0.253 for L 2 norm constraint. For black-box attacks, we apply a query-based attack, FAKEBOB (Chen et al., 2021a) , and set the iteration steps to 200, NES samples to 200, and the confidence coefficient κ = 0.5. Baselines. We compare our method with two types of baselines including: (1) transformation-based defense (Yang et al., 2019; Rajaratnam & Alshemali, 2018) including average smoothing (AS), median smoothing (MS), downsampling (DS), low-pass filter (LPF), and band-pass filter (BPF), and (2) adversarial training based defense (AdvTr) (Madry et al., 2018) . For adversarial training, we follow the setting of Chen et al. ( 2022), using L ∞ PGD 10 with ϵ = 0.002 and ratio = 0.5.

4.2. MAIN RESULTS

We evaluate AudioPure (n ⋆ =3 by default) under adaptive attacks, assuming the attacker obtains full knowledge of our defense. We use the adaptive attack algorithm described in the previous section so that the attacker is able to accurately compute the full gradients for attacking. The results are shown in Table 1 . We find that the baseline transformation-based defenses (Yang et al., 2019; Rajaratnam & Alshemali, 2018) different EOT sample sizes. Figure 2b demonstrates the result. We find that AudioPure is effective among different EOT sizes. ⋆ = 0 n ⋆ = 1 n ⋆ = 2 n ⋆ = 3 n ⋆ = 5 n ⋆ = 7 n ⋆ = 10 ϵ = 0. Attack budget ϵ. We evaluate the effectiveness of our method among different ϵ including ϵ = {0.002, 0.004, 0.008, 0.016}. Since the diffusion steps n ⋆ are the hyperparameters for Au-dioPure , we conduct experiments among different n ⋆ . As shown in Table 2 , if n ⋆ is larger than 2, AudioPure will show strong effectiveness among different ϵ. When ϵ increases, it requires a larger n ⋆ to achieve the optimal robustness since a larger adversarial perturbation requires a large noise from the forward process of the diffusion model to override the adversarial perturbations and the corresponding larger step to recover purified audio. However, if the n ⋆ is too large, it will override the original audio information as well so that the recovered audio from the diffusion model will lose the original audio information, contributing to the performance drop. Furthermore, we explore the extent of the diffusion model for purification in Appendix I. Architectures. Moreover, we apply AudioPure to different classifiers, including spectrogrambased classifier: VGG-19-BN (Simonyan & Zisserman, 2015) , ResNeXt-29-8-64(Xie et al., 2017) , WideResNet-28-10(Zagoruyko & Komodakis, 2016) DenseNet-BC-100-12 (Huang et al., 2017) and wave-form based classifier: M5 (Dai et al., 2017) . Table 3 shows that our method is effective for various neural network classifiers. Audio representations Audio has different types of representations including raw waveforms or time-frequency representations (e.g., Mel spectrogram). We conduct an ablation study to show the effectiveness of diffusion models by using different representations, including DiffWave, a diffusion model for waveforms (Kong et al., 2021) and DiffSpec, a diffusion model for spectrogram based on the original image model (Nichol & Dhariwal, 2021) . The results are shown in Table 4 . We find We think the potential reason is that the short-time Fourier transform (STFT) is an operation of information compression. The spectrogram contains much less information than the raw audio waveform. This experiment shows that the domain difference contributes to significantly different results, and directly applying the method from the image domain can lead to suboptimal performance for audio. It also verifies the crucial design of AudioPure for adversarial robustness.

4.4. CERTIFIED ROBUSTNESS

In this section, we evaluate the certified robustness of AudioPure via randomized smoothing (Cohen et al., 2019) . Here we draw N = 100, 000 noise samples and select noise levels σ ∈ {0.5, 1.0} for certification. Note that we follow the same setting from Carlini et al. (2022) and choose to use the one-shot denoising method. The detailed implementation of our method could be found in Appendix C. We compare our results with randomized smoothing using the vanilla classifier and Gaussian augmented classifier, denoted RS-Vanilla and RS-Gaussian respectively. The results are shown in Table 5 . We also provide the certified robustness under different L 2 perturbation budget with different Gaussian noise σ = {0.5, 1.0} in Figure A of Appendix C. By observing our results, we find that our method outperforms baselines for a better certified accuracy except σ = 0.5 at 0 radius. We also notice that the performance of our method will be even better when the input noise gets larger. This may be due to AudioPure can still recover the clean audio with a large L 2 -based perturbation while Gaussian augmented model could even not be converged when training with such large noise.

5. CONCLUSION

In this paper, we propose an adversarial purification-based defense pipeline for acoustic systems. To evaluate the effectiveness of AudioPure , we design the adaptive attack method and evaluate our method among adaptive attacks, EOT attacks, and black-box attacks. Comprehensive experiments indicate that our defense is more effective than existing methods (including adversarial training) among the diverse type of adversarial examples. We show AudioPure achieves better certifiable robustness via Randomized Smoothing than other baselines. Moreover, our defense can be a universal plug-and-play method for classifiers with different architectures. Limitations. AudioPure introduces the diffusion model, which increases the time and computational cost. Thus, how to improve time and computational efficiency is an important future work. For example, it is interesting to investigate the distillation technique (Salimans & Ho, 2022) and fast sampling method (Kong & Ping, 2021) to reduce the computation complexity introduced by diffusion models.

APPENDIX

A DETAILS ON TRAINING THE IMPROVE DDPM. We train an Improved DDPM using the official repository (https://github.com/openai/improved-diffusion). For the UNet model, we set image size = 32, num channels = 3, and num res blocks = 128. For diffusion flags, we set N = 200, β 1 = 0.0001, β N = 0.02 and use the linear variance schedule. For the model training, we set the learning rate to 1e -4 and the batch size to 230. The training loss has converged after 80,000 training steps, and we use this checkpoint to build our purifier.

B ADDITIONAL EXPERIMENTS OF TRANSFER-BASED ATTACK

We additionally evaluate our method under transfer-based attack, where we assume the attacker can only get the output logits of the acoustic system but have no knowledge about the used defense. We use model functional stealing to train a surrogate model. Specifically, we first feed input examples into the acoustic system consisting of DiffWave and a ResNeXt classifier and get the output logits. Then we use these output logits of the acoustic system as labels and train a new surrogate ResNeXt model, which has the same architecture as the classifier in the acoustic system. The results are shown in Table A . The Stealing Acc. denotes the accuracy of the surrogate classifier using the predictions of the defended acoustic system as ground truth. The Transfer to Vanilla and Transfer to Defended represent the undefended vanilla classifier and the defended acoustic system. The surrogate classifier is attacked to generate adversarial examples, and these adversarial examples are transferred to evaluate the robustness of the undefended vanilla classifier and the defended acoustic system. C DETAILS ABOUT CERTIFIED ROBUSTNESS Randomized smoothing (Cohen et al., 2019) provides a provable robustness guarantee in L 2 -norm by evaluating models under noise. Usually, the performance of the vanilla classifier will degrade when feeding the Gaussian perturbed inputs. To alleviate this problem, we can re-train a new network or fine-tune a pretrained network on Gaussian augmented data. However, both of them could take a lot of time on training. Another way is to apply a denoiser before the vanilla classifier, named denoised smoothing (Salman et al., 2020) . Since the reverse process of the diffusion model can be seen as a good denoiser, we can use a pretrained diffusion model as a plug-and-play method to make any model certifiably robust. For a given noise level σ, we can compute the corresponding diffusion step t ⋆ which adds the same level of noise to the input examples. The diffusion process can be reformulated as: x n = √ ᾱn x 0 + √ 1 -ᾱt z = √ ᾱn (x 0 + 1 -ᾱn ᾱn z), z ∼ N (0, I), while the noisy input x of randomized smoothing is x = x 0 + √ σz, z ∼ N (0, I). (S18) So we can obtain n ⋆ s.t. 1-ᾱn ᾱn = σ after multiplying a rescale coefficient √ ᾱn on the input x. According to Carlini et al. (2022) , a single reverse step is able to recover an image with a high accuracy for the classifier and can largely save computational time by directly recovering the data through x 0 = 1 √ ᾱn (x n - √ 1 -ᾱn ϵ θ ( √ ᾱn x, n)) . So we can just apply one-shot denoising instead of running full steps in our reverse process. The results show that the certified robustness of our method is consistently better than baselines except at small certified radii when σ = 0.50. 

D THEORETICAL ANALYSIS ON THE PURIFICATION ABILITY

Theorem D. 1. Assume that p(x) and q(x) are respectively the data distribution of clean examples and the data distribution of adversarial examples. We use p t and q t to represent the respective distribution of x(t) when x(t) ∼ p(x) and x(t) when x(t) ∼ q(x). Then we have ∂D KL (p t ||q t ) ∂t ≤ 0 (S19) where the equality is established only if p t = q t . This inequality indicates that as t increases from 0 to 1, the KL divergence of p t and q t monotonically decreases. In other words, when the diffusion steps n ⋆ increases, more of the adversarial perturbations will be removed. Considering that the original semantic information will also be removed if n ⋆ is too large, which affects the clean accuracy, there should be a trade-off when we set n ⋆ for the diffusion model purifier. Proof: Following Nie et al. (2022) ; Song et al. (2021b) , we firstly formulate the Fokker-Planck equation (Särkkä & Solin, 2019) of the forward SDE in Eq. 7 (where we define f (x, t) := -1 2 β(t) and g(t) := β(t)) as: ∂p t (x) ∂t = -∇ x f (x, t)p t (x) - 1 2 g 2 (t)∇ x p t (x) = -∇ x f (x, t)p t (x) - 1 2 g 2 (t)p t (x)∇ x log p t (x) = ∇ x • (h p (x, t)p t (x)) (S20) where h p (x, t) := 1 2 g 2 (t)∇ x log p t (x) -f (x, t). Assuming p t and q t are smooth and fast decaying, i.e. for any i = 1, . . . , d, we have lim xi→∞ p t (x) ∂ ∂x i log p t (x) = 0, lim xi→∞ q t (x) ∂ ∂x i log q t (x) = 0 (S21) for x i , the i-th dimension of x ∈ R d . Then we reformulate the KL divergence as  ∂D KL (p t ||q t ) ∂t = - ∂ ∂t p t (x) log p t (x) q t (x) dx = -∇ x f (x, t)p t (x) - 1 2 g 2 (t)p t (x)∇ x log p t (x) = ∇ x • (h p (x, t)p t (x)) log p t (x) q t (x) dx + p t (x) q t (x) ∇ x • (h p (x, t)p t (x))dx = -p t (x)[h p (x, t) -h q (x, t)] ⊤ [∇ x log p t (x) -∇ x log q t (x)]dx = - 1 2 g 2 (t) p t (x)∥∇ x log p t (x) -∇ x log q t (x)∥ 2 2 dx = - 1 2 g 2 (t)D F (p t ||q t )

E EXPERIMENTS ON THE QUALCOMM KEYWORD SPEECH DATASET

In addition to the commonly used SC09, for a more comprehensive consideration, we also conduct experiments on the Qualcomm Keyword Speech Dataset (Kim et al., 2019) , denoted as QKW in the following. QKW consists of 4270 utterances belonging to four classes, with variable durations from 0.48s to 1.92s. We split them into a training set (3770 utterances), a validation set (400 utterances), and a test set (100 utterances). To handle the variable-sized input, we train an Attention Recurrent Convolutional Network (Shan et al., 2018) and save the checkpoint with the highest accuracy on the validation set. Then we finetuned the DiffWave model on QKW for 50,000 steps, with lr = 2e -4 and batch size per gpu = 2 for 3 GPU. The results under L ∞ PGD 10 with ϵ = 0.002 are shown in Table B . We can observe that AudioPure can still achieve non-trivial robustness and handle the audio with variable time duration well. 

F FINE-TUNING ON ADVERSARIAL EXAMPLES

AudioPure takes advantage of pretrained diffusion models. We wonder whether the purification performance will be improved if fine-tuned on adversarial examples. And we further fine-tune the DiffWave model by augmenting self-supervised perturbation (SSP) (Naseer et al., 2020) . Specifically, we use STFT (rescaling to the Mel-scale) as our feature extractor and maximize the following objective to generate perturbed examples: arg max x ′ ∆(x, x ′ ) = ∥ST F T (x), ST F T (x ′ )∥ ∞ , s.t.∥x -x ′ ∥ ∞ (S23) where x is the clean example and x ′ is the perturbed example. We then use gradient descent to optimize the perturbed example by:  x ′ t+1 = clip(x ′ t + α • sign(∇ x ∆(x, x ′ t )), x -ϵ, x + ϵ), L tuning = L audio + λL f eat , where L audio = M SE(x, Purifier(x ′ t , n ⋆ )), (S26) L f eat = M SE(ST F T (x), ST F T (Purifier(x ′ t , n ⋆ )). (S27) We choose λ = 0.1 and use SGD to optimize L tuning , setting the learning rate to 1e -5. The results are shown in Table C . As a result, it does not improve the performance of AudioPure (with n ⋆ = 3) under L ∞ PGD 10 and PGD 70 with ϵ = 0.002. These results further verify the effectiveness of using pretrained models. 

G COMPARISON WITH OTHER DENOISER-BASED DEFENSE

We compare AudioPure with DefenseGAN (Samangouei et al., 2018) and Joint Adversarial Finetuning (Joshi et al., 2022) . For DefenseGAN, which is originally designed to defend against adversarial images by finding the optimal noise that generates the most similar image to the adversarial counterpart, we adopt it to the audio domain, choosing WaveGAN (Donahue et al., 2018) as the GAN model in this pipeline. We train a WaveGAN on the SC09 dataset for 100 epochs, using the Adam optimizer with lr = 1e -3, β 1 = 0.5, and β 2 = 0.9. For Joint Adversarial Fine-tuning, we follow the setting of Joshi et al. (2022) , using a Conv-TasNet (Luo & Mesgarani, 2019) as the denoiser. And like Joshi et al. (2022) , we craft an offline adversarial SC09 dataset against the pretrained classifier by using L-inf PGD-100 attacks with ϵ = 0.002 (denoted as OffAdv-SC09). Then we train a Conv-TasNet model on OffAdv-SC09 for 30 epochs to get the pretrained denoiser. We denote the defense using the pretrained Conv-TasNet as CTN Baseline. Based on the adversarial examples generated by attacking the whole acoustic system, we only update the Conv-TasNet denoiser while keeping the classifier frozen, and denote this method as CTN Adv-Finetune-Joint-frozen. During the adversarial tuning, we use L ∞ PGD 10 attack with ϵ = 0.002. After tuning for 1000 steps with batch size = 20, we calculate the clean and robust accuracy (under L ∞ PGD 10 and PGD 70 with ϵ = 0.002) on the same test used in our paper. We report the results in Table D . We find that DefenseGAN based on WaveGAN cannot work well in the audio domain. It shows the impact of domain differences with respect to the final results and verifies the importance of our pipeline design. Besides, the Conv-TasNet denoiser is less effective than diffusion models against adaptive attacks, even after fine-tuning.  L reg = i L(x i , y i ) + λ∥ ∂f (x i ) ∂x i ∥ F , where x i ∈ R d is the input data, y i ∈ R n is the label, L : R d × R n → R is the original loss function, and f : R d → R n is the neural network. By minimizing the Frobenius norm of the Jacobian matrix, the adversarial robustness of the network will be improved. For a more comprehensive study, we also compare AudioPure with this regularization-based method, using different λ. The results are shown in Table E , where we denote the regularization-based defense as Jacobian-Reg. I EXPERIMENTS ON LARGER ATTACK BUDGETS. Besides the results of different ϵ in Table 2 , we conduct additional experiments to explore the potential of the diffusion model for purification. We select ϵ = {0.01, 0.02, 0.03, 0.04, 0.05}, and set the diffusion steps n ⋆ . The results are shown in Table F . We find that our method still achieves 42% accuracy at ϵ = 0.03, which brings significant distortions to audio. Our method keeps the ability to purify adversarial perturbations until ϵ = 0.05. We also visualize the audio waveforms under attacks with different ϵ, illustrated in Figure B . It is easy to observe significant noise in them. 



Figure 1: The architecture of the whole acoustic system protected by AudioPure (black line in the figure) and the adaptive attack (orange line in the figure). AudioPure first adds noise to the adversarial audio and then runs the reverse process to recover purified audio. Next, the purified audio is transformed into the spectrogram, and the spectrogram is fed into the classifier to get predictions. The attacker updates the adversarial audio based on the gradients backpropagated through SDE. Without AudioPure , the adversarial audio transfers to the spectrogram and feeds into the classifier directly.

Figure 2: The performance of baseline (no defense, denoted as None), adversarial training (denoted as AdvTr), and AudioPure under attacks with different iteration steps and EOT size. (a) indicates the step of convergence, and the attack is almost optimal when iterating over 70 steps. (b) shows that increasing EOT size can barely affect the robustness of our method.

Figure A shows the certified accuracy of AudioPure compared with RS-Gaussian and RS-Vanilla.The results show that the certified robustness of our method is consistently better than baselines except at small certified radii when σ = 0.50.

Figure A: robustness (L 2) with different input noise level σ. Larger σ ensures better robustness under larger perturbations, but the performance for benign inputs will be degraded.

S24) for t = 1, . . . , T . Here we use T = 100, ϵ = 0.002 and α = 0.0004. Next, we fine-tune the pretrained DiffWave model on the SSP examples, minimizing the following loss:

Figure B: Visualizations of the clean audio and adversarial audio with different attack budgets.

Figure B: Visualizations of the clean audio and adversarial audio with different attack budgets.

Performance against adaptive attacks among different methods.

The results of FAKEBOB are shown in Table1, indicating that our method can keep effective under the query-based black-box attack. The results of the transferabilitybased attack are in the appendix B. They draw the same conclusion. These results further verify the effectiveness of our method. All results indicate that AudioPure can work under diverse attacks with different types of constraints, while adversarial training has to apply different training strategies and re-train the model, making it less effective among unseen attacks than our method. We report the actual inference time in Appendix J and compare out method with more existing methods in

The robust accuracy under PGD10 with different attack budget ϵ when using different reverse steps n ⋆ . Larger ϵ requires larger n ⋆ to ensure better robustness.

Ablation studies among different model architectures. The robust accuracy is evaluated under L∞-PGD70. Our method is effective on various models with different architectures.

Ablation studies among different audio representations. We implement AudioPure using two different diffusion models as purifiers, DiffWave and DiffSpec, that respectively process the representations in the time domain and time-frequency domain.

Certified accuracy for different methods. For each noise level σ, we add the same level of noise to train the classifier and apply it to RS-Gaussian. DiffWave consistently outperforms DiffSpec against L 2 and L ∞ -based adversarial examples. Moreover, compared with DiffWave, despite DiffSpec achieve higher clean accuracy, it only achieves 49% robust accuracy, a significant 35% performance drop against L 2 -based adversarial examples.

Transfer-based attack via model functional stealing. We train a surrogate model, using the outputs of the defended acoustic system as labels. Then adversarial examples are generated by attacking the surrogate model and transferred to the undefended vanilla classifier and the defended acoustic system.

S22)where D F (p t ||q t ) is the Fisher divergence. Considering that g 2 (t) = β(t) > 0, and the Fisher divergence D F (p t ||q t ) ≥ 0 and the equality is established only if p t = q t , as a result, we have Eq S19, where the equality is established only if p t = q t .

We apply AudioPure to the Qualcomm Keyword Speech Dataset. The diffusion steps n ⋆ is set to 2.

We fine-tune the pretrained DiffWave model on adversarial examples generated by SSP. After fine-tuning, the performance is not improved.

We compare AudioPure with different denoiser-based defenses. DiffWave is proven to be a more effective purifier. COMPARISON WITH THE REGULARIZATION-BASED DEFENSEGu & Rigazio (2014);Hoffman et al. (2019) introduce the input-output Jacobian matrix of the network as a regularization term in the optimization objective, formulated as

We compare AudioPure with the regularization-based defense, using different λ.

We explore the potential of DiffWave under larger attack budgets. The diffusion steps n ⋆ is set to 5.Due to the introduction of diffusion models, AudioPure will bring additional time cost during inference. As shown in TableG, we compute the time cost per audio, averaged on 100 examples and the time duration for each example is around one second. We evaluate it on an NVIDIA RTX 3090 GPU with Intel® Core™ i9-10920X CPU @ 3.50GHz and 64 GB RAM.

The inference time cost when using different diffusion steps n ⋆ .Diffusion Steps n⋆ = 0 n ⋆ = 1 n ⋆ = 2 n ⋆ = 3 n ⋆ = 5 n ⋆ = 7 n ⋆ = 10

ACKNOWLEDGMENT

We thank Prof. Xiaolin Huang from Shanghai Jiao Tong University for the valuable discussions. Shutong Wu is partially supported by the National Natural Science Foundation of China (61977046), Shanghai Science and Technology Program (22511105600), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

ETHICS STATEMENT

Our work proposes a defense pipeline for protecting acoustic systems from adversarial audio examples. In particular, our study focuses on speech command recognition, which is closely related to keyword spotting systems. Such systems are well known to be vulnerable to adversarial attacks. Our pipeline will enhance the security aspect of such real-world acoustic systems and benefit the social beings. The Speech Commands dataset used in our study are released by others and has been publicly available for years. The dataset contains various voices from anonymous speakers. To the best of our knowledge, it does not contain any privacy-related information for these speakers.

availability

https://github.com/cychomatica/AudioPure.

