WATCH WHAT YOU PRETRAIN FOR: TARGETED, TRANSFERABLE ADVERSARIAL EXAMPLES ON SELF-SUPERVISED SPEECH RECOGNITION MODELS

Abstract

A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition (ASR) system to output attacker-chosen text. To exploit ASR models in real-world, black-box settings, an adversary can leverage the transferability property, i.e. that an adversarial sample produced for a proxy ASR can also fool a different remote ASR. However recent work has shown that transferability against large ASR models is very difficult. In this work, we show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are in fact vulnerable to transferability. We successfully demonstrate this phenomenon by evaluating state-of-the-art self-supervised ASR models like Wav2Vec2, Hu-BERT, Data2Vec and WavLM. We show that with low-level additive noise achieving a 30dB Signal-Noise Ratio, we can achieve target transferability with up to 80% accuracy. Next, we 1) use an ablation study to show that Self-Supervised learning is the main cause of that phenomenon, and 2) we provide an explanation for this phenomenon. Through this we show that modern ASR architectures are uniquely vulnerable to adversarial security threats.

1. INTRODUCTION

Adversarial audio algorithms are designed to force Automatic Speech Recognition (ASR) models to produce incorrect outputs. They do so by introducing small amounts of imperceptible, carefully crafted noise to benign audio samples that can force the ASR model to produce incorrect transcripts. Specifically, targeted adversarial attacks (Carlini & Wagner, 2018; Qin et al., 2019) are designed to force ASR models to output any target sentence of the attacker's choice. However, these attacks have limited effectiveness as they make unreasonable assumptions (e.g., white-box access to the model weights), which are unlikely to be satisfied in real-world settings. An attacker could hypothetically bypass this limitation by using the transferability property of adversarial samples: they generate adversarial samples for a white-box proxy model; then pass these to a different remote black-box model, as we illustrate in Figure 1a . Transferability has been successfully demonstrated in other machine learning domains, like computer vision (Papernot et al., 2016) . This is a sample text in black. Yet for ASR, recent work has shown that transferability is close to non-existent between large models Abdullah et al. (2021b) , even between identically trained models (i.e., same training hyper-parameters, even including the random initialization seed). These findings were demonstrated on older ASR architectures, specifically on LSTM-based DeepSpeech2 models trained with CTC loss. However, robustness properties sometimes vary considerably between different ASR architectures (Lu et al., 2021; Olivier & Raj, 2022) , and it is worth studying adversarial transferability on more recent families of models. In this work, we evaluate the robustness of modern transformer-based ASR architectures. We show that many state-of-the-art ASR models are in fact vulnerable to the transferability property. Specifically, our core finding can be formulated as follows: Pretraining transformer-based ASR models with Self-Supervised Learning (SSL) makes them vulnerable to transferable adversarial attacks. SSL is an increasingly popular learning paradigm in ASR (Figure 1b ), used to boost model performance by leveraging large amounts of unlabeled data. We demonstrate that it hurdles robustness by making the following contributions: • First, we show that most public SSL-pretrained ASR models are vulnerable to transferability. We generate 85 adversarial samples for the proxy HuBERT and Wav2Vec2 models (Section 3). We show that these samples are effective against a wide panel of public transformer-based ASRs. This includes ASRs trained on different data than our proxies. • Second, we show that SSL-pretraining is the reason for this vulnerability to transferability. We do so using an ablation study on Wav2Vec2-type models. • Third, we propose an explanation for this curious phenomenon. We argue that targeted ASR attacks need considerable feature overlap to be transferable; and that SSL objectives encourage such feature overlap between different models. Our results show that SSL, a line of work gathering attention in the ASR community that has pushed the state-of-the-art on many benchmarks, is also a source of vulnerability. Formerly innocuous attacks with unreasonable assumptions are now effective against many modern models. As it is likely that SSL will be used to train ASR systems in production, our results pave the way for practical, targeted attacks in the real world. By no means do these results imply that this line of work should be aborted, but they emphasize the pressing need to focus on robustness alongside performance.

2.1. SSL PRETRAINING FOR ASR MODELS

We describe in this Section the principles of SSL-pretrained ASR models, whose robustness to attacks we evaluate in this work. These models usually follow the neural architecture of Wav2Vec2 (Baevski et al., 2020) . Raw audio inputs are fed directly to a CNN. A Transformer encodes the CNN outputs into contextualized representations. A final feed-forward network projects these representations in a character output space. The model is fine-tuned with CTC loss (Graves et al., 2006) . A number of different models follow this architecture, including Wav2Vec2, HuBERT (Hsu et al., 2021) , Data2Vec (Baevski et al., 2022) , UniSpeech-SAT (Wang et al., 2021; Chen et al., 2021b) or WavLM (Chen et al., 2021a) . These networks only have very minor differences in their architectures, to the point that standardized sizes are used for all of them. Base models have 12 transformer hidden layers and 90M parameters. Large models have 24 layers and 300M parameters. Finally, XLarge models have 48 layers for a total of 1B parameters. While the networks are similar, the training pipelines of these models differ substantially. All models are pretrained on large amounts of unlabeled data, then fine-tuned for ASR on varying quantities of labeled data. The pretraining involves SSL objectives, such as Quantization and Contrastive Learning (Wav2Vec2), offline clustering and masked predictions (HuBERT), or masked prediction of contextualized labels (Data2Vec). Unispeech combines SSL and CTC pretraining with multitask learning. WavLM adds denoising objectives and scales to even greater amounts of unlabeled data. SSL pretraining is helpful in many regards: it makes the same network easy to fine-tune for multiple downstream tasks with little labeled data and has improved state-of-the-art results in ASR benchmarks, especially in low-resource settings. As we demonstrate, it is also a source of vulnerabilities.

2.2. ADVERSARIAL ATTACKS

Adversarial examples are inputs modified imperceptibly by an attacker to fool machine learning models (Szegedy et al., 2014; Goodfellow et al., 2014; Carlini & Wagner, 2016; Madry et al., 2018) . While most works have focused on image classification, several created of adapted attacks for other tasks such as ASR (Cisse et al., 2017; Carlini & Wagner, 2018; Qin et al., 2019) . The attack we use is based on the Carlini&Wagner ASR attack (Carlini & Wagner, 2018) , although slightly simplified. Given an input x, a target transcription y t , and an ASR model f trained with loss L, our attack finds an additive perturbation δ optimizing the following objective: min δ L(f (x + δ), y t ) + c ∥δ∥ 2 2 s.t. ∥δ∥ ∞ < ϵ (1) which we optimize using L ∞ Projected Gradient Descent. While the CW attack typically uses a large initial ϵ, then gradually reduces it as it finds successful perturbations, we fix a single value of ϵ and optimize for a fixed number of iterations. We find that this scheme, closer to the PGD algorithm Madry et al. (2018) , greatly improves attack transferability. However we keep using the L 2 regularization term c ∥δ∥ 2 2 introduced in the CW attack. We also find that applying regularization such as dropout during attack optimization greatly helps to generate transferable perturbations. This effect is analyzed more in detail in Appendix D.3. Throughout the rest of the paper, we run all attack optimization steps using the default dropout, layer drop, etc. that the proxy model used during training (typically a dropout of 0.1).

3. TRANSFERABLE ATTACK ON STATE-OF-THE-ART ASR MODELS

In our core experiment, we fool multiple state-of-the-art SSL-pretrained ASR models with targeted and transferred adversarial attacks. We generate a small set of targeted audio adversarial examples using fixed proxy models. We then transfer those same examples on a large number of models available in the HuggingFace Transformers library. Table 1 specifies how much unlabeled and labeled data these models were trained on. We provide the full experimental details in appendix A.

3.1. GENERATING ADVERSARIAL EXAMPLES ON PROXIES

We describe our procedure to generate adversarial examples. To maximize the transferability success rate of our perturbations we improve the base attack in Section 2.2 in several key ways: • To limit attack overfitting on our proxy, we combine the losses of two proxy models: Wav2Vec2 and HuBERT (LARGE). Both models were pretrained on the entire LV60k dataset and finetuned on 960h of LibriSpeech. As these models have respectively a contrastive and predictive objective, they are a representative sample of SSL-pretrained ASR models. The sum of their losses is used as the optimization objective in Equation 1. • We use 10000 optimization steps, which is considerable (for comparison Carlini & Wagner (2018) use 4000) and can also lead to the adversarial noise overfitting the proxy models. To mitigate this effect we use a third model, the Data2Vec BASE network trained on LibriSpeech, as a stopping criterion for the attack. At each attack iteration, we feed our adversarial example to Data2Vec, and keep track of the best-performing perturbation (in terms of WER). We return that best perturbation at the end of the attack. Because this procedure is computationally expensive, we only apply it to a subset A of 85 utterances of less than 7 seconds. We sample them randomly in the LibriSpeech testclean set. We select attack targets at random: we sample a completely disjoint subset B of utterances in the LibriSpeech test-other set. To each utterance in A we assign as target the transcription of the sentence in B whose length is closest to its own. This ensures that a very long target isn't assigned to a very short utterance or vice versa.

3.2. TRANSFERRING ADVERSARIAL EXAMPLES ON ASR MODELS

We evaluate all SSL-pretrained models mentioned in Section 2.1, along with several others for comparison: the massively multilingual speech recognizer or M-CTC (Lugosch et al., 2022) trained with pseudo-labeling, and models trained from scratch for ASR: the Speech-to-text model from Fairseq (Wang et al., 2020) and the CRDNN and Transformer from SpeechBrain (Ravanelli et al., 2021) .

3.3. METRICS

We evaluate the performance of ASR models with the Word-Error-Rate (WER) between the model output and the correct outputs. When evaluating the success of adversarial examples, we can also use the Word-Error-Rate. Between the prediction and the attack target y t , a low WER indicates a successful attack. We therefore define the word-level targeted attack success rate as T ASR = max(1 -W ER(f (x + δ), y t ), 0) It is also interesting to look at the results of the attack in terms of denial-of-service, i.e. the attack's ability to stop the model from predicting the correct transcription y. Here a high WER indicates a successful attack. We define the word-level untargeted attack success rate as U ASR = min(W ER(f (x + δ), y), 1) We can also compute the attack success rate at the character level, i.e. using the Character-Error-Rate (CER) instead of the Word-Error-Rate. Character-level metrics are interesting when using weaker attacks that affect the model, but not enough to reduce the targeted WER significantly. We use them in our ablation study in section 4. Finally, we control the amount of noise in our adversarial examples with the Signal-Noise Ratio (SNR), defined as SN R(δ, x) = 10 log( ∥x∥ 2 2 ∥δ∥ 2 2 ) (4) for an input x and a perturbation δ. When generating adversarial examples we adjust the L ∞ bound ϵ (equation 1 to achieve a target SNR.

3.4. RESULTS

We report the results of our adversarial examples in Table 1 for ϵ = 0.015, corresponding to a Signal-Noise Ratio of 30dB on average. In Appendix D.1 we also report results for a larger ϵ value. On 12 out of 16 models, we observe that the attack achieves total denial-of-service: the untargeted success rate is 100%. Moreover, on the first 6 models (proxies aside), the targeted attack success rate ranges between 50% and 81%: the target is more than half correctly predicted! These results are in flagrant contradiction with past works on DeepSpeech2-like models, where even the slightest change in training leads to a total absence of targeted transferability between proxy and private model. Our private models vary from the proxies in depth, number of parameters and even training methods, yet we observe important transferability. However, these 6 models have all been pretrained on LibriSpeech or Libri-Light with SSL pretraining, i.e. the same data distribution as our proxies. The following five models were pretrained on different datasets. One was pretrained on a combination of Libri-Light, VoxPopuli and GigaSpeech; two on Libri-Light, CommonVoice, SwitchBoard and Fisher; and two on CommonVoice either multilingual or English-only. The transferability success rate on these five models ranges from 18% to 67%, which is significant. Even the CommonVoice models, whose training data has no intersection with Libri-Light, are partially affected. Although our inputs and attack targets are in English, we apply them to a French-only Common-Voice Wav2vec2. This model, incapable of decoding clean LibriSpeech data, is also unaffected by our targeted perturbation. It therefore seems that, while multilingual models are not robust to our examples, a minimal performance on the original language is required to observe transferability.

Model

The final 4 models for which the targeted transferability rate is null or close to null, are those that were not SSL-pretrained at all (including M-CTC which was pretrained with pseudo-labeling). These four models also partially resist the untargeted attack.

4.1. INFLUENCE OF SELF-SUPERVISED LEARNING

In this section, we compare Wav2Vec2 models with varying amounts of pretraining data: 60k hours, 960h, or none at all. We use each model both as a proxy to generate adversarial noise and as a private model for evaluation with all other proxies. As Wav2Vec2 models fine-tuned from scratch are not publicly available, we train our own models with no pretraining, using the Wav2Vec2 fine-tuning configurations on 960h of labeled data available in Fairseq (Ott et al., 2019) . These configurations are likely suboptimal and our models achieve test-clean WERs of 9.1% (Large) and 11.3% (Base), much higher than the pretrained+fine-tuned Wav2Vec2 models. This performance discrepancy could affect the fairness of our comparison. We therefore add to our experiments Wav2Vec2 Base models fine-tuned on 1h and 10h of labeled data only. These models achieve test-clean WERs of 24.5% and 11.1%. Therefore we can observe the influence of SSL pretraining by taking model architecture and performance out of the equation. Our attacks are not as strong as in section 3.1, and only have limited effect on the targeted WER. Therefore we evaluate results at the character level, which offers much finer granularity. For reference, we observe that the CER between two random pairs of sentences in LibriSpeech is 80-85% on average. Therefore attack success rates higher than 20% (i.e CER < 80% with the target) indicate a partially successful attack. We report those results in Table 2 . Results in italic correspond to cases where attacked model is the proxy or was fine-tuned from the same pretrained representation, and therefore do not correspond to a transferred attack. These results show unambiguously that SSL pretraining plays a huge role in the transferability of adversarial attacks. Adversarial examples generated on the pretrained Wav2Vec2 models fine-tuned on 960h are partially successful on all pretrained models (success rate in the 25-46% range). They are however ineffective on the ASR models trained from scratch (4-8%). Similarly, models trained from scratch are bad proxies for pretrained models (2-3%) and even for each other (19-22%).

Model\Proxy

It follows that SSL pretraining is a necessary condition for transferable adversarial examples in both the proxy and the private model. We confirm it by plotting in Figure 3a the evolution of the target loss while generating one adversarial example. We display the loss for the proxy model (blue) and two private models. The loss of the pretrained private model (red) converges to a much lower value than the non-pretrained model (yellow). SSL pretraining is however not a sufficient condition for attack transferability, and other factors play a role as well. For instance, the Base model fine-tuned on just 10h and 1h are ineffective proxies: so strong ASR models are likely better proxies than weaker ones.

4.2. INFLUENCE OF PRETRAINING DATA

As observed in Section 3 models that were (pre)trained on different data than the proxies can still be affected by transferred attacks. We analyse this effect in more details in this section. We focus on five Wav2Vec2-Large models. One is pretrained and fine-tuned on LibriSpeech. One is pre-trained on LibriLight and fine-tuned on LibriSpeech. Two are pretrained on LV60k, CommonVoice, SwitchBoard and Fisher, and fine-tuned respectively on LibriSpeech and SwitchBoard. Finally one is pretrained and finetuned on CommonVoice (English-only). As in the previous section, we evaluate every combination of proxy and target models. We report the results in Table 3 . We observe that most pairs of proxy and private models lead to important partial transferability. The major exception is the CommonVoice-only model, which does not succeed as a proxy for other models (0-8% success rate). In contrast, it is vulnerable to attacks transferred from other models, including those that do not have CommonVoice in their training data. We also note that models pretrained on Lbri-Light or more (60+khs) are better proxies, and more vulnerable ot attacks, than the LibriSpeech-only and CommonVoice-only model. In other words the vulnerability that we point out is worsened rather than mitigated by increasing amounts of available data. Model \Proxy LS960 LS960 

5. A HYPOTHESIS FOR THE VULNERABILITY OF SSL-PRETRAINED MODELS

We have established a link between adversarial transferability and the SSL pretaining of ASR models. In this section we propose a hypothesis explaining that link. We first show in Section 5.1, with empirical justification, that attacks with a very precise target are much harder to transfer everything else being equal, explaining why targeted ASR attacks are usually nontransferable. Then in Section 5.2 we suggest ways in which SSL alleviates these difficulties, thus recovering some transferability.

5.1. AT EQUAL WHITE-BOX SUCCESS, VERY TARGETED ATTACKS ARE HARDER TO TRANSFER

Targeted attacks on CIFAR10 force the model to predict one out of 10 different labels. Targeted attacks on ASR models force the model to transcribe one of all the possible transcriptions: With sequences of just five English words the number of possibilities is equal to 170000 5 ∼ 10 26 . We can call such an attack "very targeted", by contrast to more "mildly targeted" attacks on CIFAR10. We hypothesize that the target precision, or "how targeted" the attack is, negatively affects its transferability success rate, explaining why targeted ASR attacks do not transfer easily. To demonstrate it empirically, we can imagine an experiment where an attacker tries to run a very targeted attack on CIFAR10. We hypothesize that in such a case, the transferred attack success rate would drop even if the white box attack success rate remains high. Inversely, if we designed a "mildly targeted" attack on ASR models, we would expect it to achieve a non-trivial transferability success rate. We designed experiments for both cases, which we summarize below. Complete experimental details and results are provided in Appendix B.

5.1.1. VERY TARGETED ATTACKS ON CIFAR10

We run an attack on a ResNet CIFAR10 model. We do not just enforce the model's most probable output (top1 prediction) but the first k most probable outputs (topk prediction). For example with k = 3, given an image of an airplane, the attack objective could be to modify the image such that the most probable model output is "car", the second most probable is "bird" and the third is "frog". Our attack algorithm sets a "target distribution" of classes, then minimizes the KL divergence of the model's probabilistic outputs and the target, using Projected Gradient Descent. The success rate is evaluated by matching the top k predictions and the top k targets. We compute the L ∞ attack success rate (ϵ = 0.03) for both white-box and transferred attacks as a function of the "target precision" k. For k = 1, we measure a transferability success rate above 30%. However, as k increases, the transferability success rate drops close to 10%, which is the success threshold that a random model would achieve. In other words, the transferability becomes null as k increases. Meanwhile, the white box attack success rate remains above 95%. Therefore very targeted attacks on images do not transfer.

5.1.2. MILDLY TARGETED ATTACKS ON ASR

We train five small Conformer models on LibriSpeech. On each of them we generate targeted adversarial examples. The target objective is simply to prepend the word "But" to the original transcription. This makes for a much less targeted attack as is traditionally done with ASR. The attack success rate is evaluated simply by checking the presence of the word "But" at the beginning of the prediction. We restrict evaluation to inputs whose transcription does not start with that word. For each model, we generate 100 adversarial examples and evaluate them on all 4 other models. We thus obtain 20 different transferability success rates. The average of these scores is 18% with a standard deviation of 4.7%. Therefore mildly targeted attacks on ASR transfer substantially better than regular, very targeted attacks. Equivalent experiments with very targeted ASR attacks are reported in Abdullah et al. (2021b) : the word-level transferability success rate is 0%.

5.2. VERY TARGETED TRANSFERABILITY REQUIRES IMPORTANT FEATURE OVERLAP

Why would very targeted attacks transfer less? As Ilyas et al. (2019) show, statistically meaningful patterns in the training data may be "robust" (i.e. resilient to small perturbations) or non-robust. By leveraging non-robust features attackers can generate adversarial perturbations -and as these features can be learned by any models, these perturbations will transfer. The underlying assumption behind this framework is that all models learn the same features. In practice, two seperate models do not learn identical features due to randomness in training. But if they are "close enough", i.e. if the feature overlap between both models is important, then transferability will be observed. It therefore makes perfect sense that more targeted attacks would transfer less. The more precise and difficult the attack objective is, the more features the attacker will depend on to achieve it. This increases the amount of feature overlap needed between proxy and private model for the attack to transfer. In the case of targeted ASR attacks, the required overlap is considerable. We hypothesize that SSL pretraining increases the feature overlap between ASR models. As empirically verifying it would pose important difficulties, we propose a high-level justification of that hypothesis. ASR training aims at learning a representation that enables speech transcription. A subset of all features is sufficient to achieve this objective: for instance, there are lots of redundancies between low-frequency and high-frequency features, and a human listener can easily transcribe speech where most frequencies have been filtered out. The set of features learned by ASR models is therefore underspecified: two models even very similar identically may learn representations with little overlap. Self-Supervised Learning on the other hand does not only learn useful features for transcription but features needed for predicting the input itself : parts of the input are masked, then they (or their quantized or clusterized form) are predicted using context. Arguably this much more ambitious objective requires the network to learn as many features as possible. In fact, the goal of such pretraining is to learn useful representations not just for ASR but any downstream task -i.e. "exhaustive" representations. Intuitively, different models trained in that way would share many more features than ASR models trained from scratch -leading to more transferable adversarial examples.

6. RELATED WORK

The transferability of adversarial attacks has been known for many years in Image Classification (Papernot et al., 2016) . On ASR it has been limited to simple attack objectives, like preventing WakeWord detection in Alexa (Li et al., 2019) or signal processing-based attacks (Abdullah et al., 2021a; 2022b) . When it comes to optimization-based attacks on large ASR models, transferability claims are usually limited and focus on untargeted attacks (Wu et al., 2022) . In very specific cases there have been limited claims of targeted, transferable attacks, such as Yuan et al. (2018) ; however, this work does not focus on imperceptible attacks with small amounts of noise, but rather attacks embedded in music. When it comes to standard targeted optimization attacks, Abdullah et al. (2021b) have shown that they display no transferability on DeepSpeech2 models, even when the proxy and the attacked model are trained with identical hyperparameters apart from the initial random seed. Past ASR adversarial attacks usually focus on a handful of neural architectures, typically Deep-Speech2 (et al., 2016) , sometimes Listen Attend and Spell (Chan et al., 2016) . Only recently have attacks been extended to multiple recent architectures for a fair comparison between models (Lu et al., 2021; Olivier & Raj, 2022; Wu et al., 2022) . Most related to this work is Wu et al. (2022) , which focuses on the vulnerability of SSL speech models. They however focus on attacking the base pretrained model with untargeted noise that remains effective on downstream tasks. We study targeted attacks, with a much deeper focus on transferability between different models. Olivier & Raj (2022) have hinted that Wav2Vec2 models are vulnerable to transferred attacks, but only report limited results on two models and do not investigate the cause of that phenomenon. We attribute it to SSL pretraining and back our claims empirically. Abdullah et al. (2022a) have identified factors that hinder transferability for ASR attacks, such as MFCC features, Recurrent Neural Networks, and large output sizes. Since Wav2Vec2 is a CNN-Transformer model with character outputs: this gives it a better prior than DeepSpeech2 to achieve transferable adversarial attacks. However, according to that paper, this should be far from sufficient to obtain transferable attacks: our results differ in the case of SSL-pretrained models.

7. CONCLUSION

We have shown that ASR targeted attacks are transferable between SSL-pretrained ASR models. Direct access to their weights is no longer required to fool models to predict outputs of the attacker's choice -and to an extent, knowledge of its training data is not required either. With that in mind, and given the existence of over-the-air attack algorithms, we expect attacks against ASR models to become a practical, realistic threat as soon as Wav2Vec2-type models are deployed in production. In that context, it is paramount to develop adversarial defense mechanisms for ASR models. Fortunately, such defenses already exist, but they come at the cost of a tradeoff in model performance. We illustrate it in appendix E. Further research should be carried out into mitigating that tradeoff and adapting to ASR the most effective defenses in image classification, such as adversarial training. A.4 MODELS

A.4.1 TRAINING WAV2VEC2 MODELS FROM SCRATCH

We use Fairseq to train Base and Large Wav2Vec2 models from scratch. Unfortunately, no configuration or pretrained weights have been released for that purpose, and we resort to using Wav2Vec2 fine-tuning configurations while simply skipping the pretraining step. Despite our attempts to tune training hyperparameters, we do not match the expected performance of a Wav2Vec2 model trained from scratch: (Baevski et al., 2020 ) report a WER of 3.0% for a large model, while we only get 9.1%.

A.4.2 GENERATING ADVERSARIAL EXAMPLES

Wav2Vec2, HuBERT and Data2Vec models are all supported directly in robust speech and are therefore those we use for generating adversarial examples. We use the HuggingFace backend of Speechbrain for most pretrained models, and its Fairseq backend for a few (Wav2Vec2-Base models finetuned on 10h and 1h, and models trained from scratch). In both cases, the model's original tokenizer cannot be loaded in SpeechBrain directly. Therefore, we fine-tune the final projection layer of each model on 1h of LibriSpeech train-clean data. The Wav2Vec2 model pretrained and fine-tuned on CommonVoice is a SpeechBrain original model. Similarly, we fine-tune it on 1h of LibriSpeech data as a shift from the CommonVoice output space to the LibriSpeech one. As a result, all our models share the same character output space.

A.4.3 EVALUATING PRETRAINED MODELS

In section 3, we directly evaluate models from HuggingFace Transformers and SpeechBrain on our adversarial dataset, without modification.

B EXPERIMENTAL DETAILS AND RESULTS FOR SMALL-SCALE EXPERIMENTS

This section describes the experimental details used in section 5.

B.1 CIFAR10 EXPERIMENTS

We use a pretrained ResNet18 as proxy, and a pretrained ResNet50 as private model. Our "very targeted attack" P GD k consists in applying the following steps for each input: • target selection. We sample uniformly an ordered subset of k classes out of 10 (E.g. with k = 3: (2, 5, 6)). We also sample a point uniformly on the unit k-simplex {x 1 , ..., x k ∈ [0, 1] n / i X i = 1}, by sampling from an exponential distribution and normalizing (Onn & Weissman, 2011 ) (e.g. (0.17, 0.55, 0.28)). We combine the two to obtain a 10-dimensional vectors with zero probability on all but the selected k classes (y = (0, 0.17, 0, 0, 0.55, 0.28, 0, 0, 0, 0)). This is our target. • During the attack, we use Projected Gradient Descent (Madry et al., 2018) to minimize the KL divergence KL(f (x), y) between the softmax output and the target, within L 2 radius ϵ = 0.5. We use learning rate 0.1 for k * 1000 attack steps. • We measure attack success rate by measuring the top-k match between f (x) and y: acc = 1 k k i=1 1[argsort(f (x)) i = argsort(y) i with argsort(y) returning the indices of the sorted elements of y in decreasing order. For instance f (x) = (0.1, 0.05, 0.05, 0.05, 0.35, 0.2, 0.05, 0.05, 0.05, 0.05) would get an accuracy of 0.666, as the top 2 classes match with y but not the third. We evaluate attacks on 256 random images from the CIFAR10 dataset. For each value of k between 1 and 10 we repeat the experiment 3 times and average the attack success rates. In figure 2 we plot 

B.2 MILDLY TARGETED ASR ATTACKS

We train 5 identical conformer encoder models with 8 encoder layers, 4 attention heads, and hidden dimension 144. We train them with CTC loss for 30 epochs on the LibriSpeech train-clean-100 set, with different random seeds. We run a L 2 -PGD attack with SNR bound 30dB, in which we minimize the cross-entropy loss between the utterance and its transcription prepended with the word "But". The utterances we attack are the first 100 sentences in the LibriSpeech test-clean set, to which we remove 7 sentences already starting with the word "But". We generate adversarial examples using each of the 5 models as proxy, and evaluate these examples on all 5 models. We report the full results in Table 5 . 30dB and 22dB respectively. The former is identical to Table 7 ; the latter is substantially larger, and corresponds to a more easily perceptible noise. Model\Proxy Looking at the white-box attack results on the proxy models the difference is drastic: with larger noise the targeted success rate jumps from 88% to 98%. The transferred attack results on SSLpretrained models also increase overall, with success increases ranging from 0% (Wav2Vec2-Large) to 20% (Data2Vec-Large) with a median increase of 10%. Crucially however, the targeted success does not increase at all and even decreases for ASR models trained from scratch. This confirms that there is a structural difference between the robustness of ASR models with and without SSL, that cannot be bridged simply by increasing the attack strength. On all other models, the inputs have been transferred directly. We report for each model how much unlabeled data was used for SSL pretraining and for ASR finetuning. We also report its Word-Error-Rate on the LibriSpeech test-clean set, and the targeted and untargeted word-level attack success rate (see section 3.3)

D.2 LANGUAGE MODELS

In section 3 we report the results of our adversarial dataset on multiple Wav2Vec2-type models, enhanced with an N-gram language model whenever available. In Table 8 we evaluate the influence of that language model on attack results. We observe that the attack success rate systematically increases by 8 to 17% when adding a language model to the ASR model. This is understandable considering that our targets are sound English sentences: if a model tends to transcribe that target with mistakes, the language model can bridge that gap. To put it differently, the more prone an ASR model is to output sentences in a given distribution, the more vulnerable it is to attacks with targets sampled from that distribution. Language models are therefore more of a liability than a defense against attacks, and most likely so would be many tricks applied to an ASR model in order to improve its general performance.

D.3 EFFECT OF MODEL REGULARIZATION ON TRANSFERABILITY

As mentioned in Section 2.2 we use regularization tricks like dropout in all proxy models when optimizing the adversarial perturbation. In Figure 3b we plot the loss on proxy and private models without that regularization, for comparison with Figure 3a . We observe that the loss degrades significantly on private models without regularization. On the other hand, the loss on the proxy converges much faster in Figure 3b : removing model regularization makes for better, faster white-box attacks, at the cost of all transferability. To the extent of our knowledge, past work like Carlini & Wagner (2018) have not used regularization for generation, explaining why they report better white-box attacks than we do in terms of WER and SNR. However, as we have established above, applying regularization against standard ASR models does not lead to transferable adversarial examples: for that SSL pretraining is also required.



It emerges from these results that some recent ASR models, specifically those pretrained with SSL, can be vulnerable to transferred attacks. These results diverge significantly from previous works like(Abdullah et al., 2021b; 2022a) which showed no transferability between different models. Table1hints that SSL pretraining plays an important role in transferability, but does not prove it: to do so we would need to compare models of identical architecture and performance, pretrained and trained from scratch, both as proxy and target. This is what we do in the next section.s4 IDENTIFYING THE FACTORS THAT ENABLE ATTACK TRANSFERABILITYIn this section, we conduct a thorough ablation study and establish rigorously that SSL pretraining makes ASR models vulnerable to transferred attacks. We also measure the influence of several other factors on transferability. This ablation study requires the generation of many sets of adversarial examples,, using varying models as proxy, which would be computationally difficult with the improved attack introduced in section 3.1. Since we do not seek optimal performance, throughout this section we run the base attack in Section 2.2 with 1000 optimization steps.



Figure 1: Diagrams illustrating (a) the transferability of an adversarial attack between a proxy and a private model, and (b) the training procedure of SSL ASR models

Figure 2: Transferred attack success rate when varying the "target precision" on a CIFAR10 model. The more targeted the attack, the worse its transferability at an equal white-box success rate

Figure 3: Evolution over attack steps of the loss on one adversarial input for three models: the Wav2Vec2 Large proxy and two targets, respectively with and without SSL pretraining. We run attacks (a) with dropout in the proxy model, and (b) without dropout in the proxy model.





Character-level success rate of the attack with different proxies and models. Each row corresponds to a different proxy, each column to a different private model. The format is [pretraining-data fine-tuning-data]. All models follow the Wav2Vec2-Large architecture.4.3 MODEL SIZE AND TRAINING HYPERPARAMETERSWe now extend our ablation study to models pretrained with different SSL paradigms. We report the results in Table4. We observe that adversarial examples also transfer between models trained with different paradigms. Moreover, At equal pretraining data all models are not equal proxies, and the HuBERT Large model (pretrained on 60kh) is the best proxy by a large margin.

Character-level success rate of the attack with different proxies and models. Each row corresponds to a different proxy, each column to a different private model. The format is [Modeltype Model-size pretraining-data] where model types are Wav2Vec2 (W2V2), Data2Vec (D2V) and HuBERT (HB). Each model was fine-tuned on 960h of LibriSpeech training data.

Success rate of our mildly targeted attack, using each of the 5 conformer networks both as proxy and model. The attack is considered successful on an input if the prepended target word is the first word in the transcription. Table7we extend the results of Table1by comparing attack results for two different attack radii. These radii are ϵ = 0.015 and ϵ = 0.04, corresponding respectively to Signal-Noise Ratios of

Results of the transferred adversarial attack on different ASR models, with multiple Signal-Noise Ratios. The first three models correspond to the proxies used to generate the adversarial examples.

Results of the transferred adversarial attack on different ASR models, with and without language models. We report for each model how much unlabeled data was used for SSL pretraining and for ASR finetuning. We also report its Word-Error-Rate on the LibriSpeech test-clean set, and the targeted word-level attack success rate (see section 3.3)

A EXPERIMENTAL DETAILS FOR LIBRISPEECH EXPERIMENTS

A.1 FRAMEWORKS We compute adversarial examples using the robust speech framework (Olivier & Raj, 2022) . This library uses Speechbrain (Ravanelli et al., 2021) to load and train ASR models and offers implementations of various adversarial attack algorithms. Models and attacks are implemented using PyTorch (Paszke et al., 2019) .We use robust speech for evaluation on SpeechBrain-supported models. In section 3 we export a HuggingFace Dataset (Lhoest et al., 2021) , then evaluate models via the HuggingFace Transformers (et al., 2020) library. Finally, we use Fairseq (Ott et al., 2019) for training models from scratch All of our robust speech and Fairseq configurations are released alongside this article.

A.2 ATTACK HYPERPARAMETERS

We exploit the Carlini&Wagner attack (see section 2.2) implemented in robust speech, with the following hyperparameters:• initial ϵ: 0.015 (and 0.04 in appendix D.1)• learning rate: 0005• number of decreasing ϵ values: 1• Regularization constant c: 10• optimizer: SGD • attack iterations: 10000 in section 3.1, 1000 in section 4

A.3 DATASET AND TARGETS

Our adversarial dataset in section 3.1 consists of 85 sentences from the LibriSpeech test-clean set.To extract these sentences we take the first 200 sentences in the manifest, then keep only those shorter than 7 seconds. In section 4, we take the first 100 sentences and filter those shorter than 14 seconds.As attack targets, we use actual LibriSpeech sentences sampled from the test-other set. Our candidate targets are:• Let me see how can i begin For each sentence we attack, we assign the candidate target with the closest length to the sentence's original target.

E DEFENDING AGAINST ADVERSARIAL EXAMPLES

Although we have shown that adversarial attacks can represent an important threat for private, SSLbased ASR models, it is possible to defend against them. Randomized smoothing Cohen et al. ( 2019) is a popular adversarial defense that has been applied to ASR in the past Olivier & Raj (2021) and comes with some robustness guarantees. It consists in applying to the inputs, before feeding them to the model, amounts of random gaussian noise that are significantly larger than potential adversarial perturbations in L 2 norm. For reference we try applying it on some of our models.We follow (Olivier & Raj, 2021) and enhance randomized smoothing with a-priori SNR estimation and ROVER voting (with 8 outputs) to boost performance. We use gaussian deviation σ = 0.02. For evaluation, we simply check the effect of our adversarial examples generated in section 3.1 on the smoothed model. A rigorous evaluation would require us to design adaptive attacks Athalye et al. (2018); Tramer et al. (2020) ; since this paper does not focus on claiming robustness to attacks, we restrict ourselves to a simpler setting.We report our results in Table 9 for the Wav2Vec2-Base, Wav2Vec2-Large and Data2Vec-Large models, pretrained and fine-tuned on 960h of LibriSpeech training data. We observe that randomized smoothing is sufficient to block the targeted attack completely (0% success rate) and recover most of the original transcription (the untargeted success rate drops to 14-34% depending on the model). However, due to the addition of gaussian noise on all inputs the defense takes a toll on the performance on clean data: the WER jumps by 4-10%. The standard deviation σ controls this tradeoff between robustness and performance; we chose the value of σ that minimizes the untargeted success rate.Unsurprisingly, randomized smoothing is a promising protection against transferred attacks, but it does leave room for improvement. 9 : Results of the transferred adversarial attack (generated in section 3.1) on the Wav2Vec2-Base, Wav2Vec2-Large and Data2Vec-Large models. Each model was pretrained and fine-tuned on 960h of LibriSpeech training data. We report results on both the undefended version of each model and one defended with randomized smoothing at σ = 0.02. We report the WER of each model on the LibriSpeech test-clean set, and the word-level success rate of the attack (see Section 3.3).

