ROBUSTNESS OF UNSUPERVISED REPRESENTATION LEARNING WITHOUT LABELS

Abstract

Unsupervised representation learning leverages large unlabeled datasets and is competitive with supervised learning. But non-robust encoders may affect downstream task robustness. Recently, robust representation encoders have become of interest. Still, all prior work evaluates robustness using a downstream classification task. Instead, we propose a family of unsupervised robustness measures, which are modeland task-agnostic and label-free. We benchmark state-of-the-art representation encoders and show that none dominates the rest. We offer unsupervised extensions to the FGSM and PGD attacks. When used in adversarial training, they improve most unsupervised robustness measures, including certified robustness. We validate our results against a linear probe and show that, for MOCOv2, adversarial training results in 3 times higher certified accuracy, a 2-fold decrease in impersonation attack success rate and considerable improvements in certified robustness.

1. INTRODUCTION

Unsupervised and self-supervised models extract useful representations without requiring labels. They can learn patterns in the data and are competitive with supervised models for image classification by leveraging large unlabeled datasets (He et al., 2020; Chen et al., 2020b; d; c; Zbontar et al., 2021; Chen & He, 2021) . Representation encoders do not use task-specific labels and can be employed for various downstream tasks. Such reuse is attractive as large datasets can make them expensive to train. Therefore, applications are often built on top of public domain representation encoders. However, lack of robustness of the encoder can be propagated to the downstream task. Consider the impersonation attack threat model in Fig. 1 . An attacker tries to fool a classifier that uses a representation encoder. The attacker has white-box access to the representation extractor (e.g. an open-source model) but they do not have access to the classification model that uses the representations. By optimizing the input to be similar to a benign input, but to have the representation of a different target input, the attacker can fool the classifier. Even if the classifier is private, one can attack the combined system if the public encoder conflates two different concepts onto similar representations. Hence, robustness against such conflation is necessary to perform downstream inference on robust features. We currently lack ways to evaluate robustness of representation encoders without specializing for a particular task. While prior work has proposed improving the robustness of self-supervised representation learning (Alayrac et al., 2019; Kim et al., 2020; Jiang et al., 2020; Ho & Vasconcelos, 2020; Chen et al., 2020a; Cemgil et al., 2020; Carmon et al., 2020; Gowal et al., 2020; Fan et al., 2021; Nguyen et al., 2022; Kim et al., 2022) , they all require labeled datasets to evaluate the robustness of the resulting models. Instead, we offer encoder robustness evaluation without labels. This is task-agnostic, in contrast to supervised assessment, as labels are (implicitly) associated with a specific task. Labels can also be incomplete, misleading or stereotyping (Stock & Cisse, 2018; Steed & Caliskan, 2021; Birhane & Prabhu, 2021) , and can inadvertently impose biases in the robustness assessment. In this work, we propose measures that do not require labels and methods for unsupervised adversarial training that result in more robust models. To the best of our knowledge, this is the first work on unsupervised robustness evaluation and we make the following contributions to address this problem: 1. Novel representational robustness measures based on clean-adversarial representation divergences, requiring no labels or assumptions about underlying decision boundaries. 3. Evidence that even the most basic unsupervised adversarial attacks in the framework result in more robust models relative to both supervised and unsupervised measures. 4. Probabilistic guarantees on the unsupervised robustness measures based on center smoothing.

2. RELATED WORK

Adversarial robustness of supervised learning Deep neural networks can have high accuracy on clean samples while performing poorly under imperceptible perturbations (adversarial examples) (Szegedy et al., 2014; Biggio et al., 2013) . Adversarial examples can be viewed as spurious correlations between labels and style (Zhang et al., 2022; Singla & Feizi, 2022) or shortcut solutions (Robinson et al., 2021) . Adversarial training, i.e. incorporating adversarial examples in the training process, is a simple and widely used strategy against adversarial attacks (Goodfellow et al., 2015; Madry et al., 2018; Shafahi et al., 2019; Bai et al., 2021) . Unsupervised representation learning Representation learning aims to extract useful features from data. Unsupervised approaches are frequently used to leverage large unlabeled datasets. Siamese networks map similar samples to similar representations (Bromley et al., 1993; Koch et al., 2015) , but may collapse to a constant representation. However, Chen & He (2021) showed that a simple stop-grad can prevent such collapse. Contrastive learning was proposed to address the representational collapse by introducing negative samples (Hadsell et al., 2006; Le-Khac et al., 2020) . It can benefit from pretext tasks (Xie et al., 2021; Bachman et al., 2019; Tian et al., 2020; Oord et al., 2018; Ozair et al., 2020; McAllester & Stratos, 2020) . Some methods that do not need negative samples are VAEs (Kingma & Welling, 2014) , generative models (Kingma et al., 2014; Goodfellow et al., 2014; Donahue & Simonyan, 2020) , or bootstrapping methods such as BYOL by Grill et al. (2020) .

Robustness of unsupervised representation learning

Most robustness work has focused on supervised tasks, but there has been recent interest in unsupervised training for representation encoders. Kim et al. (2020) , Jiang et al. (2020) , Gowal et al. (2020) and Kim et al. (2022) propose generating instance-wise attacks by maximizing a contrastive loss and using them for adversarial training. Fan et al. (2021) complement this by a high-frequency view. Ho & Vasconcelos (2020) suggest attacking batches instead of individual samples. KL-divergence can also be used as a loss (Alayrac et al., 2019) or as a regularizer (Nguyen et al., 2022) . Alternatively, a classifier can be trained on a small labeled dataset with adversarial training applied to it (Carmon et al., 2020; Alayrac et al., 2019) . For VAEs, Cemgil et al. (2020) generate attacks by maximizing the Wasserstein distance to the clean representations in representation space. Peychev et al. (2022) address robustness from the perspective of individual fairness: they certify that samples close in a feature directions are close in representation space. However, their approach is limited to invertible encoders. While these methods obtain robust unsupervised representation encoders, they all evaluate robustness on a single supervised classification task. To the best of our knowledge, no prior work has proposed measures for robustness evaluation without labels.foot_0 Let f : X → R be a differentiable encoder from X = [0, 1] n to a representation space R.

3. PROBLEM SETTING

We require f to be white-box: we can query both f (x) and df (x) dx . R is endowed with a divergence d(r, r ′ ), a function that has the non-negativity (d(r, r ′ ) ≥ 0, ∀r, r ′ ∈ R) and identity of indiscernibles (d(r, r ′ ) = 0 ⇔ r = r ′ ) properties. This includes metrics on R and statistical distances, e.g. the KL-divergence. D is a dataset of iid samples from a distribution D over X. We denote perturbations of x as x=x+δ, ∥δ∥ ∞ ≤ ϵ, with ϵ small enough so that x is indistinguishable from x. We assume to be desirable that f maps x close to x, i.e. d(f (x), f (x)) should be ªsmallº. We call this property unsupervised robustness. It is related to the smoothness of encoders (Bengio et al., 2013) . Intuitively, (Lipschitz) continuous downstream tasks would translate such unsupervised robustness to supervised robustness. We propose two risk measures for unsupervised robustness. First, the breakaway risk is the probability that the worst-case perturbation of x in the ball B(x, ϵ) is closer to a different sample x ′ than to x: P x,x ′ ∼D [d(f (x), f (x ′ )) < d(f (x), f (x))] , x ∈ arg sup x∈B(x,ϵ) d(f (x), f (x)). (1) Another indication of the lack of unsupervised robustness is if f (B(x, ϵ)) and f (B(x ′ , ϵ)) overlap, as then there exist perturbations δ, δ ′ under which downstream tasks would not be able to distinguish between the two samples, i.e. if f (x + δ) = f (x ′ + δ ′ ). We call this the overlap risk and define it as: P x,x ′ ∼D [d(f (x), f (a(x ′ , x))) < d(f (x), f (a(x, x ′ )))] , a(o, t) ∈ arg inf x∈B(o,ϵ) d(f (t), f (x)). (2) The breakaway risk is based on the perturbation causing the largest divergence in R, while the overlap risk measures if perturbations are well-separated from other instances (see Fig. 2 ). Labels are not required: the two risks are defined with respect to the local properties of the representation manifold of f under D and not relative to a task. In fact, we neither explicitly nor implicitly consider decision boundaries or underlying classes and make no assumptions about the downstream task. What if x and x ′ are very similar? Perhaps we shouldn't consider breakaway and overlap in such cases? We argue against this. First, the probability of very similar pairs would be low in a sufficiently diverse distribution D. Second, there is no clear notion of similarity on X without making assumptions about the downstream tasks. Finally, even if x is very similar to x ′ , it should still be more similar to any x as x and x are defined to be visually indistinguishable. We call this the self-similarity assumption.

4. UNSUPERVISED ADVERSARIAL ATTACKS ON REPRESENTATION ENCODERS

It is not tractable to compute the supremum and infimum in Eqs. (1) and (2) exactly for a general f . Instead, we can approximate them via constrained optimization in the form of adversarial attacks. This section shows how to modify the FGSM and PGD supervised attacks for these objectives (Secs. 4.1 and 4.2), as well as how to generalize these methods to arbitrary loss functions (Sec. 4.3) . The adversarial examples can also be used for adversarial training (Sec. 4.4).

4.1. UNSUPERVISED FAST GRADIENT SIGN METHOD (U-FGSM) ATTACK

The Fast Gradient Sign Method (FGSM) is a popular one-step method to generate adversarial examples in the supervised setting (Goodfellow et al., 2015) . Its untargeted mode perturbs the input x by taking a step of size α ∈ R >0 in the direction of maximizing the classification loss L cl relative to the true label y. In targeted mode, it minimizes the loss of classifying x as a target class t ̸ = y: x = clip(x + α sign(∇ x L cl (f (x), y))) untargeted FGSM, x→t = clip(x -α sign(∇ x L cl (f (x), t))) targeted FGSM, where clip(x) clips all values of x to be between 0 and 1. Untargeted U-FGSM We can approximate the supremum in Eq. ( 1) by replacing L cl with the representation divergence d, using a small perturbation η ∈ R n to ensure non-zero gradient: x = clip(x + α sign(∇ x d(f (x), f (x + η)))). Ho & Vasconcelos (2020) also propose an untargeted FGSM attack for the unsupervised setting, which requires batches rather than single images and uses a specific loss function, and hence is an instance of the L-FGSM attack in Sec. 4.3. The untargeted U-FGSM proposed here is independent of the loss used for training, making it more versatile. Targeted U-FGSM We can also perturb x i ∈ D so that its representation becomes close to f (x j ) for some x j ∈ D, x j ̸ = x i . Then a downstream model would struggle to distinguish between the attacked input and the target x j . This approximates the infimum in Eq. ( 2): x→j i = clip(x i -α sign(∇ xi d(f (x i ), f (x j )))).

4.2. UNSUPERVISED PROJECTED GRADIENT DESCENT (U-PGD) ATTACK

PGD attack is the gold standard of supervised adversarial attacks (Madry et al., 2018) . It comprises iterating FGSM and projections onto B(x, ϵ), the ℓ ∞ ball of radius ϵ centered at x: x0 = x→t 0 = clip (x + U n [-ϵ, ϵ]) randomized initialization, xu+1 = clip(Π x,ϵ [x u + α sign(∇ xu L cl (f (x u ), y))]) untargeted PGD, x→t u+1 = clip(Π x,ϵ [x →t u -α sign(∇ x→t u L cl (f (x →t u ), t))]) targeted PGD. We can construct the unsupervised PGD (U-PGD) attacks similarly to the U-FGSM attack: x0 = x→t 0 = clip (x + U n [-ϵ, ϵ]) randomized initialization, xu+1 = clip(Π x,ϵ [x u + α sign(∇ xu d(f (x u ), f (x)))]) untargeted U-PGD, x→t u+1 = clip(Π x,ϵ [x →t u -α sign(∇ x→t u d(f (x →t u ), f (x j )))]) targeted U-PGD. By replacing the randomized initialization with the η perturbation in the first iteration of the untargeted case, one obtains an unsupervised version of the BIM attack (Kurakin et al., 2017) . The adversarial training methods proposed by Alayrac et al. (2019) ; Cemgil et al. (2020) ; Nguyen et al. (2022) can be considered as using U-PGD attacks with specific divergence choices (see App. A.1).

4.3. LOSS-BASED ATTACKS

In both their supervised and unsupervised variants, FGSM and PGD attacks work by perturbing the input in order to maximize or minimize the divergence d. By considering arbitrary differentiable loss functions instead, we can define a more general class of loss-based attacks. Instance-wise loss-based attacks (L-FGSM, L-PGD) Given an instance x ∈ X and a loss function L : (X → R) × X → R, the L-FGSM attack takes a step in the direction maximizing L: x = clip (x + α sign (∇ x L(f, x))) . Similarly, for a loss function L : (X → R) × X × X → R taking a representation encoder, a sample, and the previous iteration of the attack, the loss-based PGD attack is defined as: x0 = clip (x + U n [-ϵ, ϵ]) , xu+1 = clip (Π x,ϵ [x u + α sign(∇ xu L(f, x, xu )]) . If we do not use random initialization for the attack, we get the L-BIM attack. The supervised and unsupervised FGSM and PGD attacks are special cases of the L-FGSM and L-PGD attacks. Furthermore, prior unsupervised adversarial training methods can also be represented as L-PGD attacks (Kim et al., 2020; Jiang et al., 2020) . A full description is provided in App. A.1. Batch-wise loss-based attacks ( L-FGSM, L-PGD) Attacking whole batches instead of single inputs can account for interactions between the individual inputs in a batch. The above attacks can be naturally extended to work over batches by independently attacking all inputs from X = [x 1 , . . . , x N ]. This can be done with a more general loss function L : (X → R) × X N → R. The batch-wise loss-based FGSM attack L-FGSM is provided in Eq. ( 3) with L-PGD and L-BIM defined similarly. X = clip X + α sign ∇ X L(f, X) . (3) Any instance-wise loss-based attack can be trivially represented as a batch-wise attack. Additionally, prior unsupervised adversarial training methods can also be represented as L-FGSM and L-PGD attacks (Ho & Vasconcelos, 2020; Fan et al., 2021; Jiang et al., 2020 ) (see App. A.2).

4.4. ADVERSARIAL TRAINING FOR UNSUPERVISED LEARNING

Adversarial training is a min-max problem minimizing a loss relative to a worst-case perturbation that maximizes it (Goodfellow et al., 2015) . As the worst-case perturbation cannot be computed exactly (similarly to Eqs. ( 1) and ( 2)), adversarial attacks are usually used to approximate it. Any of the aforementioned attacks can be used for the inner optimization for adversarial training. Prior works use divergence-based (Alayrac et al., 2019; Cemgil et al., 2020; Nguyen et al., 2022) and loss-based attacks (Kim et al., 2020; Jiang et al., 2020; Ho & Vasconcelos, 2020; Fan et al., 2021) . These methods tend to depend on complex loss functions and might work only for certain models. Therefore, we propose using targeted or untargeted U-PGD, as well as L-PGD with the loss used for training. They are simple to implement and can be applied to any representation learning model.

5. ROBUSTNESS ASSESSMENT WITH NO LABELS

The success of a supervised attack is clear-cut: whether the predicted class is different from the one of the clean sample. In the unsupervised case, however, it is not clear when an adversarial attack results in a representation that is ªtoo farº from the clean one. In this section, we propose using quantiles to quantify distances and discuss estimating the breakaway and overlap risks (Eqs. ( 1) and ( 2)). Universal quantiles (UQ) for untargeted attacks For untargeted attacks, we propose measuring d(f (x), f (x)) relative to the distribution of divergences between representations of samples from D. In particular, we suggest reporting the quantile q = P x ′ ,x ′′ ∼D [d(f (x ′ ), f (x ′′ )) ≤ d(f (x), f (x))] . This measure is independent of downstream tasks and depends only on the properties of the encoder and D. We can use it to compare different models, as it is agnostic to the different representation magnitudes models may have. In practice, the quantile values can be estimated from the dataset D. Relative quantiles (RQ) for targeted attacks Nothing prevents universal quantiles to be applied to targeted attacks. However, considering that targeted attacks try to ªimpersonateº a particular target sample, we propose using relative quantiles to assess their success. We assess the attack as the distance d(f (x →j i ), f (x j )) induced by the attack relative to d(f (x i ), f (x j )), the original distance between the clean sample and the target. The relative quantile for a targeted attack x→j i is then the ratio d(f (x →j i ), f (x j ))/d(f (x i ), f (x j )). Quantiles are a good way to assess the success of individual attacks or to compare different models. However, they do not take into account the local properties of the representation manifold, i.e. that some regions of R might be more densely populated than others. The breakaway and overlap risk metrics were defined with this exact purpose. Hence, we propose estimating them. Estimating the breakaway risk While the supremum in Eq. ( 1) cannot be computed explicitly, it can be approximated using the untargeted U-FGSM and U-PGD attacks. Therefore, we can compute a Monte Carlo estimate of Eq. ( 1) by sampling pairs (x, x ′ ) from the dataset D and performing an untargeted attack on x, for example with U-PGD. Nearest neighbour accuracy As the breakaway risk can be very small for robust encoders we propose also reporting the fraction of samples in D ′ ⊆ D whose untargeted attacks x would have their nearest clean neighbour in D being their corresponding clean samples x. That is: 1 |D ′ | x∈D ′ ✶ [∄x ′ ∈ D, x ′ ̸ = x, s.t. d(f (x ′ ), f (x)) < d(f (x), f (x))] . (4) Estimating the overlap risk The infimums in Eq. ( 2) can be estimated with an unsupervised targeted attack. Hence, an estimate of Eq. ( 2) can be computed by sampling pairs (x i , x j ) from the dataset D and computing the targeted attacks x→j i and x→i j . The overlap risk estimate is then the fraction of pairs for which d(f (x i ), f (x →i j )) < d(f (x i ), f (x →j i )). Adversarial margin In Eq. ( 2) one takes into account only whether overlap occurs but not the magnitude of the violation. Therefore, we also propose looking at the margin between the two attacked representations, normalized by the divergence between the clean samples: d(f (x i ), f (x →i j )) -d(f (x i ), f (x →j i )) d(f (x i ), f (x j )) , for randomly selected pairs (x i , x j ) from D. If overlap occurs, this ratio would be negative, with more negative values pointing to stronger violations. The overlap risk is therefore equivalent to the probability of occurrence of a negative adversarial margin.

Certified unsupervised robustness

The present work depends on gradient-based attacks, which can be fooled by gradient masking (Athalye et al., 2018; Uesato et al., 2018) . Hence, we also assess the certified robustness of the encoder. By using center smoothing (Kumar & Goldstein, 2021) we can compute a probabilistic guarantee on the radius of the ℓ 2 -ball in R that contains at least half of the probability mass of f (x + N(0, σ 2 )). The smaller this radius is, the closer f maps similar inputs. Hence, this is a probabilistically certified alternative to assessing robustness via untargeted attacks. In order to compare certified radius values in R across models we report them as universal quantiles.

6. EXPERIMENTS

We assess the robustness of state-of-the-art representation encoders against the unsupervised attacks and robustness measures from Secs. 4 and 5. We consider the ResNet50-based self-supervised models MOCOv2 (He et al., 2020; Chen et al., 2020d) , MOCO with non-semantic negatives (Ge et al., 2021), PixPro (Xie et al., 2021) , AMDIM (Bachman et al., 2019), SimCLRv2 (Chen et al., 2020c) , and SimSiam (Chen & He, 2021) . To compare self-supervised and supervised methods, we also evaluate the penultimate layer of ResNet50 (He et al., 2016) and supervised adversarially trained ResNet50 (Salman et al., 2020) . We assess the effect of using different unsupervised attacks by fine-tuning MOCOv2 with the untargeted U-PGD, targeted U-PGD, as well as with L-PGD using MOCOv2's contrastive loss, as proposed in Sec. 4.4. App. B reports the performance of unsupervised adversarial training on ResNet50 and the transformer-based MOCOv3 (Chen et al., 2021) . The unsupervised evaluation uses the PASS dataset as it does not contain people and identifiable information and has proper licensing (Asano et al., 2021) . ImageNet (Russakovsky et al., 2015) is used for accuracy benchmarking and the adversarial fine-tuning of MOCO, as to be consistent with how the model was trained. Assira (Elson et al., 2007) is used for the impersonation attacks. We report median UQ and RQ for the ℓ 2 divergence for the untargeted and targeted U-PGD attacks with ϵ = 0.05 and ϵ = 0.10. We also estimate the breakaway risk, nearest neighbour accuracy, overlap risk and adversarial margin, and certified unsupervised robustness, as described in Sec. 5. As customary, we measure the quality of the representations with the top-1 and top-5 accuracy of a linear probe. We also report the accuracy on samples without high-frequency components, as models might be overly reliant on the high-frequency features in the data (Wang et al., 2020) . Additionally, we assess the certified robustness via randomized smoothing (Cohen et al., 2019) and report the resulting Average Certified Radius (Zhai et al., 2020) . In line with the impersonation threat model, we also evaluate to what extent attacking a representation encoder can fool a private downstream classifier. Pairs of cats and dogs from the Assira dataset (Elson et al., 2007) are attacked with targeted U-PGD so that the representation of one is close to the other. We report the percentage of impersonations that successfully fool the linear probe.

7. RESULTS

In this section, we present the results of the experiments on ResNet50 and the ResNet50-based unsupervised encoders. Further results can be found in App. B. Supervised adversarial training performs well on all unsupervised measures. We validate our unsupervised robustness measures by comparing the robustness at the penultimate layers of ResNet50 and a supervised adversarially trained version of it. The first two columns of Tab. 2 show that adversarially trained ResNet50 scores significantly better on all unsupervised measures. This demonstrates that our measures successfully detect models which we know to be robust in a supervised setting. There is no ªmost robustº standard model. Amongst the standard unsupervised models, none dominates on all unsupervised robustness measures (see Tab. 2). AMDIM is least susceptible to targeted U-PGD attacks but has the worst untargeted U-PGD, breakaway risk and nearest neighbor accuracy. PixPro significantly outperforms the other models on untargeted attacks. AMDIM and PixPro also have the lowest overlap risk and largest median adversarial margin. At the same time, the model with the lowest breakaway risk and the highest average certified radius is SimCLR. While either AMDIM or PixPro scores the best at most measures, they both have significantly higher breakaway risk and lower average certified radius than SimCLR. Therefore, no model is a clear choice for the ªmost robust modelº. Unsupervised robustness measures reveal significant differences among standard models. The gap between the best and worst performing unsupervised models for the six measures based on targeted U-PGD attacks is between 19% and 27%. The gap reaches almost 81% for the untargeted case (PixPro vs AMDIM, 5 it.), demonstrating that standard models on both extremes do exist. AMDIM has 10.5 times higher breakaway risk than SimCLR while at the same time 2.7 times lower overlap risk than MOCOv2. Observing values on both extremes of all unsupervised robustness metrics testifies to them being useful for differentiating between the different models. Additionally, AMDIM having the highest breakaway risk and lowest overlap risk indicates that unsupervised robustness is a multifaceted problem and that models should be evaluated against an array of measures. They are also more certifiably robust (Fig. 4 ). However, the added robustness comes at the price of reduced accuracy (7% to 10%, Tab. 1), as is typical for adversarial training (Zhang et al., 2019; Tsipras et al., 2019) . This gap can likely be reduced by finetuning the trade-off between the adversarial and standard objectives and by having separate batch normalization parameters for standard and adversarial samples (Kim et al., 2020; Ho & Vasconcelos, 2020) . Adversarial training also reduces the impersonation rate of a downstream classifier at 5 iterations by a half relative to MOCOv2 (Tab. 4). For 50 iterations, the rate is similar to MOCOv2 but the attacked images of the adversarially fine-tuned models have more semantically meaningful distortions, which can be detected by a human auditor (see App. D for examples). These results are for only 10 iterations of fine-tuning of a standard encoder. Further impersonation rate reduction can likely be achieved with adversarial training for all 200 epochs. Unsupervised adversarial training results in certifiably more robust classifiers. Fig. 3 shows how the randomized smoothened linear probes of the adversarially trained models uniformly outperform MO-COv2. The difference is especially evident for large radii: 3 times higher certified accuracy when considering perturbations with radius of 0.935. These results demonstrate that unsupervised adversarial training boosts the downstream certified accuracy. Adversarially trained models have better consistency between standard and low-pass accuracy. The difference between standard and low-pass accuracy for the adversarially trained models is between 1.7% and 2.1%, compared to 2.2% to 9.7% for the standard models (Tab. 1). This could be in part due to the lower accuracy of the adversarially trained models. However, compared with PixPro, AMDIM and SimSiam, which have similar accuracy but larger gaps, indicate that the lower accuracy cannot fully explain the lower gap. Therefore, this suggests that unsupervised adversarial training can help with learning the robust low-frequency features and rejecting high-frequency non-semantic ones. L-PGD is the overall most robust model, albeit with lower accuracy. L-PGD dominates across all unsupervised robustness measures. These results support the findings of prior work on unsupervised adversarial training using batch loss optimization (Ho & Vasconcelos, 2020; Fan et al., 2021; Jiang et al., 2020) . However, L-PGD also has the lowest supervised accuracy of the three models (Tab. 1), as expected due to the accuracy-robustness trade-off. Still, the differences between the three models are small, and hence all three adversarial training methods can improve the robustness of unsupervised representation learning models.

8. DISCUSSION, LIMITATIONS AND CONCLUSION

Unsupervised task-independent adversarial training with simple extensions to classic adversarial attacks can improve the robustness of encoders used for multiple downstream tasks, especially when released publicly. That is why we will release the adversarially fine-tuned versions of MOCOv2 and MOCOv3, which can be used as more robust drop-in replacements for applications built on top of these two models. We showed how to assess the robustness of such encoders without resorting to labeled datasets or proxy tasks. However, there is no single ªunsupervised robustness measureº: models can have drastically different performance across the different metrics. Still, unsupervised robustness is a stronger requirement than classification robustness as it requires not only the output but also an intermediate state of the model to not be sensitive to small perturbations. Hence, we recommend unsupervised assessment and adversarial training to also be applied to supervised tasks. We do not compare with the prior methods in Sec. 2 as different base models, datasets and objective functions hinder a fair comparison. Moreover, the methods we propose generalize the previous works, hence this paper strengthens their conclusions rather than claiming improvement over them. This work is not an exhaustive exploration of unsupervised attacks, robustness measures and defences. We adversarially fine-tuned only three models: two based on ResNet50 and one transformer-based (in App. B); assessing how these techniques work on other architectures is further required. There are many other areas warranting further investigation, such as non-gradient based attacks, measures which better predict the robustness of downstream tasks, certified defences, as well as studying the accuracy-robustness trade-off for representation learning. Still, we believe that robustness evaluation of representation learning models is necessary for a comprehensive assessment of their performance and robustness. This is especially important for encoders used for applications susceptible to impersonation attacks. Therefore, we recommend reporting unsupervised robustness measures together with standard and low-pass linear probe accuracy when proposing new unsupervised and supervised learning models. We hope that this paper illustrates the breadth of opportunities for robustness evaluation in representation space and inspires further work on it.

ETHICS STATEMENT

This work discusses adversarial vulnerabilities in unsupervised models and therefore exposes potential attack vectors for malicious actors. However, it also proposes defence strategies in the form of adversarial training which can alleviate the problem, as well as measures to assess how vulnerable representation learning models are. Therefore, we believe that it would empower the developers of safety-, reliability-and fairness-critical systems to develop safer models. Moreover, we hope that this work inspires further research into unsupervised robustness, which can contribute to more robust and secure machine learning systems.

REPRODUCIBILITY STATEMENT

The experiments in this paper were implemented using open-source software packages (Harris et al., 2020; Virtanen et al., 2020; McKinney, 2010; Paszke et al., 2019) , as well as the publicly available MOCOv2 and MOCOv3 models (He et al., 2020; Chen et al., 2020d; 2021) . We provide the code used for the adversarial training, as well as all the robustness evaluation implementations, together with documentation on their use. We also release the weights of the models and linear probes. The code reproducing all the experiments in this paper is provided as well. The details are available here.

A EXAMPLES OF LOSS-BASED ATTACKS

In this appendix we demonstrate how various supervised and unsupervised attacks can be represented as loss-based attacks. This generalizes the unifying view presented by Madry et al. (2018) by incorporating also unsupervised attacks. We also illustrate the resulting structure of which attacks are generalizations of other attacks in Fig. 5 . In the following, π(x) is the true class of the instance x, π(x) is any other class, σ(x) is a function that returns a sample from D such that σ(x) ̸ = x, and κ(x) provides a different view of x, e.g. a different augmentation. A.1 EXAMPLES OF INSTANCE-WISE LOSS-BASED ATTACKS (L-FGSM, L-PGD, L-BIM) • The supervised FGSM and PGD attacks can be considered as special cases of the unsupervised L-FGSM and L-PGD, where f is a classifier rather than an encoder and we use a classification loss: ± Untargeted FGSM attack: L-FGSM with L(f, x) = L cl (f (x), π(x)), ± Targeted FGSM attack: L-FGSM with L(f, x) = -L cl (f (x), π(x)), ± Untargeted PGD attack: L-PGD with L(f, x, xu ) = L cl (f (x u ), π(x)), ± Targeted PGD attack: L-PGD with L(f, x, xu ) = -L cl (f (x u ), π(x) ). • The U-LGSM and U-PGD attacks can be represented as L-FGSM and L-PGD attacks where the loss is the divergence d: ± Untargeted U-FGSM attack: L-FGSM with L(f, x) = d(f (x), f (x + η)), ± Targeted U-FGSM attack: L-FGSM with L(f, x) = -d(f (x), f (σ(x))), ± Untargeted U-PGD attack: L-PGD with L(f, x, xu ) = d(f (x u ), f (x)), ± Targeted U-PGD attack: L-PGD with L(f, x, xu ) = -d(f (x u ), f (σ(x)) ). • Untargeted U-PGD with the Kullback-Leibler divergence corresponds to the adversarial example generation process of UAT-OT (Alayrac et al., 2019) which is based on the Virtual Adversarial Training method for semi-supervised adversarial learning (Miyato et al., 2019) . Untargeted U-PGD with the Kullback-Leibler also corresponds to the robustness regularizer proposed by Nguyen et al. (2022) . • Cemgil et al. (2020) propose an unsupervised attack for Variational Auto-Encoders (VAEs) (Kingma & Welling, 2014) based on the Wasserstein distance. It can be represented as the untargeted U-PGD attack with the Wasserstein distance, or equivalently, as the L-PGD attack with the loss L(f, x, xu ) = W (N([f (x)] µ , I[f (x)] σ ), N[(f (x u )] µ , I[f (x u )] σ )) , where W is the Wasserstein distance, N is the normal distribution, I is the identity matrix and the subscripts µ and σ designate the respective outputs of the VAE encoder f . • The BYORL method by Gowal et al. ( 2020) is equivalent to the untargeted U-PGD attack when the divergence d is chosen to be the cosine similarity. • The instance-wise unsupervised adversarial attack proposed by Kim et al. ( 2020) is equivalent to L-PGD with the contrastive loss L(f, x, xu ) = -log exp f (x u ) ⊤ f (κ(x))/T exp (f (x u ) ⊤ f (κ(x))/T ) + exp (f (x u ) ⊤ f (σ(x))/T ) , where T is a temperature parameter. This loss encourages that the cosine similarity between the adversarial example and another view of the same sample is small relative to the cosine similarity between the adversarial example and another sample from D. • The concurrent work by Kim et al. (2022) proposing targeted adversarial self-supervised learning can also be considered to be an instance of the targeted L-PGD attack. • Using the NT-Xent loss with L-PGD results in the attack used for the Adversarial-to-Standard adversarial contrastive learning proposed by Jiang et al. (2020) . 

A.2 EXAMPLES OF BATCH-WISE LOSS-BASED ATTACKS ( L-FGSM, L-PGD, L-BIM)

• Any L-FGSM attack with loss L is trivially an L-FGSM attack by taking N = 1 and considering the loss function L = L. • Similarly, any L-PGD attack is trivially an L-PGD attack. • The adversarial attack proposed by Ho & Vasconcelos (2020) is equivalent to L-FGSM with the contrastive loss L(f, X) = N i=1 -log exp(f (x i ) ⊤ f (κ(x i ))/T ) N j=1 exp(f (x j ) ⊤ f (κ(x i ))/T ) . Here adversarial examples are selected to jointly maximize the contrastive loss. • The adversarial training of AdvCL generates adversarial attacks by maximizing a multi-view contrastive loss computed over the adversarial example, two views of x and its highfrequency component HighPass(x) (Fan et al., 2021) . It corresponds to L-PGD with the loss L(f, X, Xu ) = 1 N N i=1 L ′ (κ 1 (x i ), κ 2 (x i ), xi,u , HighPass(x i ); f, X) , L ′ (z 1 , . . . , z m ; f, X) = - m i=1 m j=1 j̸ =i log exp(sim(f (z i ), f (z j ))/T ) z k ∈X κ∈{κ1,κ2} exp(sim(f (z i ), f (κ(z k )))/T ) , with sim(•, •) being the cosine similarity. • Using the NT-Xent loss and the L-PGD attack on a pair of views of x is identical to the Adversarial-to-Adversarial and Dual Stream adversarial contrastive learning methods by Jiang et al. (2020) .

B EXTENDED RESULTS

This appendix presents more comprehensive experimental results in Tabs. 5 to 7 and Figs. 6 to 11. We consider ResNet50 (penultimate layer) (He et al., 2016) , a supervised adversarially trained ResNet50 (ℓ ∞ , ϵ = 4/255, penultimate layer) (Salman et al., 2020) and the ResNet50-based selfsupervised learning models MOCOv2 (200 epochs) (He et al., 2020; Chen et al., 2020d) , MOCO with non-semantic negatives (+Patch, k=16384, α=3) (Ge et al., 2021) , PixPro (400 epochs) (Xie et al., 2021) , AMDIM (Medium) (Bachman et al., 2019) , SimCLRv2 (depth 50, width 1x, without selective kernels) (Chen et al., 2020c) , and SimSiam (100 epochs, batch size 256) (Chen & He, 2021) . In addition to the ResNet50-based models, we also present results from two models with transformer architectures (Vaswani et al., 2017) . MAE uses masked autoencoders (He et al., 2022) and we use its ViT-Large variant. MOCOv3 is a modification of MOCOv2 to work with a transformer backbone (ViT-Small) (Chen et al., 2021) . We also apply the targeted and untargeted unsupervised adversarial fine-tuning techniques from the main text to the ResNet50 encoder and the targeted, untargeted and loss-based unsupervised adversarial fine-tuning to MOCOv3 to compare how our findings from Sec. 7 transfer to supervised learning encoders and transformers. We use the exact same attack types and parameter values as for MOCOv2. The implementation details are in App. C. For all models, we present top-1 and top-5 accuracy for both the standard and the lowpass settings in Tabs. 5 to 7. We report the results for the ℓ 2 -and ℓ ∞ -induced divergences in representation space, at iterations 5, 10, 30 and 50 for U-PGD attacks with both ϵ = 0.05 and ϵ = 0.10. The attacks are performed with d(x, x ′ ) = ∥x -x ′ ∥ 2 and α = 0.001. These are reported as universal quantiles for the untargeted attacks and relative quantiles for the targeted attacks. We also report the breakaway and overlap risks, as well as the nearest neighbor accuracy, median adversarial margin, average certified radius and impersonation rates as in the main text. Figs. 6 to 11 show the certified robustness and accuracy of the models which were omitted from the main text. Some of the models, including MOCOv2, are trained with a contrastive objective based on the cosine similarity. Therefore, using the Euclidean distances as the divergence in representation space could be considered an unfair comparison as the adversarially trained models are optimized for it while the standard models are optimized for the cosine similarity. Therefore, for targeted attacks, we also report the median cosine similarity between the representations of the adversarial examples and the representations of the target samples. Higher values mean that the attack is more successful and that the model is less robust. For the untargeted attacks, we report the median cosine similarity between the representations of the adversarial examples and the representations of the original samples. Hence, higher values mean that the attack is less successful and the model is more robust. The results in Tabs. 5 to 7 show that adversarial training with the ℓ 2 -induced divergence leads also to improvements when measuring the cosine similarity: the divergence that the standard models are trained for but the adversarial ones are not. This evidence supports our claim that the improvements we see from unsupervised adversarial fine-tuning are not due to our choice of divergence.

B.1 SUPERVISED BACKBONE MODELS BASED ON RESNET50

In Tab. 6 and Figs. 8 and 9 we compare the penultimate layer of ResNet50, the penultimate layer of ResNet50 trained with supervised adversarial training, as well as fine-tuned ResNet50 (same as the first model) using our targeted and untargeted U-PGD attacks. We did not consider L-PGD as ResNet does not have a corresponding loss in representation space due to its supervised nature. The technical details are in App. C.3. ResNet50. MAE also has some of the best robustness against targeted U-PGD attacks and is competitive to PixPro for the untargeted case. It attains the lowest breakaway and overlap risks among all standard models as well as the largest median adversarial margin and is most robust to impersonation attacks. In comparison, MOCOv3 scores rather poorly on most measures. Hence, the robustness of MAE cannot be solely attributed to the transformer backbone. This complements the observed variation in robustness performance among the ResNet50 models and provides evidence that it is likely the objective function, rather than the backbone architecture, that determines robustness. The three adversarially trained MOCOv3 models witness a lower accuracy penalty than the corresponding MOCOv2 models: between 0.2% and 3.3% for the clean and between -0.9% and 0.8% for the lowpass accuracy for MOCOv3 compared with correspondingly 5.3%-10.2% and 2.8%-6.8% for MOCOv2. In fact, the adversarially trained MOCOv3 TAR model has a higher lowpass accuracy than the standard MOCOv3. Unsupervised adversarial training also leads to a uniformly better robustness against targeted and untargeted U-PGD attacks, albeit with a lower improvement compared to MO-COv2. We similarly witness large improvements in the overlap risk and median adversarial margin measures. The average certified radius and the certified accuracy (Fig. 10 ) are also significantly improved by adversarial training. Fig. 11 shows that the adversarially trained models are also more certifiably robust than the baseline MOCOv3. However, the three adversarially trained models actually fare worse than the baseline MOCOv3 for breakaway risk and are less robust to impersonation attacks. The impersonation attacks are also less semantic in nature than the ones for the MOCOv2 adversarially trained models (Figs. 21 to 23 vs Figs. 26 to 28) . This could also be due to the learning rate being too low, rather than due to transformer models being inherently more difficult to adversarially fine-tune in an unsupervised setting. The lower accuracy gap and the lower robustness further indicate that the adversarial training might not have been as ªaggressiveº for MOCOv3 as it was for MOCOv2. Still, while the improvements for MOCOv3 are not as drastic as for MOCOv2, unsupervised adversarial training does improve most robustness measures. The lower effectiveness for MOCOv3 of unsupervised adversarial training, especially in its role as a defence against impersonation attacks, is an avenue for future work that should examine whether there are fundamental differences between unsupervised adversarial training of CNN and Transformer models.

C TRAINING AND EVALUATION DETAILS

This section provides further details on the unsupervised adversarial training and the evaluation metrics implementations.

C.1 UNSUPERVISED ADVERSARIAL TRAINING FOR MOCOV2

The three variants for the adversarially fine-tuned MOCOv2 are obtained by using a modification of the official MOCO source code. We perform fine-tuning by resuming the training procedure for additional 10 epochs but with the modified training loop. The unsupervised adversarial examples are concatenated to the model's q-inputs and the k-inputs are correspondingly duplicated as shown in List. 1. For L-PGD we use InfoNCE (Oord et al., 2018) , the loss that MOCOv2 is trained with. All parameters, including the learning rate and its decay are as used for the original training and as reported by He et al. (2020) . We only reduced the batch size from 256 to 192 in order to be able to train on four GeForce RTX 2080 Ti GPUs. Listing 1: Pseudocode of adversarial fine-tuning for MoCo (modified from (He et al., 2020) ). 

C.2 UNSUPERVISED ADVERSARIAL TRAINING FOR MOCOV3

Similarly to MOCOv2, the three variants for the adversarially fine-tuned MOCOv3 are obtained by using a modification of the official MOCOv3 source code. We perform fine-tuning by resuming the training procedure for additional 10 epochs but with the modified training loop. The unsupervised adversarial examples are added to the contrastive loss as shown in List. 2. All parameters, are as used for the original training and as reported by Chen et al. (2021) . We only increased the learning rate from 1.5 × 10 -4 to 1.5 × 10 -3 and reduced the batch size from 256 to 192 in order to be able to train on four GeForce RTX 2080 Ti GPUs. Listing 2: Pseudocode of adversarial fine-tuning for MOCOv3. 

C.3 UNSUPERVISED ADVERSARIAL TRAINING FOR RESNET

The two unsupervised adversarially fine-tuned ResNet50 models are obtained by using an implementation similar to the one for MOCOv2 and MOCOv3. However, as the ResNet model is trained in a supervised setting there is no natural contrastive loss to use. That is why we use the MSE loss on the representations. For the targeted case, we also ensure that the clean samples are still mapped to the original representations. The full procedure is outlined in List. 3. Listing 3: Pseudocode of adversarial fine-tuning for ResNet. As part of the evaluation we train three linear probes for each model. • Standard linear probe: for computing the top-1 and top-5 accuracy on clean samples, as well as for the impersonation attack evaluation. • Lowpass linear probe: for computing the top-1 and top-5 accuracy on samples with removed high-frequency components. We use the implementation of Wang et al. (2020) and keep only the Fourier components that are within a radius of 50 from the center of the Fourier-transformed image. • Gaussian noise linear probe: trained on samples with added Gaussian noise for computing the certified accuracy as randomized smoothing results in a more robust model when the base model is trained with aggressive Gaussian noise (Lecuyer et al., 2019) . Therefore, we add Gaussian noise with σ = 0.25 to all inputs. All linear probes are trained with the train set of ImageNet Large Scale Visual Recognition Challenge 2012 (Russakovsky et al., 2015) and are evaluated on its test set. For training we use modification of the MOCO linear probe evaluation code for 25 epochs. The starting learning rate is 30.0 with 10-fold reductions applied at epochs 15 and 20. For fairness of the comparison, we use the same implementation to evaluate all models. Therefore, there might be differences between the accuracy values reported by us and the ones reported in the original publications of the respective models.

C.5 COMPUTING THE DISTRIBUTION OF INTER-REPRESENTATIONAL DIVERGENCES

The distribution of ℓ 2 and ℓ ∞ -induced divergences between the representations of clean samples of PASS (Asano et al., 2021) is needed for computing the universal and relative quantiles. Due to computational restrictions, we compute the representations of 10,000 samples and the divergences between all pairs of them in order to construct the empirical estimate of the distribution of interrepresentational divergences. We observe that 10,000 samples are more than sufficient for the empirical estimate of the distribution to converge.

C.6 ADVERSARIAL ATTACKS

In Tabs. 2, 3 and 5 to 7 we report U-PGD attacks performed with d(x, x ′ ) = ∥x -x ′ ∥ 2 and α = 0.001. We report median values over the same 1000 samples from PASS (Asano et al., 2021) . The median universal quantile is reported for targeted attacks and the median relative quantile is reported for targeted attacks as explained in Sec. 5. In Tabs. 2 and 3 we report only the resulting ℓ 2 quantiles, while in the extended results (Tabs. 5 to 7) we also show the ℓ ∞ quantiles and cosine similarities.

C.7 BREAKAWAY RISK AND NEAREST NEIGHBOR ACCURACY

The breakaway risk and nearest neighbor accuracy are also computed by attacking the same 1000 samples from PASS (D ′ ) and computing their divergences with all other samples from PASS (D). Our empirical estimate is then : pbreakaway = 1 |D ′ |(|D| -1) xi∈D ′ xj ∈D/{xi} ✶ [d(f (x i ), f (x j )) < d(f (x i ), f (x i ))] , where xi is the untargeted U-PGD attack with d(x, x ′ ) = ∥x -x ′ ∥ 2 , ϵ = 0.05 and α = 0.001 for 25 iterations. A key observation is that the perturbations necessary to fool the standard models visually appear as noise (Figs. 12 and 16 to 20). However, the perturbations applied to the three adversarially trained MOCOv2 models (Figs. 21 to 23), as well as the supervised adversarially trained ResNet50 (Fig. 13 ), are more ªsemanticº in nature, and in some cases even resemble features of the target class. Still, this is not the case when comparing the impersonation attacks on MOCOv3 (Fig. 25 ) with the attacks on the adversarially trained versions (Figs. 26 to 28). 



Concurrent work byWang & Liu (2022) proposed RVCL: a method to evaluate robustness without labels. However, it focuses on contrastive learning models while the methods here work with arbitrary encoders.



Figure 1: Impersonation attack threat model. The attacker has access only to the encoder on which the classifier is built. By attacking the input to have a similar representation to a sample from the target class, the attacker can fool the classifier without requiring any access to it. Cats who successfully impersonate dogs under the MOCOv2 representation encoder and a linear probe classifier are shown.

Figure 2: Breakaway and overlap risks. Divergences that increase the corresponding risks are in red and those reducing them are in green.

Figure 3: Certified accuracy of randomized smoothed MOCOv2 and its adversarially trained variants on ImageNet. The adversarially trained models are uniformly more robust.

Figure 4: Certified robustness of MOCOv2 on PASS using center smoothing. The certified representation radius is represented as percentile of the distribution of clean representation distances.

Figure 5: The hierarchy of supervised and unsupervised attacks.

Figure 6: Certified accuracy of the standard models on ImageNet.

Figure 7: Certified robustness of the standard models on PASS using center smoothing. The distribution of certified radii in R is reported as percentile of the clean representation distances. Smaller values indicate higher unsupervised robustness.

= f_base.params # initialize momentum encoder for x in loader: # load a minibatch x with N samples x_0 = aug(x) # a randomly augmented version x_1 = aug(x) # another randomly augmented version # perform the attack switch attack_type: case targeted: target_representation = roll(f_predictor(f_base(x_1)), shifts=1) x_adv = targeted_upgd(x_0, target_representation) case untargeted: x_adv = untargeted_upgd(x_0) case loss: x_adv = batch_loss_upgd(f_base, f_predictor, x_0, x_1) # update the momentum encoder f_momentum.params = f_momentum.params * m + f_base.params * (1-m) # get the base representations q_0 = f_predictor.forward(f_base.forward(x_0)) # x_0 reps: NxC q_1 = f_predictor.forward(f_base.forward(x_1)) # x_1 reps: NxC q_adv = f_predictor.forward(f_base.forward(x_adv)) # attacked reps: NxC # get the momentum representations k_0 = f_momentum.forward(x_0) # x_0 reps: NxC k_1 = f_momentum.forward(x_1) # x_1 reps: NxC k_adv = f_momentum.forward(x_adv) # attacked reps: NxC # compute the loss loss = contrastive_loss(q_0, k_1) + contrastive_loss(q_1, k_0) + \ contrastive_loss(q_adv, k_1) + contrastive_loss(q_1, k_adv) # parameter update: query network loss.backward() update(f_q.params)

f_base: base encoder network # f_finetuned: the finetuned network f_finetuned.params = f_base.params # initialize momentum encoder for x in loader: # load a minibatch x with N samples x = aug(x) # a randomly augmented batch clean_reps = f_base(x) # perform the attack switch attack_type: case targeted: target_representation = roll(clean_reps, shifts=1) x_adv = targeted_upgd(x, target_representation) loss = MSELoss(cat([clean_reps, clean_reps], axis=0), f_funetuned(cat([x, x_adv], axis=0))) case untargeted: x_adv = untargeted_upgd(x) loss = MSELoss(clean_reps, f_funetuned(x_adv))

Figure12: Impersonation attack samples for ResNet50(He et al., 2016).

Figure 13: Impersonation attack samples for supervised adversarially trained ResNet50 (Salman et al., 2020).

Figure 17: Impersonation attack samples for PixPro (Xie et al., 2021).

Figure18: Impersonation attack samples for SimCLRv2(Chen et al., 2020c).

Figure 19: Impersonation attack samples for SimSiam (Chen & He, 2021).

Figure 20: Impersonation attack samples for MOCOv2(He et al., 2020; Chen et al., 2020d).

Figure 21: Impersonation attack samples for MOCOv2 TAR.

Figure 22: Impersonation attack samples for MOCOv2 UNTAR.

Figure 23: Impersonation attack samples for MOCOv2 LOSS.

Figure24: Impersonation attack samples for MAE(He et al., 2022).

Figure 25: Impersonation attack samples for MOCOv3 (Chen et al., 2021).

Figure 26: Impersonation attack samples for MOCOv3 TAR.

Figure 27: Impersonation attack samples for MOCOv3 UNTAR.

Figure 28: Impersonation attack samples for MOCOv3 LOSS.

Standard and lowpass ImageNet accuracy of linear probes of ResNet50-based encoders.

Robustness of ResNet50 and ResNet50-based unsupervised encoders on PASS (except the average certified radius measured on ImageNet). Arrows show if larger or smaller values are better.

Robustness of MOCOv2 and its adversarially fine-tuned versions measured on PASS and ImageNet.

Impersonation attack success rate on Assira of MOCOv2 and its label-free adversarially fine-tuned versions for different attack iterations.

Extended results for ResNet50-based self-supervised models and unsupervised adversarially fine-tuned MOCOv2. Arrows show if larger or smaller values are better. UQ and RQ respectively designate values reported in universal and relative quantiles.

Extended results for ResNet50, supervised adversarially trained ResNet50, and unsupervised adversarially fine-tuned ResNet50. Arrows show if larger or smaller values are better. UQ and RQ respectively designate values reported in universal and relative quantiles.

Extended results for transformer-based models and adversarially fine-tuned MOCOv3. Arrows show if larger or smaller values are better. UQ and RQ respectively designate values reported in universal and relative quantiles.

annex

As mentioned in Sec. 7, the supervised adversarially trained ResNet50 has by far the best robustness scores across all measures. It also significantly surpasses our unsupervised adversarially fine-tuned models. However, this is not enough to conclude that supervised adversarial training works better than unsupervised adversarial training. The supervised adversarially trained ResNet50 has been trained with attacks for 150 epochs, while our unsupervised adversarially trained models were fine-tuned for 10 epochs. Therefore, the gap in robustness could be due to the quantity of adversarial training rather than the quality of the technique.Both the targeted and untargeted unsupervised adversarially fine-tuned ResNet50 models score better on all robustness measures than the standard ResNet50. The improvements are markedly larger for the untargeted fine-tuned model. Though this comes at the price of lower accuracy and lower certified accuracy (Fig. 8 ). Furthermore, both models have very similar certified robustness, as shown in Fig. 9 . Both the supervised adversarially trained and the unsupervised adversarially fine-tuned with untargeted attacks models have lower certified accuracy than the standard model at lower radii. However, they compensate that with higher certified accuracy for larger radii resulting in overall improvement in the average certified radius (Tab. 6).

B.2 TRANSFORMER BACKBONE MODELS: MAE AND MOCOV3

Among the standard models, MAE outperforms the standard models on most measures. MAE has the highest accuracy of all models trained in the unsupervised regime, i.e. excluding the supervised

C.8 OVERLAP RISK AND MEDIAN ADVERSARIAL MARGIN

The overlap risk and median adversarial margin are computed over 1000 pairs of samples from PASS (D ′ ). Each element of the pair is attacked to have a representation similar to the other element.where x→j i is the targeted U-PGD attack on x i towards x j with d(x, x ′ ) = ∥x -x ′ ∥ 2 , ϵ = 0.05 and α = 0.001 for 10 iterations.

C.9 CERTIFIED ACCURACY

We use the randomized smoothing implementation by Cohen et al. (2019) . We evaluate the Gaussian noise linear probe (see App. C.4) over 200 samples from the ImageNet test set (Russakovsky et al., 2015) . We use σ = 0.25, N 0 = 100, N = 100, 000 and an error probability α = 0.001, as originally used by Cohen et al. (2019) . Figs. 3 and 10 show the resulting certified accuracy for MOCOv2, MOCOv3, and their unsupervised adversarially trained versions. Fig. 6 shows the certified accuracy the other models. These plots show the fraction of samples which are correctly classified and which certifiably have the same classification within a given ℓ 2 radius of the input space.Tabs. 5 to 7 also show the Average Certified Radius for all models. The Average Certified Radius was proposed by Zhai et al. (2020) as a way to summarize the certified accuracy vs radius plots with a single number. The average certified radius can be computed as:

C.10 CERTIFIED ROBUSTNESS

The certified robustness evaluation in Figs. 4 and 10 was done with the center smoothing implementation by Kumar & Goldstein (2021) . We evaluate the models over the same 200 samples from the ImageNet test set (Russakovsky et al., 2015) . We use σ = 0.25, N 0 = 10, 000, N = 100, 000 and error probabilities α 1 = 0.005, α 2 = 0.005, as originally proposed by Kumar & Goldstein (2021) .

C.11 IMPERSONATION ATTACKS

The impersonation attack evaluation is performed using targeted U-PGD attacks. We use the Assira dataset that contains 25,000 images, equally split between cats and dogs (Elson et al., 2007) . When evaluating a model, we consider only the subset of images that the standard linear probe (see App. C.4) for the given model classifies correctly. Then, we construct pairs of an image of a cat and an image of a dog. We perform two attacks: attacking the cat to have a representation as close as possible to that of the dog and vice-versa. The attacked images are then classified with the clean linear probe. The success rate of cats impersonating dogs and dogs impersonating cats are computed separately and then averaged to account for possible class-based differences. Note that the linear probe is not used for constructing the attack, i.e. we indeed fool it without accessing it. App. D shows examples of the impersonation attacks for all models.

D IMPERSONATION ATTACKS

This appendix showcases samples of the impersonation attacks on the models discussed in the paper. The first and third row in each sample are the original images of cats and dogs respectively. The second row is the result when each cat image is attacked to have a representation close to the representation of the corresponding dog image. The fourth row is the opposite: the dog image attacked to have a representation close to the representation of the cat image. The attack used was targeted U-PGD with d(r, r ′ ) = ∥r -r ′ ∥ 2 for 50 iterations with ϵ = 0.10 and α = 0.01. The samples shown differ from model to model as we restrict the evaluation to the samples that are correctly predicted by the given model, see App. C.11 for details. 

