ROBUSTNESS OF UNSUPERVISED REPRESENTATION LEARNING WITHOUT LABELS

Abstract

Unsupervised representation learning leverages large unlabeled datasets and is competitive with supervised learning. But non-robust encoders may affect downstream task robustness. Recently, robust representation encoders have become of interest. Still, all prior work evaluates robustness using a downstream classification task. Instead, we propose a family of unsupervised robustness measures, which are modeland task-agnostic and label-free. We benchmark state-of-the-art representation encoders and show that none dominates the rest. We offer unsupervised extensions to the FGSM and PGD attacks. When used in adversarial training, they improve most unsupervised robustness measures, including certified robustness. We validate our results against a linear probe and show that, for MOCOv2, adversarial training results in 3 times higher certified accuracy, a 2-fold decrease in impersonation attack success rate and considerable improvements in certified robustness.

1. INTRODUCTION

Unsupervised and self-supervised models extract useful representations without requiring labels. They can learn patterns in the data and are competitive with supervised models for image classification by leveraging large unlabeled datasets (He et al., 2020; Chen et al., 2020b; d; c; Zbontar et al., 2021; Chen & He, 2021) . Representation encoders do not use task-specific labels and can be employed for various downstream tasks. Such reuse is attractive as large datasets can make them expensive to train. Therefore, applications are often built on top of public domain representation encoders. However, lack of robustness of the encoder can be propagated to the downstream task. Consider the impersonation attack threat model in Fig. 1 . An attacker tries to fool a classifier that uses a representation encoder. The attacker has white-box access to the representation extractor (e.g. an open-source model) but they do not have access to the classification model that uses the representations. By optimizing the input to be similar to a benign input, but to have the representation of a different target input, the attacker can fool the classifier. Even if the classifier is private, one can attack the combined system if the public encoder conflates two different concepts onto similar representations. Hence, robustness against such conflation is necessary to perform downstream inference on robust features. We currently lack ways to evaluate robustness of representation encoders without specializing for a particular task. While prior work has proposed improving the robustness of self-supervised representation learning (Alayrac et al., 2019; Kim et al., 2020; Jiang et al., 2020; Ho & Vasconcelos, 2020; Chen et al., 2020a; Cemgil et al., 2020; Carmon et al., 2020; Gowal et al., 2020; Fan et al., 2021; Nguyen et al., 2022; Kim et al., 2022) , they all require labeled datasets to evaluate the robustness of the resulting models. Instead, we offer encoder robustness evaluation without labels. This is task-agnostic, in contrast to supervised assessment, as labels are (implicitly) associated with a specific task. Labels can also be incomplete, misleading or stereotyping (Stock & Cisse, 2018; Steed & Caliskan, 2021; Birhane & Prabhu, 2021) , and can inadvertently impose biases in the robustness assessment. In this work, we propose measures that do not require labels and methods for unsupervised adversarial training that result in more robust models. To the best of our knowledge, this is the first work on unsupervised robustness evaluation and we make the following contributions to address this problem: 3. Evidence that even the most basic unsupervised adversarial attacks in the framework result in more robust models relative to both supervised and unsupervised measures. 4. Probabilistic guarantees on the unsupervised robustness measures based on center smoothing.

2. RELATED WORK

Adversarial robustness of supervised learning Deep neural networks can have high accuracy on clean samples while performing poorly under imperceptible perturbations (adversarial examples) (Szegedy et al., 2014; Biggio et al., 2013) . Adversarial examples can be viewed as spurious correlations between labels and style (Zhang et al., 2022; Singla & Feizi, 2022) or shortcut solutions (Robinson et al., 2021) . Adversarial training, i.e. incorporating adversarial examples in the training process, is a simple and widely used strategy against adversarial attacks (Goodfellow et al., 2015; Madry et al., 2018; Shafahi et al., 2019; Bai et al., 2021) . Unsupervised representation learning Representation learning aims to extract useful features from data. Unsupervised approaches are frequently used to leverage large unlabeled datasets. Siamese networks map similar samples to similar representations (Bromley et al., 1993; Koch et al., 2015) , but may collapse to a constant representation. However, Chen & He (2021) showed that a simple stop-grad can prevent such collapse. Contrastive learning was proposed to address the representational collapse by introducing negative samples (Hadsell et al., 2006; Le-Khac et al., 2020) . It can benefit from pretext tasks (Xie et al., 2021; Bachman et al., 2019; Tian et al., 2020; Oord et al., 2018; Ozair et al., 2020; McAllester & Stratos, 2020) . Some methods that do not need negative samples are VAEs (Kingma & Welling, 2014), generative models (Kingma et al., 2014; Goodfellow et al., 2014; Donahue & Simonyan, 2020) , or bootstrapping methods such as BYOL by Grill et al. (2020) . 2022) address robustness from the perspective of individual fairness: they certify that samples close in a feature directions are close in representation space. However, their approach is limited to invertible encoders. While these methods obtain robust unsupervised representation encoders, they all evaluate robustness on a single



Figure 1: Impersonation attack threat model. The attacker has access only to the encoder on which the classifier is built. By attacking the input to have a similar representation to a sample from the target class, the attacker can fool the classifier without requiring any access to it. Cats who successfully impersonate dogs under the MOCOv2 representation encoder and a linear probe classifier are shown.

Most robustness work has focused on supervised tasks, but there has been recent interest in unsupervised training for representation encoders. Kim et al. (2020), Jiang et al. (2020), Gowal et al. (2020) and Kim et al. (2022) propose generating instance-wise attacks by maximizing a contrastive loss and using them for adversarial training. Fan et al. (2021) complement this by a high-frequency view. Ho & Vasconcelos (2020)suggest attacking batches instead of individual samples. KL-divergence can also be used as a loss(Alayrac et al., 2019)  or as a regularizer(Nguyen et al., 2022). Alternatively, a classifier can be trained on a small labeled dataset with adversarial training applied to it(Carmon et al., 2020; Alayrac  et al., 2019). For VAEs, Cemgil et al. (2020)  generate attacks by maximizing the Wasserstein distance to the clean representations in representation space.Peychev et al. (

