SYNBENCH: TASK-AGNOSTIC BENCHMARKING OF PRETRAINED REPRESENTATIONS USING SYNTHETIC DATA

Abstract

Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning, from task-centric model design to task-agnostic representation learning and task-specific fine-tuning. As the representations of pretrained models are used as a foundation for different downstream tasks, this paper proposes a new task-agnostic framework, SynBench, to measure the quality of pretrained representations using synthetic data. To address the challenge of task-agnostic data-free evaluation, we design synthetic binary classification proxy tasks with class conditional Gaussian mixtures to probe and compare model's robustness-accuracy performance on input synthetic data and their representations. Since the synthetic tasks spare access to real-life data, SynBench offers a holistic evaluation and informs the model designers of the intrinsic robustness level of the model given a user-specified threshold accuracy. Moreover, the use of class conditional Gaussian mixture allows us to derive a theoretically optimal robustness-accuracy tradeoff, which serves as a reference when evaluating the tradeoff on representations. By comparing the ratio of area-under-curve between the raw data and their representations, SynBench offers a quantifiable score for robustness-accuracy performance benchmarking. Our framework applies to a wide range of pretrained models taking continuous data inputs and is independent of the downstream tasks and datasets. Evaluated with several pretrained vision transformer models, the experimental results show that our SynBench score well matches the actual linear probing performance of the pre-trained model when fine-tuned on downstream tasks. Moreover, our framework can be used to inform the design of robust linear probing on pretrained representations to mitigate the robustness-accuracy tradeoff in downstream tasks.

1. INTRODUCTION

In recent years, the use of large pretrained neural networks for efficient fine-tuning on downstream tasks has prevailed in many application domains such as vision, language, and speech. Instead of designing task-dependent neural network architectures for different downstream tasks, the current methodology focuses on the principle of task-agnostic pretraining and task-specific finetuning, which uses a neural network pretrained on a large-scale dataset (often in a self-supervised or unsupervised manner) to extract generic representations of the input data, which we call pretrained representations for simplicity. The pretrained representations are then used as a foundation (Bommasani et al., 2021) to solve downstream tasks by training a linear head (i.e., linear probing) on the data representations with the labels provided by a downstream dataset, or by simply employing zero-shot inference. Moreover, to handle multi-modal data, one can use a similar neural network architecture (e.g., transformer) for multi-modal data representation learning and alignment. Successful examples following this new machine learning paradigm include the GPT-3 language model (Brown et al., 2020) , the vision transformer (Arnab et al., 2021) , and the CLIP image-text model (Radford et al., 2021) , to name a few. As large pretrained models are shown to achieve state-of-the-art performance on a variety of downstream tasks with minimal fine-tuning, there is an intensified demand for using pretrained representations from a large model for efficient finetuning. When gauging the usefulness of a pretrained model, it is a convention to compare the accuracy on selected real-life tasks. However, this ap-proach has two possible drawbacks: (1) if the underlying pretrained model has hidden risks, such as lacking robustness to adversarial examples, the standard accuracy cannot inform the risk as it does not correlate well (even worse, sometimes has negative correlation) with adversarial robustness (Su et al., 2018) . Therefore, the trending practice of pretraining and fine-tuning also signifies immediate damage to all downstream tasks. (2) the implications suggested by any "better" results on specific datasets are subjective to the datasets used for evaluation and could be inconclusive when the evaluation datasets change (e.g. ViT-L/16 is reportedly performing worse than ViT-B/16 on 4 out of 27 linear probing tasks according to Radford et al. (2021) , and is incompetent to ViT-B/16 on finetuned medical tasks (Okolo et al., 2022; Tummala et al., 2022) ). Consequently, an ideal pretrained model should entail both good accuracy and adversarial robustness, and the level of goodness can be measured in a task-agnostic manner. To address this emerging challenge, we propose a novel framework named SynBench to evaluate the quality of pretrained representations, in terms of quantifying the tradeoff between standard accuracy and adversarial robustness to input perturbations. Specifically, SynBench uses synthetic data generated from a conditional Gaussian distribution to establish a reference characterizing the robustness-accuracy tradeoff based on the Bayes optimal linear classifiers. Then, SynBench obtains the representations of the same synthetic data from the pretrained model and compares them to the reference for performance benchmarking. Finally, we define the ratio of area-under-curves in robustness-accuracy characterization as a quantifiable metric of the quality of pretrained representations. The entire procedure of SynBench is illustrated in Figure 1 . Our SynBench framework features the following key advantages. 1. Soundness: We formalize the fundamental tradeoff in robustness and accuracy of the considered conditional Gaussian model and use this characterization as a reference to benchmark the quality of pretrained representations. 2. Task-independence: Since the pretraining of large models is independent of the downstream datasets and tasks (e.g., through self-supervised or unsupervised training on broad data at scale), the use of synthetic data in SynBench provides a task-agnostic approach to evaluating pretrained representations without the knowledge of downstream tasks and datasets. 3. Completeness and privacy: The flexibility of generating synthetic data (e.g., by adopting a different data sampling procedure) offers a good proxy towards a more comprehensive evaluation of pretrained representations when fine-tuned on different downstream datasets, especially in the scenario when the available datasets are not representative of the entire downstream datasets. Moreover, the use of synthetic data enables full control and simulation over data size and distribution, protects data privacy, and can facilitate model auditing and governance. We highlight our main contributions as follows. • We propose SynBench, a novel task-agnostic framework that uses synthetic data to evaluate the quality of pretrained representations. The evaluation process of SynBench is independent of the downstream datasets and tasks and it applies to any model taking continuous data inputs. • Evaluated with several pretrained vision transformers, our experimental results show that the metric provided by SynBench well matches the model performance in terms of adversarial robustness and standard accuracy when finetuned on several downstream datasets. For example, SynBench-Score suggests that the Imagenet21k pretrained network (ViT-B/16-in21k) improves with finetuning on Imagenet1k (ViT-B/16), echoing with the higher CIFAR10 and CIFAR10-c linear probing accuracy of ViT-B/16. • We show that SynBench can be used to inform the design and selection of the hyperparameters in robust linear probing to mitigate the robustness-accuracy tradeoff when fine-tuned on downstream datasets. For example, conducting ϵ-robust linear probing with ϵ selected by SynBench-Score gives ViT-B/16 0.6% increase in CIFAR10 accuracy and 1.3% increase in CIFAR10-c accuracy.

2. RELATED WORK

Pretrained models in vision. In the past few years, much focus in the machine learning community has been shift to train representation networks capable of extracting features for a variety of downstream tasks with minimal fine-tuning. Nowadays, many common vision tasks are achieved with the assistant of good backbones, e.g. classifications (Yu et al., 2022; Wortsman et al., 2022; Foret et al., 2020; Xie et al., 2020; Dosovitskiy et al., 2020; Chen et al., 2020a) , object detection (Redmon & Farhadi, 2017; Liu et al., 2016) , segmentation (Chen et al., 2017; Xie et al., 2021) , etc. Among the popular backbones, vision transformers (ViT) (Dosovitskiy et al., 2020) have attracted enormous interest. ViTs stem from Transformers (Vaswani et al., 2017) and split an image into patches, which are then treated as tokens as for the original Transformers. We will exemplify the use of SynBench using several pretrained ViTs. Benchmarking pretrained models. Since pretrained models are used as a foundation for different downstream tasks, it is central to transfer learning (Neyshabur et al., 2020; Pruksachatkun et al., 2020) , and also tightly related to model generalization (Qiao et al., 2020; Carlucci et al., 2019) . To benchmark the performance of a pretrained model, it is a convention to apply the pretrained model for a number of popular tasks and conduct linear probing on the representations (Chen et al., 2020b; Dosovitskiy et al., 2020; Chen et al., 2020a; 2021) . Besides linear probing, evaluation frameworks have been proposed based on mutual information (Bachman et al., 2019) and minimum description length (MDL) (Blier & Ollivier, 2018; Voita & Titov, 2020) , which are reliant on the label information of the downstream tasks and are hence task-specific. Moreover, recent work (Whitney et al., 2020) also discussed the sensitivity of validation accuracy (nonlinear probes) and MDL to evaluation dataset size, and proposed a variant of MDL and a sample complexity based quantifier that depends on the data distribution. It was not until recently that more fundamental questions are brought up related to the pretrained models (Bommasani et al., 2021; Tran et al., 2022; Zhang & Ré, 2022) . Lately, Bommasani et al. (2021) raised practical concerns about the homogenization incentivized by the scale of the pretraining. Although the homogenization might help in achieving competitive performance for some downstream tasks, the defects are also inherited by all these downstreams. On that account, a more careful study of the fundamentals of pretrained models is of paramount importance. Optimal representations. In the seminal work of deep representation theory, Achille & Soatto (2018) depicted the desired optimal representations in supervised learning to be sufficient for downstream task, invariant to the effect of nuisances, maximally disentangled, and has minimal mutual information between representations and inputs. Focusing more on generalization than compression, Dubois et al. (2020) gave the optimal representation based on V-information (Xu et al., 2019) and probed generalization in deep learning. More recently, Ruan et al. (2021) defined the optimal representations for domain generalization. In (Dubois et al., 2022) , authors characterize the idealized representation properties for invariant self-supervised representation learning. Specifically, idealized representation should be well-distinguished by the desired family of probes for potential invariant tasks, have sufficiently large dimension, and be invariant to input augmentations. SynBench differs from the above quantifiers as it does not need knowledge of any downstream data and has controls over the evaluation set size since we could draw arbitrary number of synthetic data. With the assumed synthetic data distribution, we could theoretically characterize the robustnessaccuracy tradeoff that is independent to the downstream tasks. Therefore, SynBench provides a predefined standard of the tradeoff, which serves as the reference for representations induced by pretrained models. It should be also mentioned that, recently sim-to-real transfer paradigm has been leveraged to test the quality of real data, by projecting those onto the space of a model trained on large-scale synthetic data generated from a set of pre-defined grammar rules (Marzoev et al., 2020) . SynBench, though conceptually similar at a very high level, is different from that line of work -as the focus of this work is to quantify the accuracy-robustness tradeoff of pretrained representations using synthetic data from conditional distributions.

3. SYNBENCH: METHODOLOGY AND EVALUATION

Without the knowledge of the downstream tasks and data, we aim to develop a task-agnostic framework to evaluate some fundamental behaviors of the representation network. As robustness is a key desired property, we probe the network to check how representation networks are preserving robustness in the original data. It is crucial to note that the probing method developed herein specifies the robustness-accuracy tradeoff in the pretrained representations, can be used for understanding (and possible ranking) different pretrained networks. On the whole, we want to measure the idealized robustness-accuracy tradeoff using synthetic data. By propagating the Gaussian realizations through different representation networks, we can also measure the robustness-accuracy tradeoff for representations. We start this section by giving the synthetic data and the corresponding optimal linear classifier of interest.

3.1. SYNTHETIC DATA AND OPTIMAL LINEAR CLASSIFIER

We consider imbalanced) binary classification problems with data pair (x, y) generated from the mixture of two Gaussian distributions P µ1,µ2,Σ , such that x|y = 1 ∼ N (µ 1 , Σ), x|y = -1 ∼ N (µ 2 , Σ), or equivalently, x - µ 1 + µ 2 2 |y = 1 ∼ N (μ, Σ), x - µ 1 + µ 2 2 |y = -1 ∼ N (-μ, Σ), where y ∈ C = {+1, -1}, P (y = +1) = τ , P (y = -1) = 1 -τ , and μ = µ1-µ2

2

. We focus on the class-balanced case (τ = 1 2 ) and defer the imbalanced case to the Appendix D. When sampling from this idealized distribution, we eliminate the factor of data bias and can benchmark the robustness degradation in an ideal setting. Let ∥ • ∥ p denote the ℓ p norm of a vector for any p ≥ 1. For a given classifier f and input x with f (x) = y, where y is the predicted label, it is not rational for the classifier to respond differently to x + δ than to x for a small perturbation level measured by ∥δ∥ p , i.e. inconsistent top-1 prediction (Szegedy et al., 2013; Goodfellow et al., 2014) . Therefore, the level of (adversarial) robustness for a classifier can be measured by the minimum magnitude of perturbation that causes misclassification, i.e. ∥∆∥ p := min δ:f (x+δ)̸ =f (x) ∥δ∥ p . For a generic function f , solving the optimization problem exactly is hard (Katz et al., 2017; Sinha et al., 2018) . Luckily, one can readily solve for the optimization if f is affine (Moosavi-Dezfooli et al., 2016) . In the following, we will exploit this point and consider the linear classifier that minimizes the robust classification error. An ideal candidate classifier for the class conditional Gaussian (equation 1) is specified by the robust Bayes optimal classifier (Bhagoji et al., 2019; Dobriban et al., 2020) . Specifically, it is stated that the optimal robust classifier (with a robust margin ϵ) for data generated from equation 1 is a linear classifier. We derive the following result as a direct application of the fact. To simplify the exposition, we focus on the ℓ 2 norm in the remainder of this paper. We refer the readers to Appendix C for general ℓ p -norm results. We use "bound" to denote the minimal perturbation of a sample. Theorem 1. For any sample x, the optimal robust classifier f ϵ for P µ1,µ2,Σ gives (i) the bound (decision margin) ∥∆∥ 2 = |(x- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| ∥Σ -1 (μ-zΣ(μ))∥2 , (ii) the scaled bound ∥ ∆∥ 2 = |(x- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| |μ T Σ -1 (μ-zΣ(μ))| . For a sample x ∼ P µ1,µ2,Σ , it further gives (iii) the standard accuracy a = Φ( μT Σ -1 (μ-zΣ(μ)) ∥Σ -1 (μ-zΣ(μ))∥Σ ), (iv) the expected scaled bound E ∥ ∆∥ 2 | f ϵ (x) = y = 1 √ 2π 1 aΦ -1 (a) e -1 2 (Φ -1 (a)) 2 + 1, where z Σ is the solution of the convex problem arg min ∥z∥2≤ϵ (μ -z) T Σ -1 (μ -z) and Φ denotes the CDF of the standard normal distribution. We note that for samples drawn from P µ1,µ2,Σ and Σ = σ 2 I d , all ϵ-robust Bayes optimal classifier overlap with each other. For a general covariance matrix Σ, the ϵ of a ϵ-robust Bayes classifier specifies the desired size of margin and demonstrates the robustness accuracy tradeoff. We give an illustrative 2D class conditional Gaussian example in Figure 2 (a), where different ϵ-robust Bayes classifiers give different overall margins at the cost of accuracy. Concretely, as ϵ increases, the robust Bayes optimal classifier rotates counterclockwise, leading to increased misclassifications, but also overall enlarged margins.

3.2. OBJECTIVE

For a given representation network parametrized by θ, we are interested in evaluating the expected bounds on synthetic data and their representations, under a thresholding accuracy a t , i.e. E µ∼Pµ,Σ∼PΣ,x-μ|y∼N (yµ,Σ) ∥ ∆∥ 2 | f ϵ (x) = y, a > a t for ∆ = ∆x and ∆z respectively, where P µ and P Σ characterize the probability density function of the synthetic data manifold of interest, and μ is a translation vector allowing non-symmetric class conditional Gaussian. Here, without the prior of applications, we assume µ = s • 1 d / √ d, where s denotes a random variable that follows uniform distribution and 1 d /

√

d is the normalized all-ones vector. For simplicity, we let Σ = I d . Formally, we define E θ,ϵ (a t ) as E θ,ϵ (at) =Es,x ∥ ∆∥2 | fϵ(x) = y, a(s, ϵ) > at = 1 n i Ex ∥ ∆∥2 | fϵ(x) = y 1 a(s i ,ϵ)>a t , where 1 a(s i ,ϵ)>a t is the indicator function specifying the si, ϵ-dependent a that surpasses the threshold accuracy at. We put the detailed derivation in Appendix A. In the following sections, we will illustrate how to calculate the inner expectation term E x ∥ ∆∥ 2 | f ϵ (x) = y for both the raw data and representations.

3.2.1. RAW DATA

For raw data synthesized from P µ1,µ2,Σ according to equation 1, the inner expectation term is given by Theorem 1(iv) E ∥ ∆x ∥ 2 | f ϵ (x) = y = 1 √ 2π 1 aΦ -1 (a) e -1 2 (Φ -1 (a)) 2 + 1 , where a denotes the standard accuracy. The subscript x in the expected scaled bound E ∥ ∆x ∥ 2 | f ϵ (x) = y indicates the raw data space, to distinguish from the scaled bound to be derived for representations. We highlight that Theorem 1(iv) directly gives a robustness-accuracy tradeoff. We plot the expected scaled bound as a function of accuracy in Figure 2(b) . This tradeoff holds true when the data follow the conditional Gaussian exactly. In the proposed SynBench framework, we treat this theoretically-derived robustness-accuracy tradeoff as the reference, enabling a fair comparison among representations induced by different pretrained models.

3.2.2. REPRESENTATIONS

Given a pretrained network , we gather the representations of the Gaussian realizations and quantify the desired bound induced by robust Bayes optimal classifier in the representation space. When deriving the robust Bayes optimal classifier, we model the representations by a general conditional Gaussian z|y = 1 ∼ N (µ 1 , Σ), z|y = -1 ∼ N (µ 2 , Σ). By Theorem 1(ii), we consider the optimal robust classifier for the modeled conditional Gaussian in the representation space to calculate the scaled bound ∥ ∆z ∥ 2 = |(z- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| |μ T Σ -1 (μ-zΣ(μ))| for correctly-classified samples and the inner expectation is estimated empirically. It is worthwhile to note that now the Bayes optimal classifier does not necessarily coincide with robust Bayes optimal classifier even when we synthesized the dataset with an identity matrix covariance in the input space.

3.3. ROBUSTNESS-ACCURACY QUANTIFICATION OF REPRESENTATIONS

Recall that we aim to calculate E θ,ϵ (a t ) = 1 n i E x|y∼N (ysi•1 d / √ d,I d ) ∥ ∆∥ 2 | f ϵ (x) = y 1 a(si,ϵ)>at for both raw data and the representations (i.e. ∥ ∆x ∥ and ∥ ∆z ∥). We treat the expected bounds of the raw data under a threshold accuracy as the reference. Given a representation network, we compare the expected bounds of the representations rendered by representation networks with the reference. We take s ∼ U{0.1, 5} under the guidance of Theorem 1(iii). Specifically, as Theorem 1(iii) gives an analytical expected accuracy for class conditional Gaussian, we can obtain the desired range of s by giving the accuracy. Now since we are interested in having the reference as a class conditional Gaussian that yields accuracy from 55% to almost 100%, we set the starting and ending s by the fact that Φ(0.1) ∼ 0.55 and Φ(5) ∼ 1.0. We reiterate that with more accurate modelling of the data manifold of interest, SynBench can give more precise capture of the pretrained representation performance. When the data is perfect Gaussian (e.g. input synthetic data), we calculate E θ,ϵ (a t ) with the help of Section 3.2.1. We note that ∆x is independent of pretrained network parameters θ, and all the ϵ-robust classifiers f ϵ in the input space overlap with each other when Σ = I d . We hereby denote the desired metric on the input synthetic data by E(a t ), to distinguish from that on the representations E θ,ϵ (a t ). For representations, we calculate E θ,ϵ (a t ) with the help of Section 3.2.2 and the expectation is estimated empirically. We show an example of the probing results in Figure 3 . To integrate over all the desired threshold accuracy, we use the area under the curve (AUC) and give the ratio to the reference by SynBench-Score(θ, ϵ, a t ) = 1 at E θ,ϵ (a)da 1 at E(a)da , which correspond to area B area A + area B in Figure 3 . Larger value of SynBench-Score implies better probing performance on pretrained representations.

4. EXPERIMENTAL RESULTS

In this experiment, we exemplify the use of SynBench given a pretrained representation network. In order to compare among network attributes, it is desirable to control the variates. In Table 1 , we list severeal pretrained vision transformers (ViTs)foot_0 (Dosovitskiy et al., 2020; Chen et al., 2021; Caron (Chen et al., 2020c) , and make comparisons to our best knowledge. We note that the performance of these models might be nuanced by scheduler, curriculum, and training episodes, which are not captured in the above table. To provide a comprehensive evaluation, we give SynBench-Score(θ, ϵ, a t ) with a t ranging from 0.7 to 0.9, and ϵ from 0 to 0.8. Due to space limit, some a t results are deferred to the appendix. The runtime of SynBench depends on the number of outcomes of the discrete uniform distribution U{0.1, 5}. For one s ∼ U{0.1, 5}, it costs 59 seconds to generate 2048 Gaussian samples, 37 and 81 seconds to obtain the SynBench-Score for ViT-B/16 and ViT-L/16 on one GeForce RTX 2080 super. SynBench-Score offers a quantifiable score for robustness-accuracy performance benchmarking and is intrinsically a task-agnostic evaluation that characterizes general behaviors of the pretrained representations without the knowledge and use of any downstream data. Apart from the task-agnostic metrics SynBench-Score developed in this paper, we also report linear probing accuracy on CIFAR10/ImageNet and CIFAR10-c/ImageNet-c (Hendrycks & Dietterich, 2019) to validate the standard and transfer accuracy (use the probing layer trained on CIFAR10/ImageNet to probe CIFAR10-c/ImageNet-c). We note that evaluating pretrained representations on real-life tasks is sensitive to the choice of tasks and the results may be inconclusive. For example, CIFAR10/ImageNet suggest that ViT-L performs better than ViT-B, wheras KITTI/SST (Geiger et al., 2012; Socher et al., 2013) (Radford et al., 2021, imaging (Tummala et al., 2022 , Table 2 -3) suggest the opposite. In contrast, because SynBench-Score is intrinsically a task-agnostic evaluation of the pretrained model, its result is independent of the choice of tasks. Fine-tuned pretraining representation. When applying a pretrained representation network to the desired task, one can either only train a linear head on top of a fixed pretrained model, or perform fine-tuning of both the representation network and the linear head. Thus, in Table 2 , we investigate how the fine-tuning process is affecting the representation networks. Specifically, both networks in Table 2 is pretrained on Imagenet 21k with supervision. After the pretraining, ViT-B/16 is further finetuned on Imagenet 1k. Interestingly, SynBench-Score shows that this finetuning is beneficial as improvements are witnessed across all ϵ with SynBench-Score, which well match the empirical observation give by CIFAR10 and CIFAR10-c and prior results (Kumar et al., 2021) . Model size. In Table 3 , we compare ViTs of different sizes. Specifically, we perform SynBench on ViT-Ti, ViT-B, and ViT-L with patch size being 16. The model parameter θ is provided by the pretrained model. It is noticeable that ViT-B/16 is generally on par with ViT-L/16. When we set the threshold accuracy to be higher values, ViT-L/16 starts to give slightly better evaluations especially with larger ϵ. One interesting observation is that for each model, SynBench-score is not necessarily monotonic in ϵ, which indicates standard linear probing (i.e., ϵ = 0) may not be the most effective way to probing pretrained representations in terms of robustness-accuracy performance, which is consistent with recent findings (Fan et al., 2021) . See the "Robust linear probing" paragraph below for detailed analysis. We also observe that larger models exhibit better resilience (slower reduction in SynBench-score) as ϵ increases. On Self-supervised pretrained representations ( ). Robust linear probing. According to Table 3 , 0.2-robust Bayes classifiers consistently give better scores compared to 0-robust (standard) Bayes classifiers with ViT-B/16 and ViT-L/16. This offers us a quick way of gauging the suitable downstream robust probing parameter for the given pretrained model. We stipulate that observing a 0.2-robust Bayes classifier to yield better SynBench-Score than a 0-robust Bayes classifier may suggest the pretrained network to produce representations that have better overall performance with linear classifiers trained by 0.2-robust linear probing. We validate this by performing robust linear probing on representations rendered by ViTs for CIFAR10 classifications. Results are shown in Table 6 . For a given pretrained model, let f and g be the pretrained network and linear probing layer, we solve the optimization problem min g max ∥δ∥≤ϵ L(g(f (x+δ)), y) using the PyTorch library Torchattacksfoot_2 and 10-step PGDL2 attacks (Madry et al., 2018) for adversarial training. From Table 6 , we see that robust linear probing with ϵ = arg max ϵ SynBench-Score generally gives a decent robustness-accuracy tradeoff. For example, with robust linear probing, we obtain a 0.6% and 1.3% increase in CIFAR10 standard and CIFAR10-c transfer accuracy with ViT-B/16 (as in Table 9 ). A more complete table on ϵ-robust linear probing results with different ϵ is given in the appendix.

5. DISCUSSION AND CONCLUSION

In this paper, we propose a new task-agnostic framework SynBench for benchmarking the robustness-accuracy performance of pretrained representations. SynBench is fundamentally taskindependent and provides a quantifiable score that does not reply on any real-life data. SynBench exploits an idealized data distribution, class conditional Gaussian mixture, to establish a theoreticallyderived robustness-accuracy tradeoff, which serves as the reference for pretrained representations. Finally, a quantifiable score SynBench-Score is provided that compares the ratio of area-under-curve between the reference and the pretrained representations. We validate the usefulness of SynBench on several pretrained vision transformers in giving insightful comparisons on different model attributes (e.g. model size, fine-tuned pretraining representations, ViT patch size, linear probing). While we delved into the robustness-accuracy performance of pretrained representations of vision transformers, we envision the SynBench framework to be further extended to other trustworthiness dimensions such as privacy, fairness, etc. Moreover, as the popularization of pretrained representations in various domains (e.g. vision, language, speech), we foresee SynBench to be generalized to more domains, and shed light on task-agnostic benchmarking designs. A OBJECTIVE E θ,ϵ (a t ) =E s∼U ,x-μ|y∼N (µ,Σ) ∥ ∆∥ 2 | f ϵ (x) = y, a > a t , µ = s • 1 d / √ d, Σ = I d =E s,x ∥ ∆∥ 2 | f ϵ (x) = y, a(s, ϵ) > a t = i E x ∥ ∆∥ 2 | f ϵ (x) = y, a(s i , ϵ) > a t P(s = s i ) = 1 n i E x ∥ ∆∥ 2 | f ϵ (x) = y, a(s i , ϵ) > a t = 1 n i E x ∥ ∆∥ 2 | f ϵ (x) = y 1 a(si,ϵ)>at .

B PROOFS

Theorem 1. For any sample x, the optimal robust classifier f ϵ for P µ1,µ2,Σ gives (i) the bound (decision margin) ∥∆∥ 2 = |(x- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| ∥Σ -1 (μ-zΣ(μ))∥2 , (ii) the scaled bound ∥ ∆∥ 2 = |(x- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| |μ T Σ -1 (μ-zΣ(μ))| . For a sample x ∼ P µ1,µ2,Σ , it further gives (iii) the standard accuracy a = Φ( μT Σ -1 (μ-zΣ(μ)) ∥Σ -1 (μ-zΣ(μ))∥Σ ), (iv) the expected scaled bound E ∥ ∆∥ 2 | f ϵ (x) = y = 1 √ 2π 1 aΦ -1 (a) e -1 2 (Φ -1 (a)) 2 + 1, where z Σ is the solution of the convex problem arg min ∥z∥2≤ϵ (μ -z) T Σ -1 (μ -z) and Φ denotes the CDF of the standard normal distribution. Proof. (i) Following Bhagoji et al. ( 2019); Dan et al. (2020) , the Bayes optimal robust classifier for the general non-symmetric conditional Gaussians P µ1,µ2,Σ specified in equation 1 is f ϵ (x) = sign x - µ 1 + µ 2 2 T Σ -1 (μ -z Σ (μ)) , where sign(•) is the typical sign function and z Σ is the solution of the convex problem arg min ∥z∥2≤ϵ (μ -z) T Σ -1 (μ -z). The corresponding decision boundary is at (x + δ) -µ1+µ2 2 T Σ -1 (μ -z Σ (μ)) = 0, =⇒ ∆ = arg min ∥δ∥ 2 s.t. δ T Σ -1 (μ -z Σ (μ)) = -x - µ 1 + µ 2 2 T Σ -1 (μ -z Σ (μ)) =⇒ ∥∆∥ 2 = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| ∥Σ -1 (μ -z Σ (μ))∥ 2 . (ii) Since the bound ∥∆∥ 2 is subject to the positions of two Gaussians, we scale the bound by the distance from Gaussian centers to the classifier, |μ T Σ -1 (μ-zΣ(μ))| ∥Σ -1 (μ-zΣ(μ))∥2 and obtain ∥ ∆∥ 2 = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| ∥Σ -1 (μ -z Σ (μ))∥ 2 ∥Σ -1 (μ -z Σ (μ))∥ 2 |μ T Σ -1 (μ -z Σ (μ))| = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| |μ T Σ -1 (μ -z Σ (μ))| . (iii) For sample x ∼ P µ1,µ2,Σ , consider the Bayes optimal robust classifier in equation 4, we can calculate the analytical standard accuracy by P(y = 1)P [f ϵ (x) = 1 | y = 1] + P(y = -1)P [f ϵ (x) = -1 | y = -1] =P [f ϵ (x) = 1 | y = 1] =P (x - µ 1 + µ 2 2 ) T Σ -1 (μ -z Σ (μ)) > 0 | y = 1 =P (μ + w) T Σ -1 (μ -z Σ (μ)) > 0 , w ∼ N (0, Σ) =P w T Σ -1 (μ -z Σ (μ)) > -μ T Σ -1 (μ -z Σ (μ)) , w ∼ N (0, Σ) =P w T Σ -1 (μ -z Σ (μ)) ∥Σ -1 (μ -z Σ (μ))∥ Σ > - μT Σ -1 (μ -z Σ (μ)) ∥Σ -1 (μ -z Σ (μ))∥ Σ , w T Σ -1 (μ -z Σ (μ)) ∥Σ -1 (μ -z Σ (μ))∥ Σ ∼ N (0, 1) =Φ( μT Σ -1 (μ -z Σ (μ)) ∥Σ -1 (μ -z Σ (μ))∥ Σ ). (iv) For sample x ∼ P µ1,µ2,Σ , let a denote the accuracy, t denote x -µ1+µ2

2

, and w denote Σ -1 (μz Σ (μ)). From (iii), we have that the standard accuracy of conditional Gaussian samples with the Bayes optimal (robust) classifier is Φ( μT w ∥w∥Σ ), so μT w ∥w∥Σ = Φ -1 (a). Since for binary classification, we only care about accuracy from 0.5 to 1, so we should have μT w > 0. Now consider the classifier in equation 4 and the corresponding scaled bound from (ii), ∥ ∆∥ 2 = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| |μ T Σ -1 (μ -z Σ (μ))| = |t T w| |μ T w| = |t T w| μT w . Since t|y ∼ N (y μ, Σ), we have t T w|y ∼ N (y μT w, w T Σ T w). When we only want to get the expected scaled bound of the correctly-classified samples, we have that E ∥ ∆∥ 2 | f ϵ (x) = y = 1 μT w E |t T w| | f ϵ (x) = y = 1 2μ T w E |t T w| | f ϵ (x) = y = 1 + 1 2μ T w E |t T w| | f ϵ (x) = y = -1 = 1 2μ T w E t T w | y = 1, t T w ≥ 0 + 1 2μ T w E -t T w | y = -1, t T w < 0 . Recall that t T w|y ∼ N (y μT w, w T Σ T w), then by the mean of truncated normal distribution, it is true that E t T w | y = 1, t T w ≥ 0 = μT w + √ w T Σ T w ϕ( 0-μ T w √ w T Σ T w ) 1 -Φ( 0-μ T w √ w T Σ T w ) = μT w + √ w T Σ T w ϕ(- μT w √ w T Σ T w ) 1 -Φ(- μT w √ w T Σ T w ) = μT w + √ w T Σ T w 1 √ 2πΦ( μT w √ w T Σ T w ) e -1 2 μT w √ w T Σ T w 2 E -t T w | y = -1, t T w < 0 = -E t T w | y = -1, t T w < 0 = -   -μ T w - √ w T Σ T w ϕ( 0+μ T w √ w T Σ T w ) Φ( 0+μ T w √ w T Σ T w )   = μT w + √ w T Σ T w 1 √ 2πΦ( μT w √ w T Σ T w ) e -1 2 μT w √ w T Σ T w 2 . Therefore E ∥ ∆∥ 2 | f ϵ (x) = y = 1 μT w   μT w + √ w T Σ T w 1 √ 2πΦ( μT w √ w T Σ T w ) e -1 2 μT w √ w T Σ T w 2   = 1 + √ w T Σ T w μT w 1 √ 2πΦ( μT w √ w T Σ T w ) e -1 2 μT w √ w T Σ T w 2 . By replacing μT w √ w T Σ T w by Φ -1 (a), we got E ∥ ∆∥ 2 | f ϵ (x) = y = 1 √ 2π 1 aΦ -1 (a) e -1 2 (Φ -1 (a)) 2 + 1.

C GENERAL ℓ p RESULTS

We note that our results in Appendix B can be straightforwardly generalized to ℓ p . Given an ℓ p adversarial budget ϵ: Theorem 2. For any sample x, the optimal robust classifier f ϵ for P µ1,µ2,Σ gives (i) the bound (decision margin) ∥∆∥ p = |(x- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| ∥Σ -1 (μ-zΣ(μ))∥q , (ii) the scaled bound ∥ ∆∥ p = |(x- µ 1 +µ 2 2 ) T Σ -1 (μ-zΣ(μ))| |μ T Σ -1 (μ-zΣ(μ))| . For sample x ∼ P µ1,µ2,Σ , it further gives (iii) the standard accuracy a = Φ( μT Σ -1 (μ-zΣ(μ)) ∥Σ -1 (μ-zΣ(μ))∥Σ ), (iv) the expected scaled bound E ∥ ∆∥ p | f ϵ (x) = y = 1 √ 2π 1 aΦ -1 (a) e -1 2 (Φ -1 (a)) 2 + 1, where z Σ is the solution of the convex problem arg min ∥z∥p≤ϵ (μ -z) T Σ -1 (μ -z) and Φ denotes the CDF of the standard normal distribution. Proof. We follow the proof of Theorem 1 and consider the classifier in equation 4. By Hölder's inequality, we now have the corresponding lower bound and scaled lower bound as ∥∆∥ p = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| ∥Σ -1 (μ -z Σ (μ))∥ q ∥ ∆∥ p = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| ∥Σ -1 (μ -z Σ (μ))∥ q ∥Σ -1 (μ -z Σ (μ))∥ q |μ T Σ -1 (μ -z Σ (μ))| = |(x -µ1+µ2 2 ) T Σ -1 (μ -z Σ (μ))| |μ T Σ -1 (μ -z Σ (μ))| , where 1 p + 1 q = 1. The remainder of the proof will then follows as in Theorem 1. Remark. In general, in the case that Σ is singular, we can apply the economy-size (thin) decomposition with nonzero eigenvalues Σ = F ΛF T . Then, with a general non-symmetric conditional Gaussians x|y = 1 ∼ N (µ 1 , Σ), x|y = -1 ∼ N (µ 2 , Σ), we apply proper translation to symmetric conditional Gaussians F T x|y = 1 ∼ N (F T µ 1 , Σ), F T x|y = -1 ∼ N (F T µ 2 , Σ), F T x -F T µ 1 + µ 2 2 |y = 1 ∼ N (μ, Σ), F T x -F T µ 1 + µ 2 2 |y = -1 ∼ N (-μ, Σ), where μ = F T µ1-µ2 2 .

D CLASS IMBALANCE

Given an ℓ 2 adversarial budget ϵ ≤ ∥µ∥ 2 , consider the conditional Gaussian in equation 1 with Σ = I d (d by d identity matrix) and general class prior τ , then the following theorem holds. Theorem 3. For any sample x, the optimal robust classifier f ϵ for P µ1,µ2,I d gives (i) the bound (decision margin) ∥∆∥ 2 = |(x- µ 1 +µ 2 2 ) T μ(1-ϵ/∥μ∥2)-q/2| ∥μ(1-ϵ/∥μ∥2)∥2 , (ii) the scaled bound ∥ ∆∥ 2 = |(x- µ 1 +µ 2 2 ) T μ(1-ϵ/∥μ∥2)-q/2| |μ T μ(1-ϵ/∥μ∥2)-q/2| . For a sample x ∼ P µ1,µ2,I d , it further gives (iii) the standard accuracy a = τ Φ( μT w-q/2 ∥w∥2 ) + (1 -τ )Φ( μT w+q/2 ∥w∥2 ), (iv) the expected scaled bound E ∥ ∆∥ 2 | f ϵ (x) = y = τ μT w -q/2   μ T w -q/2 + ∥w∥2 ϕ( -μT w+q/2 ∥w∥ 2 ) Φ( μT w-q/2 ∥w∥ 2 )    + 1 -τ μT w -q/2   μ T w + q/2 + ∥w∥2 ϕ( μT w+q/2 ∥w∥ 2 ) Φ( μT w+q/2 ∥w∥ 2 )    . where q = ln{(1 -τ )/τ }, w = μ(1 -ϵ/∥μ∥ 2 ), ϕ and Φ denotes the PDF and CDF of the standard normal distribution. Proof. (i) Consider the Bayes optimal ℓ 2 ϵ-robust classifier (Dobriban et al., 2020, Theorem 4 .1) f ϵ (x) = sign x - µ 1 + µ 2 2 T μ(1 -ϵ/∥μ∥ 2 ) -q/2 , where q = ln{(1 -τ )/τ }. For any x, ∥∆∥ 2 = |(x -µ1+µ2 2 ) T μ(1 -ϵ/∥μ∥ 2 ) -q/2| ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 . (ii) Since the bound ∥∆∥ 2 is subject to the positions of two Gaussians, we scale the bound by the distance from Gaussian centers to the classifier, |μ T μ(1-ϵ/∥μ∥2)-q/2| ∥μ(1-ϵ/∥μ∥2)∥2 and obtain ∥ ∆∥ 2 = |(x -µ1+µ2 2 ) T μ(1 -ϵ/∥μ∥ 2 ) -q/2| ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 |μ T μ(1 -ϵ/∥μ∥ 2 ) -q/2| = |(x -µ1+µ2 2 ) T μ(1 -ϵ/∥μ∥ 2 ) -q/2| |μ T μ(1 -ϵ/∥μ∥ 2 ) -q/2| . (iii) For sample x ∼ P µ1,µ2,I d , consider the Bayes optimal robust classifier in equation 4, we can calculate the analytical standard accuracy by P(y = 1)P [f ϵ (x) = 1 | y = 1] + P(y = -1)P [f ϵ (x) = -1 | y = -1] =τ P [f ϵ (x) = 1 | y = 1] + (1 -τ ) [f ϵ (x) = -1 | y = -1] =τ P (x - µ 1 + µ 2 2 ) T μ(1 -ϵ/∥μ∥ 2 ) -q/2 > 0 | y = 1 +(1 -τ )P (x - µ 1 + µ 2 2 ) T μ(1 -ϵ/∥μ∥ 2 ) -q/2 < 0 | y = -1 =τ P (μ + w) T μ(1 -ϵ/∥μ∥ 2 ) -q/2 > 0 , +(1 -τ )P (-μ + w) T μ(1 -ϵ/∥μ∥ 2 ) -q/2 < 0 , w ∼ N (0, I d ) =τ P w T μ(1 -ϵ/∥μ∥ 2 ) > q/2 -μT μ(1 -ϵ/∥μ∥ 2 ) , +(1 -τ )P w T μ(1 -ϵ/∥μ∥ 2 ) < q/2 + μT μ(1 -ϵ/∥μ∥ 2 ) , w ∼ N (0, I d ) =τ P w T μ(1 -ϵ/∥μ∥ 2 ) ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 > q/2 -μT μ(1 -ϵ/∥μ∥ 2 ) ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 , +(1 -τ )P w T μ(1 -ϵ/∥μ∥ 2 ) ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 < q/2 + μT μ(1 -ϵ/∥μ∥ 2 ) ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 , w T μ(1 -ϵ/∥μ∥ 2 ) ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 ∼ N (0, 1) =τ Φ( μT μ(1 -ϵ/∥μ∥ 2 ) -q/2 ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 ) + (1 -τ )Φ( μT μ(1 -ϵ/∥μ∥ 2 ) + q/2 ∥μ(1 -ϵ/∥μ∥ 2 )∥ 2 ). Let w denote μ(1 -ϵ/∥μ∥ 2 ), the we got the accuracy a = τ Φ( μT w -q/2 ∥w∥ 2 ) + (1 -τ )Φ( μT w + q/2 ∥w∥ 2 ). (iv) For sample x ∼ P µ1,µ2,I d , let t denote x -µ1+µ2

2

, and w denote μ(1 -ϵ/∥μ∥ 2 ). According to Theorem 3(iii), when μT μ(1-ϵ/∥μ∥ 2 )-q/2 > 0, the accuracy would be higher than 0.5. Therefore we consider μT w -q/2 > 0. Now consider the classifier in equation 5 and the corresponding scaled bound from (ii), ∥ ∆∥ 2 = |(x -µ1+µ2 ) T μ(1 -ϵ/∥μ∥ 2 ) -q/2| |μ T μ(1 -ϵ/∥μ∥ 2 ) -q/2| = |t T w -q/2| |μ T w -q/2| = |t T w -q/2| μT w -q/2 . Since t|y ∼ N (y μ, I d ), we have t T w -q/2|y ∼ N (y μT w -q/2, w T w). When we only want to get the expected scaled bound of the correctly-classified samples, we have that E ∥ ∆∥ 2 | f ϵ (x) = y = 1 μT w -q/2 E |t T w -q/2| | f ϵ (x) = y = τ μT w -q/2 E |t T w -q/2| | f ϵ (x) = y = 1 + 1 -τ μT w -q/2 E |t T w -q/2| | f ϵ (x) = y = -1 = τ μT w -q/2 E t T w -q/2 | y = 1, t T w -q/2 ≥ 0 + 1 -τ μT w -q/2 E -t T w + q/2 | y = -1, t T w -q/2 < 0 . Recall that t T w -q/2|y ∼ N (y μT w -q/2, w T w), then by the mean of truncated normal distribution, it is true that E t T w -q/2 | y = 1, t T w -q/2 ≥ 0 = μT w -q/2 + ∥w∥ 2 ϕ( 0-μ T w+q/2 ∥w∥2 ) 1 -Φ( 0-μ T w+q/2 ∥w∥2 ) = μT w -q/2 + ∥w∥ 2 ϕ( -μ T w+q/2 ∥w∥2 ) Φ( μT w-q/2 ∥w∥2 ) E -t T w + q/2 | y = -1, t T w -q/2 < 0 = -E t T w -q/2 | y = -1, t T w -q/2 < 0 = -  -μ T w -q/2 -∥w∥ 2 



https://github.com/rwightman/pytorch-image-models https://github.com/google-research/simclr https://github.com/Harry24k/adversarial-attacks-pytorch https://github.com/willwhitney/reprieve



Figure 1: Overview of our SynBench framework. Step 1: generate class conditional Gaussian and form the inputs to the pretrained model; Step 2: gather rendered representations; Step 3: measure the expected robustness bound under a range of threshold accuracy for both input raw data and their representations according to equation 2 and obtain the expected bound-threshold accuracy plot; Step 4: calculate SynBench score by the relative area under curve of the representations (area B) to the input data (area A + area B) in the expected bound-threshold accuracy plot.

Figure 2: Illustration of robustness-accuracy tradeoff suggested by ϵ-robust Bayes optimal classifiers. Figure (a) depicts a class conditional 2D Gaussian case with decision boundaries drawn by ϵ-robust Bayes optimal classifiers of varying ϵ values. Figure (b) draws the theoretically characterized robustness-accuracy tradeoff given in Theorem 1(iv).

Figure 3: An example of the robustness-accuracy quantification of representations for ViT-B/16. (Left) The expected bound-threshold accuracy plot for the input raw data (E(a t )) and representations (E θ,ϵ (a t )) with ϵ = 0 ∼ 0.8. (Right) The desired quantification SynBench-Score(θ, ϵ, a t ) = area B area A+area B (refer to equation 3) for ϵ = 0 (top) and ϵ = 0.6 (bottom).

E ∥ ∆∥ 2 | f ϵ (x) = y = τ μT w -q/2   μT w -q/2 + ∥w∥ 2 ϕ( -μ T w+q/

Model descriptions.   et al., 2021)  and ResNets 2

, Food-Seg103(Wu et al., 2021, Table 8), X-ray images(Okolo et al., 2022, Table 4-8), magnetic resonance

Comparisons on the finetuning procedure in pretraining. The SynBench-Score of ViTs with or without finetuning pretrained representations, and the linear probing accuracy on CI-FAR10/ImageNet and transfer accuracy on CIFAR10-c/ImageNet-c.

Comparisons on the model sizes. The SynBench-Score of ViTs of different sizes, and the linear probing accuracy on CIFAR10/ImageNet and transfer accuracy on CIFAR10-c/ImageNet-c.





Comparisons on the ViT patch size. The SynBench-Score of ViTs of different patch sizes, and the linear probing accuracy on CIFAR10 and transfer accuracy on CIFAR10-c.

CIFAR10 and CIFAR10-c accuracy changes using ϵ-robust linear probing with ϵ = arg max ϵ SynBench-Score.ViT patch size. We also compare vision transformer patch sizes in Table5. Specifically, we give ViT-B with patch size being 16 and 32, individually. SynBench-Scores show an consistent trend as the model performance on CIFAR10 and CIFAR10-c. From Table4, we see that the SynBenchscore of ViT-S/16-DINO is on par with that of ViT-S/8-DINO, and ViT-B/16-DINO has higher SynBench-score than ViT-B/8-DINO. Although linear probing on CIFAR10 and ImageNet do not share the trend, bigger patch size model (ViT-S/16-DINO) does perform better than smaller ones (ViT-S/8-DINO) on the PASCAL VOC(Everingham et al., 2010) segmentation task(Caron et al.,  2021, Figure 4 bottom table

Baseline metrics evaluating the representation quality on the conditional Gaussian synthetic data with n = {2048, 4096, 8192, 16384, 32768}. For Val loss, MDL, SDL, and ϵSC, the smaller the better; for SynBench, the bigger the better. Note that the model ranking of SynBench is consistent across different values of n, while other methods will change their rankings.

Baseline metrics evaluating the representation quality on the conditional Gaussian synthetic data with n = 8192.

Model

Arch.pretraining fine-tuning patch # parameters (M) ViT-Ti/16ViT 2 . F SYNTHETIC DATASET COMPLEXITY SynBench can adjust the synthetic task complexity by statistically modeling the structure of the covariance matrix P Σ . In our previous experiments, we have considered an identity covariance matrix, here we assume a channel-wise band matrix covariance Σ (R,G,B channel entries are externally independent, and internally Gaussians with a band matrix covariance). Essentially, the assembled Σ is a block-diagonal matrix with each block being a band matrix in the size of the image. For ViT with ImageNet size inputs, the 3 blocks are 224 2 × 224 2 . We let the 3 band matrices be Pentadiagonal matrices (only main, first two upper, and two lower diagonals are nonzero).From Table 10 to Table 13 , we see that the trend suggested by SynBench is generally consistent with our findings in Section 4. Moreover, SynBench can well-capture the complexity of the task and give lower SynBench-scores when the task is more complex. 

G OTHER BASELINES

For completeness, we report several baseline metrics for the synthetic conditional Gaussian classification task. We follow the implementation of Whitney et al. (2020) 4 and set ns (the training set size) to be the length of the synthetic dataset to compute canonical results. In Table 15 , We report validation loss (val loss), minimum description length (MDL) (Voita & Titov, 2020) , surplus description length (SDL) and ϵ-sample complexity (ϵ-SC) (Whitney et al., 2020) . As a reference, obtaining the metrics for ViT-B/16 costs 6807 seconds and ViT-L/16 costs 7373 seconds on one Tesla V100.

