DENOISING MASKED AUTOENCODERS HELP ROBUST CLASSIFICATION

Abstract

In this paper, we propose a new self-supervised method, which is called Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers of images. In DMAE, we corrupt each image by adding Gaussian noises to each pixel value and randomly masking several patches. A Transformer-based encoder-decoder model is then trained to reconstruct the original image from the corrupted one. In this learning paradigm, the encoder will learn to capture relevant semantics for the downstream tasks, which is also robust to Gaussian additive noises. We show that the pre-trained encoder can naturally be used as the base classifier in Gaussian smoothed models, where we can analytically compute the certified radius for any data point. Although the proposed method is simple, it yields significant performance improvement in downstream classification tasks. We show that the DMAE ViT-Base model, which just uses 1/10 parameters of the model developed in recent work (Carlini et al., 2022) , achieves competitive or better certified accuracy in various settings. The DMAE ViT-Large model significantly surpasses all previous results, establishing a new state-of-the-art on ImageNet dataset. We further demonstrate that the pre-trained model has good transferability to the CIFAR-10 dataset, suggesting its wide adaptability. Models and code are available at https://github.com/quanlin-wu/dmae. , Salman et al. (2020); Carlini et al. (2022) took the first step to train Gaussian smoothed classifiers with the help of self-supervised learning. Both approaches use a compositional model architecture for f and decompose the prediction process into two stages. In the first stage, a denoising

1. INTRODUCTION

Deep neural networks have demonstrated remarkable performance in many real applications (He et al., 2016; Devlin et al., 2019; Silver et al., 2016) . However, at the same time, several works observed that the learned models are vulnerable to adversarial attacks (Szegedy et al., 2013; Biggio et al., 2013) . Taking image classification as an example, given an image x that is correctly classified to label y by a neural network, an adversary can find a small perturbation such that the perturbed image, though visually indistinguishable from the original one, is predicted into a wrong class with high confidence by the model. Such a problem raises significant challenges in practical scenarios. Given such a critical issue, researchers seek to learn classifiers that can provably resist adversarial attacks, which is usually referred to as certified defense. One of the seminal approaches in this direction is the Gaussian smoothed model. A Gaussian smoothed model g is defined as g(x) = E η f (x + η), in which η ∼ N (0, σ 2 I) and f is an arbitrary classifier, e.g., neural network. Intuitively, the smoothed classifier g can be viewed as an ensemble of the predictions of f that takes noise-corrupted images x + η as inputs. Cohen et al. (2019) derived how to analytically compute the certified radius of the smoothed classifier g, and follow-up works improved the training methods of the Gaussian smoothed model with labeled data (Salman et al., 2019; Zhai et al., 2021; Jeong & Shin, 2020; Horváth et al., 2022; Jeong et al., 2021) . Certified Accuracy(%) at ℓ 2 radius r Method #Params Extra data 0.5 1.0 1.5 2.0 3.0 RS (Cohen et al., (He et al., 2022) , which learns latent representations by reconstructing missing pixels from masked images, we design a new self-supervised task called Denoising Masked AutoEncoder (DMAE). Given an unlabeled image, we corrupt the image by adding Gaussian noise to each pixel value and randomly masking several patches. The goal of the task is to train a model to reconstruct the clean image from the corrupted one. Similar to MAE, DMAE also intends to reconstruct the masked information; hence, it can capture relevant features of the image for downstream tasks. Furthermore, DMAE takes noisy patches as inputs and outputs denoised ones, making the learned features robust with respect to additive noises. We expect that the semantics and robustness of the representation can be learned simultaneously, enabling efficient utilization of the model parameters. Although the proposed DMAE method is simple, it yields significant performance improvement on downstream tasks. We pre-train DMAE ViT-Base and DMAE ViT-Large, use the encoder to initialize the Gaussian smoothed classifier, and fine-tune the parameters on ImageNet. We show that the DMAE ViT-Base model with 87M parameters, one-tenth as many as the model used in Carlini et al. (2022) , achieves competitive or better certified accuracy in various settings. Furthermore, the DMAE ViT-Large model (304M) significantly surpasses the state-of-the-art results in all tasks, demonstrating a single-stage model is enough to learn robust representations with proper selfsupervised tasks. We also demonstrate that the pre-trained model has good transferability to other datasets. We empirically show that decent improvement can be obtained when applying it to the CIFAR-10 dataset. Model checkpoints will be released in the future. 2 RELATED WORK Szegedy et al. (2013) ; Biggio et al. (2013) observed that standardly trained neural networks are vulnerable to adversarial attacks. Since then, many works have investigated how to improve the robustness of the trained model. One of the most successful methods is adversarial training, which adds adversarial examples to the training set to make the learned model robust against such attacks (Madry et al., 2018; Zhang et al., 2019) . However, as the generation process of adversarial examples is pre-defined during training, the learned models may be defeated by stronger attacks (Athalye et al., 2018) . Therefore, it is important to develop methods that can learn models with certified robustness guarantees. Previous works provide certified guarantees by bounding the certified radius layer by layer using convex relaxation methods (Wong & Kolter, 2018; Wong et al., 2018; Weng et al., 2018; Balunovic & Vechev, 2020; Zhang et al., 2021; 2022a; b) . However, such algorithms are usually computationally expensive, provide loose bounds, or have scaling issues in deep and large models. Randomized smoothing. Randomized smoothing is a scalable approach to obtaining certified robustness guarantees for any neural network. The key idea of randomized smoothing is to add Gaussian noise in the input and to transform any model into a Gaussian smoothed classifier. As the Lipschitz constant of the smoothed classifier is bounded with respect to the ℓ 2 norm, we can analytically compute a certified guarantee on small ℓ 2 perturbations (Cohen et al., 2019) . Follow-up works proposed different training strategies to maximize the certified radius, including ensemble approaches (Horváth et al., 2022) , model calibrations (Jeong et al., 2021) , adversarial training for smoothed models (Salman et al., 2019) and refined training objectives (Jeong & Shin, 2020; Zhai et al., 2021) . Yang et al. (2020) ; Blum et al. (2020) ; Kumar et al. (2020) extended the method to general ℓ p perturbations by using different shapes of noises. Self-supervised pre-training in vision. Learning the representation of images from unlabeled data is an increasingly popular direction in computer vision. Mainstream approaches can be roughly categorized into two classes. One class is the contrastive learning approach which maximizes agreement between differently augmented views of an image via a contrastive loss (Chen et al., 2020; He et al., 2020) . The other class is the generative learning approach, which randomly masks patches in an image and learns to generate the original one (Bao et al., 2021; He et al., 2022) . Several works utilized self-supervised pre-training to improve image denoising (Joshua Batson, 2019; Yaochen Xie, 2020), and recently there have been attempts to use pre-trained denoisers to achieve certified robustness. The most relevant works are Salman et al. (2020); Carlini et al. (2022) . Both works first leverage a pre-trained denoiser to purify the input, and then use a standard classifier to make predictions. We discuss these two works and ours in depth in Sec. 3.

3.1. NOTATIONS AND BASICS

Denote x ∈ R d as the input and y ∈ Y = {1, . . . , C} as the corresponding label. Denote g : R d → Y as a classifier mapping x to y. For any x, assume that an adversary can perturb x by adding an adversarial noise. The goal of the defense methods is to guarantee that the prediction g(x) doesn't change when the perturbation is small. Randomized smoothing (Li et al., 2018; Cohen et al., 2019) is a technique that provides provable defenses by constructing a smoothed classifier g of the form: g(x) = arg max c∈Y P η [f (x + η) = c], where η ∼ N (0, σ 2 I d ). The function f is called the base classifier, which is usually parameterized by neural networks, and η is Gaussian noise with noise level σ. Intuitively, g(x) can be considered as an ensemble classifier which returns the majority vote of f when its input is sampled from a Gaussian distribution N (x, σ 2 I d ) centered at x. Cohen et al. (2019) theoretically provided the following certified robustness guarantee for the Gaussian smoothed classifier g. Theorem 1 (Cohen et al., 2019) Given f and g defined as above, assume that g classifies x correctly, i.e., P η [f (x + η) = y] ≥ max y ′ ̸ =y P η [f (x + η) = y ′ ]. Then for any x ′ satisfying ||x ′ -x|| 2 ≤ R, we always have g(x) = g(x ′ ), where R = σ 2 [Φ -1 (P η [f (x + η) = y]) -Φ -1 (max y ′ ̸ =y P η [f (x + η) = y ′ ])]. Φ is the cumulative distribution function of the standard Gaussian distribution. The denoise-then-predict network structure. Even without knowing the label, one can still evaluate the robustness of a model by checking whether it can give consistent predictions when the input is perturbed. Therefore, unlabeled data can naturally be used to improve the model's robustness (Alayrac et al., 2019; Carmon et al., 2019; Najafi et al., 2019; Zhai et al., 2019) . Recently, 2022) took steps to train Gaussian smoothed classifiers with the help of unlabeled data. Both of them use a denoise-then-predict pipeline. In detail, the base classifier f consists of three components: θ denoiser , θ encoder and θ output . Given any input x, the classification process of f is defined as below. x = Denoise(x + η; θ denoiser ) (3) h = Encode( x; θ encoder ) (4) y = Predict(h; θ output ) As f takes noisy image as input (see Eq.1), a denoiser with parameter θ denoiser is first used to purify x+η to cleaned image x. After that, x is further encoded into contextual representation h by θ encoder and the prediction can be obtained from the output head θ output . Note that θ denoiser and θ encoder can be pre-trained by self-supervised approaches. For example, one can use denoising auto-encoder (Vincent et al., 2008; 2010) or denoising diffusion model (Ho et al., 2020; Nichol & Dhariwal, 2021) to pre-train θ denoiser , and leverage contrastive learning (Chen et al., 2020; He et al., 2020) or masked image modelling (He et al., 2022; Xie et al., 2022) to pre-train θ encoder . Especially, Carlini et al. (2022) achieved state-of-the-art performance on ImageNet by applying a pre-trained denoising diffusion model as the denoiser and a pre-trained BEiT (Bao et al., 2021) as the encoder.

3.2. DENOISING MASKED AUTOENCODERS

In the denoise-then-predict network structure above, if the denoiser is perfect, h will be robust to the Gaussian additive noise η. Then the robust accuracy of g can be as high as the standard accuracy of models trained on clean images. However, the denoiser requires a huge number of parameters to obtain acceptable results (Nichol & Dhariwal, 2021), limiting the practical usage of the compositional method in real applications. Note that our goal is to learn representation h that captures rich semantics for classification and resists Gaussian additive noise. Using an explicit purification step before encoding is sufficient to achieve it but may not be a necessity. Instead of using multiple training stages for different purposes, we aim to adopt a single-stage approach to learn robust h through self-supervised learning directly. In particular, we extend the standard masked autoencoder with an additional denoising task, which we call the Denoising Masked AutoEncoder (DMAE). The DMAE works as follows: an image x is first divided into regular non-overlapping patches. Denote Mask(x) as the operation that randomly masks patches with a pre-defined masking ratio. As shown in Fig. 1 , we aim to train an autoencoder that takes Mask(x + η) as input and reconstructs the original image: x → x + η → Mask(x + η) Encoder ----→ h Decoder ----→ x. Like MAE (He et al., 2022) , we adopt the asymmetric encoder-decoder design for DMAE. Both encoder and decoder use stacked Transformer layers. The encoder takes noisy unmasked patches with positional encoding as inputs and generates the representation h. Then the decoder takes the representation on all patches as inputs (h for unmasked patches and a masked token embedding for masked patches) and reconstructs the original image. Pixel-level mean square error is used as the loss function. Slightly different from MAE, the loss is calculated on all patches as the model can also learn purification on the unmasked positions. During pre-training, the encoder and decoder are jointly optimized from scratch, and the decoder will be removed while learning downstream tasks. In order to reconstruct the original image, the encoder and the decoder have to learn semantics from the unmasked patches and remove the noise simultaneously. To enforce the encoder (but not the decoder) to learn robust semantic features, we control the capacity of the decoder by setting a smaller value of the hidden dimension and depth following He et al. (2022) . Robust fine-tuning for downstream classification tasks. As the encoder of DMAE already learns robust features, we can simplify the classification process of the base classifer as h = Encode(x + η; θ encoder ) (6) y = Predict(h; θ output ) To avoid any confusion, we explicitly parameterize the base classifier as f (x; θ encoder , θ output ) = Predict(Encode(x; θ encoder ); θ output ), and denote F (x; θ encoder , θ output ) as the output of the last softmax layer of f , i.e., the probability distribution over classes. We aim to maximize the certified accuracy of the corresponding smoothed classifier g by optimizing θ encoder and θ output , where θ encoder is initialized by the pre-trained DMAE model. To achieve the best performance, we use the consistency regularization training method developed in Jeong & Shin (2020) to learn θ encoder and θ output . The loss is defined as below. L(x, y; θ encoder , θ output ) = E η [CrossEntropy(F (x + η; θ encoder , θ output ), y)] + λ • E η [D KL ( F (x; θ encoder , θ output )∥F (x + η; θ encoder , θ output ))] + µ • H( F (x; θ encoder , θ output )) where  F (x; θ encoder , θ output ) := E η∼N (0,σ 2 I d ) [F (x + η; θ encoder , θ output )] is

4. EXPERIMENTS

In this section, we empirically evaluate our proposed DMAE on ImageNet and CIFAR-10 datasets. We also study the influence of different hyperparameters and training strategies on the final model performance. All experiments are repeated ten times with different seeds. Average performance is reported, and details can be found in the appendix. For the pre-training of the two DMAE models, we set the masking ratio to 0.75 following He et al. (2022) . The noise level σ is set to 0.25. Random resizing and cropping are used as data augmentation to avoid overfitting. The ViT-B and ViT-L models are pre-trained for 1100 and 1600 epochs, where the batch size is set to 4096. We use the AdamW optimizer with β 1 , β 2 = 0.9, 0.95, and adjust the learning rate to 1.5 × 10 -4 . The weight decay factor is set to 0.05. After pre-training, we 

4.2. FINE-TUNING FOR IMAGENET CLASSIFICATION

Setup. In the fine-tuning stage, we add a linear prediction head on top of the encoder for classification. The ViT-B model is fine-tuned for 100 epochs, while the ViT-L is fine-tuned for 50 epochs. Both settings use AdamW with β 1 , β 2 = 0.9, 0.999. The weight decay factor is set to 0.05. We set the base learning rate to 5 × 10 -4 for ViT-B and 1 × 10 -3 for ViT-L. Following Bao et al. (2021), we use layer-wise learning rate decay (Kevin Clark & Manning, 2020) with an exponential rate of 0.65 for ViT-B and 0.75 for ViT-L. We apply standard augmentation for training ViT models, including RandAug (Ekin D Cubuk & Le, 2020), label smoothing (Szegedy et al., 2016) , mixup (Hongyi Zhang & Lopez-Paz, 2018) and cutmix (Yun et al., 2019) . Following most previous works, we conduct experiments with different noise levels σ ∈ {0.25, 0.5, 1.0}. For the consistency regularization loss terms, we set the hyperparameters λ = 2.0 and µ = 0.5 for σ ∈ {0.25, 0.5}, and set λ = 2.0 and µ = 0.1 for σ = 1.0. Evaluation. Following previous works, we report the percentage of samples that can be certified to be robust (a.k.a certified accuracy) at radius r with pre-defined values. For a fair comparison, we use the official implementationfoot_0 of CERTIFY to calculate the certified radius for any data pointfoot_1 , with n = 10, 000, n 0 = 100 and α = 0.001. The result is averaged over 1,000 images uniformly selected from ImageNet validation set, following Carlini et al. (2022) . Results. We list the detailed results of our model and representative baseline methods in Table 2 . We also provide a summarized result that contains the best performance of different methods at each radius r in Table 1 . It can be seen from Table 2 that our DMAE ViT-B model significantly surpasses all baselines in all settings except Carlini et al. (2022) . This clearly demonstrates the strength of selfsupervised learning. Compared with Carlini et al. (2022) , our model achieves better results when r ≥ 1.0 and is slightly worse when r is small. We would like to point out that the DMAE ViT-B model only uses 10% parameters compared to Carlini et al. (2022) , which suggests our single-stage pre-training method is more parameter-efficient than the denoise-then-predict approach. Although the diffusion model used in Carlini et al. (2022) can be applied with different noise levels, the huge number of parameters and long inference time make it more difficult to deploy. Our DMAE ViT-L model achieves the best performance over all prior works in all settings and boosts the certified accuracy by a significant margin when σ and r are large. For example, at r = 1.5, it achieves 53.7% accuracy which is 15.3% better than Boosting (Horváth et al., 2022) , and it surpasses Diffusion (Carlini et al., 2022) by 12.0% at r = 2.0. This observation is different from the one reported in Carlini et al. (2022) , where the authors found that the diffusion model coupled with an off-the-shelf BEiT only yields better performance with smaller σ and r. Certified Accuracy(%) at ℓ 2 radius r σ Method 0.0 0.5 1.0 1.5 2.0 3.0 0.25 RS (Cohen et al., 2019) 67.0 49.0 0 0 0 0 SmoothAdv (Salman et al., 2019) 63.0 54.0 0 0 0 0 Consistency (Jeong & Shin, 2020) -MACER (Zhai et al., 2021) 68.0 57.0 0 0 0 0 Boosting (Horváth et al., 2022) 65.6 57.0 0 0 0 0 SmoothMix (Jeong et al., 2021) -Diffusion+BEiT (Carlini et al., 2022) (Jeong & Shin, 2020) 55.0 50.0 44.0 34.0 0 0 MACER (Zhai et al., 2021) 64.0 53.0 43.0 31.0 0 0 Boosting (Horváth et al., 2022) 57.0 52.0 44.6 38.4 0 0 SmoothMix (Jeong et al., 2021) 55.0 50.0 43.0 38.0 0 0 Diffusion+BEiT (Carlini et al., 2022) (Jeong & Shin, 2020) 41.0 37.0 32.0 28.0 24.0 17.0 MACER (Zhai et al., 2021) 48.0 43.0 36.0 30.0 25.0 14.0 Boosting (Horváth et al., 2022) 44.6 40.2 37.2 34.0 28.6 20.2 SmoothMix (Jeong et al., 2021) 40.0 37.0 34.0 30.0 26.0 20.0 Diffusion+BEiT (Carlini et al., 2022) * * denotes the best result, and * denotes the second best at each radius r.

4.3. FINE-TUNING FOR CIFAR-10 CLASSIFICATION

Setup. We show the DMAE models can benefit not only ImageNet but also the CIFAR-10 classification tasks, suggesting the nice transferability of our pre-trained models. We use the DMAE ViT-B checkpoint as a showcase. As the sizes of the images in ImageNet and CIFAR-10 are different, we pre-process the images CIFAR-10 to 224 × 224 to match the pre-trained model. Note that the data distributions of ImageNet and CIFAR-10 are far different. To address this significant distributional shift, we continue pre-training the DMAE model on the CIFAR-10 dataset. We set the continued pre-training stage to 50 epochs, the base learning rate to 5 × 10 -5 , and the batch size to 512. Most of the fine-tuning details is the same as that on ImageNet in Sec. 4.2, except that we use a smaller batch size of 256, apply only the random horizontal flipping as data augmentation, and reduce the number of the fine-tuning epochs to 50. Result. The evaluation protocol is the same as that on ImageNet in Sec. 4.2. We draw n = 100, 000 noise samples and report results averaged over the entire CIFAR-10 test set. The results are presented in Table 3 . Without continued pre-training, our DMAE ViT-B model still yields comparable performance with Carlini et al. (2022) , and the model outperforms it when continued pre-training is applied. It is worth noting that the number of parameters of Carlini et al. (2022) is larger, and the diffusion model is trained on CIFAR datasets. In comparison, our model only uses a smaller amount of parameters, and the pre-trained checkpoint is directly borrowed from Sec. 4.1. Our model performance is significantly better than the original consistent regularization method (Jeong & Shin, 2020) , demonstrating the transferability of the pre-training model. Specifically, our method outperforms the original consistent regularization by 12.0% at r = 0.25, and by 9.0% at r = 0.5. We believe our pre-trained checkpoint can also improve other baseline methods to achieve better results. He et al. (2022) in the linear probing setting on ImageNet. Linear probing is a popular scheme to compare the representation learned by different models, where we freeze the parameters of the pre-trained encoders and use a linear layer with batch normalization to make predictions. For both DMAE and MAE, we train the linear layer for 90 epochs with a base learning rate of 0.1. The weight decay factor is set to 0.0. As overfitting seldom occurs in linear probing, we only apply random resizing and cropping as data augmentation and use a large batch size of 16,384. As shown in Table 4 , our DMAE outperforms MAE by a large margin in linear probing. For example, with Gaussian noise magnitude σ = 0.25, DMAE can achieve 45.3% certified accuracy at r = 0.5, 32.0 points higher than that of MAE. Note that even our models were pre-trained with a small magnitude of Gaussian noise (σ = 0.25), they still yield much better results than that of MAE under large Gaussian noise (σ = 0.5, 1.0). This clearly indicates that our method learns much more robust features compared with MAE. happens in our setting, we also conduct experiments to study the downstream performance of model checkpoints at different pre-training steps. In particular, we compare the DMAE ViT-B model (1100 epochs) trained in Sec. 4.1 with an early checkpoint (700 epochs). Both models are fine-tuned under the same configuration. All results on ImageNet are presented in Table 5 . It shows that the 1100-epoch model consistently outperforms its 700-epoch counterpart in almost all settings. Other fine-tuning methods. In the main experiment, we use Consistency Regularization (CR) in the fine-tuning stage, and one may be interested in how much the pre-trained model can improve with other methods. To study this, we fine-tune our pre-trained DMAE ViT-L model with the RS algorithm (Cohen et al., 2019) , where the only loss used in training is the standard cross-entropy classification loss in Eq.7. For this experiment, we use the same configuration as in Sec. 4.2. The results are provided in Table 6 . First, we can see that the regularization loss consistently leads to better certified accuracy. In particular, it yields up to 3-5% improvement at a larger ℓ 2 radius (r ≥ 1.0). Second, it can also be seen that the RS model fine-tuned on DMAE ViT-L significantly surpasses lots of baselines on ImageNet. This suggests that our pre-trained DMAE ViT-L model may be combined with other training methods in the literature to improve their performance.

5. CONCLUSION

This paper proposes a new self-supervised method, Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers of images. DMAE corrupts each image by adding Gaussian noises to each pixel value and randomly masking several patches. A vision Transformer is then trained to reconstruct the original image from the corrupted one. The pre-trained encoder of DMAE can naturally be used as the base classifier in Gaussian smoothed models to achieve certified robustness. Extensive experiments show that the pre-trained model is parameter-efficient, achieves state-of-the-art performance, and has nice transferability. We believe that the pre-trained model has great potential in many aspects. We plan to apply the pre-trained model to more tasks, including image segmentation and detection, and investigate the interpretability of the models in the future.

A APPENDIX

We present the full settings of pre-training and fine-tuning in Table 7 and Table 8 

B EVALUATION PROTOCOL

In this section, we describe the details of how to estimate the approximate certified radius and certified accuracy. Recall that in Theorem 1, the certified radius R for datapoint x can be expressed as, R = σ 2 [Φ -1 (P η [f (x + η) = y]) -Φ -1 (max y ′ ̸ =y P η [f (x + η) = y ′ ])]. For computational convenience, one can obtain a bit loose, but simpler bound of certified radius by using a upper bound 1 -P η 



https://github.com/locuslab/smoothing One may notice that randomized smoothing methods require a significant number of samples (e.g., 10 5 ) for evaluation. Here the samples are used to calculate the certified radius accurately. A much smaller number of samples are enough to make predictions in practice. https://github.com/locuslab/smoothing



Figure 1: Illustration of our DMAE pre-training. We first corrupt the image by adding Gaussian noise to each pixel value, and then randomly masking several patches. The encoder and decoder are trained to reconstruct the clean image from the corrupted one.

the average prediction distribution of the base classifier, and λ, µ > 0 are hyperparameters. D KL (•||•) and H(•) denote the Kullback-Leibler (KL) divergence and the entropy, respectively. The loss function contains three terms. Intuitively, the first term aims to maximize the accuracy of the base classifier with perturbed input. The second term attempts to regularize F (x+η; θ encoder , θ output ) to be consistent with different η. The last term prevents the prediction from low confidence scores. All expectations are estimated by Monte Carlo sampling.

PRE-TRAINING SETUPFollowingHe et al. (2022);Xie et al. (2022), we use ImageNet-1k as the pre-training corpus which contains 1.28 million images. All images are resized to a fixed resolution of 224 × 224. We utilize two vision Transformer variants as the encoder, the Base model (ViT-B) and the Large model (ViT-L) with 16 × 16 input patch size(Kolesnikov et al., 2021). The ViT-B encoder consists of 12 Transfomer blocks with embedding dimension 768, while the ViT-L encoder consists of 16 blocks with embedding dimension 1024. For both settings, the decoder uses 8 Transformer blocks with embedding dimension 512 and a linear projection whose number of output channels equals the number of pixel values in a patch. All the Transformer blocks have 16 attention heads. The ViT-B/ViT-L encoder have roughly 87M and 304M parameters, respectively.

Figure 2: Visualization. For each group, the leftmost column shows the original image. The following two correspond to the image with Gaussian perturbation (noise level σ = 0.25) and the masked noisy image. The last column illustrates the reconstructed image by our DMAE ViT-L model.

η = 0.5 (σ ∈ {0.25, 0.5}) λ = 2, η = 0.1 (σ = 1.0)

[f (x + η) = y] ≥ max y ′ ̸ =y P η [f (x + η) = y ′ ], R ≥ σ 2 [Φ -1 (P η [f (x+η) = y])-Φ -1 (1-P η [f (x+η) = y])] = σ•Φ -1 (P η [f (x+η) = y]) =: R.

* * 64.6 * * 53.7 * * 41.5 * * 27.5 * * Certified accuracy (top-1) of different models on ImageNet. FollowingCarlini et al. (2022), for each noise level σ, we select the best certified accuracy from the original papers. * * denotes the best result, and * denotes the second best at each ℓ 2 radius. †Carlini et al. (2022) uses a diffusion model with 552M parameters and a BEiT-Large model with 305M parameters. It can be seen that our DMAE ViT-B/ViT-L models achieve the best performance in most of the settings.

Certified accuracy (top-1) of different models on ImageNet with different noise levels.

Certified accuracy (top-1) of different models on CIFAR-10. Each entry lists the certified accuracy of best Gaussian noise level σ from the original papers. * * denotes the best result and * denotes the second best at each ℓ 2 radius. †(Carlini et al., 2022) uses a 50M-parameter diffusion model and a 87M-parameter ViT-B model.

DMAE v.s. MAE by linear probing on ImageNet. Our proposed DMAE is significantly better than MAE on the ImageNet classification task, indicating that the proposed pre-training method is effective and learns more robust features. Numbers in (.) is the gap between the two methods in the same setting. Whether DMAE learns more robust features than MAE. Compared with MAE, we additionally use a denoising objective in pre-training to learn robust features. Therefore, we need to examine the quality of the representation learned by DMAE and MAE to investigate whether the proposed objective helps. For a fair comparison, we compare our DMAE ViT-B model with the MAE ViT-B checkpoint released by

Effects of the pre-training steps. Many previous works observe that longer pre-training steps usually helps the model perform better on downstream tasks. To investigate whether this phenomenon

Effects of the pre-training steps. From the table, we can see that the 1100-epoch model consistently outperforms the 700-epoch model in almost all settings, demonstrating that longer pretraining leads to better downstream task performance. Numbers in (.) is the gap between the two methods in the same setting. (+2.5) 59.0 (+2.0) 53.0 (+3.2) 47.9 (+5.0) 41.5 (+5.6) 27.5 (+5.7)

DMAE with different fine-tuning methods. From the table, we can see that our pretrained model is compatible with different fine-tuning methods. Numbers in (.) is the gap between the two methods in the same setting.

, respectively. Robust pre-train setting.

Fine-tuning setting.

ACKNOWLEDGEMENT

This work is partially supported by the National Key R&D Program of China (2022ZD0160304). The work is supported by National Science Foundation of China (NSFC62276005), The Major Key Project of PCL (PCL2021A12), Exploratory Research Project of Zhejiang Lab (No. 2022RC0AN02), and Project 2020BD006 supported by PKUBaidu Fund. We thank all the anonymous reviewers for the very careful and detailed reviews as well as the valuable suggestions. Their help has further enhanced our work.

annex

We use the official implementation 3 of CERTIFY to calculate the certified radius for any data point. For one data point x, the smoothed classifier samples n 0 = 100 points from the noisy Gaussian distribution N (x, σ 2 I d ) and then votes the predicted class, while we draw n = 10, 000 samples (for ImageNet) to estimate the lower bound of P η [f (x + η) = y] and certify the robustness. Following previous works, we report the percentage of samples that can be certified to be robust (a.k.a certified accuracy) at radius r with pre-defined values.

C SUPPLEMENTARY EXPERIMENTS

In this section, we report the results of several supplementary experiments.More comparison with MAE. For a fair comparison, we compare our DMAE ViT-B with MAE ViT-B in the fine-tuning setting on ImageNet, in addition to the linear probing. The checkpoints are fine-tuned by the RS and CR method described in 4.4. In Table 9 , we can see that DMAE outperforms MAE on all radii, which indicates the effectiveness of the denoising pre-training task.Fine-tuning with various levels of noise. In the previous experiments, all the models are fine-tuned and evaluated under a specific level of Gaussian noise. One may wonder whether a single model can perform well under various levels of noise. To investigate this, we've conducted a tiny experiment on the CIFAR-10 dataset (we draw fewer noise samples n = 10, 000 and report results averaged over 1000 images) and reported the results in Table 10 . Specifically, we sample the noise scale σ from a uniform distribution over an interval (σ ∈ [0, 0.75]) during fine-tuning, resulting in a robust classifier under different magnitudes of adversarial perturbations. And we compute the certified radius with this single model. The evaluation results show that it even outperforms the original model trained with a fixed noise scale when r = 1.0, which suggests that we can indeed use one DMAE across different settings without retraining.

