CAN CNNS BE MORE ROBUST THAN TRANSFORMERS?

Abstract

The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. Our findings lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers. We hope this work can help the community better understand the design of robust neural architectures.

1. INTRODUCTION

The success of deep learning in computer vision is largely driven by Convolutional Neural Networks (CNNs). Starting from the milestone work AlexNet (Krizhevsky et al., 2012) , CNNs keep pushing the frontier of computer vision (Simonyan & Zisserman, 2015; He et al., 2016; Tan & Le, 2019) . Interestingly, the recently emerged Vision Transformer (ViT) (Dosovitskiy et al., 2020) challenges the leading position of CNNs. ViT offers a completely different roadmap-by applying the pure self-attention-based architecture to sequences of image patches, ViTs are able to attain competitive performance on a wide range of visual benchmarks compared to CNNs. The recent studies on out-of-distribution robustness (Bai et al., 2021; Zhang et al., 2022; Paul & Chen, 2022) further heat up the debate between CNNs and Transformers. Unlike the standard visual benchmarks where both models are closely matched, Transformers are much more robust than CNNs when testing out of the box. Moreover, Bai et al. (2021) argue that, rather than being benefited by the advanced training recipe provided in (Touvron et al., 2021a) , such strong out-of-distribution robustness comes inherently with the Transformer's self-attention-like architecture. For example, simply "upgrading" a pure CNN to a hybrid architecture (i.e., with both CNN blocks and Transformer blocks) can effectively improve out-of-distribution robustness (Bai et al., 2021) . Though it is generally believed that the architecture difference is the key factor that leads to the robustness gap between Transformers and CNNs, existing works fail to answer which architectural elements in Transformer should be attributed to such stronger robustness. The most relevant analysis is provided in (Bai et al., 2021; Shao et al., 2021) -both works point out that Transformer blocks, where self-attention operation is the pivot unit, are critical for robustness. Nonetheless, given a) the Transformer block itself is already a compound design, and b) Transformer also contains many other layers (e.g., patch embedding layer), the relationship between robustness and Transformer's architectural elements remains confounding. In this work, we take a closer look at the architecture design of Transformers. More importantly, we aim to explore, with the help of the architectural elements from Transformers, whether CNNs can be robust learners as well. Our diagnose delivers three key messages for improving out-of-distribution robustness, from the perspective of neural architecture design. Firstly, patchifying images into non-overlapped patches can substantially contribute to out-of-distribution robustness; more interestingly, regarding the choice of patch size, we find the larger the better. Secondly, despite applying small convolutional kernels is a popular design recipe, we observe adopting a much larger convolutional kernel size (e.g., from 3 × 3 to 7 × 7, or even to 11 × 11) is necessary for securing model robustness on out-of-distribution samples. Lastly, as inspired by the recent work (Liu et al., 2022) , we note aggressively reducing the number of normalization layers and activation functions is beneficial for out-of-distribution robustness; meanwhile, as a byproduct, the training speed could be accelerated by up to ∼23%, due to fewer normalization layers being used (Gitman & Ginsburg, 2017; Brock et al., 2021) . Our experiments verify that all these three architectural elements consistently and effectively improve out-of-distribution robustness on a set of CNN architectures. The largest improvement is reported by integrating all of them into CNNs' architecture design-as shown in Fig. 1 , without applying any self-attention-like components, our enhanced ResNet (dubbed Robust-ResNet) is able to outperform a similar-scale Transformer, DeiT-S, by 2.4% on Stylized-ImageNet (16.2% vs. 18.6%), 0.5% on ImageNet-C (42.8% vs. 42.3%), 4.0% on ImageNet-R (41.9% vs. 45.9%) and 3.9% on ImageNet-Sketch (29.1% vs. 33.0%). We hope this work can help the community better understand the underlying principle of designing robust neural architectures.

2. RELATED WORKS

Vision Transformers. Transformers (Vaswani et al., 2017) , which apply self-attention to enable global interactions between input elements, underpins the success of building foundation models in natural language processing (Devlin et al., 2019; Yang et al., 2019; Dai et al., 2019; Radford et al., 2018; 2019; Brown et al., 2020; Bommasani et al., 2021) . Recently, Dosovitskiy et al. (Dosovitskiy et al., 2020) show Transformer attains competitive performance on the challenging ImageNet classification task. Later works keep pushing the potential of Transformer on a variety of visual tasks, in both supervised learning (Touvron et al., 2021a; Yuan et al., 2021; Liu et al., 2021; Wang et al., 2021; Yuan et al., 2021; Zhai et al., 2021; Touvron et al., 2021b; Xue et al., 2021) and self-supervised learning (Caron et al., 2021; Chen et al., 2021; Bao et al., 2022; Zhou et al., 2022; Xie et al., 2021; He et al., 2022) , showing a seemly inevitable trend on replacing CNNs in computer vision. CNNs striking back. There is a recent surge in works that aim at retaking the position of CNNs as the favored architecture. Our work is closely related to ConvNeXt (Liu et al., 2022) , while shifting the study focus from standard accuracy to robustness. Moreover, rather than specifically offering a unique neural architecture as in (Liu et al., 2022) , this paper aims to provide a set of useful architectural elements that allows CNNs to be able to match, or even outperform, Transformers when measuring robustness. Out-of-distribution robustness. Dealing with data from shifted distributions is a commonly encountered problem when deploying models in the real world. To simulate such challenges, several out-of-distribution benchmarks have been established, including measuring model performance



Figure 1: Comparison of out-of-distribution robustness among ResNet, DeiT, and our enhanced ResNet (dubbed Robust-ResNet). Though DeiT-S largely outperforms the vanilla ResNet, it performs worse than Robust-ResNet on these robustness benchmarks.

Wightman et al. (Wightman et al., 2021)  demonstrate that, with the advanced training setup, the canonical ResNet-50 is able to boost its performance by 4% on ImageNet.

availability

//github.com/UCSC-VLAA

