SIMILARITY OF NEURAL ARCHITECTURES BASED ON INPUT GRADIENT TRANSFERABILITY

Abstract

In this paper, we aim to design a quantitative similarity function between two neural architectures. Specifically, we define a model similarity using input gradient transferability. We generate adversarial samples of two networks and measure the average accuracy of the networks on adversarial samples of each other. If two networks are highly correlated, then the attack transferability will be high, resulting in high similarity. Using the similarity score, we investigate two topics: (1) Which network component contributes to the model diversity? (2) How does model diversity affect practical scenarios? We answer the first question by providing feature importance analysis and clustering analysis. The second question is validated by two different scenarios: model ensemble and knowledge distillation. We conduct a large-scale analysis on 69 state-of-the-art ImageNet classifiers. Our findings show that model diversity takes a key role when interacting with different neural architectures. For example, we found that more diversity leads to better ensemble performance. We also observe that the relationship between teacher and student networks and distillation performance depends on the choice of the base architecture of the teacher and student networks. We expect our analysis tool helps a high-level understanding of differences between various neural architectures as well as practical guidance when using multiple architectures.

1. INTRODUCTION

The last couple of decades have seen the great success of deep neural networks (DNNs) in real-world applications, e.g., image classification (He et al., 2016a) and natural language processing (Vaswani et al., 2017) . The advances in the DNN architecture design have taken a key role in this success by making the learning process easier (e.g., normalization methods (Ioffe & Szegedy, 2015; Wu & He, 2018; Ba et al., 2016) or skip connection (He et al., 2016a) ) enforcing human inductive bias in the architecture (e.g., convolutional neural networks (CNNs) (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015) ) or increasing model capability with the self-attention mechanism (e.g., Transformers (Vaswani et al., 2017) ). With different design principles and architectural elements, there have been proposed a number of different neural architectures; however, designing a distinguishable architecture is expensive and needs heavy expertise. One of the reasons for the difficulty is that there is only a little knowledge of the difference between two different neural architectures. Especially, if one can quantify the similarity between two models, then we can measure which design component actually contributes significantly to diverse properties of neural networks. We also can utilize the quantity for new model design (e.g., by neural architecture search (NAS) (Zoph & Le, 2017) ). In this paper, we aim to define the similarity between two networks to quantify the difference and diversity between neural architectures. Existing studies have focused on dissecting each network component layer-by-layer (Kornblith et al., 2019; Raghu et al., 2021) or providing a high-level understanding by visualization of loss surface (Dinh et al., 2017 ), input gradient (Springenberg et al., 2015; Smilkov et al., 2017) , or decision boundary (Somepalli et al., 2022) . On the other hand, we aim to design an architecture agnostic and quantitative score to measure the difference between the two architectures. We especially focus on the input gradients, a widely-used framework to understand model behavior, e.g., how a model will change predictions by local pixel changes (Sung, 1998; Simonyan et al., 2014; Springenberg et al., 2015; Smilkov et al., 2017; Sundararajan et al., 2017; Bansal et al., 2020; Choe et al., 2022) . If two models are similar, then the input gradients are similar. However, because an input gradient is very noisy, directly measuring the difference between input gradients is also very noisy. Instead, we use the adversarial attack transferability as the proxy measure of the difference between input gradients of two networks. Consider two models A and B, and the input x. We generate adversarial samples x A and x B to A and B, respectively. Then we measure the accuracy of B for x A (acc A→B ) and vice versa. If A and B are similar and assume an optimal adversary, then acc A→B will be almost zero, while if A and B have distinct input gradients, then acc A→B will not be dropped significantly. We also note that adversarial attack transferability will provide a high-level understanding of the difference between model decision boundaries (Karimi & Tang, 2020) . We define a model similarity based on attack transferability and analyze the existing neural architectures, e.g., which network component affects the model diversity the most? We measure the pairwise attack transferability-based network similarity of 69 different neural architectures trained on ImageNet (Russakovsky et al., 2015) , provided by Wightman (2019). Our work is the first extensive study for model similarity on a large number of state-of-the-art ImageNet models. Our first goal is to understand the effect of each network module on model diversity. We first choose 13 basic components (e.g., normalization, activation, the design choice for stem layers) that consist of neural architectures and list the components of 69 networks. For example, we represent ResNet We analyze the contribution of each network component to the model diversity by using the feature importance analysis by gradient boosting regression on model similarities, and the clustering analysis on model similarities. Our analyses show that the choice of base architecture (e.g., CNN, Transformer) contributes most to the network diversity. Interestingly, our analysis shows that the design choice for the input-level layers (e.g., stem layer design choice) determines the network diversity as much as the choice of core modules (e.g., normalization layers, activation functions). Our study is not only limited to the understanding of component-level architecture design, but we also focus on analyzing the effect of model diversity in practical scenarios. Particularly, we measure the model ensemble performances by controlling the diversity of the candidate models, e.g., ensemble "similar" or "dissimilar" models by our similarity score. Here, we observe that when we ensemble more dissimilar models, the ensemble accuracy gets better; more diversity leads to a better ensemble performance. We observe that the diversity caused by different initialization, different hyper-parameter choice, and different training regimes is not as significant as the diversity caused by architecture change. We also observe that the ensemble of models from the same cluster performs worse than the ensemble of models from different clusters. Similarly, by choosing more diverse models, the number of wrong samples by all models is getting decreased. Our findings confirm that the previous study conducted in simple linear classifiers (Kuncheva & Whitaker, 2003 ) is also aligned with recent complex large-scale neural networks. As our third contribution, we provide a practical guideline for the choice of a teacher network for knowledge distillation (KD) (Hinton et al., 2015) . We train 25 distilled ViT-Ti models with diverse teacher networks. Interestingly, our findings show that the performance of the distilled model is highly correlated to the similarity between the teacher and the student networks rather than the accuracy of the teacher network; if the student networks and the teacher network are based on the same architecture (e.g., both are based on Transformer), then a similar teacher provides better knowledge; if the student and teacher networks are based on different architectures (e.g., Transformer and CNN), then we observe that choosing a more dissimilar teacher will lead to a better distillation performance. Our findings are partially aligned with previous KD studies, i.e., using a similar teacher network leads to a better KD performance (Jin et al., 2019; Mirzadeh et al., 2020) . However, the existing studies only focus on the scenario when both teacher and student networks are based on the same architecture, exactly aligned with our findings. Our observation extends previous knowledge to the case of when the teacher and student networks significantly differ (e.g., Transformer and CNN).

2. NETWORK SIMILARITY BY INPUT GRADIENT TRANSFERABILITY

In this section, we propose a similarity measure between two networks based on adversarial attack transferability. Our interest lies in developing a practical toolbox to measure the architectural difference between the two models quantitatively. Existing studies for a similarity between deep neural networks have focused on comparing intermediate features (Kornblith et al., 2019; Raghu et al., 2021) , understanding loss landscapes (Dinh et al., 2017; Li et al., 2018; Park & Kim, 2022) , or decision boundary (Somepalli et al., 2022) . However, their approaches cannot measure the distance



He et al., 2016a) as f ResNet = [Base architecture = CNN, Norm = BN, Activation = ReLU, . . .].

