EXPLORING NEURAL NETWORK REPRESENTATIONAL SIMILARITY USING FILTER SUBSPACES

Abstract

Analyzing representational similarity in neural networks is crucial to numerous tasks, such as interpreting or transferring deep models. One typical approach is to input probing data into convolutional neural networks (CNNs) as stimuli to reveal their deep representation for model similarity analysis. Those methods are often computationally expensive and stimulus-dependent. By representing filter subspace in a CNN as a set of filter atoms, previous work has reported competitive performance in continual learning by learning a different set of filter atoms for each task while sharing common atom coefficients across tasks. Inspired by this observation, in this paper, we propose a new paradigm for reducing representational similarity analysis in CNNs to filter subspace distance assessment. Specifically, when filter atom coefficients are shared across networks, model representational similarity can be significantly simplified as calculating the cosine distance among respective filter atoms, to achieve millions of times computation reduction. We provide both theoretical and empirical evidence that this simplified filter subspace-based similarity preserves a strong linear correlation with other popular stimulus-based metrics, while being significantly more efficient and robust to probing data. We further validate the effectiveness of the proposed method in various applications, such as analyzing training dynamics as well as in federated and continual learning. We hope our findings can help further explorations of real-time large-scale representational similarity analysis in neural networks.

1. INTRODUCTION

Deep neural networks have shown unprecedented performance in a large variety of tasks (Krizhevsky et al., 2012; Ronneberger et al., 2015) . The cornerstone to the success is the deep representation learned by neural networks (NNs), which contains high-level semantic information about a task. By viewing deep representation as to the characterization of each task in a highdimensional space, the representational similarity between a pair of deep models can be exploited to understand the intrinsic relationship between associated tasks. In this way, the representational similarity provides a way to open the black box of deep learning by showing the training dynamics (Kornblith et al., 2019) , and it further empowers machine learning systems with the ability to transfer knowledge from one task to another (Huang et al., 2021a) . Previous works (Raghu et al., 2017; Morcos et al., 2018) measure representational similarity directly relying on deep representations revealed by input data. These approaches introduce heavy computation from both the forward pass of numerous stimulus inputs and the calculation of high-dimensional covariance matrices. Since these similarity metrics are stimulus-dependent, their quality can potentially deteriorate when probing data are inappropriately chosen, scarce or unavailable. We are inspired by the continual learning framework in Miao et al. (2021) , where a group of tasks is simultaneously modeled using NNs by learning for each task a different set of filter atoms while sharing common atom coefficients across tasks. Miao et al. (2021) has in detail analyzed and validated this framework in a continual learning context. In the above setting, it is easy to observe that the representation variations across different NNs now become dominated by respective filter atoms. Thus, Miao et al. (2021) adopts in experiments filter subspace distance to assess task relevancy, however, without formal justification. In this paper, we formally explore NN representational similarity using filter subspace distance, with detailed theoretical and empirical justifications. We first simplify the filter subspace distance to the cosine distance of two sets of filter atoms, to eliminate heavy computation of singular value decomposition in calculating principal angles. Then, we show both theoretically and empirically that the obtained filter atom-based similarity preserves a strong linear correlation with other popular stimulus-dependent similarity measures such as CCA (Raghu et al., 2017) . Our representational similarity is also immune to inappropriate choices of probing data, while stimulus-dependent metrics can be perturbed drastically. The proposed filter atom-based similarity shows extreme efficiency in both memory and computation. Since the similarity computation does not involve network forward pass, no GPU memory access is required, whereas other stimulus-based measures consume the same amount of GPU memory as regular inference. On the other hand, the proposed method involves only inner product calculations on filter atoms, which takes neglectable time for similarity evaluation. The evaluation time of stimulus-based measures includes the time of both the forward pass of probing data and the calculation of high-dimensional covariance matrices. We report later the dramatically improved evaluation time of the proposed method against other popular method, e.g., CKA (Kornblith et al., 2019) . We further validate our atom-based similarity for knowledge transfer with various continual learning and federated learning tasks. In both settings, we fix the atom coefficients, learn the filter atoms for each task, and finally conduct knowledge transfer among tasks by recalling the most similar models for the ensemble. Compared with stimulus-based similarity metrics, the proposed measure achieves competitive performance with millions of times reduction in the computational cost. We summarize our contributions as follows, • We formally explore NN representational similarity measure using filter subspace distance. • We show both theoretically and empirically that the proposed filter atom-based measure preserves a strong linear correlation with other popular stimulus-dependent measures while being significantly more robust and efficient in both memory and computation. • We demonstrate the effectiveness of the proposed similarity measure under various example settings, such as analyzing training dynamics as well as in federated and continual learning.

2. METHODOLOGY

In this section, we first provide a filter subspace formulation for NNs and propose a model similarity metric based on a simplified filter subspace distance. Then, we review stimulus-based representational similarities and show their limitations. We further demonstrate that under certain assumptions, the proposed measure shows a strong linear relationship with popular stimulus-based measures, while exhibiting dramatic improvement in computational efficiency and data robustness. These unique characteristics of the proposed measure can potentially enable real-time large-scale NN similarity assessment, e.g., helping fast knowledge transfer across a large number of models. 



REPRESENTATIONAL SIMILARITY IN FILTER SUBSPACEFilter subspace. As inQiu et al. (2018), the convolutional filter W ∈ R c ′ ×c×k×k (c ′ and c are the number of input and output channels, k is the kernel size) can be decomposed as m filter atomsD[i] ∈ R k×k (i = 1, ..., m), linearly combined by atom coefficients α ∈ R m×c ′ ×c as W = α × D.The filter subspace is then expressed as V = Span{D[1], ..., D[m]}. With this formulation, we consider a paradigm where atom coefficients are shared across different deep models while filter subspaces are model specific. This paradigm has been in detail analyzed and validated in Miao et al. (2021) and reports state-of-the-art performance in a continual learning context. In this setting, we dive deep into the relationship between filter atoms and representations. For simplicity, let c = c ′ = 1, and the argument extends. Given an input image X(b) (b ∈ B, B ⊂ Z 2 ), define the local input norm ||X|| F,N b := ( b ′ ∈N b X(b -b ′ ) 2 ) 1/2 and the convolution ⟨X, w⟩ N b := b ′ ∈N b X(b-b ′ )w(b ′ ), where N b ⊂ B is a local Euclidean grid centered at b. Then the decomposed convolution can be written as Z(b) = σ( m i=1 α i ⟨X, D i ⟩ N b ), where D[i] denotes the i-th atom, α i is the corresponded i-th coefficient. Proposition 1. Suppose D u and D v are two different sets of filter atoms for a convolutional layer with the common atom coefficients α, and the activation function σ is non-expansive, we can upper bound the changes in the corresponding features Z u , Z v with atom changes, ||Z u -Z v || F ≤ (||α|| F λ) |B| • ||D u -D v || F , with λ = sup b∈B ||X|| F,N b ,

