EXPLORING NEURAL NETWORK REPRESENTATIONAL SIMILARITY USING FILTER SUBSPACES

Abstract

Analyzing representational similarity in neural networks is crucial to numerous tasks, such as interpreting or transferring deep models. One typical approach is to input probing data into convolutional neural networks (CNNs) as stimuli to reveal their deep representation for model similarity analysis. Those methods are often computationally expensive and stimulus-dependent. By representing filter subspace in a CNN as a set of filter atoms, previous work has reported competitive performance in continual learning by learning a different set of filter atoms for each task while sharing common atom coefficients across tasks. Inspired by this observation, in this paper, we propose a new paradigm for reducing representational similarity analysis in CNNs to filter subspace distance assessment. Specifically, when filter atom coefficients are shared across networks, model representational similarity can be significantly simplified as calculating the cosine distance among respective filter atoms, to achieve millions of times computation reduction. We provide both theoretical and empirical evidence that this simplified filter subspace-based similarity preserves a strong linear correlation with other popular stimulus-based metrics, while being significantly more efficient and robust to probing data. We further validate the effectiveness of the proposed method in various applications, such as analyzing training dynamics as well as in federated and continual learning. We hope our findings can help further explorations of real-time large-scale representational similarity analysis in neural networks.

1. INTRODUCTION

Deep neural networks have shown unprecedented performance in a large variety of tasks (Krizhevsky et al., 2012; Ronneberger et al., 2015) . The cornerstone to the success is the deep representation learned by neural networks (NNs), which contains high-level semantic information about a task. By viewing deep representation as to the characterization of each task in a highdimensional space, the representational similarity between a pair of deep models can be exploited to understand the intrinsic relationship between associated tasks. In this way, the representational similarity provides a way to open the black box of deep learning by showing the training dynamics (Kornblith et al., 2019) , and it further empowers machine learning systems with the ability to transfer knowledge from one task to another (Huang et al., 2021a) . Previous works (Raghu et al., 2017; Morcos et al., 2018) measure representational similarity directly relying on deep representations revealed by input data. These approaches introduce heavy computation from both the forward pass of numerous stimulus inputs and the calculation of high-dimensional covariance matrices. Since these similarity metrics are stimulus-dependent, their quality can potentially deteriorate when probing data are inappropriately chosen, scarce or unavailable. We are inspired by the continual learning framework in Miao et al. (2021) , where a group of tasks is simultaneously modeled using NNs by learning for each task a different set of filter atoms while sharing common atom coefficients across tasks. Miao et al. (2021) has in detail analyzed and validated this framework in a continual learning context. In the above setting, it is easy to observe that the representation variations across different NNs now become dominated by respective filter atoms. Thus, Miao et al. (2021) adopts in experiments filter subspace distance to assess task relevancy, however, without formal justification. In this paper, we formally explore NN representational similarity using filter subspace distance, with detailed theoretical and empirical justifications. We first simplify the filter subspace distance to the cosine distance of two sets of filter atoms, to eliminate heavy computation of singular value decomposition in calculating principal angles. Then, we show both theoretically and empirically that the obtained filter atom-based similarity preserves a strong linear correlation with other popular stimulus-dependent similarity measures such as CCA (Raghu et al., 2017) . Our representational similarity is also immune to inappropriate choices of probing data, while stimulus-dependent metrics can be perturbed drastically. The proposed filter atom-based similarity shows extreme efficiency in both memory and computation. Since the similarity computation does not involve network forward pass, no GPU memory access is required, whereas other stimulus-based measures consume the same amount of GPU memory as regular inference. On the other hand, the proposed method involves only inner product calculations on filter atoms, which takes neglectable time for similarity evaluation. The evaluation time of stimulus-based measures includes the time of both the forward pass of probing data and the calculation of high-dimensional covariance matrices. We report later the dramatically improved evaluation time of the proposed method against other popular method, e.g., CKA (Kornblith et al., 2019) . We further validate our atom-based similarity for knowledge transfer with various continual learning and federated learning tasks. In both settings, we fix the atom coefficients, learn the filter atoms for each task, and finally conduct knowledge transfer among tasks by recalling the most similar models for the ensemble. Compared with stimulus-based similarity metrics, the proposed measure achieves competitive performance with millions of times reduction in the computational cost. We summarize our contributions as follows, • We formally explore NN representational similarity measure using filter subspace distance. • We show both theoretically and empirically that the proposed filter atom-based measure preserves a strong linear correlation with other popular stimulus-dependent measures while being significantly more robust and efficient in both memory and computation. • We demonstrate the effectiveness of the proposed similarity measure under various example settings, such as analyzing training dynamics as well as in federated and continual learning.

2. METHODOLOGY

In this section, we first provide a filter subspace formulation for NNs and propose a model similarity metric based on a simplified filter subspace distance. Then, we review stimulus-based representational similarities and show their limitations. We further demonstrate that under certain assumptions, the proposed measure shows a strong linear relationship with popular stimulus-based measures, while exhibiting dramatic improvement in computational efficiency and data robustness. These unique characteristics of the proposed measure can potentially enable real-time large-scale NN similarity assessment, e.g., helping fast knowledge transfer across a large number of models.

2.1. REPRESENTATIONAL SIMILARITY IN FILTER SUBSPACE

Filter subspace. As in Qiu et al. (2018) , the convolutional filter W ∈ R c ′ ×c×k×k (c ′ and c are the number of input and output channels, k is the kernel size) can be decomposed as m filter atoms D[i] ∈ R k×k (i = 1, ..., m), linearly combined by atom coefficients α ∈ R m×c ′ ×c as W = α × D. The filter subspace is then expressed as V = Span{D[1], ..., D[m]}. With this formulation, we consider a paradigm where atom coefficients are shared across different deep models while filter subspaces are model specific. This paradigm has been in detail analyzed and validated in Miao et al. (2021) and reports state-of-the-art performance in a continual learning context. In this setting, we dive deep into the relationship between filter atoms and representations. For simplicity, let c = c ′ = 1, and the argument extends. Given an input image X(b) (b ∈ B, B ⊂ Z 2 ), define the local input norm ||X|| F,N b := ( b ′ ∈N b X(b -b ′ ) 2 ) 1/2 and the convolution ⟨X, w⟩ N b := b ′ ∈N b X(b-b ′ )w(b ′ ), where N b ⊂ B is a local Euclidean grid centered at b. Then the decomposed convolution can be written as Z(b) = σ( m i=1 α i ⟨X, D i ⟩ N b ), where D[i] denotes the i-th atom, α i is the corresponded i-th coefficient. Proposition 1. Suppose D u and D v are two different sets of filter atoms for a convolutional layer with the common atom coefficients α, and the activation function σ is non-expansive, we can upper bound the changes in the corresponding features Z u , Z v with atom changes, The proof is provided in Appendix A.1. We further empirically validate this relationship in Section 3.1. ||Z u -Z v || F ≤ (||α|| F λ) |B| • ||D u -D v || F , with λ = sup b∈B ||X|| F,N b , Filter subspace similarity The above theorem suggests the possibility to measure the representational similarity of two CNNs by simply measuring the distance of their filter subspaces. As proposed in Miao et al. (2021) , the representational similarity of two models with different filter subspaces V u , V v can be assessed by the similarity based on Grassmann distance between V u , V v as, S Gras (F u , F v ) = d(V u , V v ) = 1 m i cosθ i , where θ i is the i-th principal angle between V u and V v . However, the above metric requires costly singular value decomposition. Note that filter atoms in different models are intrinsically aligned under shared atom coefficients, which allows us to approximate the filter subspace similarity using the cosine similarity of the corresponding filter atoms. To this end, as shown in Figure 1 , we propose a significantly simplified representational similarity measure with filter atom similarity. Definition 1. Suppose two convolution neural networks F u , F v share atom coefficients layer-wise which assume to be full-rank matrices, and their model-specific filter atoms are D u , D v , then the atom-based representational similarity is defined as, S Atom (F u , F v ) = cos(D u , D v ) = < vec(D u ), vec(D v ) > ||vec(D u )|| F • ||vec(D v )|| F . ( ) The above definition is a layer-wise similarity, allowing us to compare the similarity of different networks per layer, and we simply average layer-wise similarities for the network-wise similarity. We further show that S Atom and S Gras are equivalent under certain assumption. Proposition 2. Assume D u , D v ∈ R k 2 ×m are orthogonal matrices, then S Gras = S Atom . The proof is provided in Appendix A.1. We empirically show in Figure 2 (a) that the atom-based similarity has still a strong linear correlation with the Grassmann subspace similarity even without imposing the above orthogonality over atoms. Note that our atom-based similarity measure only involves linear operations of vectorized atoms of around 100 dimensions, which requires neglectable computation. Additionally, the proposed method depends solely on models themselves and eliminates the reliance on probing data, equipping our similarity with robustness to inappropriate choice of probing data. Intuitively, the representational similarity can be directly assessed via features generated from different neural networks. As shown in Figure 1 , it usually includes three steps to evaluate stimulus-based representational similarity between two NNs F u and F v : (1) Collect an appropriate and sufficient amount of probing data X p ∈ R n×c ′ ×h ′ ×w ′ that can represent the whole data distribution, (2) Generate the feature Z u and Z v (Z u , Z v ∈ R n×c×h×w ) by the forward pass of probing data through different neural networks, Z u = F u (X p , θ u ) and Z v = F v (X p , θ v ) , where θ u , θ v denote parameters of two NNs; (3) Choose a stimulus-based metric to assess the model similarity. Several popular methods can be adopted in step (3), below we will give a brief introduction. CCA. Raghu et al. (2017) proposes to analyze the representational similarity by conducting canonical correlation analysis on two Z u , Z v , which is a recursive process of finding projection directions for two matrices that their correlation is maximized. Specifically, let Q u , Q v denote the orthonormal bases of Z u , Z v , the CCA can be denoted as, S CCA (F u , F v ) = 1 c c l=1 σ 2 l , where σ l denotes the l-th eigenvalue of Kornblith et al. (2019) proposes another way to assess the similarity based on Centered Kernel Alignment (CKA). Let Λ u,v = Q ⊺ u Q v . CKA. K u = Z u Z ⊺ u , K v = Z v Z ⊺ v denote the Gram matrices of two feature space, the CKA is computed by, S CKA (F u , F v ) = HSIC(K u , K v ) HSIC(K u , K v ) HSIC(K u , K v ) , ( ) where HSIC is the Hilbert-Schmidt Independence Criterion (Gretton et al., 2005) . However, in addition to the forward pass, all the aforementioned approaches further introduce significant computational costs while performing evaluation in the representation space. Nevertheless, their qualities rely heavily on the mindful choice of probing data X p , which undermines their robustness.

2.3. ALGORITHM COMPLEXITY ANALYSIS

Here, we provide a detailed comparison of computation complexity between the proposed atombased similarity and stimulus-based similarities. Consider one convolutional layer with filter W ∈ R c ′ ×c×k×k (W = α × D, D ∈ R m×k×k ) which transforms the input X p ∈ R n×c ′ ×h ′ ×w ′ to output Z ∈ R n×c×h×w . The complexity of our method is dominated by inner product of two tiny filter atoms, O(m • k 2 ), e.g., m = 9, k = 3 in a typical setting. In contrast, stimulus-based similarity measure first forward feed n probing samples with a complex- ity of O(n • h ′ w ′ • k 2 • cc ′ ), then calculates covariance matrix with complexity of O(n 2 • hw • c). In total, the time complexity of CCA is O(n • h ′ w ′ • k 2 • cc ′ + n 2 • hw • c). Our method is at least n •h ′ w ′ •k 2 •cc ′ +n 2 •hw•c m•k 2 times more efficient than stimulus-based similarity measures. As h ≫ k, cc ′ ≫ m, the computational cost of our method is negligible. For example, with 10k probing datapoints, the CCA calculation requires 1.14 × 10 7 times more FLOPs than the proposed atom-based similarity.

2.4. RELATIONSHIP WITH STIMULUS-BASED SIMILARITIES

The proposed atom-based measure not only shows extreme efficiency, but also exhibits a linear relationship with other popular stimulus-based similarities. Here, we analyze the proposed atom-based similarity S Atom with CCA S CCA (Raghu et al., 2017) . Suppose forward passes of decomposed convolutional layer for F u and F v are Z u = αX p D u , Z v = αX p D v , respectively. To start with, we show that the S CCA is upper bounded by the proposed S Atom . Theorem 1. Let T = Tr(X ⊺ p α ⊺ αX p ), C = σ min (X ⊺ p α ⊺ αX p ). Assume K(Z ⊺ u Z u ), K(Z ⊺ v Z v ) ≤ γ. Then S CCA (F u , F v ) is upper bounded by S Atom (F u , F v ), C γc 3 2 T • S CCA (F u , F v ) ≤ S Atom (F u , F v ), where Tr(•) denotes trace of a matrix, σ min indicates the minimum eigenvalue, K(A) denotes the condition number of matrix A. We provide the proof in Appendix A.1. Since S CCA is stimulus-dependent, the calculated value varies depending on the choice of probing data, and the value range shows bounded by our atom-based similarity, as in the theorem above. With additional assumptions imposed, we can further show a simple linear relationship between CCA and our atom-based similarity. Assumption 1. Suppose the diagonal elements of Z ⊺ u Z u , Z ⊺ u Z v and Z ⊺ v Z v are larger than non- diagonal element, i.e., (Z ⊺ u Z u ) ii ≫ (Z ⊺ u Z u ) ij . The Assumption 1 suggests different channels of feature Z have a low correlation. Reducing channel-wise dependencies has been studied in Zhang et al. (2021) and has been shown to benefit model stability. Theorem 2. If Assumption 1 holds, S CCA (F u , F v ) is approximately linear to atom-based similarity, √ c γ 1 γ 2 γ 3 • S CCA (F u , F v ) = S Atom (F u , F v ), where γ 1 , γ 2 and γ 3 contain higher order of features, which can be found in detail with the proof in Appendix A.1. As in Figure 2 , we empirically observe the linear correlation between CCA and atom-based similarity, which agrees with our theoretical findings. In addition, we find that the proposed similarity also shows a strong correlation with CKA with different kernels.

3. EXPERIMENTS

In this section, we first validate our theorems with several simple experiments, and then demonstrate various applications of the proposed atom-based similarity in efficiently analyzing training dynamics as well as in federated and continual learning scenarios.

3.1. SIMPLE VALIDATION EXPERIMENTS

We empirically validate that the change of features is bounded by the change of atoms, and the nearlinear relationship between atom-based and stimulus-based similarity. Besides, we demonstrate the limit of stimulus-based similarities, as well as the verification of our assumption and theorems. Representation dependency on filter atoms. We first validate the dependency of deep features on filter atoms in Proposition 1 with a simple experiment. The model F here is a 2-layer CNN with coefficient α and atom D generated from normal distribution N (0, 1). The input sample X is also generated from normal distribution N (0, 1). Figure 4 (a) shows the relation between ∥Z u -Z v ∥ F and ∥D u -D v ∥ F by fixing coefficient α and input sample X and randomly varying filter atoms D. ) The performance of stimulus-based similarities can be compromised by poorly selected stimulus data. For two models trained on CIFAR-100, they have high CCA and CKA similarities with stimuli from CIFAR-100 but low similarities with stimuli from SVHN. In contrast, our atom-based similarity does not depend on stimulus data and shows a high similarity between two networks as expected. 𝐙 ! -𝐙 " # = (𝜆 𝛂 # ) ℬ 𝐃 ! -𝐃 " # Upper Bound 𝐃 ! -𝐃 " # 𝐙 ! -𝐙 " # All the points are below the line which is the bound provided by Proposition 1, reflecting that the representation variations are dominated by filter atoms. Correlation of CCA and atom-based similarity. We next empirically verify that CCA and atombased similarity have a strong correlation. In this experiment, 10 tasks are generated from CI-FAR100 (Krizhevsky et al., 2009) with 10 classes in each task. Only the filter atoms of each task are trained while the atom coefficients are fixed. We calculate CCA and atom-based similarity among 45 pairs of models. The correlation between CCA and atom-based similarity is 0.8638 which is shown in Figure 2 (b). Similarly, the correlation between CKA and atom-based similarity is also reported in Figure 2 (Table ). These results clearly show that the proposed atom-based similarity has high linear relationship with popular stimulus-based similarities, which agrees with Theorem 1 and Theorem 2. Effect of channel decorrelation. We further design a regularization term Limitations of stimulus-based similarities. Depending on stimulus data, the stimulus-based similarities can be inconsistent. We expect a high value while evaluating the similarity between two models trained on the same dataset. In this experiment, we train two models on CIFAR-100. As shown in Figure 4 (Table ), the CCA similarity of two models with stimuli from CIFAR-100 is 0.8793, but it drops to 0.4862 with stimuli from SVHN (Netzer et al., 2011) . The CKA similarity demonstrates the same inconsistency in values with different choices of stimuli. However, our atom-based similarity of two models is 0.9136, which is aligned with our expectation. β i̸ =j (Z ⊺ u Z u ) 2 ij to ap- proach (Z ⊺ u Z u ) ii ≫ (Z ⊺ u Z u ) ij in

3.2. LEARNING DYNAMICS

The atom-based similarity has various applications in analyzing NNs. It is capable of reflecting the data similarity and measuring the evolution of model similarity during the training time. We examine the training dynamics based on the heat map of atom-based similarities. In this experiment, AlexNet (Krizhevsky et al., 2012) is trained on CIFAR-100 (Krizhevsky et al., 2009) for 150 epochs and VGG11 (Simonyan & Zisserman, 2014 ) is fine-tuned on ImageNet (Russakovsky et al., 2015) for 20 epochs. For both models, we train and store atoms at each epoch. Figure 5 shows heat maps of similarities of the model among different training epochs. Figure 5(a-c ) are heat maps of the 1st, 3rd and 5th convolutional layers of Alexnet. We mark the epoch when the parameters of each layer reaches 0.99 similarity with the their states in the last epoch. The first layer reaches 0.99 similarity at epoch 36 which is earlier than final layers. In Figure 5(d-f ), VGG11 shows a similar behavior. Several previous works have also indicated this bottom-up learning dynamics where layers closer to the input solidify into their final states faster than very top layers (Raghu et al., 2017; Morcos et al., 2018) . Our atom-based similarity provides a highly efficient way to examine the training dynamics while showing results in accord with previous studies. Moreover, we can apply our method to calculate the similarity of a model We mark the epoch when the parameter reaches 0.99/ 0.999 similarity to its final state with white lines. For both models, we observe bottom-up learning dynamics where layers closer to the input solidify into their final states faster than very top layers, which is in accord with previous studies (Raghu et al., 2017; Morcos et al., 2018) . trained on different tasks, so we can track the process of the same model interacting with different datasets. The details are shown in Appendix A.2. 3.3 FEDERATED LEARNING Federated learning (FL) aims at learning models collaboratively by leveraging the local computational power and data of all users with the concern of privacy (McMahan et al., 2017) . Personalized Federated Learning (PFL) emerges to address some challenges in FL, such as poor convergence on heterogeneous data and lack of solution personalization (Tan et al., 2022) . In this setting, our framework achieves personalization by enforcing FL models with the shared atom coefficients for all users and specific filter atoms for each user. As illustrated in Figure 7 , the shared coefficients preserve the common knowledge, while user-specific atoms hold personalized information about each user. Then, we can assess model relationships with our atom-based similarity without any stimuli data, which meets the privacy requirement of the FL scenario. The shared atom coefficients can be achieved in different ways. With our framework, the coefficient can be obtained from a model pre-trained on a public dataset or from a global model trained by other FL approaches. We can also get the coefficients by training the model locally and evolving the coefficients at each communication round. Measuring user similarity. With the shared atom coefficients and user-specific filter atoms, we can simply get relations of users by calculating atom-based similarity. To be specific, we expect that users with similar data have a higher similarity. In this experiment, we combine two datasets, CIFAR-100 (Krizhevsky et al., 2009) and SVHN (Netzer et al., 2011) The computational cost of three different approaches is shown in Figure 3 . Notably, calculating the atom-based similarity is significantly faster (million times), requiring 0 GPU memory usage than stimuli-based methods. Note that the advantages in computational efficiency of atom-based similarity become more prominent as the number of models increases. Improving personalized model with ensemble of similar users. Once we get the relationships of users, we can further improve the accuracy of the current model by the ensemble of similar models, which is effective to mitigate the data heterogeneity problem in FL. The experiment is described in detail in Appendix A.2. The final results are shown in Table 1 . With ensemble, the accuracies of all FL methods can be improved. Note that the results of model ensemble selected by our atom-based similarity are comparable with stimulus-based methods while consuming much fewer resources.

3.4. CONTINUAL LEARNING

Continual learning is an open problem in machine learning in which data from multiple tasks arrive sequentially and the model is learned to adapt to new tasks while not forgetting the knowledge from the past (Parisi et al., 2019) . Note that some of the tasks in continual learning are related, so models trained with these tasks can be benefited from aggregating knowledge from each other. We adopt the setting in Miao et al. (2021) , and apply atom-based similarity to find related models. Specifically, we 10-Split CIFAR-100 dataset, where the 100 classes is broken down into 10 tasks with 10 classes per task. We train AlexNet including atoms and atom coefficients on the first task, and train only the atoms on the following tasks. Then, we calculate the task similarity with atom-based similarity, and report the model ensemble result with most similar members. The accuracy and the similarity computation costs are shown in Table 2 . Our method provides higher results and has faster speed compared with stimulus-based methods.

4. RELATED WORK

4.1 MODEL SIMILARITY Representational similarity analysis (RSA) (Kriegeskorte et al., 2008) demonstrates the method of understanding brain activities by computing similarities between brain responses in different regions. Measuring the similarity of models is beneficial for understanding neural network (NN) architectures and learning dynamics (Raghu et al., 2017; Kornblith et al., 2019; Morcos et al., 2018; Dwivedi & Roig, 2019) . Model similarity can be used to understand or incorporate various machine learning paradigms across different areas, including contrastive learning (Islam et al., 2021; Hua et al., 2021) , knowledge distillation (Stanton et al., 2021) , meta-learning (Raghu et al., 2019a) , and transfer learning (Raghu et al., 2019b; Neyshabur et al., 2020; Bolya et al., 2021) . Multiple approaches are proposed to estimate the representational similarity of NNs. Some early works show that individual neurons can capture meaningful information (Bau et al., 2017; Zeiler & Fergus, 2014; Zhou et al., 2016; Bau et al., 2018) . Later, gradient-based methods emerge to provide a visual explanation of deep neural networks (Selvaraju et al., 2017) . Current popular representational similarity methods rely on features of NN. Raghu et al. (2017) proposes SVCCA to measure similarity by calculating the covariance matrix of the features of each layer after channel alignments. Kornblith et al. (2019) discusses the invariance properties of similarity indices and proposes CKA with consistent correspondences between layers. Stimulus-based similarities are data-dependent and computationally expensive. But our method measures the representational similarity only via atoms, a portion of model parameters, which is data-agnostic and much more efficient. 4.2 LEARNING PARADIGM WITH NUMEROUS MODELS Some machine learning tasks involve numerous models. For example, in Federated learning (Tan et al., 2022) , thousands of models are trained across clients. In Continual learning, there are multiple models generated across time (Kirkpatrick et al., 2017) . Federated learning (FL) aims to improve the performance of the system by continuously training and aggregating models from users without collecting data (McMahan et al., 2017; Smith et al., 2017; Konečnỳ et al., 2016) . FL requires communication efficiency while thousands or even millions of clients may be involved (Li et al., 2020a) . It also required to achieve personalization (Tan et al., 2022; Huang et al., 2021b) considering data heterogeneity of different users (Li et al., 2020a; Kairouz et al., 2021) . Estimating user similarity can effectively address these challenges in FL. Continual learning (CL) aims at providing long-term knowledge accumulation, and the main challenge is to avoid catastrophic forgetting by learning new tasks while remembering the old ones (Kirkpatrick et al., 2017; Aljundi et al., 2018; Lee et al., 2017; Zenke et al., 2017; Kolouri et al., 2019) . One promising way is to store neural networks for each task (Lopez-Paz & Ranzato, 2017; Rusu et al., 2016; Yoon et al., 2018; Jerfel et al., 2019; Li et al., 2019) . As the number of tasks increases, a large number of models are generated and stored. It is important to find a way to access their relations to reuse models.

5. CONCLUSION

In this paper, we proposed a new paradigm for reducing representational similarity analysis in CNNs to filter subspace distance assessment. We provided both theoretical and empirical evidence that the proposed filter subspace-based similarity exhibits a strong linear correlation with popular stimulusbased metrics, while being significantly more efficient and robust in probing data. It was evaluated on both federated learning and continual learning tasks, and achieves competitive performance with millions of times reduction in computational cost. Our method currently assumes respective layers among compared CNNs to have coefficients with the same dimension. For our future work, we will explore the way to share coefficient among layers to achieve atom-based similarity with different dimensions.

A APPENDIX

A.1 THEORETICAL PROOFS Proposition 1. Suppose D u and D v are two different sets of filter atoms for a convolutional layer with the common atom coefficients α, and the activation function σ is non-expansive, we can upper bound the changes in the corresponding features Z u , Z v with atom changes, ||Z u -Z v || F ≤ (||α|| F λ) |B| • ||(D u -D v )|| F , with λ = sup b∈B ||X|| F,N b , Proof. Recall the decomposed convolution can be expressed as, Z = σ( m i=1 α i ⟨X, D[i]⟩ N b ) (9) Since σ is non-expansive, ∀b we have, |Z u (b) -Z v (b)| ≤ | m i=1 α i ⟨X, D u [i]⟩ N b - m i=1 α i ⟨X, D v [i]⟩ N b | ≤ ||α|| F ( m i=1 |⟨X, (D u [i] -D v [i])⟩ N b | 2 ) 1/2 . ( ) By Cauchy-Schwarz inequality, |⟨X, (D u [i] -D v [i])⟩ N b | ≤ ||X|| F,N b • ||D u [i] -D v [i]|| F,N b ≤ λ • ||D u [i] -D v [i]|| F,N b (11) we have that b∈B |Z u (b) -Z v (b)| 2 ≤ ||α|| 2 F b m i=1 |⟨X, (D u [i] -D v [i])⟩ N b | 2 ≤ ||α|| 2 F b m i=1 ||X|| 2 F,N b • ||(D u [i] -D v [i])|| 2 F,N b ≤ (||α|| F λ) 2 b,i ||(D u [i] -D v [i])|| 2 F,N b (12) and observe that b,i ||(D u [i] -D v [i])|| 2 F,N b = b∈B m i=1 ||(D u [i] -D v [i])|| 2 F,N b = |B| • ||(D u -D v )|| 2 F , ( ) where |B| is the area of the domain of X. Then Eq. 12 becomes b∈B |Z u (b) -Z v (b)| 2 ≤ (||α|| F λ) 2 |B| • ||(D u -D v )|| 2 F , ( ) which proves that ||Z u -Z v || F ≤ (||α|| F λ) |B| • ||(D u -D v )|| F as claimed. Proposition 2. Assume filter atoms D u , D v are orthogonal matrices, then S Gras = S Atom . Proof. Since D u , D v ∈ R k 2 ×m are orthogonal matrices, i.e., D T u D u = D T v D v = I, the Grassmann similarity can be represented as, S Gras (F u , F v ) = 1 m m i cosθ i = 1 m m i σ i , where σ i = Σ ii , U ΣV = D T u D v . S Atom is defined as, S Atom (F u , F v ) = cos(D u , D v ) = < vec(D u ), vec(D v ) > ||vec(D u )|| F • ||vec(D v )|| F . ( ) Analyze each part separately, we have < vec (D u ), vec(D v ) >= Tr(D T u D v ) = m i σ i , ||vec(D u )|| F = Tr(D T u D u ) = Tr(I) = √ m, and also ||vec(D v )|| F = √ m. In total, the atom-based similarity becomes, S Atom (F u , F v ) = cos(D u , D v ) = m i σ i m , which equals S Gras . The claimed theorem is proved. Lemma 1. For two positive semidefinite matrices A, B, Tr(AB) ≥ σ min (A) Tr(B), ( ) where σ min denotes the minimum eigenvalue of A. Proof. It is equivalent to prove that, Tr((A -σ min (A)I)B) ≥ 0. ( ) Let C, D be matrices such that A -σ min (A)I = C ⊺ C, B = D ⊺ D, then Tr((A -σ min (A)I)B) = Tr(C ⊺ CD ⊺ D) = Tr(CD ⊺ DC ⊺ ) = Tr((DC ⊺ ) ⊺ (DC ⊺ )) ≥ 0. ( ) Theorem 1. Suppose the forward of decomposed convolution layer for the u-th model is Z u = αXD u . Z u , Z v nearly have zero-mean since X p is preprocessed to be normalized. CCA coefficient is defined as S(Z u , Z v ) = 1 c c i=1 σ 2 i , where σ 2 i denotes the i-th eigenvalue of Λ u,v = Q u ⊺ Q v , Q u = Z u (Z ⊺ u Z u ) -1 2 . Then S(Z u , Z v ) is upper bounded, S(Z u , Z v ) ≤ c 3 2 T C cos(D u , D v ), where T = Tr(X ⊺ α ⊺ αX), C = σ min (X ⊺ α ⊺ αX). Proof. Consider S 2 = 1 c c i=1 σ 2 i . S 2 = 1 c c i=1 σ 2 i = 1 c Tr(Λ u,v Λ ⊺ u,v ). ( ) where Tr(Λ u,v Λ ⊺ u,v ) = Tr(Q ⊺ u Q v Q ⊺ v Q u ) = Tr(Q v Q ⊺ v Q u Q ⊺ u ). ( ) As defined above, we have Q u Q ⊺ u = Z u (Z ⊺ u Z u ) -1 2 (Z ⊺ u Z u ) -1 2 Z ⊺ u = Z u (Z ⊺ u Z u ) -1 Z ⊺ u Q v Q ⊺ v = Z v (Z ⊺ v Z v ) -1 2 (Z ⊺ v Z v ) -1 2 Z ⊺ v = Z v (Z ⊺ v Z v ) -1 Z ⊺ v . ( ) Then Equation 23becomes, Tr(Λ u,v Λ ⊺ u,v ) = Tr(Z u (Z ⊺ u Z u ) -1 Z ⊺ u Z v (Z ⊺ v Z v ) -1 Z ⊺ v ) = Tr((Z ⊺ u Z u ) -1 Z ⊺ u Z v (Z ⊺ v Z v ) -1 Z ⊺ v Z u ). (25) By Cauchy-Schwartz Inequality, Tr(Λ u,v Λ ⊺ u,v ) ≤ Tr((Z ⊺ u Z u ) -1 ) Tr((Z ⊺ v Z v ) -1 ) Tr(Z ⊺ u Z v ) 2 . ( ) Then we analyze these terms individually, Tr(Z ⊺ u Z v ) = Tr(D ⊺ u X ⊺ α ⊺ αXD v ) = Tr(X ⊺ α ⊺ αXD v D ⊺ u ) ≤ Tr(X ⊺ α ⊺ αX) Tr(D ⊺ u D v ) ≤ T • Tr(D ⊺ u D v ) As for Tr((Z ⊺ u Z u ) -1 ), let λ 1 , λ 2 , ..., λ c be eigenvalues for Z ⊺ u Z u listed in descending order (λ 1 ≥ λ 2 ≥ ... ≥ λ c ), and assume the condition number of Z ⊺ u Z u and Z ⊺ v Z v satisfy λ max /λ min ≤ γ, then, Tr((Z ⊺ u Z u ) -1 ) = c i=1 1 λ i ≤ c • 1 λ c ≤ γc λ 1 , where λ 1 = ||Z ⊺ u Z u || 2 , || • || 2 denotes the operator norm induced by the vector L 2 -norm. With the norm inequalities of any positive semidefinite matrix A, ||A|| 2 ≥ 1 √ c ||A|| F ≥ 1 c ||A|| * ≥ 1 c Tr(A), where || • || F , || • || * denote the Frobenius norm and the nuclear norm, respectively. Equation (30) then becomes, Tr((Z ⊺ u Z u ) -1 ) ≤ c • 1 ||Z ⊺ u Z u || 2 ≤ γc 2 Tr(Z ⊺ u Z u ) . ( ) By Lemma 1, Tr(Z ⊺ u Z u ) = Tr(D ⊺ u X ⊺ α ⊺ αXD u ) = Tr(X ⊺ α ⊺ α ⊺ XD u D ⊺ u ) ≥ σ min (X ⊺ α ⊺ α ⊺ X) Tr(D ⊺ u D u ) ≥ C • Tr(D ⊺ u D u ) ≥ C • ||vec(D u )|| 2 2 , where vec(•) denotes vectorization of a matrix. Then Equation 30 is further derived as, Tr((Z ⊺ u Z u ) -1 ) ≤ γc 2 C • ||vec(D u )|| 2 2 . ( ) Similarly, we have Tr((Z ⊺ v Z v ) -1 ) ≤ γc 2 C • ||vec(D v )|| 2 2 . ( ) Finally, with Tr(D ⊺ u D v ) =< vec(D u ), vec(D v ) >, we have Tr(Λ u,v Λ ⊺ u,v ) ≤ γ 2 T 2 c 4 (< vec(D u ), vec(D v ) >) 2 C 2 ||vec(D u )|| 2 2 • ||vec(D v )|| 2 2 ≤ γ 2 T 2 c 4 C 2 • cos 2 (D u , D v ), and thus, S(Z u , Z v ) = 1 c Tr(Λ u,v Λ ⊺ u,v ) ≤ γT c 3 2 C • cos(D u , D v ). Then the claimed theorem is proved. Lemma 2. For two matrices A, B, their frobenius norm satisfies, ∥AB∥ F = ∥A∥ F ∥B∥ F 1 - ∆ 1 ∥A∥ 2 F ∥B∥ 2 F , where ∆ 1 = ij ( k A 2 ik )( k B 2 kj ) • sin 2 (⟨A i: , B :j ⟩). Proof. According to the definition of frobenius norm ∥A∥ F = ij |A ij | 2 we have, ∥AB∥ F = ij ( k A ik B kj ) 2 . ( ) Note that ( i x i y i ) 2 = ( i x 2 i )( i y 2 i ) • cos 2 (⟨x, y⟩) = ( i x 2 i )( i y 2 i ) -( i x 2 i )( i y 2 i ) • sin 2 (⟨x, y⟩), where ⟨x, y⟩ is the angle of two vectors x and y. We have, ij ( k A ik B kj ) 2 = ij ( k A 2 ik )( k B 2 kj ) -( k A 2 ik )( k B 2 kj ) • sin 2 (⟨A i: , B :j ⟩) = ik A 2 ik kj B 2 kj 1 - ij ( k A 2 ik )( k B 2 kj ) • sin 2 (⟨A i: , B :j ⟩) ik A 2 ik kj B 2 kj =∥A∥ F ∥B∥ F 1 - ij ( k A 2 ik )( k B 2 kj ) • sin 2 (⟨A i: , B :j ⟩) ∥A∥ 2 F ∥B∥ 2 F =∥A∥ F ∥B∥ F 1 - ∆ 1 ∥A∥ 2 F ∥B∥ 2 F , where A i: is the i-th row of A and B :j is the j-th column of B, ∆ 1 = ij ( k A 2 ik )( k B 2 kj ) • sin 2 (⟨A i: , B :j ⟩). As A i: and B :j are more correlated, ⟨A i: , B :j ⟩ - → 0, thus, ∆ 1 ≪ ∥A∥ 2 F ∥B∥ 2 F . Lemma 3. ∥A 1/2 ∥ F = ∥A∥ 1/2 F (1 + ∆ 1A 1/2 ∥A∥ 2 F ) 1/4 . Proof. According to Lemma 2, we have, ∥A∥ 2 F = ∥A 1/2 ∥ 4 F -∆ 1 . Thus, ∥A 1/2 ∥ F = ∥A∥ 1/2 F (1 + ∆ 1A 1/2 ∥A∥ 2 F ) 1/4 , where ∆ 1A 1/2 = ij ( k (A 1/2 ) 2 ik )( k (A 1/2 ) 2 kj ) • sin 2 (⟨(A 1/2 ) i: , (A 1/2 ) :j ⟩). As (A 1/2 ) i: and (A 1/2 ) :j are more correlated, ⟨(A 1/2 ) i: , (A 1/2 ) :j ⟩ - → 0, thus, ∆ 1A 1/2 ≪ ∥A∥ 2 F . Lemma 4. For three matrices A, B, and C, their frobenius norm satisfies, ∥A∥ F = ∥A∥ F ∥B∥ F ∥C∥ F 1 - ∆ 2 + ∆ 3 ∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F , where ∆ 2 = 1 2 [∥A∥ 2 F kj ( l B 2 kl )( l C 2 lj ) • sin 2 (⟨B k: , C :j ⟩) + ∥C∥ 2 F il ( k A 2 ik )( k B 2 kl ) • sin 2 (⟨A i: , B :l ⟩)] and ∆ 3 = 1 2 [ ij ( k A 2 ik )( k (BC) 2 kj ) • sin 2 (⟨A i: , (BC) :j ⟩) + ij ( l (AB) 2 il )( l C 2 lj ) • sin 2 (⟨(AB) i: , C :j ⟩)]. Proof. Based on Lemma 2, we have, ∥ABC∥ 2 F =∥AB∥ 2 F ∥C∥ 2 F - ij ( l (AB) 2 il )( l C 2 lj ) • sin 2 (⟨(AB) i: , C :j ⟩) =∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F -∥C∥ 2 F il ( k A 2 ik )( k B 2 kl ) • sin 2 (⟨A i: , B :l ⟩) - ij ( l (AB) 2 il )( l C 2 lj ) • sin 2 (⟨(AB) i: , C :j ⟩) Symmetrically, we also have, ∥ABC∥ 2 F =∥A∥ 2 F ∥BC∥ 2 F - ij ( k A 2 ik )( k (BC) 2 kj ) • sin 2 (⟨A i: , (BC) :j ⟩) =∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F -∥A∥ 2 F kj ( l B 2 kl )( l C 2 lj ) • sin 2 (⟨B k: , C :j ⟩) - ij ( k A 2 ik )( k (BC) 2 kj ) • sin 2 (⟨A i: , (BC) :j ⟩) Thus, ∥ABC∥ 2 F = 1 2 [∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F -∥A∥ 2 F kj ( l B 2 kl )( l C 2 lj ) • sin 2 (⟨B k: , C :j ⟩) - ij ( k A 2 ik )( k (BC) 2 kj ) • sin 2 (⟨A i: , (BC) :j ⟩) + ∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F -∥C∥ 2 F il ( k A 2 ik )( k B 2 kl ) • sin 2 (⟨A i: , B :l ⟩) - ij ( l (AB) 2 il )( l C 2 lj ) • sin 2 (⟨(AB) i: , C :j ⟩)] =∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F -∆ 2 -∆ 3 , where ∆ 2 = 1 2 [∥A∥ 2 F kj ( l B 2 kl )( l C 2 lj ) • sin 2 (⟨B k: , C :j ⟩) + ∥C∥ 2 F il ( k A 2 ik )( k B 2 kl ) • sin 2 (⟨A i: , B :l ⟩)] and ∆ 3 = 1 2 [ ij ( k A 2 ik )( k (BC) 2 kj ) • sin 2 (⟨A i: , (BC) :j ⟩) + ij ( l (AB) 2 il )( l C 2 lj ) • sin 2 (⟨(AB) i: , C :j ⟩)]. Therefore, ∥ABC∥ F = ∥A∥ F ∥B∥ F ∥C∥ F 1 - ∆ 2 + ∆ 3 ∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F . ( ) As A i: and B :l , B k: and C :j are more correlated, ⟨A i: , B :l ⟩, ⟨B k: , C :j ⟩, ⟨A i: , (BC) :j ⟩, ⟨(AB) i: , C :j ⟩ - → 0, thus, ∆ 2 ≪ ∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F and ∆ 3 ≪ ∥A∥ 2 F ∥B∥ 2 F ∥C∥ 2 F . Lemma 5. ∥A -1/2 BC -1/2 ∥ F = κ F (A 1/2 )κ F (C 1/2 ) ∥B∥ F ∥A 1/2 ∥ F ∥C 1/2 ∥ F 1 - ∆ 2 + ∆ 3 ∥A -1/2 ∥ 2 F ∥B∥ 2 F ∥C -1/2 ∥ 2 F , ) where κ F (A 1/2 ) and κ F (C 1/2 ) are the condition number of A 1/2 and C 1/2 , κ F (A 1/2 ) = ( σ 2 i (A 1/2 ))( 1 σ 2 i (A 1/2 ) ) and κ F (C 1/2 ) = ( σ 2 i (C 1/2 ))( 1 σ 2 i (C 1/2 ) ); σ 2 i (A 1/2 ) are singular value of A 1/2 and σ 2 i (C 1/2 ) are singular value of C 1/2 . Proof. Based on Lemma 4, we have, ∥A -1/2 BC -1/2 ∥ F = ∥A -1/2 ∥ F ∥B∥ F ∥C -1/2 ∥ F 1 - ∆ 2 + ∆ 3 ∥A -1/2 ∥ 2 F ∥B∥ 2 F ∥C -1/2 ∥ 2 F . ( ) By the definition of condition number κ F (X) = ∥X∥ F ∥X -1 ∥ F = ( σ 2 i (X))( 1 σ 2 i (X) ), ∥A -1/2 BC -1/2 ∥ F = κ F (A 1/2 )κ F (C 1/2 ) ∥B∥ F ∥A 1/2 ∥ F ∥C 1/2 ∥ F 1 - ∆ 2 + ∆ 3 ∥A -1/2 ∥ 2 F ∥B∥ 2 F ∥C -1/2 ∥ 2 F . Theorem 2. Suppose the forward of decomposed convolution layer for the u-th model is Z u = αXD u , CCA coefficient be S(Z u , Z v ) = 1 c c i=1 σ 2 i , where σ 2 i denotes the i-th eigenvalue of Λ u,v = Q u ⊺ Q v , Q u = Z u (Z ⊺ u Z u ) -1 2 . Then S(Z u , Z v ) is approximately linear to atom-based similarity, S(Z u , Z v ) = γ 1 γ 2 γ 3 √ c cos(D u , D v ), Proof. Based on S(Z u , Z v ) = 1 c c i=1 σ 2 i and ∥Λ u,v ∥ F = c i=1 σ 2 i , where σ i are the singular value of Λ u,v , S = 1 c c i=1 σ 2 i = 1 √ c ∥Λ u,v ∥ F = 1 √ c ∥(Z ⊺ u Z u ) -1 2 Z ⊺ u Z v (Z ⊺ v Z v ) -1 2 ∥ F . According to Lemma. 5, we have 1 √ c ∥(Z ⊺ u Z u ) -1 2 Z ⊺ u Z v (Z ⊺ v Z v ) -1 2 ∥ F = γ 1 γ 2 √ c ∥Z ⊺ u Z v ∥ F ∥(Z ⊺ u Z u ) 1 2 ∥ F ∥(Z ⊺ v Z v ) 1 2 ∥ F , where γ 1 = κ F ((Z ⊺ u Z u ) 1 2 ) • κ F ((Z ⊺ v Z v ) 1 2 ) and γ 2 = 1 - ∆2+∆3 ∥(Z ⊺ u Zu) -1/2 ∥ 2 F ∥Z ⊺ u Zv∥ 2 F ∥(Z ⊺ v Zv) -1/2 ∥ 2 F . As Z u = αXD u and Z v = αXD v , we have γ 1 γ 2 √ c ∥Z ⊺ u Z v ∥ F ∥(Z ⊺ u Z u ) 1 2 ∥ F ∥(Z ⊺ v Z v ) 1 2 ∥ F = γ 1 γ 2 √ c ∥D ⊺ u X ⊺ α ⊺ αXD v ∥ F ∥(D ⊺ u X ⊺ α ⊺ αXD u ) 1 2 ∥ F ∥(D ⊺ v X ⊺ α ⊺ αXD v ) 1 2 ∥ F . ( ) According to Lemma 3, γ 1 γ 2 √ c ∥D ⊺ u X ⊺ α ⊺ αXD v ∥ F ∥(D ⊺ u X ⊺ α ⊺ αXD u ) 1 2 ∥ F ∥(D ⊺ v X ⊺ α ⊺ αXD v ) 1 2 ∥ F = γ 1 γ 2 γ 3 √ c ∥D ⊺ u X ⊺ α ⊺ αXD v ∥ F ∥(D ⊺ u X ⊺ α ⊺ αXD u )∥ 1 2 F ∥(D ⊺ v X ⊺ α ⊺ αXD v )∥ 1 2 F , where γ 3 = (1 + ∆1 ∥(D ⊺ u X ⊺ α ⊺ αXDu)∥ 2 F ) -1 4 (1 + ∆1 ∥(D ⊺ v X ⊺ α ⊺ αXDv)∥ 2 F ) -1 4 . As Assumption 1 holds, it becomes γ 1 γ 2 γ 3 √ c ∥D ⊺ u X ⊺ α ⊺ αXD v ∥ F ∥(D ⊺ u X ⊺ α ⊺ αXD u )∥ 1 2 F ∥(D ⊺ v X ⊺ α ⊺ αXD v )∥ 1 2 F = γ 1 γ 2 γ 3 √ c ∥D ⊺ u D v ∥ F ∥X ⊺ α ⊺ αX∥ F ∥D ⊺ u ∥ 1 2 F ∥X ⊺ α ⊺ αX∥ 1 2 F ∥D u ∥ 1 2 F ∥D ⊺ v ∥ 1 2 F ∥X ⊺ α ⊺ αX∥ 1 2 F ∥D v ∥ 1 2 F = γ 1 γ 2 γ 3 √ c ∥D ⊺ u D v ∥ F ∥D u ∥ F ∥D v ∥ F = γ 1 γ 2 γ 3 √ c cos(D u , D v ). Thus, we have Evolving shared atom coefficients. we can get shared atom coefficients by evolving them during the communication, apart from pre-trained models or from other FL approaches. In FL, the server aggregates models from every involved client at each communication round. Our method enforces the model with a shared atom coefficient and atoms. At each communication round, the clients perform training on both atoms and atom coefficients with locally stored data. Then, the server aggregates only the atom coefficients of selected clients to get an updated coefficient. In this way, we can get a shared coefficient across clients. S(Z u , Z v ) = γ 1 γ 2 γ 3 √ c cos(D u , D v ). Comparison with other FL approaches. We compare our approach by evolving shared atom coefficients with various personalized federated learning methods and federated learning methods with local finetuning. Among these methods, FedPer (Arivazhagan et al., 2019) and FedRep (Collins et al., 2021) have the similar ideas by learning shared global representation and personalized local heads. Ditto (Li et al., 2021) and FedProx (Li et al., 2020b) induce global regularization to improve the model performance. We also compare our method with FedAvg (McMahan et al., 2017) . FedRep (Collins et al., 2021) approaches the common knowledge with shared representation. The codes are adapted fromfoot_0 . We evaluate the test accuracy on CIFAR-10 and CIFAR-100 with different FL setting. As shown in Table 3 , our method achieves comparable performance among different methods. Fine-tuning models for ensemble. We select 3 models with different similarity measures for ensemble. For feature-based similarity methods, we randomly select 1000 examples from CIFAR-100 dataset. The fully-connected layer of each model is fine-tuned on the user's local data with 100 epochs. The fine-tuning takes about 12 hours on Nvidia RTX A5000. After fine-tuning, the accuracy is measured on local test data, with the predictions of current model and 3 selected models. Similar representations across datasets. Similar to (Kornblith et al., 2019) , we can use atombased similarity to compare networks trained on different datasets. In Figure 9 (a), we show that pairs of models that are both trained on CIFAR-10 and CIFAR-100 have high atom-based similarities. Models learned on two datasets respectively still show high similarity. In contrast, similarities between trained and untrained models are significantly lower.

Atom-based Similarity

Layer Index Input data from "red" ({(x i = 0, y i )}) and "blue" ({(x ′ i = y i , y ′ i = 0)}) are orthogonal. Since two models are learned on "red" data, their similarity should be 1, which can be faithfully indicated by our atom similarity. However, stimulus-based similarities will become 0 with the "blue" probing data.  x i = 0, y i , z i )} n i=1 , where z i = f (x i , y i ) + ϵ i and y i , ϵ i ∼ N (0.5, 0.1). Two NN models F 1 and F 2 with the same initialization and atom coefficients are trained for their different atoms to learn F : (X, Y ) -→ Z. It is can be simply found that the atom-based similarity of F 1 and F 2 is 1 and the stimulus-based similarity is also 1 with the same {(x i = 0, y i )} as the probing data. However, if we choose {(x ′ i = y i , y ′ i = 0)} as the stimuli data, then the stimulus-based similarities directly become 0 as the data are now orthogonal to model parameters. Training dynamics. We investigate the training dynamics of AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan & Zisserman, 2014) separately on CIFAR-100 (Krizhevsky et al., 2009) and ImageNet (Russakovsky et al., 2015) . The details of training dynamics of models with atoms from different time point during the training are shown in Figure 10 and Figure 11 . Moreover, we examine the similarity between the two participated models shared the same initialization trained only with atoms on two different tasks. The results is shown in Figure 12 



https://github.com/lgcollins/FedRep



Figure 1: Comparison between our method and stimulus-based methods. (left) Stimulus-based similarity metrics, e.g., CCA, rely on probing data, and calculate the correlation between large groups of features generated by the forward pass of probing data through NNs. (right) In comparison, our atom-based method decomposes convolutional filters W as filter atoms D and atom coefficients α, W = α × D, and only calculates the atom-based similarity between a small portion of parameters, i.e., atoms, which is stimuli independent and computation efficient. The proposed atom-based method can achieve millions of times computation reduction than popular stimulus-based methods.

Figure 2: (a) Correlation between Grassmann similarity and atom-based similarity; (b) Correlation between CCA and atom-based similarity. (Table) Correlation between atom-based similarity and other approaches.

Figure 4: (a) The change of features ∥Z u -Z v ∥ F is bounded by the change of atoms ∥D u -D v ∥ F . (b) The channel decorrelation leads to a higher correlation between CCA and atom-based similarity. And the correlation can reach 0.985 with β = 3 × 10 -3 , which means a near linear relation between CCA and atom-based similarity. (Table)The performance of stimulus-based similarities can be compromised by poorly selected stimulus data. For two models trained on CIFAR-100, they have high CCA and CKA similarities with stimuli from CIFAR-100 but low similarities with stimuli from SVHN. In contrast, our atom-based similarity does not depend on stimulus data and shows a high similarity between two networks as expected.

Assumption. 1. As shown in Figure 4(b), the correlation between CCA and atom-based similarity keeps increasing as β increases. The correlation reaches 0.985 when β = 3 × 10 -3 , indicating a near-linear relationship, which is aligned with Theorem. 2.

Figure 5: Layer-wise similarity matrices that show relations of model parameters of different training time points. (a)(b)(c) are the 1st, 3rd and 5th convolutional layer of AlexNet trained on CIFAR-100. (d)(e)(f) are the 1st, 4th and 8th convolutional layer of VGG11 trained on ImageNet.We mark the epoch when the parameter reaches 0.99/ 0.999 similarity to its final state with white lines. For both models, we observe bottom-up learning dynamics where layers closer to the input solidify into their final states faster than very top layers, which is in accord with previous studies(Raghu et al., 2017;Morcos et al., 2018).

Figure 6: Similarity matrices that show relations among 60 users in FL with our atom-based similarity through the training process. The labels of x-axis represent the ID's of CIFAR tasks. We can clearly see user clusters in all three figures. Specifically, the last 20 clients with SVHN data show higher similarities with themselves than the first 40 clients with CIFAR data, while every five of the first 40 clients sharing the same CIFAR task also show a high similarities within themselves.

, and separate data on 120 clients. Specifically, SVHN are randomly split into 20 tasks and each task contains 5 classes; CIFAR-100 are also split in the order of the class into 20 tasks and each task contains 5 classes. 20 clients are trained on 20 SVHN tasks, with each learning one task, while the other 100 clients are trained on 20 CIFAR-100 tasks, where every 5 clients equally share the data of one task by partitioning the data in a class-balanced way. The models share the same random initialization and filter atoms are trained independently without communication with other clients. The experimental details are described in Appendix A.2.

Figure6shows the atom-based similarity among last 40 clients of CIFAR-100 task and 20 clients of SVHN task. In Figure6(c), where clients are trained for 90 epochs, we can clearly see the cluster of 20 SVHN clients: The SVHN clients have a higher similarity among themselves but are dissimilar from CIFAR clients. Every 5 CIFAR clients who share the same task also have a high similarity among themselves. We can see the cluster appears at the early stage of training in Figure6(a)(b). It can be applied to quickly find clusters in FL(Tan et al., 2022). The full results of all 120 clients are shown in Appendix Figure8. We compare atom-based similarity with CCA and CKA. Note that stimulus-based similarities need probing data which could violate the privacy requirement of FL.

Figure 7: The shared coefficients and user-specific atoms represent common knowledge and personalized information. The atom-based similarity is used to calculate the relations among users. Users with heterogeneous data result in lower similarity, as illustrated in a similarity matrix.

Figure 8: Similarity matrices that show relations among 120 users in FL with our atom-based similarity through the training process.

Figure 9: (a) Using atom-based similarity, models trained on different datasets (CIFAR-10 and CIFAR-100) are similar among themselves, but they differ from untrained models. (b) Illustration of limitations of stimulus-based similarities.Input data from "red" ({(x i = 0, y i )}) and "blue" ({(x ′ i = y i , y ′ i = 0)}) are orthogonal. Since two models are learned on "red" data, their similarity should be 1, which can be faithfully indicated by our atom similarity. However, stimulus-based similarities will become 0 with the "blue" probing data.

Figure 10: Similarity of AlexNet with atoms from different time point during the training.

Figure 11: Similarity of VGG with atoms from different time point during the training.

Figure 12: Similarity of AlexNet trained on different tasks during the training.on the first few layers, but more on the middle layers. It reflects the middle layer is more critical than other layers, which is aligned with previous work(Neyshabur et al., 2020).

Figure 13: Similarity of VGG trained on different tasks during the training.





Classification accuracy of model ensemble using different FL methods and model selection strategies: Models are selected with different similarity measures in each setting. The model ensemble performance using our atom-based method is comparable with stimulus-based methods while being millions of times faster and consuming much fewer resources.

Continual Learning Results. The model ensemble using our atom-based similarity shows better result than stimulus-based methods. Our similarity is also faster and consuming much fewer resources.

Compare accuracy with different approaches

