INFORMATION DISTANCE FOR NEURAL NETWORK FUNCTIONS

Abstract

We provide a practical distance measure in the space of functions parameterized by neural networks. It is based on the classical information distance, and we propose to replace the uncomputable Kolmogorov complexity with information measured by codelength of prequential coding. We also provide a method for directly estimating the expectation of such codelength with limited examples. Empirically, we show that information distance is invariant with respect to different parameterization of the neural networks. We also verify that information distance can faithfully reflect similarities of neural network functions. Finally, we applied information distance to investigate the relationship between neural network models, and demonstrate the connection between information distance and multiple characteristics and behaviors of neural networks.

1. INTRODUCTION

Deep neural networks can be trained to represent complex functions that describe sophisticated input-output relationships, such as image classification and machine translation. Because the functions are highly non-linear and are parameterized in high-dimensional spaces, there is relatively little understanding of the functions represented by deep neural networks. One could interpret deep models by linear approximations (Ribeiro et al., 2016) , or from the perspective of piece-wise linear functions, such as in (Arora et al., 2018) . If the space of functions representable by neural networks admits a distance measure, then it would be a useful tool to help analyze and gain insight about neural networks. A major difficulty is the vast number of possibilities of parameterizing a function, which makes it difficult to characterize the similarity given two networks. Measuring similarity in the parameter space is straightforward but is restricted to networks with the same structure. Measuring similarity at the output is also restricted to networks trained on the same task. Similarity of representations produced by intermediate layers of networks is proved to be more reliable and consistent (Kornblith et al., 2019) , but is not invariant to linear transformations and can fail in some situations, as shown in our experiments. In this paper, we provide a distance measure of functions based on information distance (Bennett et al., 1998) , which is independent of the parameterization of the neural network. This also removes the arbitrariness of choosing "where" to measure the similarity in a neural network. Information distance has mostly been used in data mining (Cilibrasi & Vitányi, 2007; Zhang et al., 2007) . Intuitively, information distance measures how much information is needed to transform one function to the other. We rely on prequential coding to estimate this quantity. Prequential coding can efficiently encode neural networks and datasets (Blier & Ollivier, 2018) . If we regard prequential coding as a compression algorithm for neural networks, then the codelength can give an upper bound of the information quantity in a model. We propose a method for calculating an approximated version of information distance with prequential coding for arbitrary networks. In this method, we use KL-divergence in prequential training and coding, which allow us to directly estimate the expected codelength without any sampling process. Then we perform experiments to demonstrate that this information distance is invariant to the parameterization of the network while also being faithful to the intrinsic similarity of models. Using information distance, we are able to sketch a rough view into the space of deep neural networks and uncover the relationship between datasets and models. We also found that information distance can help us understand regularization techniques, measure the diversity of models, and predict a model's ability to generalize.

2. METHODOLOGY

Information distance measures the difference between two objects by information quantity. The information distance between two functions f A and f B can be defined as (Bennett et al., 1998) : d(f A , f B ) = max{K(f A |f B ), K(f B |f A )} This definition makes use of Kolmogorov complexity: K(f B |f A ) is the length of the shortest program that transforms f A into f B , and information distance d is the larger length of either direction. (Note that this is not the only way to define information distance with Kolmogorov complexity, however we settle with this definition for its simplicity.) Intuitively, this is the minimum number of bits we need to encode f B with the help of f A , or how much information is needed to know f B if f A is already known. Given two functions f A : X → Y and f B : X → Y defined on the same input space X, each parameterized by a neural network with weights θ A and θ B , we want to estimate the information distance between f A and f B . The estimation of Kolmogorov complexity term is done by calculating the codelength of prequential coding, so what we get is an upper bound of d, which we denote by d p (p for prequential coding).

2.1. ESTIMATING

K(f B |f A ) WITH PREQUENTIAL CODING To send f B to someone who already knows f A , we generate predictions y i from f B using input x i sampled from X. Assume that {x i } is known, we can use prequential coding to send labels {y i }. If we send enough labels, the receiver can use {x i , y i } to train a model to recover f B . If f A and f B have something in common, i.e. K(f B |f A ) < K(f B ), then with the help of f A we can reduce the codelength used to transmit f B . A convenient way of doing so is to use θ A as the initial model in prequential coding. The codelength of k samples is: L preq (y 1:k |x 1:k ) := - k i=1 log p θi (y i |x 1:i , y 1:i-1 ) where θi is the parameter of the model trained on {x 1:i-1 , y 1:i-1 }, and θ1 = θ A . With sufficient large k, the function parameterized by θk would converge to f B . If both f A and f B are classification models, we can sample y from the output distribution of f B . In this case, the codelength (2) not only transmits f B , but also k specific samples we draw from f B . The information contained in these specific samples is k i=1 log p θ B (y i |x i ). Because we only care about estimating K(f B |f A ), using the "bits-back protocol" (Hinton & van Camp, 1993) the information of samples can be subtracted from the codelength, resulting in an estimation of K(f B |f A ) as L k (f B |f A ): L k (f B |f A ) = - k i=1 log p θ (y i |x 1:i , y 1:i-1 ) + k i=1 log p θ B (y i |x i ) In practice, we want to use k sufficiently large such that f θk can converge to f B , for example by the criterion E x [D KL (f B (x)||f θk (x))] ≤ However, empirically we found that this often means a large k is needed, which can make the estimation using (3) unfeasible when the number of available x is small. Also the exact value of (3) depends on the specific samples used, introducing variance into the estimation.

2.2. THE PRACTICAL INFORMATION DISTANCE d p

We propose to directly estimate the expectation of L k (f B |f A ), which turns out to be much more efficient in the number of examples x, by leveraging infinite y samples. The expectation of codelength E y 1:k [L k ] over all possible samples y 1:k from f B is: E y 1:k ∼f B (x 1:k ) [L k (f B |f A )] = - k i=1 E y1:i log p θ (y i |x 1:i , y 1:i-1 ) + k i=1 E yi log p θ B (y i |x i ) ≥ - k i=1 E yi log E y1:i-1 p θ (y i |x 1:i , y 1:i-1 ) + k i=1 E yi log p θ B (y i |x i ) = k i=1 D KL (f B (x i )||E y1:i-1 f θi (x i )) (5) ≈ k i=1 D KL (f B (x i )||fθ i (x i )) =: L (f B |f A ) In ( 5), E y1:i-1 f θi (x i ) represents an infinite ensemble of models θi estimated from all possible samples y 1:i-1 . We replace this ensemble with a single model θi that is directly trained on all the samples. θi is trained using KL-divergence as objective, which is equivalent to training with infinite samples (see Appendix A for details from ( 5) to ( 6)). The expected codelength E[L k ] is related, via (6), to the KL-divergence between the output distributions of f B and fθ i . Another interpretation of the above estimation is, we finetune model θ A with an increasing number of outputs generated by θ B , and aggregate the KL-divergence between the two models along the way. The more information f A shares with f B , the faster the KL-divergence decreases, resulting in a lower estimation of K(f B |f A ). Now d p is the approximation of (1) we propose in this paper: d p (f A , f B ) = ∧ max{L (f A |f B ), L (f B |f A )} (7)

2.3. PROPERTIES OF d p

The information distance d in (1) applied to functions defines a metric on the space of functions. Now we check if d p satisfy the axioms of a metric:  1. d p (f A , f B ) = 0 ⇔ f A = f B : f A = f B if p (f A , f B ) = 0 2. d p (f A , f B ) = d p (f B , f A ): by definition. 3. d p (f A , f B ) ≤ d p (f A , f C ) + d p (f C , f B ): whether d p keeps this property of d depends on the efficiency of prequential coding, which in turn depends on model optimization. Another important property of the information distance d is the invariancy with respect to the parameterization of function f . We found that d p is also largely invariant to parameterization of the functions. d p can be used to compare models trained differently, having different structures, or even trained on different tasks. The only condition is that both models should have sufficient expressibility to allow approximation of each other. There is also a connection between L k (f B |f A ) and the information transfer measure L IT (Zhang et al., 2020) : L k IT (θ n ) = L preq θ0 (y 1:k |x 1:k ) -L preq θn (y 1:k |x 1:k ) as n → ∞, θ n → θ B , and when y i ∼ f B (x i ), we have E[L k IT (θ n )] =E[L preq θ A (y 1:k |x 1:k )] -E[L preq θ B (y 1:k |x 1:k )] =E[L k (f B |f A )]

2.4. DATA-DEPENDENCY AND EQUIVALENT INTERPRETATIONS OF DATA

In machine learning, we often only care about the output of a model on the data distribution of the task. Neural network models are trained on input data from a specific domain, for example, image classification models take natural images in RGB format as valid input. It would be meaningless to discuss the behavior of such a model on non-RGB images. This is an important distinction between a neural network function and a function in the mathematical sense. This motivates us to take a data-dependent formulation of distance measure. In this paper, we limit our discussion to distribution-dependent information distance: d p (f A , f B ) = max{K(f A |f B ), K(f B |f A )} where f A = arg min f ∈F A K(f |f B ), f B = arg min f ∈F B K(f |f A ) are equivalencies of f A and f B in the below function family ( can be A or B): F = {f |E x∼D [D KL (f (x)||f (x))] ≤ } F is a set containing all the functions producing outputs almost indistinguishable from f , in the expected sense over x drawn from data distribution D. Because they produce almost identical outputs for x ∼ D, we call them equivalent interpretations of data in D. Intuitively, this means that instead of transmitting f B , we can transmit f B , which is equivalent to f B on D, if f B can be transmitted in fewer bits. A quick note on why data-dependency here in the context of neural network models does not break the definition of information distance: if f is a neural network trained on dataset D, then f is fully determined by the information in D plus a random seed (which is of negligible information). By introducing data-dependency, it enables us to approximate Kolmogorov complexity by coding samples drawn from data distribution D, in other words we can use the training set for coding.

3. EMPIRICAL STUDY

The proposed information distance d p relies on an estimation of Kolmogorov complexity of neural networks with prequential codelength, which unfortunately does not have known theoretical guarantees. Therefore we validate the performance of d p mainly with empirical results. We use experiments to show the advantages of d p and in what situations are d p useful.

3.1. EXPERIMENT SETUP

All experiments in this section are performed using ResNet-56 models (He et al., 2016) on Tiny-ImageNetfoot_0 , a 200-class image classification dataset. To make the codelength estimation more reliable, we try to achieve lower codelength in prequential coding by optimizing the model estimation process in sequential coding. We performed a hyper-parameter search to select optimal hyperparameters that result in the lowest codelength. Unless otherwise stated, we use k = 10000 in experiments throughout this paper, which we found in most cases allow the difference between the coding model and the reference model E x [D KL (f B (x)||f θk (x))] to converge.

3.2. INVARIANCES OF INFORMATION DISTANCE

A prominent advantage of information distance is independent of parameterization. Neural networks like multi-layer perceptrons can have a very large number of different configurations (units, weights) that correspond to the same function. Because there is no "canonical" way of parameterizing neural networks, comparing the input-output functions represented by different neural networks can be difficult by merely looking at the network parameters. There exist many metrics for measuring the similarity or distance between neural networks. But because neural networks are so versatile, similar neural networks can look similar under some metric We measure the distance between a pair of ResNet-56 models θ A and θ B , trained on two subsets of Tiny-ImageNet, respectively. We modify the configuration of θ A while keeping θ B the same. Modifications include: scaling, where we multiply the weights of the network by a coefficient c, neuron swapping, where we randomly permute a fraction c of the units within a layer, perturbation, where Gaussian random noise of zero mean and standard deviation c is added to the weights of the network, and adversarial perturbation, where a vector of standard deviation c is added to the weights of the network to maximize the change of second-to-last layer representations. We also include training θ A with different initializations and using different network architectures. Some commonly used distance measures are used as baselines. These include measures in three different spaces: parameter space distances, which measure distance using metrics on the parameter matrices, including plain L2 distance, cosine similarity, and Earth Mover Distance (EMD) (Monge, 1781; Rubner et al., 1998) that computes the cost to align neurons. Representation space distances measure distance on the second-to-last layer representation vectors, including plain L2, cosine, the EMD cost of aligning feature dimensions, and Linear Centered Kernel Alignment (CKA) (Kornblith et al., 2019) which is based on pairwise sample similarity and outperforms previously proposed similarity measures. For output space distance, we use the KL-divergence between the probability distribution generated by the final softmax layer. The baseline measures include both common and straightforward measures and more sophisticated measures like EMD and CKA that possess invariance properties. Underlined labels on the x-axis denote the distances measured between unmodified θ A and θ B . Curves for each measure are individually scaled (only scaled, not shifted) to ease viewing on the same graph. Cosine similarity and CKA value are within [0,1] and is inversely correlated with distance. while very dissimilar by other metrics. There lacks a universal definition of similarity, which is precisely the problem we try to solve with d p . To empirically examine the invariancy of d p , we evaluated d p under different re-parameterizations of a neural network. We also include a number of distance metrics as baselines for comparison. Descriptions of test scenarios, baselines, and the results are shown in Figure 1 . Table 1 summarizes the observed invariancy of distance measures with a quantitative measure. Results indicates that d p is relatively stable under different kinds of re-parameterization of the network and is the most invariant overall. Other distance measures all exhibit strong dependency on certain kind of parameterization or is inapplicable for some parameterizations. For re-parameterizations that does not change or only minimally change the function f (scaling, neuron swapping, initialization, architecture), d p also exhibit minimal change. For adding perturbations to the network, information distance only starts to increases when the perturbation is large enough. This is because only large noise will start to "wipe out" information in the network. d p is also robust to small adversarial perturbations, while also showing that adversarial perturbations destroy information in the network faster than random noise. Middle: θ A and θ B are trained on the same task but from different initializations. Right: whether the measure's trend is correct, and whether it can be used to identify different interpolation coefficients. As we interpolate two functions f A and f B in the parameter space, if θ A and θ B are parameterized similarly, we observe d p (f, f B ) to monotonically decrease as f gets closer to f B (Figure 2 left). At the beginning when c is small, increase in c introduces more "fresh information" about f B , thus d p decreases faster than later in interpolation. On the other hand, if we interpolate two functions that are parameterized differently, because linear mixing of θ A and θ B in parameter space leads to a degraded network, the distance would first increase then decrease, indicating a loss of information middle in the interpolation. Overall, the general trend of d p agrees with advanced similarity measures like representation EMD and CKA in this scenario. In To summarize, parameter space distances fail when function similarity does not correspond to parameter value similarities, and representation space distances can be too noisy to be reliable when similarity is high. Only information distance d p remains faithful in both scenarios.

4. APPLICATION

To illustrate the utility of a universal function distance, we provide a few scenarios where we use d p for understanding and making predictions.

4.1. SKETCHING THE GEOMETRY OF DATA AND MODEL SPACE

A distance measure can help us understand the relationship between datasets and between models. Datasets and models usually live in very high-dimensional spaces, which makes it hard to directly perform a comparison. Instead, we can use d p to get the information distance between datasets and models. In computer vision there is a myriad of datasets and model structures, and we use the Visual Task Adaptation Benchmark (VTAB) (Zhai et al., 2019) as a collection of vision datasets. On each dataset, a model is trained to represent the input-output function of the task. Then we use d p to measure pairwise distances between these functions. To help to visualize the relationship, we use Isometric Mapping (Tenenbaum et al., 2000) , a manifold learning algorithm to generate threedimensional embeddings for each function. The distance of points in three-dimensional space is optimized to keep the original structure. Distances can tell a lot about the relationship between models. In the nine large datasets of VTAB, datasets cluster largely according to the three categories proposed in VTAB (natural, specialized, and structured). CIFAR-100 is very different from any other datasets, but is relatively closer to satellite image datasets than to artificial shape datasets. SVHN (Netzer et al., 2011) is close to 2D shape datasets. The four small datasets are evenly distributed in space: no pair of them is very similar. In terms of model architecture, ResNet variants are relatively similar, while AlexNet (Krizhevsky, 2014) and VGG (Simonyan & Zisserman, 2014) is closer to ResNet than without. ResNet-50, ResNeXt-50 (Xie et al., 2017) and WideResNet-50 (Zagoruyko & Komodakis, 2016) are closest as they are indeed very similar.

4.2. UNDERSTANDING REGULARIZATIONS

Regularization techniques like L2 regularization can bias the learned neural networks toward less complex functions, while for techniques like dropout (Srivastava et al., 2014) and self-distillation (Furlanello et al., 2018) , the regularization effect may be less straightforward to explain. We can use d p to examine the (information) complexity of a network f , by measuring its distance d p (f, 0) to an empty function. From Figure 5 , we observe that all the listed techniques result in a reduction of d p (f, 0), which means that the information complexity of the model function f is reduced. For weight decay, information complexity only starts to decrease after the regularization coefficient is larger than a threshold. Self-distillation has a similar effect to regularization, with the number of distillation iterations controlling regularization strength. This agrees with the theoretical analysis in (Mobahi et al., 2020) . Label smoothing and dropout also result in simpler models, highlighting their regularization effect. 

4.3. ENSEMBLES AND MODEL DIVERSITY

Distance can be used as an indicator of model diversity: the larger d p between models, the more diverse the models are. Ensembling is a common technique to use the consensus of multiple models to deliver superior performance than a single model. We speculate that a larger model diversity will result in more performance gain from ensembles. To verify this connection, we train a number of models on Tiny-Imagenet, all to the same performance on the validation set, but with different initializations and different subsets of the training set. Then we choose models in pairs to measure their ensemble performance as well as the distance d p (f 1 , f 2 ) between them. The result is given in Figure 6 , and we found a clear correlation between d p and ensemble performance, and the relationship is about linear. This also indicates that d p captures model diversity.

4.4. PREDICTING GENERALIZATION

Finally, d p is also linked with model generalization. Generalization of neural networks is heavily affected by hyper-parameters and optimization. There have been several works aiming to find the relationship between generalization performance and properties of the network, but it turns out that predicting generalization gap can be a challenging task (Jiang et al., 2019; 2020) . We perform a small-scale experiment to illustrate the connection between information distance and generalization gap. We train a number of models with different hyper-parameters (batch size, learning rate, optimizer, etc.) all to the same loss on the training set, and then measure the distance to a random model by d p (f, 0). In Figure 7 , we observe that the information complexity of the model is also linked with generalization gap, which also turns out to be a roughly linear relationship. Models that generalize better are farther away from a random model than less performing models.

5. DISCUSSION AND CONCLUSION

The proposed distance d p is based on information distance defined with Komolgorov complexity K. We do not attempt to give a good estimation of K, but instead relying on the efficiency of prequential coding, we empirically illustrate that d p share the invariance properties of information distance, and reflects the similarity relationships of functions parameterized by neural networks. We also found that d p is linked with behaviors of models, making it a potential tool for analyzing neural networks. The most notable difference between d p and other similarity metrics is universality. Theoretically rooted in information distance, d p is independent of parameterization and widely applicable in situations involving different tasks and models. However, d p 's utilization of prequential coding also introduces limitations that it might not work in situations where prequential coding fails, for example, when f cannot be efficiently approximated by neural networks. d p could introduce a potential scale-free, or even parameterization-free geometry of space spanned by neural models. Optimization with manifold descent by d p could also remove the dependency on parameterization, thus avoiding ill-posed conditions in some parameterizations (Dinh et al., 2017) .

A TECHNICAL DETAILS IN CALCULATING d p

In equations ( 5)-(6), we used fθ i in place of the ensemble model E y1:i-1 f θi (x i ) E y 1:k ∼f B (x 1:k ) [L k (f B |f A )] ≥ k i=1 D KL (f B (x i )||E y1:i-1 f θi (x i )) (12) = k i=1 D KL (f B (x i )||fθ i (x i )) + E yi∼f B (xi) log E y1:i-1 f θi (x i ) fθ i (x i ) θi is trained with objective function c KL : c KL (θ) = i j=1 D KL (f B (x j )||f θ (x j )) (14) = i j=1 [H c (f B (x j ), f θ (x j )) -H(f B (x j ))] (15) = E y1:i∼f B (x1:i) i j=1 H c (y j , f θ (x j ) - i j=1 H(f B (x j )) (16) = E y1:i∼f B (x1:i) c CE (θ) - i j=1 H(f B (x j )) where c CE is the cross-entropy objective function used to train θi . H c stands for cross-entropy. In other words, θi is trained with the average loss used in training θi (the entropy term in (17) does not depend on θ and has no effect in training). Therefore fθ i should mimic the behavior of the infinite ensemble E y1:i-1 f θi (x i ) reasonably well and make the second term in (13) small. Using (6) instead of (3) to estimate codelength not only makes estimations independent of the sampling process, but also requires fewer input examples x. This is because we are making the most use of each x by essentially drawing infinite y samples from each f (x). Generally speaking, to estimate d p , one first needs to sample some input x from D (based on data-dependency introduced in Section 2.4). Usually D is unknown, but we have a dataset S containing samples from D so that we can use examples from S instead. When the size of S is small, there may not be enough examples to train f A to converge to f B by (3). We found that this is often the case for small datasets, for example in Section 4.1. Even when S is large, we can save computation time by using a smaller sample size k. In Section 2.4 we introduced data dependency, where we study function restricted to the data distribution. We can quickly see that if f represents a model trained on data distribution D, then K(f |f , D) = K(f |f, D) = 0 where f means f restricted to D. Because f is fully determined by f . And K(f B |f A , D) = K(f B |f A , D) follows from above. We can study K(f B |f A , D) by coding examples from D. This requires the functions to be compared (f B and f A ) trained on the same kind of input. This is a reasonable restriction because it is unlikely one would be concerned about the distance between functions defined on different input spaces.

B EXPERIMENT DETAILS B.1 INVARIANCE EXPERIMENTS

We provide details for re-parametreizations used in invariance experiments: • Scaling: the weights in layer i is multiplied by c, and the weights in layer i -1 multiplied by 1/c. For relu networks, this keeps the output of network unchanged. • Neuron swapping: we randomly permute c•(total number of neurons) neurons in layer i. We also correspondingly permute the input of layer i + 1 so that the network output is unchanged. • Perturbation: we add Gaussian noise of zero mean and standard deviation c to each individual weight of the network. • Adversarial perturbation: we add a vector v of standard deviation c to the weight vector of the network, and optimize the vector to maximize deviations of second-to-last layer representations. i.e. max v:std (v)=c E x [(f r θ (x) -f r θ+v (x)) 2 ] • Initialization: we experiment with random initialization and initialize with a pre-trained network on another dataset (CIFAR-10). • Architecture: we use ResNet architecture with different width and depth. ResNet-56-s refers to ResNet-56 with half the width in each layer. Next we list the baseline distance (or similarity) measures and describe how to calculate them for network f A and f B : Parameter space distances: we denote the i-th layer weight matrix of network f A as w i A . For L2 and cosine measures, we first flatten and concatenate all weight matrices w i A and biases of the network into a long vector w all A (excluding parameters in batch normalization layers, because some statistic variable in them can be large and dominate the norm of vector w). • L2: d l2 = ||w all A -w all B || 2 . • Cosine: d cosine = w all A • w all B /(||w all A || 2 ||w all B || 2 ). • EMD: we use "Optimal Transport of Neurons" in (Li et al., 2020) . The distance matrix M is taken to be the pairwise L2 distance between weights of each neuron M i mn = ||w i Am• -w i Bn• || 2 The EMD distance is the optimal transport cost of matching neurons from one network to neurons in the other network d emd = min P ∈Π(µ,ν)

P, M F

where P is the optimal transport plan. Representation space distances: we sample x i from data distribution D and denote the output representation vector of the second-to-last layer of model f A by f Ai . • L2: d l2 = 1 k k i=1 ||f Ai -f Bi || 2 . • Cosine: d cosine = 1 k k i=1 [f Ai • f Bi /(||f Ai || 2 ||f Bi || 2 )]. • EMD: same as in "parameter space distances", except that the distance matrix M is taken to be the pairwise L2 distance between the activation vector of each neuron M mn = ( k i=1 (f Aim -f Bin ) 2 ) 1/2 • Linear CKA: we use implementation provided by (Kornblith et al., 2019) : d cka = ||f B• T f A• || 2 F ||f A• T f A• || F ||f B• T f B• || F where f A• is a a matrix whose k rows are vectors f A1 , ..., f Ak . Output space distances: we use the output distributions of f A and f B . • KL-divergence: E x [D KL (f A (x)||f B (x))] B.2 MORE RESULTS OF SECTION 3.3 Figure 8 and 9 shows the results in model interpolation experiments and training progress experiments, for all distance measures studied in this paper. We also show that if each method gives the correct trend, and whether it can be used to identify different models. 

B.3 GEOMETRY EXPERIMENTS

From the 19 datasets included in VTAB (Zhai et al., 2019) , we were able to download 13 datasets for using in this work. Because the dataset size vary greatly among the 13 datasets, we divide them into two groups: larger datasets (size > 10000), which include: • cifar100: CIFAR-100 (Krizhevsky & Hinton, 2009) • svhn: SVHN (Netzer et al., 2011) • eurosat: EuroSAT (Helber et al., 2019) • resisc45: Resisc45 (Cheng et al., 2017) • dsprites position: dSprites/location (Matthey et al., 2017) • dsprites orientation: dSprites/orientation • smallnorb azimuth: SmallNORB/azimuth (LeCun et al., 2004) • smallnorb elevation: SmallNORB/elevation • dmlab: DMLab (Beattie et al., 2016) and smaller datasets (size < 10000), which include:

B.5 GENERALIZATION EXPERIMENTS

We run experiments on CIFAR-10 with different hyperparameters and model configurations, and in all configurations we train the model to cross-entropy loss of 0.1 on training set. Then we measure the generalization gap as the loss on testing set minus the loss on training set. Starting from a default configuration (which is the same hyperparameters we use in other experiments in this paper), each time we modify one of the hyperparameters. Results are listed in Table 3 . In terms of studying generalization gap, our experiments is far less thorough than in (Jiang et al., 2020) , but here we would like to illustrate the connection between d p with generalization gap under different experiment settings, without spending too many machine hours. 



https://tiny-imagenet.herokuapp.com To avoid clutter in graphs, we did not include L2 and cosine measures in Figure2 and 3, as they fail basic invariancy tests in Section 3.2. Full results are given in Appendix B.2. https://github.com/pytorch/vision



Figure1: Distance d p changes with respect to changes in the parameterization of the networks. We measure the distance between a pair of ResNet-56 models θ A and θ B , trained on two subsets of Tiny-ImageNet, respectively. We modify the configuration of θ A while keeping θ B the same. Modifications include: scaling, where we multiply the weights of the network by a coefficient c, neuron swapping, where we randomly permute a fraction c of the units within a layer, perturbation, where Gaussian random noise of zero mean and standard deviation c is added to the weights of the network, and adversarial perturbation, where a vector of standard deviation c is added to the weights of the network to maximize the change of second-to-last layer representations. We also include training θ A with different initializations and using different network architectures. Some commonly used distance measures are used as baselines. These include measures in three different spaces: parameter space distances, which measure distance using metrics on the parameter matrices, including plain L2 distance, cosine similarity, and Earth Mover Distance (EMD)(Monge, 1781;Rubner et al., 1998) that computes the cost to align neurons. Representation space distances measure distance on the second-to-last layer representation vectors, including plain L2, cosine, the EMD cost of aligning feature dimensions, and Linear Centered Kernel Alignment (CKA)(Kornblith et al., 2019) which is based on pairwise sample similarity and outperforms previously proposed similarity measures. For output space distance, we use the KL-divergence between the probability distribution generated by the final softmax layer. The baseline measures include both common and straightforward measures and more sophisticated measures like EMD and CKA that possess invariance properties. Underlined labels on the x-axis denote the distances measured between unmodified θ A and θ B . Curves for each measure are individually scaled (only scaled, not shifted) to ease viewing on the same graph. Cosine similarity and CKA value are within [0,1] and is inversely correlated with distance.

Figure 2: Distance from the interpolation model θ to θ B . Left: θ A and θ B are trained on different but related tasks (first 100-class and last 100-class of Tiny-ImageNet) from the same initialization.Middle: θ A and θ B are trained on the same task but from different initializations. Right: whether the measure's trend is correct, and whether it can be used to identify different interpolation coefficients.

Figure3: Distance from the i-th epoch model θ i to the initial model θ 0 (left) and the final model θ 14 (middle). Right: whether the measure has monotonic trend with respect to the training progress, and whether it can be used to identify models from different epochs.

Figure 4: Visualizing distances between datasets and models in three-dimensional space. Top-left: large datasets in VTAB. Top-right: small datasets in VTAB. Bottom: various model architectures trained on ImageNet. The numbers on colored lines are pairwise distances.

Figure 5: Distance to an empty function d p (f, 0) for models with different kinds of regularization and varying strength. From left to right: Weight decay (L2 regularization), self-distillation, label smoothing, dropout. For dropout, a different base model without batch normalization is used.

Figure 6: Relation between ensemble performance and model diversity given by d p (f 1 , f 2 ).

Figure 7: Relation between generalization gap and model complexity d p (f, 0).

Figure 8: Full results of Figure 2: Distance from the interpolation model θ to θ B .

and only if they always produce the same predictions, which is equivalent to d

Measuring invariancy of distance measures. Invariancy is measured by the relative change of distance in each test scenario (averaged over all data points). Lower value means more invariant. Oracle refers to an ideal distance measure. N/A means the method cannot be used in that test. A with a network pre-trained on CIFAR-10 (Krizhevsky & Hinton, 2009). d p increase 7% compared to random initialization, because θ A carries over some information from CIFAR-10, making θ A and θ B slightly less similar. In Architecture experiments, if θ A uses a different architecture than θ B (which is ResNet-56), we also observe increase in d p . The more different the models are (in terms of number of layers) form ResNet-56, the distance d p is also slightly higher. This indicates that while d p is largely invariant to model parameterizations, it is also consistent with intuitive similarities of models. This is not observed with EMD and CKA distances.

Details of configurations in generalization experiments.

annex

• caltech101: Caltech101 (Li et al., 2006) • dtd: DTD (Cimpoi et al., 2014) • oxford flowers102: Flowers102 (Nilsback & Zisserman, 2008) • oxford iiit pet: Pets (Parkhi et al., 2012) For larger datasets, we use k = 10000 as with other experiments. For smaller datasets, we use k = 2000. Model is ResNet-56 trained from scratch.In model geometry experiments, we use the following models trained on ImageNet, as provided by torchvision 3 :• resnet18: ResNet-18 (He et al., 2016) • resnet34: ResNet-34• resnet50: ResNet-50• vgg11: VGG-11 without batch normalization (Simonyan & Zisserman, 2014) • vgg11 bn: VGG-11 with batch normalization• alexnet: AlexNet (Krizhevsky, 2014) • resnext50: ResNeXt-50-32x4d (Xie et al., 2017) • wide resnet50: WideResNet-50-2 (Zagoruyko & Komodakis, 2016) • densenet121: Densenet-121 (Huang et al., 2017) • squeezenet: SqueezeNet 1.1 (Iandola et al., 2016) • mobilenet: MobileNet V2 (Sandler et al., 2018) Codelength is calculated on the training set of ILSVRC 2012 (Russakovsky et al., 2015) .

B.4 ENSEMBLE EXPERIMENTS

We train multiple ResNet-56 models with 2 different random initializations and half of the examples sampled form the TinyImageNet training set. This means the training examples seen by each model can overlap from 0% to 100%. We then select two models out of this collection and ensemble the two models. Ensemble performance and distance between the two models measured by d p (f A , f B ) is given in Table 2 . We list pairs with same or different initialization, and with different overlap in training examples. Generally speaking, the less the training examples overlaps the larger the distance between the models. Different initialization can also makes the models more dissimilar. Note that from Figure 6 , we see that distance d p (f A , f B ) is correlated with ensemble performance regardless of whether diversity comes from difference in training examples or difference in initializations. 

