INFORMATION DISTANCE FOR NEURAL NETWORK FUNCTIONS

Abstract

We provide a practical distance measure in the space of functions parameterized by neural networks. It is based on the classical information distance, and we propose to replace the uncomputable Kolmogorov complexity with information measured by codelength of prequential coding. We also provide a method for directly estimating the expectation of such codelength with limited examples. Empirically, we show that information distance is invariant with respect to different parameterization of the neural networks. We also verify that information distance can faithfully reflect similarities of neural network functions. Finally, we applied information distance to investigate the relationship between neural network models, and demonstrate the connection between information distance and multiple characteristics and behaviors of neural networks.

1. INTRODUCTION

Deep neural networks can be trained to represent complex functions that describe sophisticated input-output relationships, such as image classification and machine translation. Because the functions are highly non-linear and are parameterized in high-dimensional spaces, there is relatively little understanding of the functions represented by deep neural networks. One could interpret deep models by linear approximations (Ribeiro et al., 2016) , or from the perspective of piece-wise linear functions, such as in (Arora et al., 2018) . If the space of functions representable by neural networks admits a distance measure, then it would be a useful tool to help analyze and gain insight about neural networks. A major difficulty is the vast number of possibilities of parameterizing a function, which makes it difficult to characterize the similarity given two networks. Measuring similarity in the parameter space is straightforward but is restricted to networks with the same structure. Measuring similarity at the output is also restricted to networks trained on the same task. Similarity of representations produced by intermediate layers of networks is proved to be more reliable and consistent (Kornblith et al., 2019) , but is not invariant to linear transformations and can fail in some situations, as shown in our experiments. In this paper, we provide a distance measure of functions based on information distance (Bennett et al., 1998) , which is independent of the parameterization of the neural network. This also removes the arbitrariness of choosing "where" to measure the similarity in a neural network. Information distance has mostly been used in data mining (Cilibrasi & Vitányi, 2007; Zhang et al., 2007) . Intuitively, information distance measures how much information is needed to transform one function to the other. We rely on prequential coding to estimate this quantity. Prequential coding can efficiently encode neural networks and datasets (Blier & Ollivier, 2018) . If we regard prequential coding as a compression algorithm for neural networks, then the codelength can give an upper bound of the information quantity in a model. We propose a method for calculating an approximated version of information distance with prequential coding for arbitrary networks. In this method, we use KL-divergence in prequential training and coding, which allow us to directly estimate the expected codelength without any sampling process. Then we perform experiments to demonstrate that this information distance is invariant to the parameterization of the network while also being faithful to the intrinsic similarity of models. Using information distance, we are able to sketch a rough view into the space of deep neural networks and uncover the relationship between datasets and models. We also found that information distance can help us understand regularization techniques, measure the diversity of models, and predict a model's ability to generalize.

2. METHODOLOGY

Information distance measures the difference between two objects by information quantity. The information distance between two functions f A and f B can be defined as (Bennett et al., 1998) : d(f A , f B ) = max{K(f A |f B ), K(f B |f A )} (1) This definition makes use of Kolmogorov complexity: K(f B |f A ) is the length of the shortest program that transforms f A into f B , and information distance d is the larger length of either direction. (Note that this is not the only way to define information distance with Kolmogorov complexity, however we settle with this definition for its simplicity.) Intuitively, this is the minimum number of bits we need to encode f B with the help of f A , or how much information is needed to know f B if f A is already known. Given two functions f A : X → Y and f B : X → Y defined on the same input space X, each parameterized by a neural network with weights θ A and θ B , we want to estimate the information distance between f A and f B . The estimation of Kolmogorov complexity term is done by calculating the codelength of prequential coding, so what we get is an upper bound of d, which we denote by d p (p for prequential coding).

2.1. ESTIMATING

K(f B |f A ) WITH PREQUENTIAL CODING To send f B to someone who already knows f A , we generate predictions y i from f B using input x i sampled from X. Assume that {x i } is known, we can use prequential coding to send labels {y i }. If we send enough labels, the receiver can use {x i , y i } to train a model to recover f B . If f A and f B have something in common, i.e. K(f B |f A ) < K(f B ), then with the help of f A we can reduce the codelength used to transmit f B . A convenient way of doing so is to use θ A as the initial model in prequential coding. The codelength of k samples is: L preq (y 1:k |x 1:k ) := - k i=1 log p θi (y i |x 1:i , y 1:i-1 ) where θi is the parameter of the model trained on {x 1:i-1 , y 1:i-1 }, and θ1 = θ A . With sufficient large k, the function parameterized by θk would converge to f B . If both f A and f B are classification models, we can sample y from the output distribution of f B . In this case, the codelength (2) not only transmits f B , but also k specific samples we draw from f B . The information contained in these specific samples is k i=1 log p θ B (y i |x i ). Because we only care about estimating K(f B |f A ), using the "bits-back protocol" (Hinton & van Camp, 1993) the information of samples can be subtracted from the codelength, resulting in an estimation of K (f B |f A ) as L k (f B |f A ): L k (f B |f A ) = - k i=1 log p θ (y i |x 1:i , y 1:i-1 ) + k i=1 log p θ B (y i |x i ) In practice, we want to use k sufficiently large such that f θk can converge to f B , for example by the criterion E x [D KL (f B (x)||f θk (x))] ≤ However, empirically we found that this often means a large k is needed, which can make the estimation using (3) unfeasible when the number of available x is small. Also the exact value of (3) depends on the specific samples used, introducing variance into the estimation.

