INFORMATION DISTANCE FOR NEURAL NETWORK FUNCTIONS

Abstract

We provide a practical distance measure in the space of functions parameterized by neural networks. It is based on the classical information distance, and we propose to replace the uncomputable Kolmogorov complexity with information measured by codelength of prequential coding. We also provide a method for directly estimating the expectation of such codelength with limited examples. Empirically, we show that information distance is invariant with respect to different parameterization of the neural networks. We also verify that information distance can faithfully reflect similarities of neural network functions. Finally, we applied information distance to investigate the relationship between neural network models, and demonstrate the connection between information distance and multiple characteristics and behaviors of neural networks.

1. INTRODUCTION

Deep neural networks can be trained to represent complex functions that describe sophisticated input-output relationships, such as image classification and machine translation. Because the functions are highly non-linear and are parameterized in high-dimensional spaces, there is relatively little understanding of the functions represented by deep neural networks. One could interpret deep models by linear approximations (Ribeiro et al., 2016) , or from the perspective of piece-wise linear functions, such as in (Arora et al., 2018) . If the space of functions representable by neural networks admits a distance measure, then it would be a useful tool to help analyze and gain insight about neural networks. A major difficulty is the vast number of possibilities of parameterizing a function, which makes it difficult to characterize the similarity given two networks. Measuring similarity in the parameter space is straightforward but is restricted to networks with the same structure. Measuring similarity at the output is also restricted to networks trained on the same task. Similarity of representations produced by intermediate layers of networks is proved to be more reliable and consistent (Kornblith et al., 2019) , but is not invariant to linear transformations and can fail in some situations, as shown in our experiments. In this paper, we provide a distance measure of functions based on information distance (Bennett et al., 1998) , which is independent of the parameterization of the neural network. This also removes the arbitrariness of choosing "where" to measure the similarity in a neural network. Information distance has mostly been used in data mining (Cilibrasi & Vitányi, 2007; Zhang et al., 2007) . Intuitively, information distance measures how much information is needed to transform one function to the other. We rely on prequential coding to estimate this quantity. Prequential coding can efficiently encode neural networks and datasets (Blier & Ollivier, 2018) . If we regard prequential coding as a compression algorithm for neural networks, then the codelength can give an upper bound of the information quantity in a model. We propose a method for calculating an approximated version of information distance with prequential coding for arbitrary networks. In this method, we use KL-divergence in prequential training and coding, which allow us to directly estimate the expected codelength without any sampling process. Then we perform experiments to demonstrate that this information distance is invariant to the parameterization of the network while also being faithful to the intrinsic similarity of models. Using information distance, we are able to sketch a rough view into the space of deep neural networks and uncover the relationship between datasets and models. We also found that information distance can

