MODEL INFORMATION AS AN ANALYSIS TOOL IN DEEP LEARNING

Abstract

Information-theoretic perspectives can provide an alternative dimension of analyzing the learning process and complements usual performance metrics. Recently several works proposed methods for quantifying information content in a model (which we refer to as "model information"). We demonstrate using model information as a general analysis tool to gain insight into problems that arise in deep learning. By utilizing model information in different scenarios with different control variables, we are able to adapt model information to analyze fundamental elements of learning, i.e., task, data, model, and algorithm. We provide an example in each domain that model information is used as a tool to provide new solutions to problems or to gain insight into the nature of the particular learning setting. These examples help to illustrate the versatility and potential utility of model information as an analysis tool in deep learning.

1. INTRODUCTION

The ultimate goal of many deep learning research has been improving performance on specific datasets, for example, aiming for superior classification accuracy on the ILSVRC challenge. We have witnessed super-human performance on tasks in vision and language processing, but we are still far from understanding how the learning process works and whether they resemble the learning of human. This is partially due to the metric we use providing too little information on the dynamics of learning. Besides, performance on the test set as a sole goal can sometimes even lead to undesirable outcomes or misleading conclusions (Lipton & Steinhardt, 2019) . Recently, several works propose to use the description length of a model (or surrogate estimations thereof) to understand the behavior of learning. In this paper, we refer to such measures of the amount of information content in a model as model information. Blier & Ollivier (2018) first demonstrated efficiently encoding a deep neural network with prequential coding technique. Zhang et al. (2020) then proposed an approximation of model information and utilized model information to analyze the information content in a task. They also used model information to explain phenomenons in transfer learning and continual learning. They showed that model information provides a different perspective than performance, directly characterizing the information transfer in the learning process. Voita & Titov (2020) used model information as a probe to analyze what kind of information is present in a text representation model. They claim that model information is more informative and stable than performance metrics when used as a probe. Model information can provide an informational perspective to learning. It can potentially help to answer questions about learning dynamics, such as how much information exists in a dataset or model, or how much information is transferred in a learning step. Furthermore, we can also reformulate existing problems into problems about information, for example, similarity and capacity, as we will show in this paper. Comparing model information with model performance, model information not only accounts for how good a model can perform but also how fast it learns to perform well (in the sense of sample efficiency, a discussion is given by Yogatama et al. (2019) ). The latter can be interpreted as related to the quantity of information transferred in model training. In this paper, we try to illustrate that model information can provide a framework for analyzing and understanding phenomena in deep learning. In the next section, we provide a general definition of model information, independent of how model information is estimated. We then unify the analysis of fundamental elements of deep learning under the framework of model information. In the following sections, we use several common problems as examples to show how to use the model information framework as an analysis tool.

2. AN INFORMATIONAL PERSPECTIVE TO ELEMENTS OF DEEP LEARNING

To measure the amount of information in a model, Voita & Titov (2020) use the codelength of a dataset subtracting the cross-entropy of the final model: L = L preq θinit (y 1:N |x 1:N ) + N i log p θ N (y i |x i ). (1)  L = L preq θinit (y 1:k |x 1:k ) -L preq θ N (y 1:k |x 1:k ). Both methods share an idea that model information can be derived by comparing the codelength of encoding the model and data together, with the codelength of encoding the data alone. This is essentially an approximation to the Kolmogorov complexity of the "generalizable knowledge" in model M (in Zhang et al. (2020) , this is denoted by  K(M ) -K(M |T )). y i = f (x i , i ) (where i is i.i.d. noise), then the "generalizable knowledge" in T is K(f ), and generalizable knowledge in M D is at most K(f ). We denote the amount of generalizable information a model M contains about a dataset D as L(M, D). Then L(M D , D) is the "model information": the amount of generalizable information a trained model M D contains about dataset D. Ideally, if learning is "perfectly efficient", then we can expect L(M D , D) = K(M D ) -K(M D |T ) = K(f ). In this paper, we investigate what can be achieved with a model information measure L. We found that model information is a powerful tool for analyzing and understanding many phenomenons in deep learning. This covers the fundamental elements of machine learning: task, data, model, and algorithm. In Table 1 , we list problems as examples for each domain and summarize how model information can be used to perform analysis of the corresponding problem. Table 1 : Model information as an analysis tool in deep learning

Element

Relevant problem Use of model information Task Task difficulty L(M D , D) Data Domain similarity L(M D1 , D 2 ) Model (structure) Model capacity max D L(M D , D) (parameter) Ablation study L(M c D , D) Algorithm Knowledge distillation L(M D+D T , D) As we shall see, model information can enable attacking some of the above problems from a neat and theoretically-sound perspective. Table 1 is not an exhaustive list but just examples of what one can do with model information. In the following sections, we detail the application of model information to perform analysis of each problem and explain how it can lead to useful insights. For following experiments in this paper, we use (2) to estimate model information and set the encoding set size k=10000, unless otherwise stated. it is important to understand the difficulty of the tasks and the datasets we have. Human definitions of difficulty are mostly subjective, for example, difficulty scoring, task performance, and time needed to complete the task (Ionescu et al., 2016) . Several metrics have been proposed for measuring the difficulty of a task. Many focus on the complexity of the data. For images, difficulty is linked with image quality, objectiveness, and clutter-ness (Ionescu et al., 2016) . For text, the relevant factors include perplexity, reading ease, and diversity (Collins et al., 2018) . For general real-valued data, one can measure the distribution of feature value, class separability, and the geometry of class manifolds (Ho & Basu, 2002) . There are also other factors that affect the difficulty of a task, for example, class balance and class ambiguity. As pointed out by Collins et al. (2018) , these characteristics all describe some aspect of the task, and difficulty cannot be described by any one of them alone. Complexity measures of input data can sometimes have little impact on task difficulty. An important line of work defines the difficulty of a classification problem by the complexity of the optimal decision boundary (Ho & Basu, 2002) . The complexity of the boundary can be characterized by its Kolmogorov complexity or its minimum description length. This coincidences with the idea of the model information. While for simple tasks, one can directly characterize the geometry of the decision boundary, for complex tasks that use a neural network to model the decision boundary, the description length can only be approximated. With description length, the difficulty of a task can be interpreted as the amount of information needed to solve the task. This amount of information is at most K(f ), which is the complexity of the input-output relationship of the task. We measure the model information L(M D , D) of a trained model, and use it as a measure of the information content in dataset D. In Figure 1 , we use five 10-class classification tasks to illustrate the idea. The results show that the datasets vary greatly in information content: For classifying digits, SVHN (Netzer et al., 2011) requires more information than MNIST (LeCun et al., 1998) or EMNIST (Cohen et al., 2017) , meaning classifying digits in street view is more difficult than in standardized MNIST images. CIFAR-10 (Krizhevsky & Hinton, 2009 ) is even richer in information content and therefore more difficult, as natural objects are more complex than digits. There is also a general trend that the more difficult the task, the lower the performance and the confidence of the model. However, performance and confidence are not only determined by task complexity, they are also affected by noise present in the task. Next we want to disentangle the two factors: task complexity (measured by information) and noise. We apply three kinds of noise on MNIST to represent different forms of noise present in a dataset (Figure 2 ). There are two kinds of image noise: missing pixels, where a small random block of pixels is missing in the image, missing objects, where a large portion of the digit is missing, and one form of label noise: massart noise, where a small ratio of labels are replaced by random labels. To see how different kinds of noise affect model behavior, we plot model information, model accuracy, and confidence in Figure 3 . What we observe is that information complexity and noise are two independent dimensions that affect performance on a task. Pixels missing from the image will make the task more complex and require more information to correctly classify the digits. This is because one must learn more features corresponding to each digit, as any of the features could be absent at any time. However, the model can still perform well after learning enough information about the task. Missing objects and wrong labels, on the other hand, are pure noise that does not affect the information complexity of the task. They effectively make some of the examples uninformative and simply confusing, which makes the model less confident in its predictions.

4. DATA: DOMAIN SIMILARITY

Understanding the data is a fundamental element of understanding learning. Labeled data for deep learning comes from many sources and domains, and people seek to train models that can adapt well to a new domain, generalize to unseen domains, or be independent to domains (Gulrajani & Lopez-Paz, 2020) . However, few attempts to understand the data before using them to train the model. Similarity is a fundamental concept for understanding the relationship between different data. In deep learning, we are more concerned about similarity on semantic level. Semantically similar images can sometimes differ very much in the RGB space. The informational similarity measure proposed by Lin (1998) measures the similarity between A and B by the ratio of the amount of information needed to describe the commonality between A and B and the amount information needed to describe A together with B: sim(A, B) = I(common(A, B)) I(description(A, B)) = K(f B ) -K(f B |f A ) K(f A , f B ) . It is the Jaccard similarity between A and B measured with information. A property of the informational similarity measure is universality: it does not depend on a particular representation (or modeling) of A and B. We can use model information to estimate the information terms in (3), and turn the similarity measure into (4): (using approximations L(M A , A) ≈ K(f A ) and L(M A , B) ≈ K(f B ) -K(f B |f A )), see discussion on the symmetricity of S in Appendix A.3) S(A, B) = L(M A , B) L(M A , A) + L(M B , B) -L(M A , B) . ( ) Another benefit of information similarity is that the similarity can be measured with respect to one side, resulting in a unidirectional similarity measure: S uni (A, B) = I(common(A, B) I(description(B) = L(M A , B) L(M B , B) . ( ) Note that generally S uni (A, B) = S uni (B, A). The unidirectional similarity measure is useful for depicting the relationship between A and B. For example, if S uni (A, B) < 1 and S uni (B, A) = 1, then one can tell that A is a subset of B. We perform experiments on two commonly-used domain adaptation datasets: Office-31 (Saenko et al., 2010) and Office-Home (Venkateswara et al., 2017) . They each have 3 and 4 different domains for images of the same set of classes (Figure 4 ). We calculate information similarity S and unidirectional information similarity S uni for each pair of domains in each dataset. The baseline we use for comparison is the distance of first-order and second-order statistics in the feature space As illustrated in Figure 5 , information similarity measure give a similarity score between 0 and 1, which is intuitive and comparable across datasets. S and S uni largely agree with feature distances, but the latter is not comparable across datasets. For instance, dslr and webcam in Office-31 are much more similar than any other pair in the two datasets, which is reflected in S but not in feature distance. S is also more faithful within dataset for corresponding to visual similarities. Unidirectional information similarity S uni gives extra information, for instance, S uni (amazon, dslr) > S uni (dslr, amazon) shows that amazon contains more distinct information than dslr. This can be explained that each category in amazon has images for 90 different objects, while each category in dslr only has images for 5 objects. 

5. MODEL STRUCTURE: CAPACITY

Neural networks are powerful learners that can scale to learn extremely large datasets given enough neurons and layers. It largely remains a mystery how to qualify the true capacity of a neural network. Capacity can be defined in different fashion, for example, Collins et al. (2017) measure capacity by the number of bits of data the network can remember, while Baldi & Vershynin (2019) define capacity as the number of distinct functions a network can represent. We can define the information capacity of a network M as the maximum amount of information a model can hold for a given task T , directly using model information to measure: C(M ) = max D∈T L(M D , D). To illustrate the information capacity of a model, we experiment with models from small to large (4 configurations of ResNet-11: large, standard, small, and tiny. See Appendix for details), and datasets of varying complexity (subsets of TinyImageNetfoot_1 with 50, 100, 150 and 200 classes.). We measure the model information of every model trained on each task, and plot results in Figure 6 . The first thing we notice in Figure 6 .a is that with increasing complexity of the task, model information saturates at a certain point for each model. This displays the capacity of a model (in dotted lines). Larger models can store more information and saturate later at larger tasks. Figure 6 .b plots the increase of information capacity w.r.t. the increase in the number of parameters. Information capacity roughly increases with the number of parameters in the model, which agrees with the observation in Collins et al. (2017) . The same observation also apply to datasets: each dataset has a definite amount of information content. Larger models can learn more information from the dataset, but can hardly learn more than this amount. This is shown in Figure 6 .c, and dataset information is indicated by the dotted lines.

6. MODEL PARAMETER: ABLATION

Next we turn to the analysis of parameters within a network. A common way to understand components in a network is to perform ablations: remove some units or structural components from the network, and compare the performance before and after ablation (Meyes et al., 2019) . A larger performance drop signifies larger importance of the ablated component for performing on the task. However, there are several caveats to this approach: firstly, ablations cannot reveal the true contribution of a component "in vivo." If a network layer is removed and the model re-trained, the functionalities of the layer can get substituted by other layers. In this case, ablations fail to reveal the contribution of the layer in the original network. Secondly, ablation cannot be performed on components vital to model performance, for example, if residue connections are removed from a very deep CNN, the network can fail to train. It would be unreasonable to therefore conclude that residue connections contribute 100% to the model. Here we propose to use model information to measure the contribution of a network component. The information I in a network component c can be measure in the following fashion (where M c ): I(c) = L(M D , D) -L(M c D , D). To calculate the information in c, we reset all parameters within this component to their initialized value before training. Then we measure the model information of this ablated model M c D . The difference of information in M D and M c D is the information content in the parameters of component c. In effect, we are performing information ablation: only information stored in parameters is ablated, the model structure is kept intact. Information ablation helps us uncover how the information in each component contributes to the whole network, or in other words, how information in a network is stored by its components. In Figure 7 , we perform information ablation on a ResNet model as an example. The model consists of a series of residue blocks, and each block contains several convolutional layers. Figure 7 .b shows the information in each block, as well as the information per model parameter ("density"). There are several observations: blocks in the middle (block2 -5) contribute the most information, blocks near input have larger information density, and blocks that reduce spatial resolution (with '/2' in notation) contains more information than blocks that preserves resolution. The last phenomenon is likely because convolution layers in downsampling blocks do more job of combining smaller object parts into larger parts, thus having more knowledge about the constitution of objects. Figure 7 (c) illustrates something unique to the information ablation method: this time we measure information contribution by four different kinds of convolution layers in each and every block. We found out that the majority of information resides in the 3x3 bottleneck layer, which is where most spatial feature transformations take place. Convolution layers on the residue path contain surprisingly little information, despite having roughly the same number of parameters as the layers before and after the bottleneck. This indicates that residue layers serve more of a structural functionality to ease training, rather than doing substantial information processing.

7. ALGORITHM: DISTILLATION

Knowledge distillation is a method to transfer knowledge from larger teacher models to smaller student models, which can result in better performing compact models (Hinton et al., 2015) . Student models are trained with output probabilities of a teacher model in addition to the true labels. It is speculated that the student model can benefit from "dark knowledge" in teacher's predictions. If "dark knowledge" means the presence of extra information, with proper tools, we can verify the existence of extra information provided by the teacher model and quantify how much information is transferred in distillation. We could measure the information transferred from teacher to student by the following formula:  I = L(M D+D T , D) -L(M D , D). ( ) B i 5 W 5 1 p x 0 m d 1 z Q C f g = " > A A A C t 3 i c b V H b b t Q w E P W G W w m 3 F h 5 5 i V h V Q q h a J a V V 4 a F S B T z w W A T b V o p X K 6 8 z 2 V j r S 2 p P W F Z W P o F X e O O / + B u c d C v 1 w p F s H Z + Z o x n P z G o p H K b p 3 0 F 0 5 + 6 9 + w 8 2 H s a P H j 9 5 + m x z 6 / m J M 4 3 l M O Z G G n s 2 Y w 6 k 0 D B G g R L O a g t M z S S c z h Y f u / j p d 7 B O G P 0 N V z V M F J t r U Q r O M E h f P 0 1 x u j l M R 2 m P 5 D b J 1 m R I 1 j i e b g 3 + 0 M L w R o F G L p l z e Z b W O P H M o u A S 2 p g 2 D m r G F 2 w O e a C a K X A T 3 / f a J t t B K Z L S 2 H A 0 J r 1 6 1 e G Z c m 6 l Z i F T M a z c z V g n / i + W N 1 i + m 3 i h 6 w Z B 8 4 t C Z S M T N E n 3 8 a Q Q F j j K V S C M W x F 6 T X j F L O M Y x h N f K + M 4 k 2 B B 7 j g M C u h 5 G O 7 O Z e 2 Y a l h y o x T T B Y X z h s m K o a d d L F g 8 7 T 1 G 5 9 l o v 5 v K Y d t J F p B X a P x 2 n A T Q y w J v c r o U B V a m D H k T T 5 d Q z C E Y b C P B Z / C j 9 W / D 1 b Y + H e 3 3 p A 2 N Y i X 4 Q j V u I e r D N F F N T B U U V 9 9 h o 9 n N / d 0 m J 7 u j b G 9 0 8 G V v e P R h v d s N 8 p K 8 I q 9 J R g 7 I E f l M j s m Y c D I n P 8 k v 8 j t 6 H 0 2 j M q o u U q P B 2 v O C X E N 0 / g 8 c 5 d p Q < / l a t e x i t > D t+1 =D t +{x t+1 , y t+1 } < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 F Y v w d i v Y s r h 7 i k u 2 X A K / 9 s R 1 f U = " > A A A C 3 H i c b V F N b x M x E H W W r 7 J 8 p X D k Y h E V I V p F u 9 C q X C J V w I F j k U h b K Q 6 R 4 5 1 k r d j e x Z 4 l j a y 9 c Q O u H P k R X O G P 8 G / w J q n U D 0 a y 9 f x m n t 5 4 Z l w q 6 T B J / r a i a 9 d v 3 L y 1 c T u + c / f e / Q f t z Y d H r q i s g L 4 o V G F P x t y B k g b 6 K F H B S W m B 6 7 G C 4 / H s T Z M / / g z W y c J 8 w E U J Q 8 2 n R k 6 k 4 B i o U f v p 2 5 H H 7 b S m P d q g m m 5 T 5 k 9 X 3 A 5 d r A C r R + 1 O 0 k 2 W Q a + C d A 0 6 Z B 2 H o 8 3 W T 5 Y V o t J g U C j u 3 C B N S h x 6 b l E K B X X M K g c l F z M + h U G A h m t w Q 7 / 8 U E 2 3 A p P R S W H D M U i X 7 H m F 5 9 q 5 h R 6 H S s 0 x d 5 d z D f m / 3 K D C y a u h l 6 a s E I x Y G U 0 q R b G g z X R o J i 0 I V I s A u L A y 9 E p F z i 0 X G G Y Y X 7 B x g i u w o H Y c B g b M N G x g 5 8 w 7 Z g b m o t C a m 4 z B p 4 q r n K N n T S 5 I P F t q C j N I u 3 v N V H p 1 Q 1 l A k W P h t 2 I a g p 0 Z P B + w u c w w L y a h b u j Z H L I p B I G t F P g U T m v / M l x 1 7 Z P u 3 h L U o V H M p Z j p y s 1 k 2 U u o r m K m I T v / D h t N L + / v K j h 6 0 U 1 3 u / v v d z s H r 9 e 7 3 S C P y R P y j K R k n x y Q d + S Q 9 I k g 3 8 g v 8 p v 8 i T 5 G X 6 K v 0 f d V a d R a a x 6 R C x H 9 + A d w 3 e d l < / l a t e x i t > Encode p(y t+1 |x t+1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " j k i P 3 t 0 + B 1 K g 2 u f G 5 q i e w r e T 7 l U The process of online coding gradually increases the size of the dataset D and updates the model accordingly. In synced distillation, we update the teacher and the student model in synchronization (Figure 8 .a): in each iteration, we train the teacher with a subset D t , then using D t and teacher's predictions to train the student via distillation. The student model is then used to encode y t+1 . In the end, the student will be the same as training with conventional distillation. But during the process, the teacher never leaks any information about future examples to the student. This enables the student to generate a valid encoding of D, which can then be used to calculate model information. = " > A A A C y H i c b V H b b t N A E N 2 Y W z G X p u W R F 4 u o U o E q s q F V + 1 K p g h f E U 5 F I W 8 m O o s 1 6 E q + y F 7 M 7 b m I t f u F D + g q / x N + w T l K p F 0 b a 3 b P n z N G M Z s a l 4 B b j + G 8 n e P D w 0 e M n G 0 / D Z 8 9 f v N z s b m 2 f W V 0 Z B g O m h T Y X Y 2 p B c A U D 5 C j g o j R A 5 V j A + X j 2 u d X P L 8 F Y r t V 3 r E s Y S j p V f M I Z R U + N u t v l b j 1 y + D 5 p f i 5 W 7 9 t R t x f 3 4 2 V E 9 0 G y B j 2 y j t P R V u c q y z W r J C h k g l q b J n G J Q 0 c N c i a g C b P K Q k n Z j E 4 h 9 V B R C X b o l s 0 3 0 Y 5 n 8 m i i j T 8 K o y V 7 0 + G o t L a W Y 5 8 p K R b 2 r t a S / 9 P S C i d H Q 8 d V W S E o t i o 0 q U S E O m o n E e X c A E N R e 0 C Z 4 b 7 X i B X U U I Z + X u G t M p Z R A Q b E n k X P g J r 6 a e 9 d 1 w 4 z B X O m p a Q q z + B H R U V B 0 W W t 5 i 0 u W 3 q 0 S p P + Q T u V 4 6 a l D C A r U L u d M P K R X R d 4 l 2 Z z n m O h J z 5 v 6 L I 5 5 F P w B l M J c A k s G v f R X 0 3 j 4 v 7 B E j S + U S w 4 m 8 n K z n h 5 H E e y C j M J + c 2 / 3 2 h y d 3 / 3 w d m H f r L f P / y 2 3 z v 5 t N 7 t B n l N 3 p B d k p B D c k K + k F M y I I w s y B X 5 T f 4 E X 4 M y m A f 1 K j X o r D 2 v y K 0 I f v 0 D n C / g f A = = < / l a t e x i t > As pointed out in Zhang et al. (2020) , model information can help illustrate a different aspect of learning dynamics than model performance. For the simple 5-layer CNN model that we use as student, distillation with different teachers and distillation coefficient α yields similar performance (Figure 8 .b and c). However, from measuring the information gain I, ResNet-56 and ResNet-101 transfer more information to the student than ResNet-20. This helps the student reach lower crossentropy with fewer examples, although the final performance is similar. A larger distillation coefficient also increases information transfer. This shows that as the weighting of the KL-divergence term increase in the student's loss function, the student learns more from the teacher. Observations such as this can help analyze and design better algorithms or understand why algorithms like knowledge distillation lead to performance gain (Yim et al., 2017) .

8. DISCUSSION

Currently, methods for estimating the description length of neural network models are still quite preliminary, lacking through analysis and guarantees of their efficiency. Using prequential coding to estimate model information or Kolmogorov complexity also introduces dependency on model architecture and training procedure, for example, dropout, batch normalization, and SGD optimizer will all affect the model information estimations. Because prequential codelength is always larger than Kolmogorov complexity, we can optimize the training hyperparameters to achieve as lower codelength as possible, which makes the estimation tighter. When applying model information to perform analysis, as we did in this paper, it is also necessary to vary only one variable at a time. Because L(M, D) is a function of both the model and the dataset, when analyzing models, the dataset needs to be fixed for codelengths to be comparable. Similarly when datasets are the center of interest the model needs to be fixed. This also introduces dependency. For instance, theoretically we can define difficulty by K(f ), but empirically when estimating K(f ) using L(M D , D), dependency on model architecture becomes inevitable. One can choose any adequate model architecture for training on the task, and fix that to compare the difficulty of tasks. Nonetheless, we demonstrate that model information could be a powerful analysis tool in deep learning. To summarize, the key observations from our experiments are: • Model information allows us to quantify properties such as difficulty, capacity, and similarity, especially for complex data and deep models, in a consistent fashion. • Model information can help analyze and understand learning in deep networks by showing how information is transferred and stored. • Such a tool is widely applicable to many kinds of problems because it does not depend on specific models or learning algorithms. We hope problems discussed in this work serve as examples of the versatility of an informational perspective in investigating neural network learning.

A EXPERIMENTS A.1 GENERAL EXPERIMENT SETTINGS

For experiments in this paper, we use (2) to estimate model information and set the encoding set size k=10000 (except for experiments on Office-31 and Office-Home, see below). All models are trained using Adam optimizer and early-stopping on the validation set. A.2 DIFFICULTY Statistics of noise injection experiments are given in Table 2-4 . We trained ResNet-56 models on MNIST with different kinds of injected noise. Model confidence is defined as the margin between the probability of the top and the second class predicted by the model: L(M T , T ) -L(M S , T ) can be directly calculated using (2): conf idence = E x [p c1 (x) -p c2 (x)] (9) c 1 = arg max k p k (x) c 2 = arg max k =c1 p k (x) L(M T , T ) -L(M S , T ) = L preq M S (y 1:k |x 1:k ) -L preq M T (y 1:k |x 1:k ) This assumes that information from ImageNet is known and is not included in calculating similarities. It is possible to use random models as initialization, but as the number of examples is too small for some domains in Office-31, we found that using pre-trained models gives more stable estimation of codelength. Also, because the smallest domain in Office-31 and Office-Home only have 399 and 1942 training examples respectively, we use about half of the dataset, i.e., k=200 for Office-31 and k=1000 for Office-Home to estimate L in (2). Theoretically, the similarity measure in (3) and ( 4) is symmetric: S(A, B) = I(common(A, B)) I(description(A, B)) = K(f B ) -K(f B |f A ) K(f A , f B ) = K(f A ) -K(f A |f B ) K(f A , f B ) = S(B, A) which is because K(f A ) + K(f B |f A ) = K(f B ) + K(f A |f B ) = K(f A , f B ). Empirically, when using L(M A , B) as an approximation of K(f B ) -K(f B |f A ), usually S(A, B) = S(B, A) because L(M A , B) and L(M B , A) doe not equal exactly. In this paper we report empirical results using (4). The first-order and second-order feature distance d 1 and d 2 are defined as:  d 1 = ||E x [f A (x)] -E x [f B (x)]|| 2 (13) d 2 = ||Cov x [f A (x)] -Cov x [f B (x)]|| F



TASK: DIFFICULTYThe success of deep learning is marked by its superior performance at solving difficult tasks, like large-scale visual classification and question answering. For evaluating the capability of algorithms, https://tiny-imagenet.herokuapp.com



Figure 1: Information content, model performance and model confidence on five tasks.

Figure 3: Task information, model performance, and model confidence, with different kinds and varying levels of noise.

Figure 4: Image domains in Office-31 and Office-Home.

Figure 5: Pairwise domain similarity on Office-31 (top) and Office-Home (bottom). S and S uni are information similarities, D 1 and D 2 are first-order and second-order feature distances.

Figure 6: Measuring capacity using information. (a) the amount of information a model can store is capped by its capacity. (b) increase in capacity is correlated with an increase in the number of parameters. (c) a task has an inherent amount of information content.

Figure 7: Information ablation. (a) Structure of a variant of ResNet model in (He et al., 2016a). (b) Information ablation by each block in the model. (c) Information ablation by different types of convolution layers. 'bn' stands for the bottleneck layer.

Figure 8: (a) illustration of the training process in synced distillation. indicates training procedures with information flow on the arrow directions. (b) information gain of students from different teachers. (c) information gain of the student when trained with different distillation coefficient α.

We use ResNet-18 model pre-trained on ImageNet as the initial model, then finetune on source dataset S and measure L(M S , T ) on target dataset T . Pairwise codelength representing L(M T , T )-L(M S , T ) are given inTable 5-6. Diagonal values are L(M T , T ).

Zhang et al. (2020) propose to estimate model information by subtracting codelength of k examples (the k examples are independent and different from the training set of size N ) with an initial model and a final model, and show that it is more reliable than (1):

Model statistics on MNIST with image noise (missing pixels)

Model statistics on MNIST with image noise (missing objects)

Model statistics on MNIST with label noise (massart noise)

L(M T , T ) -L(M S , T ) on domains in Office-31

L(M T , T ) -L(M S , T ) on domains in Office-Home

14) where f A (x) and f B (x) are the representations of x produced by model trained on domain A and B. F stands for Frobenius norm. A.4 CAPACITY To show the capacity of a model, we need to saturate a model by information in the training set. Therefore we choose to use ResNet-11, the smallest of ResNet configurations, and use four different layer width in ResNet-11, as follows: Large: [16, 48, 96], Standard: [16, 32, 64], Small: [16, 24, 32], Tiny: [16, 16, 24]. For example, [16, 48, 96] means the layer width is 16, 48, and 96 in the first, second, and the third residue block. The datasets we use are subsets of Tiny-ImageNet, each containing 50,100,150, and 200 classes. The number of examples in each dataset is fixed at 12500. L(M D , D) for models of different size, and datasets of different complexity, are listed in Table 7. L(M D , D) for model and dataset of different size

A.5 ABLATION

We train a ResNet-20 model (referred to as M D ) on CIFAR-10, reset the parameters of some network component c (this ablated model is referred to as M c D ) and measure the difference of model information by:Ablation results for different network layers are shown in Table 8 -9. We also experiment with different coefficient α, to control the influence of teacher models. Results are shown in Table 10 -11. We observe that the student's information gain plateaus after α = 0.2, which kind of agrees with the conventional choose of α = 0.1 in knowledge distillation.

