QUANTIFYING TASK COMPLEXITY THROUGH GENER-ALIZED INFORMATION MEASURES

Abstract

How can we measure the "complexity" of a learning task so that we can compare one task to another? From classical information theory, we know that entropy is a useful measure of the complexity of a random variable and provides a lower bound on the minimum expected number of bits needed for transmitting its state. In this paper, we propose to measure the complexity of a learning task by the minimum expected number of questions that need to be answered to solve the task. For example, the minimum expected number of patches that need to be observed to classify FashionMNIST images. We prove several properties of the proposed complexity measure, including connections with classical entropy and sub-additivity for multiple tasks. As the computation of the minimum expected number of questions is generally intractable, we propose a greedy procedure called "information pursuit" (IP), which selects one question at a time depending on previous questions and their answers. This requires learning a probabilistic generative model relating data and questions to the task, for which we employ variational autoencoders and normalizing flows. We illustrate the usefulness of the proposed measure on various binary image classification tasks using image patches as the query set. Our results indicate that the complexity of a classification task increases as signal-to-noise ratio decreases, and that classification of the KMNIST dataset is more complex than classification of the FashionMNIST dataset. As a byproduct of choosing patches as queries, our approach also provides a principled way of determining which pixels in an image are most informative for a task.

1. INTRODUCTION

Deep networks have shown remarkable progress in both simple and complex machine learning tasks. But how does one measure the "complexity" of a learning task? Is it possible to ascertain in a principled manner which tasks are "harder" to solve than others? How "close" is one task to another? Answers to these questions would have implications in many fields of machine learning such as transfer learning, multi-task learning, un/semi/self-supervised learning, and domain adaptation. In classical information theory, the entropy of a random variable X is a useful measure of complexity for tasks such as compression and transmission, which essentially require reconstructing X. However, the entropy of X is insufficient for measuring the complexity of a supervised learning task T X,Y , where the goal is to predict an output Y from an input X, i.e., to estimate the conditional p Y |X (y | x) from a finite set of samples from p XY (x, y), which we refer to as solving the learning task. Complexity measures provided by statistical learning theory like VC-dimension or covering numbers are also inadequate for this purpose because they ignore the dependence between X and Y for the particular task at hand. Information-theoretic measures such as mutual information, information bottleneck (Tishby et al., 2000) and its variants (Strouse & Schwab, 2017) have been used to study the trade-off between model complexity and accuracy, but have not been developed to focus on assessing task complexity and can provide unsatisfactory results when comparing different tasks (see Section 5 for details). Measures based on Kolmogorov complexity (Li, 2006; Vereshchagin & Vitányi, 2004 ) could in principle be used to compare different tasks, but they are dataset permutation sensitive and not easily computable. The work of (Achille et al., 2019a) proposes to quantify task complexity by measuring the information stored on the network weights, but the approach depends on the specific neural network architecture used for training. The work of (Tran et al., 2019) does not require or assume trained models, but makes strict assumptions that limit its broad applicability. In this work, we introduce a novel perspective on task complexity which generalizes classical measures from information theory. Specifically, one well-known interpretation of classical Shannon entropy is, given a random variable X, find the minimum number of bits that are needed on average to encode instances of X so that the instances can be perfectly recovered from the binary code. Stated differently, if one lets Q be defined as the set of all possible binary functions, on the domain of X, then Shannon entropy essentially asks what is the optimal sequence of queries to compute {q 1 (X), q 2 (X), . . . : q i ∈ Q} (i.e., how to encode X as a binary string) so that X can be perfectly recovered from the shortest (on average) sequence of binary answers to the queries (see Section 2 for more discussion of this interpretation). As discussed above, however, in most learning tasks we are not interested in simply compressing X but rather making a prediction about some other variable Y . Further, notions of complexity can potentially be made more relevant to a specific task by not having Q to be the set of all possible binary functions on X but rather a smaller set of queries specific to a measure of interest. From this intuition, we define the complexity of a learning task as the minimum expected number of queries, selected from a fixed set Q, one needs to ask to predict Y (to some user-defined level of confidence) from the respective answers to the queries. As a few specific examples of potential query sets: • Decision boundary complexity: Here, Q is the the set of all possible half-spaces in R d (assuming X ∈ R d ) and q(x) is a binary function response indicating whether x lies in a particular half-space (q ∈ Q). Then task complexity formalizes the intuition of "level of non-linearity" of the decision boundary. For example, the complexity of any linearly-separable binary classification task is 1, whereas, for a non-linearly separable task, this value depends on the curvature of the decision boundary. • Task-specific input feature complexity: Here, Q is the set of projection functions of X and q is of the form q(x) = x q , where x q is the value observed at the q th entry of x. Then task complexity formalizes the intuition "the greater the redundancy between the input features the easier it is to solve the task". For example, Y being a constant function of X would be the simplest task with complexity 0, since no feature needs to be queried to predict it. This notion of complexity would help in answering questions such as "which input features are most important for solving a given task?" and could in turn help in developing more "interpretable" learning algorithms. • Visual semantic complexity: Given a vocabulary V of different possible entities, their attributes and relations in a visual scene, Q could be the set of all binary functions indicating the presence or absence of an entity, its attribute or its relation with other entities (supplied by V ) in a designated region of the image X. For example, a particular q could be the function implementing the query "Is there a person in the top left corner of the image?". This notion of complexity would allow one to gauge the semantic complexity of a visual task. For instance, tasks which ask complex questions like "Is there a person playing with his dog, next to a river in the image?" would inherently be more complex than simple object detection tasks, "Where is the dog in this image?" and could be quantified by semantically relevant queries. While our proposed formal definition of task complexity will be applicable to all such choices of query functions {q} q∈Q and enjoys several nice theoretical properties that we discuss in section 2, its computation will generally be intractable. As a result, in section 3 we propose to reduce the complexity of selecting a minimal set of questions by using the Information Pursuit (IP) algorithm, which selects questions sequentially, depending on previous questions and answers, in order of information gain. While IP is generally applicable to any task and query set, its implementation is still intractable depending on the complexity of the model p(X, Y ) and of the set Q. To address this issue, we posit a tractable graphical model for p(X, Y ) and learn the required distributions using variational autoencoders and normalizing flows. In section 4 we evaluate our approach on various binary image classification tasks (MNIST, KMNIST, FashionMNIST, Caltech Silhouettes) that can be tackled using a common set of queries (the set of image patches). Our results show that complexity computed using patch queries aligns with the intuition that the complexity of a classification task increases as signal-to-noise ratio decreases, and that classification of the KMNIST dataset is more complex than classification of the FashionMNIST dataset, something that isn't obvious a priori. While these experiments are restricted to simple tasks and queries, the proposed framework is generally applicable provided that tractable models, inference and learning methods can be developed, which is the subject of ongoing and future work. Finally, we note that to the best of our knowledge, this is the first time that a subjective notion of task complexity has been proposed in literature, where the user can incorporate in Q the perception of complexity he/she wishes to measure.

2. QUANTIFYING TASK COMPLEXITY

Let the input data be represented by random variable X and the corresponding output/hypothesis by random variable Y with sample spaces X and Y respectively. A supervised learning task T X;Y is defined as the task of estimating the conditional distribution p Y |X (y | x) from a finite set of samples from the joint distribution p XY (x, y). 1 We propose to quantify the complexity of a task as the minimum expected number of queries needed to solve the task. Queries, q ∈ Q, are functions of the input data, whose answers for a given x are denoted as {q(x)} q∈Q . Naturally, the query set needs to be sufficiently rich so that the task is solvable based on answers to the queries. More formally, we say the set Q is sufficient for task T X;Y if ∀(x, y) ∈ X × Y, p(y | x) = p(y | {x ∈ X : q(x ) = q(x) ∀q ∈ Q}). In other words, Q is sufficient for task T X;Y if whenever two points x and x have identical answers for all queries in Q, then their true posteriors must be equal, p(y | x) = p(y | x ). Given a fixed query set Q, we now formally define an encoding function E Q , which we refer to as a Q-Coder, along with our proposed complexity measure C Q (X, Y ) for task T X,Y . Defining the Encoder function. Given a query set Q, an Encoder is a function, E : S * → Q where S * is the set of all finite-length sequences generated using elements from the set S = {(q, q(x)) | q ∈ Q, x ∈ X }. Additionally, we require that Q contains a special query q ST OP which signals the encoder to stop asking queries and outputs the code for x as the set of query-answer pairs that have been asked. The process can be described as follows, given E and input sample x. 1. q 1 = E(∅). The first question is independent of x. 2. q k+1 = E({q i , q i (x)} 1:k ). All subsequent queries depend on the query-answer pairs observed so far for x. 3. If q L+1 = q ST OP terminate and return Code E Q (x) := (q i , q i (x)) 1:L as the code for x. Note that each q i depends on x, but we drop this in the notation for brevity. Note also that the number of queries L = |Code E Q (x)| for a particular x generalizes coding length in information theory. The query q ST OP constrains the code Code E Q (x) to be prefix-free. Defining task complexity. Given a joint distribution p XY (x, y) and a sufficient query set Q, we define task complexity, C Q (X; Y ), as the minimum over all encoders E (which are mappings from X to a subset of queries in Q) of the mean coding length: C Q (X; Y ) := min E E X |Code E Q (X)| s.t. p(y | x) = p(y | Code E Q (x)) ∀x ∈ X , y ∈ Y (Sufficiency) Figure 1 : Schematic view of the overall framework for quantifying complexity of a task T X;Y . The first constraint ensures sufficiency of the code ∀x ∈ X . By this we mean that there exists at least one Encoder E for which the first constraint of "sufficiency" is satisfied, where p(y | Code E Q (x)) should be interpreted as the conditional probability of y given the event {x ∈ X | Code E Q (x) = Code E Q (x )}. The solution to (2) provides the optimal encoder for task T X;Y , and Fig. 1 illustrates the overall framework in detail. Connection with Shannon entropy H. It can be shown that when Q is taken as the set for all possible binary questions on X and if Y is a function of X (which it usually is for supervised classification problems (Kolchinsky et al., 2018) ), then the solution, E * , to (2) gives a coding length within one bit of the entropy of Y , denoted as H(Y ). More formally, one can show that H(Y ) ≤ C Q (X; Y ) = E X |Code E * Q (X)| < H(Y ) + 1. Note that one example of such an optimal encoder, E * (x), is given by the Huffman code for Y . Connection with task complexity & equivalence classes. Notions of complexity of an entity/object, often relate to the level of structural regularity present in the entity (Gell-Mann, 2002) . C Q (X; Y ) measures the degree of regularity present in the structure of T X;Y . This structure is implicitly defined by the conditional p Y |X (y | x) and the query set Q. Notice that any E partitions X into equivalence classes. [ x] = {x ∈ X | Code E Q (x) = Code E Q (x )}. The prefix-free constraint on the codes generated by E (due to ) ensures that ∀x ∈ [x], ∀y ∈ Y, p Y |X=x (y) = p Y |X=x (y). It is then natural to relate task complexity with the number of equivalence classes induced by the optimal E * . The greater the number of equivalence classes, the higher the complexity of the task. This is because knowing the distribution p Y |X (y | x ) for any one element x in each equivalence class is sufficient to predict Y . An extreme case of this would be the constant function which arguably has the simplest possible structure of any task T X,Y . The equivalence class in this case is just X since p Y |X (y | x) is same for all x ∈ X . Thus, the number of equivalence classes for the constant function is 1 (the minimum possible). This is expected since for a constant function there is no structure to learn from data whatsoever! The following lemma (see Appendix A.1 for a proof) relates C Q (X; Y ) and the number of equivalence classes, the latter being not easy to compute from finite samples. Proposition 1. Given a finite query set Q, b-valued query-answers {q(X)} q∈Q and any δ > 0, the number of equivalence classes induced by the minimizer of (2) can be upper bounded by b C Q (X;Y )+|Q| √ 2 log( 1 δ )) with probability of misclassifying X at most δ. Proposition 1 indicates that for the same Q, tasks with greater C Q (X; Y ) will have larger complexity (by comparing the upper bound on the number of equivalence classes). This also illustrates the role different bases play. For example, in visual recognition tasks if one queries intensities of all the pixels in the image at once then C Q (X; Y ) = 1 (Observing intensities at all the pixels is sufficient information to predict Y from X). However b in that case would be large (exponential in the size of the image). Instead if one queries individual pixels, b would be the number of intensity values each pixel can take but C Q (X; Y ) ≥ 1. Properties of C Q (X; Y ). The following proposition, whose proof can be found in Appendix A.2, establishes some key properties of our proposed measure of complexity. Proposition 2. For any query set Q that is sufficient for task T X,Y , C Q (X; Y ) satisfies the following properties: 1. C Q (X; Y ) ≥ 0. (non-negativity) 2. C Q (X; Y ) = 0 iff X ⊥ ⊥ Y . (trivial structure) 3. If ∀x, x ∈ X , x = x , ∃y ∈ Y, such that p Y |X=x (y) = p Y |X=x (y), then C Q (X; Y ) ≥ C Q (X; Ỹ ) for all tasks T X, Ỹ provided Q is sufficient for T X, Ỹ . (total structure) 4. C Q (X; Y 1 , Y 2 ) ≤ C Q (X; Y 1 ) + C Q (X; Y 2 ) for any two tasks with X ∼ p X (x) and Y 1 ⊥ ⊥ Y 2 | X. (sub-additivity under union) The property "trivial structure" captures the fact that if Y is independent of X, the learning task is trivial (i.e., there is nothing to be learned about Y from observing X or functions of X). The property "total structure" captures the intuition that if for a given task T X;Y , the conditional distribution functioini p Y |X=x (y) is different ∀x ∈ X , then learning is "impossible" (assume Y is a categorical variable, which it is for most classification tasks) as no inductive bias would help in generalization to unseen inputs for such a task. For example, Y = f (X) where f : X → Y is injective. The last property "sub-additivity under union" is especially interesting in transfer learning settings where source task T X;Y1 and target task T X;Y2 have the same marginal for X ∼ p X (x) but dif- ferent conditionals p Y1|X (y 1 | x) and p Y2|X (y 2 | x). C Q (X; Y 1 , Y 2 ) refers to the complexity of task T X;Y1,Y2 , defined by p XY1Y2 (x, y 1 , y 2 ), where the corresponding sufficiency constraint in (2) becomes p(y 1 , y 2 |x) = p(y 1 , y 2 |Code E Q (x)) ∀x ∈ X , y 1 ∈ Y 1 , y 2 ∈ Y 2 . (5) This property could potentially be exploited to predict the success of transfer learning for different choices of source and target tasks. Further, the assumption Y 1 ⊥ ⊥ Y 2 | X is not particularly strict as it simply implies given input X, knowledge of Y 1 is not required to predict Y 2 and vice-versa. -approximate task complexity. In practice, we are often interested in solving a task "approximately" rather than "exactly". To accommodate this, we extend the definition of Sufficiency in (2) to -Approximate Sufficiency by relaxing it to d p(y | x), p(y | Code E Q (x)) ≤ ∀x ∈ X . Here, d is any distance-like metric on distributions such as the KL-divergence, total variation, Wasserstein distance, etc. (Refer Appendix A.3). Having established the above properties of our complexity measure, we note that unfortunately for any general query set Q, the problem in ( 2) is known to be NP-Complete and hence generally intractable (Laurent & Rivest, 1976) . As a result, we instead consider a greedy approximation to C Q (X; Y ) via an algorithm called Information Pursuit (IP) introduced by Geman & Jedynak (1996) , which we describe in detail next.

3. APPROXIMATING TASK COMPLEXITY USING INFORMATION PURSUIT

From this section onwards we will assume Q is a finite set. Information pursuit (IP) is a greedy algorithm introduced by Geman & Jedynak (1996) which provides an approximate solution to (2). The Encoder in IP, denoted as E IP , is recursively defined as follows, q 1 = E IP (∅) = arg max q∈Q I(q(X); Y ) q k+1 = E IP ({q i , q i (x)} 1:k )) = arg max q∈Q I(q(X); Y |B x,k ) (6) where x is an input data-point and I stands for mutual information. In other words, IP chooses the next query q k+1 as the one whose answer maximizes the mutual information with Y given the history of questions and answers about x chosen by IP till time k, i.e., given the event B x,k := {x ∈ X | {q i , q i (x)} 1:k = {q i , q i (x )} 1:k }. Ties in choosing q k are broken arbitrarily if the maximum is not unique. The algorithm stops when it satisfies the following condition: q L+1 = q ST OP if max q∈Q I(q(X); Y |B x,m ) = 0 ∀m ∈ {L, L + 1, ..., L + T }, where T > 0 is a hyper-parameter chosen via cross-validation, with the rationale behind this choice provided in Appendix A.4. We will denote this sub-optimal solution E IP as CQ (X; Y ). To compute an approximation to C Q (X; Y ) we modify the stopping criteria in ( 7) to max q∈Q I(q(X); Y |B x,m ) ≤ and call this estimate C Q (X; Y ).

3.1. APPROXIMATION GUARANTEES FOR INFORMATION PURSUIT

While in general it is difficult to have any performance guarantees of IP, in the specialized setting in which Q indexes the set of all possible binary functions of X, such that H(q(X) | Y ) = 0 ∀q ∈ Q and Y is a function of X, we have the following proposition (see Appendix A.6 for a proof). Proposition 3. Given task T X;Y with Y being a discrete random variable. If there exists a function f such that Y = f (X) and Q is the set to all possible binary functions of X such that H(q(X) | Y ) = 0 ∀q ∈ Q then H(Y ) ≤ CQ (X; Y ) ≤ H(Y ) + 1. While the above proposition is often considered to be true, this is the first time a rigorous proof has been presented (to the best of our knowledge). Thus, in this special case, from (3) we have that CQ (X; Y ) ≤ C Q (X; Y )+1 and thus the IP algorithm will be a tight approximation to our proposed complexity measure.

3.2. INFORMATION PURSUIT USING VARIATIONAL AUTOENCODERS + NORMALIZING FLOWS

IP requires probabilistic models relating query-answers and data to compute the required mutual information terms in (6). Specifically, computing q k+1 in (6) (for any iteration number k) requires computing the mutual information between q(X) and Y given the history B x,k till time k. As histories become longer, we quickly run out of samples in our dataset which belong to the event B x,k . As a result, non-parametric sample-based methods to estimate mutual information (such as Belghazi et al. ( 2018)) would be impractical. In this subsection, we propose a model-based approach to address this challenge for a general task T X;Y and query set Q. In section 4 we adapt this model to the specific case where Q indexes image patches. Information Pursuit Generative Model. Let Q(X) = {q(X) : q ∈ Q}. To make learning tractable, we introduce latent variables Z to account for all the dependencies between different query answers and posit the following factorization of Q(X), Y, Z p Q(X)ZY (Q(x), η, y) = q∈Q p q(X)|ZY (q(x) | η, y)p Y (y)p Z (η). Throughout the paper η and q(x) denote the realizations of Z and q(X) respectively. Equation ( 8) is a reasonable assumption unless the answers q(X) are causally related to each other, (Reichenbach's common cause principle (Hofer-Szabó et al., 1999) ). In other words, assuming that the query-answers are conditionally independent given the hypothesis and "some" latent vector is benign and ubiquitous in many machine learning applications. 1. q(X) as object presence indicators evaluated at non-overlapping windows: Let Q be a set of non-overlapping windows in the image X with q(X) as the random variable indicating the presence of an object at the q th location. The correlation between the qs is entirely due to latent image generating factors Z, such as lighting, camera position, scene layout, and texture along with the scene description signal Y . 2. q(X) as snippets of speech utterances: A common assumption in speech recognition tasks is that the audio frame features (q(X)) are conditionally independent given latent phonemes Z (which is often modelled as a Hidden Markov Model). This latent space Z is often a lower-dimensional space compared to the original high-dimensional X. We learn Z from data in an unsupervised manner using variational inference. Note, this assumption of conditional independence will not hold in scenarios where q(X) directly cause each other, for instance, if in example 1, we considered overlapping patches. Specifically, we parameterize the distributions {p ω (q(x) | η, y) ∀q ∈ Q} with a neural network with shared weights ω and call it the Decoder Network. These weights are learnt using stochastic Variational Bayes (Kingma & Welling, 2013) by introducing an approximate posterior distribution p φ (η | y, Q(x)) parameterized by another neural network with weights φ called the Encoder Network and priors p Y (y) and p Z (η). Implementing E IP Q using the generative model. Using the learnt Decoder network one can estimate the distribution p q(X)Y (q(x), y) via Monte Carlo Integration and compute q 1 = E IP (∅). For subsequent queries (k > 1), the computation of q k+1 requires the mutual information conditioned on current history B x,k , which can be calculated from the distribution p(q(x), y | B x,k ) = p(q(x) | η, y)p(η | y, B x,k )p(y | B x,k )dη. ( ) To estimate the left-hand side of (9) via Monte Carlo integration, one needs to sample N i.i.d. samples η i ∼ p(η | y, B x,k ) and compute 1 N N i=1 p ω (q(x) | η (i) , y)p(y | B x,k ), where the term p(y | B x,k ) can be estimated recursively with p(y | B x,0 ) := p Y (y). Appendix A.7 gives more details on these computations. Note, however, it is not straightforward to sample from this posterior p(η | y, B x,k ) without resorting to advanced Markov Chain Monte Carlo sampling techniques, which often suffers from the curse of dimensionality and is computationally intensive. To mitigate this issue we implement p(η | y, B x,k ) by another neural network trained to learn this posterior. Estimating p(η | y, B x,k ) with Normalizing Flows. We amortize the cost of modelling p(η | y, B x,k ) for each history encountered during the E IP recursion by assuming the existence of a function Ψ such that p(η | y, B x,k ) = Ψ({(q i , q i (x))} 1:k , y, η) ∀x ∈ X for any iteration k. We employ Normalizing Flows (Dinh et al., 2016) to approximate Ψ from data. Specifically, we construct a sequence of invertible mappings of the form η l = Ψl ({(q i , q i (x))} 1:k , y, η l+1 ; γ l ), each parameterized by a neural network with weights γ l , such that η 0 is constrained to be normally distributed. The composite function η 0 = Ψ({(q i , q i (x))} 1:k , y, η; γ) is a composition of N neural networks with parameters γ = {γ l } l∈{1,2,...,N } . We call this network, Ψ, as the Conditional Inference Network (Refer Fig. 2 ). Refer Appendix A.8 for details on the training procedure. Using the trained Ψ one can sample from the posterior p(η | y, B x,k ), as required for (9) for any observed history B x,k . The sampling procedure is as follows: (i) Sample η 0 ∼ N (0, I d ) (assuming η 0 ∈ R d ); (ii) Compute η = Ψ-1 ({(q i , q i (x))} 1:k , y, η 0 ).

4. CASE STUDY: COMPLEXITY OF BINARY IMAGE CLASSIFICATION TASKS

As a practical instantiation of our theory, we concentrate on the task of classifying binary images. We choose Q as the set of image patches with answers being the intensities observed at the patch. The reason for this choice is two-fold: (i) Patches provide a sufficiently rich query set to compare different learning tasks on binary images and allow us to measure the task-specific input feature complexity of different tasks; (ii) From a practical stand-point, state-of-the-art deep generative models for binary images can be assumed to be "perfect" models allowing us to illustrate the usefulness of the framework with minimal modelling bias. For all our experiments, we considered Q as index set to all 3 × 3 overlapping patches in the image. This requires some modelling changes. Recall, (8) only holds if q(X) are not causally related. In case of overlapping patches, this assumption is clearly violated. So instead we model (8) at the level of pixels X j (X denoting the binary image, and j the j th pixel), p(x j , η, y) = xj ∈X p(x j | η, y)p(y)p(η). Further network training details in Appendix A.9. Complexity increases with decrease in signal-to-noise ratio. We tested the effect of two different task-specific nuisances on C Q (X; Y ) for MNIST classification by, (i) MNIST-α: randomly flipping pixels in MNIST images with probability α ∈ [0, 1]; (ii) MNIST-Translated: translating the digits by at most 4 pixels. Fig. 3a shows the results. The plot shows the trade-off between accuracy and task complexity for different values of . To normalize for the effect of different datasets having different Bayes error rates, we report the trade-off using relative test accuracies which are the accuracies obtained by predicting Y according to arg max y∈Y p(y | B x,L )foot_1 divided by the prediction made upon observing the entire image (all the patches indexed by Q). The results indicate that for almost all desired relative accuracy levels, MNIST, MNIST-0.05, and MNIST-0.1 are in increasing level of complexity. Our experiments also indicate that the complexity of non-centered MNIST digit classification is the greatest (evaluated at any fixed accuracy level). Comparing different classification tasks. While in the previous experiment, there was an expected "correct" trend (complexity increases as nuissance level increases), the complexity ordering is not so clear intuitively when comparing across datasets. We evaluated our framework to compute complexities of image classification on four different datasets of binary images, namely, MNIST, FashionMNIST, KMNIST, and Caltech Silhouettes. Fig. 3b reports the results. Our findings indicate that MNIST < FashionMNIST < KMNIST < Caltech Silhoettes in terms of task complexity at almost all relative test accuracy levels. Connections with interpretability. A common theme in all the different notions of interpretability in ML literature is a "short description length". An interesting consequence of our formulation is that when evaluated using queries that index image patches, it gives a principled way of selecting the most important parts of an image for a task. Fig. 3c, d, e , & f illustrate this with an example. IP predicts the image label is "motorbike" by just observing the edges of the silhouette which intuitively should be the most important parts of the image for this task. Indeed, Fig. 3f reveals that E IP visits these edges with high probability for any random image that has label "Motorbike". This heatmap makes sense since the data set only has centered images, for the general case a more indepth analysis would be required. This property of IP could be utilized to develop more interpretable ML algorithms.

5. CONCLUSION AND RELATED WORK

In this paper, we introduced a novel notion of task complexity intimately tied to a query set Q. In the following paragraphs, we will briefly discuss some relevant prior work with connections to our proposed measure. The information pursuit algorithm has roots in the seminal work of (Geman & Jedynak, 1996) , which proposes an active testing framework for tracking roads in an image. That algorithm was extended in (Sznitman & Jedynak, 2010) for face detection and localization, in Sznitman et al. (2012) for detection and tracking of surgical instruments, and in (Jahangiri et al., 2017) for scene interpretation. Also, while (Sznitman & Jedynak, 2010; Jahangiri et al., 2017) learn generative models for their tasks, their models are radically different from ours. The problem of classifying objects by sequentially observing different image locations has been recently re-branded as Hard Attention in vision (Mnih et al., 2014; Elsayed et al., 2019; Li et al., 2016) and several deep learning solutions have been proposed. These methods typically try to learn a policy for iteratively choosing different parts of an image for solving different visual tasks. Optimization techniques from reinforcement learning are often employed to achieve this. High variance in the gradient estimates and scalability issues prevent their widespread adoption. It would be interesting to see if IP can be combined with reinforcement learning-based approaches to design better reward functions to facilitate efficient policy search. The information bottleneck (IB) method proposed by Tishby et al. (2000) is perhaps the closest to our work. They define complexity in terms of the mutual information (MI) between input X and its representations X such that X preserves a certain amount (determined by a user-defined parameter) of information about the output variable Y . In a way, their measure of complexity accounts for the relationship between X and Y . However, this complexity isn't very useful for comparing different learning tasks. For instance, in Fig. 1 & 2 in Kolchinsky et al. (2019) the MI between X and X for FashionMNIST and MNIST datasets are roughly the same for the same level of accuracy. This is a problem since we know from practical experience that MNIST is a much "simpler" than Fashion-MNIST. Interestingly, when Q is taken to be the set all possible binary functions of X our proposed measure recovers IB and its variants (Strouse & Schwab, 2017) . That discussion however is out of the scope of this paper. We note in passing that there has recently been work on quantifying information of a system under limited computation and model constraints (Xu et al., 2020) which could potentially be explored in the future in conjunction with our framework. In a sense, the choice of Q constrains the way information from X can be extracted to predict Y . Task complexity measures based on Kolmogorov Complexity are not computable (Li, 2006; Achille et al., 2019b; Vereshchagin & Vitányi, 2004) . These measures are based on the idea of finding the minimum length computer program that given input x outputs label y for every (x, y) in the training dataset. This is in stark contrast to our definition which is a property of the joint distribution p XY (x, y) and not any given finite dataset. Computing C Q (X; Y ) is NP-Complete but not uncomputable. The implication of this is that there exist dynamic programming based solutions that exactly compute C Q (X; Y ). The complexity of these algorithms are typically exponential in |Q| and so feasible only when |Q| is small. For large |Q|, we must turn to approximations and Information Pursuit is one such strategy. On the other hand, an algorithm for computing Kolmogorov Complexity does not exist, let alone an efficient one. Besides computability, the more pressing issues with Kolmogorov complexity is that the measures are sensitive to permutations of the dataset which is undesirable. Secondly, Kolmogorov complexity fails to distinguish between memorization and learning. A dataset sampled from p XY (x, y) where Y is independent of X will have the maximum Kolmogorov-based complexity measure. However, from a learning point of view there is nothing to learn -an optimal strategy is to simply predict p(Y | X) = p(Y ) regardless of the value of X! So, the task complexity of such tasks should be 0. Achille et al. (2019b) presents for a more detailed discussion on this. The proposed measure C Q (X; Y ) is not dataset permutation-sensitive since it is a property of the distribution p XY (x, y). Moreover, C Q (X; Y ) = 0 when Y is independent of X (See Proposition 2.2) and so distinguishes learning from memorization. Our work is related in spirit to the work of Achille et al. (2019b) which introduces an alternate measure of task complexity based on the intuition that the information stored in the weights of a trained network can be used as a measure of task complexity. They show that their measure recovers Kolmogorov's complexity, Shannon's information, Fisher's information as special cases, and is also related to PAC-Bayes generalization bounds.

A APPENDIX

In proofs of propositions and lemmas we rewrite the statement for convenience.

A.1 PROOF OF PROPOSITION 1

Proposition. Given a finite query set Q, b-valued query-answers {q(X)} q∈Q and any δ > 0, the number of equivalence classes induced by the minimizer of (2) can be upper bounded by b C Q (X;Y )+|Q| √ 2 log( 1 δ )) with probability of misclassifying X at most δ. Proof. Let d Q (X) be a random variable denoting the optimal code-length for X using E * , the solution to (2). Since, 0 ≤ d Q (X) ≤ |Q|. From the definition, C Q (X; Y ) = E X [d Q (X)]. So, Hoeffding's Lemma we have d Q (X) -C Q (X; Y ) ∼ subG( |Q| 2 4 ) (10) Since d Q (X) is sub-gaussian, for any δ > 0 P d Q (X) > C Q (X; Y ) + |Q| 2 log 1 δ ≤ δ (11) Notice, the prefix-free nature of E * naturally gives rise to a tree with d(X) being the depth of the corresponding tree T Q for input X. The number of equivalence classes is exactly the number of leaves of T Q . Given, each query-answer q(X) is b-valued. So we can upper bound the number of leaves by replacing T Q with a balanced b-ary tree with b C Q (X;Y )+|Q| √ 2 log 1 δ leaves called TQ . TQ can be constructed from T Q as follows: 1. For any leaf node with depth d(X) < C Q (X; Y ) + |Q| 2 log 1 δ vacuously increase the depth till equality is achieved by repeatedly asking the last query. For any internal node in T Q that is not balanced, add vacuous subtrees by randomly selecting queries from Q (no x ∈ X would be sent to these new subtrees) to make the tree balanced with each internal node having b children nodes. As a result, ∀x ∈ X with d(x) < C Q (X; Y ) + |Q| 2 log 1 δ , TQ (X) would compute the same posterior as T Q (X), that is p(y | Code E * Q (x)), and thus not make an error.

2.. For any leaf node with depth

d(X) > C Q (X; Y ) + |Q| 2 log 1 δ , cut the path at depth u(X) = C Q (X; Y ) + |Q| 2 log 1 δ by merging the subtree rooted at depth u(X) and creating a leaf node. An input x ∈ X sent to this new leaf node would be misclassified. The probability of such an event is given by (11).

A.2 PROOF OF PROPOSITION 2

Proposition. For any query set Q that is sufficient for task T X,Y , C Q (X; Y ) satisfies the following properties.

1.. C

Q (X; Y ) ≥ 0. (non-negativity) 2. C Q (X; Y ) = 0 iff X ⊥ ⊥ Y . (trivial structure) 3. If ∀x, x ∈ X , x = x , ∃y ∈ Y, such that p Y |X=x (y) = p Y |X=x (y), then C Q (X; Y ) ≥ C Q (X; Ỹ ) for all tasks T X, Ỹ provided Q is sufficient for T X, Ỹ . (total structure) 4. C Q (X; Y 1 , Y 2 ) ≤ C Q (X; Y 1 ) + C Q (X; Y 2 ) for any two tasks with X ∼ p X (x) and Y 1 ⊥ ⊥ Y 2 | X. (sub-additivity under union) Proof. 1. Follows trivially from the definition. 2. For proving the "if" part, observe that if X ⊥ ⊥ Y then p Y |X=x (y) = p Y (y) ∀y ∈ Y, x ∈ X . Choose E 1 to be an encoder such that q 1 = q ST OP . Then E X |Code E1 Q (X)| = 0 E 1 is trivially prefix-free (all x ∈ X gets mapped to the same code ∅). Recall that p(y|Code E1 Q (x)) should be interpreted as the conditional probability of y given the event {x ∈ X | Code E1 Q (x) = Code E1 Q (x )}. Since ∀x ∈ X , Code E1 Q (x) = ∅, p(y | Code E1 Q (x)) = p(y | X ) = p(y). Hence, E 1 also provides a sufficient code and thus is a feasible solution to the optimization problem in (2). From property 1 we know that E achieves the optimal C Q (X; Y ). The proof for the "only if" part is very similar. Consider E * as the optimal solution to (2) and C Q (X; Y ) = 0. This implies Code E * Q (x) = ∅ ∀x ∈ X . Since E * is sufficient, p(y | x) = p(y | Code E * Q (x)) = p(y) ∀x ∈ X , y ∈ Y. This implies X ⊥ ⊥ Y . 3. Let E T denote the minimizer of (2) for the task T X;Y where Y given X is distributed as stated. From sufficiency we know that no two different inputs x, x ∈ X could have the same code, that is, Code E T Q (x) = Code E T Q (x ) ∀x, x ∈ X , x = x . Consider any other task T X, Ỹ defined by p(X, Ỹ ) with the same marginals X ∼ p X (x) but possibly different conditional Ỹ ∼ p Ỹ |X (ỹ | x) such that Q is also sufficient for T X, Ỹ . E T is a feasible solution for (2) with respect to task T X, Ỹ since p(ỹ | Code E T Q (x)) = p(ỹ | {x}). Thus, C Q (X; Ỹ ) ≤ E X |Code E T Q (X) | for any other task T X; Ỹ for which Q is sufficient. 4. Let E 1 and E 2 be the optimal encoders for tasks T X;Y1 and T X;Y2 respectively. Construct an encoder E 12 for that task T X;Y1,Y2 by concatenating the two codes. It is not hard to see that E X |Code E12 Q (X)| ≤ E X |Code E1 Q (X)| + E X |Code E2 Q (X)| . The inequality is due to the fact that (query,answer) tuples can overlap in the codes constructed by two encoders E 1 and E 2 for the same input x. E 12 is prefix-free by construction. Now, for any observation y 1 of output Y 1 p(y 1 | Code E12 Q (x)) = p(y 1 | Code E1 Q (x) ∪ Code E2 Q (x)) = p(y 1 | {x ∈ X : Code E1 Q (x ) ∪ Code E2 Q (x ) = Code E1 Q (x) ∪ Code E2 Q (x)}) = p(y 1 | {x ∈ X : Code E1 Q (x ) = Code E1 Q (x)} ∩ {x ∈ X : Code E2 Q (x ) = Code E2 Q (x)}) = p(y 1 | x) The last equality appeals to the fact that E 1 satisfies the "sufficiency" constraint in (2) for task T X;Y1 . Similarly, p(y 2 | Code E12 Q (x)) = p(y 2 | x) for any observation y 2 of output Y 2 . Given, Y 1 ⊥ ⊥ Y 2 | X. This implies ∀x ∈ X p(y 1 , y 2 | x) = p(y 1 | x)p(y 2 | x) = p(y 1 | Code E12 Q (x))p(y 2 | Code E12 Q (x)) = p(y 1 , y 2 | Code E12 Q (x)) This proves E 12 is sufficient for the task T (X; Y 1 , Y 2 ). Since E 12 is a feasible solution for the optimization problem with respect to T (X; Y 1 , Y 2 ) we get the required inequality.

A.3 -APPROXIMATE COMPLEXITY OF TASK

In practice, we are often interested in solving a task "approximately" rather than "exactly". This requires introducing a notion of approximate sufficiency instead of exact sufficiency, and we extend the definition in (2) to incorporate this. C Q (X; Y ) := min E E X |Code E Q (X)| (12) s.t. p(y | x), p(y | Code E Q (x)) ≤ ∀x ∈ X (Approx. Sufficiency) Here, d is any distance-like metric on distributions such as the KL-divergence, total variation, Wasserstein distance, etc. Additionally, if d is convex in both its arguments, symmetric and satisfies the triangle-inequality then C Q (X; Y ) satisfies the properties in Proposition 2 with two key differences: 1. In Property 3, C Q (X; Y ) is the complexity of the task for which d p Y |X=x (y), p Y |X=x (y) > 2 ∀x, x ∈ X , x = x . 2. In Property 4, the relation is C Q (X; Y 1 , Y 2 ) ≤ C 2 Q (X; Y 1 ) + C 2 Q (X; Y 2 ). A.4 TERMINATION CRITERIA FOR E IP Q Ideally we would like to terminate (E IP outputs q ST OP ) after L steps if p(y | x) = p(y | x ) ∀x, x ∈ B x,L , y ∈ Y However, detecting this is difficult in practice. We have the following lemma. Lemma A.4.1. Assume Y ⊥ ⊥ q(X) | X ∀q ∈ Q. If event B x,L satisfies the condition specified by (14) then for all subsequent queries q m , m ≥ L, max q∈Q I(q(X); Y |B x,m ) = 0. Since ties are broken arbitrarily if the maximum not unique, E IP chooses any q ∈ Q as a subsequent query q m . Refer to Appendix A.5 for a proof. The assumption Y ⊥ ⊥ q(Xx) | X ∀q ∈ Q is generally true since we have the following Markov Chain Y → X → q(X) ∀q ∈ Q. Using Lemma A.4.1, the correct stopping criteria should be L = inf{k ∈ {1, 2, ..., |Q|} : max q∈Q I(q(X); Y |B x,m ) = 0 ∀m ≥ k, m ≤ |Q|} Evaluating ( 15) would be computationally costly since it would involve processing all the queries for every input x. We employ a more practically amenable criteria q L+1 = q ST OP if max q∈Q I(q(X); Y |B x,m ) = 0 ∀m ∈ {L, L + 1, ..., L + T } T > 0 is a hyper-parameter chosen via cross-validation. Note, it is possible that there does not exist any informative query in one iteration, but upon choosing a question there suddenly appears informative queries in the next iteration. For example, consider the XOR problem. X ∈ R 2 and Y ∈ {0, 1}. Let Q be the set to two axis-aligned half-spaces. Both half-spaces have zero mutual information with Y . However, upon choosing any one as q 1 , the other half-space is suddenly informative about Y . A.5 PROOF OF LEMMA A.4.1 Lemma. Assume Y ⊥ ⊥ q(X) | X ∀q ∈ Q. If event B x,L satisfies the condition specified by ( 14) then for all subsequent queries q m , m ≥ L, max q∈Q I(q(X); Y |B x,m ) = 0. Since ties are broken arbitrarily, E IP chooses any q ∈ Q as a subsequent query q m . Proof. Recall each query q partitions the set X and B x,L is the event {x ∈ X | {q i , q i (x)} 1:L = {q i , q i (x )} 1:L }. It is easy to see that if B x,L satisfies the condition specified by ( 14) then P (y | B x,m ) = P (y | x ) ∀x ∈ B x,m ∀m ≥ L, ∀q ∈ Q (17) This is because subsequent query-answers partition a set in which all the data points have the same posterior distributionsfoot_2 . Now, ∀q ∈ Q, ∀a ∈ Range(q), y ∈ Y p(q(X) = a, y|B x,m ) = p(q(X) = a | B x,m )p(y | q(X) = a, B x,m ) eq: chain rule of prob is just an application of the chain rule of probability. The randomness in A q (X) is entirely due to the randomness in X. For any a ∈ Range(A q ), y ∈ Y p(y | q(X) = a, B x,m ) = x ∈Bx,m∩{x∈X |q(X)=a} p(y, x | a, B x,m ) = x ∈Bx,m∩{x∈X |q(X)=a} p(y | x , a, B x,m )p(x | a, B x,m ) = x ∈Bx,m∩{x∈X |q(X)=a} p(y | x )p(x | a, B x,m ) = p(y | B x,m ) x ∈Bx,m∩{x∈X |q(X)=a} p(x | a, B x,m ) = p(y | B x,m ) The first equality is an application of the law of total probability, third due to conditional independence of the history and the hypothesis given X = x (assumption) and the fourth by invoking (( 17)). Substituting ( 19) in ( 18) we obtain Y ⊥ ⊥ q(X) | B x,m ∀m ≥ L, q ∈ Q. This implies that for all subsequent queries q m , m > L, max q∈Q I(q(X); Y |B x,m ) = 0. Hence, Proved. A.6 PROOF OF PROPOSITION 3 Proposition. Given task T X;Y with Y being a discrete random variable. If there exists a function f such that Y = f (X) and Q is the index set to all possible binary functions of X such that H(q(X) | Y ) = 0 ∀q ∈ Q then H(Y ) ≤ CQ (X; Y ) ≤ H(Y ) + 1. We make two remarks before turning to the proof. Remark 1: The task is to determine the true state of a latent variable Y ∈ Y based on querying an observed data point x 0 . We assume Y = f (X) with f unknown. Were Y observable, the natural queries would be indexed by subsets D ⊂ Y, one query for every D ⊂ Y, namely q(Y ) = 1 if Y ∈ D and 0 otherwise. (This is essentially the classic "twenty questions game", but with an "oracle" and "complete tests".) There are 2 |Y| such queries and obviously they collectively determine Y . Now since Y = f (X), these queries are, at least implicitly, functions of the data X, but we need realizable functions, not requiring knowledge of f . So our fundamental assumption is that for each subset D ∈ Y the corresponding subset D ∈ X (D = f -1 (D)) can be checked for inclusion of X, i.e., Y ∈ D if and only if X ∈ D . Or, what is the same, a binary query q(X) (and still denoted q for simplicity) with q(X) = q(Y ). In effect, we are assuming that whereas we cannot determine Y directly from X, we can answer simple binary queries which determine Y and can be expressed as observable data features. Remark 2: The sequence of queries q 1 , q 2 , ... generated by the IP algorithm for a particular data point can be seen as one branch, root to leaf, of a decision tree constructed by the standard machine learning strategy based on successive reduction of uncertainty as measured by entropy: q 1 = arg max q∈Q I(q(X); Y ), q k+1 = arg max q∈Q I(A q (X); Y |B x 0 ,k ) where the B x 0 ,k is the event that for the first k questions the answers agree with those for x 0 . We stop as soon as Y is determined. Whereas a decision tree accommodates all x simultaneously, the questions along the branch depends on having a particular, fixed data point. But the learning problem in the branch version ("active testing") is exponentially simpler. Proof of Proposition 3.1: The lower bound H(Y ) ≤ CQ (X; Y ) comes from Shannon's source coding theorem for symbol codes. Now for the upper bound, since I(q(X); Y |B x 0 ,k ) = H(q(X)|B x 0 ,k ) -H(q(X)|Y, B x 0 ,k ) and since Y determines q(Y ) and hence also q(X), the second entropy term is zero (since given H(A q (X) | Y ) = 0). So our problem is maximize the conditional entropy of the binary random variable q(X) given B x 0 ,k . So the IP algorithm is clearly just "divide and conquer": q 1 = arg max q∈Q H(q(X)), q k+1 = arg max q∈Q H(q(X)|B x 0 ,k ). Equivalently, since entropy of a binary random variable ρ is maximized when P (ρ) = 1 2 , q k+1 = arg min q∈Q |P (q(X) = 1|B x 0 ,k ) - 1 2 |. Let Y k be the set of "active hypotheses" after k queries (denoted as A k ), namely those y with positive posterior probability: P (Y = y|B x 0 ,k ) > 0. Indeed, P (Y = y|B x 0 ,k ) = P (B x 0 ,k |Y = y)p(y) y P (B x 0 ,k |Y = y)p(y) = 1 Y k p(k) y∈A k p(y) since P (B x 0 ,k |Y = y) = 1, if y ∈ A k 0, y / ∈ A k In particular, the classes in the active set have the same relative weights as at the outset. In summary: p(y|B x 0 ,k ) = p(y)/ A k p(l), y ∈ A k 0, otherwise The key observation to prove the theorem is that if a hypothesis y generates the same answers to the first m or more questions as y 0 , and hence is active at step m, then its prior likelihood p(y) is at most 2 -(m-1) , m = 1, 2, . . .. This is intuitively clear: if y has the same answer as y 0 on the first question, and p(y 0 ) > 1 2 , then only one question is needed and the active set is empty at step two; if q 1 (y) = q 1 (y 0 ) and q 2 (y) = q 2 (y 0 ) and p(y 0 ) > 1 4 , then only two question are needed and the active set is empty at step three, etc. B x,0 = ∅ (since no evidence via queries has been gathered from x yet) and so p(y | B x,0 ) = p Y (y). p(y | B x,k ) is obtained by normalizing the last equation in ( 22) such that y p(y|B x,k ) = 1. p(q k (x) | y, B x,k-1 ) can be estimated using (9).

A.8 TRAINING THE CONDITIONAL INFERENCE NETWORK

A normalizing flow is a sequence of invertible transformations that takes a random variable from a simple source distribution (say Uniform or Gaussian) to an arbitrarily complex multi-modal target distribution. These invertible transformations are parameterized by deep neural networks that can express a richer family of distributions than the Gaussian/Uniform family. Specifically, we construct a sequence of invertible mappings of the form η l = Ψl ({(q i , q i (x))} 1:k , y, η l+1 ; γ l ), each parameterized by a neural network with weights γ l , such that η 0 is constrained to be normally distributed. The composite function η 0 = Ψ({(q i , q i (x))} 1:k , y, η; γ) is a composition of N neural networks with parameters γ = {γ l } l∈{1,2,...,N } . We call this network, Ψ, as the Conditional Inference Network. (Refer Fig. 2 for a pictorial depiction). By the change-of-variables formula for probability densities, p(η | y, B x,k ) can be written as. p η (η | y, B x,k ) = p η0 ( Ψ({(q i , q i (x))} 1:k , y, η; γ)) |∇ η Ψ({(q i , q i (x))} 1:k , y, η; γ)| To ensure Ψ(η, y, {(q i , q i (x))} 1:k ; γ) is invertible and the determinant in ( 23) is efficiently computable, the family of functions used is often constrained to those that admit an upper/lower triangular Jacobin. The normalizing flow model employed here is the realNVP model introduced in Dinh et al. (2016) . For training we construct a dataset D * from given dataset D (of N i.i.d samples {x i , y i } ∼ p XY (x, y)) in the following manner. 1. Since we assumed Q is finite, fix an enumeration. For every (x, y) in D, evaluate all the functions in Q = {q : q ∈ Q} and obtain the sample (Q(x), y). Sample k ∼ p K (k), then sample k random positions in Q(x) as m k ∼ p M (m k ). p K (k) and p M (m k ) are user-defined distributions. In our applications P K (k) is taken to be P oisson(λ = 10) and p M (m k ) = U{1, 2, ..., |Q|} 2. Sample η ∼ q φ (η|y, Q(x)) using the trained Encoder network. 3. Mask Q(x) according to positions in m k to obtain k-length sequence {(q, q(x)) q∈ Q⊆Q }. In this way, we obtain samples from the desired joint p(K, B K (X), Y, η). B K (X) is a K-length random sequence {(q, q(X)) : q ∈ Q} for some Q ⊆ Q. The weights γ are then learnt using stochastic approximation by optimizing the following objective using D * . max γ E K,B K (X),Y,η [ log p η (η|Y, B K (X))] = max γ E K,B K (X),Y,η [ log p η0 ( Ψ(B K (X), Y, η; γ)) log |∇ η Ψ(B K (X), Y, η; γ)|] The second equality is obtained by substituting (23) in (24). p(η 0 ) = N (0, I d ). To understand the objective in 24, assume general random variables ξ and ψ. Let our proposal distribution be p(ψ|ξ) and the true distribution be p(ψ|ξ).  The first term in (26) disappears since it does not depend on p(ψ|ξ). Compare ( 27) and ( 24). Substitute ψ := η and ξ := (K, B K (X), Y ). The proposal p(.) is parameterized by the Conditional Inference Network as p η (η|y, B x,k ). The issue is that we don't have access to the true distribution p(ψ|ξ) to generate samples for a training set and learn an optimal p(ψ|ξ) for each Ξ = ξ. However, we have access to the joint p(ξ, ψ) := p(k, B k (x), y, η) from which we could generate data D * for optimizing (27) using stochastic approximation. A.9 NETWORK ARCHITECTURES AND TRAINING PROCEDURE A.9.1 ENCODER AND DECODER NETWORKS For the Information Pursuit Generative Model in subsection 3.2, we implemented a β-VAE as introduced in Higgins et al. (2017) . The encoder-decoder architecture used is depicted in Fig. 4 . Notation for Fig. 4 : • L(x, y) -Linear layer with x input units and y output units, followed by a BatchNorm layer and leakyRELU activation. • C x -Convolution layer with x 3×3 filters followed by a BatchNorm layer and leakyRELU activation. • C x M -Same as C x, followed by a Maxpool layer. • DC x y z -Transposed Convolution layer with x y × y filters and stride z, followed by a BatchNorm layer and leakyRELU activation. Slope for leakyRELU activation was taken to be 0.3. BatchNorm and leakyRELU activation was not applied to the output layer of the encoder network (Fig. 4a ). BatchNorm was not applied and the leakyRELU activation was replaced by Sigmoid in the output layer of the decoder network (Fig. 4 ). Training. The β-VAE was trained by optimizing the Evidence Lower BOund (ELBO) objective, max ω,φ n i=1 [E η∼p φ (η|y (i) ,x (i) ) [log p ω (x (i) |η, y (i) )] -βKL(p φ (η|y (i) , x (i) ) p(η))]. using ADAM with learning rate 0.001 and momentum parameters β 1 = 0.9 & β 2 = 0.999. β was taken as 4.0 for all experiments. The prior over latents p Z (η) is taken as N (0, I 100 ) and p Y (y) estimated from the empirical distribution of the outputs from the training data. Data augmentation was performed on MNIST and its variants via elastic deformations. No Data augmentation was done for the other datasets. The network was trained for 200 epochs. In Fig. 5 for our use-case with binary images and patch queries, {q i , q i (x)} 1:k , represents a masked image (sequence of patches observed), y denotes the class label, η are the nuisances distributed according to the target complex distribution and η 0 are the transformed random variables distributed according to a uni-modal standard Gaussian distribution. Each triplet of (Actnorm Layer, Permute Layer, Affine Layer) forms a layer of the flow network. The overall network is 25 layers deep. Each layer represents the function η l = Ψl ({q i , q i (x)} 1:k , y, η l+1 ); γ l ), with weights γ l . The weights of the flow network is denoted as γ = {γ l } l∈{0,1,...,24} . In what follows, we describe in detail each of the layers in a triplet. . • denotes element-wise multiplication. Here P is a permutation matrix which swaps the first 50 dimensions of η actnorm l-1 with the next 50 dimensions. Recall, η l ∈ R 100 ∀l ∈ {0, 1, . . . , 25}. Affine Layer: This layer implements the following operations. h 1 l-1 , h 2 l-1 = split(η permute l-1 ) s, t = N N affine (h 1 l-1 , B k (x), y) (31) α 1 l-1 = h 1 l-1 α 2 l-1 = h 2 l-1 • s + (1 -s) • t (32) • denotes element-wise multiplication. Training. The dataset for optimizing (24) was generated as outlined in section A.8 (denoted as D * ). The Conditional Inference Network was trained by optimizing this objective using ADAM with learning rate 0.001 and momentum parameters β 1 = 0.9 & β 2 = 0.999. L2 regularization was added to stabilize the training and prevent gradients from exploding (a common problem in training normalizing flow networks). A scheduling was done for the regularization constant. Namely, we optimized (24) with L2 regularization parameter {10 12 , 10 8 , 10 2 } for 5 epochs each. Finally, the L2 regularization was relaxed (that is, regularization constant was made 0) and the network was trained for 500 epochs.



As commonly used, we denote random variables by capital letters and their realizations with small letters. Recall L is the iteration after which IP terminates We refer to the distribution p(y | x) for any x ∈ X as the posterior distribution of x.



Figure 2: Conditional Inference Network.

Figure 3: The plots (a) & (b) show the trade-off between accuracy and complexity as is varied (c), (d), (e) & (f) pertain to our discussion on interpretability. (a) Complexity results on MNIST with different levels of nuissances; (b) Complexity results for different image classification tasks; (c) test image x 0 from Caltech Silhouettes dataset with class "Motorbike"; (d) patches queried by E IP before termination for x 0 (shown by the overlayed coloured 3 × 3 boxes); (e) Part of image x 0 observed through the queried patches; (f) Heatmap for probability a pixel would be visited by the IP Encoder for a randomly chosen image with label "Motorbike" from the Caltech Silhouettes dataset.

Consider a loss function KL(p(ψ|ξ)||p(ψ|ξ)) = p(ψ|ξ)log p(ψ|ξ)dψ -p(ψ|ξ)log p(ψ|ξ)dψ (25) Since (25) should hold ∀ξ, we take an expectation over ξ. E ξ [KL(p(ψ|ξ)||p(ψ|ξ))] = p(ξ)p(ψ|ξ)log p(ψ|ξ)dψdξ -p(ξ)p(ψ|ξ)log p(ψ|ξ)dψdξ = p(ξ, ψ)log p(ψ|ξ)dψdξ -p(ξ, ψ)log p(ψ|ξ)dψdξ

Figure 4: The encoder-decoder architecture used in the β-VAE network. Recall from section 3.2 4, X j are the image pixels, Y is the class label and Z are the nuisances. (a) Encoder Network: Takes the image {X j } j∈{1,2,...,28×28} and class label Y as inputs and predicts the mean µ and diagonal covariance matrix Σ of the nuisances Z. The weights of this network are denoted as ω in the main text; (b) Decoder Network: Takes the nuisance Z and class label Y as inputs and predicts the Bernoulli parameters of each pixel X j in the image. The weights of this network are denoted as φ in the main text. Best viewed in colour.

Figure 5: Overall architecture of the Conditional Inference Network used based on RealNVP normalizing flows. The network takes {q i , q i (x)} 1:k , Y and η as inputs and performs 25 transformations η → η 24 → η 23 → . . . → η 0 . Each transformation is referred to as Ψl := DN N ({q i , q i (x)} 1:k , y, η l+1 ; γ l ) in Fig. 2.

Figure6: The Actnorm layer used in each layer of the Conditional Inference Network. This layer takes η l , {q i , q i (x)} 1:k ) and y as input and outputs η actnorm l-1

Fig. 7 depicts the network architecture used for the N N affine in (31).

Figure 7: The N N affine used in each Affine layer (31) of the Conditional Inference Network. This layer takes h 1l-1 , {q i , q i (x)} 1:k and y as inputs and outputs scale vector s and shift vector t.Notice, each of the layers are invertible functions and hence their compositions (the 25 layer deep flow network) is also invertible, which is a key requirement for (23).

annex

Finally, since C, the code length, takes values in the non-negative integers {0, 1, . . . , }:Once the Decoder network has been learnt using variational inference, the first question q 1 can be calculated as per ( 6). Since the mutual information is completely determined by p(q(x), y), which is obtained by numerically marginalizing the nuisances Z from (8) using Monte Carlo integration.In the last approximation, p ω (q(x) | y, η (i) ) is the distribution obtained using the trained Decoder network. N is the number of i.i.d. samples drawn and η i ∼ p Z (η).

A.7.2 DERIVATION FOR (9)

.The first equality is an application of the law of total probability. The last equality appeals to the assumption that {q(X), q ∈ Q} are conditionally independent given Y, Z ((8)). 

