QUANTIFYING TASK COMPLEXITY THROUGH GENER-ALIZED INFORMATION MEASURES

Abstract

How can we measure the "complexity" of a learning task so that we can compare one task to another? From classical information theory, we know that entropy is a useful measure of the complexity of a random variable and provides a lower bound on the minimum expected number of bits needed for transmitting its state. In this paper, we propose to measure the complexity of a learning task by the minimum expected number of questions that need to be answered to solve the task. For example, the minimum expected number of patches that need to be observed to classify FashionMNIST images. We prove several properties of the proposed complexity measure, including connections with classical entropy and sub-additivity for multiple tasks. As the computation of the minimum expected number of questions is generally intractable, we propose a greedy procedure called "information pursuit" (IP), which selects one question at a time depending on previous questions and their answers. This requires learning a probabilistic generative model relating data and questions to the task, for which we employ variational autoencoders and normalizing flows. We illustrate the usefulness of the proposed measure on various binary image classification tasks using image patches as the query set. Our results indicate that the complexity of a classification task increases as signal-to-noise ratio decreases, and that classification of the KMNIST dataset is more complex than classification of the FashionMNIST dataset. As a byproduct of choosing patches as queries, our approach also provides a principled way of determining which pixels in an image are most informative for a task.

1. INTRODUCTION

Deep networks have shown remarkable progress in both simple and complex machine learning tasks. But how does one measure the "complexity" of a learning task? Is it possible to ascertain in a principled manner which tasks are "harder" to solve than others? How "close" is one task to another? Answers to these questions would have implications in many fields of machine learning such as transfer learning, multi-task learning, un/semi/self-supervised learning, and domain adaptation. In classical information theory, the entropy of a random variable X is a useful measure of complexity for tasks such as compression and transmission, which essentially require reconstructing X. However, the entropy of X is insufficient for measuring the complexity of a supervised learning task T X,Y , where the goal is to predict an output Y from an input X, i.e., to estimate the conditional p Y |X (y | x) from a finite set of samples from p XY (x, y), which we refer to as solving the learning task. Complexity measures provided by statistical learning theory like VC-dimension or covering numbers are also inadequate for this purpose because they ignore the dependence between X and Y for the particular task at hand. Information-theoretic measures such as mutual information, information bottleneck (Tishby et al., 2000) and its variants (Strouse & Schwab, 2017) have been used to study the trade-off between model complexity and accuracy, but have not been developed to focus on assessing task complexity and can provide unsatisfactory results when comparing different tasks (see Section 5 for details). Measures based on Kolmogorov complexity (Li, 2006; Vereshchagin & Vitányi, 2004 ) could in principle be used to compare different tasks, but they are dataset permutation sensitive and not easily computable. The work of (Achille et al., 2019a) proposes to quantify task complexity by measuring the information stored on the network weights, but the approach depends on the specific neural network architecture used for training. The work of (Tran et al., 2019) does not require or assume trained models, but makes strict assumptions that limit its broad applicability. In this work, we introduce a novel perspective on task complexity which generalizes classical measures from information theory. Specifically, one well-known interpretation of classical Shannon entropy is, given a random variable X, find the minimum number of bits that are needed on average to encode instances of X so that the instances can be perfectly recovered from the binary code. Stated differently, if one lets Q be defined as the set of all possible binary functions, on the domain of X, then Shannon entropy essentially asks what is the optimal sequence of queries to compute {q 1 (X), q 2 (X), . . . : q i ∈ Q} (i.e., how to encode X as a binary string) so that X can be perfectly recovered from the shortest (on average) sequence of binary answers to the queries (see Section 2 for more discussion of this interpretation). As discussed above, however, in most learning tasks we are not interested in simply compressing X but rather making a prediction about some other variable Y . Further, notions of complexity can potentially be made more relevant to a specific task by not having Q to be the set of all possible binary functions on X but rather a smaller set of queries specific to a measure of interest. From this intuition, we define the complexity of a learning task as the minimum expected number of queries, selected from a fixed set Q, one needs to ask to predict Y (to some user-defined level of confidence) from the respective answers to the queries. As a few specific examples of potential query sets: • Decision boundary complexity: Here, Q is the the set of all possible half-spaces in R d (assuming X ∈ R d ) and q(x) is a binary function response indicating whether x lies in a particular half-space (q ∈ Q). Then task complexity formalizes the intuition of "level of non-linearity" of the decision boundary. For example, the complexity of any linearly-separable binary classification task is 1, whereas, for a non-linearly separable task, this value depends on the curvature of the decision boundary. • Task-specific input feature complexity: Here, Q is the set of projection functions of X and q is of the form q(x) = x q , where x q is the value observed at the q th entry of x. Then task complexity formalizes the intuition "the greater the redundancy between the input features the easier it is to solve the task". For example, Y being a constant function of X would be the simplest task with complexity 0, since no feature needs to be queried to predict it. This notion of complexity would help in answering questions such as "which input features are most important for solving a given task?" and could in turn help in developing more "interpretable" learning algorithms. • Visual semantic complexity: Given a vocabulary V of different possible entities, their attributes and relations in a visual scene, Q could be the set of all binary functions indicating the presence or absence of an entity, its attribute or its relation with other entities (supplied by V ) in a designated region of the image X. For example, a particular q could be the function implementing the query "Is there a person in the top left corner of the image?". This notion of complexity would allow one to gauge the semantic complexity of a visual task. For instance, tasks which ask complex questions like "Is there a person playing with his dog, next to a river in the image?" would inherently be more complex than simple object detection tasks, "Where is the dog in this image?" and could be quantified by semantically relevant queries. While our proposed formal definition of task complexity will be applicable to all such choices of query functions {q} q∈Q and enjoys several nice theoretical properties that we discuss in section 2, its computation will generally be intractable. As a result, in section 3 we propose to reduce the complexity of selecting a minimal set of questions by using the Information Pursuit (IP) algorithm, which selects questions sequentially, depending on previous questions and answers, in order of information gain. While IP is generally applicable to any task and query set, its implementation is still intractable depending on the complexity of the model p(X, Y ) and of the set Q. To address this issue, we posit a tractable graphical model for p(X, Y ) and learn the required distributions using variational autoencoders and normalizing flows. In section 4 we evaluate our approach on various binary image classification tasks (MNIST, KMNIST, FashionMNIST, Caltech Silhouettes) that can be tackled using a common set of queries (the set of image patches). Our results show that complexity computed using patch queries aligns with the intuition that the complexity of a classification task increases as signal-to-noise ratio decreases, and that classification of the KMNIST dataset is more complex than classification of the FashionMNIST dataset, something that isn't obvious a priori. While these experiments are restricted to simple tasks and queries, the proposed framework is generally applicable provided that tractable models, inference and learning methods can be developed, which is the subject of ongoing and future work. Finally, we note that to the best of our knowledge, this is the first time that a subjective notion of task complexity has been proposed in literature, where the user can incorporate in Q the perception of complexity he/she wishes to measure.

