LEARNING TO ACTIVELY LEARN: A ROBUST AP-PROACH

Abstract

This work proposes a procedure for designing algorithms for specific adaptive data collection tasks like active learning and pure-exploration multi-armed bandits. Unlike the design of traditional adaptive algorithms that rely on concentration of measure and careful analysis to justify the correctness and sample complexity of the procedure, our adaptive algorithm is learned via adversarial training over equivalence classes of problems derived from information theoretic lower bounds. In particular, a single adaptive learning algorithm is learned that competes with the best adaptive algorithm learned for each equivalence class. Our procedure takes as input just the available queries, set of hypotheses, loss function, and total query budget. This is in contrast to existing meta-learning work that learns an adaptive algorithm relative to an explicit, user-defined subset or prior distribution over problems which can be challenging to define and be mismatched to the instance encountered at test time. This work is particularly focused on the regime when the total query budget is very small, such as a few dozen, which is much smaller than those budgets typically considered by theoretically derived algorithms. We perform synthetic experiments to justify the stability and effectiveness of the training procedure, and then evaluate the method on tasks derived from real data including a noisy 20 Questions game and a joke recommendation task.

1. INTRODUCTION

Closed-loop learning algorithms use previous observations to inform what measurements to take next in a closed-loop in order to accomplish inference tasks far faster than any fixed measurement plan set in advance. For example, active learning algorithms for binary classification have been proposed that under favorable conditions require exponentially fewer labels than passive, random sampling to identify the optimal classifier (Hanneke et al., 2014) . And in the multi-armed bandits literature, adaptive sampling techniques have demonstrated the ability to identify the "best arm" that optimizes some metric with far fewer experiments than a fixed design (Garivier & Kaufmann, 2016; Fiez et al., 2019) . Unfortunately, such guarantees often either require simplifying assumptions that limit robustness and applicability, or appeal to concentration inequalities that are very loose unless the number of samples is very large (e.g., web-scale). The aim of this work is a framework that achieves the best of both worlds: algorithms that learn through simulated experience to be as effective as possible with a tiny measurement budget (e.g., 20 queries), while remaining robust due to adversarial training. Our work fits into a recent trend sometimes referred to as learning to actively learn (Konyushkova et al., 2017; Bachman et al., 2017; Fang et al., 2017; Boutilier et al., 2020; Kveton et al., 2020) which tunes existing algorithms or learns entirely new active learning algorithms by policy optimization. Previous works in this area learn a policy by optimizing with respect to data observed through prior experience (e.g., metalearning or transfer learning) or an assumed explicit prior distribution of problem parameters (e.g. the true weight vector for linear regression). In contrast, our approach makes no assumptions about what parameters are likely to be encountered at test time, and therefore produces algorithms that do not suffer from a potential mismatch of priors. Instead, our method learns a policy that attempts to mirror the guarantees of frequentist algorithms with instance dependent sample complexities: if the problem is hard you will suffer a large loss, if it is easy you will suffer little. The learning framework is general enough to be applied to many active learning settings of interest and is intended to be used to produce novel and robust high performing algorithms. The difference is that instead of hand-crafting hard instances that witness the difficulty of the problem, we use adversarial training inspired by the robust reinforcement learning literature to automatically train minimax policies. Embracing the use of a simulator allows our learned policies to be very aggressive while maintaining robustness. Indeed, this work is particularly useful in the setting where relatively few rounds of querying can be made, where concentration inequalities of existing algorithms are vacuous. To demonstrate the efficacy of our approach we implement the framework for the (transductive) linear bandit problem. This paradigm includes pure-exploration combinatorial bandits (e.g., shortest path, matchings) as a special case which itself reduces to active binary classification. We empirically validate our framework on a simple synthetic experiment before turning our attention to datasets derived from real data including a noisy 20 questions game and a joke recommendation task.

2. PROPOSED FRAMEWORK FOR ROBUST LEARNING TO ACTIVELY LEARN

Whether learned or defined by an expert, any algorithm for active learning can be thought of as a policy from the perspective of reinforcement learning. At time t, based on an internal state s t , the policy takes action x t and receives observation y t , which then updates the state and the process repeats. In our work, at time t the state s t ∈ S is a function of the history {(x i , y i )} t-1 i=1 such as its sufficient statistics. Without loss of generality, a policy π takes a state as input and defines a probability distribution over X so that at time t we have x t ∼ π(s t ). Fix a horizon T . For t = 1, 2, . . . , T • state s t ∈ S is a function of the history, {(x i , y i )} t-1 i=1 , • action x t ∈ X is drawn at random from the distribution π(s t ) defined over X , and • next state s t+1 ∈ S is constructed by taking action x t in state s t and observing y t ∼ f (•|θ * , s t , x t ) until the game terminates at time t = T and the policy receives loss L T . Note that L T is a random variable that depends on the tuple (π, {(x i , y i )} T i=1 , θ * ). We assume that f is a distribution of known parameteric form to the policy (e.g., f (•|θ, s, x) ≡ N ( x, θ , 1)) but the parameter θ is unknown to the policy. Let P π,θ , E π,θ denote the probability and expectation under the probability law induced by executing policy π in the game with θ * = θ to completion. Note that P π,θ includes any internal randomness of the policy π and the random observations y t ∼ f (•|θ, s t , x t ). Thus, P π,θ assigns a probability to any trajectory {(x i , y i )} T i=1 . For a given policy π and θ * = θ, the metric of interest we wish to minimize is the expected loss (π, θ) := E π,θ [L T ] where L T as defined above is the loss observed at the end of the episode. For a fixed policy π, (π, θ) defines a loss surface over all possible values of θ. This loss surface captures the fact that some values of θ are just intrinsically harder than others, but also that a policy may be better suited for some values of θ versus others. Example: In active binary classification, T is a label budget, X could be a set of images such that we can query the label of example image x t ∈ X , y t ∈ {-1, 1} is the requested binary label, and the loss L T is the classification error of a trained classifier on these collected labels. Figure 1 : The r-dependent baseline defines a different policy for each value of r, thus, the blue curve may be unachievable with just a single policy. π * is the single policy that minimizes the maximum gap to this rdependent baseline policy. Finally, θ x = p(y = 1|x) for all x ∈ X . More examples can be found in Appendix A.

2.1. INSTANCE DEPENDENT PERFORMANCE METRIC

We now define the sense in which we wish to evaluate a particular policy. For any fixed value of θ one could clearly design an algorithm that would maximize performance on θ, but then it might have very poor performance on some other value θ = θ. Thus, we would ideally like π to perform uniformly well over a set of θ's that are all equivalent in a certain sense. Define a positive function C : Θ → (0, ∞) that assigns a score to each θ ∈ Θ that intuitively captures the "difficulty" of a particular θ, and can be used as a partial ordering of Θ. Ideally, C(θ) is a monotonic transformation of ( π, θ) for some "best" policy π that we will define shortly. We give the explicit C(θ)

