LEARNING TO ACTIVELY LEARN: A ROBUST AP-PROACH

Abstract

This work proposes a procedure for designing algorithms for specific adaptive data collection tasks like active learning and pure-exploration multi-armed bandits. Unlike the design of traditional adaptive algorithms that rely on concentration of measure and careful analysis to justify the correctness and sample complexity of the procedure, our adaptive algorithm is learned via adversarial training over equivalence classes of problems derived from information theoretic lower bounds. In particular, a single adaptive learning algorithm is learned that competes with the best adaptive algorithm learned for each equivalence class. Our procedure takes as input just the available queries, set of hypotheses, loss function, and total query budget. This is in contrast to existing meta-learning work that learns an adaptive algorithm relative to an explicit, user-defined subset or prior distribution over problems which can be challenging to define and be mismatched to the instance encountered at test time. This work is particularly focused on the regime when the total query budget is very small, such as a few dozen, which is much smaller than those budgets typically considered by theoretically derived algorithms. We perform synthetic experiments to justify the stability and effectiveness of the training procedure, and then evaluate the method on tasks derived from real data including a noisy 20 Questions game and a joke recommendation task.

1. INTRODUCTION

Closed-loop learning algorithms use previous observations to inform what measurements to take next in a closed-loop in order to accomplish inference tasks far faster than any fixed measurement plan set in advance. For example, active learning algorithms for binary classification have been proposed that under favorable conditions require exponentially fewer labels than passive, random sampling to identify the optimal classifier (Hanneke et al., 2014) . And in the multi-armed bandits literature, adaptive sampling techniques have demonstrated the ability to identify the "best arm" that optimizes some metric with far fewer experiments than a fixed design (Garivier & Kaufmann, 2016; Fiez et al., 2019) . Unfortunately, such guarantees often either require simplifying assumptions that limit robustness and applicability, or appeal to concentration inequalities that are very loose unless the number of samples is very large (e.g., web-scale). The aim of this work is a framework that achieves the best of both worlds: algorithms that learn through simulated experience to be as effective as possible with a tiny measurement budget (e.g., 20 queries), while remaining robust due to adversarial training. Our work fits into a recent trend sometimes referred to as learning to actively learn (Konyushkova et al., 2017; Bachman et al., 2017; Fang et al., 2017; Boutilier et al., 2020; Kveton et al., 2020) which tunes existing algorithms or learns entirely new active learning algorithms by policy optimization. Previous works in this area learn a policy by optimizing with respect to data observed through prior experience (e.g., metalearning or transfer learning) or an assumed explicit prior distribution of problem parameters (e.g. the true weight vector for linear regression). In contrast, our approach makes no assumptions about what parameters are likely to be encountered at test time, and therefore produces algorithms that do not suffer from a potential mismatch of priors. Instead, our method learns a policy that attempts to mirror the guarantees of frequentist algorithms with instance dependent sample complexities: if the problem is hard you will suffer a large loss, if it is easy you will suffer little. The learning framework is general enough to be applied to many active learning settings of interest and is intended to be used to produce novel and robust high performing algorithms. The difference is that instead of hand-crafting hard instances that witness the difficulty of the problem, we use adversarial training inspired by the robust reinforcement learning literature to automatically train minimax policies. Embracing the use of a simulator allows our learned policies to be very aggressive while maintaining robustness. Indeed, this work is particularly useful in the setting where relatively few rounds of querying can be made, where concentration inequalities of existing algorithms are vacuous. To demonstrate the efficacy of our approach we implement the framework for the (transductive) linear bandit problem. This paradigm includes pure-exploration combinatorial bandits (e.g., shortest path, matchings) as a special case which itself reduces to active binary classification. We empirically validate our framework on a simple synthetic experiment before turning our attention to datasets derived from real data including a noisy 20 questions game and a joke recommendation task.

2. PROPOSED FRAMEWORK FOR ROBUST LEARNING TO ACTIVELY LEARN

Whether learned or defined by an expert, any algorithm for active learning can be thought of as a policy from the perspective of reinforcement learning. At time t, based on an internal state s t , the policy takes action x t and receives observation y t , which then updates the state and the process repeats. In our work, at time t the state s t ∈ S is a function of the history {(x i , y i )} t-1 i=1 such as its sufficient statistics. Without loss of generality, a policy π takes a state as input and defines a probability distribution over X so that at time t we have x t ∼ π(s t ). Fix a horizon T . For t = 1, 2, . . . , T • state s t ∈ S is a function of the history, {(x i , y i )} t-1 i=1 , • action x t ∈ X is drawn at random from the distribution π(s t ) defined over X , and • next state s t+1 ∈ S is constructed by taking action x t in state s t and observing y t ∼ f (•|θ * , s t , x t ) until the game terminates at time t = T and the policy receives loss L T . Note that L T is a random variable that depends on the tuple (π, {(x i , y i )} T i=1 , θ * ). We assume that f is a distribution of known parameteric form to the policy (e.g., f (•|θ, s, x) ≡ N ( x, θ , 1)) but the parameter θ is unknown to the policy. Let P π,θ , E π,θ denote the probability and expectation under the probability law induced by executing policy π in the game with θ * = θ to completion. Note that P π,θ includes any internal randomness of the policy π and the random observations y t ∼ f (•|θ, s t , x t ). Thus, P π,θ assigns a probability to any trajectory {(x i , y i )} T i=1 . For a given policy π and θ * = θ, the metric of interest we wish to minimize is the expected loss (π, θ) := E π,θ [L T ] where L T as defined above is the loss observed at the end of the episode. For a fixed policy π, (π, θ) defines a loss surface over all possible values of θ. This loss surface captures the fact that some values of θ are just intrinsically harder than others, but also that a policy may be better suited for some values of θ versus others. Example: In active binary classification, T is a label budget, X could be a set of images such that we can query the label of example image x t ∈ X , y t ∈ {-1, 1} is the requested binary label, and the loss L T is the classification error of a trained classifier on these collected labels. Figure 1 : The r-dependent baseline defines a different policy for each value of r, thus, the blue curve may be unachievable with just a single policy. π * is the single policy that minimizes the maximum gap to this rdependent baseline policy. Finally, θ x = p(y = 1|x) for all x ∈ X . More examples can be found in Appendix A.

2.1. INSTANCE DEPENDENT PERFORMANCE METRIC

We now define the sense in which we wish to evaluate a particular policy. For any fixed value of θ one could clearly design an algorithm that would maximize performance on θ, but then it might have very poor performance on some other value θ = θ. Thus, we would ideally like π to perform uniformly well over a set of θ's that are all equivalent in a certain sense. Define a positive function C : Θ → (0, ∞) that assigns a score to each θ ∈ Θ that intuitively captures the "difficulty" of a particular θ, and can be used as a partial ordering of Θ. Ideally, C(θ) is a monotonic transformation of ( π, θ) for some "best" policy π that we will define shortly. We give the explicit C(θ) for the active binary classification example in Section 3, further description of C in Section 2.2, and more examples in Appendix A. For any set of problem instances Θ define (π, Θ) := sup θ∈Θ (π, θ). And for any r ≥ 0, define Θ (r) = {θ : C(θ) ≤ r}. The quantity (π, Θ (r) ) -inf π (π , Θ (r) ) is then a function of r that describes the sub-optimality gap of a given policy π relative to an r-dependent baseline policy trained specifically for each r. For a fixed r k > 0, a policy π that aims to minimize just (π, Θ (r) ) might focus just on the hard instances (i.e., those with C(θ) close to r) and there may exist a different policy π that performs far better than π on easier instances (i.e., those with C(θ) r). To avoid this, assuming sup r (π, Θ (r) )inf π (π , Θ (r) ) < ∞, we define π * := arg inf π sup r>0 (π, Θ (r) ) -inf π (π , Θ (r) ) (1) as the policy that minimizes the worst case sub-optimality gap over all r > 0. Figure 1 illustrates these definitions. Instead of computing inf π (π , Θ (r) ) for all r, in practice we define a grid with an increasing sequence {r k } K k=1 , to find an approximation to π * . We are now ready to state the goal of this work: Objective: Given an increasing sequence r 1 < • • • < r K that indexes nested sets of problem instances of increasing difficulty, Θ (r1) ⊂ Θ (r2) ⊂ • • • ⊂ Θ (r K ) , we wish to identify a policy π that minimizes the maximum sub-optimality gap with respect to this sequence. Explicitly, we seek to learn π := arginf π max k≤K (π, Θ (r k ) ) -inf π (π , Θ (r k ) ) where (π, Θ) := sup θ∈Θ (π, θ) and (π, θ) is the expected loss incurred by policy π on instance θ. Note that as K → ∞ and sup k r k+1 r k → 1, (1) and (2) are essentially equivalent under benign smoothness conditions on C(θ), in which case π → π * . In practice, we choose a finite K where Θ r K contains all problems that can be solved within the budget T relatively accurately, and a small > 0, where max k r k+1 r k = 1 + . Furthermore, the objective in (2) is equivalent with π = arginf π max k≤K (π, Θ (r k ) ) -(π k , Θ (r k ) ) where π k ∈ arg inf π sup θ:C(θ)≤r k (π, θ). We can efficiently solve this objective by first computing π k for all k ∈ [K] to obtain (π k , Θ (r k ) ) as benchmarks, and then use these benchmarks to train π.

2.2. PICKING THE COMPLEXITY FUNCTION C(θ)

We have defined an optimal policy in terms of a function C(θ) that determines a partial ordering over instances θ. This function can come from a heuristic that intuitively captures the difficulty of an instance. Or it can be defined and motivated from information theoretic lower bounds that often describe a general ordering, but are typically very loose relative to empirical performance. For example, consider the standard multi-armed bandit game where an agent has access to K distributions and in each round t ∈ [T ] she chooses a distribution I t ∈ [K] and observes a random variable in [0, 1] with mean θ It . If her strategy is described by a policy π, once t reaches T she receives loss L T = max i∈[K] T t=1 θ i -θ It with expectation (π, θ) = E[L T ] where the expectation is taken with respect to the randomness in the observations, and potentially any randomness of the policy. Under benign conditions, it is known that any policy must suffer (π, θ) min{ (Lattimore & Szepesvári, 2018) . Such a lower bound is an ideal candidate for C(θ). We define a different C(θ) for our particular experiments of interest, and others are described in Appendix A. The bottom line is that any function C(θ) works, but if it happens to correspond to an information theoretic lower bound, the resulting policy will match the lower bound if it is achievable. √ KT , i = * (θ * -θ i ) -1 } where θ * = max i∈[k] θ i

2.3. DIFFERENTIABLE POLICY OPTIMIZATION

The first step in learning the policy π defined in Equation 2 is to learn each π k := inf π sup θ:C(θ)≤r k (π, θ) for all k = 1, . . . , K. Once all π k are defined, π of ( 2) is an optimization of the same form after shifting the loss by the scalar (π k , Θ (r k ) ). Consequently, to learn π it suffices to develop a training procedure to solve inf π sup θ∈Ω (π, θ) for an arbitrary set Ω and generic loss function (π, θ). To make the optimization problem inf π sup θ∈Ω (π, θ) tractable, we parameterize it as follows. First, to compute the suprema over Θ, we consider a finite set Θ := { θ i } N i=1 ⊂ Ω, weighted by SOFTMAX(w) where w ∈ R N . In addition, instead of optimizing over all possible policies, we restrict the policy as the class of neural networks that take state representation as input and output a probability distribution over actions, parameterized by weights ψ. Mathematically, it could be stated as the following: inf π sup θ∈Ω (π, θ) = inf π sup θ 1:N ⊂Ω max i∈[N ] (π, θ i ) (3) = inf π sup w∈R N , θ 1:N ⊂Ω E i∼SOFTMAX(w) (π, θ i ) (4) ≈ inf ψ sup w∈R N , θ 1:N ⊂Ω E i∼SOFTMAX(w) (π ψ , θ i ) . ( ) Algorithm 1: Gradient Based Optimization of ( 5) 1 Input: partition Ω, number of iterations N it , number of problem samples M , number of rollouts per problem L, and loss variable L T at horizon T (see beginning of Section 2). 2 Goal: Compute the optimal policy arginf π sup θ∈Ω (π, θ) Update the generating distribution by taking ascending steps on gradient estimates: = arginf π sup θ∈Ω E π,θ [L T ]. w ←w + 1 M L M m=1 ∇ w log(SOFTMAX(w) Im ) • ( L l=1 L T (π ψ , τ m,l , θ Im )) Θ ← Θ + 1 M L M m=1 L l=1 ∇ Θ L barrier ( θ Im , Ω) + ∇ Θ L T (π ψ , τ m,l , θ Im ) +L T (π ψ , τ m,l , θ Im ) • ∇ Θ log(P π ψ , θ Im (τ m,l )) (7) where L barrier is a differentiable barrier loss that heavily penalizes the θ Im 's outside Ω. 12 Optimize policy: 13 Update the policy by taking descending step on gradient estimate: ψ ← ψ - 1 M L M m=1 L l=1 L T (π ψ , τ m,l , θ Im ) • ∇ ψ log(P π ψ , θ Im (τ m,l )) (8) 14 end Note that the objectives in ( 3) and ( 4) are indeed equivalent as θ 1:N are free parameters we optimize over rather than taking fixed values. Now, to motivate (5), starting from the left hand side of (3), observe that a small change in π may result in a large change in argsup θ∈Ω (π, θ). Therefore, with the goal of covering the entire Ω, we optimize the N points so that when π changes a bit, there is at least one θ i close to the optimal argsup. In addition, to covering the entire space of Ω, N is expected to be very large in practice. However, to optimize the objective effectively, we can only evaluate on M of θ's (M N ) in each iteration. Therefore, instead of naively sampling M points uniformly at random from the N points, in ( 4), we optimize an extra multinomial distribution, SOFTMAX(w), over the N points so that the points around the argsup are sampled more often. The final approximation in (5) comes from parameterizing the policy by a neural network. To solve the saddle point optimization problem in (5), we use an instance of the Gradient Descent Ascent (GDA) algorithm as shown in Algorithm 1. The gradient estimates are unbiased estimates of the true gradients with respect to ψ, w and Θ (shown in Appendix B). We choose N large enough to avoid mode collapse, and M, L as large as possible to reduce variance in gradient estimates while fitting the memory constraint. We use Adam optimization (Kingma & Ba, 2014) in taking gradient updates and regularize some of the parameters (an example will be presented in the next section). Note the decomposition for log(P π ψ ,θ (τ )) in ( 7) and ( 8), where rollout τ = {(x t , y t )} T t=1 , and log(P π ψ ,θ ({(x t , y t )} T t=1 )) = log π ψ (x 1 ) • f (y 1 |θ , s 1 ) • T t=2 π ψ (s t , x t ) • f (y t |θ , s t , x t ) . Here π ψ and f are only dependent on ψ and Θ respectively. During evaluation of a fixed policy π, we are interested in solving sup θ∈Ω (π, θ) by gradient ascent updates like ( 7). The decoupling of π ψ and f thus enables us to optimize the objective without differentiating through a policy π, which could be non-differentiable policies like deterministic algorithms. Finally, we make a few remarks on the parameterization of ( 5). As given in ( 5), we represent the generating distribution P as a simple finite number of weighted particles, analogous to a particle filter. Our policy parameterization π ψ could be modelled by multi-layer perceptrons, recurrent neural networks, etc. We note that when using alternative generator parameterization like GANs (Goodfellow et al., 2014) , an unbiased gradient can also be derived similarly.

3. IMPLEMENTATION FOR LINEAR BANDITS AND CLASSIFICATION

We now apply the general framework of the previous section to a specific problem: transductive linear bandits. As described in Sections 5 and Appendix A this setting generalizes standard multiarmed bandits, linear bandits, and all of binary classification through a simple reduction to combinatorial bandits. We are particularly motivated to look at classification because the existing agnostic active learning algorithms are very inefficient (see Section 5). Indeed when applied to our setting of T = 20 they never get past their first stage of uniform random sampling. Consider the game: Input: Policy π, X ⊂ R d , Z ⊂ R d , time horizon T ∈ T Initialization: Nature chooses θ * ∈ R d (hidden from policy) for t = 1, 2, . . . , T • Policy π selects x t ∈ X using history {(x s , y s )} t-1 s=1 • Nature reveals y t ∼ f (•|θ * , x t ) with E[y t |θ * , x t ] = x t , θ * Output: Policy π recommends z ∈ Z as an estimate for z (θ * ) := argmax z∈Z z, θ * and suffers loss L T = z (θ * ) -z, θ * if SIMPLE REGRET 1{z (θ * ) = z} if BEST IDENTIFICATION The observation distribution f (•|θ, x) is domain specific but typically taken to be either a Bernoulli distribution for binary data, or Gaussian for real-valued data. We are generally interested in two objectives: BEST IDENTIFICATION which attempts to exactly identify the vector z ∈ Z that is most aligned with θ * , and SIMPLE REGRET which settles for an approximate maximizer. Defining C(θ) Recalling the discussion of Section 2.1, C(θ) should ideally be monotonically increasing in the intrinsic difficulty of minimizing the loss with respect to a particular θ. For arbitrary X ⊂ R d and Z ⊂ R d , it is shown in Fiez et al. (2019) that the sample complexity of identifying z (θ) = arg max z∈Z z, θ with high probability is proportional to a quantity ρ (θ), the value obtained by an optimization program. Another complexity term that appears in the combinatorial bandits literature (Cao & Krishnamurthy, 2017) where X = {e i : i ∈ [d]} and Z ⊂ {0, 1} d is ρ(θ) = d i=1 max z:zi =z i (θ) z -z (θ) 2 2 z -z (θ), θ 2 . (9) One can show ρ (θ) ≤ ρ(θ) and in many cases track each other. Because ρ(θ) can be computed much more efficiently compared to ρ (θ), we use C(θ) = ρ(θ) in our experiments. Algorithm 2: Training Workflow 1 Input: sequence {r k } K k=1 , complexity function C, and obj ∈ {SIMPLE REGRET, BEST IDENTIFICATION}. 2 Define k(θ) ∈ [K] such that r k(θ)-1 < C(θ) ≤ r k(θ) for all θ with C(θ) ≤ r K 3 For each k ∈ [K], obtain policy π k by Algorithm 1 with Ω = Θ (r k ) and SIMPLE REGRET loss 4 if obj is SIMPLE REGRET then 5 For each k ∈ [K], compute ( π k , r k ) // In this case, π k = π k . 6 Warm start π = π K/2 ; optimize π by Algorithm 1 with Ω = Θ (r K ) and objective in (2), i.e., L T = z (θ) -z, θ -( π k(θ) , Θ (r k(θ) ) ) 7 else if obj is BEST IDENTIFICATION then 8 For each k ∈ [K], warm start π k = π k ; optimize π k by Algorithm 1 with Ω = Θ (r k ) and BEST IDENTIFICATION loss; compute (π k , Θ (r k ) ) 9 Warm start π = π K/2 ; optimize π by Algorithm 1 with Ω = Θ (r K ) and objective in (2), i.e., L T = 1{z (θ) = z} -(π k(θ) , Θ (r k(θ) ) ) 10 end 11 Output: π (an approximate solution to (2)) Training. When training our policies, we follow the following procedure in Algorithm 2. Note that even when we are training for BEST IDENTIFICATION, we still warm start the training with optimizing SIMPLE REGRET. This is because a random initialized policy performs so poorly that BEST IDENTIFICATION is nearly always 1, making it difficult to improve the policy. In addition, our generating distribution parameterizations exactly follows from Section 2.3, while detailed state representations, policy parametrization and hyperparamters can be found in Appendix C, D. Loss functions. Instead of optimizing the approximated quantity from (5) directly, we add regularizers to the losses for both the policy and generator. First, we choose the L barrier in (7) to be λ barrier • max{0, log(C(X , Z, θ)) -log(r k )}, for some large constant λ barrier . To discourage the policy from over committing to a certain action and/or the generating distribution from covering only a small subset of particles (i.e., mode collapse), we also add negative entropy penalties to both policy's output distributions and SOFTMAX(w) with scaling factors λ Pol-reg and λ Gen-reg .

4. EXPERIMENTS

We now evaluate the approach described in the previous section for combinatorial bandits with X = {e i : i ∈ [d]} and Z ⊂ {0, 1} d . We stress that the framework implemented here can be applied to any X , Z ⊂ R d and any appropriate f -just plug and play to learn a new policy. In our experiments we take particular instances of combinatorial bandits with Bernoulli observations. We evaluated based on two criterion: instance-dependent worst-case and average-case. For instancedependent worst-case, we measure, for each r k and policy π, (π, Θ (r k ) ) := max θ∈Θ (r k ) (π, θ) and plot this value as a function of r k . We note that our algorithm is designed to optimize for such metric. For the secondary average-case metric, we instead measure, for policy π and some collected set Θ, 1 |Θ| θ∈Θ (π, θ). Performances of instance-dependent worst-case metric are reported in Figures 2, 3 , 4, 6, and 7 below while the average case performances are reported in the tables and Figure 5 . Full scale of the figures can also be found in Appendix F. Algorithms. We compare against a number of baseline active learning algorithms (see Section 5 for a review). UNCERTAINTY SAMPLING at time t computes the empirical maximizer of z, θ and the runner-up, and samples an index uniformly from their symmetric difference; if either are not unique, an index is sampled from the region of disagreement of the winners (see Appendix G for details). The greedy methods are represented by soft generalized binary search (SGBS) (Nowak, 2011) which maintains a posterior distribution over Z and samples to maximize information gain. A hyperparameter β ∈ (0, 1/2) of SGBS determines the strength of the likelihood update. We plot or report a range of performance over β ∈ {.01, .03, .1, .2, .3, .4}. The agnostic algorithms for classification (Dasgupta, 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) or combinatorial bandits (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019) are so conservative that given just T = 20 samples, they are all exactly equivalent to uniform sampling and hence represented by UNIFORM. To represent a policy based on learning to actively learn (LAL), we employ the method of Kveton et al. (2020) with a fixed prior P constructed by drawing a z uniformly at random from Z and defining θ = 2z -1 ∈ [-1, 1] d (details in Appendix H). When evaluating each policy, we use the successive halving algorithm (Li et al., 2017; 2018) for optimizing our non-convex objective with randomly initialized gradient descent and restarts (details in Appendix E). We begin with a very simple instance to demonstrate the instance-dependent performance achieved by our learned policy. For d = 25 let X = {e i : i ∈ [d]}, Z = {0 + k i=1 e i : k = 0, 1, . . . , d}, and f (•|θ, x) is a Bernoulli distribu- tion over {-1, 1} with mean x, θ ∈ [-1, 1]. Note this is a binary classification task in one-dimension where the set of classifiers are thresholds on a line. We trained baseline policies {π k } 9 k=1 for the BEST IDENTIFICA-TION metric with C(θ) = ρ(X , Z, θ) and r k = 2 3+i/2 for i ∈ {0, . . . , 8}. First we compare the base policies π k to π. Figure 2 presents (π, Θ (r) ) = sup θ: ρ(θ)≤r (π, θ) = sup θ: ρ(θ)≤r P π,θ ( z = z (θ)) as a function of r for our base policies {π k } k and the global policy π * , each as an individual curve. Figure 3 plots the same information in terms of gap: (π, Θ (r) ) -min k:r k-1 <r≤r k (π k , Θ (r k ) ). We observe that each π k performs best in a particular region and π * performs almost as well as the r-dependent baseline policies over the range of r. This plot confirms that our optimization objective of (2) was successful. Under the same conditions as Figure 2 , Figure 4 compares the performance of π * to the algorithm benchmarks. Since SGBS and LAL are deterministic, the adversarial training finds a θ that tricks them into catastrophic failure. Figure 5 trades adversarial evaluation for evaluating with respect to a parameterized prior: For each h ∈ {0.5, 0.6, . . . , 1}, θ ∼ P h is defined by drawing a z uniformly at random from Z and then setting θ i = (2z i -1)(2α i -1) where α i ∼ Bernoulli(h). Thus, each sign of 2z-1 is flipped with probability h. We then compute E θ∼P h [P π,θ ( z = z (θ))] = E θ∼P h [ (π, θ)]. While SGBS now performs much better than uniform and uncertainty sampling, our policy π * is still superior to these policies. However, LAL is best overall which is expected since the support of P h is basically a rescaled version of the prior used in LAL. 20 Questions. We now address an instance constructed from the real data of Hu et al. (2018) . Summarizing how we used the data from Hu et al. A potential explanation is that on a noiseless instance (e.g., θ = 2z -1 for some z ∈ Z), our implementation of uncertainty sampling is equivalent to CAL (Cohn et al., 1994) and is known to have near-optimal sample complexity (Hanneke et al., 2014) . Uncertainty sampling even outperforms our r-dependent baseline by a bit which in theory should not occur-we conjecture this is due to insufficient convergence of our policies or local minima. Our second experiment constructs a distribution P based on the dataset: to draw a θ ∼ P we uniformly at random select a j ∈ [1000] and sets θ i = 2p (j) i -1 for all i ∈ [d]. As shown in Table 1 , SGBS and π * are the winners. LAL performs much worse in this case, potentially because of the distribution shift from P (prior we train on) to P (prior at test time). The strong performance of SGBS may be due to the fact that sign(θ i ) = 2z (θ) i -1 for all i and θ ∼ P, a realizability condition under which SGBS has strong guarantees (Nowak, 2011). = 1 corresponds to recommending the ith joke in user cluster z (k) ∈ Z. Figure 7 shows the same style plot as Figures 4, 6 but for this jokes dataset, with our policy alone nearly achieving the r-dependent baseline for all r. Mirroring the construction of the 20Q prior, we construct P by uniformly sampling a user and employing their θ to answer queries. Table 2 shows that despite our policy not being trained for this setting, its performance is still among the top.

5. RELATED WORK

Learning to actively learn. Previous works vary in how the parameterize the policy, ranging from parameterized mixtures of existing expertly designed active learning algorithms (Baram et al., 2004; Hsu & Lin, 2015; Agarwal et al., 2016) , parameterizing hyperparameters (e.g., learning rate, rate of forced exploration, etc.) in an existing popular algorithm (e.g, EXP3) (Konyushkova et al., 2017; Bachman et al., 2017; Cella et al., 2020) , and the most ambitious, policies parameterized end-to-end like in this work (Boutilier et al., 2020; Kveton et al., 2020; Sharaf & Daumé III, 2019; Fang et al., 2017; Woodward & Finn, 2017) . These works take an approach of defining a prior distribution either through past experience (meta-learning) or expert created (e.g., θ ∼ N (0, Σ)), and then evaluate their policy with respect to this prior distribution. Defining this prior can be difficult, and moreover, if the θ encountered at test time did not follow this prior distribution, performance could suffer significantly. Our approach, on the other hand, takes an adversarial training approach and can be interpreted as learning a parameterized least favorable prior (Wasserman, 2013) , thus gaining a much more robust policy as an end result. Robust and Safe Reinforcement Learning: Our work is also highly related to the field of robust and safe reinforcement learning, where our objective can be considered as an instance of minimax criterion under parameter uncertainty (Garcıa & Fernández, 2015) . Widely applied in applications such as robotics (Mordatch et al., 2015; Rajeswaran et al., 2016) , these methods train a policy in a simulator like Mujoco (Todorov et al., 2012) to minimize a defined loss objective while remaining robust to uncertainties and perturbations to the environment (Mordatch et al., 2015; Rajeswaran et al., 2016) . Ranges of these uncertainty parameters are chosen based on potential values that could be encountered when deploying the robot in the real world. In our setting, however, defining the set of environments is far less straightforward and is overcome by the adoption of the C(θ) function. Active Binary Classification Algorithms. The literature on active learning algorithms can be partitioned into model-based heuristics like uncertainty sampling, query by committee, or modelchange sampling (Settles, 2009) , greedy binary-search like algorithms that typically rely on a form of bounded noise for correctness (Dasgupta, 2005; Kääriäinen, 2006; Golovin & Krause, 2011; Nowak, 2011) , and agnostic algorithms that make no assumptions on the probabilistic model (Dasgupta, 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) . Though the heuristics and greedy methods can perform very well for some problems, it is typically easy to construct counter-examples (e.g., outside the assumptions) in which they catastrophically fail (as demonstrated in our experiments). The agnostic algorithms have strong robustness guarantees but rely on concentration inequalities, and consequently require at least hundreds of labels to observe any deviation from random sampling (see Huang et al. (2015) for comparison). Therefore, they were not included in our experiments explicitly but were represented by uniform. Pure-exploration Multi-armed Bandit Algorithms. In the linear structure setting, for sets X , Z ⊂ R d known to the player, pulling an "arm" x ∈ X results in an observation x, θ * + zeromean noise, and the objective is to identify arg max z∈Z z, θ * for a vector θ * unknown to the player (Soare et al., 2014; Karnin, 2016; Tao et al., 2018; Xu et al., 2017; Fiez et al., 2019) . A special case of linear bandits is combinatorial bandits where X = {e i : i ∈ [d]} and Z ⊂ {0, 1} d (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019) . Active binary classification is a special case of combinatorial pure-exploration multi-armed bandits (Jain & Jamieson, 2019) , which we exploit in the threshold experiments. While the above works have made great theoretical advances in deriving algorithms and information theoretic lower bounds that match up to constants, the constants are so large that these algorithms only behave well when the number of measurements is very large. When applied to the instances of our paper (only 20 queries are made), these algorithms behave no differently than random sampling.

6. DISCUSSION AND FUTURE DIRECTIONS

We see this work as an exciting but preliminary step towards realizing the full potential of this general approach. From a practical perspective, training a π * can take many hours of computational resources for even these small instances. Scaling these methods to larger instances is an important next step. While training time scales linearly with the horizon length T , we note that one can take multiple samples per time step with minimal computational overhead enabling problems that require larger sample complexities. In our implementation we hard-coded the decision rule for z given s T , while it could also be learned as in (Luedtke et al., 2020) . Likewise, the parameterization of the policy and generator worked well for our purposes but was chosen somewhat arbitrarily-are there more natural choices? Finally, while we focused on stochastic settings, this work naturally extends to constrained fully adaptive adversarial sequences which is an interesting direction of future work.

A INSTANCE DEPENDENT SAMPLE COMPLEXITY

Identifying forms of C(θ) is not as difficult a task as one might think due to the proliferation of tools for proving lower bounds for active learning (Mannor & Tsitsiklis, 2004; Tsybakov, 2008; Garivier & Kaufmann, 2016; Carpentier & Locatelli, 2016; Simchowitz et al., 2017; Chen et al., 2014) . One can directly extract values of C(θ) from the literature for regret minimization of linear or other structured bandits (Lattimore & Szepesvari, 2016; Van Parys & Golrezaei, 2020) , contextual bandits (Hao et al., 2019) , and tabular as well as structured MDPs (Simchowitz & Jamieson, 2019; Ok et al., 2018) . Moreover, we believe that even reasonable surrogates of C(θ) should result in a high quality policy π * . We review some canonical examples: • Multi-armed bandits. In the best-arm identification problem, there are d ∈ N Gaussian distributions where the ith distribution has mean θ i ∈ R for i = 1, . . . , d. In the above formulation, this problem is encoded as action x t = i t results in observation y t ∼ Bernoulli(θ it ) and the loss (π, θ) := E π,θ [1{ i = i (θ)}] where i is π's recommended index and i (θ) = arg max i θ i . It's been shown that there exists a constant c 0 > 0 such that for any sufficiently large ν > 0 we have inf π sup θ:C M AB (θ)≤ν (π, θ) ≥ exp(-c 0 T /ν) where C M AB (θ) := i =i (θ) (θ i (θ) -θ i ) -2 Moreover, for any θ ∈ R d there exists a policy π that achieves ( π, θ) ≤ c 1 exp(-c 2 T /C M AB (θ)) where c 1 , c 2 capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Simchowitz et al., 2017; Garivier & Kaufmann, 2016) . The above correspondence between the lower bound and the upper bound suggests that C M AB (θ) plays a critical role in determining the difficult of identifying i (θ) for any θ. This exercise extends to more structured settings as well: • Content recommendation / active search. Consider n items (e.g., movies, proteins) where the ith item is represented by a feature vector x i ∈ X ⊂ R d and a measurement x t = x i (e.g., preference rating, binding affinity to a target) is modeled as a linear response model such that y t ∼ N ( x i , θ , 1) for some unknown θ ∈ R d . If (π, θ) := E π,θ [1{ i = i (θ)}] as above then nearly identical results to that of above hold for an analogous function of C M AB (θ) (Soare et al., 2014; Karnin, 2016; Fiez et al., 2019) . • Active binary classification. For i = 1, . . . , d let φ i ∈ R p be a feature vector of an unlabeled item (e.g., image) that can be queried for its binary label y i ∈ {-1, 1} where y i ∼ Bernoulli(θ i ) for some θ ∈ R d . Let H be an arbitrary set of classifiers (e.g., neural nets, random forest, etc.) such that each h ∈ H assigns a label {-1, 1} to each of the items {φ i } d i=1 in the pool. If items are chosen sequentially to observe their labels, the objective is to identify the true risk minimizer h (θ) = arg min h∈H d i=1 E θ [1{h(φ i ) = y i }] using as few requested labels as possible and (π, θ) := E π,θ [1{ h = h (θ)}] where h ∈ H is π's recommended classifier. Many candidates for C(θ) have been proposed from the agnostic active learning literature (Dasgupta, 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) but we believe the most granular candidates come from the combinatorial bandit literature (Chen et al., 2017; Fiez et al., 2019; Cao & Krishnamurthy, 2017; Jain & Jamieson, 2019) . To make the reduction, for each h ∈ H assign a z (h) ∈ {0, 1} d such that [z (h) ] i := 1{h(φ i ) = 1} for all i = 1, . . . , d and set Z = {z (h) : h ∈ H}. It is easy to check that z (θ) := arg max z∈Z z, θ satisfies z (θ) = z (h (θ)) . Thus, requesting the label of example i is equivalent to sampling from Bernoulli( e i , θ ) ∈ {-1, 1}, completing the reduction to combinatorial bandits: X = {e i : i ∈ [d]}, Z ⊂ {0, 1} d . We then apply the exact same C(θ) as above for linear bandits.

B GRADIENT ESTIMATE DERIVATION

Here we derive the unbiased gradient estimates ( 6), ( 7) and ( 8) in Algorithm 1. Since each the gradient estimates in the above averages over M • L identically distributed trajectories, it is therefore sufficient to show that our gradient estimate is unbiased for a single problem θ i and its rollout trajectory {(x t , y t )} T t=1 . Under review as a conference paper at ICLR 2021 For a feasible w, using the score-function identity (Aleksandrov et al., 1968 ) ∇ w E i∼SOFTMAX(w) (π ψ , θ i ) = E i∼SOFTMAX(w) (π ψ , θ i ) • ∇ w log(SOFTMAX(w) i ) . Observe that if i ∼ SOFTMAX(w) and {(x t , y t )} T t=1 is the result of rolling out a policy π ψ on θ i then g w := L T (π ψ , {(x t , y t )} T t=1 , θ i ) • ∇ w log(SOFTMAX(w) i ) is an unbiased estimate of ∇ w E i∼SOFTMAX(w) (π ψ , θ i ) . For a feasible set Θ, by definition of (π, θ), ∇ Θ E i∼SOFTMAX(w) (π ψ , θ i ) =E i∼SOFTMAX(w) ∇ Θ E π, θi L T (π, {(x t , y t )} T t=1 , θ i ) =E i∼SOFTMAX(w) E π, θi ∇ Θ L T (π, {(x t , y t )} T t=1 , θ i ) (10) + L T (π, {(x t , y t )} T t=1 , θ i ) • ∇ Θ log(P π ψ , θi ({(x t , y t )} T t=1 )) where the last equality follows from chain rule and the score-function identity (Aleksandrov et al., 1968) . The quantity inside the expectations, call it g Θ , is then an unbiased estimator of ∇ Θ E i∼SOFTMAX(w) (π ψ , θ i ) given i and {(x t , y t )} T t=1 are rolled out accordingly. Note that if L barrier = 0, ∇ Θ L barrier ( θ i , Ω) is clearly an unbiased gradient estimator of E i∼SOFTMAX(w) [E π, θi [L barrier ( θ i , Ω)]] given i and rollout are sampled accordingly. Likewise, for policy, g ψ := L T (π ψ , {(x t , y t )} T t=1 , θ i ) • ∇ ψ log(P π ψ , θi ({(x t , y t )} T t=1 )) is an unbiased estimate of ∇ ψ E i∼SOFTMAX(w) (π ψ , θ i ) .

C LINEAR BANDIT PARAMETERIZATION C.1 STATE REPRESENTATION

We parameterize our state space S as a flattened |X | × 3 matrix where each row represents a distinct x ∈ X . Specifically, at time t the row of s t corresponding to some x ∈ X records the number of times that action x has been taken 

C.2 POLICY MLP ARCHITECTURE

Our policy π ψ is a multi-layer perceptron with weights ψ. The policy take a 3|X | sized state as input and outputs a vector of size |X | which is then pushed through a soft-max to create a probability distribution over X . At the end of the game, regardless of the policy's weights, we set z = argmax z∈Z z, θ where θ is the minimum 2 norm solution to argmin θ T s=1 (y s -x s , θ ) 2 . Our policy network is a simple 6-layer MLP, with layer sizes {3|X |, 256, 256, 256, 256 

D HYPER-PARAMETERS

In this section, we list our hyperparameters. First we define λ binary to be a coefficient that gets multiplied to binary loses, so instead of 1{z (θ * ) = z}, we receive loss λ binary • 1{z (θ * ) = z}. We choose λ binary so that the recieved rewards are approximately at the same scale as SIMPLE REGRET. During our experiments, all of the optimizers are Adam. All budget sizes are T = 20. For fairness of evaluation, during each experiment (1D thresholds or 20 Questions), all parameters below are shared for evaluating all of the policies. To elaborate on training strategy proposed in Algorithm 2 more, we divide our training into four procedures, as indicated in Table 3: • Init. The initialization procedure takes up a rather small portion of iterations primarily for the purpose of optimizing for L barrier so that the particles converge into the constrained difficulty sets. In addition, during the initialization process we initialize and freeze w = 0, thus putting an uniform distribution over the particles. This allows us to utilize the entire set of particles without w converge to only a few particles early on. the Procedures. The primary purpose for this process is to let the policy converge to a reasonable warm start that already captures some essence of the task. • Fine-tune π i . Training with BEST IDENTIFICATION objective run multiple times for each π i with their corresponding complexity set Θ i . During each run, we start with a warm started policy, and reinitialize the rest of the models by running the initialization procedure followed by optimizing the BEST IDENTIFICATION objective. • Fine-tune π This procedure optimizes (2), with baselines min k (π k , Θ (r k ) ) evaluated based on each π i learned from the previous procedure. Similar to fine-tuning each individual π i , we warm start a policy π K/2 and reinitialize w and Θ by running the initialization procedure again. .999 (all) To provide a general strategy of choosing hyper-parameters, we note that L, firstly, λ binary , λ Pol-reg are primarily parameters tuned for |X | as the noisiness and scale of the gradients, and entropy over the arms X grows with the size |X |. Secondly, λ Gen-reg is primarily tuned for |Z| as it penalizes the entropy over the N arms, which is a multiple of |Z|. Thirdly, learning rate of θ is primarily tuned for the convergence of constraint ρ * into the restricted class, thus L barrier becoming 0 after the specified number of iterations during initialization is a good indicator. Finally, we choose N and M by memory constraint of our GPU. The hyper-parameters for each experiment was tuned with less than 20 hyper-parameter assignments, some metrics to look at while tuning these hyperparameters includes but are not limited to: gradient magnitudes of each component, convergence of each loss and entropy losses for each regularization term (how close it is to the entropy of a uniform probability), etc.

E POLICY EVALUATION

When evaluating a policy, we are essentially solving the following objective for a fixed policy π: max θ∈Ω (π, θ) where Ω is a set of problems. However, due to non-concavity of this loss function, gradient descent initialized randomly may converge to a local maxima. To reduce this possibility, we randomly initialize many initial iterates and take gradient steps round-robin, eliminating poorly performing trajectories. To do this with a fixed amount of computational resource, we apply the successive halving algorithm from Li et al. (2018) . Specifically, we choose hyperparamters: η = 4, r = 100, R = 1600 and s = 0. This translates to: • Initialize | Θ| = 1600, optimize for 100 iterations for each θ i ∈ Θ • Take the top 400 of them and optimize for another 400 iterations • Take the top 100 of the remaining 400 and optimize for an additional 1600 iterations We take gradient steps with the Adam optimizer (Kingma & Ba, 2014) with learning rate of 10 -3 β 1 = .9 and β 2 = .999. 



3 Initialization: w, finite set Θ and ψ 4 for t = 1, ..., N it do 5 Collect rollouts of play: 6 Sample M problem indices I 1 , ..., I M i.i.d. ∼ SOFTMAX(w) 7 for m = 1, ..., M do 8 Collect L independent rollout trajectories, denoted as τ m,1:L , by the policy π ψ for problem instant θ Im and observe losses ∀1 ≤ l ≤ L, L T (π ψ , τ m,l , θ Im ).

Figure 2: Learned policies, lower is better Figure 3: Sub-optimality of individual policies, lower is better

(2018)  (see Appendix I for details), 100 yes/no questions were considered for 1000 celebrities. Each question i ∈ [100] for each person j ∈ [1000] was answered by several annotators to construct an empirical probability p(j)i ∈ [0, 1] denoting the proportion of annotators that answered "yes." To construct our instance, we take X = {e i : i ∈ [100]} and Z = {z (j) 2}} ⊂ {0, 1} 1000 . Just as before, we trained {π k } 4 k=1 for the BEST IDENTIFICATION metric with C(θ) = ρ(X , Z, θ) and r i = 2 3+i/2 for i ∈ {1, . . . , 4}.

Figure 6: Max {θ : ρ(θ) ≤ r}

Figure 6 is analogous to Figure4but for this 20 questions instance. Uncertainty sampling performs remarkably well on this instance. A potential explanation is that on a noiseless instance (e.g., θ = 2z -1 for some z ∈ Z), our implementation of uncertainty sampling is equivalent to CAL(Cohn et al., 1994) and is known to have near-optimal sample complexity(Hanneke et al., 2014). Uncertainty sampling even outperforms our r-dependent baseline by a bit which in theory should not occur-we conjecture this is due to insufficient convergence of our policies or local minima. Our second experiment constructs a distribution P based on the dataset: to draw a θ ∼ P we uniformly at random select a j ∈ [1000] and sets θ i = 2p

Figure 7: Max {θ : ρ(θ) ≤ r}

1{x s = x}, its inverse ( t-1 s=1 1{x s = x}) -1 , and the sum of the observations t-1 s=1 1{x s = x}y s .

, |X |} where 3|X | corresponds to the input layer and |X | is the size of the output layer, which is then pushed through a Softmax function to create a probability over arms. In addition, all intermediate layers are activated with the leaky ReLU activation units with negative slopes of .01. For the experiments for 1D thresholds and 20 Questions, they share the same network structure as mentioned above with |X | = 25 and |X | = 100 respectively.

Figure 8: Full scale of Figure 2

Figure 10: Full scale of Figure 4

Figure 12: Full scale of Figure 6

Average E θ∼ P [•]

To initialize Θ, we sample 2/3 of the N particles uniformly from [-1, 1] |X | and the rest 1/3 of the particles by sampling, for each i ∈ [|Z|], N 3|Z| particles uniformly from {θ : argmax j θ, z j = i}. We initialize our policy weights by Xavier initialization with weights sampled from normal distribution and scaled by .01. • Regret Training, π i Training with SIMPLE REGRET objective usually takes the longest among

Number of Iterations and Learning Rates

Parallel Sizes and Regularization coefficients

FUNDING DISCLOSURE

Removed for anonymization purposes.

ACKNOWLEDGEMENT

Removed for anonymization purposes.

G UNCERTAINTY SAMPLING

We define the symmetric difference of a set of binary vectors, SymDiff({z 1 , ..., z n }) = {i : ∃j, k ∈ [n] s.t., z (i) j = 1 ∧ z (i) k = 0}, as the dimensions where inconsistencies exist.Algorithm 3: Uncertainty sampling in very small budget setting 

H LEARNING TO ACTIVELY LEARN ALGORITHM

To train a policy under the learning to actively learn setting, we aim to solve for the objectivewhere our policy and states are parameterized the same way as Appendix C for a fair comparison. To optimize for the parameter, we take gradient steps like (8) but with the new sampling and rollout where θ i ∼ P. This gradient step follows from both the classical policy gradient algorithm in reinforcement learning as well as from recent LAL work by Kveton et al. (2020) .Moreover, note that the optimal policy for the objective must be deterministic as justified by deterministic policies being optimal for MDPs. Therefore, it is clear that, under our experiment setting, the deterministic LAL policy will perform poorly in the adversarial setting (for the same reason why SGBS performs poorly). I 20 QUESTIONS SETUP Hu et al. (2018) collected a dataset of 1000 celebrities and 500 possible questions to ask about each celebrity. We chose 100 questions out of the 500 by first constructing p , X and Z for the 500 dimensions data, and sampling without replacement 100 of the 500 dimensions from a distribution derived from a static allocation. We down-sampled the number of questions so our training can run with sufficient M and L to de-noise the gradients while being prototyped with a single GPU.Specifically, the dataset from Hu et al. (2018) consists of probabilities of people answering Yes / No / Unknown to each celebrity-question pair collected from some population. To better fit the linear bandit scenario, we re-normalize the probability of getting Yes / No, conditioning on the event that these people did not answer Unknown. The probability of answering Yes to all 500 questions for each celebrity then constitutes vectors p (1) , ..., p (1000) ∈ R 500 , where each dimension of a give p (j) i represents the probability of yes to the ith question about the jth person. The action set X is then constructed as X = {e i : i ∈ [500]}, while Z = {z (j) : [zi > 1/2}} ⊂ {0, 1} 1000 are binary vectors taking the majority votes.To sub-sample 100 questions from the 500, we could have uniformly at random selected the questions, but many of these questions are not very discriminative. Thus, we chose a "good" set of queries based on the design recommended by ρ of Fiez et al. (2019) . If questions were being answered noiselessly in response to a particular z ∈ Z , then equivalently we have that for this setting θ = 2z -1. Since ρ optimizes allocations λ over X that would reduce the number of required queries as much as possible (according to the information theoretic bound of (Fiez et al., 2019 )) if we want to find a single allocation for all z ∈ Z simultaneously, we can perform the optimization problemWe then sample elements from X according to this optimal λ without replacement and add them to X until |X | = 100.

J JESTER JOKE RECOMMENDATION SETUP

We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from punbased jokes to grossly offensive. We filter the dataset to only contain users that rated all 100 jokes, resulting in 14116 users. A rating of each joke was provided on a [-10, 10] scale which was shrunk [-1, 1]. Denote this set of ratings as Θ = {θ i : i ∈ [14116], θ i ∈ [-1, 1] 100 }, where θ i encodes the ratings of all 100 jokes by user i. To construct the set of arms Z, we then clustered the ratings of these users to 10 groups to obtain Z = {z i : i ∈ [10], z i ∈ {0, 1} 100 } by minimizing the following metric: To solve for Z, we adapt the k -means algorithm, with the metric above instead of the L -2 metric used traditionally.

