UNDERSTANDING CURRICULUM LEARNING IN POL-ICY OPTIMIZATION FOR ONLINE COMBINATORIAL OP-TIMIZATION

Abstract

Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings.

1. INTRODUCTION

In recent years, machine learning techniques have shown promising results in solving combinatorial optimization (CO) problems, including traveling salesman problem (TSP, Kool et al. (2019) ), maximum cut (Khalil et al., 2017) and satisfiability problem (Selsam et al., 2019) . While in the worst case some CO problems are NP-hard, in practice, the probability that we need to solve the worst-case problem instance is low (Cappart et al., 2021) . Machine learning techniques are able to find generic models which have exceptional performance on the majority of a class of CO problems. A significant subclass of CO problems is called online CO problems, which has gained much attention (Grötschel et al., 2001; Huang, 2019; Garg et al., 2008) . Online CO problems entail a sequential decision-making process, which perfectly matches the nature of reinforcement learning (RL). This paper concerns using RL to tackle online CO problems. RL is often coupled with specialized techniques including (a particular type of) Curriculum Learning (Kong et al., 2019) , human feedback and correction (Pérez-Dattari et al. (2018) , Scholten et al. (2019) ), and policy aggregation (boosting, Brukhim et al. (2021) ). Practitioners use these techniques to accelerate the training speed. While these hybrid techniques enjoy empirical success, the theoretical understanding is still limited: it is unclear when and why they improve the performance. In this paper, we particularly focus on RL with Curriculum Learning (Bengio et al. (2009) , also named "bootstrapping" in Kong et al. (2019) ): train the agent from an easy task and gradually increase the difficulty until the target task. Interestingly, these techniques exploit the special structures of online CO problems. Main contributions. In this paper, we initiate the formal study on using RL to tackle online CO problems, with a particular emphasis on understanding the specialized techniques developed in this emerging subarea. Our contributions are summarized below. • Formalization. For online CO problems, we want to learn a single policy that enjoys good performance over a distribution of problem instances. This motivates us to use Latent Markov Decision Process (LMDP) (Kwon et al., 2021a) instead of standard MDP formulation. We give concrete examples, Secretary Problem (SP) and Online Knapsack, to show how LMDP models online CO problems. With this formulation, we can systematically analyze RL algorithms. • Provable efficiency of policy optimization. By leveraging recent theory on Natural Policy Gradient for standard MDP Agarwal et al. (2021) , we analyze the performance of NPG for LMDP. The performance bound is characterized by the number of iterations, the excess risk of policy evaluation, the transfer error, and the relative condition number κ that characterizes the distribution shift between the sampling policy and the optimal policy. To our knowledge, this is the first performance bound of policy optimization methods on LMDP. • Understanding and simplifying Curriculum Learning. Using our performance guarantee on NPG for LMDP, we study when and why Curriculum Learning is beneficial to RL for online CO problems. Our main finding is that the main effect of Curriculum Learning is to give a stronger sampling policy. Under certain circumstances, Curriculum Learning reduces the relative condition number κ, improving the convergence rate. For the Secretary Problem, we provably show that Curriculum Learning can exponentially reduce κ compared with using the naïve sampling policy. Surprisingly, this means even a random curriculum of SP accelerates the training exponentially. As a direct implication, we show that the multi-step Curriculum Learning proposed in Kong et al. (2019) can be significantly simplified into a single-step scheme. Lastly, to obtain a complete understanding, we study the failure mode of Curriculum Learning, in a way to help practitioners to decide whether to use Curriculum Learning based on their prior knowledge. To verify our theories, we conduct extensive experiments on two classical online CO problems (Secretary Problem and Online Knapsack) and carefully track the dependency between the performance of the policy and κ.

2. RELATED WORK

RL for CO. There have been rich literature studying RL for CO problems, e.g., using Pointer Network in REINFORCE and Actor-Critic for routing problems (Nazari et al., 2018) , combining Graph Attention Network with Monte Carlo Tree Search for TSP (Drori et al., 2020) and incorporating Structure-to-Vector Network in Deep Q-networks for maximum independent set problems (Cappart et al., 2019) . Bello et al. (2017) proposed a framework to tackle CO problems using RL and neural networks. Kool et al. (2019) combined REINFORCE and attention technique to learn routing problems. Vesselinova et al. (2020) and Mazyavkina et al. (2021) are taxonomic surveys of RL approaches for graph problems. Bengio et al. (2020) summarized learning methods, algorithmic structures, objective design and discussed generalization. In particular scaling to larger problems was mentioned as a major challenge. Compared to supervised learning, RL not only mimics existing heuristics, but also discover novel ones that humans have not thought of, for example chip design (Mirhoseini et al., 2021) and compiler optimization (Zhou et al., 2020b) . Kong et al. (2019) focused on using RL to tackle online CO problems, which means that the agent must make sequential and irrevocable decisions. They encoded the input in a length-independent manner. For example, the i-th element of a n-length sequence is encoded by the fraction i n and other features, so that the agent can generalize to unseen n, paving the way for Curriculum Learning. Three online CO problems were mentioned in their paper: Online Matching, Online Knapsack and Secretary Problem (SP). Currently, Online Matching and Online Knapsack have only approximation algorithms (Huang et al., 2019; Albers et al., 2021) . There are also other works about RL for online CO problems. Alomrani et al. (2021) uses deep-RL for Online Matching. Oren et al. (2021) studies Parallel Machine Job Scheduling problem (PMSP) and Capacitated Vehicle Routing problem (CVRP), which are both online CO problems, using offline-learning and Monte Carlo Tree Search. LMDP. We provide the exact definition of LMDP in Sec. 4.1. As studied by Steimle et al. (2021) , in the general cases, optimal policies for LMDPs are history-dependent. This is different from standard MDP cases where there always exists an optimal history-independent policy. They showed that even finding the optimal history-independent policy is NP-hard. Kwon et al. (2021a) investigated the sample complexity and regret bounds of LMDP in the history-independent policy class. They presented an exponential lower-bound for a general LMDP and derived algorithms with polynomial sample complexities for cases with special assumptions. Kwon et al. (2021b) showed that in rewardmixing MDPs, where MDPs share the same transition model, a polynomial sample complexity is achievable without any assumption to find an optimal history-independent policy. Convergence rate for policy gradient methods. There is line of work on the convergence rates of policy gradient methods for standard MDPs (Bhandari & Russo (2021) , Wang et al. (2020) , Liu et al. (2020) , Ding et al. (2021) , Zhang et al. (2021) ). For softmax tabular parameterization, NPG can obtain an O(1/T ) rate (Agarwal et al., 2021) where T is the number of iterations; with entropy regularization, both PG and NPG achieves linear convergence (Mei et al., 2020; Cen et al., 2021) . For log-linear policies, sample-based NPG makes an O(1/ √ T ) convergence rate, with assumptions on stat , bias and κ (Agarwal et al., 2021) (see Def. 4); exact NPG with entropy regularization enjoys a linear convergence rate up to bias (Cayci et al., 2021) . We extend the analysis to LMDP. Curriculum Learning. There are a rich body of literature on Curriculum Learning (Zhou et al., 2021b; a; 2020a; Ao et al., 2021; Willems et al., 2020; Graves et al., 2017) . As surveyed in Bengio et al. (2009) , Curriculum Learning has been applied to training deep neural networks and non-convex optimizations and improves the convergence in several cases. Narvekar et al. (2020) rigorously modeled curriculum as a directed acyclic graph and surveyed work on curriculum design. Kong et al. (2019) proposed a bootstrapping / Curriculum Learning approach: gradually increase the problem size after the model works sufficiently well on the current problem size.

3. MOTIVATING ONLINE COMBINATORIAL OPTIMIZATION PROBLEMS

Online CO problems are a natural class of problems that admit constructions of small-scale instances, because the hardness of online CO problems can be characterized by the input length, and instances of different scales are similar. This property simplifies the construction of curricula and underscores curriculum learning. We also believe online CO problems make the use of LMDP suitable, because under a proper distribution {w m }, instances with drastically different solutions do not occupy much of the probability space. In this section we introduce two motivating online CO problems. We are interested in these problems because they have all been extensively studied and have closed-form, easy-to-implement policies as references. Furthermore, they were studied in Kong et al. (2019) , the paper that motivates our work. They also have real-world applications, e.g., auction design (Babaioff et al., 2007) .

3.1. SECRETARY PROBLEM

In SP, the goal is to maximize the probability of choosing the best among n candidates, where n is known. All candidates have different scores to quantify their abilities. They arrive sequentially and when the i-th candidate shows up, the decision-maker observes the relative ranking X i among the first i candidates, which means being the X i th-best so far. A decision that whether to accept or reject the i-th candidate must be made immediately when the candidate comes, and such decisions cannot be revoked. Once one candidate is accepted, the game ends immediately. The ordering of the candidates is unknown. There are in total n! permutations, and an instance of SP is drawn from an unknown distribution over these permutations. In the classical SP, each permutation is sampled with equal probability. The optimal solution for the classical SP is the wellknown 1/e-threshold strategy: reject all the first n/e candidates, then accept the first one who is the best so-far. In this paper, we also study some different distributions.

3.2. ONLINE KNAPSACK (DECISION VERSION)

In Online Knapsack problems the decision-maker observes n (which is known) items arriving sequentially, each with value v i and size s i revealed upon arrival. A decision to either accept or reject the i-th item must be made immediately when it arrives, and such decisions cannot be revoked. At any time the accepted items should have their total size no larger than a known budget B. The goal of standard Online Knapsack is to maximize the total value of accepted items. In this paper, we study the decision version, denoted as OKD, whose goal is to maximize the probability of total value reaching a known target V . We assume that all values and sizes are sampled independently from two fixed distributions, namely v 1 , v 2 , . . . , v n i.i.d. ∼ F v and s 1 , s 2 , . . . , s n i.i.d. ∼ F s . In Kong et al. (2019) the experiments were carried out with F v = F s = Unif [0,1] , and we also study other distributions. Remark 1. A challenge in OKD is the sparse reward: the only signal is reward 1 when the total value of accepted items first exceeds V (see the detailed formulation in Sec. C.2), unlike in Online Knapsack the reward of v i is given instantly after the i-th item is successfully accepted. This makes random exploration hardly get reward signals, necessitating Curriculum Learning.

4. PROBLEM SETUP

In this section, we first introduce LMDP and why it naturally formulates online CO problems. The next are necessary components required by the algorithm, Natural Policy Gradient.

4.1. LATENT MARKOV DECISION PROCESS

Tackling an online CO problem entails handling a family of problem instances. Each instance can be modeled as a Markov Decision Process. However, for online CO problems, we want to find one algorithm that works for a family of problem instances and performs well on average over an (unknown) distribution over this family. To this end, we adopt the concept of Latent MDP which naturally models online CO problems. Latent MDP (Kwon et al., 2021a ) is a collection of MDPs M = {M 1 , M 2 , . . . , M M }. All the MDPs share state set S, action set A and horizon H. Each MDP M m = (S, A, H, ν m , P m , r m ) has its own initial state distribution ν m ∈ ∆(S), transition P m : S × A → ∆(S) and reward r m : S × A → [0, 1], where ∆(S) is the probability simplex over S. Let w 1 , w 2 , . . . , w M be the mixing weights of MDPs such that w m > 0 for any m and M m=1 w m = 1. At the start of every episode, one MDP M m ∈ M is randomly chosen with probability w m . Due to the time and space complexities of finding the optimal history-dependent policies, we stay in line with Kong et al. (2019) and care only about finding the optimal history-independent policy. Let Π = {π : S → ∆(A)} denote the class of all the history-independent policies. Log-linear policy. Let φ : S × A → R d be a feature mapping function where d denotes the dimension of feature space. Assume that φ(s, a) 2 ≤ B. A log-linear policy is of the form: π θ (a|s) = exp(θ φ(s, a)) a ∈A exp(θ φ(s, a )) , where θ ∈ R d . Remark 2. Log-linear parameterization is a generalization of softmax tabular parameterization by setting d = |S||A| and φ(s, a) = One-hot(s, a). They are "scalable": if φ extracts important features from different S × As with a fixed dimension d |S||A|, then a single π θ can generalize. Entropy regularized value function, Q-function and advantage function. The expected reward of executing π on M m is defined via value functions. We incorporate entropy regularization for completeness because prior works (especially empirical works) used it to facilitate training. We define the value function in a unified way: V π,λ m,h (s) is defined as the sum of future λ-regularized rewards starting from s and executing π for h steps in M m , i.e., V π,λ m,h (s) := E h-1 t=0 r π,λ m (s t , a t ) M m , π, s 0 = s , where r π,λ m (s, a) := r m (s, a) + λ ln 1 π(a|s) , and the expectation is with respect to the randomness of trajectory induced by π in M m . Denote V π m,h (s) := V π,0 m,h (s), the unregularized value function. For any M m , π, h, with H(π(•|s)) := a∈A π(a|s) ln 1 π(a|s) ∈ [0, ln |A|] we define H π m,h (s) := E h-1 t=0 H(π(•|s t )) M m , π, s 0 = s . In fact, V π,λ m,h (s) = V π m,h (s) + λH π m,h (s). Denote V π,λ := M m=1 w m s0∈S ν m (s 0 )V π,λ m,H (s 0 ) and V π := V π,0 . We need to find π = arg max π∈Π V π . Under regularization, we seek for π λ = arg max π∈Π V π,λ instead. Denote V := V π , V ,λ = V π λ ,λ . Since V ≤ V π ,λ ≤ V ,λ ≤ V π λ + λH ln |A|, the regularized optimal policy π λ can be nearly optimal as long as the regularization coefficient λ is small enough. For notational ease, we abuse π with π λ . The Q-function can be defined in a similar manner: Q π,λ m,h (s, a) := E h-1 t=0 r π,λ m (s t , a t ) M m , π, (s 0 , a 0 ) = (s, a) , and the advantage function is defined as A π,λ m,h (s, a) := Q π,λ m,h (s, a) -V π,λ m,h (s). Modeling SP. For SP, each instance is a permutation of length n, and in each round an instance is drawn from an unknown distribution over all permutations. In the i-th step for i ∈ [n], the state encodes the i-th candidate and relative ranking so far. The transition is deterministic according to the problem definition. A reward of 1 is given if and only if the best candidate is accepted. We model the distribution as follows: suppose for candidate i, he/she is the best so far with probability P i and is independent of other i . Hence, the weight of each instance is simply the product of the probabilities on each position. The classical SP satisfies P i = 1 i . Modeling OKD. For OKD, each instance is a sequence of items with values and sizes drawn from unknown distributions F v and F s . In the i-th step for i ∈ [n], the state encodes the information of i-th item's value and size, the remaining budget, and the remaining target value to fulfill. The transition is also deterministic according to the problem definition, and a reward of 1 is given if and only if the agent obtains the target value for the first time. Kong et al. (2019) . F v = F s = Unif [0,1] in

4.2. ALGORITHM COMPONENTS

In this subsection we will introduce some necessary notions used by our main algorithm. Definition 1 (State(-action) Visitation Distribution). The state visitation distribution and stateaction visitation distribution at step h ≥ 0 with respect to π in M m are defined as d π m,h (s) := P(s h = s | M m , π), d π m,h (s, a) := P(s h = s, a h = a | M m , π). We will encounter a grafted distribution d π m,h (s, a) = d π m,h (s) • Unif A (a) which in general cannot be the state-action visitation distribution with respect to any policy. However, it can be attained by first acting under π for h steps to get states then sample actions from the uniform distribution Unif A . This distribution will be useful when we apply a variant of NPG, where the sampling policy is fixed. Denote d ♣ m,h := d π ♣ m,h and d ♣ as short for {d ♣ m,h } 1≤m≤M,0≤h≤H-1 , here ♣ can be any symbol. We also need the following definitions for NPG, which are different from the standard versions for discounted MDP because weights {w m } must be incorporated in the definitions to deal with LMDP. In the following definitions, let v be the collection of any distribution, which will be instantiated by d , d t , etc. in the remaining sections. Definition 2 (Compatible Function Approximation Loss). Let g be the parameter update weight, then NPG is related to finding the minimizer for the following function: L(g; θ, v) := M m=1 w m H h=1 E s,a∼v m,H-h A π θ ,λ m,h (s, a) -g ∇ θ ln π θ (a|s) 2 . Definition 3 (Generic Fisher Information Matrix). Σ θ v := M m=1 w m H h=1 E s,a∼v m,H-h ∇ θ ln π θ (a|s) (∇ θ ln π θ (a|s)) . Particularly, denote F (θ) = Σ θ d θ as the Fisher information matrix induced by π θ .

5. LEARNING PROCEDURE

In this section we introduce the algorithms: NPG supporting any customized sampler, and our proposed Curriculum Learning framework. Natural Policy Gradient. The learning procedure generates a series of parameters and policies. Starting from θ 0 , the algorithm updates the parameter by setting θ t+1 = θ t + ηg t , where η is a predefined constant learning rate, and g t is the update weight. Denote π t := π θt , V t,λ := V πt,λ and A t,λ m,h := A πt,λ m,h for convenience. We adopt NPG (Kakade, 2002) because it is efficient in training parameterized policies and admits clean theoretical analysis. NPG satisfies g t ∈ arg min g L(g; θ t , d θt ) (see Sec. D.1 for explanation). When we only have samples, we use the approximate version of NPG: g t ≈ arg min g∈G L(g; θ t , d θt ), where G = {x : x 2 ≤ G} for some hyper-parameter G. We also introduce a variant of NPG: instead of sampling from d θt using the current policy π t , we sample from d πs using a fixed sampling policy π s . The update rule is g t ≈ arg min g∈G L(g; θ t , d πs ). This version makes a closed-form analysis for SP possible. The main algorithm is shown in Alg. 1. It admits two types of training: x If π s = None, it calls Alg. 4 (deferred to App. A) to sample s, a ∼ d θt ; y If π s = None, it then calls Alg. 4 to sample s, a ∼ d πs . Alg. 4 also returns an unbiased estimation of A πt H-h (s, a). In both cases, we denote d t as the sampling distribution and Σ t as the induced Fisher Information Matrix used in step t, i.e. d t := d θt , Σ t := F (θ t ) if π s = None; d t := d πs , Σ t := Σ θt d πs otherwise. The update rule can be written in a unified way as g t ≈ arg min g∈G L(g; θ t , d t ). This is equivalent to solving a constrained quadratic optimization and we can use existing solvers. Remark 3. Alg. 1 is different from Alg. 4 of Agarwal et al. (2021) in that we use a "batched" update while they used successive Projected Gradient Descents (PGD). This is an important implementation technique to speed up training in our experiments. Curriculum Learning. We use Curriculum Learning to facilitate training. Alg. 2 is our proposed training framework, which first constructs an easy environment E and trains a (near-)optimal policy π s of it. In the target environment E, we either use π s to sample data while training a new policy from scratch, or simply continue training π s . To be specific and provide clarity for the results in Sec. 7, we name a few training modes (without regularization) here, and the rest are in Tab. 1. curl, the standard Curriculum Learning, runs Alg. 2 with samp = pi t; fix samp curl stands for the fixed sampler Curriculum Learning, running Alg. 2 with samp = pi s. direct means directly learning in E without curriculum, i.e., running Alg. 1 with π s = None; naive samp also directly learns in E, while using π s = naïve random policy to sample data in Alg. 1.  for t ← 0, 1, . . . , T -1 do 3: For 0 ≤ n ≤ N -1 and 0 ≤ h ≤ H -1, sample (a (n) h , s (n) h ) and estimate A (n) H-h using Alg. 4. 4: Calculate: Ft ← N -1 n=0 H-1 h=0 ∇ θ ln π θ t (a (n) h |s (n) h ) ∇ θ ln π θ t (a (n) h |s (n) h ) , ∇t ← N -1 n=0 H-1 h=0 A (n) H-h ∇ θ ln π θ t (a (n) h |s (n) h ). 5: Call any solver to get gt ← arg min g∈G g Ftg -2g ∇t.

6:

Update θt+1 ← θt + η gt. 7: end for 8: Return: θT .

6. PERFORMANCE ANALYSIS

Our analysis contains two important components, namely the sub-optimality gap guarantee of the NPG we proposed, and the efficacy guarantee of Curriculum Learning on Secretary Problem. The first component can also be extended to history-dependent policies with features being the tensor products of features from each time step (which is exponentially large). Algorithm 2: Curriculum learning framework. 1: Input: Environment E; learning rate η; episode number T ; batch size N ; sampler type samp ∈ { pi s, pi t }; regularization coefficient λ; entropy clip bound U ; optimization domain G. 2: Construct an environment E with a task easier than E. This environment should have optimal policy similar to that of E. 3: θs ← NPG (E , η, T, N, 0 d , None, λ, U, G) (see Alg. 1). 4: if samp =pi s then 5: θT ← NPG (E, η, T, N, 0 d , πs, λ, U, G). 6: else 7: θT ← NPG (E, η, T, N, θs, None, λ, U, G). 8: end if 9: Return: θT .

6.1. NATURAL POLICY GRADIENT FOR LATENT MDP

Let g t ∈ arg min g∈G L(g; θ t , d t ) denote the true minimizer. We have the following definitions: Definition 4. Define for 0 ≤ t ≤ T : • (Excess risk) stat := max t E[L(gt; θ t , d t ) -L(g t ; θ t , d t )]; • (Transfer error) bias := max t E[L(g t ; θ t , d )]; • (Relative condition number) κ := max t E sup x∈R d x Σ θ t d x x Σtx . Note that term inside the expectation is a random quantity as θ t is random. The expectation is with respect to the randomness in the sequence of weights g 0 , g 1 , . . . , g T . All the quantities are commonly used in literature mentioned in Sec. 2. stat is due to that the minimizer g t from samples may not minimize the population loss L. bias quantifies the approximation error due to feature maps. κ characterizes the distribution mismatch between d t and d . This is a key quantity in Curriculum Learning and will be studied in more details in the following sections. Our main result is based on a fitting error which depicts the closeness between π and any policy π. Definition 5 (Fitting Error). Suppose the update rule is θ t+1 = θ t + ηg t , define err t := M m=1 w m H h=1 E (s,a)∼d m,H-h A t,λ m,h (s, a) -g t ∇ θ ln π t (a|s) . Thm. 6 shows the convergence rate of Alg. 1, and its proof is deferred to Sec. B.1. Theorem 6. With Def. 4, 5 and 8, our algorithm enjoys the following performance bound: E min 0≤t≤T V ,λ -V t,λ ≤ λ(1 -ηλ) T +1 Φ(π 0 ) 1 -(1 -ηλ) T +1 + η B 2 G 2 2 + T t=0 (1 -ηλ) T -t E[ err t ] T t =0 (1 -ηλ) T -t ≤ λ(1 -ηλ) T +1 Φ(π 0 ) 1 -(1 -ηλ) T +1 + η B 2 G 2 2 + H bias + Hκ stat , where Φ(π 0 ) is the Lyapunov potential function which is only relevant to the initialization. √ T ) rate), matching the result in Agarwal et al. (2021) . z stat can be reduced using a larger batch size N (Lem. 19). The typical scaling is stat = O(1/ √ N ). { If some d t (especially the initialization d 0 ) is far away from d , κ may be extremely large (Sec. 6.2 as an example). In other words, if we can find a policy whose κ is small with a single curriculum, we do not need the multi-step curriculum learning procedure used in Kong et al. (2019) .

6.2. CURRICULUM LEARNING FOR SECRETARY PROBLEM

For SP, there exists a threshold policy that is optimal (Beckmann, 1990) . Suppose the threshold is p ∈ (0, 1), then the policy is: accept candidate i if and only if i n > p and X i = 1. For the classical SP where all the n! instances have equal probability to be sampled, the optimal threshold is 1/e. To show that curriculum learning makes the training converge faster, Thm. 6 gives a direct hint: curriculum learning produces a good sampler leading to much smaller κ than that of a naïve random sampler. Here we focus on the cases where samp = pi s because the sampler is fixed, while when samp = pi t it is impossible to analyze a dynamic procedure. We show Thm. 7 to characterize κ in SP. Its full statement and proof is deferred to Sec. B.2. Theorem 7. Assume that each candidate is independent of others and the i-th candidate has a probability P i of being the best so far (Sec. 4.1). Assume the optimal policy is a p-threshold policy and the sampling policy is a q-threshold policy. There exists a policy parameterization such that: κ curl = Θ np j= nq +1 1 1-Pj , q ≤ p, 1, q > p, , κ naïve = Θ   2 np max    1, max i≥ np +2 i-1 j= np +1 2(1 -P j )      , where κ curl and κ naïve are κ of the sampling policy and the naïve random policy, respectively. To understand how curriculum learning influences κ, we apply Thm. 7 to three concrete cases. They show that, when the state distribution induced by the optimal policy in the small problem is similar to that in the original large problem, then a single-step curriculum suffices (cf. { of Rem. 4). The classical case: an exponential improvement. We study the classical SP first, where all the n! permutations are sampled with equal probability. The probability series for this case is P i = 1 i . Substituting them into Eq. 1 directly gives: κ curl = n/e nq , q ≤ 1 e , 1, q > 1 e , κ naïve = 2 n-1 n/e n -1 . Except for the corner case where q < 1 n , we have that κ curl = O(n) while κ naïve = Ω(2 n ). Notice that any distribution with P i ≤ 1 i leads to an exponential improvement. A more general case. Now we try to loosen the condition where P i ≤ 1 i . Let us consider the case where P i ≤ 1 2 for i ≥ 2 (by definition P 1 is always equal to 1). Eq. 1 now becomes: κ curl ≤ 2 np -nq , q ≤ p, 1, q > p, κ naïve ≥ 2 np . Clearly, κ curl ≤ κ naïve always holds. When q is close to p, the difference is exponential in nq . Failure mode of Curriculum Learning. Lastly we show further relaxing the assumption on P i leads to failure cases. The extreme case is that all P i = 1, i.e., the best candidate always comes as the last one. Suppose q < n-1 n , then d πq n n = 0. Hence κ curl = ∞, larger than κ naïve = 2 n-1 . From Eq. 1, κ naïve ≤ 2 n-1 . Similar as Sec. 3 of Beckmann (1990) , the optimal threshold p satisfies: n i= np +2 P i 1 -P i ≤ 1 < n i= np +1 P i 1 -P i . So letting P n > 1 2 results in p ∈ [ n-1 n , 1). Further, if q < n-1 n and P j > 1 -2 - n n-nq -1 for any nq + 1 ≤ j ≤ n -1, then from Eq. 1, κ curl > 2 n > κ naïve . This means that Curriculum Learning can always be manipulated adversarially. Sometimes there is hardly any reasonable curriculum. Remark 5. Here we only provide theoretical explanations for SP when samp = pi s, because κ is highly problem-dependent, and the analytical forms for κ is tractable when the sampler is fixed. For samp = pi t and other CO problems such as OKD, however, we do not have analytical forms, so we resort to empirical studies (Sec. 7). 

7. EXPERIMENTS

The experiments' formulations are modified from Kong et al. (2019) . Due to page limit, more formulation details and results are presented in Sec. C. In Curriculum Learning the entire training process splits into two phases. We call the training on curriculum (small scale instances) "warm-up phase" and the training on large scale instances "final phase". We ran more than one experiments for each problem. In one experiment there are more than one training processes to show the effect of different samplers and regularization coefficients. To highlight the effect of Curriculum Learning, we omit the results regarding regularization, and they can be found in supplementary files. All the trainings in the same experiment have the same distributions over LMDPs for final phase and warm-up phase (if any), respectively. Secretary Problem. We show one of the four experiments in Fig. 1 . Aside from reward and ln κ, we plot the weighted average of err t according to Thm. 6: avg( err t ) = t i=0 (1-ηλ) t-i erri t i =0 (1-ηλ) t-i . All the instance distributions are generated from parameterized series {P n } with fixed random seeds, which guarantees reproducibility and comparability. There is no explicit relation between the curriculum and the target environment, so the curriculum can be viewed as random and independent. The experiments clearly demonstrate that curriculum learning can boost the performance by a large margin and curriculum learning indeed dramatically reduces κ, even the curriculum is randomly generated.

Online Knapsack (decision version

). We show one of the three experiments in Fig. 2 . ln κ and avg( err t ) are with respect to the reference policy, a bang-per-buck algorithm, which is not the optimal policy. Thus, they are only for reference. The curriculum generation is also parameterized, random and independent of the target environment. The experiments again demonstrate the effectiveness of curriculum learning and curriculum learning indeed dramatically reduces κ.

8. CONCLUSION

We showed online CO problems could be naturally formulated as LMDPs, and we analyzed the convergence rate of NPG for LMDPs. Our theory shows the main benefit of curriculum learning is finding a stronger sampling strategy, especially for standard SP any curriculum exponentially improves the learning rate. Our empirical results also corroborated our findings. Our work is the first attempt to systematically study techniques devoted to using RL to tackle online CO problems, which we believe is a fruitful direction worth further investigations.  Initialize F t ← 0 d×d , ∇ t ← 0 d . 4: for n ← 0, 1, . . . , N -1 do 5: Update: for h ← 0, 1, . . . , H -1 do 6: if π s is not None then 7: s h , a h , A H-h (s h , a h ) ← Sample (E, π s , True, π t , h, λ, U ) (see Alg. 4). // s, F t ← F t + H-1 h=0 ∇ θ ln π θt (a h |s h ) (∇ θ ln π θt (a h |s h )) , ∇ t ← ∇ t + H-1 h=0 A H-h (s h , a h )∇ θ ln π θt (a h |s h ). 13: end for 14: Call any solver to get g t ← arg min g∈G g F t g -2g ∇ t . 15: Update θ t+1 ← θ t + η g t . 16: end for 17: Return: θ T .

B.1 PERFORMANCE OF NATURAL POLICY GRADIENT FOR LMDP

First we give the skipped definition of the Lyapunov potential function Φ, then prove Thm. 6. Definition 8 (Lyapunov Potential Function (Cayci et al., 2021) ). We define the potential function Φ : Π → R as follows: for any π ∈ Π, Φ(π) = M m=1 w m H-1 h=0 E (s,a)∼d m,h ln π (a|s) π(a|s) . Theorem 6 (Restatement of Thm. 6). With Def. 4, 5 and 8, our algorithm enjoys the following performance bound: Proof. Here we make shorthands for the sub-optimality gap and potential function: ∆ t := V ,λ -V t,λ and Φ t := Φ(π t ). From Lem. 15 we have E min 0≤t≤T V ,λ -V t,λ ≤ λ(1 -ηλ) T +1 Φ(π 0 ) 1 -(1 -ηλ) T +1 + η B 2 G 2 2 + T t=0 (1 -ηλ) T -t E[ err t ] T t =0 (1 -ηλ) T -t ≤ λ(1 -ηλ) T +1 Φ(π 0 ) 1 -(1 -ηλ) T +1 + η B 2 G η∆ t ≤ (1 -ηλ)Φ t -Φ t+1 + η err t + η 2 B 2 G 2 2 . Taking expectation over the update weights, we have E[η∆t] ≤ (1 -ηλ) E[Φt] -E[Φt+1] + η E[ err t ] + η 2 B 2 G 2 2 . Thus, E η T t=0 (1 -ηλ) T -t ∆ t ≤ T t=0 (1 -ηλ) T -t+1 E[Φt] - T t=0 (1 -ηλ) T -t E[Φt+1] + η T t=0 (1 -ηλ) T -t E[ err t ] + η 2 B 2 G 2 2 T t=0 (1 -ηλ) T -t = (1 -ηλ) T +1 Φ 0 -E[ΦT +1 ] + η T t=0 (1 -ηλ) T -t E[ err t ] + η 2 B 2 G 2 2 T t=0 (1 -ηλ) T -t ≤ (1 -ηλ) T +1 Φ 0 + η T t=0 (1 -ηλ) T -t E[ err t ] + η 2 B 2 G 2 2 T t=0 (1 -ηλ) T -t , where the last step uses the fact that Φ(π) ≥ 0. This is a weighted average, so by normalizing the coefficients, E min 0≤t≤T ∆ t ≤ λ(1 -ηλ) T +1 Φ 0 1 -(1 -ηλ) T +1 + η B 2 G 2 2 + T t=0 (1 -ηλ) T -t E[ err t ] T t =0 (1 -ηλ) T -t ≤ λ(1 -ηλ) T +1 Φ 0 1 -(1 -ηλ) T +1 + η B 2 G 2 2 + H bias + Hκ stat , where the last step comes from Lem. 16 and Jensen's inequality. This completes the proof.

B.2 CURRICULUM LEARNING AND THE CONSTANT GAP FOR SECRETARY PROBLEM

Theorem 7 (Formal statement of Thm. 7). For SP, set samp = pi s in Alg. 2. Assume that each candidate is independent from others and the i-th candidate has probability P i of being the best so far (see formulation in Sec. 4.1 and C.1). Assume the optimal policy is a p-threshold policy and the sampling policy is a q-threshold policy. There exists a policy parameterization and quantities k curl = np j= nq +1 1 1-Pj , q ≤ p, 1, q > p, k naïve = 2 np max    1, max i≥ np +2 i-1 j= np +1 2(1 -P j )    , such that k curl ≤ κ curl ≤ 2k curl and k naïve ≤ κ naïve ≤ 2k naïve . Here κ curl and κ naïve correspond to κ induced by the q-threshold policy and the naïve random policy respectively. Proof. We need to calculate three state-action visitation distributions: that induced by the optimal policy, d ; that induced by the sampler which is the optimal for the curriculum, d curl ; and that induced by the naïve random sampler, d naïve . This then boils down to calculating the state(-action) visitation distribution under two types of policies: any threshold policy and the naïve random policy. For any policy π, denote d π i n as the probability for the agent acting under π to see states i n with arbitrary x i . We do not need to take the terminal state g into consideration, since it stays in a zero-reward loop and contributes 0 to L(g; θ, d). We use the LMDP distribution described in Sec. 7. Denote π p as the p-threshold policy, i.e. accept if and only if i n > p and x i = 1. Then d πp i n = P(reject all previous i -1 states|π p ) = i-1 j=1 P j n , 1 1 j n ≤ p + 1 -P j n , 1 = i-1 j= np +1 1 -P j n , 1 = i-1 j= np +1 (1 -P j ). Denote π naïve as the naïve random policy, i.e., accept with probability 1 2 regardless of the state. Then d πnaïve i n = P(reject all previous i -1 states|π naïve ) = 1 2 i-1 . For any π, we can see that the state visitation distribution satisfies d π i n , 1 = P i d π i n and d π i n , 0 = (1 -P i )d π i n . To show the possible largest difference, we use a parameterization that for each state s, φ(s) = One-hot(s). The policy is then π θ (accept|s) = exp(θ φ(s)) exp(θ φ(s)) + 1 , π θ (reject|s) = 1 exp(θ φ(s)) + 1 . Denote π θ (s) = π θ (accept|s), we have ∇ θ ln π θ (accept|s) = (1 -π θ (s))φ(s), ∇ θ ln π θ (reject|s) = -π θ (s)φ(s). Now suppose the optimal threshold and the threshold learned through curriculum are p and q, then Σ θ d = s∈S d πp (s) π p (s)(1 -π θ (s)) 2 + (1 -π p (s))π θ (s) 2 φ(s)φ(s) , Σ θ d curl = s∈S d πq (s) 1 2 (1 -π θ (s)) 2 + 1 2 π θ (s) 2 φ(s)φ(s) , Σ θ d naïve = s∈S d naïve (s) 1 2 (1 -π θ (s)) 2 + 1 2 π θ (s) 2 φ(s)φ(s) . Denote κ ♣ (θ) = sup x∈R d x Σ θ d x x Σ θ d ♣ x . From parameterization we know all φ(s) are orthogonal. Abusing π q with π curl , we have κ ♣ (θ) = max s∈S d πp (s) π (s)(1 -π θ (s)) 2 + (1 -π (s))π θ (s) 2 d ♣ (s) 1 2 (1 -π θ (s)) 2 + 1 2 π θ (s) 2 . We can separately consider each s ∈ S because of the orthogonal features. Observe that π p (s) ∈ {0, 1}, so for s ∈ S, its corresponding term in κ ♣ (θ) is maximized when π θ (s) = 1 -π p (s) and is equal to 2 d πp (s) d ♣ (s) . By definition, κ ♣ = max 0≤t≤T E[κ♣(θt)]. Since θ 0 = 0 d , we have κ ♣ ≥ κ ♣ (0 d ) where π θ (s) = 1 2 and the corresponding term is d πp (s) d ♣ (s) . So max s∈S d πp (s) d ♣ (s) ≤ κ ♣ ≤ 2 max s∈S d πp (s) d ♣ (s) . We now have an order-accurate result k ♣ = max s∈S d πp (s) d ♣ (s) for κ ♣ . Direct computation gives k curl = np j= nq +1 1 1-Pj , q ≤ p, 1, q > p, k naïve = 2 np max    1, max i≥ np +2 i-1 j= np +1 2(1 -P j )    . This completes the proof.

C FULL EXPERIMENTS

Here are all the experiments not shown in Sec. 7. All the experiments were run on a server with CPU AMD Ryzen 9 3950X, GPU NVIDIA GeForce 2080 Super and 128G memory. For legend description please refer to the caption of Fig. 1 . For experiment data (code, checkpoints, logging data and policy visualization) please refer to the supplementary files. Policy parameterization. Since in all the experiments there are exactly two actions, we can use φ(s) = φ(s, accept) -φ(s, reject) instead of φ(s, accept) and φ(s, reject). Now the policy is π θ (accept|s) = exp(θ φ(s)) exp(θ φ(s))+1 and π θ (reject|s) = 1 exp(θ φ(s))+1 . Training schemes. We ran seven experiments in total, three for Secretary Problem and four for Online Knapsack (decision version). The difference between the experiments of the same problem lies in the distribution over instances (i.e., {w m }). In the following subsections, we will introduce how we parameterized the distribution in detail. In a single experiment, we ran eight setups, each representing a combination of sampler policies, initialization policies of the final phase, and whether we used regularization. For visual clarity, we did not plot setups with entropy regularization, but the readers can plot it using plot.py (comment L55-58 and uncomment L59-62) in the supplementary files. We make a detailed list of the training schemes in Tab. 1.

Abbreviation

Detailed setup Script fix samp curl Fixed sampler curriculum learning. In the warm-up phase, train a policy π s from scratch (with zero initialization in parameters) using a small environment E . In the final phase, change to the true environment E, use π s as the sampler policy to train a policy from scratch. Run Alg. 2 with samp = pi s and λ = 0. fix samp curl reg The same as fix samp curl, but add entropy regularization to both phases. Run Alg. 2 with samp = pi s and λ = 0. direct Direct learning. Only the final phase. Train a policy from scratch directly in E. Run Alg. 1 with θ 0 = 0 d , π s = None and λ = 0.

direct reg

The same as direct, but add entropy regularization. Run Alg. 1 with θ 0 = 0 d , π s = None and λ = 0.

naive samp

Learning with the naïve sampler. Only the final phase. Use the naïve random policy as the sampler to train a policy from scratch in E. Run Alg. 1 with θ 0 = 0 d , π s = naïve random policy and λ = 0.

naive samp reg

The same as naive samp, but add entropy regularization. Run Alg. 1 with θ 0 = 0 d , π s = naive random policy and λ = 0. curl Curriculum learning. In the warm-up phase, train a policy π s from scratch in E . In the final phase, change to E and continue on training π s . Run Alg. 2 with samp = pi t and λ = 0.

curl reg

The same as curl, but add entropy regularization. Run Alg. 2 with samp = pi t and λ = 0. reference This is the reference policy. For SP, it is exactly the optimal policy since it can be calculated. For OKD, it is a bang-per-buck policy and is not the optimal policy (whose exact form is not clear). 

C.1 SECRETARY PROBLEM

State and action spaces. States with X i > 1 are the same. To make the problem "scale-invariant", we use i n to represent i. So the states are ( i n , x i = 1[X i = 1]). There is an additional terminal state g = (0, 0). For each state, the agent can either accept or reject. Transition and reward. Any action in g leads back to g. Once the agent accepts the i-th candidate, the state transits into g, and reward is 1 if i is the best in the instance. If the agent rejects, then the state goes to ( i+1 n , x i+1 ) if i < n and g if i = n. For all other cases, rewards are 0. Feature mapping. Recall that all states are of the form (f, x) where f ∈ [0, 1], x ∈ {0, 1}. We set a degree d 0 and the feature mapping is constructed as the collection of polynomial bases with degree less than d 0 (d = 2d 0 ): φ(f, x) = (1, f, . . . , f d0-1 , x, f x, . . . , f d0-1 x). LMDP distribution. We model the distribution as follows: for each i, we can have x i = 1 with probability P i and is independent from other i . By definition, P 1 = 1 while other P i can be arbitrary. The classical SP satisfies P i =foot_0  i . We also experimented on three other distributions (so in total there are four experiments), each with a series of numbers p 2 , p 3 , . . . , p n i.i.d. ∼ Unif [0,1] and set P i = 1 i 2p i +0.25 . For each experiment, we run eight setups, each with different combinations of sampler policies, initialization policies of the final phase, and the value of regularization coefficient λ. For the warmup phases we set n = 10 and for final phases n = 100. Results. Fig. 3 (with its full view Fig. 4 ), Fig. 5 , Fig. 6 , along with Fig. 1 (with seed 2018011309) show four experiments of SP. They shared a learning rate of 0.2, batch size of 100 per step in horizon, final n = 100 and warm-up n = 10 (if applied curriculum learning). 1 The experiment in Fig. 3 was done in the classical SP environment, i.e., all permutations have probability 1 n! to be sampled. Experiments Fig. 1 , Fig. 5 and Fig. 6 were done with other distributions (see LMDP distribution of Sec. 7): the only differences are the random seeds, which we fixed and used to generate P i s for reproducibility. The experiment of classical SP was run until the direct training of n = 100 converges, while all other experiments were run to a maximum episode of 30000 (hence sample number of T Hb = 30000 × 100 × 100 = 3 × 10 8 ). The optimal policy was derived from dynamic programming. 

C.2 ONLINE KNAPSACK (DECISION VERSION)

State and action spaces. The states are represented as i n , s i , v i , i-1 j=1 x j s j B , i-1 j=1 x j v j V , where x j = 1[item j was successfully chosen] for 1 ≤ j ≤ i -1 (in the instance). There is an additional terminal state g = (0, 0, 0, 0, 0). For each state (including g for simplicity), the agent can either accept or reject. Transition and reward. The transition is implied by the definition of the problem. Any action in terminal state g leads back to g. The item is successfully chosen if and only if the agent accepts and the budget is sufficient. A reward of 1 is given only the first time i j=1 x i v i ≥ V , and then the state goes to g. For all other cases, reward is 0. Feature mapping. Suppose the state is (f, s, v, r, q). We set a degree d 0 and the feature mapping is constructed as the collection of polynomial bases with degree less than d 0 (d = d 5 0 ): φ(f, s, v, r, q) = (f i f s is v iv r ir q iq ) i f ,is,iv,ir,iq where i ♣ ∈ {0, 1, . . . , d 0 -1}. LMDP distribution. In Sec. 3.2 the values and sizes are sampled from F v ans F s . If F v or F s is not Unif [0,1] , we model the distribution as: first set a granularity gran and take gran numbers p 1 , p 2 , . . . , p gran i.i.d. ∼ Unif [0,1] . p i represents the (unnormalized) probability that x ∈ ( i-1 gran , i gran ). To sample, we take i ∼ Multinomial(p 1 , p 2 , . . . , p gran ) and return x ∼ i-1+Unif [0, 1] gran . For each experiment, we ran four setups, each with different combinations of sampler policies and initialization policies of the final phase. For the warm-up phases n = 10 and for final phases we set n = 100 in all experiments, while B and V vary. In one experiment it satisfies that B n are close for warm-up and final, and V B increases from warm-up to final. Results. Experiments Fig. 7 and Fig. 8 were done with other value and size distributions (see LMDP distribution of Sec. 7): the only differences are the random seeds, which we fixed and used to generate F v and F s for reproducibility. All experiments were run to a maximum episode of 50000 (hence sample number of T Hb = 50000 × 100 × 100 = 5 × 10 8 ). The reference policy is a bang-per-buck algorithm (Sec. 3.1 of Kong et al. (2019) ): given a threshold r, accept i-th item if vi si ≥ r. We searched for the optimal r with respect to Online Knapsack because we found that in general the reward is unimodal to r and contains no "plain area", so we can easily apply ternary search (the reward of OKD contains "plain area"). According to Agarwal et al. (2021) , the unconstrained, full-information NPG update weight satisfies F (θ t )g t = ∇ θ V t,λ . Lem. 12 and Lem. 13 together show that: it is equivalent to finding a minimizer of the fitting compatible function approximation loss (Def. 2). Theorem 11 (Policy Gradient Theorem for LMDP). For any policy π θ parameterized by θ, and any 1 ≤ m ≤ M , ∇ θ E s0∼νm V π θ ,λ m,H (s 0 ) = H h=1 E s,a∼d θ m,H-h Q π θ ,λ m,h (s, a)∇ θ ln π θ (a|s) . As a result, ∇ θ V π θ ,λ = M m=1 w m H h=1 E s,a∼d θ m,H-h Q π θ ,λ m,h (s, a)∇ θ ln π θ (a|s) . Proof. For any 1 ≤ h ≤ H and s ∈ S, since V π θ ,λ m,h (s) = a∈A π θ (a|s)Q π θ ,λ m,h (s, a), we have ∇ θ V π θ ,λ m,h (s) = a∈A Q π θ ,λ m,h (s, a)∇ θ π θ (a|s) + π θ (a|s)∇ θ Q π θ ,λ m,h (s, a) . Hence H h=1 s∈S d θ m,H-h (s)∇ θ V π θ ,λ m,h (s) = H h=1 s∈S d θ m,H-h (s) a∈A Q π θ ,λ m,h (s, a)∇ θ π θ (a|s) + π θ (a|s)∇ θ Q π θ ,λ m,h (s, a) = H h=1 s∈S d θ m,H-h (s) a∈A π θ (a|s)Q π θ ,λ m,h (s, a)∇ θ ln π θ (a|s) + H h=1 s∈S d θ m,H-h (s) a∈A π θ (a|s)∇ θ Q π θ ,λ m,h (s, a) = H h=1 E s,a∼d θ m,H-h Q π θ ,λ m,h (s, a)∇ θ ln π θ (a|s) + H h=1 s∈S d θ m,H-h (s) a∈A π θ (a|s)∇ θ Q π θ ,λ m,h (s, a). Next we focus on the second term. From the Bellman equation, ∇ θ Q π θ ,λ m,h (s, a) = ∇ θ r θ (s, a) -λ ln π θ (a|s) + s ∈S P (s |s, a)V π θ ,λ m,h-1 (s ) = -λ∇ θ ln π θ (a|s) + s ∈S P (s |s, a)∇ θ V π θ ,λ m,h-1 (s ). Particularly, ∇ θ Q π,λ i,1 (s, a) = -λ∇ θ ln π θ (a|s). So  P (s |s, a)∇ θ V π θ ,λ m,h-1 (s ) = -λ H h=1 s∈S d θ m,H-h (s) a∈A ∇ θ π θ (a|s) =0 + H h=2 s ∈S ∇ θ V π θ ,λ m,h-1 (s ) s∈S d θ m,H-h (s) a∈A π θ (a|s)P (s |s, a) =d θ m,H-h+1 (s ) = H h=2 s ∈S d θ m,H-h+1 (s )∇ θ V π θ ,λ m,h-1 (s ) = H h=1 s∈S d θ m,H-h (s)∇ θ V π θ ,λ m,h (s) - s0∈S ν m (s 0 )∇ θ V π θ ,λ m,H (s 0 ), where we used the definition of d and ν m . So by rearranging the terms, we complete the proof. Lemma 12. Suppose Γ ∈ R n×m , D = diag(d 1 , d 2 , . . . , d m ) ∈ R m×m where d i ≥ 0 and q ∈ R m , then x = (ΓDΓ ) † ΓDq is a solution to the equation ΓDΓ x = ΓDq. Proof. Denote D 1/2 = diag( √ d 1 , √ d 2 , . . . , √ d m ), P = ΓD 1/2 , p = D 1/2 q, then the equation is reduced to P P x = P p. Suppose the singular value decomposition of P is U ΣV where U ∈ R n×n , Σ ∈ R n×m , V ∈ R m×m where U and V are unitary, and singular values are σ 1 , σ 2 , . . . , σ k . So P P = U (ΣΣ )U and (P P ) † = U (ΣΣ ) † U . Notice that ΣΣ = diag(σ 2 1 , σ 2 2 , . . . , σ 2 k , 0, . . . , 0) ∈ R n×n , we can then derive the pseudo-inverse of this particular diagonal matrix as (ΣΣ ) † = diag(σ -2 1 , σ -2 2 , . . . , σ -2 k , 0, . . . , 0). It is then easy to verify that (ΣΣ )(ΣΣ ) † Σ = Σ. Finally, P P x = (P P )[(P P ) † P p] = U (ΣΣ )U U (ΣΣ ) † U U ΣV p = U (ΣΣ )(ΣΣ ) † ΣV p = U ΣV p = P p. This completes the proof. Lemma 13 (NPG Update Rule). The update rule θ ← θ + ηF (θ) † ∇ θ V π θ ,λ where F (θ) = M m=1 w m H h=1 E s,a∼d θ m,H-h ∇ θ ln π θ (a|s) (∇ θ ln π θ (a|s)) is equivalent to θ ← θ + ηg , where g is a minimizer of the function L(g) = M m=1 w m H h=1 E s,a∼d θ m,H-h A π θ ,λ m,h (s, a) -g ∇ θ ln π θ (a|s) 2 . Proof. ∇ g L(g) = -2 M m=1 w m H h=1 E s,a∼d θ m,H-h A π θ ,λ m,h (s, a) -g ∇ θ ln π θ (a|s) ∇ θ ln π θ (a|s) . Suppose g is any minimizer of L(g), we have ∇ g L(g ) = 0, hence M m=1 w m H h=1 E s,a∼d θ m,H-h g ∇ θ ln π θ (a|s) ∇ θ ln π θ (a|s) = M m=1 w m H h=1 E s,a∼d θ m,H-h A π θ ,λ m,h (s, a)∇ θ ln π θ (a|s) = M m=1 w m H h=1 E s,a∼d θ m,H-h Q π θ ,λ m,h (s, a)∇ θ ln π θ (a|s) . Since (u v)v = (vv )u, then F (θ)g = ∇ θ V π θ ,λ . Now we assign 1, 2, . . . , M HSA as indices to all (m, h, s, a) ∈ {1, . . . , M } × {1, . . . , H} × S × A, and set γ j = ∇ θ ln π θ (a|s), d j = w m d θ m,H-h (s, a), q j = Q π θ ,λ m,h (s, a), where j is the index assigned to (m, h, s, a). Then F (θ) = ΦDΦ and ∇ θ V θ = ΦDq where Γ = [γ 1 , γ 2 , . . . , γ M HSA ] ∈ R d×M HSA , D = diag(d 1 , d 2 , . . . , d M HSA ) ∈ R M HSA×M HSA , q = [q 1 , q 2 , . . . , q M HSA ] ∈ R M HSA . We now conclude the proof by utilizing Lem. 12.

D.2 AUXILIARY LEMMAS USED IN THE MAIN RESULTS

Lemma 14 (Performance Difference Lemma). For any two policies π 1 and π 2 , and any 1 ≤ m ≤ M , E s0∼νm V π1,λ m,H (s 0 ) -V π2,λ m,H (s 0 ) = H h=1 E s,a∼d π 1 m,H-h A π2,λ m,h (s, a) + λ ln π 2 (a|s) π 1 (a|s) . As a result, V π1,λ -V π2,λ = M m=1 w m H h=1 E s,a∼d π 1 m,H-h A π2,λ m,h (s, a) + λ ln π 2 (a|s) π 1 (a|s) Proof. First we fix s 0 . By definition of the value function, we have V π1,λ m,H (s 0 ) -V π2,λ m,H (s 0 ) = E H-1 h=0 r m (s h , a h ) -λ ln π 1 (a h |s h ) M m , π 1 , s 0 -V π2,λ m,H (s 0 ) = E H-1 h=0 r m (s h , a h ) -λ ln π 1 (a h |s h ) + V π2,λ m,H+1-h (s h+1 ) -V π2,λ m,H-h (s h ) M m , π 1 , s 0 = E H-1 h=0 E r m (s h , a h ) -λ ln π 2 (a h |s h ) + V π2,λ m,H+1-h (s h+1 ) M m , π 2 , s h , a h M m , π 1 , s 0 + E H-1 h=0 -V π2,λ m,H-h (s h ) + λ ln π 2 (a h |s h ) π 1 (a h |s h ) M m , π 1 , s 0 , where the last step uses law of iterated expectations. Since E r m (s h , a h ) -λ ln π 2 (a h |s h ) + V π2,λ m,H+1-h (s h+1 ) M m , π 2 , s h , a h = Q π2,λ m,H-h (s h , a h ), we have V π1,λ m,H (s 0 ) -V π2,λ m,H (s 0 ) = E H-1 h=0 Q π2,λ m,H-h (s h , a h ) -V π2,λ m,H-h (s h ) + λ ln π 2 (a h |s h ) π 1 (a h |s h ) M m , π 1 , s 0 = E H-1 h=0 A π2,λ m,H-h (s h , a h ) + λ ln π 2 (a h |s h ) π 1 (a h |s h ) M m , π 1 , s 0 . By taking expectation over s 0 , we have E s0∼νm V π1,λ m,H (s 0 ) -V π2,λ m,H (s 0 ) = E H-1 h=0 A π2,λ m,H-h (s h , a h ) + λ ln π 2 (a h |s h ) π 1 (a h |s h ) M m , π 1 = H-1 h=0 (s,a)∈S×A d π1 m,h (s, a) A π2,λ m,H-h (s, a) + λ ln π 2 (a|s) π 1 (a|s) . The proof is completed by reversing the order of h. Lemma 15 (Lyapunov Drift). Recall definitions in Def. 8 and 5. We have that: Φ(π t+1 ) -Φ(π t ) ≤ -ηλΦ(π t ) + η err t -η V ,λ -V t,λ + η 2 B 2 g t 2 2 2 . Proof. Denote Φ t := Φ(π t ). This proof follows a similar manner as in that of Lem. 6 in Cayci et al. (2021) . By smoothness (see Rem. 6.7 in Agarwal et al. ( 2021)), ln π t (a|s) π t+1 (a|s) ≤ (θ t -θ t+1 ) ∇ θ ln π t (a|s) + B 2 2 θ t+1 -θ t 2 2 = -ηg t ∇ θ ln π t (a|s) + η 2 B 2 g t 2 2 2 . By the definition of Φ, Φ t+1 -Φ t = M m=1 w m H h=1 E (s,a)∼d m,H-h ln π t (a|s) π t+1 (a|s) ≤ -η M m=1 w m H h=1 E (s,a)∼d m,H-h g t ∇ θ ln π t (a|s) + η 2 B 2 g t 2 2 2 . By the definition of err t , Lem. 14 and again the definition of Φ, we finally have Φ t+1 -Φ t ≤ η M m=1 w m H h=1 E (s,a)∼d m,H-h A t,λ m,h (s, a) -g t ∇ θ ln π t (a|s) -η M m=1 w m H h=1 E (s,a)∼d m,H-h A t,λ m,h (s, a) + λ ln π t (a|s) π (a|s) -ηλ M m=1 w m H h=1 E (s,a)∼d m,H-h ln π (a|s) π t (a|s) + η 2 B 2 g t 2 2 2 = η err t -η V ,λ -V t,λ -ηλΦ t + η 2 B 2 g t 2 2 2 , which completes the proof. Lemma 16. Recall that g t is the true minimizer of L(g; θ t , d t ) in domain G. err t defined in Def. 5 satisfies err t ≤ HL(g t ; θ t , d ) + Hκ(L(g t ; θ t , d t ) -L(g t ; θ t , d t )). Proof. The proof is similar to that of Thm. 6.1 in Agarwal et al. (2021) . We make the following decomposition of err t : , the update weight sequence of Alg. 1 satisfies: for any 0 ≤ t ≤ T , L( g t ; θ t , d θt ) -L(g t ; θ t , d θt ) ≤ 2 + 8λGB|A| e U +1 , where err t = M m=1 w m H-1 h=0 E (s,a)∼d m,h A t,λ m,h (s, a) -g t ∇ θ ln π t (a|s) x + M m=1 w m H-1 h=0 E (s,a)∼d m,h (g t -g t ) ∇ θ ln π t (a|s) y . Since M m=1 w m H-1 h=0 (s,a)∈S×A d m,h C = 16HGB[1 + λU + H(1 + λ ln |A|)] + 4HG 2 B 2 . If π s = None and λ = 0, then with probability 1 -2(T + 1) exp -2N 2 C 2 , the update weight sequence of Alg. 1 satisfies: for any 0 ≤ t ≤ T , L( g t ; θ t , d πs ) -L(g t ; θ t , d πs ) ≤ 2 , where C = 16H 2 GB + 4HG 2 B 2 . Proof. We first prove the π s = None case. For time step t, Alg. 1 samples HN trajectories. Abusing the notation, denote (2) F t = 1 N N n=1 To apply any standard concentration inequality, we next need to calculate the expectation of y. According to Monte Carlo sampling and Lem. 18, for any 1 ≤ m ≤ M, 1 ≤ h ≤ H and (s, a) ∈ S × A, we have A t,λ m,h (s, a) -λ|A| e U +1 ≤ E A t,λ m,h (s, a) ≤ A t,λ m,h (s, a). Denote ∇ t as the exact policy gradient at time step t, then E g ∇ t -g ∇ t ≤ g 2 E ∇ t -∇ t 2 ≤ g 2 • H ∇ θ ln π θ (a|s) 2 E A(s, a) -A(s, a) ∞ ≤ 2λGB|A| e U +1 . Since Monte Carlo sampling correctly estimates state-action visitation distribution, E F t = F (θ t ). Notice that g F t g is linear in entries of F t , we have E g F t g = g F (θ t )g. Now we are in the position to show that E L(g) -L(g) ≤ 4λGB|A| e U +1 . Hoeffding's inequality (Lem. 17) gives P L(g) -E L(g) ≥ ≤ 2 exp - 2N 2 C 2 . where from Eq. 2, C = 16HGB[1 + λU + H(1 + λ ln |A|)] + 4HG 2 B 2 . After applying union bound for all t, with probability 1 -2(T + 1) exp For π s = None and λ = 0, we notice that | A| ≤ 2H and hence -8H 2 GB ≤ y ≤ 8H 2 GB + 4HG 2 B 2 . Moreover, E A t,λ m,h (s, a) = A t,λ m,h (s, a). So by slightly modifying the proof we can get the result.



All the four trainings shown in the figures have their counterparts with regularization (λ = 0.01). Check the supplementary files and use TensorBoard for visualization.



Remark 4. x This is the first result for LMDP and sample-based NPG with entropy regularization. y For any fixed λ > 0 we have a linear convergence, which matches the result of discounted infinite horizon MDP (Thm. 1 in Cayci et al. (2021)); the limit when λ tends to 0 is O(1/(ηT ) + η) (which implies an O(1/

Figure 1: One experiment of SP. The x-axis is the number of trajectories, i.e., number of epsidoes × horizon × batch size. Dashed lines represent only final phase training and solid lines represent CurriculumLearning. The shadowed area shows the 95% confidence interval for the expectation. The explanation for different modes can be found in Sec. 5. The reference policy is the optimal threshold policy.

Figure 2: One experiment of OKD. Legend description is the same as that of Fig. 1. The reference policy is the bang-per-buck algorithm for Online Knapsack (Sec. 3.1 of Kong et al. (2019).

Figure 3: Classical SP, truncated to 3 × 10 8 samples.

Figure 4: Classical SP, full view.

Figure 5: SP, with seed 20000308.

Figure 6: SP, with seed 19283746.

Fig. 7, Fig. 8, along with Fig. 2 (with F v = F s = Unif [0,1] ) show three experiments of OKD. They shared a learning rate of 0.1, batch size of 100 per step in horizon, final n = 100 and warm-up n = 10 (if applied curriculum learning).

Figure 7: OKD, with seed 2018011309.

Figure 8: OKD, with seed 20000308.

-h (s) a∈A π θ (a|s) -λ∇ θ ln π θ (a|s) + s ∈S

s, a) = H, normalize the coefficients and apply Jensen'

BOUNDING statLemma 17 (Hoeffding's Inequality). Suppose X 1 , X 2 , . . . , X n are i.i.d. random variables taking values in [a, b], with expectation µ. Let X denote their average, then for any ≥ 0,P X -µ ≥ ≤ 2 exp -2n 2 (b -a) 2 . Lemma 18. For any policy π, any state s ∈ S and any U ≥ ln |A| -The first inequality is straightforward, so we focus on the second part. Set A = {a ∈ A : ln 1 π(a|s) > U } = {a ∈ A : π(a|s) < 1 e U } and p = a∈A π(a|s), where the penultimate step comes from concavity of ln x and Jensen's inequality. Let f (p) = p ln |A| p -pU , then f (p) = ln |A| -U -1 -ln p. Recall that U ≥ ln |A| -1, so f (p) increases when p ∈ (0, |A| e U +1 ) and decreases when p ∈ ( |A| e U +1 , 1). Since f ( |A| e U +1 ) = |A| e U +1 we complete the proof. Lemma 19 (Loss Function Concentration). If set π s = None and U ≥ ln |A| -1, then with probability 1 -2(T + 1) exp -2N 2 C 2

Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Robust curriculum learning: from clean label detection to noisy label self-correction. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=lmTWnm3coJJ.In this section, we present the algorithms skipped in the main text. Alg. 3 is the full version of Alg. 1. Alg. 4 is the sampling function.

Multinomial (w 1 , . . . , w M ), a ∼ Unif A if unif = True and a ∼ π samp (•|s) otherwise, and estimate of A t,λ m,H-h (s, a). 1: Input: Environment E; sampler policy π samp ; whether to sample uniform actions after state unif ; current policy π t ; time step h; regularization coefficient λ; entropy clip bound U . Sample action a i ∼ π samp (•|s i ) and E.execute(a i ).

Detailed setups for each training scheme.

A t,λ m,h (s, a) -g t ∇ θ ln π t (a|s) Hκ g t -g t 2 Σt , where in (i), for vector v, denote v A =√v Av for a symmetric positive semi-definite matrix A. Due to that g t minimizes L(g; θ t , d t ) over the set G, the first-order optimality condition implies that (g -g t ) ∇ g L(g t ; θ t , d t ) ≥ 0 for any g. Therefore, L(g; θ t , d t ) -L(g t ; θ t , d t ) (s, a) -g t ∇ ln π t (a|s) + (g t -g) ∇ ln π t (a|s)(g t -g) ∇ θ ln π t (a|s) (s, a) -g t ∇ θ ln π t (a|s) ∇ θ ln π t (a|s) = g t -g 2 Σt + (g -g t ) ∇ g L(g t ; θ t , d t ) ≥ g t -g 2Σt . So finally we have err t ≤ HL(g t ; θ t , d ) + Hκ(L(g t ; θ t , d t ) -L(g t ; θ t , d t )). This completes the proof.

ln θ (a n,h |s n,h ) (∇ θ ln π θ (a n,h |s n,h )) ,

-2N 2 L(g; θ t , d θt ) -L(g; θ t , d θt ) ≤ + 4λGB|A| e U +1 . Hence L( g t ; θ t , d θt ) ≤ L( g t ; θ t , d θt ) + + 4λGB|A| e U +1 ≤ L(g t ; θ t , d θt ) + + 4λGB|A| e U +1≤ L(g t ; θ t , d θt ) + 2 + 8λGB|A| e U +1 .

