UNDERSTANDING CURRICULUM LEARNING IN POL-ICY OPTIMIZATION FOR ONLINE COMBINATORIAL OP-TIMIZATION

Abstract

Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings.

1. INTRODUCTION

In recent years, machine learning techniques have shown promising results in solving combinatorial optimization (CO) problems, including traveling salesman problem (TSP, Kool et al. (2019) ), maximum cut (Khalil et al., 2017) and satisfiability problem (Selsam et al., 2019) . While in the worst case some CO problems are NP-hard, in practice, the probability that we need to solve the worst-case problem instance is low (Cappart et al., 2021) . Machine learning techniques are able to find generic models which have exceptional performance on the majority of a class of CO problems. A significant subclass of CO problems is called online CO problems, which has gained much attention (Grötschel et al., 2001; Huang, 2019; Garg et al., 2008) . Online CO problems entail a sequential decision-making process, which perfectly matches the nature of reinforcement learning (RL). This paper concerns using RL to tackle online CO problems. RL is often coupled with specialized techniques including (a particular type of) Curriculum Learning (Kong et al., 2019) , human feedback and correction (Pérez-Dattari et al. (2018) , Scholten et al. ( 2019)), and policy aggregation (boosting, Brukhim et al. (2021) ). Practitioners use these techniques to accelerate the training speed. While these hybrid techniques enjoy empirical success, the theoretical understanding is still limited: it is unclear when and why they improve the performance. In this paper, we particularly focus on RL with Curriculum Learning (Bengio et al. (2009 ), also named "bootstrapping" in Kong et al. (2019) ): train the agent from an easy task and gradually increase the difficulty until the target task. Interestingly, these techniques exploit the special structures of online CO problems. Main contributions. In this paper, we initiate the formal study on using RL to tackle online CO problems, with a particular emphasis on understanding the specialized techniques developed in this emerging subarea. Our contributions are summarized below. • Formalization. For online CO problems, we want to learn a single policy that enjoys good performance over a distribution of problem instances. This motivates us to use Latent Markov Decision Process (LMDP) (Kwon et al., 2021a) instead of standard MDP formulation. We give concrete examples, Secretary Problem (SP) and Online Knapsack, to show how LMDP models online CO problems. With this formulation, we can systematically analyze RL algorithms. • Provable efficiency of policy optimization. By leveraging recent theory on Natural Policy Gradient for standard MDP Agarwal et al. (2021) , we analyze the performance of NPG for LMDP. The performance bound is characterized by the number of iterations, the excess risk of policy evaluation, the transfer error, and the relative condition number κ that characterizes the distribution shift between the sampling policy and the optimal policy. To our knowledge, this is the first performance bound of policy optimization methods on LMDP. • Understanding and simplifying Curriculum Learning. Using our performance guarantee on NPG for LMDP, we study when and why Curriculum Learning is beneficial to RL for online CO problems. Our main finding is that the main effect of Curriculum Learning is to give a stronger sampling policy. Under certain circumstances, Curriculum Learning reduces the relative condition number κ, improving the convergence rate. For the Secretary Problem, we provably show that Curriculum Learning can exponentially reduce κ compared with using the naïve sampling policy. Surprisingly, this means even a random curriculum of SP accelerates the training exponentially. As a direct implication, we show that the multi-step Curriculum Learning proposed in Kong et al. ( 2019) can be significantly simplified into a single-step scheme. Lastly, to obtain a complete understanding, we study the failure mode of Curriculum Learning, in a way to help practitioners to decide whether to use Curriculum Learning based on their prior knowledge. To verify our theories, we conduct extensive experiments on two classical online CO problems (Secretary Problem and Online Knapsack) and carefully track the dependency between the performance of the policy and κ.

2. RELATED WORK

RL for CO. There have been rich literature studying RL for CO problems, e.g., using Pointer Network in REINFORCE and Actor-Critic for routing problems (Nazari et al., 2018) , combining Graph Attention Network with Monte Carlo Tree Search for TSP (Drori et al., 2020) and incorporating Structure-to-Vector Network in Deep Q-networks for maximum independent set problems (Cappart et al., 2019) . Bello et al. (2017) proposed a framework to tackle CO problems using RL and neural networks. Kool et al. (2019) 2020) summarized learning methods, algorithmic structures, objective design and discussed generalization. In particular scaling to larger problems was mentioned as a major challenge. Compared to supervised learning, RL not only mimics existing heuristics, but also discover novel ones that humans have not thought of, for example chip design (Mirhoseini et al., 2021) and compiler optimization (Zhou et al., 2020b) . Kong et al. (2019) focused on using RL to tackle online CO problems, which means that the agent must make sequential and irrevocable decisions. They encoded the input in a length-independent manner. For example, the i-th element of a n-length sequence is encoded by the fraction i n and other features, so that the agent can generalize to unseen n, paving the way for Curriculum Learning. Three online CO problems were mentioned in their paper: Online Matching, Online Knapsack and Secretary Problem (SP). Currently, Online Matching and Online Knapsack have only approximation algorithms (Huang et al., 2019; Albers et al., 2021) . There are also other works about RL for online CO problems. LMDP. We provide the exact definition of LMDP in Sec. 4.1. As studied by Steimle et al. (2021) , in the general cases, optimal policies for LMDPs are history-dependent. This is different from standard MDP cases where there always exists an optimal history-independent policy. They showed that even finding the optimal history-independent policy is NP-hard. Kwon et al. (2021a) investigated the sample complexity and regret bounds of LMDP in the history-independent policy class. They presented an exponential lower-bound for a general LMDP and derived algorithms with polynomial sample complexities for cases with special assumptions. Kwon et al. (2021b) showed that in rewardmixing MDPs, where MDPs share the same transition model, a polynomial sample complexity is achievable without any assumption to find an optimal history-independent policy.



combined REINFORCE and attention technique to learn routing problems. Vesselinova et al. (2020) and Mazyavkina et al. (2021) are taxonomic surveys of RL approaches for graph problems. Bengio et al. (

Alomrani et al. (2021) uses deep-RL for Online Matching. Oren et al. (2021) studies Parallel Machine Job Scheduling problem (PMSP) and Capacitated Vehicle Routing problem (CVRP), which are both online CO problems, using offline-learning and Monte Carlo Tree Search.

