UNDERSTANDING CURRICULUM LEARNING IN POL-ICY OPTIMIZATION FOR ONLINE COMBINATORIAL OP-TIMIZATION

Abstract

Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings.

1. INTRODUCTION

In recent years, machine learning techniques have shown promising results in solving combinatorial optimization (CO) problems, including traveling salesman problem (TSP, Kool et al. (2019) ), maximum cut (Khalil et al., 2017) and satisfiability problem (Selsam et al., 2019) . While in the worst case some CO problems are NP-hard, in practice, the probability that we need to solve the worst-case problem instance is low (Cappart et al., 2021) . Machine learning techniques are able to find generic models which have exceptional performance on the majority of a class of CO problems. A significant subclass of CO problems is called online CO problems, which has gained much attention (Grötschel et al., 2001; Huang, 2019; Garg et al., 2008) . Online CO problems entail a sequential decision-making process, which perfectly matches the nature of reinforcement learning (RL). While these hybrid techniques enjoy empirical success, the theoretical understanding is still limited: it is unclear when and why they improve the performance. In this paper, we particularly focus on RL with Curriculum Learning (Bengio et al. (2009) , also named "bootstrapping" in Kong et al. ( 2019)): train the agent from an easy task and gradually increase the difficulty until the target task. Interestingly, these techniques exploit the special structures of online CO problems. Main contributions. In this paper, we initiate the formal study on using RL to tackle online CO problems, with a particular emphasis on understanding the specialized techniques developed in this emerging subarea. Our contributions are summarized below. • Formalization. For online CO problems, we want to learn a single policy that enjoys good performance over a distribution of problem instances. This motivates us to use Latent Markov Decision



This paper concerns using RL to tackle online CO problems. RL is often coupled with specialized techniques including (a particular type of) Curriculum Learning (Kong et al., 2019), human feedback and correction (Pérez-Dattari et al. (2018), Scholten et al. (2019)), and policy aggregation (boosting, Brukhim et al. (2021)). Practitioners use these techniques to accelerate the training speed.

