POPULATION-BASED REINFORCEMENT LEARNING FOR COMBINATORIAL OPTIMIZATION PROBLEMS

Abstract

Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam-search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of complementary policies, and obtains state-of-the-art RL results on three popular NP-hard problems: the traveling salesman (TSP), the capacitated vehicle routing (CVRP), and 0-1 knapsack (KP) problems. On TSP specifically, Poppy outperforms the previous state-of-the-art, dividing the optimality gap by 5 while reducing the inference time by more than an order of magnitude.

1. INTRODUCTION

In recent years, machine learning (ML) approaches have overtaken algorithms that use handcrafted features and strategies across a variety of challenging tasks (Mnih et al., 2015; van den Oord et al., 2016; Silver et al., 2017; Brown et al., 2020) . In particular, solving combinatorial optimization (CO) problems -where the maxima or minima of an objective function acting on a finite set of discrete variables is sought -has attracted significant interest (Bengio et al., 2021) due to both their (often NP) hard nature and numerous practical applications across domains varying from logistics (Sbihi & Eglese, 2007) to fundamental science (Wagner, 2020) . As the search space of feasible solutions typically grows exponentially with the problem size, exact solvers can be challenging to scale, hence CO problems are often also tackled with handcrafted heuristics using expert knowledge. Whilst a diversity of ML-based heuristics have been proposed, reinforcement learning (RL; Sutton & Barto, 2018) is a promising paradigm as it does not require pre-solved examples of these hard problems. Indeed, algorithmic improvements to RL-based CO solvers, coupled with low inference cost, and the fact that they are by design targeted at specific problem distributions, have progressively narrowed the gap with traditional solvers. To improve the quality of proposed solutions, RL methods typically generate multiple candidates with additional search procedures, which can be divided into two families (Mazyavkina et al., 2021) . First, improvement methods start from a feasible solution and iteratively improve it through small modifications (actions). However, such incremental search cannot quickly access very different solutions, and requires handcrafted procedures to define a sensible action space. Second, construction methods incrementally build a solution by selecting one element at a time. Multiple solutions can be built using sampling strategies, such as stochastic sampling policies or beam search. However, just as improvement methods are biased by the initial starting solution, construction methods are biased by the single underlying policy. Thus, a balance must be struck between the exploitation of the learned policy (which may be ill-suited for a given problem instance) and the exploration of different solutions (where the extreme case of a purely random policy will likely be highly inefficient). In this work, we propose Poppy, a construction method that uses a population of agents with suitably diverse policies to improve the exploration of the solution space of hard CO problems. Whereas a single agent aims to perform well across the entire problem distribution, and thus has to make compromises, a population can learn a set of heuristics such that only one of these has to be performant on any given problem instance. However, realizing this intuition presents several challenges: (i) naïvely training a population of agents is expensive and challenging to scale, (ii) the trained population should have complementary policies that propose different solutions, and (iii) the training approach should not impose any handcrafted notion of diversity within the set of policies given the absence of clear behavioral markers aligned with performance for typical CO problems. Challenge (i) is addressed by sharing a large fraction of the computations across the population, specializing only lightweight policy heads to realize the diversity of agents. Challenges (ii) and (iii) are jointly achieved by introducing an RL objective aimed at specializing agents on distinct subsets of the problem distribution. Concretely, we derive a lower bound of the true population-level objective, which corresponds to training only the agent which performs best on each problem. This is intuitively justified as the performance of the population on a given problem is not improved by training an agent on an instance where another agent already has better performance. Strikingly, we find that judicious application of this conceptually simple objective gives rise to a population where the diversity of policies is obtained without explicit supervision (and hence is applicable across a range of problems without modification) and essential for strong performance. Our contributions are summarized as follows: 1. We motivate the use of populations for CO problems as an efficient way to explore environments that are not reliably solved by single-shot inference. 2. We derive a new training objective and present a practical training procedure that encourages performance-driven diversity (i.e. effective diversity without the use of explicit behavioral markers or other external supervision). 3. We evaluate Poppy on three CO problems: TSP, CVRP, and 0-1 knapsack (KP). On TSP and KP, Poppy significantly outperforms other RL-based approaches. On CVRP, it consistently outperforms other inference-only methods and approaches the performance of actively fine-tuning problem-specific policies.

2. RELATED WORK

ML for Combinatorial Optimization The first attempt to solve TSP with neural networks is due to Hopfield & Tank (1985) , which only scaled up to 30 cities. Recent developments of bespoke neural architectures (Vinyals et al., 2015; Vaswani et al., 2017) and performant hardware have made ML approaches increasingly efficient. Indeed, several architectures have been used to address CO problems, such as graph neural networks (Dai et al., 2017) , recurrent neural networks (Nazari et al., 2018) , and attention mechanisms (Deudon et al., 2018) . In this paper, we use an encoder-decoder architecture that draws from that proposed by Kool et al. (2019) . The costly encoder is run once per problem instance, and the resulting embeddings are fed to a small decoder iteratively rolled out to get the whole trajectory, which enables efficient inference. This approach was furthered by Kwon et al. (2020) Whilst this work (MDAM) shares our architecture and goal of training a population, our approach for enforcing diversity differs substantially. MDAM explicitly trades off performance with diversity by jointly optimizing policies and their KL divergence. Moreover, as computing the KL divergence for the whole trajectory is intractable, MDAM is restricted to only using it to drive diversity at the first timestep. In contrast, Poppy drives diversity by maximizing population-level performance (i.e. without any explicit diversity metric), uses the whole trajectory and scales better with the population size (we have used up to 32 agents instead of only 5). Additionally, ML approaches usually rely on mechanisms to generate multiple candidate solutions (Mazyavkina et al., 2021) . One such mechanism consists in using improvement methods on an



, who leveraged the underlying symmetries of typical CO problems (e.g. of starting positions and rotations) to realize improved training and inference performance using instance augmentations. Kim et al. (2021) also draws on Kool et al. and uses a hierarchical strategy where a seeder proposes solution candidates, which are refined bit by bit by a reviser. Closer to our work, Xin et al. (2021) trains multiple policies using a shared encoder and separate decoders.

