BQ-NCO: BISIMULATION QUOTIENTING FOR GENER-ALIZABLE NEURAL COMBINATORIAL OPTIMIZATION

Abstract

Despite the success of Neural Combinatorial Optimization methods for end-toend heuristic learning, out-of-distribution generalization remains a challenge. In this paper, we present a novel formulation of combinatorial optimization (CO) problems as Markov Decision Processes (MDPs) that effectively leverages symmetries of the CO problems to improve out-of-distribution robustness. Starting from the standard MDP formulation of constructive heuristics, we introduce a generic transformation based on bisimulation quotienting (BQ) in MDPs. This transformation allows to reduce the state space by accounting for the intrinsic symmetries of the CO problem and facilitates the MDP solving. We illustrate our approach on the Traveling Salesman, Capacitated Vehicle Routing and Knapsack Problems. We present a BQ reformulation of these problems and introduce a simple attention-based policy network that we train by imitation of (near) optimal solutions for small instances from a single distribution. We obtain new state-ofthe-art generalization results for instances with up to 1000 nodes from synthetic and realistic benchmarks that vary both in size and node distributions.

1. INTRODUCTION

Combinatorial Optimization problems are crucial in many application domains such as transportation, energy, logistics, etc. Because they are generally NP-hard (Cook et al., 1997) , their resolution at real-life scales is mainly done by heuristics, which are efficient algorithms that generally produce good quality solutions (Boussaïd et al., 2013) . However, strong heuristics are generally problemspecific and designed by domain experts. Neural Combinatorial Optimization (NCO) is a relatively recent line of research that focuses on using deep neural networks to learn such heuristics from data, possibly exploiting information on the specific distribution of problem instances of interest (Bengio et al., 2021; Cappart et al., 2021) . Despite the impressive progress in this field over the last few years, their out-of-distribution generalization, especially to larger instances, remains a major hurdle (Joshi et al., 2022; Manchanda et al., 2022) . In this paper, we are interested in constructive NCO methods, which build a solution incrementally, by applying a sequence of elementary steps. These methods are often quite generic, see e.g. the seminal papers by Khalil et al. (2017) ; Kool et al. (2019) . Most CO problems can indeed be represented in this way, although the representation is not unique as the nature of the steps is, to a large extent, a matter of choice. Given a choice of step space, solving the CO problem amounts to computing an optimal policy for sequentially selecting the steps in the construction. This task can typically be performed in the framework of Markov Decision Processes (MDP), through imitation or reinforcement learning. The exponential size of the state space, inherent to the NP-hardness of combinatorial problems, usually precludes other methods such as (tabular) dynamic programming. Whatever the learning method used to solve the MDP, its efficiency, and in particular its out-ofdistribution generalization capabilities, greatly depends on the state representation. The state space is often characterized by deep symmetries, which, if they are not adequately identified and leveraged, hinders the training process by forcing it to independently learn the policy at states which in fact are essentially the same (modulo some symmetry). In this work, we investigate a type of symmetries which often occurs in MDP formulations of constructive CO heuristics. We first introduce a generic framework to systematically derive a naive CO problem-specific MDP. We formally demonstrate the equivalence between solving the MDP and

