DATA-EFFICIENT SUPERVISED LEARNING IS POWER-FUL FOR NEURAL COMBINATORIAL OPTIMIZATION

Abstract

Neural combinatorial optimization (NCO) is a promising learning-based approach to solve difficult combinatorial optimization problems. However, how to efficiently train a powerful NCO solver remains challenging. The widely-used reinforcement learning method suffers from sparse rewards and low data efficiency, while the supervised learning approach requires a large number of high-quality solutions. In this work, we develop efficient methods to extract sufficient supervised information from limited labeled data, which can significantly overcome the main shortcoming of supervised learning. For traveling salesman problem (TSP), a representative combinatorial optimization problem, we propose a set of efficient data augmentation methods and a novel bidirectional loss to better leverage the equivalent properties of problem instances, which finally lead to a promising supervised learning approach. The thorough experimental studies demonstrate our proposed method can achieve state-of-the-art performance on TSP only with a small set of 50, 000 labeled instances, while it also achieves promising generalization performances on tasks with different sizes or different distributions. We believe this somewhat surprising finding could lead to valuable rethinking on the value of efficient supervised learning for NCO.



Many real-world applications involve challenging combinatorial optimization problems, which could be NP-hard and cannot be exactly solved in a reasonable time (Papadimitriou & Steiglitz, 1998) . The traditional approach needs to design handcrafted heuristic rules for each specific problem, and requires a long search process to solve every problem instance even when they are similar to each other (Korte et al., 2011) . In recent years, many learning-based algorithms have been proposed to efficiently find a good approximate solution for a given problem instance (Bengio et al., 2021) . In this work, we focus on the neural combinatorial optimization (NCO) approach (Bello et al., 2016) since it can directly generate an approximate solution in real-time without any expert knowledge or predefined heuristic rules. Although a combinatorial optimization problem could be NP-hard, a real-world application could typically only care about a small subset of instances (Bengio et al., 2021) . Therefore, it is possible to leverage the similar patterns shared by these instances to learn an efficient neural combinatorial solver (Vinyals et al., 2015) . Supervised learning (SL) and reinforcement learning (RL) are the two main methods for training the NCO solver, which learn the pattern directly from high-quality solutions (Vinyals et al., 2015) or through extensive interaction with the environment (e.g., the problem instances) (Bello et al., 2016) . It is challenging to efficiently train a powerful NCO solver. The RL method suffers from the issues of sparse rewards (Vecerik et al., 2017; Hare, 2019) and low data efficiency (Laskin et al., 2020) , which could require a huge computational budget and lead to extremely long training time (e.g., more than a week) (Joshi et al., 2020; Kwon et al., 2020) . By directly learning from high-quality solutions at each step, the SL method has better sample efficiency and is a promising alternative for training an NCO solver (Joshi et al., 2019; 2020) . Nevertheless, SL suffers from the difficulty of collecting sufficient labeled data (i.e., optimal or near-optimal solutions of combinatorial optimization instances). In addition, there are also some concerns on the generalization performance of the NCO solver trained by the SL method (Joshi et al., 2020) . In this work, we investigate how to overcome the shortcomings of SL-based NCO training. By leveraging the equivariance and symmetries of the problem instances and solutions, we develop novel approaches to extract sufficient information from limited high-quality solutions for data-efficient supervised learning, and we demonstrate that training POMO (Kwon et al., 2020) through our method is better than reinforcement learning. Our main contributions can be summarized as follows: • We design four simple yet efficient data augmentation approaches to significantly enlarge the training set from limited high-quality solutions, and develop a novel bidirectional supervised loss to leverage the equivalence of solutions to further improve the training efficiency for supervised learning. With these two powerful methods, we propose a novel Supervised Learning with Data Augmentation and Bidirectional Loss (SL-DABL) algorithm for TSP. • We conduct thorough experiments to study the efficiency of our proposed method. The results confirm that SL-DABL can achieve state-of-the-art performance on TSP with only 50, 000 training instances, and also has promising generalization performance to realworld instances with different sizes. These findings lead us to rethink some current beliefs on NCO for TSP, such as (Joshi et al., 2020) . Our findings reported in this work could be somewhat surprising and opposite to some current beliefs about the NCO method. We show that 1) the huge supervised data requirement (a major drawback) is indeed not necessary for SL and 2) RL is not always the best choice for training a NCO model. We hope they could be helpful for rethinking the role and value of efficient SL-based NCO training.

2. RELATED WORKS

In the past few years, many promising learning-based approaches have been proposed to tackle different combinatorial optimization problems. We briefly review the neural combinatorial optimization methods that are closely related to this work, and refer readers to Bengio et al. ( 2021 



Figure 1: The optimality gap of models trained with different training strategies on the validation set.

Joshi et al. (2020)  systematically studied the performance of different learning methods on both autoregressive and non-autoregressive models. This work focuses on the construction-based autoregressive model. According to the results in(Joshi et al., 2020), even with 1, 280, 000 training instances, the SL approach will still be outperformed by the RL approach on the zero-shot greedy prediction for both testing and generalization performance. In this work, we propose a novel dataefficient SL method to achieve state-of-the-art performance with 50, 000 training instances, which is only 4% of the training dataset inJoshi et al. (2020).

