DATA-EFFICIENT SUPERVISED LEARNING IS POWER-FUL FOR NEURAL COMBINATORIAL OPTIMIZATION

Abstract

Neural combinatorial optimization (NCO) is a promising learning-based approach to solve difficult combinatorial optimization problems. However, how to efficiently train a powerful NCO solver remains challenging. The widely-used reinforcement learning method suffers from sparse rewards and low data efficiency, while the supervised learning approach requires a large number of high-quality solutions. In this work, we develop efficient methods to extract sufficient supervised information from limited labeled data, which can significantly overcome the main shortcoming of supervised learning. For traveling salesman problem (TSP), a representative combinatorial optimization problem, we propose a set of efficient data augmentation methods and a novel bidirectional loss to better leverage the equivalent properties of problem instances, which finally lead to a promising supervised learning approach. The thorough experimental studies demonstrate our proposed method can achieve state-of-the-art performance on TSP only with a small set of 50, 000 labeled instances, while it also achieves promising generalization performances on tasks with different sizes or different distributions. We believe this somewhat surprising finding could lead to valuable rethinking on the value of efficient supervised learning for NCO.



Many real-world applications involve challenging combinatorial optimization problems, which could be NP-hard and cannot be exactly solved in a reasonable time (Papadimitriou & Steiglitz, 1998) . The traditional approach needs to design handcrafted heuristic rules for each specific problem, and requires a long search process to solve every problem instance even when they are similar to each other (Korte et al., 2011) . In recent years, many learning-based algorithms have been proposed to efficiently find a good approximate solution for a given problem instance (Bengio et al., 2021) . In this work, we focus on the neural combinatorial optimization (NCO) approach (Bello et al., 2016) since it can directly generate an approximate solution in real-time without any expert knowledge or predefined heuristic rules. Although a combinatorial optimization problem could be NP-hard, a real-world application could typically only care about a small subset of instances (Bengio et al., 2021) . Therefore, it is possible to leverage the similar patterns shared by these instances to learn an efficient neural combinatorial solver (Vinyals et al., 2015) . Supervised learning (SL) and reinforcement learning (RL) are the two main methods for training the NCO solver, which learn the pattern directly from high-quality solutions (Vinyals et al., 2015) or through extensive interaction with the environment (e.g., the problem instances) (Bello et al., 2016) . It is challenging to efficiently train a powerful NCO solver. The RL method suffers from the issues of sparse rewards (Vecerik et al., 2017; Hare, 2019) and low data efficiency (Laskin et al., 2020) , which could require a huge computational budget and lead to extremely long training time (e.g., more than a week) (Joshi et al., 2020; Kwon et al., 2020) . By directly learning from high-quality solutions at each step, the SL method has better sample efficiency and is a promising alternative for training an NCO solver (Joshi et al., 2019; 2020) . Nevertheless, SL suffers from the difficulty of collecting sufficient labeled data (i.e., optimal or near-optimal solutions of combinatorial optimization instances). In addition, there are also some concerns on the generalization performance of the NCO solver trained by the SL method (Joshi et al., 2020) . In this work, we investigate how to overcome the shortcomings of SL-based NCO training. By leveraging the equivariance and symmetries of the problem instances and solutions, we develop novel approaches to extract sufficient information from limited high-quality solutions for data-efficient supervised learning, and we demonstrate that training POMO (Kwon et al., 2020) through our method is better than reinforcement learning. Our main contributions can be summarized as follows: • We design four simple yet efficient data augmentation approaches to significantly enlarge the training set from limited high-quality solutions, and develop a novel bidirectional supervised loss to leverage the equivalence of solutions to further improve the training efficiency for supervised learning. With these two powerful methods, we propose a novel Supervised Learning with Data Augmentation and Bidirectional Loss (SL-DABL) algorithm for TSP. • We conduct thorough experiments to study the efficiency of our proposed method. The results confirm that SL-DABL can achieve state-of-the-art performance on TSP with only 50, 000 training instances, and also has promising generalization performance to realworld instances with different sizes. These findings lead us to rethink some current beliefs on NCO for TSP, such as (Joshi et al., 2020) . Our findings reported in this work could be somewhat surprising and opposite to some current beliefs about the NCO method. We show that 1) the huge supervised data requirement (a major drawback) is indeed not necessary for SL and 2) RL is not always the best choice for training a NCO model. We hope they could be helpful for rethinking the role and value of efficient SL-based NCO training.

2. RELATED WORKS

In the past few years, many promising learning-based approaches have been proposed to tackle different combinatorial optimization problems. We briefly review the neural combinatorial optimization methods that are closely related to this work, and refer readers to Bengio et al. (2021) and Cappart et al. (2021) for comprehensive surveys. 2.1 SUPERVISED LEARNING FOR NCO Vinyals et al. (2015) proposed the Pointer Network with RNN encoder-decoder structure and attention mechanism to solve TSP in an autoregressive manner. Milan et al. (2017) found that it is costly to generate enough high-quality solutions to serve as a training dataset for supervised learning, and proposed to update the initial dataset with superior solutions generated during the training process. Joshi et al. ( 2019) trained a graph neural network to predict the heatmap for each instance with non-autoregressive decoding. The heatmap measures the probability that each edge will belong to the optimal solution, which can be converted to a valid solution with beam search, Monte Carlo tree search (Fu et al., 2021) , guided local search (Hudson et al., 2021) , and dynamic programming (Kool et al., 2021) . Joshi et al. (2020) systematically studied the performance of different learning methods on both autoregressive and non-autoregressive models. This work focuses on the construction-based autoregressive model. According to the results in (Joshi et al., 2020) , even with 1, 280, 000 training instances, the SL approach will still be outperformed by the RL approach on the zero-shot greedy prediction for both testing and generalization performance. In this work, we propose a novel dataefficient SL method to achieve state-of-the-art performance with 50, 000 training instances, which is only 4% of the training dataset in Joshi et al. (2020).

2.2. REINFORCEMENT LEARNING FOR NCO

Many RL-based methods have been proposed to train NCO solvers (Bello et al., 2016; Khalil et al., 2017; Nazari et al., 2018; Deudon et al., 2018; Ma et al., 2019) . Kool et al. (2018) proposed the Attention Model (AM) framework to solve different vehicle routing problems. Different follow-up works have been developed to improve the AM performance with diverse solution generations (Xin et al., 2021; Kim et al., 2021) . Kwon et al. (2020) proposed the POMO method with multiple greedy rollouts to leverage the multiple optima property, which is the current state-of-the-art RL algorithm for NCO. In this work, with the same model and inference strategy, our proposed SL-DABL method can outperform the RL counterpart in POMO for both testing and generalization performance. Data augmentation is a widely-used approach to increase the amount of training data for supervised learning (Shorten & Khoshgoftaar, 2019; Feng et al., 2021) . Kwon et al. (2020)  c(τ |S) = n-1 i=1 ∥s ti+1 -s ti ∥ 2 + ∥s tn -s t1 ∥ 2 . In this work, we represent the graph S as [x; y], where  x = [x 1 , x 2 , . . . , x n ] and y = [y 1 , y 2 , . . . , y n ]. %HIRUH'$ $IWHU'$ (a) Rotation y = tan( 2 )x + 1 2 (1 tan( 2 )) %HIRUH'$ $IWHU'$ (b) Symmetry = 0.7 %HIRUH'$ $IWHU'$ (c) Shrink %HIRUH'$ $IWHU'$ (d) Noise

3.2. ROTATION

Each TSP instance is a fully connected graph whose optimal solution is invariant with the rotation of the entire graph. Therefore, with a single optimal solution to a given instance, we can generate multiple new instances by randomly rotating the original instance at different angles. The details of this rotation operator are provided in Algorithm 1. Algorithm 1 Rotation Input: the original graph S. Output: the augmented graph S ′ . 1: [x m ; y m ] ← [x -0.5; y -0.5]; 2: ρ, θ ← Cartesian2Polar(x m , y m ); 3: ∆θ ∼ U(0, 2π); 4: θ ′ ← θ + ∆θ; 5: [x m ; y m ] ← Polar2Cartesian(ρ, θ ′ ); 6: S ′ ← [x m + 0.5; y m + 0.5]; Algorithm 2 Symmetry Input: the original graph S. Output: the augmented graph S ′ . 1: [x m ; y m ] ← [x -0.5; y -0.5]; 2: ρ, θ ← Cartesian2Polar(x m , y m ); 3: ∆θ ∼ U(0, 2π); 4: θ ′ ← -(θ + ∆θ); 5: [x m ; y m ] ← Polar2Cartesian(ρ, θ ′ ); 6: S ′ ← [x m + 0.5; y m + 0.5]; As shown in lines 1-2 of Algorithm 1, we first move the graph by changing its center from (0.5, 0.5) ⊺ to (0, 0) ⊺ and express the nodes in terms of the polar coordinates as ρ = x 2 m + y 2 m , θ = arctan y m x m . Then, we randomly generate an angle ∆θ ∈ (0, 2π) then add it to the current θ to get a new instance (i.e., lines 3-4 of Algorithm 1). After that, we transform the new instance into the Cartesian coordinate system as x m = ρ cos θ ′ , y m = ρ sin θ ′ . Finally, we move the graph back to the location centered at (0.5, 0.5) ⊺ and output the augmented graph S ′ . Figure 3 (a) illustrates an example of generating a new instance via the rotation operator. It is clear that the new instance has the same optimal solution with the original one.

3.3. SYMMETRY

Similar to rotation, the optimal solution is also invariant to the symmetry operator. Since all nodes are located in [0, 1] 2 , for uniformity, we can set the axis of symmetry as any line that passes through the midpoint (0.5, 0.5) ⊺ . Algorithm 2 presents the symmetry operator in detail. Compared with the rotation operator, it additionally flips the rotated graph along the horizontal axis of the polar coordinate system on line 4. The symmetry axis between the new graph and the original graph can be expressed as y = kx + 1 2 (1 -k), where k = tan (-∆θ 2 ). An example of generating the new instance via the symmetry operator is depicted in Figure 3 (b). We can see that both the original instance and the new instance have the same optimal solution.

3.4. SHRINK

Algorithm 3 Shrink Input: original graph S, threshold parameter γ. Output: augmented graph S ′ 1: [x m ; y m ] ← [x -0.5; y -0.5]; 2: ρ, θ ← Cartesian2Polar(x m , y m ); 3: β ∼ U(1 -γ, 1 + γ); 4: ρ ′ ← βρ; 5: [x m ; y m ] ← Polar2Cartesian(ρ ′ , θ); 6: S ′ ← [x m + 0.5; y m + 0.5]; Algorithm 4 Noise Input: original graph S. Output: augmented graph S ′ . 1: d ← NodesMinimumDistance(x, y); 2: r ρ ∼ U(0, 1); 3: r θ ∼ U(0, 2π); 4: r ′ ρ ← d 2 r ρ ; 5: r x , r y ← Polar2Cartesian(r ′ ρ , r θ ); 6: S ′ ← [x + r x ; y + r y ]; The shrink operator linearly scales the original graph. Since the relative positions of all nodes are not changed, the optimal solution of the new instance is the same as the original one. Similar to the rotation and symmetry operators, we use the midpoint (0.5, 0.5) ⊺ as the center for the shrink operator. The detailed procedure is given in Algorithm 3. We first move the given instance to the new location such that its midpoint is (0, 0) ⊺ as in line 1, and map all nodes into the polar coordinate system as shown in line 2 of Algorithm 3. Then a coefficient β is randomly sampled from U(1 -γ, 1 + γ) to control the zoom degree. The predefined threshold parameter γ can prevent the graph from zooming to an extremely small or large scale. In this paper, we set γ to 0.3. The final step of Algorithm 3 restores the scaled graph to the original Cartesian coordinate. In Figure 3 (c), an example of generating a new instance via the shrink operator is provided, where both the original and shrinking instance share the same optimal solution.

3.5. NOISE

Unlike the above three operators that perfectly preserve the relative positional relationship, the noise operator generates new graphs by randomly perturbing each node in the original graph. In other words, without any further restriction, the newly generated instances could have different optimal solutions to the original instance. This property is undesirable for data augmentation. However, the optimal solution could still keep the same if we only slightly change the node coordinates while qualitatively maintaining the relative position. To be specific, we can add a small enough noise to each node to perturb the graph such that the noise's upper bound is half the minimum distance between each pair of nodes d 2 . In this way, the nodes after perturbation will be in a small region around their original location and not overlap with each other. This tight restriction can also guarantee the newly generated instance will still have the same optimal solution as the original instance. Algorithm 4 describes the noise operator in detail. As shown in line 1, the minimal distance between each pair of nodes is calculated first. In lines 2-3, two coefficient vectors r ρ ∈ (0, 1) n and r θ ∈ (0, 2π) n are randomly sampled to determine the magnitude and direction of the noise, respectively. Finally, we add the corresponding movements over the x and y axes as the noise for each node. This noise operator can adaptively set the noise upper bound and generate new graphs without changing the optimal solution. As the example illustrated in Figure 3(d) , the optimality of the original label solution is guaranteed.

3.6. COMPARISON

These four DA operators can be applied independently or stacked together in a specific manner to generate new problem instances to support SL-based training. They are also model-agnostic and can be used to train any SL-based NCO solver. In this subsection, we investigate their effectiveness for training the AM solver with the implementation in Kwon et al. (2020) . We compare the performance of the models trained by SL with different DA operators on the validation set, which consists of 1,000 randomly generated instances. Both training and validation are 50-node TSP instances (denoted as TSP50). The performance is evaluated by the average optimal gap, and the optimal values are calculated by Concorde (Applegate et al., 2006) . As shown in Figure 4 , each of the proposed DA operators can improve the performance compared to using the original training dataset directly. Specifically, the rotation and symmetry operators can remarkably alleviate the overfitting problem as well as reduce the optimal gap. In contrast, the effectiveness of using the shrink operator or the noise operator separately is not significant. In addition to the performance assessment of each individual DA operator, we also investigate the efficiency of the combined DA operators. We stack all DA operators first and remove one of them for enumeration. For instance, the All-Rotation represents we employ all DA operators except the rotation operator. Since there is overlap in rotation and symmetry operators, we randomly employ one of them when using them both. As shown in Figure 5 , all kinds of combinations can significantly reduce the optimal gap. The symmetry operator appears to be the most effective one, but the other operators also contribute to the performance improvement to varying degrees. In this paper, we adopt the combination of all four DA operations as our DA strategy.

4. BIDIRECTIONAL LOSS

In this section, by leveraging the equivalence of optimal solutions, we propose a novel bidirectional loss to further improve the data efficiency for SL-based training. The construction-based NCO solver sequentially generates the solution in an autoregressive manner, which selects one node at each step. The node selection can be viewed as a classification problem, and the goal of the SL-based method is to minimize the conditional cross-entropy loss from the optimal solution (Vinyals et al., 2015) :

4.1. LOSS FUNCTIONS

𝝉𝝉 = (1, 3, 2, 5, 4) 𝝉𝝉 1 = (1, 3, 2, 5, 4) 𝝉𝝉 2 = (2, 5, 4, 1, 3) 𝝉𝝉 3 = (3, 2, 5, 4, 1) 𝝉𝝉 4 = (4, 1, 3, 2, 5) 𝝉𝝉 5 = (5, 4, 1, 3, 2) 1 3 2 5 4 𝝉𝝉 1 = (1, 4, 5, 2, 3) 𝝉𝝉 2 = (2, 3, 1, 4, 5) 𝝉𝝉 3 = (3, 1, 4, 5, 2) 𝝉𝝉 4 = (4, 5, 2, 3, 1) 𝝉𝝉 5 = (5, 2, 3, 1, 4) (a) A solution for TSP 𝝉𝝉 = (1, 3, 2, 5, 4) 𝝉𝝉 1 = (1, 3, 2, 5, 4) 𝝉𝝉 2 = (2, 5, 4, 1, 3) 𝝉𝝉 3 = (3, 2, 5, 4, 1) 𝝉𝝉 4 = (4, 1, 3, 2, 5) 𝝉𝝉 5 = (5, 4, 1, 3, 2) 1 3 2 5 4 𝝉𝝉 1 = (1, 4, 5, 2, 3) 𝝉𝝉 2 = (2, 3, 1, 4, 5) 𝝉𝝉 3 = (3, 1, 4, 5, 2) 𝝉𝝉 4 = (4, 5, 2, 3, 1) 𝝉𝝉 5 = (5, 2, 3, 1, 4) (b) Equivalent optimal solutions L(S, - → τ ) = - n t=1 log p ϕ ( - → τ t |S-→ τ 0:t-1 ) where -→ τ is the optimal solution for the corresponding instance S, S-→ τ 0:t-1 is the partial tour up to step t -1, -→ τ t is the selected node at step t, and ϕ is the parameters of the training model. However, a CO problem instance could have multiple optimal solutions (Kwon et al., 2020; Kotary et al., 2021) . As shown in Figure 6 , for a TSP instance with n nodes and a given optimal solution -→ τ , we can have n equivalent optimal solutions -→ τ i with different starting node i. In addition, since the solution (tour) can move along the reverse direction, we can also have the other n optimal solution ←τ i . Therefore, there are total 2n equivalent solutions to the single given optimal solution -→ τ . In this work, we propose a novel bidirectional loss function to leverage all the equivalent optimal solutions for data-efficient SL-based training: L B (S, - → τ ) = 1 n n i=1 A(L(S, - → τ i ), L(S, ← - τ i )), where A(•, •) denotes the aggregation function over the two reverse directions from the same starting node, which can be one of {min, mean, max}. The min aggregation greedily optimizes the solution for the prefer direction, the mean aggregation considers solutions for both directions, an the max aggregation optimizes the upper bound (e.g., the solutions with worse performance). In this subsection, we compare the performance of different aggregations for the bidirectional loss with data augmentation. As shown in Figure 7 , all aggregated bidirectional loss functions can outperform the original supervised learning function with the same amount of provided optimal solutions (e.g., 50, 000). Among the three different aggregation functions, the min aggregation achieves the best performance, and we use it as the default setting in the rest of this work. An ablation study on the three aggregations with different settings can be found in Appendix.

4.3. SL-DABL

We combine the aforementioned data augmentation approaches and the bidirectional loss together to propose our Supervised Learning with Data Augmentation and Bidirectional Loss (SL-DABL) method as shown in Algorithm 5. SL-DABL is model-agnostic and can be used to train different construction-based NCO solvers. In this work, we adopt the same model structure, hyperparameter settings, as well as the inference strategy with multiple starting nodes from POMO (Kwon et al., 2020) . In other words, the only difference is our proposed SL-DABL v.s. the RL-based training method with multiple rollouts developed in Kwon et al. (2020) . Following POMO, we also propose a variant (denoted as SL-DABL ×8) that conducts additional ×8 instance augmentation in the inference phase. Algorithm 5 SL-DABL 1: Input: the training dataset D, the number of training steps iter max , the batch size B, and the shrink threshold parameter γ. 2: Output: the trained model with parameters ϕ * . 3: Initialize the model with parameters ϕ; 4: for iter = 1, . . . , iter max do 5: S i , - → τ i ∼ SampleInstance(D) ∀i ∈ {1, . . . , B}; 6: a ∼ U(0, 1);

7:

if a < 0.5 then 8: S i ← Rotation(S i ); S i ← Shrink(S i , γ); 13: S i ← Noise(S i ); 14: ∇L(θ) ← 1 B B i=1 ∇L B (S i , - → τ i ); 15: ϕ ← ADAM(ϕ, ∇L(ϕ)); 16: end for

5. EXPERIMENT

In this section, we first compare our SL-DABL method with other widely-used learning/non-learning solvers on the uniform TSP instances. Then we conduct ablation studies to analyze each component of SL-DABL. Finally, SL-DABL is compared against RL on the generalization ability to different problem sizes and real-world TSPLib instances. (Applegate et al., 2006) and Gurobi (Gurobi Optimization, LLC, 2022) are two exact solvers and LKH3 (Helsgaun, 2017 ) is a powerful heuristic algorithm. The second group consists of two learn-to-improve algorithms, which are proposed by Wu (Wu et al., 2021) and Costa (d O Costa et al., 2020) , respectively. The third group contains two RL-based NCO algorithms, AM (Kool et al., 2018) and POMO (Kwon et al., 2020) , each of which has two variants. We also evaluate the performance of POMO trained thought reinforcement learning with data augmentation for clear ablation. The three GCN variants (Joshi et al., 2019) are SL-based two-stage algorithms that generate solutions based on the predicted heatmap. As for Gurobi, LKH3 and the three GCN variants, we directly use the results reported in Joshi et al. ( 2019) and Fu et al. (2021) .

5.1. OVERALL COMPARISON

We run the other algorithms by ourselves with the codes and pretrained models from their official implementations. Following the common setting from other NCO work, we separately train three different models for TSP instances with 20-, 50-and 100-node (called TSP20/50/100, respectively). All training datasets are from (Hottung et al., 2020) , where each one contains 50,000 TSP instances with optimal solutions solved by Concorde (Applegate et al., 2006) . We evaluate the performance and inference time on the test set with 10,000 randomly generated instances for each problem size. From these results, we can see that SL-DABL outperforms the other nine learning-based algorithms and achieve state-of-the-art results on all three kinds of TSP instances. Especially, SL-DABL × 8 gains the same performance as Concorde on TSP20. Furthermore, our SL-DABL models inherit the real-time inference capability, which is significantly faster than exact and heuristic solvers as well as improvement-based algorithms. It is worth noting that SL-DABL outperforms POMO even though they have the same model structure and inference strategy. These results fully confirm that SL-DABL is a data-efficient and powerful training method for NCO solvers.

5.2. ABLATION EXPERIMENT

To investigate the effectiveness of the DA approach and the bidirectional loss, we conduct ablation experiments on each component of SL-DABL. We train the model with four different SL strategies (i.e., original SL, SL-BL, SL-DA, and SL-DABL) using the dataset of 50,000 labeled TSP instances of size 50. For comparison, we also train the model via RL using 51,000,000 random TSP50 instances. The five trained models are compared on the validation set of 1,000 random instances. As shown in Figure 1 , the original SL is overfitting after about 50 epochs due to the lack of sufficient supervised data. In contrast, the SL-DA optimizes smoothly throughout the training stage without overfitting, and the only difference is that the training dataset is expanded by our proposed DA approach. The bidirectional loss also illustrates its effectiveness, especially when training data is insufficient. In the comparison of original SL and SL-BL, the bidirectional loss helps the latter moderately alleviate overfitting and extract more information from the limited tiny dataset in the early training stage. In the case of the training data containing tremendous labeled data, like SL-DA and SL-DABL, the bidirectional loss can still further improve the data efficiency. SL-DABL is more efficient than SL-DA in getting information from the same amount of training data. SL-DABL and SL-DA have similar performance at the end of training since they are very close to the optimal solutions (e.g., with 0.001% optimal gap). According to the results in Figure 1 , the SL-DA and SL-DABL methods have better efficiency than RL throughout the whole training process. In summary, the original SL approach is inferior to RL mainly due to the overfitting issue with limited training data. Our proposed data augmentation approach can significantly help SL to extract sufficient supervised information from only 50, 000 training instances and thereby address the overfitting issues with negligible cost. Meanwhile, the bidirectional loss can further improve the data efficiency, especially when the training set is relatively small.

5.3. GENERALIZATION

The generalization ability is an important concern for the learning-based NCO solver. In real-world applications, the problem instances could typically have different sizes and might come from different distributions. It is crucial that the learning-based solver should still be able to generate good approximate solutions for those unseen instances. Table 2 : Generalization of models trained on TSP100 In this section, we compare the generalization ability of SL-DABL and its RL counterpart with the POMO model and different inference strategies. We train two models on TSP100 with SL-DABL and the RL approach in POMO respectively, and then compare their performance on TSP instances with up to 300 nodes. As shown in Table 2 , the model trained by SL-DABL always has the better generalization performance. We also test our model on the widely-used TSPlib benchmark (Reinelt, 1991) that contains real-world instances from dramatically different distributions. The statistical results on 30 2-D Euclidean instances with 100 to 300 nodes are shown in Table 3 . SL-DABL outperforms its RL counterpart on 23 out of 30 instances and has a better average optimal gap. The detailed results for each instance can be found in Appendix.

6. CONCLUSION

In this paper, we have proposed a powerful SL-DABL method for learning traveling salesman problem. It integrates data augmentation to efficiently extract sufficient supervised information from limited training data, and bidirectional loss to better exploit the equivalent properties of optimal solutions. The experiments have validated that SL-DABL can achieve state-of-the-art performance on TSP with only a small set of 50,000 labeled training instances, while also having a better generalization ability to its RL counterpart on real-world instances with various sizes. These findings could be helpful in rethinking the value of efficient SL methods for NCO training. A APPENDIX Table 5 : Generalization of models trained on TSP100 A.1 EXPERIMENTS SETTING In this work, we plug our SL-DABL into POMO (Kwon et al., 2020) , and the details of the model can be found in the corresponding literature. We only describe the hyperparameter settings here, even though all of them are also identical to POMO: there are 100,000 data per epoch and the batch size is 64. The models are optimized by the Adam optimizer in 510 epochs. In the first 500 epochs, the learning rate η = 1e-4 with a weight decay w = 1e-6, while the last 10 epochs fine-tuning the model with η = 1e -5. The only difference between the training datasets of our SL-DABL and RL is that the former is augmented from a tiny dataset consisting of 50,000 labeled data, while the latter is randomly generated. Both of them use 51,000,000 data to optimize the models. All experiments are implemented on a single Tesla V100 GPU.

A.2 BIDIRECTIONAL LOSS ABLATION

In this part, we extend the overall comparison with the models trained through the other two aggregated bidirectional loss functions Max and Min. As shown in Tabel 4, all variants of SL-DABL outperform all other learning-based baselines. In this subsection, we study the effect of different numbers of labeled instances (e.g., 10K and 100K) for our proposed SL-DABL method. As shown in Figure 8 , the proposed efficient SL-DABL method with only 10k instances will suffer from overfitting, yet which is significantly alleviated compared to the original SL method. In addition, its performance is still better than the reinforcement learning counterpart. In practice, the overfitting issue might be properly handled by early stopping. On the other hand, the improvement on increasing the labeled instances from 50K to 100K is not significant.

A.4 GENERALIZATION

We firstly discuss the generalization of the models trained on TSP100 up to TSP500. As shown in Tabel 5, the SL-DABL with max aggregate function superior to RL in all test sets. In addition, the min one has better performance on TSP100-TSP300. The details of the performance on TSBlib are depicted in Tabel 6.



Figure 1: The optimality gap of models trained with different training strategies on the validation set.

Figure 3: Examples of generating a new instance via the (a) rotation operator, (b) symmetry operator, (c) the shrink operator, and (d) the noise operator, respectively.

Figure4: The average optimal gap on the validation set generated by the Attention Model trained with supervised learning using each individual DA operator.

Figure 6: Equivalent optimal solutions for a TSP instance.

Figure 7: The average optimal gaps on the validation set with different loss functions.

Figure 8: Ablation of training dataset sizes

Statistical results on TSPlib



Results on TSPlib

annex

In this subsection, we investigate the effect of our proposed SL-DABL method on the capacitated vehicle routing problems(CVRP). Most experimental setting is the same with the TSP experiment, and we have 10k labeled training instances where the solutions are achieved by the powerful LKH solver (Helsgaun, 2017) . We also slightly modified the data augmentation and loss function to accommodate CVRP. For data augmentation, to guarantee the optimality of original solutions, the noise approach is removed since the additional depot hampers the design of the noise upper bound. For the loss function, the sub-tours of given (near-)optimal solutions are reordered to find the most similar one based on the current model. The purpose of this operation is to alleviate the influence of the equivalent solutions. Therefore, different from the 2n equivalent solutions in TSP instances, there are m!2 m equivalent solutions for each CVRP instance, where m is the number of sub-tour. As shown in Figure 9 , for CVRP20, our SL-DABL significantly mitigates the overfitting compared to the original SL, and the performance is stably superior to POMO. And the numerical result can be checked out in Table 7 . In this subsection, we investigate the performance of our proposed data augmentation approaches on the GCN model (Joshi et al., 2019) .Since the GCN model directly predicts the solution heatmap as an adjacency matrix, the bidirectional loss functions is not capable in this setting. We follow the same experiment settings as reported in the main paper.As shown in Figure 10 , for TSP20, our data augmentation still modestly improve the performance for the GCN model, even if both of them employ the beam search approach with 1280 width.

