POLICY-BASED SELF-COMPETITION FOR PLANNING PROBLEMS

Abstract

AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent's historical performances and to reshape an episode's reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan' (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.

1. INTRODUCTION

One of the reasons for the success of AlphaZero (Silver et al., 2017) is the use of a policy and value network to guide the Monte Carlo tree search (MCTS) to decrease the search tree's width and depth. Trained on state-outcome pairs, the value network develops an 'intuition' to tell from single game positions which player might win. By normalizing values in the tree to handle changing reward scales (Schadd et al., 2008; Schrittwieser et al., 2020) , AlphaZero's mechanisms can be applied to single-agent (or single-player) tasks. Although powerful, the MCTS relies on value approximations which can be hard to predict (van Hasselt et al., 2016; Pohlen et al., 2018) . Furthermore, without proper normalization, it can be difficult for value function approximators to adapt to small improvements in later stages of training. In recent years there has been an increasing interest in learning sequential solutions for combinatorial optimization problems (COPs) from zero knowledge via deep reinforcement learning. Particularly strong results have been achieved with policy gradient methods by using variants of self-critical training (Rennie et al., 2017; Kool et al., 2018; Kwon et al., 2020) where it is avoided to learn a value function at all. By baselining the gradient estimate with the outcome of rolling out a current or historical policy, actions are reinforced by how much better (or worse) an episode is compared to the rollouts. Something similar is achieved in MCTS-based algorithms for single-player tasks by computing a scalar baseline from the agent's historical performance. The reward of the original task is reshaped to a binary ±1 outcome indicating whether an episode has exceeded this baseline or not (Laterre et al., 2018; Schmidt et al., 2019; Mandhane et al., 2022) . This self-competition brings the original single-player task closer to a two-player game and bypasses the need for in-tree value scaling during training. The scalar baseline against which the agent is planning must be carefully chosen as it should neither be too difficult nor too easy to outperform. Additionally, in complex problems, a single scalar value holds limited information about the instance at hand and the agent's strategies for reaching the threshold performance. In this paper, we follow the idea of self-competition in AlphaZero-style algorithms for deterministic single-player sequential planning problems. Inspired by the original powerful 'intuition' of AlphaZero to evaluate board positions, we propose to evaluate states in the value function not by comparing them against a scalar threshold but directly against states at similar timesteps coming from a historical version of the agent. The agent learns by reasoning about potential strategies of its past self. We summarize our contributions as follows: (i) We assume that in a self-competitive framework, the scalar outcome of a trajectory ζ under some baseline policy is less informative for tree-based planning than ζ's intermediate states. Our aim is to put the flexibility of rollouts in selfcritical training into a self-competitive framework while maintaining the policy's information in intermediate states of the rollout. We motivate this setup from the viewpoint of advantage baselines, show that policy improvements are preserved, and arrive at a simple instance of gamification where two players start from the same initial state, take actions in turn and aim to find a better trajectory than the opponent. (ii) We propose the algorithm GAZ Play-to-Plan (GAZ PTP) based on Gumbel AlphaZero (GAZ), the latest addition to the AlphaZero family, introduced by Danihelka et al. (2022) . An agent plays the above game against a historical version of itself to improve a policy for the original single-player problem. The idea is to allow only one player in the game to employ MCTS and compare its states to the opponent's to guide the search. Policy improvements obtained through GAZ's tree search propagate from the game to the original task. We show the superiority of GAZ PTP over single-player variants of GAZ on two COP classes, the Traveling Salesman Problem and the standard Job-Shop Scheduling Problem. We compare GAZ PTP with different single-player variants of GAZ, with and without self-competition. We consistently outperform all competitors even when granting GAZ PTP only half of the simulation budget for search. In addition, we reach competitive results for both problem classes compared to benchmarks in the literature.

2. RELATED WORK

Gumbel AlphaZero In GAZ, Danihelka et al. (2022) redesigned the action selection mechanisms of AlphaZero and MuZero (Silver et al., 2017; Schrittwieser et al., 2020) . At the root node, actions to explore are sampled without replacement using the Gumbel-Top-k trick (Vieira; Kool et al., 2019b) . Sequential halving (Karnin et al., 2013) is used to distribute the search simulation budget among the sampled actions. The singular action remaining from the halving procedure is selected as the action to be taken in the environment. At non-root nodes, action values of unvisited nodes are completed by a value interpolation, yielding an updated policy. An action is then selected deterministically by matching this updated policy to the visit count distribution (Grill et al., 2020) . This procedure theoretically guarantees a policy improvement for correctly estimated action values, both for the root action selection and the updated policy at non-root nodes. Consequently, the principled search of GAZ works well even for a small number of simulations, as opposed to AlphaZero, which might perform poorly if not all actions are visited at the root node. Similarly to MuZero, GAZ normalizes action values with a min-max normalization based on the values found during the tree search to handle changing and unbounded reward scales. However, if for example a node value is overestimated, the probability of a good action might be reduced even when all simulations reach the end of the episode. Additionally, value function approximations might be challenging if the magnitude of rewards changes over time or must be approximated over long time horizons (van Hasselt et al., 2016; Pohlen et al., 2018) . Rennie et al. (2017) introduce self-critical training, a policy gradient method that baselines the REINFORCE (Williams, 1992) gradient estimator with the reward obtained by rolling out the current policy greedily. As a result, trajectories outperforming the greedy policy are given positive weight while inferior ones are suppressed. Self-critical training eliminates the need for learning a value function approximator (and thus all innate training challenges) and reduces the variance of the gradient estimates. The agent is further provided with an automatically controlled curriculum to keep improving. Self-critical methods have shown great success in neural combinatorial optimization. Kool et al. (2018) use a greedy rollout of a periodically updated best-so-far policy as a baseline for routing problems. Kwon et al. (2020) and Kool et al. (2019a) bundle the return of multiple sampled trajectories to baseline the estimator applied to various COPs. The idea of using rollouts to control the agent's learning curriculum is hard to transfer one-to-one to MCTS-based algorithms, as introducing a value network to avoid full Monte Carlo rollouts in the tree search was exactly one of the great strengths in AlphaZero (Silver et al., 2017) .

Self-critical training

Self-competition Literature is abundant on learning by creating competitive environments for (practical) single-player problems, e.g. Bansal et al. (2017) ; Sukhbaatar et al. (2018); Zhong et al. (2019) ; Xu & Lieberherr (2020) ; Wang et al. (2020) ; Göttl et al. (2022) . Reward mechanisms based on self-competition for AlphaZero-type algorithms are different but comparable to self-critical training in policy gradient methods. A scalar baseline is computed from the agent's historical performance against which the current agent needs to compete. A reward of ±1 is returned at the end of an episode, depending on whether the trajectory performed better than the baseline, eliminating the need for value scaling and normalization. An iterative process of improvement is created by continuously or periodically updating the baseline according to the agent's performance. As in self-critical training, the baseline provides the agent automatically with the right curriculum to improve. We use the term self-competition in the context of competing against historical performance to distinguish from self-play as in board games, where usually the latest agent version is used. Ranked Reward (Laterre et al., 2018) stores historical rewards in a buffer which is used to calculate a threshold value based on a predetermined percentile. Mandhane et al. (2022) use an exponential moving averagebased scheme for the threshold and apply self-competition to maximize a constrained objective. Beloborodov et al. (2020) use a similar approach as Ranked Reward, but also allow non-binary rewards in [-1, 1]. The described self-competition based on combining historical performances cannot differentiate between problem instances. If an instance's difficulty is higher than usual, in extreme cases the agent might obtain a reward of -1 even though the found solution is optimal, thus adding significant noise to the reshaped reward (Bello et al., 2016; Laterre et al., 2018) .

3.1. PROBLEM FORMULATION

We consider undiscounted Markov decision processes (MDPs) of finite problem horizon T with state space S, a finite action space A, a reward function r : S × A → R, and an initial state distribution ρ 0 . The goal is to find a state-dependent policy π which maximizes the expected total reward G π := E s0∼ρ0 E ζ0∼η π (•|s0) T -1 t=0 r(s t , a t ) , where η π (•|s t ) is the distribution over possible trajectories ζ t = (s t , a t , . . . , s T -1 , a T -1 , s T ) obtained by rolling out policy π from state s t at timestep t up to the finite horizon T . We use the subscript t in actions and states to indicate the timestep index. For an action a t ∼ π(•|s t ) in the trajectory, the state transitions deterministically to s t+1 = F (s t , a t ) according to some known deterministic state transition function F : S × A → S. We shortly write a t s t := F (s t , a t ) for a state transition in the following. With abuse of notation, we write r(ζ t ) := T -1-t l=0 r(s t+l , a t+l ) for the accumulated return of a trajectory. As usual, we denote by V π (s t ) := E ζt∼η π (•|st) [r(ζ t )] Q π (s t , a t ) := r(s t , a t ) + E ζt+1∼η π (•|atst) [r(ζ t+1 )] A π (s t , a t ) := Q π (s t , a t ) -V π (s t ) the state-value function, action-value function and advantage function w.r.t. the policy π. Furthermore, for two policies π and µ, we define V π,µ (s t , s l ) := E ζt∼η π (•|st) ζ l ∼η µ (•|s l ) [r(ζ t ) -r(ζ l )] as the expected difference in accumulated rewards taken over the joint distribution of η π (•|s t ) and η µ (•|s l ). We define Q π,µ (s t , s l ; a t ) := r(s t , a t ) + V π,µ (a t s t , s l ) and A π,µ (s t , s l ; a t ) := Q π,µ (s t , s l ; a t ) -V π,µ (s t , s l ) analogously. The following lemma follows directly from the definitions. It tells us that we can work with the policy µ as with any other scalar baseline, and that policy improvements are preserved. A proof is given in Appendix A.1. Lemma 1 Let π, π, and µ be state-dependent policies. For any states s t , s l ∈ S and action a t ∈ A, we have A π (s t , a t ) = A π,µ (s t , s l ; a t ), and (2) at π(a t |s t )Q π (s t , a t ) -V π (s t ) = at π(a t |s t )Q π,µ (s t , s l ; a t ) -V π,µ (s t , s l ). (3)

3.2. MOTIVATION FOR THE TWO-PLAYER GAME

We now consider MDPs with episodic rewards, i.e., we assume r(s t , a t ) = 0 for t < T -1. The policy target and action selection within the search tree in GAZ is based on the following policy update: At some state s, given logits of the form logit π (a) for an action a predicted by a policy π, an updated policy π GAZ is obtained by setting logit π GAZ (a) = logit π (a) + σ( Âπ (s, a)). (4) Here, σ is some monotonically increasing linear function and Âπ (s, a) := Qπ (s, a) -V π (s) is an advantage estimation, where V π (s) is a value approximation based on the output of a value network and Qπ (s, a) is a Q-value approximation coming from the tree search statistics (and is set to V π (s) for unvisited actions). Note that (4) differs from the presentation in Danihelka et al. (2022) due to the additional assumption that σ is linear, which matches its practical choice (see Appendix A.2 for a derivation). The update (4) is proven to provide a policy improvement but relies on the correctness of the value approximations V π . Our aim is to couple the ideas of self-critical training and self-competition using a historical policy. Firstly, the baseline should adapt to the difficulty of an instance via rollouts as in self-critical training. Secondly, we want to avoid losing information about intermediate states of the rollout as when condensing it to a scalar value. Consider some initial state s 0 ∼ ρ 0 , policy π, and a historical version µ of π. Let ζ greedy 0 = (s 0 , a 0 , . . . , s T -1 , a T -1 , s T ) be the trajectory obtained by rolling out µ greedily. We propose to plan against ζ greedy 0 timestep by timestep instead of baselining the episode with r(ζ greedy 0 ), i.e., we want to approximate V π,µ (s t , s t ) for any state s t . By Lemma 1, we have for any policy π at π (a t |s t )Q π (s t , a t ) ≥ V π (s t ) ⇐⇒ at π (a t |s t )Q π,µ (s t , s t ; a t ) ≥ V π,µ (s t , s t ). (5) So if π improves policy π baselined by µ, then π improves π in the original MDP. Furthermore, Âπ (s t , a t ) in (4) can be swapped with an approximation of Âπ,µ (s t , s t ; a t ) for the update in GAZ. Note that V π,µ (s t , s t ) = E ζt,ζ t [r(ζ t ) -r(ζ t )] = E ζt,ζ t [sgn(r(ζ t ) -r(ζ t )) • |r(ζ t ) -r(ζ t )|]. Especially in later stages of training, improvements might be small and it can be hard to approximate the expected reward difference |r(ζ t ) -r(ζ t )|. Thus, we switch to a binary reward as in self-competitive methods and arrive at V π,µ sgn (s t , s t ) := E ζt∼η π (•|st) ζ t ∼η µ (•|s t ) [sgn(r(ζ t ) -r(ζ t ))] , the final target output of the value network in the self-competitive formulation. Note that (6) uncouples the network from the explicit requirement to predict the expected outcome of the original MDP to decide if π is in a more advantageous state at timestep t than µ (for further details, see Appendix D). We can define Q π,µ sgn analogously to (6) and obtain an improved policy π via logit π (a t ) := logit π (a t ) + σ( Qπ,µ sgn (s t , s t ; a t ) -V π,µ sgn (s t , s t )). 4 GAZ PLAY-TO-PLAN

4.1. GAME MECHANICS

Given the original MDP, the above discussion generalizes to a two-player zero-sum perfect information game with separated states: two players start with identical copies s 1 0 and s -1 0 , respectively, of an initial state s 0 ∼ ρ 0 . The superscripts 1 and -1 indicate the first (max-) and second (min-) player, respectively. Player 1 observes both states (s 1 0 , s -1 0 ), chooses an action a 1 0 and transitions to the next state s 1 1 = a 1 0 s 1 0 . Afterwards, player -1 observes (s 1 1 , s -1 0 ), chooses action a -1 0 , and transitions to state s -1 1 = a -1 0 s -1 0 . Then, player 1 observes (s 1 1 , s -1 1 ), chooses action a 1 1 , and in this manner both players take turns until they arrive at terminal states s 1 T and s -1 T . Given the resulting trajectory ζ p 0 for player p ∈ {1, -1}, the outcome z of the game is set to z = 1 if r(ζ 1 0 ) ≥ r(ζ -1 0 ) (player wins) and z = -1 otherwise (player 1 loses). In case of equality, player 1 wins to discourage player -1 from simply copying moves. Algorithm 1: GAZ Play-to-Plan (GAZ PTP) Training Input: ρ 0 : initial state distribution; J arena : set of initial states sampled from ρ 0 Input: 0 ≤ γ < 1: self-play parameter Init policy replay buffer M π ← ∅ and value replay buffer M V ← ∅ Init parameters θ, ν for policy net π θ : S → ∆A and value net V ν : S × S → [-1, 1] Init 'best' parameters θ B ← θ foreach episode do Sample initial state s 0 ∼ ρ 0 and set s p 0 ← s 0 for p = 1, -1 Assign learning actor to player position: l ← random({1, -1}) Set greedy actor's policy µ ← π θ with probability γ, π θ B with probability 1 -γ for t = 0, . . . , T -1 do for player p = 1, -1 do if player p = l then Take greedy action a p t according to policy µ(s  p t )) in M π Have trajectories ζ p ← (s p 0 , a p 0 , . . . , s p T -1 , a p T -1 , s p T ) for players p ∈ {1, -1} z ← 1 if r(ζ 1 ) ≥ r(ζ -1 ), -1 else game outcome from perspective of player 1 Store tuples (s 1 t , s -1 t , z) and (s -1 t , s 1 t+1 , -z) in M V for all timesteps t Periodically update θ B ← θ if s0∈Jarena r(ζ greedy 0,π θ ) -r(ζ greedy 0,π θ B ) > 0 4.2 ALGORITHM Algorithm 1 summarizes our approach for improving the agent's performance, and we provide an illustration in the appendix in Figure 2 . In each episode, a 'learning actor' improves its policy by playing against a 'greedy actor', which is a historical greedy version of itself. We elaborate on the main parts in the following (more details can be found in Appendix B).

Network and training

An agent maintains a state-dependent policy network π θ : S → ∆A and value network V ν : S × S → [-1, 1] parameterized by θ and ν, respectively, where ∆A is the probability simplex over actions and V ν serves as a function approximator of ( 6). The agent generates experience in each episode for training the networks by playing the game in Section 4.1 against a frozen historically best policy version π θ B of itself. Initially, θ and θ B are equal. The networks are trained from experience replay as in AlphaZero, where we store value targets from the perspective of both players and only policy targets coming from the learning actor.

Choice of players

The learning actor and greedy actor are randomly assigned to player positions at the beginning of each episode. The greedy actor simply chooses actions greedily according to its policy π θ B . In contrast, the learning actor chooses actions using the search and action selection mechanism of GAZ to improve its policy and dominate over π θ B . We found the policy to become stronger when the learning actor makes its moves sometimes after and sometimes before the greedy actor. At test time, we fix the learning actor to the position of player 1.

GAZ tree search for learning actor

The learning actor uses MCTS to improve its policy and dominate over the greedy behavior of π θ B . We use the AlphaZero variant (as opposed to MuZero) in Danihelka et al. (2022) for the search, i.e., we do not learn models for the reward and transition dynamics. The search tree is built through several simulations, as usual, each consisting of the phases selection, expansion, and backpropagation. Each node in the tree is of the form N = (s 1 , s -1 , k), consisting of both players' states, where k ∈ {1, -1} indicates which player's turn it is. Each edge (N , a) is a node paired with a certain action a. We apply the tree search of GAZ similarly to a two-player board game, with two major modifications: (i) Choice of actions: For the sake of explanation, let us assume that l = 1, i.e., the learning actor takes the position of player 1. In the selection phase, an action is chosen at any node as follows: If the learning actor is to move, we regularly choose an action according to GAZ's tree search rules, which are based on the completed Q-values. If it is the greedy actor's turn, there are two ways to use the policy π θ B : we either always sample, or always choose greedily an action from π θ B . By sampling a move from π θ B , the learning actor is forced to plan not against only one (possibly suboptimal) greedy trajectory but also against other potential trajectories. This encourages exploration (see Figure 1 in the experimental results), but a single action of the learning actor can lead to separate branches in the search tree. On the other hand, greedily choosing an action is computationally more efficient, as only the states in the trajectory ζ greedy 0,π θ B are needed for comparison in the MCTS (see Appendix B.6 for notes on efficient implementation). However, if the policy π θ B is too strong or weak, the learning actor might be slow to improve in the beginning. We refer to the variant of sampling actions for the greedy actor in the tree search as GAZ PTP 'sampled tree' (GAZ PTP ST) and as GAZ PTP 'greedy tree' (GAZ PTP GT) in case actions are chosen greedily. (ii) Backpropagation: Actions are chosen as above until a leaf edge (N , a) with N = (s 1 , s -1 , k) is reached. When a leaf edge (N , a) is expanded and k = -l = -1, the learning actor is to move in the new node Ñ = (s 1 , as -1 , 1). We proceed by querying the policy and value network to obtain π θ (s 1 ) and v := V ν (s 1 , as -1 ). The new node Ñ is added to the tree and the value approximation v is backpropagated up the search path. If k = 1, i.e. the turn of the greedy actor in Ñ = (as 1 , s -1 , -1), we only query the policy network π θ B (s -1 ), choose an action ã from it (sampled or greedy), and directly expand the edge ( Ñ , ã). In particular, we do not backpropagate a value approximation of state pairs where the greedy actor is to move. Nodes, where it's the greedy actor's turn, are similar to afterstates (Sutton & Barto, 2018) , which represent chance transitions in stochastic environments. We further illustrate the procedure in Appendix B.2. Arena mode To ensure we always play against the strongest greedy actor, we fix at the beginning of the training a large enough 'arena' set of initial states J arena ∼ ρ 0 on which periodically π θ and π θ B are pitted against each other. For each initial state s 0 ∈ J arena , the policies π θ and π θ B are unrolled greedily to obtain the trajectories ζ greedy 0,π θ and ζ greedy 0,π θ B , respectively. We replace θ B with θ if s0∈Jarena r(ζ greedy 0,π θ ) -r(ζ greedy 0,π θ B ) > 0. Self-play parameter The parameters θ, θ B might be initialized disadvantageously (worse than uniformly random play), so that the learning actor generates no useful experience while maintaining a greedy policy worse than π θ B . We especially experienced this behavior when using a small number of simulations for the tree search. To mitigate this and to stabilize training, we switch to self-play in an episode with a small nonzero probability γ, i.e., the greedy actor uses the non-stationary policy π θ instead of π θ B .

5. EXPERIMENTS

We evaluate our algorithm on the Traveling Salesman Problem (TSP) and the standard Job-Shop Scheduling Problem (JSSP). For this purpose, we train the agent on randomly generated instances to learn heuristics and generalize beyond presented instances. and reward shaping. Additionally, learned solvers often focus on short evaluation times while accepting a high number of training instances. Our main goal is to show the performance of GAZ PTP compared with applying GAZ in a single-player way. Nevertheless, we also present results of recent learned solvers to put our approach into perspective. An in-depth comparison is difficult for several reasons. (i) Running MCTS is slower than a single (possibly batch-wise) rollout of a policy network. As we do not aim to present an optimized learned solver for TSP or JSSP, we omit comparisons of inference times. (ii) The network architecture is kept simple to be adjustable for both problem classes. (iii) We give only episodic rewards, even if there would be (canonical) intermediate rewards. Furthermore, we do not include heuristics in state representations. (iv) Eventually, the agent starts from zero knowledge and runs for a comparably lower number of episodes: 100k for TSP (for comparison, Kool et al. (2018) trains on 128M trajectories), 20k for JSSP (40k trajectories in Zhang et al. ( 2020)).

5.1. SINGLE-PLAYER VARIANTS

We compare our approach GAZ PTP with the following single-player GAZ variants: GAZ Single Vanilla We apply GAZ for single-agent domains to the original MDP. For the value target, we predict the final outcome of the episode, i.e., we train the value network V ν : S → R on tuples of the form (s t , r(ζ 0 )) for a trajectory ζ 0 = (s 0 , a 0 , . . . , a T -1 , s T ). We transform the Qvalues with a min-max normalization based on the values found inside the tree search to cope with different reward scales in MCTS (Schrittwieser et al., 2020; Danihelka et al., 2022) . For details, see Appendix C.1.3.

GAZ Single N-

Step Same as 'GAZ Single Vanilla', except: we bootstrap N steps into the future for the value target, as used in MuZero (Schrittwieser et al., 2020) and its variants (Hubert et al., 2021; Antonoglou et al., 2022) . As we do not assume intermediate rewards, this is equivalent to predicting the undiscounted root value of the search tree N steps into the future. We set N = 20 in all instances. GAZ Greedy Scalar This is a self-competitive version with a scalar baseline, which can differentiate between instances as in a self-critical setting. At the start of each episode, a full greedy rollout ζ greedy 0 from the initial state s 0 is performed with a historical policy to obtain an instance-specific baseline R := r(ζ greedy 0 ) ∈ R. At test time, max{R, learning actor's outcome} is taken as the result in the original MDP. A listing of the algorithm is provided in Appendix C.1.4.

5.2. GENERAL SETUP

We outline the common experimental setup for both TSP and JSSP. Full details and specifics can be found in Appendix C.1. Danihelka et al. (2022) for the choice of the function σ in (4) and the completion of Q-values. The number of simulations allowed from the search tree's root node is critical since a higher simulation budget generally yields more accurate value estimates. Although two edges are expanded in a single simulation in the modified tree search of GAZ PTP, the learning actor considers only one additional action in the expansion step because the greedy actor's moves are considered as environment transitions. Nevertheless, we grant the single-player variants twice as many simulations at the root node as GAZ PTP in all experiments.

GAZ MCTS We closely follow the original work of

Network architecture The policy and value network share a common encoding part f : S → R d mapping states to some latent space R d . A policy head g : R d → ∆A and value head h : R d × R d → [-1, 1] are stacked on top of f such that π θ (s) = g(f (s)) and V ν (s, s ) = h(f (s), f (s )). For both problems, f is based on the Transformer architecture (Vaswani et al., 2017) and the policy head g uses a pointing mechanism as in (Bello et al., 2016) and (Kool et al., 2018) . For GAZ PTP and all single-player variants, the architecture of f is identical, and g, h similar (see Appendix C.1.5). 1 . For TSP, we also list the greedy results of the self-critical method of Kool et al. (2018) . They train a similar attention model autoregressively with a greedy rollout baseline. For JSSP, we include the results of the L2D method of Zhang et al. (2020) , who propose a single-agent approach taking advantage of the disjunctive graph representation of JSSP. There exist optimized approaches for TSP and JSSP, which achieve even better results. Since it is not our aim to present a new state-of-the-art solver, an exhaustive comparison is not necessary. All experiments, except for the robustness experiments below, are seeded with 42. Traveling Salesman Problem The TSP is a fundamental routing problem that, given a graph, asks for a node permutation (a complete tour) with minimal edge weight. We focus on the twodimensional Euclidean case in the unit square [0, 1] 2 ⊆ R 2 , and train on small-to medium-sized instances with n = 20, 50 and 100 nodes. Tours are constructed sequentially by choosing one node to visit after the other. We run the agent for 100k episodes (see Appendix C.2 for further details). We report the average performance of all GAZ variants on the 10,000 test instances of Kool et al. (2018) . Optimality gaps are calculated using optimal solutions found by the Concorde TSP solver (Applegate et al., 2006) . Both GAZ PTP ST and GAZ PTP GT consistently outperform all singleplayer variants, and obtain strong results even when simply unrolling the learned policy greedily (see 'Num Sims = greedy' in Table 1 ). GAZ PTP GT yields better results than GAZ PTP ST across all instances, especially when rolling out the learned policy greedily. While all methods perform well on n = 20, the single-player variants strongly underperform on n = 50 and n = 100. GAZ Greedy Scalar fails to improve early in the training process for n = 100.

Job-Shop Scheduling Problem

The JSSP is an optimization problem in operations research, where we are given k jobs consisting of individual operations, which need to be scheduled on m machines. We focus on the standard case, where there is a bijection between the machines and the operations of a job. The objective is to find a schedule with a minimum makespan, i.e., the time when all jobs are finished (see Appendix C.3 for details). The size of a problem instance is denoted by k ×m (jobs × machines). To construct a schedule, we iteratively choose unfinished jobs of which to process the next operation. The agent is run for 20k episodes. We report results on instances of medium size 15 × 15, 20 × 20, and 30 × 20 of the well-known Taillard benchmark set (Taillard, 1993) , consisting of 10 instances for each problem size. Optimality gaps are calculated with respect to the best upper bounds found in the literature (see Appendix C.3.3). As for TSP, our method outperforms all single-player variants of GAZ. In contrast to TSP, the performance of GAZ PTP ST and GAZ PTP GT is comparable. Due to the reduced network evaluations, GAZ PTP GT is generally more favorable. GAZ Greedy Scalar yields strong results only for 15 × 15 but fails to learn on larger instances, similar to GAZ Single N-Step. Reproducibility We further evaluate the robustness and plausibility on four different seeds for TSP n = 50 and JSSP 15 × 15 and 20 × 20 in Figure 1 . To lower the computational cost, we reduce the number of episodes on TSP to 50k and JSSP to 10k, with a simulation budget of 100 for TSP (70 for JSSP) for single-player variants, and 50 for TSP (35 for JSSP) for GAZ PTP. The different seeding leads to only small variations in performance for TSP. JSSP is more challenging. Again, we can observe that GAZ PTP outperforms the single-player variants across all seeds. Especially, GAZ PTP escapes bad initializations early in the training process and becomes stable. The in-tree sampling of GAZ PTP ST encourages exploration, leading to swift early improvements. Value estimates as baselines By comparing pairs of states to guide the tree search, we provide the network some freedom to model the problem without relying explicitly on predicted expected objectives. In Appendix D, we compare our approach with using value estimates of a historical policy as baselines in the advantage function of GAZ's tree search. Limitations In contrast to GAZ PTP GT, the search space increases exponentially for the variant GAZ PTP ST, and an additional network evaluation is needed for π θ B (•) in each MCTS simulation (see Appendix B.6). Nevertheless, our empirical results show that both GAZ PTP variants perform well in general with a small number of simulations. In this paper, we only consider problem classes with constant episode length. However, the methodology presented extends to varying episode lengths: Once a player finishes, only its terminal state is considered in the remaining tree search (i.e., within the two-player game, the unfinished player keeps its turn). Furthermore, even though GAZ's policy improvements propagate from the two-player setup to the original single-player problem by Lemma 1, an improved policy does not necessarily imply improved greedy behavior. Especially in later stages of training, it can take thousands of steps until the distribution of π θ is sharp enough for updating the parameters θ B .

6. CONCLUSION

We introduced GAZ PTP, a self-competitive method combining greedy rollouts as in self-critical training with the planning dynamics of two-player games. The self-critical transformation of a deterministic single-player task does not alter the theoretical policy improvement guarantees obtained through the principled search of GAZ. Experiments on the TSP and JSSP confirm that our method learns strong policies in the original task with a low number of search simulations.

7. REPRODUCIBILITY STATEMENT

We provide information about network architectures, hyperparameters and training details in the appendix. Our code in PyTorch (Paszke et al., 2017) is available on https://github.com/ grimmlab/policy-based-self-competition. Fangwei Zhong, Peng Sun, Wenhan Luo, Tingyun Yan, and Yizhou Wang. Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking. IEEE transactions on pattern analysis and machine intelligence, 43(5):1467-1482, 2019.

A PROOFS

A.1 PROOF OF LEMMA 1 Lemma: Let π, π and µ be state-dependent policies. For any states s t , s l ∈ S and action a t ∈ A, we have A π (s t , a t ) = A π,µ (s t , s l ; a t ), and at π(a t |s t )Q π (s t , a t ) -V π (s t ) = at π(a t |s t )Q π,µ (s t , s l ; a t ) -V π,µ (s t , s l ). Proof: We recall the definitions for the sake of clarity: V π,µ (s t , s l ) := E ζt∼η π (•|st) ζ l ∼η µ (•|s l ) [r(ζ t ) -r(ζ l )] Q π,µ (s t , s l ; a t ) := r(s t , a t ) + V π,µ (a t s t , s l ) A π,µ (s t , s l ; a t ) := Q π,µ (s t , s l ; a t ) -V π,µ (s t , s l ) Note that V π (s t ) = E ζt∼η π (•|st) [r(ζ t )] = E ζ l ∼η µ (•|s l ) E ζt∼η π (•|st) [r(ζ t )] (8) = E ζt∼η π (•|st) ζ l ∼η µ (•|s l ) [r(ζ t )] , by the law of iterated expectations and because the realization of a trajectory following π does not depend on µ and vice versa. Hence by linearity of expectations V π (s t ) -V µ (s l ) = E ζt∼η π (•|st) ζ l ∼η µ (•|s l ) [r(ζ t ) -r(ζ l )] = V π,µ (s t , s l ), and it follows that A π (s t , a t ) = Q π (s t , a t ) -V π (s t ) (11) = Q π (s t , a t ) -V µ (s l ) -(V π (s t ) -V µ (s l )) (12) = r(s t , a t ) + V π (a t s t ) -V µ (s l ) -(V π (s t ) -V µ (s l )) ( ) (10) = r(s t , a t ) + V π,µ (a t s t , s l ) -V π,µ (s t , s l ) (14) = Q π,µ (s t , s l ; a t ) -V π,µ (s t , s l ) (15) = A π,µ (s t , s l ; a t ). (16) Furthermore, we have at π(a t |s t )Q π (s t , a t ) -V π (s t ) = at π(a t |s t ) r(s t , a t ) + V π (a t s t ) -V π (s t ) (17) = at π(a t |s t ) r(s t , a t ) + V π (a t s t ) -V µ (s l ) + V µ (s l ) -V π (s t ) ( ) (10) = at π(a t |s t ) r(s t , a t ) + V π,µ (a t s t , s l ) =Q π,µ (st,s l ;at) + V µ (s l ) -V π (s t ) (19) = at π(a t |s t )Q π,µ (s t , s l ; a t ) + at π(a t |s t )V µ (s l ) =V µ (s l ) -V π (s t ) (20) = at π(a t |s t )Q π,µ (s t , s l ; a t ) -(V π (s t ) -V µ (s l )) (21) = at π(a t |s t )Q π,µ (s t , s l ; a t ) -V π,µ (s t , s l ). A.2 DERIVATION OF THE LOGIT UPDATE (4) In Danihelka et al. (2022) , the improved policy π GAZ is obtained by the logit update logit π GAZ (a) := logit π (a) + σ( Qπ (s, a)). ( ) Subtracting the constant σ( V π (s)) does not alter the subsequent softmax-output of the logits, so we can equivalently set logit π GAZ (a) = logit π (a) + σ( Qπ (s, a)) -σ( V π (s)) σ linear = logit π (a) + σ( Qπ (s, a) -V π (s)) = logit π (a) + σ( Âπ (s, a)). (25)

B ADDITIONAL ALGORITHM DETAILS B.1 A NOTE ON SELF-PLAY IN THE TWO-PLAYER GAME

The mechanics of the proposed two-player game differ from classical board games such as chess or Go, as a player's action does not influence the opponent's state. We can interpret trajectories in the original MDP as strategies in the game: if some trajectory α gives a higher final reward than trajectory β, its corresponding strategy α weakly dominates strategy β (Leyton-Brown & Shoham, 2008) . This is desired and means that playing α will always yield an outcome z at least as good as playing β, no matter what the opponent does. Usually, in AlphaZero-type self-play for board games, both players choose their moves utilizing MCTS. By limiting the tree search to the learning actor, we can distill even incremental policy improvements to the policy network effectively.

B.2 MODIFIED TREE SEARCH

We provide a schematic view of the proposed game in Figure 2 , and an illustration of the modified selection, expansion and backpropagation in Figure 3 . As outlined in Section 4.2, we do not ... The learning actor is player 1 (solid outline, blue), the greedy actor is player -1 (dashed outline, red). The learning actor chooses moves via MCTS, taking into account the states of both players, whereas the greedy actor moves greedily with respect to only its own state. (A) GAZ with (B) Figure 3 : Example MCTS for the learning actor, when sampling actions for the greedy actor: In nodes with solid lines (blue), the learning actor (player 1) is to move, and in nodes with dashed lines (red), the greedy actor (player -1) is to move. In the selection phase (solid arrows), an action is selected in solid nodes according to the search principles of GAZ, based on π θ and the completed Q-values. In nodes with dashed lines, an action is sampled from π θ B . Suppose the dashed edge (A) is expanded in the expansion phase, then it is the greedy actor's turn in the following node: an action is sampled from π θ B (ã -1 s -1 ) and the corresponding edge (B) is immediately expanded as well. Only then is the predicted value of the following solid node backpropagated through the search path. evaluate state pairs with V ν in nodes where it's the greedy actor's turn. This is due to computational efficiency. Suppose the greedy actor is player -1, and it is the greedy actor's turn in a node Ñ = (as 1 , s -1 , -1), which was added to the tree in an expansion step. By the modified tree search, an action is sampled from π θ B (s -1 ). As the policy only depends on the player's state, π θ B (s -1 ) can be reused for the action sampling in a node (s 1 , s -1 , -1) for any state s1 without additional network evaluations, as the value function does not need to be queried. This can reduce the number of network evaluations in subsequent simulations. This is different from the experiments in Stochastic MuZero (Antonoglou et al., 2022) , where the authors report significantly improved results for AlphaZero on Backgammon and 2048 when a Q-value is learned for afterstates (chance nodes). For GAZ PTP, we did not experience notable improvements and trade the learned Q-value for increased computational efficiency. For evaluation at test time, the greedy actor's moves are not sampled in the tree search but chosen greedily to increase the search tree's depth.

B.3 A NOTE ON THE ARENA MODE

As only the sign of the reward difference r(ζ 0,π θ ) -r(ζ greedy 0,π θ B ) is considered to compute the outcome of the game, and not its magnitude, training a non-stationary policy through self-play can exhibit behavior where the overall performance of a policy in the original MDP gradually degrades, even though the agent gets better at beating itself. In the optimal case, we would like to replace the frozen parameters θ B with θ, if E s0∼ρ0 E ζ0∼η π θ (•|s0) [r(ζ 0 )] > E s0∼ρ0 E ζ 0 ∼η π θ B (•|s0) [r(ζ 0 )]. This cannot be guaranteed alone by the binary reward in our self-competitive framework. We settle for a fixed arena set J arena on which the policies are pitted greedily, to ensure an improved greedy actor and avoid cycling performance due to stochasticity as far as possible.

Given trajectory ζ

p 0 = (s p 0 , a p 0 , . . . , s p T -1 , a p T -1 , s p T ) for player p ∈ {1, -1} , training data is stored in replay buffers at the end of an episode as follows: For the value network, the final game outcome z is bootstrapped from the perspective of both players, i.e. we store tuples (s 1 t , s -1 t , z) and (s -1 t , s 1 t+1 , -z). For the policy network, we store tuples (s l t , Iπ(s l t )) only for the learning actor l ∈ {1, -1}, where Iπ(s l t ) is the improved policy obtained through the GAZ tree search at state s l t .

B.5 LOSS FUNCTIONS

We train the value network by minimizing the squared error (V ν (s, s ) -z) 2 for a sampled tuple (s, s , z) in the replay buffer, and the policy network by minimizing the Kullback-Leibler divergence KL(Iπ(s) π θ (s)) for a tuple (s, Iπ(s)).

B.6 EFFICIENT IMPLEMENTATION OF GAZ PTP GT

Algorithm 1 is a general formulation of our method encompassing both variants in which the greedy actor samples actions in the search tree in one (GAZ PTP ST) and chooses actions greedily in the other (GAZ PTP GT). For GAZ PTP GT, only states of the greedy actor encountered in the trajectory ζ greedy 0,π θ B = (s 0 , a 0 , . . . , s T -1 , a T -1 , s T ), are needed in the MCTS for the learning actor. Furthermore, the policy and value network share in practice an encoding part f : S → R d such that V ν (s, s ) = h(f (s), f (s )) for a value head h : R d × R d → [-1, 1] (see Section 5.2). Hence, at the beginning of an episode, ζ greedy 0,π θ B can be obtained once in advance of the tree search, as the policies π θ and π θ B are independent. The states s 0 , . . . , s T -1 can be batched, and their latent representations f (s 0 ), . . . , f (s T -1 ) ∈ R d can be stored in memory, effectively reducing the number of network evaluations in each search simulation from two to one (as desired in network-guided MCTS).

C EXPERIMENTAL DETAILS

C.1 GENERAL SETUP C.1.1 TRAINING LOOP All algorithmic variants fit into the asynchronous training loop commonly used for MuZero, where a learning process receives generated trajectories, stores them in a replay buffer, and performs training steps for the network. Multiple playing processes generate trajectories from initial states randomly sampled on the fly, using periodically updated network checkpoints from the learning process for MCTS. We generate experience with 100 playing processes and update the network checkpoint every 100 training steps in all variants.

C.1.2 GAZ MCTS

Given a node N in the search tree, denote by N (a) the visit count of the edge (N , a) for action a. We follow Danihelka et al. (2022) for the choice of the monotonically increasing linear function σ in (4) and set σ(q) = (c visit + max b N (b)) • c scale • q. PTP. For both TSP and JSSP, f is based on the Transformer architecture and its underlying multihead attention (MHA) layers (Vaswani et al., 2017) . For some state s ∈ S, the network f outputs a concatenation of vectors f (s) = [s; ã1 ; . . . , ãm ] where s ∈ R d and ãi ∈ R d are latent representations of the state s and actions a 1 , . . . , a m . The network architecture of f is identical for GAZ PTP and all single-player variants (see C.2.3 and C.3.4) . Value head The value head consists of an MLP with two hidden layers of size d with GELU activation for TSP (and three hidden layers of size 2 d for JSSP). The input and output of the MLP differ between variants: • GAZ PTP: We input a concatenation [s 1 ; s-1 ] ∈ R 2 d of latent state vectors for both players. The output is mapped to [-1, 1] via tanh-activation. • GAZ Greedy Scalar: We input a concatenation [s 1 ; R] ∈ R d+1 , where R ∈ R is the outcome of the greedy rollout. As in GAZ PTP, the output is mapped to [-1, 1] via tanhactivation. • GAZ Single Vanilla/N-Step: Only the latent state s ∈ R d serves as the input, with linear output. Policy head For the policy head g, the logit for an action a i ∈ {a 1 , . . . , a m } is computed using a pointing mechanism based on the attention of s and the ãi 's similarly to (Bello et al., 2016) and (Kool et al., 2018) : We compute (single-head) attention weights u 1 , . . . , u m ∈ [-C, C] ⊆ R via u i = C • tanh (W Q s) T (W K ãi ) d , with constant C = 10 and learnable linear maps W Q , W K : R d → R d. The weight u i is interpreted as the logit for the probability of action a i and is set to -∞ for infeasible (masked) actions. In GAZ PTP, the above process is performed with w = FF(y) + y instead of s, where y = SHA(s; ã1 , . . . , ãm )) is a transformation of the vector s through a layer of single-head attention (SHA) with s as the only query, and keys ã1 , . . . , ãm . This is to simplify the task of aligning the state encoding f for the value and policy head, as in GAZ PTP the value head operates on two separate state encodings. The same design choice did not make any difference in GAZ Greedy Scalar/Single Vanilla/N-Step, so we removed it to speed up computation. C.1.6 HYPERPARAMETERS Arena episodes are played every 400 episodes in GAZ PTP and GAZ Greedy Scalar. In all experiments, the replay buffer holds data of the latest 2000 episodes. We set the self-play parameter to γ = 0.2. We use Adam (Kingma & Ba, 2014) as an optimizer, with a constant learning rate of 10 -4 , sampling batches of size 256 at each training step. Gradients are clipped to unit L 2 -norm. C.2 TSP C.2.1 ENVIRONMENT An initial state s 0 is given by n nodes, where s 0 = {x 1 , . . . , x n } ⊆ [0, 1] 2 ⊆ R 2 . An ordered tour is constructed sequentially by picking one node to visit after the other, iteratively completing a partial tour. In particular, actions are represented by (unvisited) nodes. The agent decides from which node it starts the tour. We take a relative view at timestep t > 0 and represent a state (partial tour) for t > 0 by a tuple s t = (l t , x t,start , x t,end , X t = {x t1 , . . . , x tn-t }), where l t is the length of the current partial tour, x t,start = a t-1 is the last node in the partial tour, x t,end = a 0 is the first chosen node (and must be eventually returned to) and X t is the remaining set of unvisited nodes from which the next action is picked. The terminal state s T is a complete solution with X T = ∅. The negative length of the full tour is given as a reward at the end of the episode (and zero rewards in between). All tour lengths are scaled by division with √ 2n (the supremum of a possible tour length with n nodes in the unit square).

C.2.2 DATA GENERATION AND TRAINING

Initial states for training are uniformly sampled on the fly. We fix 300 states for the arena set J arena , which is identical for GAZ PTP and GAZ Greedy Scalar. During training, the model is evaluated periodically on a fixed validation set of 100 states to determine the final model. The final model is evaluated on the 10,000 instances of Kool et al. (2018) , which were generated with the random seed 1234. The agent is run for 100k episodes, keeping a ratio of the number of played episodes to the number of optimizer steps of approximately 1 to 0.1n. The network architecture is independent of the number of input nodes, but for comparability, we train it from scratch for each n ∈ {20, 50, 100}. We sample (at most) 16 actions without replacement at the root of GAZ's search tree. The simulation budget in the tree search is 100 for our approach and 200 for all single-player variants. We augment training data by applying a random reflection, rotation, and linear scaling within the unit square to states sampled from the replay buffer.

C.2.3 STATE ENCODING NETWORK

The state encoding network f consists of a sequence-to-sequence Transformer architecture similar to the encoder in Kool et al. (2018) . We use a latent dimension of d = 128 in all TSP experiments. The network is composed of the following components: • Learnable lookup embeddings E token , E start , E end , E start-ind , E end-ind in R d. • Affine maps W len , W num : R → R d and W node : R 2 → R d. • A simple stack of five Transformer blocks with eight heads in the self-attention and layer normalization before the MHA and the FF (Wang et al., 2019) . The structure of a Transformer block is summarized in Figure 4 . We illustrate the encoding procedure in Figure 4 . For an intermediate state s t = (l t , x t,start , x t,end , X t = {x t1 , . . . , x tn-t }), we construct an input sequence (E token , W len (l t ), W num (n -t), W node (x t,start ) + E start-ind , W node (x t,end ) + E end-ind , W node (x t1 ), . . . , W node (x tn-t )). The sequence element corresponding to E token is a token representing the state, similar to the class token in natural language processing (NLP) (Devlin et al., 2019) . The two-dimensional nodes are affinely embedded into R d. As Transformer architectures are invariant to sequence permutations by design, we add the learnable lookup embeddings E start-ind , E end-ind to the start and end nodes to indicate them. This is comparable to position embeddings in NLP. For an initial state s 0 , we use W len (0) for the second sequence element. Further, there are no start and end nodes in the initial state yet, so we use the learnable embeddings E start , E end instead of the affine embeddings W node (x t,start ), W node (x t,end ). The sequence is passed to the stack of Transformer blocks. For each attention head and pair of nodes x, y in the sequence, we add a spatial bias w h • x -y 2 + b h ∈ R to the attention weight corresponding to x, y, similarly as in Graphormer architectures (Ying et al., 2021) . As outlined in C.1.5, f eventually outputs f (s t ) = [s; ã1 ; . . . , ãn-t ], where s is the first output sequence element corresponding to the state token E token , and ãi corresponds to the output sequence element of the i-th remaining node W node (x ti ). The main structural difference to Kool et al. (2018) is that in our case the latent representation of a state is not computed autoregressively, but is re-done at each state from the partial tour length and two-dimensional coordinates of remaining nodes.

C.3.4 STATE ENCODING NETWORK

There are two types of sequences in a problem instance. (i) For each job j i , we have a sequence of operations, where the order of operations matters. (ii) The entirety of jobs forms a sequence, where the order does not matter (similar to the sequence of nodes in the TSP). The encoding network f consists of two stacked Transformer models, where the first one computes a latent representation for each job separately, and the second one operates on the sequence of these job representations to compute a state encoding. Both networks operate in a latent space of dimension d = 64.

Job encoding

The Transformer network for encoding each job from the sequence of its unfinished operations consists of: • Learnable one-dimensional embeddings P machine , P operation : {1, . . . , m} → R d. • Learnable lookup embedding E job in R d. • Affine maps W op : R 2 d+1 → R d, W m-avail : R m → R d, and W j-avail : R → R d. • A simple stack of three Transformer blocks with four heads in the self-attention and instance normalization before the MHA and the FF. Let j i ∈ J t be an unfinished job. We construct a sequence in R d (E job , W m-avail ((c t,1 -ct , . . . , c t,m -ct )), W j-avail (max{0, e t,i -ct }), W op ([P machine (o i,1 ); P operation (1); p i,1 ]), . . . W op ([P machine (o i,m ); P operation (m); p i,m ]), which is passed to the stack of Transformer blocks. We mask all operations which have already been scheduled. The first element in the output sequence corresponding to E job is taken as a latent representation of the job j i , which we denote by j latent i ∈ R d. The job encoding procedure is depicted in Figure 5 .

State encoding

The network for encoding the state representation from the sequence of latent job representations consists of: • Learnable lookup embedding E token in R d. • Affine maps W num , W len : R → R d, W job : R d → R d, and W m-avail-2 : R m → R d. • A simple stack of four Transformer blocks with four heads in the self-attention and instance normalization before the MHA and the FF. From the set of unfinished jobs J t = {j t,1 , . . . , j t,nt }, we construct a sequence (E token , W num (n t ), W len (c t ), W m-avail-2 ((c t,1 -ct , . . . , c t,m -ct )), W job (j latent t,1 ), • • • , W job (j latent t,nt )), which is passed to the stack of Transformer blocks. Finally, f outputs f (s t ) = [s; ã1 ; . . . , ãn-t ], where s is the first output sequence element corresponding to E token , and ãi corresponds to the output sequence element of the i-th unfinished job W job (j latent t,i ).

D VALUE ESTIMATES AS TIMESTEP-DEPENDENT BASELINES

In this section, we provide more insight into why comparing pairs of states in the value function instead of directly comparing pairs of predicted values can be beneficial in a self-competitive setting. To recall, at some state s t for an action a t , the policy logit update in GAZ (both for the in-tree action selection and obtaining a policy training target) is given by logit π GAZ (a) = logit π (a) + σ( Âπ (s, a)), where Âπ (s, a) = Qπ (s t , a t ) -V π (s t ) is an advantage estimation based on the Q-value estimates from the tree search and value network evaluations (cf. equation (4) in Section 3.2). In the following discussion, for ease of notation, we denote by π := π θ the current policy of the learning actor, and by µ := π θ B the historical best greedy policy. As proposed in Section 3.2, in GAZ PTP, we are comparing pairs of states at a timestep t to assess how good the learning actor performs compared with its historical version. This is achieved by swapping the term Âπ (s t , a t ) in the logit update with Âπ,µ sgn (s t , s t ; a t ) := Qπ,µ sgn (s t , s t ; a t ) -V π,µ sgn (s t , s t ), where the state s t comes from a (greedy) trajectory of the policy µ (see equation ( 7)). The sgn in the subscript indicates that the value network is trained to estimate the sign of the episodic reward difference in the original MDP (cf. equation ( 6)). In the self-competitive framework, we are working with the assumption that predicting the expected episode outcome without sophisticated techniques can be a hard task. By supplying the value network with training data consisting of state pairs and binary targets to decide which state is more advantageous, the network is given a certain amount of freedom in how to model the problem and does not rely explicitly (as e.g. GAZ Single Vanilla) on the value network's capability to predict the expected outcome. In contrast, by explicitly using value predictions, the advantage estimation at a timestep t can also be formulated as Âπ,µ t (s t , a t ) := Qπ (s t , a t ) -b µ t , (28) where the baseline b µ t is an estimate of E ζ0=(s0,a 0 ,s 1 ,...,s T )∼η µ (s0) [V µ (s t )]. In this case, we are baselining the Q-values from the standard single-player tree search with the expectation of the value of the historical policy µ at timestep t. This is sensible, as it allows us to compute the estimate b µ t for all timesteps in advance of the learning actor's episode (similarly to GAZ PTP GT, see B.6) without the need to reevaluate the states s i encountered by µ in the tree search. By using b µ t , we maintain the postulated benefit of keeping instance-specific information about the behavior of µ in intermediate timesteps (via value estimates). Additionally, the MCTS can be run with GAZ in the standard single-player way, as baselining with b µ t fits elegantly into GAZ's in-tree action selection and policy improvement mechanisms (see equations ( 4) and ( 28)). In the following, we refer to this method as GAZ Single Timestep Baseline (GAZ Single TB) and consider two ways of obtaining b µ t : (i) (GAZ Single TB Greedy) Let ζ greedy 0 = (s 0 , a 0 , . . . , s T -1 , a T -1 , s T ) be the trajectory obtained by rolling out µ greedily. As a counterpart to GAZ PTP GT, we set b µ t := V ν (s t ), where V ν : S → R is the learning actor's (single-player) value network. (ii) (GAZ Single TB Sampled) For i ∈ {1, . . . , k} for some k ∈ N, we sample trajectories ζ 0,i = (s 0 , a 0,i , . . . , s T -1,i , a T -1,i , s T,i ) using µ, and average over the value network evaluations via b µ t := 1 k k i=1 V ν (s t,i ). We provide a listing of the method in Algorithm 3. The algorithmic variant GAZ PTP GT is similar to GAZ Single TB Greedy, except that we compare pairs of states in GAZ PTP GT, and value estimates in GAZ Single TB Greedy. Furthermore, the number of network evaluations is the same in both variants because only a single greedy rollout of µ is required. Value estimates (resp. latent states for GAZ PTP GT) can be stored in memory for usage in the MCTS. We compare the two variants of GAZ Single TB with GAZ Single Vanilla and GAZ PTP GT in small-scale experiments for TSP 100 (50 search simulations, 20k episodes) and JSSP



We greedily unroll the policy which was trained with 100 (resp. 50 for JSSP) simulations.



There is a plethora of deep reinforcement learning approaches for training heuristic solvers for COPs, including Bello et al. (2016); Deudon et al. (2018); Kool et al. (2018); Xing et al. (2020); da Costa et al. (2020); Kool et al. (2022) for the TSP and Zhang et al. (2020); Park et al. (2021a;b); Tassel et al. (2021) for the JSSP. Most of them are problem-specific and especially concentrate on how to optimize (graph) neural network-based architectures, state representations,

Figure 1: Mean training performance on TSP and JSSP, run with 4 distinct seeds ∈ {42, 43, 44, 45}. Shades denote standard errors. We omit the results for GAZ Single N-Step on JSSP for readability.

Figure 2: Schematic view of the proposed game in an example instance of the Traveling Salesman Problem:The learning actor is player 1 (solid outline, blue), the greedy actor is player -1 (dashed outline, red). The learning actor chooses moves via MCTS, taking into account the states of both players, whereas the greedy actor moves greedily with respect to only its own state.

Results for TSP and JSSP. 'Num Sims' refers to the number of simulations starting from the root of the search tree.

ACKNOWLEDGMENTS

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -466387255 -within the Priority Programme "SPP 2331: Machine Learning in Chemical Engineering". The authors gratefully acknowledge the Leibniz Supercomputing Centre for providing computing time on its Linux-Cluster.

annex

Algorithm 2: GAZ Greedy Scalar Training Input: ρ 0 : initial state distribution Input: J arena : set of initial states sampled from ρ 0 Init policy replay buffer M π ← ∅ and value replay buffer M V ← ∅ Init parameters θ, ν for policy net π θ : S → ∆A and value net V ν : S × R → [-1, 1] Init 'best' parameters θ B ← θ foreach episode do Sample initial state s 0 ∼ ρ 0 Perform greedy rollout using π θ B and obtain R ← r(ζ greedy 0,π θ B ) for t = 0, . . . , T -1 do Perform policy improvement I with MCTS using V ν (•, R) and policy π θ (•)Receive improved policy Iπ(s t ), action a t and new state s t+1 Store (s t , Iπ(s t )) in M π Have trajectory ζ ← (s 0 , a 0 , . . . , s T -1 , a T -1 , s T )outcome as reshaped binary reward Store tuples (s t , R, z) in M V for all timesteps tPeriodically update θ B ← θ if s0∈Jarena r(ζ greedy 0,π θ ) -r(ζ greedy 0,π θ B ) > 0We set the constants to c visit = 50 and c scale = 1.0, which has shown to be stable across various simulation budgets (Danihelka et al., 2022) . We complete the vector of Q-values using the authors' value interpolation proposed in their appendix: For every unvisited action, the Q-value approximation in ( 4) is set to V , whereHere, V is the value approximation of node N coming from the value network. We abuse the notation by omitting states (resp. state pairs for GAZ PTP), as ( 26) is used in all GAZ variants. In particular, unvisited actions are given zero advantage in (4).

C.1.3 GAZ NORMALIZATION

Rewards in single-player tasks usually have different scales and are not limited to ±1. We follow Schrittwieser et al. (2020) ; Danihelka et al. (2022) , and perform a min-max normalization in GAZ Single Vanilla and GAZ Single N-Step (prior to rescaling by σ) on the vector of completed Qvalues with the values observed in the tree up to that point. I.e., the maximum value is given by max s∈tree Q(s, a) (and analogously for the minimum value). This pushes normalized advantages into the interval [-1, 1].

C.1.4 GAZ GREEDY SCALAR

We provide a listing of the training algorithm for GAZ Greedy Scalar in Algorithm 2.

C.1.5 GENERAL NETWORK ARCHITECTURE

Feed-forward In the following, a feed-forward network (FF) always refers to a multilayer perceptron (MLP) with equal input and output dimensions and one hidden layer of four times the input dimension, with GELU activation (Hendrycks & Gimpel, 2016) .

State encoding

We model a state-encoding network f : S → R d , followed by a policy head g : R d → ∆A and a state-value head h, which is of the form h : In the standard JSSP, we are given a set of k jobs J = {j 1 , . . . , j k }, each consisting of m operations which need to be scheduled on m machines. Each job j i is a permutation (o i,l ) m l=1 of the machines, where o i,l ∈ {1, . . . , m} indicates on which machine the l-th operation of job j i needs to run. Finishing an operation takes some processing time p i,l ∈ (0, 1]. The operations of a job must run in order (precedence constraints), a machine can only process one operation at a time, and there is no preemption. The objective is to find a schedule with minimum makespan.

C.3.2 ENVIRONMENT

A schedule must satisfy all precedence constraints, and there is a bijection between operations of a job and the m machines. Thus, we can represent a schedule by a (not necessarily unique) sequence of jobs ( j1 , . . . , jk•m ), where ji ∈ J is an unfinished job of which the next unscheduled operation should be scheduled at the earliest time possible. In particular, we can represent feasible actions in the environment by the set of unfinished jobs, limiting the number of possible actions to k. Note that a timestep t in the environment corresponds to the t-th chosen action (unfinished job), and is not equal to the passed processing time in the schedule.A state s t is given by a tuple ((c t,l ) m l=1 , (e t,i ) k i=1 , J t ), where • c t,l ∈ R ≥0 is the finishing time of the latest scheduled operation on the l-th machine ('machine availability'),• e t,i ∈ R ≥0 is the finishing time of the last scheduled operation of a job j i ('job availability'), and• J t = {j t,1 , . . . , j t,nt } ⊆ J is the subset of jobs with unscheduled operations ('unfinished jobs').In particular, we have c 0,l = 0, e 0,i = 0 and J 0 = J at the initial state s 0 . Let ct := min l c t,l in the following. The negative makespan max l c t,l is given as a reward at the end of the episode (and zero rewards in between). Similarly to the TSP environment, we scale the final makespan by division with 100, so that for all problem sizes the makespan lies roughly in (0, 1). We illustrate the state representation in Figure 5 . We assume an instance with three jobs j 1 , j 2 , j 3 and three machines. The tuple below a job indicates on which machines its operations must run. In the depicted unfinished schedule, all three operations of j 1 , the first operation of j 2 , and the first two operations of j 3 have been scheduled. The schedule can be represented by the sequence (j 2 , j 1 , j 1 , j 1 , j 3 , j 3 ). For a given number of jobs k and machines m, we generate training data on the fly by randomly sampling a machine permutation (o i,l ) m l=1 ∈ S m for each job j i , and a random processing time p il ∈ (0, 1] for each operation. We fix 200 instances for J arena and a small validation set of size 20. We evaluate the model on the 10 benchmark instances of Taillard (Taillard, 1993) for each size: instances ta01-ta10 (size 15 × 15), instances ta21-ta30 (size 20 × 20) and instances ta41-ta50 (size 30 × 20). The benchmark instances have integer processing times in [1, 100], which we rescale to the unit interval by division with 100. We summarize the best upper bounds from the literature in Table 2 , as reported in Zhang et al. (2020) . An episode takes k • m actions, so to reduce computation time, we limit the simulation budget to 50 for our approach and 100 for all single-player variants, running the agent for 20k episodes. We keep a ratio of the number of played episodes to the number of optimizer steps of approximately 1 to 0.02k • m. We train from scratch for each problem size. As the action space is rather small (at most k actions at each timestep), we consider all feasible actions for the simulations at the root of the search tree.During training, we augment states sampled from the replay buffer by linearly scaling processing times p i,l , job availability times e t,i and machine availability times c t,l with a random scalar in (0, 1). Furthermore, we shuffle the machines on which operations must be scheduled (e.g. operations on some machine A are reassigned to some machine B and the other way round). 15 × 15 (35 search simulations, 10k episodes). For GAZ Single TB Sampled, we sample k = 10 trajectories. The results are presented in Table 3 .Even though GAZ Single TB still relies on the value network being able to predict the expected outcome of an episode sufficiently well, the small-scale experiments indicate that GAZ Single TB provides a better advantage baseline to form a curriculum for the learning actor than the value interpolation of GAZ Single Vanilla. GAZ Single TB Greedy in particular obtains comparable results to GAZ PTP GT for JSSP, and improves faster than GAZ Single Vanilla for TSP. We believe that GAZ Single TB can provide a much stronger method than the value interpolation (26) of GAZ for baselining the Q-values in problems where it is 'easier' to predict the expected outcome of an episode, or in combination with more sophisticated value prediction techniques, such as the target scaling techniques proposed in Pohlen et al. (2018) or Implicit Quantile Networks (Dabney et al., 2018) .

