A DIFFERENTIABLE LOSS FUNCTION FOR LEARNING HEURISTICS IN A*

Abstract

Optimization of heuristic functions for the A* algorithm, realized by deep neural networks, is usually done by minimizing square root loss of estimate of the cost to goal values. This paper argues that this does not necessarily lead to a faster search of A* algorithm since its execution relies on relative values instead of absolute ones. As a mitigation, we propose the L * loss, which upper-bounds the number of excessively expanded states inside the A* search. The L * loss, when used in the optimization of state-of-the-art deep neural networks for automated planning in maze domains like Sokoban and maze with teleports, significantly improves the fraction of solved problems, the quality of founded plans, and reduces the number of expanded states to approximately 50%.

1. INTRODUCTION

Automated planning aims to find a sequence of actions that will reach a goal in a model of the environment provided by the user. Planning is considered to be one of the core problems in Artificial intelligence and it is behind some of its successful applications Samuel (1967) ; Knuth & Moore (1975) ; Silver et al. (2017) . Early analysis of planning tasks McDermott (1996) 2017) optimises the heuristic function by minimizing the error of the predicted cost to the goal on a training set of problem instances,foot_0 where the error is measured by the L 2 error function or its variant. L 2 = 0 does not guarantee the optimal efficiency of A* , hence it gives a false sense of security. We propose a L * loss function tailored for A* , which minimizes an upper bound on the number of expanded states. This is achieved by stimulating states on an optimal path to have a smaller cost function f = g + h than those off the optimal path. By this, L * effectively utilizes all the states generated during the exploration of A* , providing much more information to the learner. If L * on a given problem instance is equal to zero, it is guaranteed that A* will expand only states on the optimal path, which under conditions on the training set as detailed below, implies optimal efficiency of A* . We emphasize that the optimal efficiency is retained even on problems with exponentially many optimal paths Helmert & Röger (2008), therefore the heuristic function has to learn a tie-breaking mechanism. The proposed L * is compared to state of the art on seven domains: Sokoban, Maze with teleports, Sliding tile puzzle, Blockworld, Ferry, Grippe, and N-Puzzle and on all of them it consistently outperforms heuristic functions optimizing L 2 .

2. PRELIMINARIES

We define a search problem instance by a directed weighted graph Γ = ⟨S, E, w⟩, a distinct node s 0 ∈ S and a distinct set of nodes S * ⊆ S. The nodes S denote all possible states s ∈ S of the underlying transition system representing the graph. The set of edges E contains all possible transitions e ∈ E between the states in the form e = (s, s ′ ). s 0 ∈ S is the initial state of the problem instance and S * ⊆ S is a set of allowed goal states. Problem instance graph weights (alias action costs) are mappings w : E → R ≥0 . Let π = (e 1 , e 2 , . . . , e l ), we call π a path (alias a plan) of length l solving a task Γ with s 0 and S * iff π = ((s 0 , s 1 ), (s 1 , s 2 ), . . . , (s l-1 , s l )) and s l ∈ S * . An optimal path is defined as a minimal cost of a problem instance Γ, s 0 , S * and is denoted as π * together with its value f * = w(π * ) = w(e 1 )+w(e 2 )+. . . , +w(e l ). We often minimize the cost of solution of a problem instance Γ, s 0 , S * , namely π * , together with its length l * = |π * |.

2.1. A* ALGORITHM

Let's briefly recall how the A* algorithm works. For consistent heuristics, where h(s) -h(s ′ ) ≤ w(s, s ′ ) for all edges (s, s ′ ) in the w-weighted state space graph, it mimics the working of Dijkstra's shortest-path algorithm Dijkstra (1959) and maintains the set of generated but not expanded nodes in O (the Open list) and the set of already expanded nodes in C (the Closed list). It works as follows. 1. Add the start node s 0 to the Open list O 0 . 2. Set g(s 0 ) = 0 3. Initiate the Closed list to empty, i.e. C 0 = ∅. 

4.. For

i ∈ 1, . . . until O i ̸ = ∅ (a) Select the state s i = arg min s∈Oi-1 g(s) + h(s) (b) Remove s i from O i-1 , O i = O i-1 \ {s i } (c) If s i ∈ S * , i.e. it i to C i-1 , C i = C i-1 ∪ {s i } e) Expand the state s i into states s ′ for which hold (s i , s ′ ) ∈ E and for each i. set g(s ′ ) = g(s i ) + w(s i , s ′ ) ii. if s ′ is in the Closed list as s c and g(s ′ ) < g(s c ) then s c is reopened (i.e., moved from the Closed to the Open list), else continue with (e) iii. if s ′ is in the Open list as s o and g(s ′ ) < g(s o ) then s o is updated (i.e., removed from the Open list and re-added in next step with updated g(•)), else continue with (e) iv. add s ′ into the Open list 5. Walk back to retrieve the optimal path. In the above algorithm, g(s) denotes a function assigning an accumulated cost w for moving from the initial state (s 0 ) to a given state s. Consistent heuristics are called monotone because the estimated cost of a partial solution f (s) = g(s) + h(s) is monotonically non-decreasing along the best path to the goal. More than this, f is monotone on all edges (s, s ′ ), if and only if h is consistent as we have f (s ′ ) = g(s ′ ) + h(s ′ ) ≥ g(s) + w(s, s ′ ) + h(s) -w(s, s ′ ) = f (s) and h(s) -h(s ′ ) = f (s) -g(s) -(f (s ′ ) -g(s ′ )) = f (s) -f (s ′ ) + w(s, s ′ ) ≤ w(s, s ′ ). For the case of consistent heuristics, no reopening (moving back nodes from Closed to Open) is needed, as we essentially traverse a state-space graph with edge weights w(s, s ′ ) + h(s ′ ) -h(s) ≥ 0. For the trivial heuristic h 0 , we have h 0 (s) = 0 and for perfect heuristic h * , we have f (s) = f * = g(s) + h * (s) for all nodes s. Both heuristics h 0 and h * are consistent.



The training set contains solved problem instances, where the solution should be ideally found by a search finding optimal solution, such as A* with ideally admissible heuristic function.



indicated that optimising the heuristic function steering the search for a given problem domain can dramatically improve the performance of the search. Learning in planning means optimizing heuristic functions from plans of already solved problems and their instances. This definition includes selection of proper heuristics in a set of pattern databases Franco et al. (2017); Haslum et al. (2007); Moraru et al. (2019); Edelkamp (2006), a selection of a planner from a portfolio Katz et al. (2018), learning planning operators from instances Ménager et al. (2018); Wang (1994), and learning for macro-operators and entanglements Chrpa (2010); Korf (1985). Recent years observe a renewed interest in learning heuristic functions and this is fuelled by the success of deep learning and reinforcement learning in the same area Shen et al. (2020); Groshev et al. (2018); Ferber et al. (2020); Bhardwaj et al. (2017). In this work, we are interested in optimising the heuristic function for A* Hart et al. (1968), which despite the popularity of Monte Carlo tree search Coulom (2006); Silver et al. (2017) is interesting due to its guarantees on optimal solution. A* is also optimally efficient in the sense that it expands the minimal number of states. Majority of prior art Shen et al. (2020); Toyer et al. (2020); Groshev et al. (2018); Ferber et al. (2020); Bhardwaj et al. (

is a goal state, go to 5. (d) Insert the state s

