SCALING UP AND STABILIZING DIFFERENTIABLE PLANNING WITH IMPLICIT DIFFERENTIATION

Abstract

Differentiable planning promises end-to-end differentiability and adaptivity. However, an issue prevents it from scaling up to larger-scale problems: they need to differentiate through forward iteration layers to compute gradients, which couples forward computation and backpropagation and needs to balance forward planner performance and computational cost of the backward pass. To alleviate this issue, we propose to differentiate through the Bellman fixed-point equation to decouple forward and backward passes for Value Iteration Network and its variants, which enables constant backward cost (in planning horizon) and flexible forward budget and helps scale up to large tasks. We study the convergence stability, scalability, and efficiency of the proposed implicit version of VIN and its variants and demonstrate their superiorities on a range of planning tasks: 2D navigation, visual navigation, and 2-DOF manipulation in configuration space and workspace.

1. INTRODUCTION

Planning is a crucial ability in artificial intelligence, robotics, and reinforcement learning (LaValle, 2006; Sutton & Barto, 2018) . However, most planning algorithms require either a model that matches the true dynamics or a model learned from data. In contrast, differentiable planning (Tamar et al., 2016; Schrittwieser et al., 2019; Oh et al., 2017; Grimm et al., 2020; 2021) trains models and policies in an end-to-end manner. This approach allows learning a compact Markov Decision Process (MDP) and ensures that the learned value is equivalent to the original problem. For instance, differentiable planning can learn to play Atari games with minimal supervision (Oh et al., 2017) . However, differentiable planning faces scalability and convergence stability issues because it needs to differentiate through the planning computation. This process requires unrolling network layers iteratively to improve value estimates, especially for long-horizon planning problems. As a result, it leads to slower inference and inefficient and unstable gradient computation through multiple network layers. Therefore, this work addresses the question: how can we scale up differentiatiable planning and keep the training efficient and stable? In this work, we focus on the bottleneck caused by algorithmic differentiation, which backpropagates gradients through layers and couples forward and backward passes and slows down inference and gradient computation. To address this issue, we propose implicit differentiable planning (IDP). IDP uses implicit differentiation to solve the fixed-point problem defined by the Bellman equations without unrolling network layers. Value Iteration Networks (VINs) (Tamar et al., 2016) use convolution networks to solve the fixed-point problem by embedding value iteration into its computation. We name it algorithmic differentiable planner, or ADP for short. We apply IDP to VIN-based planners such as GPPN (Lee et al., 2018) and SymVIN (Zhao et al., 2022) . This implicit differentiation idea has also been recently studied in supervised learning (Bai et al., 2019; Winston & Kolter, 2021; Amos & Yarats, 2019; Amos & Kolter, 2019) . Using implicit differentiation in planning brings several benefits. It decouples forward and backward passes, so when the forward pass scales up to more iterations for long-horizon planning problems, the backward pass can stay constant cost. It is also no longer constrained to differentiable forward solvers/planners, potentially allowing other non-differentiable operations in planning. It can potentially reuse intermediate computation from forward computation in the backward pass, which is infeasible for algorithmic differentiation. We focus on scaling up implicit differentiable planning to larger planning problems and stabilizing its convergence, and also experiment with different optimization techniques and setups. In our experiments on various tasks, the planners with implicit differentiation can train on larger tasks, plan with a longer horizon, use less (backward) time in training, converge more stably, and exhibit better performance compared to explicit counterparts. We summarize our contributions below: • We apply implicit differentiation on VIN-based differentiable planning algorithms. This connects with deep equilibrium models (DEQ) (Bai et al., 2019) and prior work in both sides, including (Bai et al., 2021; Nikishin et al., 2021; Gehring et al., 2021) . • We propose a practical implicit differentiable planning pipeline and implement implicit differentiation version of VIN, as well as GPPN (Lee et al., 2018) and SymVIN (Zhao et al., 2022) . • We empirically study the convergence stability, scalability, and efficiency of the ADPs and proposed IDPs, on four planning tasks: 2D navigation, visual navigation, and 2 degrees of freedom (2-DOF) manipulation in configuration space and workspace.

2. RELATED WORK

Differentiable Planning In this paper, we use differentiable planning to refer to planning with neural networks, which can also be named learning to plan and may be viewed as a subclass of integrating planning and learning (Sutton & Barto, 2018) . It is promising because it can be integrated into a larger differentiable system to form a closed loop. Grimm et al. (2020; 2021) propose to understand model-based planning algorithms from value equivalence perspective. Value iteration network (VIN) (Tamar et al., 2016) is a representative work that performs value iteration using convolution on lattice grids, and has been further extended (Niu et al., 2017; Lee et al., 2018; Chaplot et al., 2021; Deac et al., 2021) and Abstract VIN (Schleich et al., 2019) . Other than using convolution network, the work on combining learning and planning includes (Oh et al., 2017; Karkus et al., 2017; Weber et al., 2018; Srinivas et al., 2018; Schrittwieser et al., 2019; Amos & Yarats, 2019; Wang & Ba, 2019; Guez et al., 2019; Hafner et al., 2020; Pong et al., 2018; Clavera et al., 2020) . Implicit Differentiation Beyond computing gradients by following the forward pass layer-bylayer, the gradients can also be computed with implicit differentiation to bypass differentiating through some advanced root-find solvers. This strategy has been used in a body of recent work



Figure 1: An overview of VIN, a planner with algorithmic differentiation, and ID-VIN, our proposed planner with implicit differentiation. Lighter colors for the backward pass with algorithmic differentiation (solid red arrows) indicate larger backpropagation depth. For backward passes with implicit differentiation, the dashed red arrows start from the solved equilibrium V ⋆ and end at each forward layer (black arrows).

