PARAMETERIZED PROJECTED BELLMAN

Abstract

The Bellman operator is a cornerstone of reinforcement learning, widely used in a plethora of works, from value-based methods to modern actor-critic approaches. In problems with unknown models, the Bellman operator requires transition samples that strongly determine its behavior, as uninformative samples can result in negligible updates or long detours before reaching the fixed point. In this work, we introduce the novel idea of obtaining an approximation of the Bellman operator, which we call projected Bellman operator (PBO). Our PBO is an operator on the parameter space of a given value function. Given the parameters of a value function, PBO outputs the parameters of a new value function and converges to a fixed point in the limit, as a standard Bellman operator. Notably, our PBO can obtain approximated repeated applications of the true Bellman operator, as opposed to the sequential nature of the standard Bellman operator. We show how to obtain PBOs for representative classes of RL problems, and how to approximate it resorting to neural network regression. Eventually, we propose an approximate value-iteration algorithm to learn PBOs and empirically evince how it can overcome the limitations of classical methods, opening up multiple research directions as a novel paradigm in reinforcement learning.

1. INTRODUCTION

Value-based reinforcement learning (RL) is a popular class of algorithms for solving sequential decision-making problems with unknown dynamics (Sutton & Barto, 2018) . For a given problem, value-based algorithms aim at obtaining the most accurate estimate of the expected return from each state, i.e., a value function. For instance, the well-known value-iteration algorithm computes value functions by iterated applications of the Bellman operator (Bellman, 1966) , of which the true value function is the fixed point. Although the Bellman operator can be applied exactly in dynamic programming, it needs to be estimated from samples at each application when dealing with unknown models of RL problems, i.e., empirical Bellman operator (Watkins, 1989; Bertsekas, 2019) . Intuitively, the dependence of value iteration on the samples has an impact on the efficiency of the algorithms and on the quality of the obtained estimated value function, which becomes accentuated when solving continuous problems that require value-based methods with function approximation, e.g., approximate value iteration (AVI) (Munos, 2005; Munos & Szepesvári, 2008) . Moreover, in AVI approaches, costly function approximation steps are needed to project the output of the Bellman operator back to the considered action-value functional space. In this paper, we tackle these limitations by introducing the novel approach of using samples to obtain a new operator, which we call projected Bellman operator (PBO). Our PBO is a function Λ : Ω → Ω defined on parameters ω ∈ Ω of the value function. Contrarily to the standard (empirical) Bellman operator, which uses action-value functions Q ω k to compute targets that are then projected to obtain Q ω k+1 , our PBO uses the parameters of the action-value function to compute updated parameters ω k+1 = Λ(ω k ) directly (Figure 1 ). The crucial advantages of our approach are twofold: (i) the output of PBO always belongs to the considered action-value functional space, avoiding, therefore, the costly projection step typical when using the Bellman operator, and (ii) once learned, PBO is applicable for an arbitrary number of iterations without using further samples, as visualized in Figure 2 . Starting from initial parameters ω 0 , AVI approaches obtain consecutive approximations of the value function Q ω k by Γ 𝑄 ! ! Bellman Operator 𝑄 ! !"# Λ 𝑤 " PBO (ours) 𝑤 "#$



Figure 1: PBO operates on value function parameters.

