PARAMETERIZED PROJECTED BELLMAN

Abstract

The Bellman operator is a cornerstone of reinforcement learning, widely used in a plethora of works, from value-based methods to modern actor-critic approaches. In problems with unknown models, the Bellman operator requires transition samples that strongly determine its behavior, as uninformative samples can result in negligible updates or long detours before reaching the fixed point. In this work, we introduce the novel idea of obtaining an approximation of the Bellman operator, which we call projected Bellman operator (PBO). Our PBO is an operator on the parameter space of a given value function. Given the parameters of a value function, PBO outputs the parameters of a new value function and converges to a fixed point in the limit, as a standard Bellman operator. Notably, our PBO can obtain approximated repeated applications of the true Bellman operator, as opposed to the sequential nature of the standard Bellman operator. We show how to obtain PBOs for representative classes of RL problems, and how to approximate it resorting to neural network regression. Eventually, we propose an approximate value-iteration algorithm to learn PBOs and empirically evince how it can overcome the limitations of classical methods, opening up multiple research directions as a novel paradigm in reinforcement learning.

1. INTRODUCTION

Value-based reinforcement learning (RL) is a popular class of algorithms for solving sequential decision-making problems with unknown dynamics (Sutton & Barto, 2018) . For a given problem, value-based algorithms aim at obtaining the most accurate estimate of the expected return from each state, i.e., a value function. For instance, the well-known value-iteration algorithm computes value functions by iterated applications of the Bellman operator (Bellman, 1966) , of which the true value function is the fixed point. Although the Bellman operator can be applied exactly in dynamic programming, it needs to be estimated from samples at each application when dealing with unknown models of RL problems, i.e., empirical Bellman operator (Watkins, 1989; Bertsekas, 2019) . Intuitively, the dependence of value iteration on the samples has an impact on the efficiency of the algorithms and on the quality of the obtained estimated value function, which becomes accentuated when solving continuous problems that require value-based methods with function approximation, e.g., approximate value iteration (AVI) (Munos, 2005; Munos & Szepesvári, 2008) . Moreover, in AVI approaches, costly function approximation steps are needed to project the output of the Bellman operator back to the considered action-value functional space. In this paper, we tackle these limitations by introducing the novel approach of using samples to obtain a new operator, which we call projected Bellman operator (PBO). Our PBO is a function Λ : Ω → Ω defined on parameters ω ∈ Ω of the value function. Contrarily to the standard (empirical) Bellman operator, which uses action-value functions Q ω k to compute targets that are then projected to obtain Q ω k+1 , our PBO uses the parameters of the action-value function to compute updated parameters ω k+1 = Λ(ω k ) directly (Figure 1 ). The crucial advantages of our approach are twofold: (i) the output of PBO always belongs to the considered action-value functional space, avoiding, therefore, the costly projection step typical when using the Bellman operator, and (ii) once learned, PBO is applicable for an arbitrary number of iterations without using further samples, as visualized in Figure 2 . Starting from initial parameters ω 0 , AVI approaches obtain consecutive approximations of the value function Q ω k by Q * and Q ω * are respectively the optimal value function and its projection on the parametric space. Γ 𝑄 ! ! Bellman Operator 𝑄 ! !"# Λ 𝑤 " PBO (ours) 𝑤 "#$ applying the Bellman operator iteratively over samples (Figure 2b ). Instead, our PBO makes use of the samples to learn the operator only. Then, starting from initial parameters ω 0 , PBO can produce a chain of updated parameters of arbitrary length (blue lines in Figure 2a ) without requiring further samples. In the following, after formally introducing PBO and a novel algorithm for value estimation based on it, we analyze its advantageous properties for different classes of problems. Thus, our contribution is threefold: (i) we introduce the notion of projected Bellman operator (PBO); (ii) we show how to derive different PBOs according to the class of problems at hand; (iii) we develop a novel algorithm for value estimation based on PBO and show its advantages over related baselines on several RL problems.

2. RELATED WORKS

Our work is, to the best of our knowledge, the first attempt to obtain a variant of the Bellman operator that acts on the parameters of action-value functions. Nevertheless, several works in the literature proposed variants of the standard Bellman operator to induce some desired behavior. Variants of the Bellman operator are widely studied for entropy-regularized MDPs (Neu et al., 2017; Geist et al., 2019; Belousov & Peters, 2019) . The softmax (Haarnoja et al., 2017; Song et al., 2019 ), mellowmax (Asadi & Littman, 2017 ), and optimistic (Tosatto et al., 2019) operators are all examples of variants of the Bellman operator to obtain maximum-entropy exploratory policies. Besides favoring exploration, other approaches address the limitations of the standard Bellman operator. For instance, the consistent Bellman operator (Bellemare et al., 2016) is a modified operator that addresses the problem of inconsistency of the optimal action-value functions for suboptimal actions. The distributional Bellman operator (Bellemare et al., 2017) enables to operate on the whole return distribution, instead of its expectation, i.e., the value function (Bellemare et al., 2023) . Furthermore, the logistic Bellman operator uses a logistic loss to solve a convex linear programming problem to find optimal value functions (Bas-Serrano et al., 2021) . Finally, the Bayesian Bellman operator is a method in Bayesian RL to infer a posterior over Bellman operators centered on the true Bellman operator (Fellows et al., 2021) . We point out that our PBO can be seamlessly applied on an arbitrary variant of the standard Bellman operator with just minor adaptations. Operator learning. Literature in operator learning is mostly focused on supervised learning, with methods for learning operators over vector spaces (Micchelli & Pontil, 2005) and parametric approaches for learning non-linear operators (Chen & Chen, 1995), with a resurgence of recent contributions in deep learning. For example, Kovachki et al. (2021) learn mappings between infinite function spaces with deep neural networks, or Kissas et al. ( 2022) apply an attention mechanism to learn correlations in the target function for efficient operator learning. The literature about operator learning is much larger, but we consider it out of the scope of this work that, to the best of our knowledge, is the first to deal with the original problem of operator learning in RL.

3. PRELIMINARIES

We consider discounted Markov decision processes (MDPs) M = ⟨S, A, P, R, γ⟩, where S is a measurable state space, A is a measurable action space, P : S × A → ∆(S)foot_0 is the transition



∆(X ) denotes the set of probability measures over the set X .



Figure 1: PBO operates on value function parameters.

Figure2: Behavior of our PBO and approximate value iteration (AVI) in the space of value functions. Q * and Q ω * are respectively the optimal value function and its projection on the parametric space.

