FASTER REINFORCEMENT LEARNING WITH VALUE TARGET LOWER BOUNDING

Abstract

We show that an arbitrary lower bound of the maximum achievable value can be used to improve the Bellman value target during value learning. In the tabular case, value learning using the lower bounded Bellman operator converges to the same optimal value as using the original Bellman operator, at a potentially faster speed. In practice, discounted episodic return in episodic tasks and n-step bootstrapped return in continuing tasks can serve as lower bounds to improve the value target. We experiment on Atari games, FetchEnv tasks and a challenging physically simulated car push and reach task. We see large gains in sample efficiency as well as converged performance over common baselines such as TD3, SAC and Hindsight Experience Replay (HER) in most tasks, and observe a reliable and competitive performance against the stronger n-step methods such as td-lambda, Retrace and optimality tightening. Prior works have already successfully applied a special case of lower bounding using episodic return, to a limited number of episodic tasks. To the best of our knowledge, we are the first to propose the general method of value target lower bounding with possibly bootstrapped return, to demonstrate its optimality in theory, and effectiveness in a wide range of tasks over many strong baselines.

1. INTRODUCTION

The value function is a key concept in dynamic programming approaches to Reinforcement Learning (RL) (Bellman, 1957) . Given a starting state, the value function estimates the sum of all future rewards (usually time-discounted). In temporal difference (TD) learning, the value function is adjusted toward its Bellman target which simply adds the reward of the current step with the (discounted) value of the next state (Sutton & Barto, 2018) . This forms the basis of many state of the art RL algorithms such as DQN (Mnih et al., 2013 ), DDPG (Lillicrap et al., 2016 ), TD3 (Fujimoto et al., 2018 ), and SAC (Haarnoja et al., 2018) . The value of the next state is typically estimated or derived from the value function itself, which is being actively learned during training, a process called "bootstrapping". The bootstrapped values can be random and far from the optimal value, especially at the initial stage of training, or with sparse reward tasks where rewards can only be achieved through a long sequence of actions. Consequently, the Bellman value targets as well as the learned values are usually far away from the optimal value (the value of the optimal policy). Naturally, this leads to the following idea: If we can make the value target closer to the optimal value, we may speedup TD learning. For example, we know that the optimal value is just the expected discounted return of the optimal policy, which always upper bounds the expected return of any policy. For episodic RL tasks, we could use the observed discounted return up to episode end from the training trajectories to lower bound the value target. When the empirical return is higher than the Bellman target, lower bounding brings the new value target closer to the optimal value. For continuing or non-episodic tasks, it is less clear how a lower bound may be estimated. When a continuing task can return negative rewards, a lower bound may not even exist. One could use the n-step bootstrapped return as a lower bound, but bootstrapped return can overestimate and be greater than the optimal value. It is unclear whether the resulting algorithm will still converge.

Algorithm 1 Value iteration with value target lower bounding

Input: Finite MDP p(s ′ , r|s, a), convergence threshold θ, a lower bound f (s) of the maximum achievable value Ḡv (s) Output: State value v(s) v(s) ← 0 repeat ∆ ← 0 v p (s) ← v(s) for each state s do v(s) ← max a s ′ ,r p(s ′ , r|s, a)[r + γv p (s ′ )] vf (s) ← max(f (s), v(s)) v(s) ← vf (s) ∆ ← max(∆, |v(s) -v p (s)|) end for until ∆ < θ This work presents a general framework proving that value target lower bounding converges to the optimal value for both the episodic and non-episodic cases, under certain conditions for the lower bound function. We demonstrate faster training with an illustrative example and extensive experiments on a variety of environments over strong baselines.

2. THEORETICAL RESULTS FOR THE TABULAR CASE

Here we show for the tabular case, arbitrary functions below a certain bootstrap bound can be used to lower bound the value target to still converge to the same optimal value.

2.1. BACKGROUND

In finite MDPs with a limited number of states and actions, a table can keep track of the value of each state. Using dynamic programming algorithms such as value iteration, values are guaranteed to converge to the optimum through Bellman updates (Chapter 4.4 (Sutton & Barto, 2018) ). The core of the value iteration algorithm (Algorithm 1) is the Bellman update of the value function, B(v), where v(s ′ ) is the bootstrapped value: B(v)(s) := max a s ′ ,r p(s ′ , r|s, a)[r + γv(s ′ )] Here a is an available action in state s. s ′ is the resulting state, and r the resulting reward, of executing a in s, with p(s ′ , r|s, a) being the transition probability. It is well known that the Bellman operator, B, is a contraction mapping over value functions (Denardo, 1967) . That is, for any two value functions v 1 and v 2 , ||B(v 1 )-B(v 2 )|| ∞ ≤ γ||v 1 -v 2 || ∞ for the discount factor γ ∈ [0, 1) and ||x|| ∞ := max i |x i | (the L ∞ norm) . This guarantees that any value function under the algorithm converges to the optimal value B ∞ (v) = v * .foot_0 

2.2. CONVERGENCE OF VALUE TARGET LOWER BOUNDING

Definition 2.1. The expected n-step bootstrapped return for a given policy π and value function v(s) is defined as the expected bootstrapped return of taking n steps according to policy π: G π,v n (s 0 ) := E π {r 1 + ... + γ n-1 r n + γ n v(s n )} (2) Here, the step rewards r i and the resulting n-th step state s n are random variables, with the expectation E π taken over all possible n-step trajectories under the policy π and the given MDP.



For the gist of the proof, see for example page 8 of https://people.eecs.berkeley.edu/ ˜pabbeel/cs287-fa09/lecture-notes/lecture5-2pp.pdf

