IMPLICIT UNDER-PARAMETERIZATION INHIBITS DATA-EFFICIENT DEEP REINFORCEMENT LEARNING

Abstract

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.

1. INTRODUCTION

Many pervasive deep reinforcement learning (RL) algorithms estimate value functions using bootstrapping, that is, by sequentially fitting value functions to target value estimates generated from the value function learned in the previous iteration. Despite high-profile achievements (Silver et al., 2017) , these algorithms are highly unreliable due to poorly understood optimization issues. Although a number of hypotheses have been proposed to explain these issues (Achiam et al., 2019; Bengio et al., 2020; Fu et al., 2019; Igl et al., 2020; Liu et al., 2018; Kumar et al., 2020a) , a complete understanding remains elusive. We identify an "implicit under-parameterization" phenomenon that emerges when value networks are trained using gradient descent combined with bootstrapping. This phenomenon manifests as an excessive aliasing of features learned by the value network across states, which is exacerbated with more gradient updates. While the supervised deep learning literature suggests that some feature aliasing is desirable for generalization (e.g., Gunasekar et al., 2017; Arora et al., 2019) , implicit under-parameterization exhibits more pronounced aliasing than in supervised learning. This over-aliasing causes an otherwise expressive value network to implicitly behave as an under-parameterized network, often resulting in poor performance. Implicit under-parameterization becomes aggravated when the rate of data re-use is increased, restricting the sample efficiency of deep RL methods. In online RL, increasing the number of gradient steps in between data collection steps for data-efficient RL (Fu et al., 2019; Fedus et al., 2020b) causes the problem to emerge more frequently. In the extreme case when no additional data is collected, referred to as offline RL (Lange et al., 2012; Agarwal et al., 2020; Levine et al., 2020) , implicit under-parameterization manifests consistently, limiting the viability of offline methods. We demonstrate the existence of implicit under-parameterization in common value-based deep RL methods, including Q-learning (Mnih et al., 2015; Hessel et al., 2018) and actor-critic (Haarnoja et al., 2018) , as well as neural fitted-Q iteration (Riedmiller, 2005; Ernst et al., 2005) . To isolate the issue, we study the effective rank of the features in the penultimate layer of the value network (Section 3). We observe that after an initial learning period, the rank of the learned features drops steeply. As the rank decreases, the ability of the features to fit subsequent target values and the optimal value function generally deteriorates and results in a sharp decrease in performance (Section 3.1). To better understand the emergence of implicit under-parameterization, we formally study the dynamics of Q-learning under two distinct models of neural net behavior (Section 4): kernel regression (Jacot et al., 2018; Mobahi et al., 2020) and deep linear networks (Arora et al., 2018) . We corroborate the existence of this phenomenon in both models, and show that implicit underparameterization stems from a pathological interaction between bootstrapping and the implicit regularization of gradient descent. Since value networks are trained to regress towards targets generated by a previous version of the same model, this leads to a sequence of value networks of potentially decreasing expressivity, which can result in degenerate behavior and a drop in performance. The main contribution of this work is the identification of implicit under-parameterization in deep RL methods that use bootstrapping. Empirically, we demonstrate a collapse in the rank of the learned features during training, and show it typically corresponds to a drop in performance in the Atari (Bellemare et al., 2013) and continuous control Gym (Brockman et al., 2016) benchmarks in both the offline and data-efficient online RL settings. We verify the emergence of this phenomenon theoretically and characterize settings where implicit under-parameterization can emerge. We then show that mitigating this phenomenon via a simple penalty on the singular values of the learned features improves performance of value-based RL methods in the offline setting on Atari.

2. PRELIMINARIES

The goal in RL is to maximize long-term discounted reward in a Markov decision process (MDP), defined as a tuple (S, A, R, P, γ) (Puterman, 1994), with state space S, action space A, a reward function R(s, a), transition dynamics P (s |s, a) and a discount factor γ ∈ [0, 1). The Q-function Q π (s, a) for a policy π(a|s), is the expected long-term discounted reward obtained by executing action a at state s and following π(a|s) thereafter, Q π (s, a) := E [ ∞ t=0 γ t R(s t , a t )]. Q π (s, a) is the fixed point of the Bellman operator T π , ∀s, a: T π Q(s, a) := R(s, a) + γE s ∼P (•|s,a),a ∼π(•|s ) [Q(s , a )], which can be written in vector form as: Practical Q-learning methods (e.g., Mnih et al., 2015; Hessel et al., 2018; Haarnoja et al., 2018) convert the Bellman equation into an bootstrapping-based objective for training a Q-network, Q θ , via gradient descent. This objective, known as mean-squared temporal difference (TD) error, is given by: L Q π = R + γP π Q π . The optimal Q-function, Q * (s, (θ) = s,a R(s, a) + γ Qθ (s , a ) -Q(s, a) 2 , where Qθ is a delayed copy of the Q-function, typically referred to as the target network. These methods train Q-networks via gradient descent and slowly update the target network via Polyak averaging on its parameters. We refer the output of the penultimate layer of the deep Q-network as the learned feature matrix Φ, such that Q(s, a) = w T Φ(s, a), where w ∈ R d and Φ ∈ R |S||A|×d . Algorithm 1 Fitted Q-Iteration (FQI) For simplicity of analysis, we abstract deep Q-learning methods into a generic fitted Q-iteration (FQI) framework (Ernst et al., 2005) . We refer to FQI with neural nets as neural FQI (Riedmiller, 2005) . In the k-th fitting iteration, FQI trains the Q-function, Q k , to match the target values, y k = R+γP π Q k-1 generated using previous Q-function, Q k-1 (Algorithm 1). Practical methods can be instantiated as variants of FQI, with different target update styles, different optimizers, etc. 1: Initialize Q-network Q θ ,



Figure 1: Implicit under-parameterization. Schematic diagram depicting the emergence of an effective rank collapse in deep Q-learning. Minimizing TD errors using gradient descent with deep neural network Q-function leads to a collapse in the effective rank of the learned features Φ, which is exacerbated with further training.

a), is the fixed point of the Bellman optimality operator T : T Q(s, a) := R(s, a) + γE s ∼P (•|s,a) [max a Q(s , a )].

buffer µ. 2: for fitting iteration k in {1, . . . , N} do 3: Compute Q θ (s, a) and target values y k (s, a) = r + γ max a Q k-1 (s , a ) on {(s, a)} ∼ µ for training 4: Minimize TD error for Q θ via t = 1, • • • , T gradient descent updates, min θ (Q θ (s, a) -y k ) 2 5: end for

