IMPLICIT UNDER-PARAMETERIZATION INHIBITS DATA-EFFICIENT DEEP REINFORCEMENT LEARNING

Abstract

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.

1. INTRODUCTION

Many pervasive deep reinforcement learning (RL) algorithms estimate value functions using bootstrapping, that is, by sequentially fitting value functions to target value estimates generated from the value function learned in the previous iteration. Despite high-profile achievements (Silver et al., 2017) , these algorithms are highly unreliable due to poorly understood optimization issues. Although a number of hypotheses have been proposed to explain these issues (Achiam et al., 2019; Bengio et al., 2020; Fu et al., 2019; Igl et al., 2020; Liu et al., 2018; Kumar et al., 2020a) , a complete understanding remains elusive. We identify an "implicit under-parameterization" phenomenon that emerges when value networks are trained using gradient descent combined with bootstrapping. This phenomenon manifests as an excessive aliasing of features learned by the value network across states, which is exacerbated with more gradient updates. While the supervised deep learning literature suggests that some feature aliasing is desirable for generalization (e.g., Gunasekar et al., 2017; Arora et al., 2019) , implicit under-parameterization exhibits more pronounced aliasing than in supervised learning. This over-aliasing causes an otherwise expressive value network to implicitly behave as an under-parameterized network, often resulting in poor performance. Implicit under-parameterization becomes aggravated when the rate of data re-use is increased, restricting the sample efficiency of deep RL methods. In online RL, increasing the number of gradient steps in between data collection steps for data-efficient RL (Fu et al., 2019; Fedus et al., 2020b) causes the problem to emerge more frequently. In the extreme case when no additional data is collected, referred to as offline RL (Lange et al., 2012; Agarwal et al., 2020; Levine et al., 2020) , implicit under-parameterization manifests consistently, limiting the viability of offline methods. We demonstrate the existence of implicit under-parameterization in common value-based deep RL methods, including Q-learning (Mnih et al., 2015; Hessel et al., 2018) and actor-critic (Haarnoja et al., 2018) , as well as neural fitted-Q iteration (Riedmiller, 2005; Ernst et al., 2005) . To isolate the issue, we study the effective rank of the features in the penultimate layer of the value network (Section 3). We observe that after an initial learning period, the rank of the learned features drops steeply. As the rank decreases, the ability of the features to fit subsequent target values and the optimal value function generally deteriorates and results in a sharp decrease in performance (Section 3.1).

