EFFICIENT DEEP REINFORCEMENT LEARNING REQUIRES REGULATING OVERFITTING

Abstract

Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on the validation TD error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks.

1. INTRODUCTION

Reinforcement learning (RL) methods, when combined with high-capacity deep neural net function approximators, have shown promise in domains such as robot manipulation (Andrychowicz et al., 2020) , chip placement (Mirhoseini et al., 2020) , games (Silver et al., 2016) , and data-center cooling (Lazic et al., 2018) . Since every unit of active online data collection comes at an expense (e.g., running real robots, chip evaluation using simulation), it is important to develop sample-efficient deep RL algorithms, that can learn efficiently even with limited amount of experience. Devising such efficient RL algorithm has been an important thread of research in recent years (Janner et al., 2019; Chen et al., 2021; Hiraoka et al., 2021) . In principle, off-policy RL methods (e.g., SAC (Haarnoja et al., 2018 ), TD3 (Fujimoto et al., 2018 ), Rainbow (Hessel et al., 2018) ) should provide good sample efficiency, because they make it possible to improve the policy and value functions for many gradient steps per step of data collection. However, this benefit does not appear to be realizable in practice, as taking too many training steps per each collected transition actually harms performance in many environments. Several hypotheses, such as overestimation (Thrun & Schwartz, 1993; Fujimoto et al., 2018 ), non-stationarities (Lyle et al., 2022 ), or overfitting (Nikishin et al., 2022) have been proposed as the underlying causes. Building on these hypotheses, several mitigation strategies, such as model-based data augmentation (Janner et al., 2019) , the use of ensembles (Chen et al., 2021) , network regularizations (Hiraoka et al., 2021) , and periodically reseting the RL agent from scratch while keeping the replay buffer (Nikishin et al., 2022) , have been proposed as methods for enabling off-policy RL with more gradient steps. While each of these approaches significantly improve sample efficiency, the efficacy of these fixes can be highly task-dependent (as we will show), and understanding the underlying issue and the behavior of these methods is still unanswered. In this paper, we attempt to understand why taking more gradient steps can lead to worse performance with deep RL algorithms, why heuristic strategies can help in some cases, and how this challenge can be mitigated in a more principled and direct way. Through empirical analysis with the recently proposed tandem learning paradigm (Ostrovski et al., 2021) , we show that in the early stages of training, TD-learning algorithms tend to quickly obtain high validation temporal-difference (TD) error (i.e., the error between the Q-network and the bootstrapping targets on a held-out validation set), and give rise to a worse final solution. We further show that many existing methods devised for the data-efficient RL setting are effective insofar as they control the validation TD error to be low. This insight gives a robust principle for making deep RL efficient: in order to improve data-efficiency, we can simply select the most suitable regularization for any given problem by hillclimbing on the validation TD error. We realize this principle in the form of a simple online model selection method, that attempts to automatically discover the best regularization strategy for a given task during the course of online RL training, that we call Automatic model selection using Validation TD error (AVTD). AVTD trains several off-policy RL agents on a shared replay buffer where each agent applies a different regularizer. Then, AVTD dynamically selects the agent with the smallest validation TD error for acting in the environment. We find that this simple strategy alone often performs similarly or outperforms individual regularization schemes across a wide array of Gym and DeepMind control suite (DMC) tasks. Critically, note that unlike prior regularization methods, whose performance can vary drastically across domains, our approach behaves robustly across all domains. To summarize, our first contribution is an empirical analysis of the bottlenecks in sample-efficient deep RL. We rigorously evaluate several potential explanations behind these challenges, and observe that obtaining high validation TD-error in the early stages of training is one of the biggest culprits that inhibits performance of data-efficient deep RL. Our second contribution is a simple active model selection method (AVTD) that attempts to automatically select regularization schemes by hill-climbing on validation TD error. Our method often matches or outperforms the best individual regularization scheme across a wide range of Gym and DMC tasks.

2. PRELIMINARIES AND PROBLEM STATEMENT

The objective in RL is to maximize the long-term discounted return in a Markov decision process (MDP), (S, A, P, r, γ), consisting of a state space S, an action space A, a transition dynamics model P (s ′ |s, a), a reward function r(s, a), and a discount factor γ ∈ [0, 1). The Q-function Q π (s, a) for a policy π(a|s) is the expected discounted reward obtained by executing action a at state s and following π(a|s) thereafter, Q π (s, a) := E π [ ∞ t=0 γ t r(s t , a t )]. The optimal Q-function is achieved when it satisfies the Bellman equation: Q ⋆ (s, a) = E s ′ ∼P (s ′ |s,a) [r(s, a) + γ max a ′ Q ⋆ (s ′ , a ′ )]. Practical off-policy methods (e.g., Mnih et al., 2015; Hessel et al., 2018; Haarnoja et al., 2018) train a Q-network, Q θ (parameterized by θ), to minimize the temporal difference (TD) error: L(θ) = E (s,a,s ′ )∼D r(s, a) + γ Q(s ′ , a ′ ) -Q θ (s, a) 2 , ( ) where D is the replay buffer consisting of the transitions (s, a, s ′ ) collected so far, Q is the target Qnetwork that is often updated to follow the Q-network Q θ with delay or smoothing (Fujimoto et al., 2018) so that the target does not move too quickly, and a ′ is usually drawn from a policy π(a|s) that can maximize or approximately maximize Q θ (s, a). In theory, these off-policy algorithms can be made very sample efficient by minimizing the TD error fully over any data batch, which in practice translates to making more update steps to the Q-network per environment step, or higher "updateto-data" ratio (UTD) (Chen et al., 2021) . However, when done naïvely, this can lead to worse performance (e.g., on DMC (Nikishin et al., 2022) and on MuJoCo gym (Janner et al., 2019)). There have been many prior methods proposed for dealing with high UTD issues (e.g., DroQ (Hiraoka et al., 2021) , REDQ (Chen et al., 2021), and resets (Nikishin et al., 2022) ). However, we find that none of these prior methods and other simple baseline regularization schemes such as weight decay, dropout and spectral normalization (Miyato et al., 2018) work well across all the tasks (see Appendix A, Figure 4 ). What is the primary culprit that can explain the high UTD challenge? Can we address it in a more direct and principled way?

