A CONNECTION BETWEEN ONE-STEP RL AND CRITIC REGULARIZATION IN REINFORCEMENT LEARNING

Abstract

As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This "early stopping" makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While our theoretical results require assumptions (e.g., deterministic dynamics), our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters.



Endpoints of these regularization paths are the same. We prove that these methods also obtain the same policy for an intermediate degree of regularization. Reinforcement learning (RL) algorithms tend to perform better when regularized, especially when given access to only limited data, and especially in batch (i.e., offline) settings where the agent is unable to collect new experience. While RL algorithms can be regularized using the same tools as in supervised learning (e.g., weight decay, dropout), we will use "regularization" to refer to regularization methods unique to the RL setting. Such regularization methods include policy regularization (penalizing the policy for sampling out-of-distribution action) and value regularization (penalizing the critic for making large predictions). Research on these sorts of regularization has grown significantly in recent years, yet theoretical work studying the tradeoffs between regularization methods remains limited. Many RL methods perform regularization, and can can be classified by whether they perform one or many steps of policy improvement. One-step RL methods (Brandfonbrener et al., 2021; Peng et al., 2019; Peters & Schaal, 2007; Peters et al., 2010) perform one step of policy iteration, updating the policy to choose actions the are best according to the Q-function of the behavioral policy. The policy is often regularized to not deviate far from the behavioral policy. In theory, policy iteration can take a large number of iterations ( Õ(|S||A|/(1 -γ)) (Scherrer, 2013)) to converge, so one-step RL (one step of policy iteration) fails to find the optimal policy on most tasks. Empirically, policy iteration often converges in a smaller number of iterations (Sutton & Barto, 2018, Sec. 4.3) , and the policy after just a single iteration can sometimes achieve performance comparable to multi-step RL methods (Brandfonbrener et al., 2021) . Critic regularization methods modify the training of the value function such that it predicts smaller returns for unseen actions (Kumar et al., 2020; Chebotar et al., 2021; Yu et al., 2021; Hatch et al., 2022; Nachum et al., 2019; An et al., 2021; Bai et al., 2022; Buckman et al., 2020) . Errors in the critic might cause it to overestimate the value of some unseen actions, but that overestimation can be combated by decreasing the values predicted for all unseen actions. In this paper, we will use "critic regularization" to specifically refer to multi-step methods that use critic regularization. In this paper, we show that a certain type of actor and critic regularization can be equivalent, under some assumptions (see Fig. 1 ). The key idea is that, when using a certain TD loss, the regularized critic updates converge not to the true Q-values, but rather the Q-values multiplied by an importance weight. For the critic, these importance weights mean that the Q-values end up estimating the expected returns of the behavioral policy (Q β , as in many one-step methods (Peters et al., 2010; Peters & Schaal, 2007; Peng et al., 2019; Brandfonbrener et al., 2021) ), rather than the expected returns of the optimal policy (Q π ). For the actor, these importance weights mean that the logarithm of the Q-values includes a term that looks like a KL divergence. So, optimizing the policy with these Q-values results in a standard form of actor regularization. The main contributions of this paper are as follows: • We prove that one-step RL produces the same policy as a multi-step critic regularization method, for a certain regularization coefficient and when applied in deterministic settings. • We show that similar connections hold for goal-conditioned RL, as well as other RL settings. • We provide experiments validating the theoretical results in settings where our assumptions hold. • We show that the theoretical results make accurate, testable predictions for practical offline RL methods, which can violate our assumptions.

2. RELATED WORK

Regularization has been applied to RL in many different ways (Neu et al., 2017; Geist et al., 2019) , and features prominantly in offline RL methods (Lange et al., 2012; Levine et al., 2020) . While RL algorithms can be regularized using the same techniques as in supervised learning (e.g., weight decay, dropout), our focus will be on regularization methods unique to the RL setting. Such RL-specific regularization methods can be categorized based on whether they regularize the actor or the critic. One-step RL methods (Brandfonbrener et al., 2021; Gülçehre et al., 2020; Peters & Schaal, 2007; Peng et al., 2019; Peters et al., 2010; Wang et al., 2018) apply a single step of policy improvement to the behavioral policy. These methods first estimate the Q-values of the behavioral policy, either via regression or iterative Bellman updates. Then, these methods optimize the policy to maximize these Q-values minus an actor regularizer. Many goal-conditioned or task-conditioned imitation learning methods (Savinov et al., 2018; Ding et al., 2019; Sun et al., 2019; Ghosh et al., 2020; Paster et al., 2020; Yang et al., 2021; Srivastava et al., 2019; Kumar et al., 2019; Chen et al., 2021; Lynch & Sermanet, 2021; Li et al., 2020; Eysenbach et al., 2020a ) also fits into this mold (Eysenbach et al., 2022) , yielding policies that maximize the Q-values of the behavioral policy while avoiding unseen actions. Note that non-conditional imitation learning methods do not perform policy improvement, and do not fit into this mold. One-step methods are typically simple to implement and computationally efficient. Critic regularization methods instead modify the objective for the Q-function so that it predicts lower returns for unseen actions (Kumar et al., 2020; Chebotar et al., 2021; Yu et al., 2021; Hatch et al., 2022; Nachum et al., 2019; An et al., 2021; Bai et al., 2022; Buckman et al., 2020) . Critic regularization methods are typically more challenging to implement correctly and more computationally demanding (Kumar et al., 2020; Nachum et al., 2019; Bai et al., 2022; An et al., 2021) , but can lead to better results on some challenging problems (Kostrikov et al., 2021) Our analysis will show that one-step RL is equivalent to a certain type of critic regularization.

3. PRELIMINARIES

We start by defining the single-task RL problem, and then introduce prototypical examples of one-step RL and critic regularization. We then introduce an actor critic algorithm we will use for our analysis.

3.1. NOTATION

We assume an MDP with states s, actions a, initial state distribution p 0 (s 0 ), dynamics p(s ′ | s, a), and reward function r(s, a).foot_0 We assume episodes always have infinite length (i.e., there are no terminal states). Without loss of generality, we assume rewards are positive, adding a positive constant to all



If the reward also depends on the next state, then define r(s, a) = E p(•|s,a) [r(s, a, s ′ )].



Figure 1: Both n-step RL and critic regularization can interpolate between behavioral cloning (left) and unregularized RL (right) by varying the regularization parameter. Endpoints of these regularization paths are the same. We prove that these methods also obtain the same policy for an intermediate degree of regularization.

