A CONNECTION BETWEEN ONE-STEP RL AND CRITIC REGULARIZATION IN REINFORCEMENT LEARNING

Abstract

As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This "early stopping" makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While our theoretical results require assumptions (e.g., deterministic dynamics), our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters.



Endpoints of these regularization paths are the same. We prove that these methods also obtain the same policy for an intermediate degree of regularization. Reinforcement learning (RL) algorithms tend to perform better when regularized, especially when given access to only limited data, and especially in batch (i.e., offline) settings where the agent is unable to collect new experience. While RL algorithms can be regularized using the same tools as in supervised learning (e.g., weight decay, dropout), we will use "regularization" to refer to regularization methods unique to the RL setting. Such regularization methods include policy regularization (penalizing the policy for sampling out-of-distribution action) and value regularization (penalizing the critic for making large predictions). Research on these sorts of regularization has grown significantly in recent years, yet theoretical work studying the tradeoffs between regularization methods remains limited. Many RL methods perform regularization, and can can be classified by whether they perform one or many steps of policy improvement. One-step RL methods (Brandfonbrener et al., 2021; Peng et al., 2019; Peters & Schaal, 2007; Peters et al., 2010) perform one step of policy iteration, updating the policy to choose actions the are best according to the Q-function of the behavioral policy. The policy is often regularized to not deviate far from the behavioral policy. In theory, policy iteration can take a large number of iterations ( Õ(|S||A|/(1 -γ)) (Scherrer, 2013) ) to converge, so one-step RL (one step of policy iteration) fails to find the optimal policy on most tasks. Empirically, policy iteration often converges in a smaller number of iterations (Sutton & Barto, 2018, Sec. 4.3) , and the policy after just a single iteration can sometimes achieve performance comparable to multi-step RL methods (Brandfonbrener et al., 2021) . Critic regularization methods modify the training of the value function such that it predicts smaller returns for unseen actions (Kumar et al., 2020; Chebotar et al., 2021; Yu et al., 2021; Hatch et al., 2022; Nachum et al., 2019; An et al., 2021; Bai et al., 2022; Buckman et al., 2020) . Errors in the critic might cause it to overestimate the value of some unseen actions, but that overestimation can be combated by decreasing the values predicted for all unseen actions. In this paper, we will use "critic regularization" to specifically refer to multi-step methods that use critic regularization. In this paper, we show that a certain type of actor and critic regularization can be equivalent, under some assumptions (see Fig. 1 ). The key idea is that, when using a certain TD loss, the regularized critic updates converge not to the true Q-values, but rather the Q-values multiplied by an importance weight. For the critic, these importance weights mean that the Q-values end up estimating the expected returns of the behavioral policy (Q β , as in many one-step methods (Peters et al., 2010; Peters & Schaal, 2007; Peng et al., 2019; Brandfonbrener et al., 2021) ), rather than the expected returns of the optimal policy (Q π ). For the actor, these importance weights mean that the logarithm of the Q-values includes a term that looks like a KL divergence. So, optimizing the policy with these Q-values results in a standard form of actor regularization. The main contributions of this paper are as follows: • We prove that one-step RL produces the same policy as a multi-step critic regularization method, for a certain regularization coefficient and when applied in deterministic settings. • We show that similar connections hold for goal-conditioned RL, as well as other RL settings. • We provide experiments validating the theoretical results in settings where our assumptions hold. • We show that the theoretical results make accurate, testable predictions for practical offline RL methods, which can violate our assumptions.

2. RELATED WORK

Regularization has been applied to RL in many different ways (Neu et al., 2017; Geist et al., 2019) , and features prominantly in offline RL methods (Lange et al., 2012; Levine et al., 2020) . While RL algorithms can be regularized using the same techniques as in supervised learning (e.g., weight decay, dropout), our focus will be on regularization methods unique to the RL setting. Such RL-specific regularization methods can be categorized based on whether they regularize the actor or the critic. rewards can make them all positive without changing the optimal policy. We will learn a Markovian policy π(a | s) to maximize the expected discounted sum of rewards: max π E π(τ ) ∞ t=0 γ t r(st, at) | s0 ∼ p0(s0) , where π(τ ) = p(s 0 ) ∞ t=0 π(a t | s t )p(s t+1 | s t , a t ) is the probability of policy π sampling an infinite-length trajectory τ = (s 0 , a 0 , • • • ). We define Q-values for policy π(a | s) as Q π (s, a) = E π(τ ) ∞ t=0 γ t r(st, at) | s0 = s, a0 = a . Note that the reward being positive implies that the Q-values are also positive, Q π (s, a) > 0. Since we focus on the offline setting, we will consider two policies: β(a | s) is the behavioral policy that collected the dataset, and π(a | s) is the online policy output by the algorithm that attempts to maximize the rewards. We will use p(s, a, s ′ ) to denote the empirical distribution of transitions in an offline dataset, and p(s, a) and p(s) denote the corresponding marginal distributions. The behavioral policy is defined as β(a | s) = p(a | s).

3.2. EXAMPLES OF REGULARIZATION IN RL

While actor and critic regularization methods can be implemented in many ways, we introduce two prototypical examples below to make our discussion more concrete. Example of one-step RL: Brandfonbrener et al. (2021) . One-step RL first estimates the Q-values of the behavioral policy (Q β (s, a)), and then optimizes the policy to maximize the Q-values minus a actor regularizer. While the actor regularizer can take different forms and the Q-values can be learned via regression, we will use a reverse KL regularizer and TD-style critic update so that the objective is similar to critic regularization: max π E p(s)π(a|s) Q β (s, a) + λ(log β(a | s) -log π(a | s)) , where Q β (s, a) = lim t→∞ Qt(s, a) and Qt+1 ← arg min Q E p(s,a) Q(s, a) -y β,Q t (s, a) 2 and y β,Q t (s, a) ≜ r(s, a) + γE p(s ′ |s,a) β(a ′ |s ′ ) Qt(s ′ , a ′ ) . where λ is the regularization coefficient and β(a | s) is an estimate of the behavioral policy, typically learned via behavioral cloning. Like most TD methods (Haarnoja et al., 2018; Mnih et al., 2013; Fujimoto et al., 2018) , the TD targets y are not considered learnable. In practice, most methods do not solve optimize the critic to convergence at each step, instead taking just a few gradient steps before updating the TD targets. This one-step critic loss is different from the multi-step critic losses used in other RL methods (e.g., TD3, SVG(0)) because it uses the TD target y β,Q (s, a) (corresponds to a fixed policy) rather than y π,Q (s, a) (corresponding to a sequence of learned policies). One-step RL amounts to performing one step of policy iteration, rather than full policy optimization. While truncating the iterations of policy iteration can be suboptimal, it can also be interpreted as a form of early stopping regularization. Example of critic regularization: Kumar et al. (2020) . CQL (Kumar et al., 2020) modifies the standard Bellman loss to include an additional term that decreases the values predicted for unseen actions. The actor objective is to maximize Q values; some CQL implementations also regularize the actor loss (Hoffman et al., 2020; Kumar et al., 2020) ). The objectives can then be written as max π E p(s)π(a|s) [Q π (s, a)] , where Q π (s, a) = lim t→∞ Qt(s, a) and Qt+1 = arg min Q E p(s,a) Q(s, a) -y π,Q t (s, a) 2 + λ E p(s)π(a|s) [Q(s, a)] -E p(s)β(a|s) [Q(s, a)] . The second term decreases the Q-values for unseen actions (those sampled from π(a | s)) while the third term increases the values predicted for seen actions (those sampled from the behavioral policy β(a | s)). Unlike standard temporal difference methods, the CQL updates resemble a competitive game between the actor and the critic. In practice, this cyclic dependency can create unstable learning (Kumar et al., 2020; Hoffman et al., 2020) .

3.3. HOW ARE THESE METHODS CONNECTED?

Prior work has observed that one-step methods and critic regularization methods perform similarly on many (Fujimoto & Gu, 2021; Emmons et al., 2021) (but not all (Kostrikov et al., 2021) ) tasks. Despite the differences in objectives and implementations of these two methods (and, more broadly, the actor/critic regularization methods for which they are prototypes), are there deeper, unifying connections between the methods? In the next section, we introduce a different actor-critic method that will allow us to draw a connection between one-step RL and critic regularization. We experimentally validate this equivalence in Sec. 5.1. Despite its difference from practically-used methods, such as one-step RL and CQL, we will show that it makes accurate predictions about the behavior of these practical methods (Sec. 5.2 and 5.3).

3.4. CLASSIFIER ACTOR CRITIC

To support our analysis, we will introduce a new actor-critic algorithm. This algorithm is similar to prior work, but trains the critic using a cross entropy loss instead of an MSE loss. We introduce this algorithm not because we expect it to perform better than existing actor-critic methods, but rather because it allows us to make precise a connection between actor and critic regularization. This method treats the value function like a classifier, so we will call it classifier actor critic. We will then introduce actor-regularized and critic-regularized versions of this method. The subsequent section (Sec. 4) will show that these two regularized methods learn the same policy. The key to our analysis will be to treat Q-values like probabilities, so we define the critic loss in terms of a cross-entropy loss, similar to prior work (Kalashnikov et al., 2018; Eysenbach et al., 2021) . Recalling that Q-values are positive (Sec. 3.1), we transform the Q-values to have the correct range by using Q Q+1 ∈ [0, 1). We will minimize the cross-entropy loss applied to the transformed Q-values: E p(s,a) CE Q(s, a) Q(s, a) + 1 ; y π,Q t (s, a) y π,Q t (s, a) + 1 (3) = -E p(s,a) y π,Q t (s, a) y π,Q t (s, a) + 1 log Q(s, a) Q(s, a) + 1 + 1 y π,Q t (s, a) + 1 log 1 Q(s, a) + 1 const. = -E p(s,a) y π,Q t (s, a) log Q(s, a) Q(s, a) + 1 + log 1 Q(s, a) + 1 ≜ Lcritic(Q, y π,Q t ), In the last line we scale both the positive and negative term by y π,Qt (s, a) + 1, a choice that does not change the optimal classifier but reduces notational clutter. When the TD target can be computed exactly, solving this optimization problem results in performing one SARSA update: Q(s, a) ← r(s, a) + γQ(s ′ , a ′ ) (see Lemma 4.1). Thus, by solving this optimization problem many times, each time using the previous Q-value to compute the TD targets, we will converge to the correct Q-values (see Lemma 4.1). The actor objective is to maximize the expected log of the Q-values: max π L actor (π) ≜ E p(s)π(a|s) [log(Q π (s, a))] , where Q π (s, a) = lim t→∞ Q t (s, a) and Q t+1 = arg min Q L critic (Q, y π,Qt ). While most actor-critic methods do not use the logarithm transformation, prior work on conditional behavioral cloning (e.g., (Savinov et al., 2018; Ding et al., 2019; Sun et al., 2019; Ghosh et al., 2020; Srivastava et al., 2019) ) implicitly includes this transformation (Eysenbach et al., 2022) . In the absence of additional regularization, the optimal policy π(a | s) = 1(a = arg max a ′ Q(s, a ′ )) is the same as the optimal policy for the standard actor objective (without the logarithm). We next introduce a one-step version of this method, as well as a critic regularization variant that resembles CQL. While we will implicitly use a regularization coefficient of 1 below, Appendix B.1 discusses versions of classifier actor critic with varying degrees of regularization. One-step RL. To make classifier actor critic resemble one-step RL (Brandfonbrener et al., 2021) , we make two changes: estimating the value of the behavioral policy and adding a regularization term to the actor objective. To estimate the value of the behavioral policy, we modify the critic loss to sample the next action a ′ from the behavioral policy (i.e., we use y β,Qt (s, a) rather than y π,Qt (s, a)). We also regularize the policy by adding a relative entropy term to the actor loss, analogous to the reverse KL penalty used in one-step RL: max π E p(s)π(a|s) log Q β (s, a) + log β(a | s) -log π(a | s) , where Q β (s, a) = lim t→∞ Qt(s, a) (6) and Qt+1 = arg min Q Lcritic(Q, y β,Q t ). In tabular settings, this critic objective estimates the Q-values for β(a | s) (Appendix Lemma 4.1). Critic regularization. To emulate CQL, we modify the critic loss (Eq. 4) by adding a penalty term that decreases the values for unseen actions. Whereas CQL applies this penalty to the Q-values directly, we will apply it to the logarithm of the Q-values: 2 max π E p(s)π(a|s) [log Q π r (s, a)] , where Q π r (s, a) = lim t→∞ Qt(s, a) (7) Qt+1(s, a) = arg min Q Lcritic(Q, y π,Q t ) + λ E p(s)π(a|s) [log(Q(s, a) + 1)] -E p(s)β(a|s) [log(Q(s, a) + 1)] L r critic (Q,y π,Q t ) .

4. A CONNECTION BETWEEN ONE-STEP RL AND CRITIC REGULARIZATION

This section provides our main result, which is that actor and critic regularization yield the same policy under some settings. The key to proving this connection will be to analyze the Q-values learned by critic regularization. While we mainly focus on the single-task setting, Sec. 4.2 describes how similar results also apply to other settings, including goal-conditioned RL, example-based control, and settings with smaller degrees of regularization. All proofs are in Appendix A. To relate one-step RL to critic regularization, we start by analyzing the Q-values learned by both methods. We first show that the classifier critic converges to the correct Q-values: Lemma 4.1. Assume that states and actions are tabular (discrete and finite), that rewards are positive, and that TD targets can be computed exactly (without sampling). Incrementally update the critic by solving a sequence of optimization problems: Q t+1 ← arg min Q L critic (Q, y π,Qt ) In the limit, this sequence of Q-functions will converge to Q π : lim t→∞ Q t (s, a) = Q π (s, a ) for all states s and actions a. Because one-step RL trains the critic using L critic (Q, y β,Q ), it learns Q-values corresponding to Q β (s, a). When regularization is added to the critic updates, it learns different Q-values. Perhaps surprisingly, this regularization means that our estimates for the value of policy π(a | s) look like the value of the original behavioral policy: Lemma 4.2. Assume that states and actions are tabular (discrete and finite), that rewards are positive, and that TD targets can be computed exactly (without sampling). Incrementally update the critic by minimizing a sequence of regularized critic losses using policy π and hyperparameter λ = 1: Q t+1 ← arg min Q L r critic (Q, y π,Qt ). In the limit, this sequence of Q-functions will converge to the Q-values for the behavioral policy (β(a | s)), weighted by the ratio of the behavioral and online policies: lim t→∞ Q t (s, a) = Q β (s, a)β(a | s) π(a | s) for all states s and actions a. Proof sketch. The ratio β(a|s) π(a|s) above is an importance weight. Ordinarily, a TD backup for policy π(a | s) would entail sampling an action a ∼ π(a | s). However, this importance weight means that TD backup is effectively performed by sampling an action a ∼ β(a | s). Such a TD backup resembles the TD backup for β(a | s). The full proof is in Appendix A. Intuitively, this result says that critic regularization reweights the Q-values to assign higher values to in-distribution actions, where β(a | s) is large. An unexpected part of this result is that the Q-values correspond to the behavioral policy. Said in other words, critic regularization added to a multi-step RL method (one using y π,Qt (s, a)) yields the same critic as a one-step RL method (one using y β,Qt (s, a)). Our main result is a direct corollary of this Lemma: Theorem 4.3. Let a behavioral policy β(a | s) be given and let Q β (s, a) be the corresponding value function. Let π(a | s) be an arbitrary policy (typically learned) with support constrained to β(a | s) (i.e., π(a | s) > 0 =⇒ β(a | s) > 0). Let Q π r (s, a) be the critic obtained by the regularized critic update (Eq. 7) to this policy with λ = 1. Then critic regularization results in the same policy as one-step RL: E π(a|s) [log Q π r (s, a)] = E π(a|s) log Q β (s, a) + log β(a | s) -log π(a | s) for all states s. Since both forms of regularization result in the same objective for the actor, they must produce the same policy in the end. While prior work has mentioned that critic regularization implicitly regularizes the policy (Yu et al., 2021) , this result shows that under the assumptions stated above, the implicit regularization of critic regularization results in the exact same policy learning objective as one-step RL. This equivalence holds when λ = 1, and not necessarily for other regularization coefficients. Appendix B.1 shows how a variant of this result that includes an additional regularization mechanism does apply to different regularization coefficients. This connection between one step RL and critic regularization concerns their objective functions, not the procedures used to optimize those objective functions. Indeed, because practical offline RL algorithms sometimes use different optimization procedures (e.g., TD vs. MC estimates of Q β (s, a)), they will incur errors in estimating Q β (s, a), violating Theorem 4.3's assumption that these Q-values are estimated exactly.

4.1. LIMITATIONS

Our theoretical analysis makes assumptions that may not always hold in practice. For example, our results use a critic loss based on the cross entropy loss, while most (but not all (Kalashnikov et al., 2018; Eysenbach et al., 2020b) ) practical methods use the MSE. Our analysis assumes that critic regularization arrives at an equilibrium, and ignores errors introduced by function approximation and sampling. Nonetheless, our theoretical results will make accurate predictions about practically-used offline RL methods.

4.2. EXTENSIONS OF THE ANALYSIS

We extend this analysis in three ways. First, we also show that a similar connection can be established for lesser degrees of regularization (λ < 1) (see Appendix B.1). Second, we show that a similar connection holds for RL problems defined via success examples (Pinto & Gupta, 2016; Tung et al., 2018; Kalashnikov et al., 2021; Singh et al., 2019; Zolna et al., 2020; Calandra et al., 2017; Eysenbach et al., 2021) These results use existing actor-critic method, rather than classifier actor critic (see Appendix C). Third, we extend our analysis to multi-task settings by looking at goal-conditioned RL problems.

5. NUMERICAL SIMULATIONS

Our numerical simulations study whether the theoretical connection between actor and critic regularization holds empirically. The first experiments (Sec. 5.1) will use classifier actor-critic, and we will expect the equivalence to hold exactly in this setting. We then study whether this connection still holds for practical prior methods (one-step RL and CQL), which violate our assumptions. We study these commonly-used methods in both tabular settings (Sec. 5.2) and on a benchmark offline RL task with continuous states and actions (Sec. 5.3). We do not expect these methods to always be the same (see, e.g., Kostrikov et al. (2021, Table 1 )), and we will focus our experiments on critic regularization We also plot the action probabilities for a policy learned by an unregularized policy to confirm that the equivalence between one-step RL and critic regularization is not a coincidence. -10 +1 with moderate regularization coefficients. See Appendix E for details and hyperparameters for the experiments. Code will be released.

5.1. EXACT EQUIVALENCE WHEN USING CLASSIFIER ACTOR CRITIC

Our first experiment aims to validate our theoretical result under the required assumptions: when using classifier actor-critic as the RL algorithm, and when using a tabular environment. We use a 5 × 5 deterministic gridworld with 5 actions (up/down/left/right/nothing). We describe the reward function and other details in Appendix E. To ensure that critic regularization converges to a fixed point and to avoid oscillatory learning dynamics, we update the policy using an exponential moving average. We also include (unregularized) classifier actor-critic to confirm that regularization is important in some settings. We compare these three methods in three environments. The first setting (Fig. 2 (left)) checks our theory that one-step RL and critic regularization should obtain the same policy. The second setting (Fig. 2 (center)) shows that one-step RL and critic regularization learn the same (suboptimal) policy in settings where using the Q-values for the behavioral policy lead to a suboptimal policy. The final setting is designed so that regularization increases the expected returns. The dataset is a single trajectory from the initial state to the goal. With such limited data, unregularized classifier actor critic overestimates the Q-values at unseen actions, learning a policy that mistakenly takes these actions. In contrast, the regularized approaches learn to imitate the expert trajectory. Fig. 2 (right) shows that both forms of regularization produce the optimal policy. In summary, these tabular experiments validate our theoretical results, including in settings where regularization is useful and harmful.

5.2. PRACTICAL IMPLEMENTATIONS EXHIBIT SIMILAR BEHAVIOR

Based on our theoretical analysis, we predict that practical implementations of one-step RL and critic regularization will exhibit similar behavior, for a certain critic regularization coefficient. This section studies the tabular setting, and the following section will use a continuous control benchmark. For critic regularization, we used CQL (Kumar et al., 2020) together with soft value iteration; following (Brandfonbrener et al., 2021) , we implement one-step RL (reverse KL) using Q-learning. We designed a deterministic gridworld so one-step RL would fail to learn the optimal policy (see Fig. 3 (left)). If CQL interpolates between the behavioral policy (random) and the optimal policy, then the argmax action would always be the same as the action for π * . Based on our analysis, we make a different prediction: that CQL will learn a policy similar to the one-step RL policy. We show results in Fig. 3 (right), just showing the argmax action for visual clarity. The CQL policy takes actions away from both the high-reward state and the low reward state, similar to the behavioral policy but different from both the behavioral policy and the optimal policy. This experiment suggests that CQL can exhibit behavior similar to one-step RL. Of course, this effect is mediated by the regularization strength: a larger regularization coefficient would cause CQL to learn a random policy, and a coefficient of 0 would make CQL identical to Q-learning. How often does one-step RL approximate CQL? To show that the results in Fig. 3 are not cherry-picked, we repeated this experiment using 100 MDPs that are structurally similar to that in Fig. 3 , but where the locations of the high-reward and low reward state are randomized. In each randomly generated MDP, we determine whether CQL exhibits behavior similar to one-step RL by looking at the states where CQL takes actions that differ from the reward-maximizing actions (as determined by running Q-learning with unlimited data). Since there are five total actions, a random policy would have a similarity score of 20%. As shown in Fig. 4 , the similarity score is significantly higher than chance for the vast majority of MDPs, showing that one-step RL and CQL(λ = 10) produce similar policies on most such gridworlds. When does one-step RL approximate CQL? Because one-step RL is highly regularized (policy iteration is truncated after just one step), one might imagine that it would be most similar to CQL with a very large regularization coefficient. To study this, we use the same environment (Fig. 3 ) and measure the fraction of states where one-step RL and CQL choose the same argmax action. As shown in Fig. 5 , one-step RL is most similar to CQL with moderate regularization (λ = 10), and is less similar to CQL with a very strong regularization.

5.3. TESTING PREDICTIONS ABOUT EXISTING OFFLINE RL METHODS

Our final set of experiments studies whether our theoretical results can make accurate testable predictions about practically-used regularization methods in a setting where they are commonly used: offline RL benchmarks with continuous states and actions. For these experiments, we will use well-tuned implementations of CQL and one-step RL from Hoffman et al. (2020) , using the default hyperparameters without modification. We made one change to the one-step RL implementation to makethe comparison more fair: because CQL learns two Q functions and takes the minimum (a trick introduced in Fujimoto et al. ( 2018)), we applied this same parametrization to the one-step RL implementation. Since offline RL methods can perform different on datasets of varying quality (Wang et al., 2020; Fujimoto & Gu, 2021; Paine et al., 2020; Wang et al., 2021; Fujimoto et al., 2019) , we will repeat our experiments on four datasets from the D4RL benchmark (Fu et al., 2020) . Lower bounds on Q-values. One oft-cited benefit of critic regularization is that it has guarantees about value-estimation (Kumar et al., 2020) : under appropriate assumptions, the learned value function will underestimate the discounted expected returns of the policy. Because our analysis shows a conenction between one-step RL and critic regularization, it raises the question of whether one-step RL methods have similar value-estimation properties. Taken at face value, this hypothesis seems obvious: the behavioral critic estimates the value of the behavioral policy, so it should underestimate the value of any policy that is better than the behavioral policy. Despite this, the lower bound property of methods like one-step RL are rarely discussed, suggesting that it has yet to be widely appreciated. one-step RL underestimate the actual returns. In contrast, we observe that critic regularization overestimates the true returns on 2 /4 environments, perhaps because the regularization coefficients used to achieve good returns in practice are too weak to guarantee the lower bound property, and perhaps because the theoretical guarantees are only guaranteed to hold at convergence. In total, these experiments confirm our theoretical predictions that one-step RL will result in Q-values that are underestimations, while also questioning the claim that critic regularization methods are always preferable for ensuring underestimation. Critic regularization causes actor regularization. Our analysis in Sec. 4 not only suggests that one-step RL methods might inherit properties of critic regularization (as studied in the previous section), but also suggests that critic regularization methods may behave like one-step methods. In particular, while critic regularization methods such as CQL do not explicitly regularize their actor, we hypothesize that they implicitly regularize the actor (Lemma 4.2), similar to how one-step RL methods explicitly regularize the actor. We measure the MSE between the action in the dataset and the action predicted by the learned policy. Fig. 7 shows the results. Some CQL implementations, including ours, "warm-start" the actor by applying a behavioral cloning loss for 50,000 iterations; we omit these initial gradient steps from our plots so that any effect is caused solely by the critic regularization. On 4 /4 datasets, we observe that the MSE between the CQL policy's actions and the actions in the datasets decreases throughout training. Perhaps the one exception is on the medium-replay dataset, where the MSE eventually starts to increase after 5e5 gradient steps. While directly regularizing the actor leads to MSE errors that are ∼ 3× smaller, this plot nevertheless provides evidence that critic regularization indirectly regularizes the actor.

6. CONCLUSION

In this paper, we drew a connection between two seemingly-distinct RL regularization methods: one-step RL and critic regularization. While our analysis made assumptions that are typically violated in practice, it nonetheless made accurate, testable predictions about practical methods with commonlyused hyperparameters: critic regularization methods can behave like one-step methods, and vice versa.

REPRODUCIBILITY STATEMENT

We have included full experimental details in Appendix E and proofs in Appendix A.

APPENDICES

In the Appendices, we provide proofs of the theoretical results (Appendix C), extend the analysis to other RL settings (Appendices C-D), and then provide details of the experiments (Appendix E).

A PROOFS

A.1 PROOF OF LEMMA 4.1 Proof sketch. Lemma 4.1 shows that classifier actor critic converges. The key idea of the proof will be to show that the incremental updates for classifier actor critic are exactly the same as the incremental updates for Q-learning. Q-learning converges, so an algorithm that performs the same incremental updates as Q-learning must also converge. Proof. As the cross entropy loss is minimized when the predictions equal the labels, updates for L critic (Q, π) can be written as Q(s,a) Q(s,a)+1 ← y π,Q t (s,a) y π,Q t (s,a)+1 . If the updates are performed by averaging over all possible next states (e.g., in the tabular setting), these updates are equivalent to directly updating Q(s, a) ← y π,Qt (s, a) = r(s, a) + γE p(s ′ |s,a)π(a ′ |s ′ ) [Q t (s ′ , a ′ )], which is the standard policy evaluation update for policy π(a | s). Thus, we can invoke the standard result that policy evaluation converges to Q π (Agarwal et al., 2019, Theorem 1.14.) to argue that updates for L critic likewise converge to Q π . In this proof, the TD targets were the expectation over the next state and next action. If Eq. 4 were optimized using a single-sample estimate of this expectation, y = r(s, a) + γQ t (s ′ , a ′ ), then the updates would be biased: Q(s, a) Q(s, a) + 1 ← E y y + 1 ≤ E[y] E[y] + 1 = y π,Qt (s, a) y π,Qt (s, a) + 1 . In settings with stochastic transitions or policies, these updates would result in estimating a lower bound on Q π (s, a). A.2 PROOF OF LEMMA 4.2 AND THEOREM 4.3 Proof. Our proof proceeds in three steps. First, we derive the update equations for the regularized critic update. That is, if we maintained a table of Q-values, what would the new value for Q(s, a) be? Second, we show that these updates are equivalent to performing policy evaluation on a reparametrized critic Q(s, a) = Q(s, a) π(a|s) β(a|s) . We invoke the standard results for policy evaluation to prove convergence that Q(s, a) convergences. Finally, we undo the reparametrization to obtain convergence results for Q(s, a). Step 0. We start by rearranging the regularized critic objective: ( ( ( ( ( ( ( ( ( (   E  L r critic (Q, y π,Q t ) ≜ Lcritic(Q, y π,Q t ) + E p(s)π(a|s) [log(Q(s, a) + 1)] -E p(s)β(a|s) [log(Q(s, a) + 1)] = -E p(s,a) y π,Q t (s, a) log Q(s, a) Q(s, a) + 1 + log 1 Q(s, a) + 1 + E p(s)π(a|s) [log(Q(s, a) + 1)] -E p(s)β(a|s) [log(Q(s, a) + 1)] = -E p(s,a) y π,Q t (s, a) log Q(s, a) Q(s, a) + 1 + log 1 Q(s, a) + 1 - ( ( p(s)π(a|s) log 1 Q(s, a) + 1 + E p(s)β(a|s) log 1 Q(s, a) + 1 = -E p(s,a) y π,Q t (s, a) log Q(s, a) Q(s, a) + 1 + E p(s)β(a|s) log 1 Q(s, a) + 1 . For the cancelation on the third line, we used the fact that p(s, a) = p(s)β(a | s). Step 1. To start, note that the regularized critic update is equivalent to a weighted classification loss: positive examples are sampled (s, a) ∼ p(s)β(a | s) and receive weight y π,Q t (s,a) y π,Q t (s,a)+1 , and negative examples are sampled (s, a) ∼ p(s)π(a | s) and receive weight 1 y π,Q t (s,a)+1 . The Bayes' optimal classifier is given by Q(s, a) Q(s, a) + 1 = y π,Q t (s,a) y π,Q t (s,a)+1 p(s)β(a | s) y π,Q t (s,a) y π,Q t (s,a)+1 p(s)β(a | s) + 1 y π,Q t (s,a)+1 p(s)π(a | s) = y π,Q t (s, a)β(a | s) y π,Q t (s, a)β(a | s) + π(a | s) . Solving for Q(s, a) on the left hand side, the optimal value for Q(s, a) is given by Q(s, a) = y π,Qt (s, a) β(a | s) π(a | s) = (r(s, a) + E p(s ′ |s,a)π(a ′ |s ′ ) [Q t (s ′ , a ′ )]) β(a | s) π(a | s) . ( ) This equation tells us what each update for the regularized critic loss does. Step 2. To analyze these updates, we define Q(s, a) ≜ Q t (s, a) π(a|s) β(a|s) . Then these updates can be written using Q(s, a) as Q(s, a) β(a | s) π(a | s) = r(s, a) + E p(s ′ |s,a)π(a ′ |s ′ ) Q(s ′ , a ′ ) β(a ′ | s ′ ) π(a ′ | s ′ ) β(a | s) π(a | s) , which can be simplified to Q(s, a) = r(s, a) + E p(s ′ |s,a)β(a ′ |s ′ ) Q(s ′ , a ′ ) . Note that the ratio β(a ′ |s ′ ) π(a ′ |s ′ ) inside the expectation acts like an importance weight, so that the expectation over π(a ′ | s ′ ) becomes an expectation over β(a ′ | s ′ ). Thus, the regularized critic updates are equivalent to perform policy evaluation on Q(s, a). An immediately consequence is that the regularized critic updates converge, and they converge to Q * (s, a) = Q β (s, a). Step 3. Finally, we translate these convergence results for Q(s, a) into convergence results for Q(s, a). Written in terms of the original Q-values, we see that the optimal critic for the regularized critic update is Q * (s, a) = Q * (s, a) β(a | s) π(a | s) = Q β (s, a) β(a | s) π(a | s) . This completes the proof of Lemma 4.2. We now prove Theorem 4.3 by applying a logarithm: Proof. log Q * (s, a) = log Q β (s, a) β(a | s) π(a | s) = log Q β (s, a) + log β(a | s) -log π(a | s). We note that our proof does not account for stochastic and function approximation errors. However, if we assume that the TD updates are deterministic (e.g., as they are in deterministic MDPs), then the updates for classifier actor-critic are identical to those of Q-learning (Lemma 4.1). Thus, it immediately inherits any theoretical results regarding the propagation of errors for Q-learning. While this Theorem 4.3 shows that one-step RL and critic regularization have the same fixed point, it does not say how many transitions or gradient updates are required to reach those fixed points.

A.3 WHY USE THE CROSS-ENTROPY LOSS?

Our proof of Theorem 4.3 helps explain why classifier actor-critic use the cross entropy loss for the critic loss, rather than the MSE loss. Precisely, our analysis requires that the optimal Q function be a ratio, Q(s, a) = Q(s,a)π(a|s) β(a|s) . The cross entropy loss can readily estimate ratios. For example, the optimal classifier for data drawn from p(x) and q(x) is C(x) = p(x) p(x)+q(x) , so the ratio can be expressed as C(x) 1-C(x) = p(x) q(x) . However, fitting a function C(x) to data drawn from (say) a 1:1 mixture of p(x) and q(x) would result in C(x) = 1 2 p(x) + 1 2 q(x), which we cannot transform to express the ratio p(x) q(x) as a function of C(x). Our theoretical results suggest that one-step RL and critic regularization should be most similar with critic regularization is applied with a regularization coefficient of λ = 1. To test this hypothesis, we took the task from Fig. 2 (Left) and measured the similarity between one-step RL and critic-regularized classifier actor critic, for varying values of the critic regularization parameter. We measured the similarity of the policies obtained by the two methods by counting the fraction of states where the two methods choose the same (argmax) action. The results, shown in Fig. 8 , validate our theoretical prediction that these methods should be most similar with λ = 1.

A.5 WHAT ABOUT USING THE POLICY GRADIENT?

Our analysis fundamentally requires using TD learning: the key step is that doing TD backups using one policy is equivalent to doing (modified) TD backups with a different policy. However, the actor updates for both methods could be implemented using a policy gradient or natural gradient, rather than a straight-through gradient estimator. Indeed, much of the work on one-step RL methods (Peng et al., 2019; Siegel et al., 2020) uses an actor update that resembles a policy gradient or natural policy gradient (e.g., 1-step RL with a reverse KL penalty (Brandfonbrener et al., 2021) ).

B VARYING THE REGULARIZATION COEFFICIENT

While our main analysis (Theorem 4.3)showed that regularization and critic regularization yield the same policy when these regularizers are applied with a certain strength, in practice the strength of regularization is controlled by a hyperparameter. This hyperparameter raises a question: does the connection between one-step RL and critic regularization hold for different values of this hyperparameter? In this section, we show that there remains a precise connection between actor and critic regularization, even for different values of this hyperparameter. This result not only suggests that the connection is stronger than initially suggested by the main result. Proving this connection also helps highlight how many regularization methods can be cast from a similar mold.

B.1 A REGULARIZATION COEFFICIENT.

We start by modifying the actor regularizer and critic regularizer introduced in Sec. 3.4 to include an additional hyperparameter. Mixture policy. Both the actor and critic losses will make use of a mixture policy, (1 -λ)π(a | s) + λβ(a | s), where λ ∈ [0, 1] will be a hyperparameter. Larger values of λ yield a mixture policy that is closer to the behavioral policy; this will correspond to higher degrees of regularization. Mixtures of policies are commonly used in practice (Kumar et al., 2020, Appendix F) , (Villaflor et al., 2020, Eq. 11) , (Finn et al., 2016, Sec. 4 .3) (Lyu et al., 2022) (Hazan et al., 2019, Eq. 2.5 ), even though it rarely appears in theoretical offline RL literature. Indeed, because critic regularization resembles a two-player zero-sum game, mixture policies might even be required to find a (Nash) equilibrium of the critic regularizer (Nash, 1951) . λ-weighted critic loss. With this concept of a mixture policy, we define the λ-weighted actor and critic regularizers. For the λ-weighted critic loss, we will change how the TD targets are computed. Instead of sampling the next action from π or β, we will sample the next action from a λ TD -weighted combination of these two policies, reminiscent of how prior work has regularized the actions sampled for the TD backup (Fujimoto et al., 2019; Zhou et al., 2020) : y λTD ≜ y (1-λ)π+λβ (s, a) = r(s, a) + γE p(s ′ |s,a) (1-λTD)π(a|s)+λTDβ(a|s) [Q(s ′ , a ′ )]. When introducing one-step RL in Sec. 3.4, we used λ TD = 1. Using this TD target, the λ-weighted critic loss can now be written as a combination of the unregularized objective (Eq. 4) plus the regularized objective (Eq. 7): L r critic (Q, λcritic) ≜ (1 -λcritic) -E p(s,a) y λ TD (s, a) y λ TD (s, a) + 1 log Q(s, a) Q(s, a) + 1 + 1 y λ TD (s, a) + 1 log 1 Q(s, a) + 1 + λ -E p(s,a) a -∼π(•|s) y λ TD (s, a) y λ TD (s, a) + 1 log Q(s, a) Q(s, a) + 1 + 1 y λ TD (s, a) + 1 log 1 Q(s, a) + 1 = -E p(s,a) a -∼(1-λ critic )π(•|s)+λ critic β(•|s) y λ TD (s, a) y λ TD (s, a) + 1 log Q(s, a) Q(s, a) + 1 + 1 y λ TD (s, a) + 1 log 1 Q(s, a -) + 1 . The second line rewrites this objective: the first term looks the same as the original "positive" term in the critic objective, while the "negative" term uses actions sampled from a mixture of the current policy and the behavioral policy. When λ critic = 1, we recover the regularized critic loss introduced in Sec. 3.4. λ-weighted actor loss. Finally, the strength of the actor regularizer can be controlled by changing the reverse KL penalty. While it may seem like changing the reward scale would varying the strength of the actor loss, this is not the case for classifier actor critic because of the log(•) in the actor loss. Instead, we will relax the reverse KL penalty between the learned policy π(a | s) and the behavioral policy β(a | s) so that only the mixture policy only needs to be close to behavioral policy: L r actor (π, λKL) ≜ E p(s)π(a|s) [log Q(s, a) + log β(a | s) -log ((1 -λKL)π(a | s) + λKLβ(a | s))] . As indicated on the second line, replacing β(a | s) with the mixture policy has an effect similar to that of decreasing the weight applied to the KL penalty. The approximation on the second line is determined by the Jensen Gap (Abramovich & Persson, 2016; Gao et al., 2017) . When introducing one-step RL in Sec. 3.4, we used λ KL = 1, together with λ TD = 1. In summary, the strength of the actor and critic regularizers can be controlled through additional hyperparameters (λ critic , λ TD , λ KL ). Indeed, it is typical for offline RL methods to require many hyperparameters (Brandfonbrener et al., 2021; Lu et al., 2021; Paine et al., 2020; Wu et al., 2019) , and performance is sensitive to their settings. However, the close connection that we have shown between actor and critic regularizers allows us to decrease the number of hyperparameters.

B.2 ANALYSIS

In our main result (Thm. 4.3), we showed that one-stel RL and critic regularization are equivalent when λ critic = λ TD = λ KL = 1. This is a large value for the regularization strength, and we now consider what happens for smaller degrees of regularization: is there still a connection between one-step RL and critic regularization? The following theorem will prove that this is the case. In particular, applying critic regularization with coefficient λ critic yields the same policy as applying one-step RL with λ TD = λ KL = λ critic . That is, there is a very simple recipe for converting the hyperparameters for critic regularization into the hyperparameters for one-step RL. Theorem B.1. Let policy π(a | s) be given, let Q β (s, a) be the Q-function of the behavioral policy, and let Q λTD r (s, a, λ critic ) be the critic obtained by the λ critic -weighted regularized critic update (Eq. 12) using TD targets y λTD (s, a). If λ critic = λ TD = λ KL , then the λ KL -weighted actor loss (Eq. 13) is equivalent to the un-regularized policy objective using the regularized critic: E p(s)π(a|s) [log Q(s, a) + log β(a | s) -log ((1 -λKL)π(a | s) + λKLβ(a | s))] = E π(a|s) log Q λ TD r (s, a, λcritic) for all states s. While we used the cross entropy loss for this result, it turns out that the result also holds for the more standard MSE loss (we omit the proof for brevity). Limitations. Before presenting the proof in Sec. B.3, we discuss a few limitations of this result. Like the rest of the analysis in this paper, the form of the critic regularizer is different from that often used in practice. Additionally, our analysis assumes ignores many sources of errors (e.g., sampling, function approximation), and assumes that each objective is optimized exactly.

B.3 PROOF OF THEOREM B.1

Proof. We start by defining the fixed point of the λ-weighted regularized critic loss. Like in the single-task setting, this loss resembles a weighted classification problem, so we can write down the Bayes' optimal classifier as Q(s, a) Q(s, a) + 1 = y λ TD (s,a) y λ TD (s,a)+1 p(s)β(a | s) y λ TD (s,a) y λ TD (s,a)+1 p(s)β(a | s) + 1 y λ TD (s,a)+1 p(s)((1 -λcritic)π(a | s) + λcriticβ(a | s)) = y λ TD (s, a)β(a | s) y λ TD (s, a)β(a | s) + (1 -λcritic)π(a | s) + λcriticβ(a | s) . Solving for Q(s, a) on the left hand side, the optimal value for Q(s, a) is given by Q(s, a) = y λ TD (s, a) β(a | s) (1 -λcritic)π(a | s) + λcriticβ(a | s) = (r(s, a) + E p(s ′ |s,a),a ′ ∼(1-λ TD ),π(•|s ′ )+λ TD β(•|s) [Q(s ′ , a ′ )]) β(a | s) (1 -λcritic)π(a | s) + λcriticβ(a | s) . Note that the next action a ′ is sampled from a mixture policy defined by λ TD . This equation tells us what each update for the λ-weighted regularized critic loss does. To analyze these updates, we define Q(s, a) ≜ Q(s, a) (1 -λcritic)π(a | s) + λcriticβ(a | s) β(a | s) . Like before, the ratio β(a ′ |s ′ ) (1-λTD)π(a ′ |s ′ )+λTDβ(a ′ |s ′ ) can act like an importance weight. When λ TD = λ critic , then this importance weight cancels with the sampling distribution, providing the following identity: E p(s ′ |s,a),a ′ ∼(1-λTD),π(•|s ′ )+λTDβ(•|s) [Q(s ′ , a ′ )] = E p(s ′ |s,a),a ′ ∼(1-λTD),π(•|s ′ )+λTDβ(•|s) Q(s, a) β(a | s) (1 -λ critic )π(a | s) + λ critic β(a | s) = E p(s ′ |s,a),a ′ ∼β(•|s ′ ) [ Q(s, a)]. Substituting this identity in Eq. 14, we can write the updates using Q(s, a): Q(s, a) β(a | s) (1 -λcritic)π(a | s) + λcriticβ(a | s) = r(s, a) + E p(s ′ |s,a),a ′ ∼β(•|s ′ ) [ Q(s, a)] β(a | s) (1 -λcritic)π(a | s) + λcriticβ(a | s) , which can be simplified to Q(s, a) = r(s, a) + E p(s ′ |s,a),a ′ ∼β(•|s ′ ) [ Q(s, a)]. We then translate these convergence results for Q(s, a) into convergence results for Q(s, a). Written in terms of the original Q-values, we see that the optimal critic for the regularized critic update is Q * (s, a) = Q β (s, a) β(a | s) (1 -λ critic )π(a | s) + λ critic β(a | s) . ( ) Note that this holds for any value of λ critic = λ TD ∈ [0, 1]. This result suggests that two common forms of regularization, decreasing the values predicted at unseen actions and regularizing the actions used in the TD backup, can produce the same effect: a critic that estimates the Q-values of the behavioral policy (multiplied by some importance weight). Finally, substitute this Q-function into the un-regularized actor loss, we see that the result is equivalent to the λ-weighted actor loss: E p(s)π(a|s) [log Q * (s, a)] =E p(s)π(a|s) log Q β (s, a) + log β(a | s) -log ((1 -λKL)π(a | s) + λKLβ(a | s)) λ-weighted actor regularizer

C REGULARIZATION FOR GOAL-CONDITIONED PROBLEMS

Like single-task RL problems, goal-conditioned RL problems have also been approached with both one-step methods (Ghosh et al., 2020; Ding et al., 2019; Sun et al., 2019) and critic regularization (Chebotar et al., 2021) . In these problems, the aim is to learn a goal-conditioned policy π(a | s, s g ) that maximizes the expected discounted sum of goal-conditioned rewards r g (s, a), where goals are sampled s g ∼ p g (s g ): max π E pg(sg) E π(τ |sg) ∞ t=0 γ t r g (s t , a t ) . We will use the goal-conditioned reward function r g (s, a) = p(s ′ = s g | s, a), which is defined in terms of the environment dynamics. In settings with discrete states, maximizing this reward function is equivalent to maximizing the sparse indicator reward function (r g (s, a) = 1(s g = s)). In this section, we show that one-step RL and critic regularization are equivalent for a certain goalconditioned actor-critic method. Unlike our analysis in the single-task setting, this analysis here uses an existing method, C-learning (Eysenbach et al., 2020b) . C-learning is a TD method that already makes use of the cross entropy loss for training the critic: max Q (1 -γ)E p(s,a,s ′ ) log Q(s, a, s g = s ′ ) Q(s, a, s g = s ′ ) + 1 + γE p(s,a)pg(sg) y π,Qt (s, a, s ) log Q(s, a, s g ) Q(s, a, s g ) + 1 + E p(s,a)pg(sg) log 1 Q(s, a, s g = s ′ ) + 1 , where y π,Qt (s, a, s g ) = E p(s ′ |s,a)π(a ′ |s ′ ,sg) [Q(s ′ , a ′ , s g )] serves the role of the TD target. The first two terms increase the Q-values while the last term decreases the Q-values. The actor is updated to maximize the Q-values. While this objective for the actor can be written in many ways, we will write it as maximizing a log ratio because it will allow us to draw a precise equivalence between actor and critic regularization: max π E pg(sg)p(s)π(a|s,sg) [log Q(s, a, s g )] We will now consider variants of C-learning that incorporate actor and critic regularization. One-step RL. We will consider a variant of C-learning that resembles one-step RL (Brandfonbrener et al., 2021) . The critic update will be similar to before, but the next-actions sampled for the TD updates will be sampled from the marginal behavioral policy: max Q (1 -γ)E p(s,a,s ′ ) log Q(s, a, s g = s ′ ) Q(s, a, s g = s ′ ) + 1 + γE p(s,a)pg(sg) y β,Qt (s, a, s ) log Q(s, a, s g ) Q(s, a, s g ) + 1 + E p(s,a)pg(sg) log 1 Q(s, a, s g = s ′ ) + 1 , where y β,Qt (s, a, s g ) = E p(s ′ |s,a)β(a ′ |s ′ ) [Q t (s ′ , a ′ , s g )]. The actor update will be modified to include a reverse KL divergence: max π E p(s)pg(sg)π(a|s,sg) [log Q(s, a, s g ) + log β(a | s) -π(a | s, s g )] . Note that we are regularizing the policy to be similar to the average behavioral policy, β(a | s). Compared to regularization towards a goal-conditioned behavioral policy β(a | s, s g ), this choice gives the policy additional flexibility: when trying to reach goal s g , it is allowed to take actions that were not taken by β(a | s, s g ), as long as they were taken by the behavioral policy when trying to reach some other goal s ′ g . Critic regularization. To regularize the critic, we will modify the "negative" term in the C-learning objective to use actions sampled from the policy: max Q (1 -γ)E p(s,a,s ′ ) log Q(s, a, s g = s ′ ) Q(s, a, s g = s ′ ) + 1 + γE p(s,a)pg(sg) y π,Qt (s, a, s g ) log Q(s, a, s g ) Q(s, a, s g ) + 1 (18) + E p(s)pg(sg)a∼π(•|s,sg) log 1 Q(s, a, s g ) + 1 . C.1 ANALYSIS FOR GOAL-CONDITIONED PROBLEMS Like in the single-task setting, these two forms of regularization yield the same fixed points: Theorem C.1. Let policy π(a | s, s g ) be given, let Q β (s, a, s g ) be the Q-values for the marginal behavioral policy β(a | s) and let Q π r (s, a, s g ) be the critic obtained by the regularized critic update (Eq. 19). Then performing regularized policy updates (Eq. 16) using the behavioral critic is equivalent to the un-regularized policy objective using the regularized critic: E π(a|s,sg) log Q β (s, a, s g ) + log β(a | s) -log π(a | s, s g ) = E π(a|s,sg) [log Q π r (s, a, s g )] for all states s and goals s g . Proof. We start by determining the fixed point of critic-regularized C-learning. Like in the single-task setting, the C-learning objective resembles a weighted-classification problem, so we can write down the Bayes' optimal classifier as Q(s, a, sg) Q(s, a, sg) + 1 = ((1 -γ)p(s ′ = sg | s, a) + γp(s = sg)y(s ′ , sg))β(a | s) ((1 -γ)p(s ′ = sg | s, a) + γp(s = sg)y(s ′ , sg))β(a | s) + p(sg)π(a | s, sg) . Solving for Q(s, a, s g ) on the left hand side, the optimal value for Q(s, a, s g ) is given by Q(s, a, s g ) = ((1 -γ)p(s ′ = s g | s, a) + γp(s = s g )y(s ′ , s g )) β(a | s) π(a | s, s g )

This tells us what each critic-regularized C-learning update does.

To analyze these updates, we define Q(s, a, s g ) ≜ Q(s, a, s g ) π(a|s,sg) β(a|s) . Then these updates can be written using Q(s, a, s g ) as Q(s, a, sg) β(a | s) π(a | s, sg) = (1 -γ)p(s ′ = sg | s, a) + γE p(s ′ |s,a)π(a ′ |s ′ ,sg ) Q(s ′ , a ′ , sg) β(a ′ | s ′ ) π(a ′ | s ′ , sg) β(a | s) π(a | s, sg) . These updates can be simplified to Q(s, a, s g ) = (1 -γ)p(s ′ = s g | s, a) + γE p(s ′ |s,a)β(a ′ |s ′ ) Q(s ′ , a ′ , s g ) . Like before, the ratio β(a ′ |s ′ ) π(a ′ |s ′ ,sg) inside the expectation acts like an importance weight. Thus, the regularized critic updates are equivalent to perform policy evaluation on Q(s, a, s g ). Note that this is estimating the probability that the average behavioral policy β(a | s) reaches goal s g ; this is not the probability that a goal-directed behavioral policy β(a | s, s g ) reaches the goal. Finally, we translate these convergence results for Q(s, a, s g ) into convergence results for Q(s, a, s g ). Written in terms of the original Q-values, we see that the optimal critic for the regularized critic update is Q * (s, a, s g ) = Q * (s, a, s g ) β(a | s) π(a | s, s g ) = Q β(•|•) (s, a, s g ) β(a | s) π(a | s, s g ) . Thus, critic regularization implicitly regularizes the actor objective so that it is the same objective as one-step RL: E p(s),sg ∼p(s),π(a|s,sg ) [log Q * (s, a, sg)] = E p(s),sg ∼p(s),π(a|s,sg ) log Q β(•|•) (s, a, sg) + log β(a | s) -log π(a | s, sg) .

D REGULARIZATION FOR EXAMPLE-BASED CONTROL PROBLEMS

While specifying tasks in terms of reward functions is standard for MDPs, it can be difficult for real-world applications of RL. So, prior work has looked at specifying tasks by goal states (as in the previous section) or sets of states representing good outcomes (Pinto & Gupta, 2016; Tung et al., 2018; Fu et al., 2018) . In addition to requiring more flexible and user-friend forms of task specification, these algorithms targeted at real-world applications often demand regularization. In the same way that prior goal-conditioned RL algorithms have employed critic regularization, so too have prior example-based control algorithms (Singh et al., 2019; Hatch et al., 2022) . In this section, we extend our analysis to regularization of an example-based control algorithm. Again, we will show that a certain form of critic regularization is equivalent to regularizing the actor. We first define the problem of example-based control (Fu et al., 2018) . In these problems, the agent is given a small collection of states s ∼ p e (s), which are examples of successful outcomes. The aim is to learn a policy π(a | s) that maximizes the probability of reaching a success state: max π E p(sg) E π(τ |sg) ∞ t=0 γ t p e (s t ) . Note that this objective function is exactly equivalent to a reward-maximization problem, with a reward function r(s, a) = p e (s t ). In this section, we show that one-step RL and critic regularization are equivalent for a certain examplebased control algorithm. Unlike our analysis in the single-task setting, this analysis here uses an existing method, RCE (Eysenbach et al., 2021) . RCE is a TD method that already makes use of the cross entropy loss for training the critic: max Q (1 -γ)E pe(s)β(a|s) log Q(s, a) Q(s, a) + 1 + E p(s,a) γy π,Q t (s, a) log Q(s, a) Q(s, a) + 1 + log 1 Q(s, a) + 1 , where y π,Qt (s, a) = E p(s ′ |s,a)π(a ′ |s ′ ) [Q(s ′ , a ′ )] serves the role of the TD target. The first two terms increase the Q-values while the last term decreases the Q-values. The actor is updated to maximize the Q-values. While this objective for the actor can be written in many ways, we will write it as maximizing a log ratio because it will allow us to draw a precise equivalence between actor and critic regularization: max π E p(s)π(a|s) [log Q(s, a)] We will now consider variants of RCE that incorporate actor and critic regularization. One-step RL. We will consider a variant of RCE that resembles one-step RL (Brandfonbrener et al., 2021) . The critic update will be similar to before, but the next-actions sampled for the TD updates will be sampled from the behavioral policy: max Q (1 -γ)E pe(s)β(a|s) log Q(s, a) Q(s, a) + 1 + E p(s,a) γy β,Q t (s, a) log Q(s, a) Q(s, a) + 1 + log 1 Q(s, a) + 1 , where y β,Qt (s, a) = E p(s ′ |s,a)β(a ′ |s ′ ) [Q(s ′ , a ′ )]. The actor update will be modified to include a reverse KL divergence: max π E p(s),π(a|s) [log Q(s, a) + log β(a | s) -π(a | s)] . Critic regularization. To regularize the critic, we will modify the "negative" term in the RCE objective to use actions sampled from the policy: ( 1 -γ)E pe(s)β(a|s) log Q(s, a) Q(s, a) + 1 + E p(s,a),a -∼π(•|s) γy π,Q t (s, a) log Q(s, a) Q(s, a) + 1 + log 1 Q(s, a -) + 1 ,

D.1 ANALYSIS FOR EXAMPLE-BASED CONTROL PROBLEMS

Like in the single-task setting, these two forms of regularization yield the same fixed points: Theorem D.1. Let policy π(a | s) be given, let Q β (s, a) be the Q-values for the behavioral policy β(a | s) and let Q π r (s, a) be the critic obtained by the regularized critic update (Eq. 21). Then performing regularized policy updates (Eq. 20) using the behavioral critic is equivalent to the un-regularized policy objective using the regularized critic: To analyze these updates, we define Q(s, a) ≜ Q(s, a) π(a|s) β(a|s) . Then these updates can be written using Q(s, a) as Q(s, a) β(a | s) π(a | s) = (1 -γ)pe(s) + γE p(s ′ |s,a)π(a ′ |s ′ ) Q(s ′ , a ′ ) β(a ′ | s ′ ) π(a ′ | s ′ ) β(a | s) π(a | s) . These updates can be simplified to Q(s, a) = (1 -γ)p e (s) + γE p(s ′ |s,a)β(a ′ |s ′ ) Q(s ′ , a ′ ) . Like before, the ratio β(a ′ |s ′ ) π(a ′ |s ′ ) inside the expectation acts like an importance weight. Thus, the regularized critic updates are equivalent to perform policy evaluation on Q(s, a). Finally, we translate these convergence results for Q(s, a) into convergence results for Q(s, a). Written in terms of the original Q-values, we see that the optimal critic for the regularized critic update is Q * (s, a) = Q * (s, a) β(a | s) π(a | s) = Q β (s, a) β(a | s) π(a | s) . Thus, critic regularization implicitly regularizes the actor objective so that it is the same objective as one-step RL: 

E EXPERIMENTAL DETAILS E.1 TABULAR EXPERIMENTS

Implementing critic regularization for classifier actor critic. The objective for critic regularization in contrastive actor critic (Eq. 7) is nontrivial to optimize because of the cyclic dependency between the policy and the critic: simply alternating between optimizing the actor and the critic does not converge. In our experiments, we update the critic using an exponential moving average of the policy, as proposed in Wen et al. (2021) . We found that this decision was sufficient for ensuring convergence. When applying CQL in the tabular setting (Figures 3 and 4 ), we did not do this because soft value iteration represents the policy implicitly in terms of the value function. Fig. 2 (left) The initial state and goal state are located in opposite corners. The reward function is +1 for reaching the goal and 0 otherwise. We use a dataset of 20 trajectories, 50 steps each, collected by a random policy. We use γ = 0.95 and train for 20k full-batch updates, using a learning rate of 1e-2. The Q table is randomly initialized using a standard normal distribution. Fig. 2 (center) The initial state and goal state are located in adjacent corners. The goal state has a reward of +3.5, the states between the initial state and goal state have a reward +1, and all other states (including the initial state) have a reward of +2. We use a dataset of 20 trajectories, 50 steps each, collected by a random policy. We use γ = 0.95 and train for 20k full-batch updates, using a learning rate of 1e-2. The Q table is randomly initialized using a standard normal distribution. Fig. 2 (right) The initial state and goal state are located in adjacent corners. The reward is +0.01 at the goal state and 0 otherwise. We use a dataset of 1 trajectories with 10 steps, which traces the following path: [(0, 0), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (0, 4), (0, 4), (0, 4), (0, 4)]. We use γ = 0.95 and train for 10k full-batch updates, using a learning rate of 1e-2. The Q table is randomly initialized using a standard normal distribution. Fig. 3 There is a bad state (reward of -10) next to the optimal state (reward of +1), so the behavioral policy navigates away from the optimal state. We generate 10 trajectories of length 100 from a uniform random policy. We use γ = 0.95 and train each method for 10k full-batch updates. The Q table is randomly initialized using a standard normal distribution. One-step RL performs SARSA updates while CQL performs soft value iteration (as suggested in the CQL paper). Fig. 4 We generate 100 random variants of Fig. 3 by randomly sampling the high-reward state and low-reward state (without replacement). The datasets are generated in the same way. Fig. 5 We use the same environment and dataset as in Fig. 3 , but train the CQL agent with varying values of λ, each with 5 random seeds. We train the one-step RL agent for 5 random seeds. For each point on the X axis of Fig. 5 , we compare compute 5 × 5 pairwise comparisons and report the mean and standard deviation.

E.2 CONTINUOUS CONTROL EXPERIMENTS

For the experiments in Figures 6 and 7 , we used the implementation of one-step RL (reverse KL) and CQL provided by Hoffman et al. (2020) . We choose this implementation because it is well tuned and uses similar hyperparameters for the two methods. As mentioned in the main text, the only change we made to the implementation was adding the twin-Q trick to one-step RL, such that it matched the critic architecture used by CQL. We did not change any of the other hyperparameters, including hyperparameters controlling the regularization strength.



If the reward also depends on the next state, then define r(s, a) = E p(•|s,a) [r(s, a, s ′ )]. From a dimensional analysis perspective(Huntley, 1967), this choice makes sense because it allows the penalty term to have the same "units" as the critic loss: log Q-values. A second motivation for regularizing the logarithm is that the actor loss uses a logarithm.



Figure 1: Both n-step RL and critic regularization can interpolate between behavioral cloning (left) and unregularized RL (right) by varying the regularization parameter. Endpoints of these regularization paths are the same. We prove that these methods also obtain the same policy for an intermediate degree of regularization.

Figure2: Actor and critic regularization produce identical policies. Across three tabular settings, we plot the action probabilities π(a | s) for the policies produced by one-step RL and critic-regularized classifier actor-critic (R 2 ≥ 0.999). We also plot the action probabilities for a policy learned by an unregularized policy to confirm that the equivalence between one-step RL and critic regularization is not a coincidence.

Figure3: CQL can behave like one-step RL. We design a gridworld (a) so that one-step RL (c) learns a suboptimal policy. For the three cells highlighted in blue, the optimal policy (b) navigates towards the highreward state (green) while the one-step RL policy (c) navigates away from the high-reward state. (d) CQL with a large regularization coefficient exhibits the same suboptimal behavior as one-step RL, taking actions that lead away from the high-reward states. (e) CQL with a small regularization coefficient behaves like Q-learning. For clarity, we only show the argmax action in each state; we omit the arrow when the argmax action is "do nothing".

Figure 4: CQL and one-step RL take similar actions on most MDPs that resemble Fig. 3

Fig.6shows both these predicted and actual (discounted) returns throughout the course of training. The results for one-step RL confirm our theoretical prediction on 4 /4 datasets: the Q-values from

Figure 8: Under the assumptions of Theorem 4.3, onestep RL is most similar to critic regularization with a coefficient of λ = 1.

π(a|s) log Q β (s, a) + log β(a | s) -log π(a | s) = E π(a|s) [log Q π r (s, a)] for all states s.

We start by determining the fixed point of critic-regularized RCE. Like in the single-task setting, The RCE objective resembles a weighted-classification problem, so we can write down the Bayes' optimal classifier asQ(s, a) Q(s, a) + 1 = ((1 -γ)pe(s) + γy π,Q t (s, a))β(a | s) ((1 -γ)pe(s) + γy π,Q t (s, a))β(a | s) + π(a | s) .Solving for Q(s, a) on the left hand side, the optimal value for Q(s, a) is given byQ(s,a) = ((1 -γ)p e (s) + γy π,Qt (s, a)) β(a | s) π(a | s) This tells us what each critic-regularized RCE update does.

p(s),π(a|s) [log Q * (s, a)] = E p(s),π(a|s) log Q β (s, a) + log β(a | s) -log π(a | s) .

