ACTOR-CRITIC ALIGNMENT FOR OFFLINE-TO-ONLINE REINFORCEMENT LEARNING

Abstract

Deep offline reinforcement learning has recently demonstrated considerable promise in leveraging offline datasets, providing high-quality models that significantly reduce the online interactions required for fine-tuning. However, such a benefit is often diminished due to the marked state-action distribution shift, which causes significant bootstrap error and wipes out the good initial policy. Existing solutions resort to constraining the policy shift or balancing the sample replay based on their online-ness. However, they require online estimation of distribution divergence or density ratio. To avoid such complications, we propose deviating from existing actor-critic approaches that directly transfer the state-action value functions. Instead, we post-process them by aligning with the offline learned policy, so that the Q-values for actions outside the offline policy are also tamed. As a result, the online fine-tuning can be simply performed as in the standard actorcritic algorithms. We show empirically that the proposed method improves the performance of the fine-tuned robotic agents on various simulated tasks.

1. INTRODUCTION

Offline reinforcement learning (RL) provides a novel tool that allows offline batch data to be leveraged by RL algorithms without having to interact with the environment (Levine et al., 2020) . This opens up new opportunities for important scenarios such as health care decision making, and goaldirected dialog learning. Due to the limitation of offline data, it generally remains beneficial and necessary to fine-tune the learned model through online interactions, and ideally the latter will enjoy a faster learning curve thanks to the favorable initialization. Unfortunately, it has been long observed that a direct offline-to-online (O2O) transfer often leads to catastrophic degradation of performance in the online stage, which is unacceptable in critical applications including medical treatment and autonomous driving. A key cause lies in the significant shift of state distribution at online phase compared with the offline data (Fujimoto et al., 2019; Kumar et al., 2019; Fu et al., 2019; Kumar et al., 2020a) . As a result, the Bellman backup suffers a compounded error (Farahmand et al., 2010; Munos, 2005) , because the Q-value has not been well estimated for the state-actions lying outside the offline distribution. A number of solutions have been developed to address this issue. The most straightforward approach is importance sampling (Laroche et al., 2019; Gelada & Bellemare, 2019; Zhang et al., 2020; Huang & Jiang, 2020) , which requires an additional effort of estimating the behavior policy, and suffers from high variance, especially when it differs markedly from the learned policy (a more realistic issue for the offline setting than the conventional off-policy setting). The model-based approach, on the other hand, also suffers from the distribution shift in state marginals and actions (Mao et al., 2022; Kidambi et al., 2020; Yu et al., 2020; Janner et al., 2019) . It may exploit the model to pursue out-of-distribution states and actions where the model mis-believes to yield a high return. So they also require detecting and quantifying the shift. In addition, they suffer from standard challenges plaguing model-based RL algorithms such as long horizon and high dimensionality. Dynamic programming proffers lower variance and directly learns the value functions and policy. Several approaches have been proposed to combat distribution shift. A natural idea is to constrain the policy to the proximity of the behavior policy, and this has been implemented by using probability divergences in (Nair et al., 2020; Siegel et al., 2020; Peng et al., 2019; Wu et al., 2019; Kumar et al., 2019) , or by behavior cloning regularization (Zhao et al., 2021; Fujimoto & Gu, 2021) . A second class of approaches resort to pessimistic under-estimate of the state-action values (Kumar et al., 2020b; Kostrikov et al., 2021) , especially for out-of-distribution actions that could have an unjustified high value. Conservative Q-learning (CQL, Kumar et al., 2020b) has been shown to produce a relatively safer O2O transfer in balanced replay (Lee et al., 2022) , which further prioritizes the experience transitions that are closer to the current policy. Unfortunately, all these methods require online estimation of distribution divergence or density ratio (for priority score or regularization weight). Excess conservatism can also slow down the online fine-tuning. A third category of methods avoid these complications by estimating the epistemic uncertainty of the Q-function, so that out-of-distribution actions carry a larger uncertainty which in turn yields conservative target values for Bellman backup (Jaksch et al., 2010; O'Donoghue et al., 2018; Osband et al., 2016; Kumar et al., 2019) . However, it is generally hard to find calibrated uncertainty estimates, especially for deep neural nets (Fujimoto et al., 2019) . To resolve the aforementioned issues, we propose a novel alignment step for actor-critic RL that can be flexibly inserted between offline and online training, dispensing with any estimation of Qfunction uncertainty, distribution divergence, or density ratio. Our key insight is drawn from soft actor-critic (SAC, Haarnoja et al., 2018) , where the optimal entropy-regularized policy is simply the softmax of the Q-function. Now that the Q-function is generally problematic for out-of-distribution actions while the policy learned offline is assumed trustworthy (though still needs fine-tuning), it is natural to align the critic to the actor upon the completion of offline learning, so that the Q function is tamed to be consistent with the policy under the softmax function, especially for those actions that lie outside the behavior policy. As a result, the online fine-tuning will only need to take the simple form of the standard SAC, and empirically the proposed method outperforms state-of-the-art fine-tuned robotic agents on various simulated tasks. Our contributions and novelty can be summarized as follows: • We propose a novel O2O RL approach that outperforms or matches the current SOTAs. • Our approach does not rely on offline pessimism or conservatism, allowing it to transfer to a broader range of offline models. • We propose, for the first time, discarding Q-values learned offline as a means to combat distribution shift in O2O RL. We also design a novel reconstruction of Q-functions for online fine-tuning. • When offline data is not available at online fine-tuning -a very realistic scenario due to data privacy concerns, our method remains applicable and stable, while strong competitors such as balanced replay cease being applicable.

2. RELATED WORK

Decision transformer (Chen et al., 2021) and trajectory transformer (Janner et al., 2021) have recently been shown effective for offline reinforcement learning, where the batch trajectories' likelihood is maximized auto-regressively to model action sequences conditioned on a task. Zheng et al. (2022) extended them to online decision transformers (ODTs) by populating the replay buffer with online ODT rollouts labeled with hindsight experience replay. As a result, sequence modeling becomes effective for online fine-tuning. Our method remains in the actor-critic framework, and we demonstrate similar or superior empirical performance to ODT. Behavior cloning often plays an important role in effective O2O RL. It can take the form of constraining the policy around the behavior policy under certain probability discrepancy measure, or simply imposing least square or cross-entropy regularizer to drive the policy to imitate transitions (Zhao et al., 2021; Fujimoto & Gu, 2021) . Such a regularizer often requires delicate annealing, and to this end, Zhao et al. (2021) designed heuristic rules based on reward feedback. Recently, Kostrikov et al. (2022) employ behavior cloning to guide the extraction of policy from an expectilebased implicit Q-learning. It is noteworthy that behavior cloning is also commonly used in imitation learning, where the goal is to imitate instead of outperforming the demonstrator, differing from the O2O setting. A number of efforts have been made to fuse it with RL for improvement (Lu et al., 2021) . A similar line of research is to boost online learning from demonstration, (e.g., Hester et al., 2018; Reddy et al., 2019) . However, they focus on accelerating online learning by utilizing offline data, and are not concerned about the safety or performance drop in porting the pre-trained policy to online.

3. PRELIMINARY

We follow the standard protocol that formulates a RL environment as a Markov decision process (MDP). An MDP M is often described as a 5-tuple (S, A, P, r, γ), where S is the state-space, A is the action space, P : S × A → ∆(S) is the transition function, R : S × A → R is the reward function, and γ ∈ [0, 1) is a discount factor. A policy is a distribution π(a|s) ∈ ∆(A), and the agent aims to find a policy that maximizes the expected return E π [ ∞ t=0 γ t r t ]. Soft actor-critic To learn from offline data generated by a behavior policy, we will focus on offpolicy RL methods. In particular, the soft actor-critic method (SAC, Haarnoja et al., 2017; 2018) learns a Q-function Q µ (s, a) with parameter µ, and a Gaussian policy π θ (a|s) whose sufficient statistics are determined by a neural network with parameter θ. Let d be the empirical distribution corresponding to the replay buffer, and we intentionally left it flexible on state, state-action, or transition. Then SAC alternates between updating the critic and actor by minimizing the following respective objectives: L SAC π (θ, d) := E s∼d E a∼π θ (•|s) [α log π θ (a|s) -Q µ (s, a)] , L SAC Q (µ, d) := E (s,a,r,s ′ ,d)∼d Q µ (s, a) -y(r, s ′ , d) 2 , where y(r, s ′ , d) := r + γ(1 -d)E a ′ ∼π θ (•|s ′ ) [Q μ(s ′ , a ′ ) -α log π θ (a ′ |s ′ )] . Here, α > 0 is the temperature parameter, and μ is the delayed Q-function parameter. If π θ is based on a universal neural network, its optimal value that minimizes L SAC π (θ, d) admits a closed form: π θ (a|s) = exp ( 1 α Q µ (s, a)) a∈A exp ( 1 α Q µ (s, a)). In practice, one simply performs gradient descent steps on L SAC π because even if the network is universal, the value of θ that corresponds to the optimal solution (4) is hard to find. It is important to note that adding a baseline function Z(s) to Q µ (s, a) does not change the optimal π θ in (4), as long as Z(s) does not depend on a. Therefore, given π θ , Q µ (s, a) can be inferred as Q µ (s, a) = Z(s) + α log π θ (a|s), where Z(s) provides additional freedom to fit other aspects of the problem; see Section 4.2.

4. ALIGNING CRITICS WITH ACTORS FOR OFFLINE-TO-ONLINE RL

We now detail our method that consists of three phases: offline, actor-critic (AC) alignment, and online. The whole procedure is summarized in Table 5 in Appendix A.

4.1. OFFLINE TRAINING

Motivating our offline training is TD3+BC (Fujimoto & Gu, 2021) , which runs TD3 (Fujimoto et al., 2018) on the offline dataset with a behaviour cloning (BC) regularization (Bain & Sammut, 1995) . Similar approaches such as SAC+BC can be found in Nair et al. (2020) . However, we replaced TD3 with SAC to enable stochastic policies and to be consistent with the subsequent AC alignment, where the Q-function is obtained in closed-form under maximum entropy RL. We also replaced the BC regularization with a maximum likelihood (ML) regularizer, in order to be consistent with the online phase that also uses an ML regularizer (see Section 4.3). As a result, we naturally name our offline method as SAC+ML. We will compare our SAC+ML against TD3+BC in Appendix C. Actor update. Let d be the empirical distribution of a mini-batch sampled from the offline dataset D. The actor update of TD3+BC and SAC+ML aims to minimize the following respective objectives: L TD3+BC π (θ, d) = E (s,a)∼d E b∼π θ (•|s) -λQ µ (s, b) + (b -a) 2 , L SAC+ML π (θ, d) = E (s,a)∼d E b∼π θ (•|s) -λ Q µ (s, b) -α log π θ (b|s) -log π θ (a|s) , where the hyperparameter λ balances Q values with the BC/ML regularization. In practice, we employed the clipped double Q-learning technique (Hasselt, 2010) to train two Q-networks Q µ1 and Q µ2 . It is beneficial for both offline and online training (Fujimoto et al., 2018) . λ is then set to λ := ω / E (s,a)∼d [Q µ (s, a)], where Q µ := min{Q µ1 , Q µ2 }, and ω is a predetermined hyper-parameter. So λ is recomputed after every critic update, requiring almost no additional computation. Critic update. SAC+ML follows the same critic update as SAC in (2), except for the double Q part: L SAC+ML Q (µ i , d) := E (s,a,r,s ′ ,d)∼d (Q µi (s, a) -y(r, s ′ , d)) 2 (9) where y(r, s ′ , d) := r + γ(1 -d)E a ′ ∼π θ (•|s ′ ) (Q μ(s ′ , a ′ ) -α log π θ (a ′ |s ′ )), ( ) and μ is a delayed version of µ, a.k.a. target network, with Q μ(s, a) = min i∈{1,2} Q μi (s, a), akin to Q µ . The temperature update tries to reduce the following objective over α > 0: L SAC+ML temp (α, d) := E s∼d E a∼π θ (•|s) -α log π θ (a|s) -H . Here, H > 0 is the target entropy value, a hyper-parameter specified a priori. The Lagrange multiplier α is automatically tuned in (11), enforcing the upper bound of the entropy of π θ by H. The pseudo-code of SAC+ML is relegated to Appendix A.

4.2. ACTOR-CRITIC ALIGNMENT

At the end of offline learning, the learned policy π θ0 often performs reasonably well, and is ready for online fine-tuning. So we denote the policy with index 0. In conventional actor-critic, the critic is supposed to be updated frequently enough to accurately pursue the state-action values for the current policy. However, even if such updates are conducted proactively, the distribution shift problem in O2O still plagues the critic under deep net approximation, because the Q-values are not trustworthy beyond what has been visited under the behavior policy. So the over-estimated Q-values can rapidly destroy the learned actor and critic through Bellman backup. In order to avoid this issue, we propose taming the out-of-distribution Q-values by directly aligning the critics with the actors, as a post-processing step for offline learning, or an initialization step for online learning. In particular, inspired by (5), we choose to discard the Q µi learned from the offline phase, and reset them intofoot_0 Q i (s, a) = log π θ0 (a|s) + Z ψi (s). (12) The baseline Z ψi (s) can be naturally calibrated by minimizing the Bellman residual on offline data: L SAC+ML Z (ψ i , d) := E (s,a,r,s ′ ,d)∼d (log π θ0 (a|s) + Z ψi (s) -y(r, s ′ , d)) 2 (13) where y(r, s ′ , d) := r + γ(1 -d)E a ′ ∼π θ 0 (•|s ′ ) [log π θ0 (a ′ |s ′ ) + Z ψ (s ′ )] , Z ψ := min{Z ψ1 , Z ψ2 }. Here, Z ψ in ( 14) employs a standard semi-gradient. This optimization is simply a regression problem and can be conducted by Adam (Kingma & Ba, 2015) . The details are deferred to Appendix A.1, where pseudo-code is also given in Algorithm 1. Generality. Thanks to this alignment step that disregards the Q-function learned offline, the offline learning algorithm is not limited to SAC+ML, even though it is more favorable. In Section 6.3, we will show that our alignment approach can be well applied to the offline policy learned from CQL.

4.3. ONLINE TRAINING

During the online fine-tuning, we restore the full flexibility of Q-functions by using the following parameterization: Q ϕi (s, a) := log π θ0 (a|s) + R ϕi (s, a), where R ϕi (s, a) is initialized with Z ψi (s). Such an initialization can be simply implemented by loading the weights of Z ψi and setting the weights corresponding to action to zeros. It is noteworthy that one should refrain from constraining Q to closed-form manifold induced by the latest π θ throughout the online phase, i.e., setting Q ϕi (s, a) to log π θ (a|s) + Z ϕi (s) for some trainable baseline Z ϕi . This is because it would lead to no improvement of the policy. As such, we only leverage the closed-form for initialization. The update on temperature is exactly the same as ( 11), and the update on critic resembles that of the offline phase in ( 9), except that the training variable is now only R ϕi (s, a). In particular, we adapt SAC critic update to our Q ϕ , along with standard tricks of target network and double Q-clipping: L Q (ϕ i , d) := E (s,a,r,s ′ ,d)∼d log π θ0 (a|s) + R ϕi (s, a) -y(r, s ′ , d) 2 , ( ) where y(r, s ′ , d) := r + γ(1-d)E a ′ ∼π θ (•|s ′ ) log π θ0 (a ′ |s ′ ) + R φ(s ′ , a ′ ) -α log π θ (a ′ |s ′ ) . ( ) The actor's objective follows from the vanilla SAC, and can be written as follows with d being a mini-batch sampled from the replay buffer, R ϕ := min i∈{1,2} R ϕi , and Q ϕ := log π θ0 + R ϕ : L π (θ, d) := -E s∼d E a∼π θ (•|s) [Q ϕ (s, a) -α log π θ (a|s)] , (19) = -E s∼d E a∼π θ (•|s) [R ϕ (s, a) -α log π θ (a|s)] -E s∼d E a∼π θ (•|s) [log π θ0 (a|s)] penalizing deviation of π θ from π θ 0 . (20) Behavior cloning in (20). Naturally unfolding from SAC using the parameterization ( 16) is the expectation of log-likelihood of π θ0 under π θ in (20), a maximum-likelihood term that enforces the actions favored by the new policy π θ to also enjoy a high log-likelihood under the offline policy π θ0 . Different from AdaBC (Zhao et al., 2021) , we sidestepped an ad-hoc introduction of behavior cloning regularization and tweaking of its weight. This regularization is not applied on the offline data, but on the policy π θ0 achieved by offline learning. To be consistent, we also used the maximumlikelihood regularization in the offline training. One might argue that this interpretation is artificial because, after all, the log π θ0 term can be subsumed into the free variable R ϕi in ( 16), obliterating this BC regularizer in (20). This in fact makes sense if the entire optimization is convex and the range of R ϕi as a function set is closed under addition with log π θ0 . However, since R ϕi is a neural network, such conditions do not hold true. As a result, the composite form in ( 16) does play a crucial role in the good empirical performance, which is manifested in our ablation study in Section 6.5. In practice, we also introduced two techniques to stabilize online learning. The first β-clipping trick addresses the excessively large magnitude of log π θ0 by capping its absolute values. The second critic interpolation gives the flexibility to balance between safety transfer and policy improvement. For the sake of space, they are deferred to Appendix A.2.

5. ILLUSTRATION OF ALIGNMENT UNDER DISTRIBUTION SHIFT

We first demonstrate how the critic alignment makes the Q-function more consistent with the real actions in the offline dataset, compared with the Q-values learned from offline actor-critic. We trained SAC+ML on the halfcheetah-medium dataset, and sampled in-distribution states from it. To sample out-of-distribution states, we resorted to the halfcheetah-expert dataset, and the details are available in Appendix D. Figure 5 there further illustrates the difference of these two state distributions. Figure 1 (top row) compares the Q-values learned from SAC+ML and our aligned/reconstructed Q-values, where the state-action pair is sampled in-distribution. The bottom row shows a similar comparison, but on out-of-distribution samples. The actions have 6 dimensions, and for the i-th subplot, we perturbed the i-th dimension in [-1, 1], with all the other dimensions fixed. Clearly, the offline learned Q-values are often inconsistent with the real action from the dataset, even for in-distribution samples. But our alignment much improves the consistency, which encourages the policy to stay close to the offline policies, safeguarding the process of transfer.

6. EXPERIMENTS

We next compared our actor-critic alignment method (ACA) with a number of state-of-the-art methods as summarized in Table 1 . Although CQL was not developed for O2O transfer, we still included it due to its strong performance. Implementation of our ACA algorithm can be found anonymously at Online Supplementary, along with the pre-trained models. Our experiments aim to demonstrate: • SAC→ACA matches or outperforms SOTAs, e.g., balanced replay (BR, Lee et al., 2022) , advantage weighted actor critic (AWAC, Nair et al., 2020) , and online decision transformer (ODT, Zheng et al., 2022) ; • Direct transfer such as SAC→SAC and CQL→SAC suffers significant performance drop; • Transfer from offline method significantly outperforms training SAC online from scratch. We will additionally present ablation studies to examine various components of ACA.

6.1. COMPARISON WITH BASELINE METHODS

We used three environments from the datasets D4RL-v2 (Fu et al., 2020) , including HalfCheetah, Hopper, and Walker2d. Each of them has five levels. All offline/online experiments ran 5 random seeds. We ran all offline algorithms for 500 episodes with 1000 mini-batches each, and all online experiments for 100 episodes with 1000 environment interactions each. This protocol is quite commonly used. More implementation details are deferred to Appendix B. Figure 2 shows the average return as a function of training episodes, achieved at each offline model (left half of the subplots) and online model (right half). Since SAC→ACA and SAC→SAC share the same offline method, their curves coincide on the left of the subplots, with the green curve shown only (no blue) on the left. A similar situation occurs to CQL→BR and CQL→SAC, and only the purple curve is shown on the left half (no pink). In Figure 2 , CQL→SAC (purple→pink) drops significantly on the expert level (fifth row) and medium-expert level (fourth row). SAC→SAC (green→blue) drops in almost all cases, except random (first row) and medium-replay (third row). It is clear that our SAC→ACA (green) barely suffers performance drop. The only exception is Hopper-medium-expert, but all other methods (except AWAC which performs poorly offline) also suffer a drop there, while ours recover most rapidly. Besides, ours offers comparable policy improvement to the strongest baseline, which is CQL→BR in most cases. As shown in Table 3 , for almost all medium and medium-replay tasks, our SAC→ACA outperforms ODT in both final performance and performance increase (δ). We also note that ODT(offline) outperforms SAC+ML in the hopper-medium-replay task by a large margin, which leaves our approach more room to improve. Therefore, we made the same comparison by excluding the hopper-mediumreplay task. In this case, ODT and ours were initialized from roughly the same performance, and ours still outperforms ODT in both total final performance and total performance increase.

6.3. FLEXIBILITY IN INITIALIZATION FROM DIFFERENT OFFLINE METHODS

A key advantage of our alignment method lies in the flexibility of leveraging any offline RL method, as long as it outputs a parameterized Gaussian policy, because the Q-function is reset anyway. In contrast, SOTA methods sometimes require certain properties in the offline method such as pessimism. For example, BR's performance depends critically on the use of CQL. To demonstrate our flexibility, we adopted CQL for offline learning and made a simple change to the alignment step which, in (13), clips log π θ0 to 0 when it is negative. In comparison, we also tested BR by using SAC+ML as the offline learner. Figure 3 shows the results of ACA/BR initialized from SAC+ML/CQL. While the performance of our approach does not change much when initializing from different offline models, BR shows significant performance drops when it is initialized from SAC+ML, i.e. non-pessimistic offline training. Figure 3 : ACA and BR initialized from different offline methods. ACA could achieve similar performance while initializing from both SAC+ML/CQL. BR requires CQL initialization.

6.4. ONLINE TRAINING WITHOUT OFFLINE DATA

When the application precludes the accessibility of offline data during online fine-tuning, we re-ran the benchmarks for medium-replay, medium-expert, expert. There is obviously no reason to replay offline data at random level, and empirically, we observed that online fine-tuning already performed well on medium when no offline data was replayed. Figure 6 in Appendix F shows the online average return of our method without using offline data, compared with other baselines which also do not access offline data during online finetuning. The balance replay algorithm requires offline data. So compared with Figure 2 , we no longer have the purple line that corresponds to CQL→BR. It turns out that all the other baselines retain similar online performance as in Figure 2 , which had been shown inferior to our SAC→ACA. Table 4 further highlights that SAC→ACA does not exhibit significant change in online performance in the absence of offline data.

6.5. ABLATION STUDY

As mentioned in Section 4.3, log π θ0 in (20) can be considered as a "behaviour cloning" regularization. One may wonder whether this, instead of actor-critic alignment, is the primary contributor to the empirical effectiveness. We therefore conducted the following ablation study, which answers this question in the negative. In contrast to the parameterization of online Q-function in ( 16), we designed two alternatives. The first directly copies the Q-function from the conclusion of offline learning, and then fine-tunes it online. The second adopts the same decomposed parameterization as in ( 16), but initializes R ϕi with the offline learned Q µi , instead of Z ψi . As a result, log π θ0 + R ϕ , whose expectation serves in the actor objective, becomes a regular offline-trained Q function with a regularizer log π θ0 , instead of an actor-critic alignment. The update objectives of the two ablations are relegated to Appendix E. Figure 4 shows that on tasks vulnerable to transfer risk such as hopper-medium and walker-medium (second and third subplots), the two ablation alternatives suffer clear performance drop due to the attributed error in the offline trained Q-functions. However, some tasks can be less vulnerable. For example, on halfcheetah-medium, Figure 2 shows that SAC→SAC (green→blue) only suffers a small amount of drop, although it employs no mechanism to combat distribution shift. In such a task, the two ablation alternatives remain competitive to no surprise (first subplot). 

7. CONCLUSION AND FUTURE WORK

We proposed a new actor-critic alignment method that allows safe offline-to-online reinforcement learning and achieves strong empirical performance. To combat distribution shift, we designed a novel approach that disregards offline learned Q-functions, and reconstructs it based on the learned policy using a closed-form that is motivated from the entropy-regularized actor update. Since it does not need an offline critic, online actor-critic fine-tuning is made possible for offline learned decision transformer, as well as other supervised learning methods such as RvS (Emmons et al., 2022) .

A ALGORITHM DETAILS

The pseudo-code of the offline, alignment, and online phases is provided in Algorithm 1, 2, and 3, respectively.  Q ϕ (s, a) := log π θ0 (a|s) + R ϕ (s, a) , where R ϕ (s, a) is initialized by Z ψ (s). Algorithm 1 Offline SAC+ML Initialize parameters θ, α, ψ i , µ i , μi for i ∈ {1, 2} for each iteration do sample mini-batch from dataset D update α with Eq. ( 11) update µ i with Eq. ( 9) for i ∈ {1, 2} update θ with Eq. ( 7) update ψ i with Eq. ( 13) for i ∈ {1, 2} μi ← τ µ i + (1 -τ )μ i for i ∈ {1, 2} end for Algorithm 2 Actor-critic Alignment Require: θ, α, ψ i , µ i , μi , for i ∈ {1, 2} Initialize parameters ϕ i , φi , for i ∈ {1, 2} Set R ϕi (s, a) ← Z ψi (s), for i ∈ {1, 2} Copy φi ← ϕ i Copy θ 0 ← θ Reset α ← 1 Delete µ i and μi , for i ∈ {1, 2} A.1 OPTIMIZATION OF Z ψi Since the alignment objective ( 13) needs to access offline data, we blended it into the offline training as shown in the second last step of Algorithm 1. It is noteworthy that this is only for the convenience of implementation, and the ψ i values do not have any influence on SAC+ML training itself. Conversely, the optimized value of ψ i provides a good initialization for a standalone optimization of objective ( 13). In practice, we observed that the ψ i found from offline training is good enough, and we just directly used them to initialize the online critic R ϕi .

A.2 TECHNIQUES TO STABILIZE ONLINE LEARNING

We propose β-clipping trick and critic interpolation to achieve better empirical performance. As log π θ0 is unbounded below, the training can be numerically 11) update ϕ i with Eq. ( 30) for i ∈ {1, 2} update θ with Eq. ( 29) φi ← τ ϕ i + (1 -τ ) φi end for Here, β is a hyper-parameter, and SoftPlus(x) = log(1 + exp(x)). Essentially it clips log π θ0 (a|s) at C β (s). This CLIP β (•) operator bounds the log π θ0 term in a reasonable range, and also requires minimal tuning of hyper-parameter, see Section G and Section H for details. Using CLIP β , we define Q β ϕ as Q β ϕ (s, a) := CLIP β (log π θ0 (a|s)) + R ϕ (s, a). Now, the clipped online actor/critic updates can be summarized by L β π (θ, d) = E s∼d E a∼π θ α log π θ (a|s) -Q β ϕ (s, a) , L β Q (ϕ i , d) = E (s,a,r,s ′ ,d)∼d Q β ϕi (s, a) -y(r, s ′ , d) 2 , (r, s ′ , d) = r + γ(1 -d)E a ′ ∼π θ (•|s ′ ) Q β ϕ (s ′ , a ′ ) -α log π θ (a ′ |s ′ ) .

A.2.2 CRITIC INTERPOLATION

At the initial phase of online training, CLIP β (log π θ0 (a|s)) dominates the actor update, safeguarding the policy. As training proceeds, R ϕ grows to overcome the barrier and starts to improve the policy. Ideally, we wish to finely control such a junction so that the safety of O2O transition does not excessively slow down the policy improvement. To this end, we introduce an interpolation between closed-form initialized critic and restriction-free critic. We call it critic interpolation, which can be written as Q k,β ϕ (s, a) := k CLIP β log π θ0 (a|s) + R ϕ (s, a) closed-from initialized critic + (1 -k) R ϕ (s, a) restriction-free critic (27) = k × CLIP β log π θ0 (a|s) + R ϕ (s, a). We set k = 1 at t = 0 to assert closed-form initialization. Then we linearly decay k during the course of online training, allowing a transition from closed-form initialization to free SAC update. The detailed decaying rate can be found in Appendix I.

A.3 CONCLUDED ONLINE TRAINING

Our final online update rules are summarized as follows: L online π (θ, d) = E s∼d E a∼π θ α log π θ (a|s) -Q k,β ϕ (s, a) L online Q (ϕ i , d) = E (s,a,r,s ′ ,d)∼d Q k,β ϕ (s, a) -y(r, s ′ , d) 2 (30) y(r, s ′ , d) = r + γ(1 -d)E a ′ ∼π θ (•|s ′ ) Q k,β ϕ (s ′ , a ′ ) -α log π θ (a ′ |s ′ ) .

B IMPLEMENTATIONS

Overall, all our implementations are from or based on d3rlpy (Takuma Seno, 2021), a popular RL library that specialized for offline RL. Using the same lib helps us to minimize the impact of implementation difference. Many of our baselines (see Table 9 ) are implemented upon SAC, with changes proposed in their original papers, respectively.

B.1 GENERAL IMPLEMENTATION DETAILS

Evaluation protocol: All offline/online experiments ran 5 random seeds. We ran all offline algorithms for 500 episodes with 1000 mini-batches each, and all online experiments for 100 episodes with 1000 environment interactions each. After each episode, we conducted 10 evaluations and computed the average return. Results reported are mean and std of average returns, over 5 random seeds. Choice of offline checkpoints: Evaluating in the offline phase, in fact, requires online interactions. Therefore we do not pick the best-performed checkpoints. Instead, we use the last checkpoints as our initialization models, for online. Squashed Gaussian: For methods with stochastic policies, we parameterized their policies by unimodal Gaussian, and applied the squashed Gaussian trick (Haarnoja et al., 2018) to bound the range of action to [-1, 1].

Buffer initialization:

We followed the instructions in AWAC and BR papers on initializing online replay buffers. For AWAC, we added all transitions in D to the buffer B. And for BR, we refer to their original implementation at this URL for details. For SAC→SAC and CQL→SAC, we added all transitions in D to the buffer B as well, as there is no explicit instructions or common protocols. All replay buffer sizes were set to be 1e6, unless specified in the Appendix I.

B.2 OFFLINE

AWAC and CQL: We used d3rlpy implementations for AWAC and CQL. SAC+ML: Our SAC+ML implementation was adapted from d3rlpy's TD3+BC implementation, with changing the actor update rule to Eq. ( 7), and adding the learning of baseline Z ψ .

B.3 ONLINE

Training details for online: For all methods, we made a temperature (if applicable), a critic, and an actor update after every environmental interaction, if there were enough transitions (i.e. more than batch size) in the replay buffer. Target networks were all updated in a Polyak averaging fashion, where the step size τ = 0.005 for all experiments. See Section I for more hyper-parameter details. And online results, reported in tables, were also using the last checkpoints instead of best-performed ones.

SAC:

We used d3rlpy implementation for SAC.

SAC→SAC and CQL→SAC:

We simply loaded offline-trained SAC+ML and CQL, respectively, and then ran SAC online.

BR:

We adapted all parts that related to the prioritized replay from the official BR implementation, to a d3rlpy SAC implementation base, as the original BR paper also run SAC online. ACA (ours:) Implementation of our approach can be found anonymously at Online Supplementary. In addition to Algorithm 3, we also did gradient norm clipping to actor updates, which is commonly used in RL implementations.

C SAC+ML VS. TD3+BC

We would like to emphasize that our goal is not to propose a stronger offline RL method. Table 6 is presented to show that our SAC+ML modification performs comparably to the original SOTA method, TD3+BC. TD3+BC results in Table 6 were copied from appendix C.3 of their paper (Fujimoto & Gu, 2021) . The evaluation protocol is identical to theirs: (1) all experiments were done in D4RL-v2 datasets; (2) and the results reported were from the last evaluation step, averaged over 5 random seeds. 

D ILLUSTRATION DISTRIBUTION SHIFT

Figure 5 shows the histogram of the ℓ 1 norm of state vectors from the halfcheetah-medium-v2 and halfcheetah-expert-v2 datasets. Clearly, there is a distribution shift. So we can obtain in-distribution sample and out-of-distribution samples (with respect to the medium dataset) by ( 1) sample a transition from medium dataset and (2) sample from 98% to 100% percentile of the expert dataset (in terms of the ℓ 1 norm of the state vector), so that it is out-of-distribution for a medium agent. Here we write out the detailed formula of the critic objective in the two ablation studies in Section 6.5. L ablation1 Q (ϕ i , d) = E (s,a,r,s ′ ,d)∼d R ϕ (s, a) -y(r, s ′ , d) 2 (32) y(r, s ′ , d) = r + γ(1 -d)E a ′ ∼π θ (•|s ′ ) R ϕ (s ′ , a ′ ) -α log π θ (a ′ |s ′ ) (33) L ablation2 Q (ϕ i , d) = E (s,a,r,s ′ ,d)∼d Q k,β ϕ (s, a) -y(r, s ′ , d) 2 (34) y(r, s ′ , d) = r + γ(1 -d)E a ′ ∼π θ (•|s ′ ) Q k,β ϕ (s ′ , a ′ ) -α log π θ (a ′ |s ′ )

F MORE RESULTS ON ONLINE TRAINING WITHOUT OFFLINE DATA

The distributional shift issue would clearly be severer when offline data are not accessible during the online phase. To be more conservative, we therefore set β w/o = 1.5β w/ for experiments without offline data, excluding random and medium levels as both used no offline data for our main results already. (β w/ denotes the hyper-parameter we used for our main results, see 

G SENSITIVITY ON β

Figure 7 shows that the performance of SAC→ACA is not very sensitive to the choice of β. For each random seed, we randomly sampled an episode {(s i , a i , s i+1 , r i ) : i = 1, 2, . . . , T } from corresponding offline dataset. We then computed {|C β (s i )|} and {|Z ψ (s i )|} to make box plots, where outliers were omitted for better visualization. We used the same βs as in Table 8 to make this plot. Figure 8 shows that |C β (s)| and |Z ψ (s)| have comparable values. The empirical performance already outperforms all baselines even though we did not extensively match their magnitude for every task. It in turn implies that tuning β requires minimal effort, in additional to the advantage that β can be chosen completely via offline comparison. I HYPER-PARAMETERS We observed that IQL is struggling with online improvement. This is in fact also observed by ODT (Zheng et al., 2022) . See Table 10 . -The actor/entropy update are the same SAC actor/entropy update. -Run update on random dataset to simulate OOD data -We run total 10k steps and evaluate its performance every 100 steps. to test how different offline/aligned Q-functions affects the actor update. Takeaway 1: our reparameterization is able to attain its offline performance better than other baselines. Takeaway 2: our reparameterization applies to different offline critics, as we also mentioned in Section 6.3. As SAC+ML collapse really quick, we opt it out for further comparison. We provide more comparison of CQL vs. aligned CQL. Figure 11 : Aligned CQL vs. CQL. Table 11 : Expert level policies are more fragile. Dataset Env CQL relative change (%) CQL (aligned) relative change (%) @1k @2k @3k @4k @5k @1k @2k @3k @4k @5k Additional observation: near optimal policies are more fragile (even if we exclude HalfCheetahexpert), which highlights our advantage in expert-level datasets.

L AWR OBJECTIVES

Akin to experiments we conducted in Section K, we run actor update only experiments with AWAC (also categorized into AWR) actor objective instead of SAC actor objective. Critics are trained offline by AWAC. with its deviation from Q(s, π(s)), so that Q(s, π(s)) is centered at y = 0 and π(s) is centered at x = 0; (c) Such a centralization allows us to place multiple samples (different s) in the same plot, where points above the x-axis correspond to "over-estimated" perturbations; (d) We aggregate multiple samples by counting how many points are above 0. This way, the height of the red part in the bar plot quantifies the fraction of points that are "over-estimated". For an "over-estimated" point (s, a), its x-coordinate stands for the distance between a and the policy favored action π(s). Note: By "over-estimation", we mean for some a ̸ = π(s) such that Q(s, a) > Q(s, π(s)), which in a certain degree means that the critic Q is "disagreeing" with the policy π. Details: All agents are trained on halfcheetah-medium-v2 in-distribution samples, we refer to states drawn from the halfcheetah-medium-v2 dataset. By out-of-distribution samples, we use samples from the halfcheetah-expert-v2 dataset. We randomly drew 200 samples per seed, which results in a total of 1000 samples to make each plot.



Compared with (5), it appears that we have set α there to 1, while its value at the end of offline learning is rarely close to 1. This creates no contradiction, however, because log π θ 0 will be used to parameterize the online Q-function as in (16), and the α for online phase SAC is initialized to 1. So their product, passed through the softmax, will recover π θ 0 .



Figure 1: SAC+ML Q-values v.s. aligned Q-values for in-distribution sample (top row) and out-ofdistribution sample (bottom row). Left y-axis: SAC+ML Q-values, right y-axis: aligned Q-values. Since only the trend of each curve matters, we omit the y-axis tick values.

Figure 2: Comparing SAC→ACA (ours) with other baselines for offline-to-online RL. The shaded areas stand for the standard deviation. Refer toTable 1 for legend meanings.

Figure 4: Ablation study: We keep the actor update same as Eq (20), we however change the critic update (details found in Appendix 6.5) to see show that the re-parameterized Q-function is critical. BC only (ablation 1): R ϕ is seen as the online critic and is initialized by Q µ . Critic updates are made as regular SAC critic update without our re-parameterization. It therefore can be seen as SAC with behavior cloning regularized actor update. R ϕ init by Q µ (ablation 2): We keep the ACA framework but initialize R ϕ by Q µ instead of Z ψ to show the importance of the baseline.

Figure 5: Histogram of ℓ 1 norm of state vectors in halfcheetah medium and expert datasets.

Figure 6: When offline data are not accessible, vs. other baselines

Figure 7: Results for different β

Figure 9: IQL

Figure 10: Our alignment (applied on both CQL and SAC+ML) could attain the offline performance better than the original performance.

Figure12:(a)  shows that AWAC also collapse to nearly zero performance. that around 20 actor updates are already enough to destroy the performance.

Figure 13: Demonstration of how Figure 14 and Figure 15 are created. (a) Given a sample (s, a), we perturb a along a dimension to plot Q(s, ã) and compare Q(s, ã) to Q(s, π(s)); (b) We plot Q(s, ã)with its deviation from Q(s, π(s)), so that Q(s, π(s)) is centered at y = 0 and π(s) is centered at x = 0; (c) Such a centralization allows us to place multiple samples (different s) in the same plot, where points above the x-axis correspond to "over-estimated" perturbations; (d) We aggregate multiple samples by counting how many points are above 0. This way, the height of the red part in the bar plot quantifies the fraction of points that are "over-estimated". For an "over-estimated" point (s, a), its x-coordinate stands for the distance between a and the policy favored action π(s).

Figure 14: Quantifying fraction of over-estimated perturbations for in-distribution samples.

Figure 15: Quantifying fraction of over-estimated perturbations for out-of-distribution samples.

Baseline algorithms for O2O RL. See acronyms below.

.2 COMPARISON WITH ONLINE DECISION TRANSFORMER Since ODT is not based on dynamic programming, we compared it with SAC→ACA in this separate section. AsZheng et al. (2022) experimented using 200k online samples and averaged over 10 seeds, we ran SAC+ML with 5 additional seeds and ran 200k online steps for all 10 SAC+ML runs, to make the comparison fair.

Comparing SAC→ACA with online decision transformer (ODT), with a focus on the online improvement upon offline policy (δ ODT and δ ACA ).

Scores for SAC→ACA at 100k online steps, and its increase from offline result (in parenthesis). Comparison is made between with or without offline data. HC = HalfCheetah, H = Hopper, W = Walker2d.

Learning in three phases for offline-to-online RL with actor-critic alignment

SAC+ML vs. TD3+BC

Table 8 for details.) All other hyper-parameters remained unchanged.

Specific hyper-parameters for different baselines. Please refer to the original paper for the meaning of hyper-parameter names.

Hyper-params used for our main results reported in section 6.1. x

General hyper-parameters. ACA and BR stand for SAC→ACA and CQL→BR, respectively, for the sake of space.

IQL Disable critic update, target update, data collecting, etc. (In other words, only keep actor and entropy update.)

annex

Reproducibility Statement. Our ACA implementation can be found anonymously at Online Supplementary, along with the pre-trained offline models. More implementation details regarding offline approaches and baselines are available in Appendix B, and our choice of hyper-parameters can be found in Appendix I.

O INITIALIZATION FROM BEST CHKPTS

We only saved checkpoints every Therefore, we pick the best checkpoints among them. Sub-figures that are not visible means the last checkpoint is the best checkpoint we saved. (We determine the "best" checkpoint by the averaged performance over all seeds.)We do not observe significant differences between initialization from best checkpoints and last ones. 

P DOES OVER-FITTING AFFECT TRANSFER?

In addition to Section O, we initialize at 100k, 200k, 300k, 400k and 500k respectively for walker2d-medium-v2, to see whether initialization from different checkpoints make a difference. We do not observe conclusive evidence showing that over-fitting might lead unstable transfer. It is obvious that random initialization has poor performance as it violates our motivation.

