BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-theshelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.

1. INTRODUCTION

Offline Reinforcement Learning (Offline RL) is one of the most promising data-driven ways of optimizing sequential decision-making. Offline RL differs from the typical settings of Deep Reinforcement Learning (DRL) in that the agent is trained on a fixed dataset that was previously collected by some arbitrary process, and does not interact with the environment during learning (Lange et al., 2012; Levine et al., 2020) . Consequently, it benefits the scenarios where online exploration is challenging and/or unsafe, especially for application domains such as healthcare (Wang et al., 2018; Gottesman et al., 2019; Satija et al., 2021) and autonomous driving (Bojarski et al., 2016; Yurtsever et al., 2020) . A common baseline of Offline RL is Behavior Cloning (BC) (Pomerleau, 1991) . BC performs maximum-likelihood training on a collected set of demonstrations, essentially mimicking the behavior policy to produce predictions (actions) conditioned on observations. While BC can only achieve proficient policies when dealing with expert demonstrations, Offline RL goes beyond the goal of simply imitating and aims to train a policy that improves over the behavior one. Despite promising results, Offline RL algorithms still suffer from two main issues: i) difficulty dealing with limited high-dimensional data, especially visual observations with continuous action space (Lu et al., 2022) ; ii) implicit under-parameterization of value networks exacerbated by highly re-used data, that is, an expressive value network implicitly behaves as an under-parameterized one when trained using bootstrapping (Kumar et al., 2021a; b) . In this paper, we focus on state representation learning for Offline RL to mitigate the above issues: projecting the high-dimensional observations to a low-dimensional space can lead to a better performance given limited data in the Offline RL scenario. Moreover, disentangling representation learning from policy training (or value function learning), referred to as pre-training the state representations, can potentially mitigate the "implicit under-parameterization" phenomenon associated with the emergence of low-rank features in the value network (Wang et al., 2022) . In contrast to previous work that pre-train state representations by specifying the required properties, e.g., maximizing the diversity of states encountered by the agent (Liu & Abbeel, 2021; Eysenbach et al., 2019) , exploring the attentive knowledge on sub-trajectory (Yang & Nachum, 2021) , or capturing temporal information about the environment (Schwarzer et al., 2021a) , we consider using the behavior policy to learn generic state representations instead of specifying specific properties. Many existing Offline RL methods regularize the policy to be close to the behavior policy (Fujimoto et al., 2019; Laroche et al., 2019b; Kumar et al., 2019) or constrain the learned value function of OOD actions not to be overestimated (Kumar et al., 2020; Kostrikov et al., 2021) . Beyond these use, the behavior policy is often ignored, as it does not directly provide information on the environment. However, the choice of behavior has a huge impact on the Offline RL task. As shown by recent theoretical work (Xiao et al., 2022; Foster et al., 2022) , under an agnostic baseline, the Offline RL task is intractable (near optimality is exponential in the state space size), but it becomes tractable with a well-designed behavior (e.g. the optimal policy or a policy trained online). This impact indicates that the information collected from the behavior policy might deserve more attention. To this end, we propose Behavior Prior Representation (BPR), a state representation learning method tailored to Offline RL settings (Figure 1 ). BPR learns state representations implicitly by enforcing them to be predictive of the action performed by the behavior policy, normalized to be on the unit sphere. Then, the learned encoder is frozen and utilized to train a downstream policy with any Offline RL algorithms. Intuitively, to be predictive of the normalized actions, BPR encourages the encoder to ignore the task-irrelevant information while maintaining the task-specific knowledge relative to the behavior, which we posit is efficient for learning a state representation. Theoretically, we prove that BPR carries out performance guarantees when combined with conservative or pessimistic Offline RL algorithms. While an uninformative behavioral policy may lead to bad representations and therefore degraded performance, we note that such a scenario may be predicted from the empirical returns of the dataset. Furthermore, since the learning procedure of BPR does not involve value functions or bootstrapping methods like Temporal-Difference, it can naturally mitigate the "implicit under-parameterization" phenomenon. We prove this empirically by utilizing effective dimensions measurement to evaluate feature compactness in the value network's penultimate layer. The key contributions of our work are summarized below: • We propose a simple, yet effective method for state representation learning in Offline RL, relying on the behavior cloning of actions; and find that this approach is effective across several offline benchmarks, including raw state and pixel-based ones. Our approach can be combined to any existing Offline RL pipeline with minimal changes. • Behavior prior representation (BPR) is theoretically grounded: we show, under usual assumptions, that policy improvement guarantees from offline RL algorithms are retained through the BPR, at the only expense of an additive behavior cloning error term. • We provide extensive empirical studies, comparing BPR to several state representation objectives for Offline RL, and show that it outperforms the baselines across a wide range of tasks.

2. RELATED WORK

Offline RL with behavior regularization. Although to the best of our knowledge, we are the first to leverage behavior cloning (BC) to learn a state representation in Offline RL, we remark that combining Offline RL with behavior regularization has been considered previously by many works. A common way of combining BC with RL is to utilize it as a reference for policy optimization with baseline methods, such as natural policy gradient (Rajeswaran et al., 2018) , DDPG (Nair et al., 2018; Goecks et al., 2020) , BCQ (Fujimoto et al., 2019) , SPIBB (Laroche et al., 2019a; Nadjahi et al., 2019; Simão et al., 2020; Satija et al., 2021; Brandfonbrener et al., 2022) , CQL (Kumar et al., 2020) , and TD3 (Fujimoto & Gu, 2021) . Other previous works include learning adaptive behavior policies that are biased towards higher-rewards trajectories (Cang et al., 2021) , and pretraining behavior policies to guide downstream policies training (Zhang et al., 2021b) . However, all these approaches neglect the potential of using the behavior policy to guide representation learning. In contrast, we investigate state representation learning via a BC-style approach and show that the downstream policy can be greatly boosted in that regime. Representation learning in Offline RL. Pretraining representation has been recently studied in Offline RL settings, where several studies presented its effectiveness (Arora et al., 2020; Schwarzer et al., 2021a; Nachum & Yang, 2021a) . Some typical auxiliary tasks for pretraining state representations include capturing the dynamical (Nachum & Yang, 2021b) and temporal (Schwarzer et al., 2021a) information of the environment, exploring the attentive knowledge on sub-trajectory (Yang & Nachum, 2021) , improving policy performance by applying data augmentations techniques to the pixel-based inputs (Chen et al., 2021; Lu et al., 2022) ... While driven by various motivations, most of these methods can be thought of as including different inductive biases via the specific design of the target properties of the representation. In this paper, the inductive bias we enforce is that the representation should allow matching the behavior action normalized to the unit hypersphere, we demonstrate its effectiveness on both raw-state and visual-observation inputs tasks. Additional related works are discussed in appendix B.

3. PRELIMINARIES

Offline RL We consider the standard Markov decision process (MDP) framework, in which the environment is given by a tuple M = (S, A, T, ρ, r, γ), with state space S, action space A, transition function T that decides the next state s ′ ∼ T (•|s, a), initial state distribution ρ, reward function r(s, a) bounded by R max , and a discount factor γ ∈ [0, 1). The agent in state s ∈ S selects an action a ∈ A according to its policy, mapping states to a probability distribution over actions: a ∼ π(•|s). We make use of the state value function V π (s) = E M,π [ ∞ t=0 γ t r (s t , a t ) | s 0 = s] to describe the long term discounted reward of policy π starting at state s. In Offline RL, we are given a fixed dataset of environment interactions that include N transition samples, i.e., D = {s i , a i , s ′ i , r i } N i=1 . We assume that the dataset D is generated i.i.d. from a distribution µ(s, a) that specifies the effective behavior policy π β (a|s) = µ(s, a)/ a µ(s, a), and denote by d π β (s) its discounted state occupancy density. Following Rashidinejad et al. (2021) , the goal of Offline RL is to minimize the suboptimality of π with respect to the optimal policy π ⋆ given a dataset D: SubOpt(π) = E D∼µ [J (π ⋆ ) -J (π)] = E D∼µ [E s0∼ρ [V ⋆ (s 0 ) -V π (s 0 )]] , where J (π) = E s0∼ρ [V π (s 0 ) ] is the performance of the policy π.

State Representation Learning

In this paper, our objective is to find a state representation function ϕ : S → Z that maps each state s ∈ S to its representation z = ϕ(s) ∈ Z. The desired representations should provide necessary and useful information to summarize the task-relevant knowledge and facilitate policy learning. To investigate the capacity of state representation learning for mitigating the "implicit under-parameterization" phenomenon, following Lyle et al. (2022) and Kumar et al. (2021a) , we measure the compactness of the feature in the penultimate layer of the value network by utilizing the Effective Dimension, where we refer to the output of the penultimate layer of the state-action value network as the feature matrix Ψ ∈ R |S||A|×dfoot_0 : Definition 1. Effective Dimension Let ED(M ) denote the eigenvalues of a square matrix M , D be the offline dataset, and ϵ be a fixed hyperparameter. Then, the effective dimension of Ψ is defined as ζ(Ψ, D, ϵ) = E D σ ∈ ED |S| -1 |A| -1 Ψ ⊤ Ψ | σ > ϵ , It is the expected number of eigenvalues of Ψ ⊤ Ψ that are larger than |S||A|ϵ. By studying this quantity, we can explicitly observe the usefulness of the state representation objective in mitigating the "implicit under-parameterization" problem in value networks (see Section 6.3 for the results). Given an offline dataset D consisting of (s, a) pairs, our goal is to learn an encoder ϕ(s) that produces state representations z = ϕ(s) allowing efficient and successful downstream policy learning. During pre-training, the BPR network is comprised of two connected components: an encoder ϕ θ and a predictor f ω . The encoder maps the state to the representation space, while the predictor projects the representations onto the unit sphere in dimension |A|. In state s, the BPR network outputs a representation z θ = ϕ θ (s), and a prediction y = f ω (z θ ). We then ℓ 2 -normalize the prediction y and the action a to y = y/∥y∥ 2 and a = a/∥a∥ 2 and minimize the mean squared error to train the state representation (or equivalently maximize the cosine similarity between y and a): L θ,ω = E (s,a)∼D ∥y -a∥ 2 2 = E (s,a)∼D 2 -2 • ⟨f ω (ϕ θ (s)), a⟩ ∥f ω (ϕ θ (s))∥ 2 • ∥a∥ 2 , ( ) where we set the action from the pair (s, a) as the target. During the training process, stochastic optimization is performed to minimize L θ,ω with respect to θ and ω. Note that although the encoder and predictor are updated together through this optimization procedure, only the encoder ϕ θ is used in the downstream task. At the end of the training, we keep the encoder ϕ θ fixed and build the downstream Offline RL agents on top of the state representation that BPR learnt. Implementation details BPR method does not rely on any specific architecture as its encoder network. Depending on the nature of given inputs, it can either be a convolutional neural network (CNN) for visual observation inputs, or a multi-layer perceptron (MLP) for physically meaningful state inputs. The representation z θ that corresponds to the output of the encoder is then projected to the action space. In this paper, the projection is done by an MLP consisting of two linear layers followed by rectified linear units (ReLU) (Nair & Hinton, 2010) , and a final linear layer followed by a tanh activation layer.

5. THEORETICAL ANALYSIS

For simplicity, we consider the optimization problem of BPR without the normalization term as: min 1 n n i=1 ∥f (ϕ(s i )) -a i ∥ 2 2 : ϕ ∈ Φ, f ∈ F, where (s i , a i ) is an i.i.d. sample from the offline dataset (of size n). BPR consists in learning a representation ϕ impacting the policy search. In other words, with BPR, the function class for the policy consists in Π BPR . = {π, s.t. ∃ ω with π(•|s) = f ω (ϕ θ (s)) ∀s}, where f ω is the neural model parameterized with ω characterizing the policy on top of embedding ϕ θ (•). Like any representation learning technique, the potential benefits are (i) enhancing the signal-to-noise ratio and (ii) reducing the size of the policy search. In this section, we develop an analysis showing that the potential harm of using BPR is upper bounded by an error ϵ β that we control in Section 5.3. To our knowledge, this is the first representation learning technique for Offline RL with such guarantees. Idealized assumptions : Letting x denote an estimate of a quantity x computed using D, we start by considering the following idealized assumptions: Assumption 1. (1.1) Access to the true behavior: π β = π β . (1.2) Access to the true performance of policies: Ĵ (π) = J (π) for all policies π. (1.3) The embedding ϕ allows to represent the behavior policy estimate: π β (a|ϕ(s)) = π β (a|s) ∈ Π BPR . (1.4) The Offline RL algorithm performs perfect optimization on top of ϕ: J BPR = max π∈ΠBPR Ĵ (π). Theorem 1. Under idealized Assumption 1, BPR returns a policy that improves over the behavior policy: J BPR ≥ J (π β ). Its prooffoot_1 is immediate: under perfect estimation and optimization, the fact that β ∈ Π BPR guarantees policy improvement. The above assumptions are stringent, and we propose to relax them in two different ways: for conservative algorithms that derive safe policy improvement guarantees (Petrik et al., 2016; Fujimoto et al., 2019; Laroche et al., 2019b; Simão et al., 2020) , and for pessimistic algorithms that derive a value function lower bound (Kumar et al., 2020; Buckman et al., 2021) . We note that the policy improvement is constrained by the set Π BPR over which the policy search is conducted. Appendix G.6 empirically shows that the method will fail to improve over the behavioral policy in the extreme case where the behavioral policy is uniform and therefore uninformative. The BPR efficiency therefore strongly relies on the assumption that the behavioral policy provides a beneficial inductive bias.

5.1. SAFE POLICY IMPROVEMENT

Conservative algorithms constrain the policy search to the set of policies Π PI for which the true policy improvement ∆ π,π β . = J (π)-J (π β ) can be safely estimated from ∆π,π β . = Ĵ (π)-Ĵ (π β ): ∀π ∈ Π PI , ∆π,π β -∆ π,π β ≤ ϵ ∆ . It is worth noting that the estimated behavior policy π β necessarily belongs to Π PI , since trivially ∆π β ,π β = ∆ π β ,π β = 0. Combined with BPR, conservative algorithms optimize on the policy set Π PI ∩ Π BPR , and we consider the following relaxed assumptions, corresponding respectively to Assumptions 1.1 and 1.2: Assumption 2. (2.1) Access to an accurate estimate π β of the true behavior π β : J (π β ) -J (π β ) ≤ ϵ β . (2. 2) Access to an accurate estimate ∆π,π β of the true policy improvement ∆ π,π β for all policies π ∈ Π PI ∩ Π BPR : ∆π,π β -∆ π,π β ≤ ϵ ∆ . Theorem 2. Under Assumption 2, BPR returns a policy π BPR with the following performance bounds: J BPR ≥ J (π β ) + ∆πBPR,π β -ϵ ∆ -ϵ β . Assumption 2.1 amounts to behavior cloning (BC) guarantees. It is generally assumed as being an easier task than policy optimization, as BC is supervised learning. Theorem 4 provides bounds on ϵ β . Assumption 2.2 reflects the policy improvement objective. For instance, the SPIBB principle guarantees it with high probability in finite MDPs (Laroche et al., 2019b; Nadjahi et al., 2019; Simão et al., 2020) . It is important to notice that Assumption 2.2 forbids, in theory, the use of the BPR representation to estimate the value directly. But in practice, we show that the representation learned by BPR is still useful to estimate the value function. We provide some experimental results in Appendix G.7 and a counterexample in Appendix F. Additionally, it is worth noting that ensuring that ∆πBPR,π β is positive is necessary to obtain approximate policy improvement guarantees. To do so, it suffices to fall back to π β when the optimization of π BPR does not lead to a positive expected improvement ∆πBPR,π β .

5.2. VALUE FUNCTION LOWER BOUND

In this section, we propose an alternative to Assumption 2.2 following the performance lower bound principle derived in pessimistic algorithms: Assumption 3. (3.1) Access to a lower bound of the performance J ⊥ (π) ≤ J (π) for all policies, which is accurate for π β : J ⊥ (π β ) ≥ J (π β ) -ϵ ⊥ . In order to provide performance improvement bounds, we make use of the lower bound value gap ∆ ⊥ π,π β . = J ⊥ (π) -J ⊥ (π β ). Theorem 3. Under Assumption 2.1 and 3.1, BPR returns a policy π BPR with the following performance bounds: J BPR ≥ J (π β ) + ∆ ⊥ πBPR,π β -ϵ ⊥ -ϵ β . Assumption 3.1 is satisfied by the Offline RL algorithm that is used in combination with BPR. For instance, CQL (Kumar et al., 2020; Yu et al., 2021) relies on the computation of a lower bound of the value function of the considered policies. More generally, pessimistic algorithms (Petrik et al., 2016; Yu et al., 2020; Kidambi et al., 2020; Jin et al., 2021; Buckman et al., 2021) have for principle to add an uncertainty-based penalty to the reward function, in order to control its risk to be overestimated. It is important to notice that, like for Assumption 2.2, this assumption would forbid (in theory) the use of the BPR representation to estimate the value. Both approaches (safe policy improvement and value function lower bounds) suffer the same cost ϵ β , in comparison to their guarantees without BPR. The next subsection controls this quantity.

5.3. PERFORMANCE BOUND WITH BPR OBJECTIVE

In this section, we are concerned with the statistical property of the error between the estimated behavior based on the representation and the true behavior. To this end, we have the upper bound of the ϵ β as: Theorem 4. With probability at least 1 -δ, for any δ ∈ (0, 1): ϵ β ≤ CK • 1 n n i=1 π β (•|s i ) -π β (• | ϕ(s i )) 2 + 2 √ 2K • Rad(Φ) + K • 2 ln 1 δ n ( ) where n is the size of the dataset, Rad(Φ) is the Rademacher complexity of ϕ's function class Φ, π β is the behavior policy over the dataset, C is a constant, and K = Rmax 1-γ . Note that the first term in Equation 6is the exact optimization problem of BPR in Equation 4multiplied by constant C • K, where the action a i is sampled from the offline dataset pairs (s i , a i ) whose behavior policy is π β , and the predictor f plays the similar role as π β . The second and the third term are both irrelevant to the specific representation, or the estimated behavior. This indicates that the potential harm of utilizing BPR can be reduced as the representation training procedure goes on.

6. EXPERIMENTS

BPR can be easily combined with any existing Offline RL pipeline. Our experimental results can be broken down into the following sectionsfoot_2 : • Does BPR outperform baseline Offline RL algorithms on standard raw-state inputs benchmarks?other representation objectives? • Is BPR effective in learning policies on high-dimensional pixel-based inputs? Can BPR improve the robustness of the representation when the input contains complex distractors? • Can BPR successfully improve the effective dimension of the feature in the value network? Performance Comparison in D4RL Benchmark Experimental Setup: We analyze our proposed method BPR on the D4RL benchmark (Fu et al., 2020) of OpenAI gym MuJoCo tasks (Todorov et al., 2012) which includes a variety of datasets that have been commonly used in the Offline RL community. We evaluate our method by integrating it into three Offline RL methods: TD3+BC (Fujimoto & Gu, 2021) , CQL (Kumar et al., 2020), and EDAC (An et al., 2021) . We consider three environments: halfcheetah, hopper and walker2d, and two datasets per task: expert and mediumexpert. Instead of passing the raw state as input to the value and policy networks as in the baseline methods, we first pretrain the encoder during 100k timesteps, then freeze it, pass the raw state through the frozen encoder to obtain its representation, and use that as the input of the Offline RL algorithm. Further details on the experiment setup are included in appendix G. Analysis: Figure 2 shows the performance of all models on the D4RL tasks. We observe that pretraining the encoder with BPR leads to faster convergence for all Offline RL algorithms, and can improve policy performance. Notably, since different Offline RL algorithms share the same encoder network architecture, the pretrained BPR encoder can be reused across them, which substantially amortizes the pretraining cost. (2021) , and use the inter-quartile mean (IQM) normalized score, which is calculated overruns rather than tasks, and the percentile bootstrap confidence intervals, to compare the performance of the different objectives. The detailed results are provided in appendix G. Figure 3 presents the IQM normalized return and 95% bootstrap confidence intervals for all methods, the numerical values can be found in Table 1 . The performance gain of BPR over the two most competitive baselines, Fourier and ACL, is statistically significant(p < 0.05). Performance comparison in V-D4RL benchmark Experimental Setup We evaluate our method with five representation objectives that are considered as state-of-the-art methods on a benchmarking suite for Offline RL from visual observations of DMControl suite (DMC) tasks (Lu et al., 2022) . We note that not all methods have been shown to be effective for visual-based continuous control problems. The baselines include: (i) Temporal contrastive learning methods such as DRIML (Mazoure et al., 2020) and HOMER (Misra et al., 2020) ; (ii) spatial contrastive approach, CURL (Laskin et al., 2020) ; (iii) one-step inverse action prediction model, Inverse model (Pathak et al., 2017) ; and (iv) Representation module which combines self-predictive, goal-conditioned RL and inverse model, namely SGI (Schwarzer et al., 2021a) .

6.2. EFFECTIVENESS OF BEHAVIOR REPRESENTATIONS IN VISUAL OFFLINE RL BENCHMARK

We evaluate all representation objectives by integrating the pre-trained encoder from each into an Offline RL method DrQ+BC (Lu et al., 2022) , which combines data augmentation techniques with the TD3+BC method (TD3 with a regularizing behavioral-cloning term to the policy loss). Analysis As shown in Figure 4 , BPR consistently improves over the other state-of-the-art algorithms, except for SGI in the walker_walk task, while SGI is the most time-consuming representation objective of all methods. We evaluate the wall-clock training time of each representation objective for the 100K time steps. Our approach compares favorably against all previous methods. In particular, BPR is 2.5x faster than the most competitive SGI method. This suggests that the BPR objective can improve sample efficiency more effectively than the other representation objectives.

Robustness to Visual Distractions Experimental Setup:

To test the policy robustness brought by BPR, we use three levels of distractors: easy, medium, and hard, to evaluate the performance of the model, following Lu et al. (2022) . Each distraction represents a shift in the data distribution, where task-irrelevant visual factors are added (i.e., backgrounds, agent colors, and camera positions). Those factors of variation disturb the model training (see the demonstrations provided in Appendix G.8). We train the baseline agent, DrQ+BC, and its variant using a fixed encoder pretrained by BPR, on the three levels of the distraction of the cheetah-run-medium-expert dataset with 1 million data points. In the experiment, the agent is evaluated on three different test environments: i) Original Env, the evaluation environment is the original environment i.e., without any distractors; ii) Distractor Train, the evaluation environment has the same distraction configuration as the training dataset, and the configuration is fixed during the evaluation procedure; iii) Distractor Test, the specific distractions of the evaluation environment changes over evaluation, while the level of the distractions remains the same. Analysis: The final performances are shown in Table 2 . As can be seen, BPR improves the policy performance in most cases, except for the task with a combination of the hard level and the distractor test setting where the averaged performance of the agent with BPR is close to the one without BPR but has a lower variance. Besides, the different evaluation environment corresponds to different degrees of the distribution shift from the training dataset. It is therefore not surprising to see that the agent performs better in Distractor train environment than the others. This experiment shows that much better (more robust) policies can be obtained against distractions with the help of BPR, indicating that BPR objective can somehow disentangle sources of distractions through training, facilitating the learning of general features. Experimental Setup: We investigate whether utilizing the fixed encoder learned from BPR can alleviate the "under-parameterization" phenomenon of the value network. To this end, we propose to use the effective dimension (defined in Definition 1) as a way to evaluate the compactness of features represented in the value network. We first sample a batch of state-action pairs from the offline dataset and then compute the effective dimension of the output of the penultimate layer of the value network. In this experiment, we evaluate the effective dimension of the value network of the CQL agent on Halfcheetahmedium-expert-v2 task in D4RL datasets, which operates on the physical meaning raw state inputs. For comparison, we also develop a variant of CQL whose input is still the raw state, but a shared encoder head is applied to both the value network and policy network and is trained with the critic loss along the policy learning procedure. Analysis: Figure 5 illustrates that over the course of training, BPR objective successfully improves the effective dimensions of the state representation, meanwhile achieving better performance compared to the agent that is without an encoder and the agent whose encoder is trained with the critic loss. Notably, when training the encoder via the critic loss along the policy learning procedure, the effective dimension increases significantly early in training, suggesting that even without any specific representation loss, the encoder can still disentangle some similar states. When the training goes on, the agent will observe more diverse states over time as the policy improves, which induces a mismatch between the learnt representations and the newly observed states, leading to a decline in the effective dimension. In contrast, with a fixed encoder learned using the BPR objective, CQL produces more effective dimensions than its vanilla version, leading to better performance, which indicates that BPR is capable of providing compact information for downstream Offline RL training.

7. DISCUSSION

Limitations and Future Work Although this work has shown effectiveness in learning representations that are robust to visual distractions, it still struggles when the distribution of evaluation environments shifts too much from the training environment, this suggests that further improvement on the generalization abilities of the representation is required. Another limitation related to the nature of the approach is that if the behavioral policy is uninformative, then our approach will likely result in a policy that has the same performance as the behavior but without any improvement. However, it is straightforward to leverage information about the behavior (either a priori or from statistics on the returns in the dataset) to decide when to use BPR. From these perspectives, further development of a more theoretically grounded representation objective might be needed.

Conclusion

Offline RL algorithms typically suffer from two main issues: one is the difficulty of learning from high-dimensional visual input data, the other is the implicit under-parameterization phenomenon that can lead to low-rank features of the value network, further resulting in low performance. In this work, we propose a simple yet effective method for state representation learning in an Offline RL setting which can be combined with any existing Offline RL pipeline with minimal changes.

8. REPRODUCIBILITY STATEMENT

To ensure the reproducibility of all empirical results, we provide the code base in the supplementary material, which contains: training scripts, and requirements for the virtual environment. All datasets and code platforms (PyTorch and Tensorflow) we use are public. To rebuild the architecture of BPR model and plug BPR in any Offline RL algorithms, the readers are suggested to check the descriptions in Section 4, especially the Implementation details paragraph. All proofs are stated in Appendix E with explanations and underlying assumptions. We also provide the pseudocode of the pretraining process and the co-training process in Algorithm 1 and 2 in Appendix. All training details are specified in Section 6 and Appendix G. In the experiment on d4rl tasks, all representation objectives use the same encoder architecture, i.e., with 4-layer MLP activated by ReLU, followed by another linear layer activated by Tanh, where the final output feature dimension of the encoder is 256. Besides, all representation objectives follow the same optimizer settings, pre-training data, and the number of pre-training epochs. In the experiment on v-d4rl tasks, all representation objectives use the same encoder architecture, i.e., with four convolutional layers activated by ReLU, followed by a linear layer normalized by LayerNorm and activated by Tanh, where the final output feature dimension of the encoder is 256. We also provide the pseudocode for each visual representation objective in Algorithm 4-7 in Appendix. A NOTATION (Andre & Russell, 2002; Ferns et al., 2006; Mannor et al., 2004; Comanici et al., 2012) , and most of these methods aim to reduce the original state space size and to minimize the system complexity. Recent studies present promising results in learning robust representations that can then be used to accelerate policy learning in pixel-based observation spaces. 

C COMPARISON C.1 BEHAVIOR CLONING

One would consider BPR similar to the behavior cloning method that imitates the behavior policy of the data since they both learn a projection from state to action. Despite resembling behavior cloning in form, we emphasize that the goal of BPR is not to approximate the exact behavior policy. Instead, it is a two-stage process that reutilizes the learned encoder on downstream tasks to accelerate policy learning, while discarding the initial projection layer. Additionally, this allows us to apply ℓ 2 -normalization to the prediction and action rather than using their true values. This normalization has been shown to be important for representation learning (Wang & Isola, 2020; Grill et al., 2020; Zang et al., 2022) . The difference is illustrated in Figure 6 .

C.2 π * -IRRELEVANCE ABSTRACTION

A π * -irrelevance abstraction ϕ π * (Jong & Stone, 2005; Li et al., 2006) is such that every abstract class has an action a * that is optimal for all the states in that class that is a) , which attempt to preserve Figure 6 : A summary of the BPR algorithm shows its difference from Behavior Cloning. BPR focus on learning state representation to benefit the downstream Offline RL algorithms instead of concentrating on learning the behavior policy. ϕ π * (s 1 ) = ϕ π * (s 2 ) implies that Q * (s 1 , a * ) = max a Q * (s 1 , a) and Q * (s 2 , a * ) = max a Q * (s 2 , the optimal actionfoot_5 . In Li et al. (2006) , π * -irrelevance abstraction was applied on Q-learning and was proven that the induced value function could not converge to the optimal in the ground MDP. While comparing with BPR, the biggest difference is that: in theory, as described in Assumption 2.2, we use the BPR representation only for the policy search, while Li et al. (2006) assumes that the value function should be estimated for learning reasonable Q value in representation space. As a consequence, Li et al. (2006) requires much stronger assumptions for the policy abstractions, such as bisimulation or invariance for the optimal value (policy). As an example, consider an MDP as shown in Figure 7 (an example is taken from the MDP investigated by Li et al. (2006) ): for π * -irrelevance abstraction, it induces an abstract MDP M =< S, A, P , R, γ >, where we apply the q learning on the induced M . Under this abstract MDP, since S 1 and S 2 are accidentally aggregated to one state abstraction, the abstract reward R and the abstract transition P will be estimated mistakenly, therefore the q value cannot be updated to the optimal. Notably, the value estimation we perform on the original input space, which belongs to the ground MDP state space, differs from the value estimation on the representation space that raises issues that can only be alleviated with the drastic assumptions in Li et al. (2006) . Figure 7 : The solid and dashed lines represent actions 1 and 2, respectively. The corresponding graph shows the value function in the aggregated (middle) state for each of its actions. ϕ π * yields an optimal policy for M that is suboptimal in M , while BPR can still find the optimal policy in M by policy search.

C.3 FOURIER

Nachum & Yang (2021a) developed a representation objective that combines contrastive learning with the linear dynamic model (i.e., a learnable function whose input is the pair of next state and action (s ′ , a), the output is a representation of the current state), where the contrastive objective is approximated by leveraging the Random Fourier Features. Though effective, this method assumes the underlying dynamic model is linear, while the BPR objective does not show reliance on such an assumption. Since the BPR objective can be oblivious to the reward function and transition model, it can be applied to a wild range of domains where the reward or the transition could be hard to approximate. On the other hand, without modeling the reward function or transition dynamics, the BPR objective is slightly theoretically weaker (see the counterexample in the Appendix). A combination of BPR and the methods like Fourier should be further investigated in future work, which will provide a more theoretically grounded yet still simple approach. C.4 ACL Yang & Nachum (2021) introduced Attentive Contrastive Learning (ACL) to learn state representation, which uses the transformer-based architecture as a skeleton, where a subset of sub-trajectories are randomly masked to make predictions. Similar to the Fourier approach described above, ACL also utilizes a contrastive learning objective to update its feature mapping, while further reconstructing the action and the reward to stabilize the training. BPR, on the other hand, can be seen as a representation objective that, with only a reconstruction module for predicting actions, is far more simple than the transformer-based module that requires sequential prediction. Another potential benefit of the BPR objective is that it can possibly integrate into other kinds of representation objectives due to its independence from the encoder architecture.

D EFFECTIVE DIMENSION AND FEATURE RANK

The desirable state representation should not only be able to guide a good policy learning procedure, but also be compact enough to provide concise yet effective information. We focus on measuring the "compactness" of the state representation to show that BPR suffices for the agent to benefit from learning representation with auxiliary representation loss, which can alleviate the "implicit underparameterization" phenomenon. Assumption 4. (Feature reachability) Denote λ min (A) as the smallest eigenvalue of a positivedefinite matrix A. With a mapping function ϕ ⋆ , we assume that there exists ϵ ∈ R + such that, sup π λ min E s∼d π β (s) ϕ ⋆ (s)ϕ ⋆ (s) ⊤ ≥ ϵ. Assumption 4 posits that in MDP M, for each latent state in representation space R d , there exists a policy that reaches it with a non-zero probability. This ensures the optimal state representation should have no redundant dimensions, which is a reasonable assumption for a compact latent space. While a similar assumption is commonly made in previous work (Agarwal et al., 2022; Modi et al., 2021) , seldom of them consider leveraging it to measure the feature effectiveness. Based on this assumption, we define the following measurement, inspired by the Feature rank defined in Lyle et al. ( 2022): Definition 2. Effective Dimension Let ED(M ) denote the multiset of eigenvalues of a square matrix M . Then the effective dimension of Ψ is defined to be ζ(Ψ, D, ϵ) = E D σ ∈ ED 1 |S||A| Ψ ⊤ 1 |S||A| Ψ |σ > ϵ , ( ) where Ψ ∈ R |S||A|×d is the feature matrix with respect to the output of the penultimate layer of the state-action value network. Lemma 5. Let X n ⊂ R |S||A| be a set of n state-action pairs in R |S||A| sampled from a fixed distribution d π β (s, a), and the corresponding feature matrix being Ψ n , a consistent estimator can be conducted as: ζ(Ψ, D, ϵ) = σ ∈ ED 1 √ n Ψ n ⊤ 1 √ n Ψ n |σ > ϵ . ( ) Proof. The following proof mimics the derivation in Lyle et al. (2022) . Recall that 1 √ n Ψ n ⊤ 1 √ n Ψ n = 1 n n i=1 ψ(x i )ψ(x i ) ⊤ . ( ) The following property of the expected value holds E D ψ(x)ψ(x) ⊤ = E 1 n n i=1 ψ(x i )ψ(x i ) ⊤ . ( ) Then, consider each element of the matrix M : E ψ(x)ψ(x) ⊤ ij = M ij = E [ψ i (x)ψ j (x)] =⇒ n k=1 1 n ψ i (x k ) ψ j (x k ) a.s. → M ij (12) Since we have convergence for any M ij , we get convergence of the resulting matrix to M . And since the eigenvalues are continuous functions of M , the eigenvalues of M n converge to those of M almost surely. Intuitively, it possesses a similar equivalent form of feature reachability. Empirically, we set ϵ as 0.01 where only higher eigenvalues are considered, eliminating the distraction caused by small noise. With the effective dimensions measurement, we can show that BPR can stabilize the training procedure, and provides a higher effective feature dimension for the value network, thus helping to alleviate the "implicit under-parameterization" phenomenon. Notably, Equation 8can also be used to measure the effective dimension of the state representation, i. 2022) borrow the concept of µ-coherence (Candès & Recht, 2009; Mohri & Talwalkar, 2011) , which is related to the statistical leverage scores (Drineas et al., 2006; Mahoney et al., 2012) . Definition 3. Given an arbitrary n × d matrix A, with n > d and rank(A) = r, let U denote the n × d matrix consisting of the d left singular vectors of A, and let U (i) denote the i-th row of the matrix U as a row vector. Then the statistical leverage scores are given by ∥U (i) ∥ 2 2 , for i ∈ {1, • • • , n}; the coherence µ is: µ(U ) = max i∈{1,••• ,n} ∥U (i) ∥ 2 , i.e., the largest statistical leverage score. When we let P U be the orthogonal projection onto U and (e i ) the canonical basis, we can define µ ′ -coherence as: µ ′ (U ) = n max i∈{1,••• ,n} ∥P U e i ∥ 2 = n max i∈{1,••• ,n} ∥U U T e i ∥ 2 = n max i∈{1,••• ,n} ∥U T e i ∥ 2 = n max i∈{1,••• ,n} ∥U (i) ∥ 2 . ( ) Following the above definition, Lan et al. ( 2022) defined the effective dimension as: Definition 4 (Effective dimension in Lan et al. ( 2022)). Let Φ ∈ R S×k be a feature matrix. The effective dimension of Φ (vis-a-vis the standard basis (e i )) is defined as the quantity d eff (Φ) := S max i=1,...,S ∥P Φ e i ∥ 2 2 , ( ) where P Φ is the orthogonal projector onto the column space of Φ, S is the number of the state. Note that for any feature matrix, the smallest d eff can be the rank of the state space, achieved, for example, if Φ is spanned by vectors whose entries all have magnitude 1/ √ S, meaning that the feature matrix is perfectly learned. The largest possible value for d eff is S, which would correspond to any subspace that contains a standard basis element, meaning that the state space is full-rank. As a comparison, effective dimension in this paper and the one in Lan et al. ( 2022) are both developed from the Singular Value Decomposition (SVD) that was generally utilized for dimensionality reductionfoot_6 , the difference is that we compute the number of the eigenvalue of the Gram matrix of the feature matrix as the measurement, while Lan et al. ( 2022) compute the largest leverage score of the feature matrix that considered as the left singular vectors of the state space. In some sense, the effective dimension in Lan et al. ( 2022) can be seen as a variation of Equation 8 with ϵ = 0, which also indicates that the small disturbance component of the feature matrix cannot be discarded in Equation 15without prior knowledge of the rank of the state space.

E PROOFS E.1 USEFUL TECHNICAL RESULTS

We begin this section by introducing Rademacher complexity, which can be used to derive datadependent upper bounds on learnability of function classes. Definition 5. Let X be any set, X 1 , X 2 , • • • , X n be i.i.d. random variables with values in X . We have G a class of functions g : X → R. The quality Rad(G) := E Xj ,σj   sup g∈G 1 n n j=1 σ j g (X j )   (16) is called Rademacher complexity of the class G on the sample X = (X 1 , X 2 , • • • , X n ) ∈ X n , where σ is the Rademacher random variables that are drawn uniformly at random from {±1}. The corresponding vectorized version of Rademacher complexity is: Rad(G) := E Xj ,σji   sup g∈G 1 n n j=1 K i=1 σ ji g (X j ) i   , ( ) where G is a class of functions g : X → R K . (1.4) The algorithm performs perfect optimization:

E.2 THEORETICAL

J BPR = max π∈ΠBPR Ĵ (π). Theorem 2. Under idealized Assumption 1.1-1.4, BPR returns a policy that improves over the behavior policy: J BPR ≥ J (π β ). Proof. J BPR = J arg max π∈ΠBPR Ĵ (π) (Assumption 1.4) = max π∈ΠBPR J (π) (Assumption 1.2) ≥ J (π β ) (Assumption 1.3) ≥ J (π β ) (Assumption 1.1) which concludes the proof. Assumption 2. (2.1) Access to an accurate estimate π β of the true behavior β: J (π β ) -J (π β ) ≤ ϵ β . (2. 2) Access to an accurate estimate ∆π,π β of the true policy improvement ∆ π,π β for all policies π ∈ Π PI ∩ Π BPR : ∆π,π β -∆ π,π β ≤ ϵ ∆ . Theorem 3. Under Assumption 2, BPR returns a policy π BPR with the following performance bounds: J BPR ≥ J (π β ) + ∆πBPR,π β -ϵ ∆ -ϵ β . Proof. J BPR = J (π BPR ) (by definition) ≥ J (π β ) + ∆πBPR,π β -ϵ ∆ (Assumption 2.2) ≥ J (π β ) + ∆πBPR,π β -ϵ ∆ -ϵ β (Assumption 2.1) which concludes the proof. Assumption 3. (3.1) access to a lower bound of the performance J ⊥ (π) ≤ J (π) for all policies, which is tight for β: J ⊥ (π β ) ≥ J (π β ) -ϵ ⊥ . Theorem 4. Under Assumptions 2.1 and 3.1, BPR returns a policy π BPR with the following performance bounds: J BPR ≥ J (π β ) + ∆ ⊥ πBPR,π β -ϵ ⊥ -ϵ β . Proof. J BPR = J (π BPR ) (by definition) ≥ J ⊥ (π BPR ) (Assumption 3.1) = J ⊥ (π β ) + ∆ ⊥ πBPR,π β (by definition) ≥ J (π β ) + ∆ ⊥ πBPR,π β -ϵ ⊥ (Assumption 3.1) ≥ J (π β ) + ∆ ⊥ πBPR,π β -ϵ ⊥ -ϵ β (Assumption 2.1) which concludes the proof.

E.3 PERFORMANCE BOUND WITH BPR OBJECTIVE

BPR solve the optimization problem: min 1 n n i=1 ∥f (ϕ(s i )) -a i ∥ 2 : ϕ ∈ Φ, f ∈ F, where (s i , a i ) is an i.i.d. sample from the offline dataset with the size of n. The difference between the estimated behavior policy and the true behavior policy is: E J(π β ) -J(π β )|D = E s∼ρ [V π β (s) -V π β (s)] . Lemma 5. Consider a fixed dataset D and the reward is bounded by R max , then the difference between π β and π β will be bounded as: E J(π β ) -J(π β )|D ≤ 2R max (1 -γ) 2 • E D TV π β (s)∥π β (s) |D Proof. Recall that the total variation distance (TVD) between the estimated policy π β and the behavior policy π β for a given state s is defined as: D TV (π β (s), π β (s)) = sup a∈A π β (a|s) -π β (a|s) = 1 2 a ∥π β (a|s) -π β (a|s)∥ 1 . We expand the expectation in the value bound: E s∼ρ [V π β (s) -V π β (s)] ≤ s ρ(s) a π β (a | s) -π β (a | s) R(s, a) + γ s ′ T (s, a, s ′ ) |V π β (s ′ ) -V π β (s ′ )| = s a ρ(s) π β (a | s) -π β (a | s) R(s, a) + γ s ′ T (s, a, s ′ ) |V π β (s ′ ) -V π β (s ′ )| . Apply the upper bound of the value function R max /(1 -γ) to the above equation: E J(π β ) -J(π β )|D = E s∼ρ [V π β (s) -V π β (s)] ≤ R max 1 -γ E s∼ρ a π β (a | s) -π β (a | s) = 2R max 1 -γ • E E s∼ρ D TV π β (s)∥π β (s) |D = 2R max (1 -γ) 2 E s∼d πρ D TV π β (s)∥π β (s) . For simplicity, we denote E s∼d πρ D TV π β (s)∥π β (s) as E D TV π β (s)∥π β (s) |D . This lemma tells us the suboptimality of an arbitrary policy is upper-bounded by the TVD between the optimal policy and itself over the dataset D. Consider we learn the representation mapping ϕ : S → Z, and we learn a policy π ϕ : Z → A based on the fixed mapping ϕ, instead of learning the policy π : S → A. The following theorem will provide an upper bound of the suboptimality of the policy π ϕfoot_7 . Theorem 5. With probability at least 1 -δ, for any δ ∈ (0, 1): ϵ β = E J(π β ) -J(π β ) D ≤ CK • 1 n n i=1 π β (•|s i ) -π β (• | ϕ(s i )) 2 + 2 √ 2K • Rad(Φ) + K • 2 ln 1 δ n ( ) where n is the size of the dataset, Rad is the Rademacher complexity, π β is the behavior policy over the dataset, C is a constant, K = Rmax 1-γ . Proof. The following proves are build on techniques provided by Asadi et al. (2020) . First, for ∀ϕ, we have: E s∼d π β π β (• | s) -π β (• | ϕ(s)) 1 - 1 n n i=1 π β (• | s i ) -π β (• | ϕ(s i )) 1 ≤ sup ϕ∈Φ E s∼d π β π β (• | s) -π β (• | ϕ(s)) 1 - 1 n n i=1 π β (• | s i ) -π β (• | ϕ(s i )) 1 :=Ξ(s1,...,sn) Lemma 6. (Asadi et al., 2020) The expected value of Ξ can be bounded as: E [Ξ] ≤ 2 √ 2 Rad(Φ), ( ) where Rad is the Rademacher complexity.

Given the fact that |Ξ

(s 1 , . . . , s i , . . . s n ) -Ξ (s 1 , . . . , s ′ i , . . . s n )| ≤ 2 n and apply McDiarmid's inequality, ∀δ ∈ (0, 1), we have: Pr   Ξ ≤ E[Ξ] + 2 ln 1 δ n   ≥ 1 -δ Combining Equation 26and Equation 27 with Equation 25, we get the following holds with probability at least 1 -δ: E ρ π β (• | s) -π β (• | ϕ(s)) 1 ≤ 1 n n i=1 π β (• | s i ) -π β (• | ϕ(s i )) 1 + 2 √ 2 Rad(Φ) + 2 ln 1 δ n (28) Now from the left-hand side, with the help of Lemma 5, we get: E β ) -J(π β )|D ≤ 2R max (1 -γ) 2 • E D TV π β (s)∥π β (s) |D = R max 1 -γ • E ρ π β (• | s) -π β (• | ϕ(s)) 1 . And from the right-hand side, we have: 1 n n i=1 π β (• | s i ) -π β (• | ϕ(s i )) 1 + 2 √ 2 Rad(Φ) + 2 ln 1 δ n ≤ C • 1 n n i=1 π β (• | s i ) -π β (• | ϕ(s i )) 2 + 2 √ 2 Rad(Φ) + 2 ln 1 δ n With combining Equation 29and Equation 30 together, we can conclude the proof: ϵ β = E J(π β ) -J(π β ) D ≤ C R max 1 -γ • 1 n n i=1 π β (•|s i ) -π β (• | ϕ(s i )) 2 + 2 √ 2 R max 1 -γ • Rad(Φ) + R max 1 -γ • 2 ln 1 δ n with probability at least 1 -δ. that the second term and the last term do not depend on the exact form of the representation. Minimizing the first term can decrease the difference between the learned policy and the behavior policy, where the first term itself aligns with the optimization problem of BPR (Equation 4), indicating that with BPR objective, we can improve the performance of the policy. And the potential harm of using BPR is bounded by the error ϵ β that we can control.

F COUNTEREXAMPLE

We consider the following dataset composed of four trajectories τ 1 , τ 2 , τ 3 , and τ 4 in deterministic MDP m = ⟨{s 0 , s 1 }, {a 0 , a 1 }, p 0 (s 0 ) = 1, p, r, γ = 1⟩: D =    ⟨s 0 , a 0 , s f , 0⟩ τ1 , ⟨s 0 , a 0 , s f , 0⟩ τ2 , ⟨s 0 , a 1 , s 1 , 0⟩, ⟨s 1 , a 0 , s f , 1⟩ τ3 , ⟨s 0 , a 1 , s 1 , 0⟩, ⟨s 1 , a 1 , s f , 0⟩ τ4    , where the final state s f denotes the termination of the trajectory. Then, BPR may collapse s 0 and s 1 to a single embedding z where the estimated behavior policy is uniform: π β (a 0 |z) = π β (a 1 |z) = π β (a 0 |s 0 ) = • • • . Then, we have: J (π β ) = 1 4 (33) J BPR (π(a 0 |z) = 1) = 1 3 (34) J BPR (π(a 1 |z) = 1) = 0 (35) J BPR (π(a 0 |z) = 0.5) = 1 2 × 1 3 + 1 2 × 2 3 × J BPR (π(a 0 |z) = 0.5) (36) = 1 6 + 1 3 × J BPR (π(a 0 |z) = 0.5) (37) = 3 2 × 1 6 = 1 4 As a conclusion, with a BPR value, a greedy algorithm would converge to π(a 0 |z) = 1, which has the performance of 0 in the environment MDP m, and is therefore not a policy improvement. Experimental Setup The purpose of this experiment is to demonstrate that, when compared to the state-of-the-art representation objectives, BPR still has a competitive advantage. In this experiment, we follow the pretrain-finetune paradigm suggested by Yang & Nachum (2021) , i.e., we first pretrain state representations of 100k time steps, then the learned encoder is fixed and applied to the downstream task, which performs the BRAC (Wu et al., 2019) agent for 100K steps. Same as the configuration from Yang & Nachum (2021) , we also fix the regularization strengths and policy learning rates in BRAC for all domains. And since representation is more valuable in informative data, our experiment does not include interaction pairs collected by a random policy, but rather has interests in the dataset collected from a learned (or at least partially learned) agent. The experiment is still conducted on the D4RL mujoco tasks, with 16 tasks in total included. In this setting, the pretraining datasets and downstream datasets are the same, determined by a single choice of task. We compare our method to several published leading representation RL algorithms, which include: • ACL (Yang & Nachum, 2021) applies contrastive learning on the transformer-based architecture: (1) take a sub-trajectory s t:t+k , a t:t+k , r t:t+k , (2) randomly mask a subset of these, (3) pass the masked sequence into a transformer, and then (4) for each masked input state, apply a contrastive loss between its representation ϕ(s) and the transformer output at its sequential position. • Fourier (Nachum & Yang, 2021a ) is implemented as contrastive learning where the transition dynamics are approximated by an implicit linear model with representations given by random Fourier features. • Forward model (Pathak et al., 2017) uses the state representation and the action to predict the reward and next state given a sub-trajectory, where the transition probability is defined as an entropy-based model, i.e., given a sub-trajectory τ t:t+1 , use ϕ(s t ), a t to predict s t+1 , r t . • Inverse model (Pathak et al., 2017) uses the current state representation and the next state representation to predict the current action, i.e., given a sub-trajectory τ t:t+1 , use ϕ(s t:t+1 ) to predict a t . • Contrastive (Yang & Nachum, 2021 ) learns state representation with contrastive objective. Given a sub-trajectory τ t:t+1 , a contrastive loss is applied between ϕ(s t ), ϕ(s t+1 ) as: -ϕ (s t+1 ) ⊤ W ϕ (s t ) + log E ρ exp ϕ(s) ⊤ W ϕ (s t ) • Super (Yang & Nachum, 2021 ) is a combination of the forward model and the backward model, which has a self-predictive module and inverse module to learn dynamical information. In this experiment, all methods use the same encoder architecture, i.e., with 4-layer MLP activated by ReLU, followed by another linear layer activated by Tanh, where the final output feature dimension of the encoder is 256. Besides, all methods follow the same optimizer settings, pre-training data, and the number of pre-training epochs. Analysis Figure 9 and Figure 10 provide the performance comparison results for BPR and other representation objectives on D4RL mujoco dataset. We find that BPR outperforms or at least is competitive with the previous state-of-the-art methods in the majority of environments. Notably, the variance on expert and medium datasets is much smaller than that on medium-expert and mediumreplay datasets, especially for BPR, which indicates that the learned representation is more stable when the dataset is from the same level of the agent. The underlying reason may be that the behavior policy of the mixed dataset could differ even in the same mini-batch in the training stage, which will limit the ability of the neural network to learn the information related to (near) optimal decisionmaking. This suggests that the representation quality could be impacted more by the characteristics of the dataset, than the representation objectives, which confirms the observation of Schweighofer et al. (2021) . We also notice that even with the pretrained encoder learned from BPR, there still remains a large performance gap between the BRAC agent and the baseline TD3+BC agent, indicating that despite the representation objective being helpful for Offline RL, choosing the suitable offline agent for different tasks is still essential for improving the performance, which may also partially solve the aforementioned high variance issue of the different dataset. Furthermore, to understand the performance difference of different representation objectives comprehensively, we additionally use uncertainty-aware comparisons for all these methods. Experimental Setup There are two design choices of representation learning in the experiment: i) learn state representations via pretraining, then freeze the encoder and apply it to the downstream tasks, ii) learn state representations simultaneously with the policy, where both the representation and the policy will be learned shoulder-by-shoulder. One can wonder how BPR performs with the different designs of choices, and what the difference is between the latter one and the one that additionally applies BC to policy learning. To illustrate both questions, here, we evaluate both designs on the state-of-the-art TD3+BC backbone. For the pretraining one, we still set the pretraining stage as 100,000 time steps. For the co-training one, we train the state encoder simultaneously with the value network and the policy network, where the corresponding pseudocode is listed in Algorithm 2. For reference, we also provide the pseudocode of the baseline TD3+BC method in Algorithm 3. Concretely, the differences between BPR and the behavior cloning in TD3+BC are listed as follows: i) they use different projection (or actor) heads, ii) the gradient of the policy loss is non-visible for the encoder, but the gradient of the encoder loss can pass through and update both the encoder layer and the projection layer; iii) BPR require the l2-normalization for the representation and the action, which is known to be effective for representation learning, while l 2 -normalization is not a good fit for action in BC. Analysis Figure 11 shows the performance of three models on D4RL tasks, indicating that BPR objective can improve sample efficiency in most cases. BPR consistently outperforms the baseline method, TD3+BC, for the majority of tasks and has a considerably lower variance, demonstrating that our approach has the advantage of convergence rate and stability. On the walker2d tasks, the baseline method performs better than the model with BPRs, we consider the possible reason could be that the baseline is sufficient to solve this task, therefore adding extra auxiliary losses may, in turn, damage the performance. Meanwhile, this finding is consistent with our hypothesis that the representation objective becomes more valuable as the difficulty of the task increases. Another impressive result is that compared to co-training BPR objective with the policy, BPR yields more performance gains and faster convergence under the pretrain-finetune paradigm, illustrating that learning policy from a "good" fixed encoder might be more suitable for Offline RL. substantial improvement of the modified variant of DrQ+CQL suggests that BPR can be effective in improving sample efficiency, decrease computational cost, and reach better final performance. Experimental Setup We evaluate our method with five other representation objectives that are considered as state-of-the-art methods on a benchmarking suite for Offline RL from visual observations of DMControl suite (DMC) tasks (Lu et al., 2022) . The baselines include: • DRIML (Mazoure et al., 2020) and HOMER (Misra et al., 2020) (Time Contrastive methods) learn representations which can discriminate between adjacent observations in a rollout and pairs of random observations. • CURL (Laskin et al., 2020) (Augmentation Contrastive method) learns a representation that is invariant to a class of data augmentations while being different across random example pairs. • Inverse model (Pathak et al., 2017) (One-Step Inverse Models) predict the action taken conditioned on the previous and resulting observations. • SGI (Schwarzer et al., 2021a) : A combination of three kinds of techniques (Self-Predictive Representations + Inverse modelling + goal-conditioned RL). As SGI was not tested on the Deepmind Control suite (continuous-action space) setting, we modified its codefoot_8 to allow it to be fairly compared to BPR. Specifically, for the inverse model, we follow the same architecture as the above Inverse model; for the self-predictive representation, we design a simple MLP network acting as an inverse model to capture the dynamic information; for the goal-conditioned parts, we utilize the same architecture of FiLM module in DRIML, apply DrQ+BC as the backbone, and define the potential-based reward of goal-conditioned RL to pretrain the state representations. In this experiment, all methods use the same encoder architecture, i.e., with four convolutional layers activated by ReLU, followed by a linear layer normalized by LayerNorm and activated by Tanh, where the final output feature dimension of the encoder is 256. Besides, all methods follow the same optimizer settings, image augmentation, pre-training data, and the number of pre-training epochs. We provide pseudocode for each method in Algorithm 4-7. It should be noted that in this work, we assume the inductive bias of the behavior policy is efficient. In particular, a bad behavioral policy may lead BPR to encode an undesirable bias and therefore deteriorate the performance of the algorithm it is paired with. To verify this, we run BPR on data collected using a random policy and report the results in Figures 13 and 14 . We see that the gains provided by BPR on datasets where the behavior policy is informative vanish. raw state as its input; ii) the learned encoder is used as the input of both the policy network and the value network. Analysis As shown in Figure 15 , the performance of both variants are almost equivalent, which means that ϕ(s) is good enough to learn the state-action value empirically. Notably, when taking ϕ(s) as the input, the agent will have a lower variance along the training procedure in 2 of 6 tasks, which is the empirical evidence of that ϕ(s) as the input of value network is an acceptable choice. We use three levels of distractions (i.e., easy, medium, hard) to evaluate the performance of the model (See Figure 16 . Each distraction represents a shift in the data distribution, where we add task-irrelevant visual factors (i.e., backgrounds, agent colors, and camera positions) that vary with environments to disturb the model training.



Notably, when we characterize the state by its corresponding state representation, the dimension of the feature matrix changes accordingly as R |Z||A|×d . We defer a detailed discussion to Appendix D. For readability, we defer rigorous proofs of theorems to the Appendix E. Due to the page limit, we provide some additional experiment results in Appendix G. We use the codebase fromYang & Nachum (2021) in these experiments: https://github.com/ google-research/google-research/tree/master/rl_repr Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent -A new approach to self-supervised learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,NeurIPS 2020, December 6-12, 2020, virtual, 2020. To keep the notation aligned well withLi et al. (2006), we will replace π ⋆ with π * , and replace M with M in this section. the singular values of the feature matrix are the eigenvalues of M in Equation12when we only consider the state feature Note that we will abuse the notation of π ϕ , π(ϕ(s)), and π(•|ϕ(s)) for readability and simplicity. https://github.com/mila-iqia/SGI



Figure 1: Illustration of Behavior Prior Representations and comparison with Behavior Cloning.

Figure2: Performance comparison on D4RL dataset. The x-axis indicates the number of environment steps. The vertical axis reports the normalized cumulative returns. We train our 6 algorithms (EDAC, TD3+BC, CQL, and their BPR variants) on 6 seeds, and evaluate each one every 5,000 environment steps by computing the average return over 10 evaluation episodes. Lighter lines correspond to baselines, darker ones to versions pretrained with BPR. The black dotted line shows the time when representation pre-training ends and policy training process begins.

Figure 3: Bootstrapping distributions for uncertainty in IQM measurements with 5000 resamples Performance Comparison among Different Representation Objectives Experimental Setup: We follow the pre-training and finetuning paradigm of Yang & Nachum (2021), where we first pre-train the encoder for 100k timesteps, and then fine-tune a BRAC (Wu et al., 2019) agent on downstream tasks. We conduct performance comparison by training the encoder with different objectives, including BPR and several other state-of-the-art representation objectives 4 . Analysis: We follow the performance criterion from Schwarzer et al. (2021b); Agarwal et al. (2021), and use the inter-quartile mean (IQM) normalized score, which is calculated overruns rather than tasks, and the percentile bootstrap confidence intervals, to compare the performance of the different objectives. The detailed results are provided in appendix G. Figure3presents the IQM normalized return and 95% bootstrap confidence intervals for all methods, the numerical values can be found in Table1. The performance gain of BPR over the two most competitive baselines, Fourier and ACL, is statistically significant(p < 0.05).

Figure 4: Performance comparison of BPR with several other baselines on DMC environments (left). Run time comparison for all representation objectives running on cheetah_run_expert task with 100K timesteps, the vertical axis indicates the wall-clock time in seconds (right).

Figure 5: Effective dimension of the penultimate layer of the value network (top) and performance (bottom) over the training on the Halfcheetah-medium-expert-v2 task.

e., ζ(Φ, D, ϵ), when we substitute the feature matrix Ψ by the representation matrix Φ ∈ R |S|×|Z| where Z is the latent state representation space. Connection to the effective dimension proposed in Lan et al. (2022) To develop a measurement of the effective dimension of state representation, Lan et al. (

ANALYSISAssumption 1. (1.1) Access to the true behavior: π β = π β . (1.2) Access to the true performance of policies: Ĵ (π) = J (π) for all policies π. (1.3) The embedding ϕ allows to represent the behavior policy estimate: π β (a|ϕ(s)) = π β (a|s) ∈ Π BPR .

Figure 9: Performance comparison on D4RL mujoco dataset.

Figure 10: Performance comparison with different representation objectives on D4RL Mujoco dataset at 100K timestep. x-axis shows the average normalized score over the final 10 evaluations and 6 seeds. Blue dotted lines show the average normalized return without pretraining.

Figure11: Performance comparison on D4RL mujoco dataset. The horizontal axis indicates the number of environment steps. The vertical axis indicates the normalized cumulative returns. We trained 6 different instances of each algorithm with different random seeds, with each instance performing an evaluation every 5,000 environment steps by computing the average episode return over 10 evaluation episodes. The solid lines represent the mean over the 6 trials.

Figure12: Performance comparison on v-d4rl benchmark, the results are averaged over 10 evaluations and 6 seeds.

DRIML(HOMER) Pretraining Pseudocode, PyTorch-like # encoder: CNN, encoder network # driml_net: mlp, contains FiLM block and residual block def pretrain(encoder, driml_net, replay_buffer, batch_size, temp=0.1): iteration = 0 for iteration < 1e5: state, action, state_k_step, _, _ = replay_buffer.sample(batch_size) # state_k_step is the k step future observation from the current state, where k=1 in HOMER and k=5 in DRIML state = encoder(state) state_k = encoder(state_k_step) u_t, u_tpk = driml_net(state, state_k, action) outer_prod = torch.einsum('ik,jk->ij', u_t, u_tpk) scores = log_softmax(outer_prod / temp, -1) # temp is the temperature mask = torch.eye(scores.shape[-1]) scores = (mask * scores).sum(-1).mean() encoder_loss = -scores # loss for encoder encoder_loss.backward() update(encoder,driml_net) iteration += 1 Algorithm 5 CURL Pretraining Pseudocode, PyTorch-like # encoder: CNN, encoder network # W: matrix-wise parameters def pretrain(encoder, W, replay_buffer, batch_size, temp=0.1): iteration = 0 for iteration < 1e5: state, action, next_state, _, _ = replay_buffer.sample(batch_size) obs = augment(state.float()) # augment pos = augment(torch.clone(state).float()) # augment z_a = encoder(obs) z_pos = encoder(pos, ema=True) # using EMA in encoder network logits = torch.matmul(z_a, torch.matmul(W_param, z_pos.T) logits = logits -torch.max(logits, 1)[0][:, None] labels = torch.arange(logits.shape[0]) encoder_loss = nn.CrossEntropyLoss()(logits, labels) encoder_loss.backward() update(encoder,W_param) iteration += 1 G.6 PERFORMANCE ON THE DATASETS COLLECTED BY THE RANDOM POLICY

Figure 13: Performance comparison on D4RL mujoco dataset where the data is collected by a random policy.

Figure 15: Performance comparison on D4RL mujoco dataset, ϕ(s) v.s. s as input of value network.

Interquartile mean, median, and mean normalized scores, evaluated over all six runs for each of 16 DMC tasks. Confidence intervals computed by percentile bootstrap with 5000 resamples.

Cumulative return on cheetah-run-medium-expert dataset with visual distractions ranging from easy to hard. We compare DrQ+BC (before →) to DrQ+BC with an encoder pretrained with BPR (after →). The final performance is averaged over 6 seeds (highest result for each task underlined).

summarizes our notation.

Table of Notation.

funding

* Correspondence to Xin Li. This work was partially supported by NSFC under Grant 62276024 and 92270125.† Work done while at Microsoft Research Montreal.

availability

The code is available at https://github.com/bit1029public/offline_bpr.

Experimental Setup

We analyze our proposed method BPR on the D4RL benchmark (Fu et al., 2020) of OpenAI gym MuJoCo tasks (Todorov et al., 2012) which includes a variety of dataset domains that have been commonly used in the Offline RL community. We evaluate our method by integrating it into three Offline RL methods, including: i)TD3+BC agent (Fujimoto & Gu, 2021) , which is one of the existing state-of-the-art Offline RL approaches combining Behavior Cloning with TD3 together and has a good balancing on the simplicity and efficiency; ii) CQL agent (Kumar et al., 2020) , which learns a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value; iii) EDAC agent (An et al., 2021) , which is an uncertainty-based ensemble-diversified Offline RL algorithm. We consider three simulated tasks from the Mujoco control domain with D4RL benchmark ( halfcheetah, hopper, and walker2d) and consider the expert and medium-expert datasets. For our experiments, the baseline methods take the raw state as the input of both the value network and policy network, while we also pre-train the representations with 100k timesteps and then freeze the encoder and take it as the input during the Offline RL optimization; and follow the standard evaluation procedure as in D4RL benchmark. We provide the pseudocode of the pretraining process in Algorithm 1, and the training curve that without the accounts for the pre-training steps (100K timesteps) in BPR (Figure 8 ). 

G.4 PERFORMANCE COMPARISON IN V-D4RL BENCHMARK

Experimental Setup: We evaluate our method on a benchmarking suite for Offline RL from visual observations of DMControl suite (DMC) tasks (Lu et al., 2022) . We consider three domains of environments (walker-walk, humanoid-walk, and cheetah-run) and two different datasets per environment (medium-expert and expert). We include two baselines from Lu et al. (2022) in this experiment:• DrQ+BC: combining data augmentation techniques with the TD3+BC method, which applies TD3 in the offline setting with a regularizing behavioral-cloning term to the policy loss. The policy objective is:• DrQ+CQL: adding the regularization in CQL to the Q-function of an actor-critic approach, which is DrQ-v2.To investigate if BPR learns valuable representation, we pretrain the encoder by using the BPR objective for 100k time steps, followed by a fine-tuning stage where we keep the encoder fixed and apply two baselines based on the learned encoder.Analysis: Figure 12 shows the performance of the four models on V-D4RL. We emphasize that these tasks are more challenging than their conventional D4RL counterparts. Neither DrQ+BC nor DrQ+CQL can solve them without further modifications. By integrating the frozen encoder learned from BPR, both of them obtain significant performance gains. Note that DrQ+CQL typically requires a longer training time to achieve similar performance to DrQ+BC. Nevertheless, the of taking BPR representation as the input of the value network is acceptable empirically. We still pretrain the encoder and then fix it, while developing two variants of the architecture afterward: i) the learned encoder is only used for training the policy network, while the value network takes the

