CONTRASTIVE VALUE LEARNING: IMPLICIT MODELS FOR SIMPLE OFFLINE RL

Abstract

Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.

1. INTRODUCTION

While the offline RL setting is relevant to many real-world applications where the ability for online data collection is limited, it often requires RL algorithms to find policies that are not well-supported by the training data. Instead of learning via trial-and-error, offline RL algorithms must leverage logged historical data to learn about the outcome of different actions, potentially by capturing environment dynamics as a proxy signal. Many prior approaches for this offline RL setting have been proposed, whether in modelfree (Wu et al., 2019; Fujimoto et al., 2019; Kumar et al., 2020) or model-based (Kidambi et al., 2020; Yu et al., 2021) settings. Our focus will be on those that address this prediction problem head-on: by learning a predictive model of the environment which can be used in conjunction with most model-free algorithms. Prior model-based methods (Yu et al., 2020b; Argenson and Dulac-Arnold, 2020; Kidambi et al., 2020; Yu et al., 2021) learn a model that predicts the observation at the next time step. This model is then used to generate synthetic data that can be passed to an off-the-shelf RL algorithm. While these approaches can work well on some benchmarks, they can be complex and expensive: the model must predict high-dimensional observations, and determining the value of an action may require unrolling the model for many steps. Learning a model of the environment has not made the RL problem any simpler. Moreover, as we will show later in the paper, the environment dynamics are intertwined with the policy inside the value function; model-based methods aim to decouple these quantities by separately estimating them. On the other hand, we show that one can directly learn a long-horizon transition model for a given policy, which is then used to estimate the value function. A natural use case for learning this long-horizon transition model (specifically, a state occupancy measure) from unlabelled data is multi-task pretraining, where the implicit dynamics model is trained on trajectory data across a collection of tasks, often exhibiting positive transfer properties. As we demonstrate in our experiments, this multi-task occupancy measure can then be finetuned using reward-labelled states on the task of interest, greatly improving performance upon existing pretraining methods as well as tabula rasa approaches. In this paper, we propose to learn a different type of model for offline RL, a model which (1) will not require predicting high-dimensional observations and (2) can be directly used to estimate Q-values without requiring either model-based rollouts or model-free temporal difference learning. Precisely, we will learn an implicit model of the discounted state occupancy measure, i.e. a function which takes in a state, action and future state and outputs a scalar proportional to the likelihood of visiting the future state under some fixed policy. We will learn this implicit model via contrastive learning, treating it as a classifier rather than a generative model of observations. Once learned, we predict the likelihood of reaching every reward-labeled state. By weighting these predictions by the corresponding rewards, we form an unbiased estimate of the Q-function. Whereas methods like Q-learning estimate the Q-function of a state "backing up" reward values, our approach goes in the opposite direction, "propagating forward" predictions about where the agent will go. We name our proposed algorithm Contrastive Value Learning(CVL). CVL is a simple algorithm for offline RL which learns the future state occupancy measure using contrastive learning and re-weights it with the future reward samples to construct a quantity proportional to the true value function. Because CVL represents multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Using the same algorithm, we can handle settings where reward-free data is provided, which cannot be directly handled by classical offline RL methods such as FQI (Munos, 2003) or BCQ (Fujimoto et al., 2019) . We compare our proposed method to competitive offline RL baselines, notably CQL (Kumar et al., 2020) and CQL+UDS (Yu et al., 2022) on an offline version of the multi-task Metaworld benchmark (Yu et al., 2020a) , and find that CVL greatly outperforms the baseline approaches as measured by the rliable library (Agarwal et al., 2021b) . Additional experiments on image-based tasks from this same benchmark show that our approach scales to high-dimension tasks more seamlessly than the baselines. We also conduct a series of ablation experiments highlighting critical components of our method.

2. RELATED WORKS

Prior work has given rise to multiple offline RL algorithms, which often rely on behavior regularization in order to be well-supported by the training data. The key idea of offline RL methods is to balance interpolation and extrapolation errors, while ensuring proper diversity of out-of-dataset actions. Popular offline RL algorithms such as BCQ and CQL rely on a behavior regularization loss (Wu et al., 2019) as a way to control the extrapolation error. This regularization term ensures that the learned policy is well-supported by the data, i.e. does not stray too far away from the logging policy. The major issue with current offline RL algorithms is that they fail to fully capture the entire distribution over state-action pairs present in the training data. To directly learn a value function using policy or value iteration, one needs to have information about the transition model in the form of sequences of state-action pairs, as well as the reward emitted by this transition. However, in some real-world scenarios, the reward might only be available for a small subset of data. For instance, in the case of recommending products available in an online catalog to the user, the true long-term reward (user buys the product) is only available for users who have browsed the item list for long enough and have purchased a given item. It is possible to decompose the value function into reward-dependent and reward-free parts, as was done by (Barreto et al., 2016) through the successor representation framework (Dayan, 1993) . More recent approaches (Janner et al., 2020; Eysenbach et al., 2020; 2022 ) use a generative model to learn the occupancy measure over future states for each state-action pair in the dataset; its expectation corresponds to the successor representation. However, learning an explicit multi-step model such as (Janner et al., 2020) can be unstable due to the bootstrapping term in the temporal difference loss. Similarly to model-based approaches, our method will learn a reward-free representation of the world, but will do so without having to predict high-dimensional observations and without having to do costly autoregressive rollouts. Thus, while our critic is trained without requiring rewards, it is much more similar to a value function than a standard 1-step model. Learning a conditional probability distribution over a highly complex space can be a challenging task, which is why it is often easier to instead approximate it using a density ratio specified by an inner product in a much lower-dimensional latent space. To learn an occupancy measure over future states without passing via the temporal difference route, one can use noise-contrastive estimation (NCE, Gutmann and Hyvärinen, 2010; Oord et al., 2018) to approximate the corresponding log ratio of densities as an implicit function. Contrastive learning was originally proposed as an alternative to classical maximum likelihood estimation, but has since then seen successes in static self-supervised learning (He et al., 2020; Chen et al., 2020) . In reinforcement learning, NCE was shown to improve the robustness of state representations to exogenous noise (Srinivas et al., 2020; Mazoure et al., 2020; Agarwal et al., 2021a) and, more recently, to be an efficient replacement for traditional goal-conditioned methods (Eysenbach et al., 2022) .

3. PRELIMINARIES

Reinforcement learning We assume a Markov decision process M defined by the tuple M = S, S 0 , A, T , r, γ , where S is a state space, S 0 ⊆ S is the set of starting states, A is an action space, T = P[•|s t ,a t ] : S ×A → ∆(S) is a one-step transition functionfoot_0 , r : S ×A → [r min ,r max ] is a reward function and γ ∈ [0,1) is a discount factor. The system starts in one of the initial states s 0 ∈ S 0 . At every timestep t=1,2,3,.., the policy π :S →∆(A), samples an action a t ∼π(•|o t ). The environment transitions into a next state s t+1 ∼ T (•|s t ,a t ) and emits a reward r t = r(s t ,a t ). The aim is to learn a Markovian policy π(a|s) that maximizes the discounted sum of returns over an episode of length H: max π∈Π E P π 0:H ,S0 H t=0 γ t r(s t ,a t ) , where P π t:t+K denotes the joint distribution of {s t+k ,a t+k } K k=1 obtained by executing π in the environment for K timesteps starting at timestep t. We assume that the offline RL algorithm cannot interact with the environment, but instead must learn from an offline dataset of logged trajectories {(s 0 ,a 0 ,s 1 ,a 1 ,•••)}. Value-based RL algorithms maximize cumulative episodic rewards by estimating the state-action value function under a policy π: Q π (s t ,a t )=E P π t [ H k=1 γ k r(s t+k ,a t+k )|s t ,a t ], for s t ∈ S,a t ∈ A. Alternatively, the value function can be written as the expectation of the reward over the discounted occupancy measure: Q π (s t ,a t )= 1 1-γ E s,a∼P π t:H (st,at),π(s) [r(s,a)], where P π t:H (s|s t ,a t )=(1-γ) H ∆t=1 γ ∆t-1 P[S t+∆t =s|s t ,a t ;π] as defined in Janner et al. (2020) . Note that the occupancy measure can equivalently be re-written in terms of the geometric distribution over the time interval [0,∞) for infinite-horizon rollouts: P π 0:∞ (s|s 0 ,a 0 )=E ∆t∼Geom(1-γ) [P[S t+∆t |s 0 ,a 0 ;∆t;π]] This decomposition of the value function has already been used in previous works based on the successor representation (Dayan, 1993; Barreto et al., 2016) and, more recently, γ-models (Janner et al., 2020) . We will use it to efficiently learn an implicit density ratio proportional to the state occupancy measure using contrastive learning. Noise-contrastive estimation Noise-contrastive estimation (NCE, Gutmann and Hyvärinen, 2010) spans a broad class of learning algorithms, at the core of which is negative sampling (Mikolov et al., 2013) , i.e., learning an implicit metric space from positive and negative examples. Given reference samples, samples from a positive distribution (i.e., high similarity with reference points) and samples from a negative distribution (i.e., low similarity with reference points), contrastive learning methods learn an embedding where positive examples are located closer to the reference points than negative examples. One of the most well-known and commonly used NCE objectives is InfoNCE (Oord et al., 2018) , which solves max φ,ψ∈Φ E x,y,y ï log e φ(x) ψ(y) y ∈y∪y e φ(x) ψ(y ) ò (5) over some hypothesis class Φ:{φ:X →Z} for input space X , latent space Z, x∼P(X ), y ∼P positives (X ) and y ∼P negatives (X ). Contrastive learning has been widely studied in the static unsupervised/ supervised learning settings (Hjelm et al., 2018; Chen et al., 2020; He et al., 2020) , as well as in reinforcement learning (Kim et al., 2018; Mazoure et al., 2020) for learning state representations with desirable properties such as alignment and uniformity (Wang and Isola, 2020) . Solving Equation ( 5) for (φ * , ψ * ) yields a critic f : X × Y → R which decomposes as f * (x,y)=φ * (x) ψ * (y) and, at optimalityfoot_1 , captures the log-ratio of P positives (X ) and P negatives (X ): f * (x,y)∝log P[y|x] P[y] . ( ) Implicit dynamics models via NCE. Various prior works (Du and Mordatch, 2019; Mazoure et al., 2020; Nachum and Yang, 2021) have studied the use of NCE to approximate a single-step dynamics model, where triplets (s t ,a t ,s t+1 ) have higher similarity than (s t ,a t ,s t =t+1 ), effectively defining positive and negative distributions over trajectory data. More recently, contrastive goal-conditioned RL (Eysenbach et al., 2022) used InfoNCE to condition the critic on goal states sampled from the replay buffer. These methods use asymetric encoders, using φ(s t ,a t ) and ψ(s t+∆t ), where positive samples of s t+∆t are sampled from the discounted state occupancy measure for t≥0. The conditional probability distribution of future states given the current state-action pair can be efficiently estimated using an implicit model trained via contrastive learning over positive and negative feature distributions, as shown in Equation ( 7). InfoNCE (φ,ψ)=E st,at,∆t,∆t ï -log e φ(st,at) ψ(st+∆t) ∆t ∈∆t∪∆t e φ(st,at) ψ(s t+∆t ) ò (7) Minimizing InfoNCE over trajectory data yields a critic which, at optimality, approximates the future discounted state occupancy measure up to a multiplicative term as per Equation ( 6), f * (s t ,a t ,s t+∆t )∝log P[s t+∆t |s t ,a t ;π] P[s t+∆t ;π] . Intuitively, f * approximates a H-step dynamics model which has an implicit dependence on policy π that collected the training data, but is time-independent since Equation ( 8) is optimized on average across multiple t,∆t. Ordinarily, training state-space models is hard when the dimensions are large, e.g. image-based domains. However, by using contrastive learning, we can learn this model without having to require it predict high-dimensional observations, as similarity is evaluated in a lower-dimensional latent space (observe that in Equation ( 7) the inner product is computed in Z, whose dimension we control, instead of X , which is specified externally). An apparent limitation of the approach is that the probability of future states s t+∆t is recovered only up to a constant. However, it turns out that we can still use this model to get accurate estimates of the Q-values, as is described in the next section.

4. ESTIMATING AND MAXIMIZING RETURNS VIA CONTRASTIVE LEARNING

In this section, we show how NCE can be used to learn a quantity proportional to a value function, and how the later can be used in a policy iteration scheme.

4.1. ESTIMATING Q-VALUES USING THE CONTRASTIVE MODEL

As shown in Equation ( 3), the Q-function at (s t ,a t ) can be thought of as evaluating the reward function at states sampled from the discounted occupancy measure P π t:H (s t ,a t ). That is, to estimate a quantity akin to Q π , we can first estimate the occupancy measure and take a weighted average of rewards over future states using the probabilities from the log-density ratio learned by the contrastive model. Precisely, Equation (3) corresponds to using an importance-weighted estimator, where an optimal critic that minimizes Equation ( 7) approximates the density ratio from Equation ( 8). The positive samples come from the discounted state occupancy measure: we first sample a time offset ∆t ∼ Geometric(1-γ) (column in the dataset), and then sample a state from the distribution of states at this given offset (row in the dataset). As per classical InfoNCE formulation, this forms the joint distribution (s t ,a t ,s t+∆t ), which is contrasted against the negative distribution of product of marginals p(s t ,a t )×p(s t+∆t ). The critic itself can be trained using the occupancy measure formulation specified in Equation ( 4) over all state-action pairs in a given episode. However, Equation (4) needs to be re-adjusted to account for finite-horizon truncation of the geometric mass function presented in Definition 1. Definition 1 (Truncated distribution) Let X be a random variable with distribution function F X . Y is a called the truncated distribution of X with support [m,M] s.t. 0<m<M if P[Y =y]= F X (y-m)-F X (y-1-m) F X (M)-F X (m) ,y =m,m+1,m+2,..M (9) We denote the special case of the truncated geometric distribution as TruncGeom(p,m,M). The contrastive objective to train the critic to approximate the discounted occupancy measure over a dataset D is then OM-InfoNCE (φ,ψ)=E st,at∼D, ∆t∼TruncGeom(1-γ,t,H), ∆t∼TruncGeom(1-γ,t =t,H) ï -log e φ(st,at) ψ(st+∆t) ∆t ∈∆t∪∆t e φ(st,at) ψ(s t+∆t ) ò (10) It is possible that multiple optimal critics exist s.t. the multiplicative proportionality constant depends on the action. To avoid this, we adopt a similar approach as Eysenbach et al. Now, suppose we found an optimal critic f. Combining Equation (4) with Definition 1, we obtain the following form of the Q-function for an optimal critic f which minimizes Equation ( 7): Q NCE (s t ,a t )= ∞ ∆t=1 γ ∆t-1 st+∆t r(s t+∆t )P[s t+∆t |s t ,a t ;π]ds t+∆t ∝ 1 1-γ E ∆t∼TruncGeom(1-γ,t,H) ï st+∆t r(s t+∆t )e f(st,at,st+∆t) P[s t+∆t ;π]ds t+∆t ò = 1 1-γ E ∆t∼TruncGeom(1-γ,t,H) [E P π t+∆t [r(s t+∆t )e f(st,at,st+∆t) ]] Here, the offset ∆t is a random variable sampled from TruncGeom(1-γ,t,H) where H is the horizon of the MDPfoot_2 . We can also show that Q(s,a)<Q(s,a ) =⇒ Q NCE (s,a)<Q NCE (s,a ) for all s∈S and a,a ∈ A, which makes the contrastive Q-values suitable for policy evaluation. However, we do not, in general, expect Q NCE to recover the optimal Q function, as the recovered Q-values are on-policy with respect to π.

4.2. EFFICIENT ESTIMATION USING RANDOM FOURIER FEATURES

A major issue with using Q NCE out-of-the-box is that it is computationally expensive, requiring evaluation of the inner product φ(s t , a t ) ψ(s t+∆t ) with a large number of future states and hence multiple forward passes through ψ. The underlying cause of this computational overhead is the RBF kernel term e φ(st,at) ψ(st+∆t) . If we instead used a linear kernel, the constant term φ(s t ,a t ) would be factored out, and we could separately keep track of reward-weighted future expected features. This would (1) reduce the computational complexity of N actor updates over D from O(|D|N) to O(|D|+N) and ( 2) reduce the variance of the representation if averaging features of future states using exponential moving average. It turns out that the RBF kernel can be approximately linearized by using random Fourier features (Rahimi and Recht, 2007; Nachum and Yang, 2021) . Lemma 1 is a straightforward modification of the result from Rahimi and Recht (2007) and allows us to reduce the RBF kernel to an expectation over d-dimensional random feature vectors: Q NCE (s t ,a t )= 1 1-γ E ∆t∼TruncGeom(1-γ,t,H) [E P(st+∆t;π) [e φ(st,at) ψ(st+∆t) r(s t+∆t )]] = 1 1-γ F W,b (φ(s t ,a t )) E ∆t∼TruncGeom(1-γ,t,H) [E P(st+∆t;π) [F W,b (ψ(s t+∆t ))r(s t+∆t )]] = 1 1-γ F W,b (φ(s t ,a t ))ξ(π) The advantage of using the RFF approximation is that it allows us to split the exponential term inside the expectation and separately keep track of the policy-dependent, reward-weighted future state probability term, while the state-action dependence term is learned online. Specifically, we keep track of ξ(π) via an exponential-moving average during the entire duration of trainingfoot_3 .

4.3. LEARNING THE POLICY

Once the policy evaluation phase completes and we have an estimate Q NCE , we optimize a policy to maximize the returns predicted by this Q-value. We can decode the policy by minimizing its Kullback-Leibler divergence to the Boltzmann Q-value distribution (see Haarnoja et al. (2018) ), which can be efficiently done by minimizing the following objective: Policy (θ)=E st∼D ï D KL Å π θ (s t ) e Q(st,•)/τ a∈A e Q(st,a)/τ da ãò . ( ) Note that in discrete action spaces, minimizing Equation ( 14) leads to a soft version of the greedy policy decoding π greedy (s)=argmax a∈A Q NCE (s,a) for s∈S. In practice, we approximate the KL term in Equation ( 14) using N a Monte-Carlo action samples. Decoding π in such a way can lead to sampling out-of-distribution actions in regions where the Q-function might be inaccurate due to poor dataset coverage. To mitigate this issue, we follow prior work (Cobbe et al., 2021; Zhao et al., 2021; Schwarzer et al., 2021) and add a behavior cloning term which prevents the new policy from straying too far away from the data: BC (θ)=E a,s∼D [logπ θ (a|s)]+τE s∼D [H(π θ (s))]. for entropy estimator H(π(s)) = -E a∼π(s) [logπ(a|s) ]. We add this extra loss to Policy to learn a policy π which prioritizes high Q-values that are well-supported by the offline dataset D. Thus, the final policy optimization objective becomes Policy (θ)= Policy (θ)+λ BC BC (θ). ( ) The policy found by minimizing Policy has, on average, non-decreasing returns, as per Lemma 2. Lemma 2 (Contrastive policy improvement) Let µ be a policy and let Q µ NCE = min φ,ψ∈Φ E D µ [ Critic (φ,ψ)]. If π(s)=argmin π∈Π D KL Å π(s) e Q µ NCE (st,•)/τ a∈A e Q µ NCE (st,a)/τ da ã ( ) then Q π (s,a)≥Q µ (s,a)for all (s,a)∈D µ .  EMA •B[r t+∆t ]; The proof of Lemma 2 is located in Section 6.2. Specifically, Lemma 2 tells us that using CVL as a surrogate Q-function corresponds to one step of conservative policy improvement, where π satisfies soft constraints of Equation ( 14) and small E D µ [D KL (π(s)||µ(s))] via the BC term.

4.4. PRACTICAL IMPLEMENTATION

We now present our complete method, which can be viewed as an actor-critic method for offline RL. We learn the critic via contrastive learning (Equation ( 11)) and learn the policy via Equation ( 16). We will interleave these steps in most of our experiments, but experiments in Section 5.1 show that the critic can be pretrained e.g. in the presence of unlabeled data from related tasks. We summarize the method in Algorithm 1.

4.5. INTERPRETATIONS AND CONNECTIONS WITH PRIOR WORK

The main distinction between Contrastive Value Learning and prior works consists specifically in representing the Q-values in a two-step decomposition: the Q-value is represented as an occupancy measure weighted by the reward signal; the occupancy measure itself is represented using a powerful likelihood-based model parameterized using an implicit function. Decoupling the learning of the occupancy measure from reward maximization allows, among others, for efficient pretraining strategies on unlabeled data, i.e. trajectory data without reward information, and can be used to learn provably optimal state representations for any reward function (Touati and Ollivier, 2021) . While CVL is similar in spirit to the successor representation (Dayan, 1993; Barreto et al., 2016) , the occupancy measure learned by CVL is much richer than that of SR, as it captures the entire distribution over future states instead of only the first moment. Another method, γ-models (Janner et al., 2020) , is closely related to CVL, but uses a surrogate single-step TD objective to learn the occupancy measure, similarly to C-learning (Eysenbach et al., 2020) .

5. EXPERIMENTS

Our experiments aim to answer three questions. First, we study how CVL compares with baseline approaches on a large benchmark of state-based tasks. Our second set of experiments look at image-based tasks, testing the hypothesis that CVL scales to these tasks more effectively than the baselines. We conclude with ablation experiments. Our main point of comparison will be a high-performing offline RL method, CQL (Kumar et al., 2020) . While CVL learns an implicit model, that model is structurally much more similar to value-based RL methods than model-based methods, motivating our comparison to a value-based baseline (CQL). We will also include behavioral cloning as a baseline. Metaworld. We first test our approach on the complex MetaWorld benchmark (Yu et al., 2020a) , which consists of 50 robotic manipulation tasks such as open a door, pick up an object, reach a certain area of the table, executed by a robotic arm (see Figure 2 (left)). This domain is an ideal testbed for CVL, as it allows . We report the results on all tasks of the MetaWorld suite over 5 random seeds, according to the aggregation methodology proposed by Agarwal et al. (2021b) . Per-environment scores are available in Table 4 . Results presented in Table 1 show that CVL is also able to learn meaningful Q-values and achieve good empirical performance on hard image-based tasks.

5.1. ABLATION EXPERIMENTS.

When is pretraining the model useful? In theory, the model can be pretrained on the data from other tasks, however, we do not always expect this to help (e.g., when the pretraining tasks are very different). We ran an experiment to test this capability. The results, shown in Fig. 3 , show that pretraining sometimes speeds up learning. In particular, we observe that pretraining is effective when the pretraining tasks are similar to the target task and contain a diverse set of state-action pairs. How reliable is the Q NCE approximation? Given that contrastive Q-values are proportional to the true Q-function, a natural question to ask is how good is Q NCE at capturing the topology of Q? First, we conduct an ablation demonstrating how linearizing the RBF kernel via random Fourier features provides a performance gain on the offline MetaWorld tasks Figure 4 . Specifically, we hypothesize that this is due to the reduced variance of the RFF Q-value estimator which keeps track of future reward-weighted state features using a rolling average. Next, we qualitatively assess the similarity between contrastive and true Q-values on the continuous Mountain Car environment (Moore, 1990 ) by first pre-training SAC online on the task and then fitting CVL to the data from SAC's replay buffer. 

6. DISCUSSION

This paper presented an RL algorithm that learns a contrastive model of the world, and uses that model to obtain Q values by estimating the likelihood of visiting future states. Our experiments demonstrate that this approach can effectively solve a large number of offline RL tasks, including from image-based observations. Our pretraining results hinted that CVL can be pretrained on datasets from other tasks, and we are excited to pretrain our model on datasets of increasing size. Limitations. One limitation of our approach is that it corresponds to a single step of policy improvement. This limitation might be lifted by training the contrastive model using a temporal difference update for the contrastive model (Eysenbach et al., 2020; Blier et al., 2021) . A second limitation is that the RFF approximation can be poor when the feature dimension is small. We tried to train the contrastive model using non-exponentiated features (akin to HaoChen et al. ( 2021)), but failed to achieve satisfactory results. Figuring out how to effectively train these spectral models remains an important question.

REPRODUCIBILITY STATEMENT

We ensure reproducibility of our method via a) releasing the offline MetaWorld dataset that we used for our main results to allow the community to conduct further research on this domain upon publication, b) releasing the code used to obtain our results upon publication and c) detailing the hyperparameters and design choices for the implementation of CVL and the computational resources used for our experiments in Section 6.1. Model architecture All algorithms (baselines as well as CVL) were based on a common architecture, where an encoder (IMPALA (Espeholt et al., 2018) for image data and two layer DenseNet MLP (Huang et al., 2017) for full-states) generated state features which, combined with actions gave rise to the Q-value and the policy (we used a diagonal Gaussian policy with a Tanh bijector, as is common for continuous control tasks). The main difference of CVL with the baselines is that the critic is defined implicitly via the dot-product of current state-action features passed through one encoder, and future state features passed into a separate DenseNet. The output of both encoders was optionally normalized using 2 norm. All methods had a LayerNorm layer (Ba et al., 2016) All experiments were run on the equivalent of 8 V100 GPUs with 64 Gb of RAM and 8 CPUs. For all methods, the corresponding auxiliary loss weights have been selected through best aggregated performance on the drawer and door domains with hyperparameter values of {0,0.01,0.1,1.0}.

Dataset composition

The offline MetaWorld dataset was constructed by first pre-training SAC on all 50 tasks from full-states for 500k environment interactions. The replay buffer at the end of the training was then used as training dataset for BC, CQL, CQL+UDS and CVL. An identical approach was used to construct the image-based MetaWorld datasets and the Mountain Car dataset. Pretraining setup When pretraining CVL, we first optimize the critic on unlabeled data from dataset for all the semantically related tasks, i.e. tasks which belong to the same domain, and then finetune both the critic and the policy on reward-labeled data from the target task. Semantically related tasks in MetaWorld are easily identifiable by their domain name, e.g. drawer-open and drawer-close belong to the drawer domain. We use a similar approach when pretraining CQL+UDS, where we perform TD updates with all rewards equal to 0 during the pretraining phase.

6.2. PROOFS

Proof 1 (Random Fourier features approximation, Lemma 1) For unit vectors x,y ∈R d , d>0, E[ Å … 2 d cos(W x+b) ã Å … 2 d cos(W y+b) ã ]=exp -||x-y|| 2 2 /2 =exp -(||x|| 2 2 -2x y+||y|| 2 2 )/2 =exp -(2-2x y)/2 =e x y-1 = e x y e by re-arranging the terms in the result from Rahimi and Recht (2007) . Therefore, e x y =E[ Å … 2e d cos(W x+b) ã Å … 2e d cos(W y+b) ã ] Proof 2 (CVL induces a single-step of policy improvement, Lemma 2) Since, for the optimal critic f * , e f * (st,at  Now, the following relation holds using the previous result Q µ NCE (s t ,a t )= 1 1-γ E ∆t∼TruncGeom(1-γ,t,H) [E P µ t+∆t [r(s t+∆t )e f(st,at,st+∆t) ]] = α 1-γ E ∆t∼TruncGeom(1-γ,t,H) ï st+∆t r(s t+∆t )P[s t+∆t |s t ,a t ;µ]ds t+∆t ò =αQ µ (s t ,a t ) Using this relation yields It follows that argmin π∈Π D KL Å π(s t ) e Q µ NCE (st,•)/τ a∈A e Q µ NCE (st,a)/τ da ã =argmin π∈Π D KL Å π(s t ) e Q µ (st,•)/τ a∈A e Q µ (st,a)/τ da ã =π(s t ) Now, we invoke Lemma 2 from Haarnoja et al. (2018) by using the equivalence of the policy decoded from contrastive Q-values to the policy found by soft policy iteration, which concludes the proof. 6.3 ADDITIONAL RESULTS

6.3.1. METAWORLD

Ablation on the BC coefficient: We ablate the impact of the behavior cloning loss on CVL's performance in Figure 7 . We can see that, although adding a behavior cloning loss improves the performance by a small amount, it is not essential to the fundamental functioning of CVL. Quantitative evaluation of the contrastive occupancy measure: From Wang and Tay (2022), we know that MMD(P,Q)≤2 1-e -KL(P,Q) (25) We also know that

6.3.2. MOUNTAIN CAR

I(P,Q)=KL((P,Q)||P⊗Q) ≥logN -InfoNCE (P N ,Q N ) which simplifies the above expression to MMD N (P,Q)≤2 1-e -logN+ InfoNCE(PN ,Q N ) Figure 9 shows the upper-bound on the MMD between occupancy measures learned with temporal difference and contrastive learning methods. 



∆(X ) denotes the entire set of distributions over the space X . See Ma and Collins (2018) for exact derivation. While using the truncated geometric distribution makes Equation (12) proportional to the true value function, the relation becomes an equality in the infinite-horizon case since limH→∞1-(1-p) H-m =1. This idea can be adapted to online learning settings as well by clipping policy improvement steps so that ξ doesn't change too fast under newly collected data. For CQL+UDS, we combine all data from the current task with unlabeled data from related tasks with rewards set to 0. In the absence of related tasks, we pre-train the critic on the current task with 0 rewards.



Figure 1: Contrastive Value Learning: A stylized illustration of trajectories (grey) and the rewards at future states (e.g., +8, -5). (Left) Q-learning estimates Q-values by "backing up" the rewards at future states. (Right) Our method learns the Q-values by fitting an implicit model to estimate the likelihoods of future states (blue), and taking the reward-weighted average of these likelihoods.

(2022) and introduce a regularization term over the partition function, making the critic training objective be Critic = OM-InfoNCE +λ Partition E st,at,∆t,∆t [(log ∆t ∈∆t∪∆t e φ(st,at) ψ(s t+∆t ) ) 2 ](11)

Let x,y ∈R d be unit vectors, and let F W,b (x)= » 2e d cos(Wx+b) where W ∼Normal(0,I) and b∼Uniform(0,2π) fixed at initialization. Then, E[F W,b (x) F W,b (y)]=e x y .

Figure 2: Metaworld benchmark. (Left) We evaluate CVL on 50 tasks from Metaworld, a subset of which are shown here. (Right) Compared with three offline RL baselines, CVL achieves statistically-significant improvements in offline performance. Results are reported over 5 random seeds.

Figure 5 (left) shows the contrastive Q-values on a log-scale, evaluated on trajectories from the SAC replay; for comparison, we also show the Q-values learned by online SAC in Figure 5 (right). Note that the value function learned by CVL conserves the same topology as the true value function, up to a multiplicative rescaling.

Figure 3: Offline Learning Curves for Metaworld. Episode return curves as a function of gradient steps taken during training on 10 random MetaWorld tasks; curves show mean ± standard deviation. Pretraining the reward-free occupancy measure on related tasks allows CVL to outperform baseline approaches and even CVL trained tabula rasa.

Figure 4: CVL with RFF (orange) performs slightly better than without RFF (blue).

Figure 5: Visualizing the estimated Q-values. (Left) Normalized logQNCE learned by CVL offline on the Mountain Car environment. (Right) Normalized Q learned by online SAC on the same environment.

. Yu, A. Kumar, Y. Chebotar, K. Hausman, C. Finn, and S. Levine. How to leverage unlabeled data in offline reinforcement learning. arXiv preprint arXiv:2202.01741, 2022. Y. Zhao, R. Boney, A. Ilin, J. Kannala, and J. Pajarinen. Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning. 2021. Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? Yes (b) Did you describe the limitations of your work? See end of Section 4 (c) Did you discuss any potential negative societal impacts of your work? N/A (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? Yes 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? Yes, they are outlined in the lemmas. (b) Did you include complete proofs of all theoretical results? See Appendix Section 6.2 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Code will be release upon publication, along with the offline Metaworld dataset. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Appendix Section 6.1 (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Yes, we include multiple various uncertainty estimates and measures of central tendency, e..g as outlined in Agarwal et al. (2021b) over 5 random replicates (i.e. seeds) (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix Section 6.1 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? N/A (b) Did you mention the license of the assets? N/A (c) Did you include any new assets either in the supplemental material or as a URL? Due to the large size of the offline Metaworld dataset, we will release the code required to re-generate it exactly upon publication. (d) Did you discuss whether and how consent was obtained from people whose data you're using/curating? N/A (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? N/A 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? N/A (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? N/A (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? N/A 6.1 EXPERIMENTAL DETAILS

e Q µ NCE (st,at)/τ a∈A e Q µ NCE (st,a)/τ da = e αQ µ (st,at)/τ a∈A e αQ µ (st,a)/τ da = e Q µ (st,at)/τ a∈A e Q µ (st,a)/τ da (23)

Figure 6: Performance profile of BC (red), CQL (green), CQL+UDS (orange) and CVL (blue) generated by the rliable library (Agarwal et al., 2021b) for the offline MetaWorld experiments over 5 random seeds.

Figure 7: Aggregated performance metrics for CVL with different behavior cloning weights.

Figure 8: Evaluation returns on Mountain car during training on data from the SAC replay buffer. The red dotted line indicates highest possible return.

Figure 9: Upper-bound on the MMD between occupancy measures estimated via TD and contrastive learning.

Algorithm 1: Contrastive Value Learning (CVL) Input :Dataset D ∼µ, ψ,φ networks, temperature parameter τ, exponential moving average parameter β for epoch j =1,2,..,J do

Offline RL with Images. We compare CVL to baselines on four offline, image-based tasks from MetaWorld offline image-based tasks. Average ± std. dev. are shown for 5 random seeds.

in between each linear layer to ensure proper feature scaling. Hyperparameters that are consistent between methods.

Hyperparameters that are different between methods.

Evaluation returns on MetaWorld offline tasks. Average ± standard deviation are shown for 5 random seeds.

