CLARE: CONSERVATIVE MODEL-BASED REWARD LEARNING FOR OFFLINE INVERSE REINFORCEMENT LEARNING

Abstract

This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the "exploitation" (on both expert and diverse data) and "exploration" (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right "exploitation-exploration" balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning (source code).

1. INTRODUCTION

The primary objective of Inverse Reinforcement Learning (IRL) is to learn a reward function from demonstrations (Arora & Doshi, 2021; Russell, 1998) . In general, conventional IRL methods rely on extensive online trials and errors that can be costly or require a fully known transition model (Abbeel & Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Syed & Schapire, 2007; Boularias et al., 2011; Osa et al., 2018) , struggling to scale in many real-world applications. To tackle this problem, this paper studies offline IRL, with focus on learning from a previously collected dataset without online interaction with the environment. Offline IRL holds tremendous promise for safetysensitive applications where manually identifying an appropriate reward is difficult but historical datasets of human demonstrations are readily available (e.g., in healthcare, autonomous driving, robotics, etc.). In particular, since the learned reward function is a succinct representation of an expert's intention, it is useful for policy learning (e.g., in offline Imitation Learning (IL) (Chan & van der Schaar, 2021)) as well as a number of broader applications (e.g., task description (Ng et al., 2000) and transfer learning (Herman et al., 2016) ). This work aims to address a major challenge in offline IRL, namely the reward extrapolation error, where the learned reward function may fail to correctly explain the task and misguide the agent in unseen environments. This issue results from the partial coverage of states in the restricted expert demonstrations (i.e., covariate shift) as well as the high-dimensional and expressive function approximation for the reward. It is further exacerbated due to no reinforcement signal for supervision and the intrinsic reward ambiguity therein. 1 In fact, similar challenges related to the extrapolation error in the value function have been widely observed in offline (forward) RL, e.g., in Kumar et al. (2020) ; Yu et al. (2020; 2021) . Unfortunately, to the best of our knowledge, this challenge remains not well understood in offline IRL, albeit there is some recent progress (Zolna et al., 2020; Garg et al., 2021; Chan & van der Schaar, 2021) . Thus motivated, the key question this paper seeks to answer is: "How to devise offline IRL algorithms that can ameliorate the reward extrapolation error effectively?" We answer this question by introducing a principled offline IRL algorithm, named conservative model-based reward learning (CLARE), leveraging not only (limited) higher-quality expert data but also (potentially abundant) lower-quality diverse data to enhance the coverage of the state-action space for combating covariate shift. CLARE addresses the above-mentioned challenge by appropriately integrating conservatism into the learned reward to alleviate the possible misguidance in out-of-distribution states, and improves the reward generalization ability by utilizing a learned dynamics model. More specifically, CLARE iterates between conservative reward updating and safe policy improvement, and the reward function is updated via improving its values on weighted expert and diverse state-actions while in turn cautiously penalizing those generated from model rollouts. As a result, it can encapsulate the expert intention while conservatively evaluating out-of-distribution state-actions, which in turn encourages the policy to visit data-supported states and follow expert behaviors and hence achieves safe policy search. Technically, there are highly nontrivial two-tier tradeoffs that CLARE has to delicately calibrate: "balanced exploitation" of the expert and diverse data, and "exploration" of the estimated model. 2As illustrated in Fig. 1 , The first tradeoff arises because CLARE relies on both exploiting expert demonstrations to infer the reward and exploiting diverse data to handle the covariate shift caused by the insufficient state-action coverage of limited demonstration data. At a higher level, CLARE needs to judiciously explore the estimated model to escape the offline data manifold for better generalization. To this end, we first introduce the new pointwise weight parameters for offline data points (state-action pairs) to capture the subtle two-tier exploitation-exploration tradeoffs. Then, we rigorously quantify its impact on the performance by providing an upper bound on the return gap between the learned policy and the expert policy. Based on the theoretical quantification, we derive the optimal weight parameters whereby CLARE can strike the balance appropriately to minimize the return gap. Our findings reveal that the reward function obtained by CLARE can effectively capture the expert intention and provably ameliorate the extrapolation error in offline IRL. Finally, extensive experiments are carred out to compare CLARE with state-of-the-art offline IRL and offline IL algorithms on MuJoCo continuous control tasks. Our results demonstrate that even using small offline datasets, CLARE obtains significant performance gains over existing algorithms in continuous, high-dimensional environments. We also show that the learned reward function can explain the expert behaviors well and is highly instructive for further learning.

2. PRELIMINARIES

Markov decision process (MDP) can be specified by tuple M . = ⟨S, A, T, R, µ, γ⟩, consisting of state space S, action space A, transition function T : S × A → P(S), reward function R : S × A → R, initial state distribution µ : S → [0, 1], and discount factor γ ∈ (0, 1). A stationary stochastic policy maps states to distributions over actions as π : S → P(A). We define the normalized state-action occupancy measure (abbreviated as occupancy measure) of policy π under transition dynamics T as ρ π (s, a) . = (1 -γ) ∞ h=0 γ h Pr(s h = s|T, π, µ)π(a|s). The objective of reinforcement learning (RL) can be expressed as maximizing expected cumulative rewards: max π∈Π J(π) . = E s,a∼ρ π [R(s, a)], where Π is the set of all stationary stochastic policies that take actions in A given states in S. 3Maximum entropy IRL (MaxEnt IRL) aims to learn the reward function from expert demonstrations and reason about the stochasticity therein (Ziebart et al., 2008; Ho & Ermon, 2016) . Based on demonstrations sampled from expert policy π E , the MaxEnt IRL problem is given by min r∈R max π∈Π αH(π) + E s,a∼ρ π [r(s, a)] -E s,a∼ρ E [r(s, a)] + ψ(r), with H(π) . = -ρ π (s, a) log π(a|s) ds da being the γ-discounted causal entropy, R a family of reward functions, α ≥ 0 the weight parameter, and ψ : R S×A → R ∪ {∞} a convex reward regularizer Fu et al. (2018); Qureshi et al. (2018) . Problem (1) looks for a reward function assigning higher rewards to the expert policy and lower rewards to other policies, along with the best policy under the learned reward function. Although enjoying strong theoretical justification and achieving great performance in many applications, MaxEnt IRL has to solve a forward RL problem in the inner loop that involves extensive online interactions with the environment. Offline IRL is the setting where the algorithm is neither allowed to interact with the environment nor provided reinforcement signals. It only has access to static dataset D = D E ∪ D B consisting of expert dataset D E . = {(s i , a i , s ′ i )} D E i=1 and diverse dataset D B . = {(s i , a i , s ′ i )} D B i=1 collected by expert policy π E and behavior policy π B , respectively. The goal of offline IRL is to infer a reward function capable of explaining the expert's preferences from the given dataset.

3. CLARE: CONSERVATIVE MODEL-BASED REWARD LEARNING

A naive solution for offline IRL is to retrofit MaxEnt IRL to the offline setting via estimating a dynamics model using offline data (e.g., in Tanwani & Billard (2013); Herman et al. (2016) ). Unfortunately, it has been reported that this naive paradigm often suffers from unsatisfactory performance in high-dimensional and continuous environments Jarrett et al. (2020) . The underlying reasons for this issue include: (1) the dependence on full knowledge of the reward feature function, and (2) the lack of effective mechanisms to tackle the reward extrapolation error caused by covariate shift (as stated in Section 1). Nevertheless, we believe that utilizing a learned dynamics model is beneficial because it is expected to provide broader generalization by learning on additional model-generated synthetic data (Yu et al., 2020; 2021; Lin et al., 2021) . With this insight, this work focuses on the model-based offline IRL method that is robust to covariate shift while enjoying the model's generalization ability. As illustrated in Fig. 1 , there are two-tier subtle tradeoffs that need to be carefully balanced between exploiting the offline data and exploring model-based synthetic data. On one hand, the higher-quality expert demonstrations are exploited to infer the intention and abstract the reward function therein, while the lower-quaity diverse data is exploited to enrich data support. On the other hand, it is essential to prudently explore the estimated dynamics model to improve the generalization capability while mitigating overfitting errors in inaccurate regions. To this end, we devise conservative modelbased reward learning (CLARE) based on MaxEnt IRL, where the new pointwise weight parameters are introduced for each offline state-action pair to capture the tradeoffs subtly. We elaborate further in what follows. As outlined below, CLARE iterates between (I) conservative reward updating and (II) safe policy improvement, under a dynamics model (denoted by T ) learned from offline dataset. = |D E (s, a)|/D E is that for expert dataset D E ; ρπ is the occupancy measure when rolling out π with dynamics model T ; and ψ denotes a convex regularizer mentioned above. One key step is to add an additional term weighting the reward of each offline state-action by β(s, a), which is a "fine-grained control" for the exploitation of the offline data. For the data deserving more exploitation (e.g., expert behaviors with sufficient data support), we can set a relatively large β(s, a); otherwise, we decrease its value. Besides, it can also control the exploration of the model subtly (consider that if we set all β(s, a) = 0, Eq. ( 2) reduces to MaxEnt IRL, enabling the agent to explore the model without restrictions). Here, Z β . = 1 + E s ′ ,a ′ ∼ ρD [β(s ′ , a ′ )] is a normalization term. The new ingredients beyond MaxEnt IRL are highlighted in blue. Observe that in Eq. ( 2), by decreasing the reward loss, CLARE pushes up the reward on good offline state-action that characterized by larger β(s, a), while pushing down the reward on potentially out-of-distribution ones that generated from model rollouts. This is similar to COMBO (Yu et al., 2021) in spirit, a state-of-the-art offline forward RL algorithm, and results in a conservative reward function. It can encourage the policy to cautiously exploring the state-actions beyond offline data manifold, thus capable of mitigating the misguidance issue and guiding safe policy search. In Section 4, we will derive a closed-form optimal β(s, a) that enables CLARE to achieve a proper exploration-exploitation trade-off by minimizing a return gap from the expert policy. (II) Safe policy improvement. Given updated reward function r, the policy is improved by solving max π∈Π L(π|r) . = Z β E s,a∼ ρπ [r(s, a)] + α H(π), where α ≥ 0 is a weight parameter, and H(π) . = -ρπ (s, a) log π(a|s) ds da is the γ-discounted causal entropy induced by the policy and learned dynamics model. Due to the embedded expert intention and conservatism in the reward function, the policy is updated safely by carrying out conservative model-based exploration. One can use any well-established MaxEnt RL approach to solve this problem by simulating with model T and reward function r. It is worth noting that for Problem (3) in this step, the practical implementation of CLARE works well with a small number of updates in each iteration (see Sections 5 and 6).

4. THEORETICAL ANALYSIS OF CLARE

In this section, we focus on answering the following question: "How to set β(s, a) for each offline state-action pair to strike the two-tier exploitation-exploration balance appropriately?" To this end, we first quantify the impact of the tradeoffs via bounding the return gap between the learned policy and expert policy. Then, we derive the optimal weight parameters to minimize this gap. All the detailed proofs can be found in Appendix B. Notably, this section works with finite state and action spaces, but our algorithms and experiments run in high-dimensional and continuous environments.

4.1. CONVERGENCE ANALYSIS

We first characterize the policy learned by CLARE, in terms of β(s, a) and empirical distributions ρE and ρD . Before proceeding, it is easy to see CLARE is iteratively solving the min-max problem: min r∈R max π∈Π α H(π) + Z β E ρπ r(s, a) -E ρD β(s, a)r(s, a) -E ρE r(s, a) + Z β ψ(r) . =L(π,r) . For dynamics T , define the set of occupancy measures satisfying Bellman flow constraints as C T . = ρ ∈ R |S||A| : ρ ≥ 0 and a ρ(s, a) = µ(s) + γ s ′ ,a T (s|s ′ , a)ρ(s ′ , a) ∀s ∈ S . We first provide the following results for switching between policies and occupancy measures, which allow us to use π ρ to denote the unique policy for occupancy measure ρ. Lemma 4.1 (Theorem 2 in Syed et al. (2008) ). If ρ ∈ C T , then ρ is the occupancy measure for stationary policy π ρ (a|s) . = ρ(s, a)/ a ′ ρ(s, a ′ ), and π ρ is the only stationary policy with occupancy measure ρ. Lemma 4.2 (Lemma 3.2 in Ho & Ermon (2016) ). Denote H(ρ) . =s,a ρ(s, a) log ρ(s,a) a ′ ρ(s,a ′ ) . Then, H is strictly concave, and for all π ∈ Π and ρ ∈ C T , H(π) = H(ρ π ) and H(ρ) = H(π ρ ) hold true, where π ρ (a|s) . = ρ(s, a)/ a ′ ρ(s, a ′ ). Based on Lemma 4.1 and Lemma 4.2, we have the follow results on the learned policy. Theorem 4.1. Assume that β(s, a) ≥ -ρ E (s, a)/ρ D (s, a) holds for (s, a) ∈ D. For Problem (4), the following relationship holds: min r∈R max π∈Π L(π, r) = max ρ∈C T α H(ρ) -Z β D ψ ρ, ρE + β ρD Z β , with D ψ (ρ 1 , ρ 2 ) . = ψ * (ρ 2 -ρ 1 ) , where ψ * is the convex conjugate of ψ. Notably, by selecting appropriate forms of reward regularizers ψ, D ψ can belong to a wide-range of statistical distances. For example, if (Garg et al., 2021) . Theorem 4.1 implies that CLARE implicitly seeks a policy under T whose occupancy measure stays close to an interpolation of the empirical distributions of expert dataset D E and union offline dataset D. The interpolation reveals that CLARE is trying to trade off the exploration of the model and exploitation of offline data by selecting proper weight parameters β(s, a). For example, if β(s, a) = 0 for all (s, a) ∈ D, CLARE will completely follow the occupancy measure of the (empirical) expert policy by explore the model freely. In contrast, if β(s, a) increases with ρD (s, a), the learned policy will look for richer data support. ψ(r) = αr 2 , then D ψ (ρ 1 , ρ 2 ) = 1 4α χ 2 (ρ 1 , ρ 2 ); if ψ restricts r ∈ [-R max , R max ], then D ψ (ρ 1 , ρ 2 ) = 2R max D TV (ρ 1 , ρ 2 ) Remarks. Looking deeper into Eq. ( 6), the target occupancy measure can be expressed equivalently as (1+βD E /D) ρE +(βD S /D) ρB Z β , after rearranging terms in the above interpolation. As a result, CLARE also subtly balances the exploitation between the expert and diverse datasets to extract potentially valuable information in the sub-optimal data.

4.2. STRIKING THE RIGHT EXPLORATION-EXPLOITATION BALANCE

Next, we show how to set β(s, a) properly to achieve the right two-tier balance. Recall that J(π) . = E s,a∼ρ π [R(s, a)] is the return achieved by policy π. The next result provides a upper bound on the return gap between J(π) and J(π E ), which hinges on the intrinsic trade-offs. Theorem 4.2. Suppose |R(s, a)| ≤ 1 for any s ∈ S, a ∈ A. For any stationary policy π, let ρπ denote the occupancy measure of π under estimated model T . We have that J(π E ) -J(π) ≤ C • E s,a∼ ρπ D TV T (•|s, a), T (•|s, a) + 2 D TV (ρ π , ρE ) + D TV (ρ E , ρ E ) , where C . = 2γ 1-γ , and ρ E is the occupancy measure of expert policy π E under true dynamics T . Remarks. Theorem 4.2 indicates that a good policy learned from the estimated model not only follows the expert behaviors but also keeps in the "safe region" of the learned model, i.e., visiting the state-actions with less model estimation inaccuracy. Under the concentration assumption, the following holds with probability greater than 1 -δ: J(π E ) -J(π) ≤ E s,a∼ ρπ CC δ |D E (s, a)| + |D B (s, a)| (a) +2 D TV (ρ π , ρE ) (b) +2 D TV (ρ E , ρ E ) (c) , where D(s, a) . = {(s ′ , a ′ ) ∈ D : s ′ = s, a ′ = a}. It aligns well with the aforementioned exploration-exploitation balance: 1) Term (a) captures the exploitation of offline data support; 2) Term (b) captures the exploitation of expert data and the exploration of the model (recall that ρπ is the occupancy measure of rolling out π with T ); and 3) Term (c) captures the distributional shift in offline learning. Importantly, the result in Theorem 4.2 connects the true return of a policy with its occupancy measure on the learned model. This gives us a criteria to evaluate the performance of a policy from offline. Define c(s, a) . = C • D TV (T (•|s, a), T (•|s, a) ) and c min . = min s,a c(s, a). Subsequently, we derive the policy that minimizes the RHS of Eq. ( 7). Under the same conditions as in Theorem 4.2, the optimal occupancy measure minimizing the upper bound of Eq. ( 7) is given as follows: ρ * (s, a) =      ρE (s, a) + ∆ ρ , if c(s, a) ≤ c min , 0, if c(s, a) > c min + 2, ρE (s, a), otherwise. where ∆ ρ . = s ′ ,a ′ 1[c(s ′ ,a ′ )-c min >2]• ρE (s ′ ,a ′ ) |Nmin| and N min . = {(s, a) ∈ D : c(s, a) ≤ c min }. As shown in Theorem 4.3, the "optimal" policy leaned on model T conservatively explores the model by avoiding the visit of risky state-actions. Meantime, it cleverly exploits the accurate region, such that it does not deviate large from the expert. Now, we are ready to derive the optimal values of the weight parameters.  then it follows that min r∈R max π∈Π L(π, r) = max π α H(ρ π ) -Z β D ψ (ρ π , ρ * ). Corollary 4.1 provides the value of β(s, a) for each (s, a) ∈ D such that the learned reward function can guide the policy to minimize the return gap in Eq. ( 7). It indicates that the right exploitationexploration trade-off can be provably balanced via setting the weight parameters properly. In particular, β * assigns positive weight to the offline state-action with accurate model estimation and negative weight to that with large model error. It enables CLARE to learn a conservative reward function that pessimistically evaluates the our-of-distribution states and actions, capable of ameliorating the extrapolation error in unseen environments. However, the optimal weights require the model error, c(s, a), which is typically hard to obtain (especially in high-dimensional and continuous spaces). Section 5 will solve this problem by extending this result with the aid of the model ensembles and uncertainty quantification techniques.

5. PRACTICAL IMPLEMENTATION

Learning dynamics models. Following the state-of-the-art model-based methods (Yu et al., 2020; 2021) , we model the transition dynamics by an ensemble of neural networks, each of which outputs a Gaussian distribution over next states, i.e., { T i (s ′ |s, a) = N (µ i (s, a), Σ i (s, a))} N i=1 . Weights in continuous environments. The ideas of achieving CLARE in continuous environments are 1) to approximately see the offline data as sampled from a large discrete space, and 2) to use an uncertainty quantification technique for quantifying the model error. Specifically, because state-action pairs are basically different from each other in this setting, we let ρD (s, a) = 1/D and ρE (s, a) = 1/D E , and employ the uncertainty estimator, c(s, a) = max i∈[N ] ∥Σ i (s, a)∥ F , proposed in Yu et al. (2020) for model error evaluation. Guided by the analytical results in Corollary 4.1, we compute the weights for each (s, a) ∈ D via slight relaxation as follows: β(s, a) =      N ′′ D N ′ D E , if c(s, a) ≤ u, -D D E • 1[(s, a) ∈ D E ], if c(s, a) > u, 0, otherwise, where N ′ . = (s,a)∈D 1[c(s, a) ≤ u] and N ′′ . = (s,a)∈D E 1[c(s, a) > u]. Here, coefficient u is a user-chosen hyper-parameter for controlling the conservatism level of CLARE. If one wants the learned policy to be trained more conservatively on offline data support, u should be small; otherwise, u can be chose to be large for better exploration. Reward and policy regularizers. In the experiments, we use ψ(r) = r 2 as the reward regularizer. Additionally, when updating the policy, we use a KL divergence as a regularizer with empirical behavior policy π b induced by a subset of the offline dataset, D ′ ⊂ D, as follows: D KL (π b ∥π) . = E s∈D ′ E a∼π b (•|s) log π b (a|s) -E a∼π b (•|s) log π(a|s) , where π b (a|s) = (s ′ ,a ′ )∈D ′ 1[s ′ =s,a ′ =a] (s ′ ,a ′ )∈D ′ 1[s ′ =s] if (s, a) ∈ D ′ , and π b (a|s) = 0 otherwise. It can be implemented by adding -E s,a∼D ′ [log π(a|s)] to the actor loss. The intuition is to encourage the actor to perform in support of the real data for accelerating safe policy improvement. While this regularization lacks theoretical guarantees, we empirically find that it can indeed speed up the training. Practical algorithm design. The pseudocode of CLARE is depicted in Algorithm 1. The policy improvement phase can be implemented by the standard implementation of SAC (Haarnoja et al., 2018) with a change of the additional policy regularizer. We elaborate more details in the Appendix A.

6. EXPERIMENTS

Next, we use experimental studies to evaluate CLARE and answer the following key questions: (1) How does CLARE perform on the standard offline RL benchmarks in comparison to existing stateof-the-art algorithms? (2) How does CLARE perform given different dataset sizes? (3) How does the "conservatism level", u, affect the performance? (4) How fast does CLARE converge? (5) Can the learned reward function effectively explain the expert intention? To answer these questions, we compare CLARE with the following existing offline IRL methods on the D4RL benchmark (Fu et al., 2020) : 1) IQ-LEARN (Garg et al., 2021) , a state-of-the-art modelfree offline IRL algorithm; 2) AVRIL (Chan & van der Schaar, 2021), another recent model-free offline IRL method; 3) EDM (Jarrett et al., 2020) , a state-of-the-art offline IL approach; and 4) Behavior Cloning (BC). To demonstrate the poor performance of the naive approach using a simple combination of IRL with model-based offline forward RL (MORL) method, we also consider a baseline algorithm, namely MOMAX, by directly using COMBO (Yu et al., 2021) in the inner loop of MaxEnt IRL. We present the results on continuous control tasks (including Half-Cheetah, Walker2d, Hopper, and Ant) consisting of three data qualities (random, medium, and expert). Experimental set-up and hyperparameters are described in detailed in Appendix A. Results under different dataset sizes. To answer the second question, we vary the total numbers of state-action tuples from 2k to 100k and present the results on different tasks in Figure 2 . CLARE reaches expert performance on each task with sufficient data. Albeit with very limited data, CLARE also achieves strong performance over existing algorithms, revealing its great sample efficiency. Results under different u. To answer the third question, we normalize the uncertainty measure to [0, 1] and vary u from 0.1 to 1.0. Due to Eq. ( 11), a smaller u corresponds to a more conservative CLARE. As illustrated in Figure 3 (a), the performance becomes better with the decrease of u value. It validates the importance of the embedded conservatism in alleviating the extrapolation error. We empirically find that the performance with respect to u varies in different tasks. Thus, we treat it as a hyper-parameter to tune In practice. Convergence speed. To answer the fourth question, we present the results on the convergence speed of CLARE in Figure 3 (b), revealing its great learning efficiency. It showcases that CLARE converges in 5 iterations with totally less than 50k gradient steps. Recovered reward function. To answer the last question, we evaluate the learned reward function by transferring it to the real environment. As demonstrated in Figure 3 (c), the reward function is highly instructive for online learning. It implies that it can effectively reduce the reward extrapolation error and represent the task preferences well. Surprisingly, compared to the true reward function, the policy trained via the learned one performs more stably. The reason is that the learned one incorporates conservatism and thus is capable of penalizing risks and guide safe policy search.

7. RELATED WORK

Offline IRL. To side-step the expensive online environmental interactions in classic IRL, offline IRL aims to infer a reward function and recover the expert policy only from a static dataset with no access to the environment. Klein et al. (2011) extend the classic apprenticeship learning (i.e., Abbeel & Ng (2004) ) to batch and off-policy cases by introducing a temporal difference method, namely LSTD-µ, to compute the feature expectations therein. 2020) is often unrealistic, because the choice of features is problem-dependent and can become a very hard task for complex problems (Arora & Doshi, 2021; Piot et al., 2014) . To address this problem, Piot et al. (2014) propose a non-parametric algorithm, called RCAL, using boosting method to minimize directly the criterion without the step of choosing features. Konyushkova et al. (2020) propose two semi-supervised learning algorithms that learn a reward function from limited human reward annotations. Zolna et al. (2020) further propose ORIL that can learn from both expert demonstrations and a large unlabeled set of experiences without human annotations. Chan & van der Schaar (2021) use a variational method to jointly learn an approximate posterior distribution over the reward and policy. Garg et al. (2021) propose an off-policy IRL approach, namely IQ-Learn, implicitly representing both reward and policy via a learned soft Q-function. Nevertheless, these methods primarily concentrate on offline policy learning with learning reward function being an intermediate step. Due to the intrinsic covariate shift, these methods may suffer from severe reward extrapolation error, leading to misguidance in unseen environments and low learning efficiency. Offline IL. Akin to offline IRL, offline imitation learning (offline IL) deals with training an agent to directly mimic the actions of a demonstrator in an entirely offline fashion. Behavioral cloning (BC (Ross & Bagnell, 2010) ) is indeed an intrinsically offline solution, but it fails to exploit precious dynamics information. To tackle this issue, several recent works propose dynamics-aware offline IL approaches, e.g., Kostrikov et al. (2019) ; Jarrett et al. (2020) ; Chang et al. (2021) ; Swamy et al. (2021) . In contrast to directly mimicking the expert as done in offline IL, offline IRL explicitly learns the expert's reward function from offline datasets, which can take into account the temporal structure and inform what the expert wishes to achieve, rather than simply what they are reacting to. It enables agents to understand and generalize these "intentions" when encountering similar environments and therefore makes offline IRL more robust (Lee et al., 2019) . In addition, the learned reward function can succinctly explain the expert's objective, which is also useful in a number of broader applications (e.g., task description Ng et al. (2000) and transfer learning Herman et al. (2016) ).

8. CONCLUSION

This paper introduces a new offline IRL algorithm (namely CLARE) to approaching the reward extrapolation error (caused by covariate shift) via incorporating conservatism into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis characterizes the impact of covariate shift by quantifying a subtle two-tier exploitation-exploration tradeoffs, and we show that CLARE can provably alleviate the reward extrapolation error by striking the right tradeoffs therein. Extensive experiments corroborate that CLARE outperforms existing methods in continuous, high-dimensional environments by a significant margin, and the learned reward function represents the task preferences well. Conservatism level u. For all tasks, we normalize the uncertainty measure to [0, 1] and test u from set {0.4, 0.6, 0.8}. The result is shown in Table 3 . In each experiment, we select the u value that achieves the maximum corresponding score. Learning rates. For all experiments, the reward learning rate is η = 5×10 -5 . Our empirical studies indicate that a relatively small reward learning rate leads to more stable training. Additionally, the learning rates for actor and critic are both 3 × 10 -4 , and that for dynamics model is 10 -3 . Policy regularization. For all experiments, the policy regularization weight is λ = 0.25. 

A.3 MORE EXPERIMENTAL RESULTS

We further evaluate CLARE by answering the following two questions: 1) Can CLARE exploit the useful information from diverse datasets? 2) How does CLARE perform compared to the simple combination of MORL and (online) IRL methods? 3) What is the impact of reward weighting? 4) What is the impact of expert sample sizes? Exploitation on diverse data. Table 4 shows the results under different data combinations. By using additional medium data, the performance can be improved over that only using 5k expert tuples. The underlying rationale is: 1) The diverse datasets contain some good state-actions; 2) the diverse data support enables CLARE to safely generalize to the states beyond expert data manifold. Expert sample sizes. Table 5 shows the average returns (over 5 random seeds) under different expert sample sizes with the fixed number of medium data (50k). It can corroborate our analytical results that with a relatively sufficient data coverage of the empirical expert behaviors, the performance is dominated by the expert sample size (combining Theorem 4.2, Theorem 4.3 and Corollary 4.1). Comparison to MOMAX. To demonstrate the poor performance of the naive approach using a simple combination of IRL with model-based offline forward RL (MORL) method, we design a baseline directly using a state-of-the-art MORL method, COMBO (Yu et al., 2021) , in the inner loop of MaxEnt IRL (Eq. ( 1)), called MOMAX. As shown in Figure 4 , MOMAX does not work well in A.4 COMPUTATIONAL COMPLEXITY We implement the code in PyTorch 1.11.0 on a server with a 32-Cores AMD Ryzen Threadripper PRO 3975WX and a Intel GeForch RTX 3090 Ti. For all tasks, CLARE converges in one hour (around 5-10 iterations with total 50k-100k gradient steps).

B PROOFS

In this section, we provide detailed proofs of main results in Section 4.

B.1 PROOF OF THEOREM 4.1

This proof is built on that for Ho & Ermon (2016, Proposition 3.1) . First, it follows from Eq. (4) that . L(π, r) = α H(π) + Z β E ρπ r(s, It is easy to see that R is compact and convex. Besides, from the proof of Ho & Ermon (2016, Proposition 3.1), C T is also a compact and convex set. Accordingly, based on the concavity of H (Lemma 4.2), the minimax theorem holds (Du & Pardalos, 2013) , and hence we have  min r∈R max ρ∈C T L(ρ, r) = max ρ∈C T min r∈R L(ρ, r) = max ρ∈C T α H(ρ) + Z β min r∈R E ρ r(s, a) -E ρI r(s, a) + ψ(r) = max ρ∈C T α H(ρ) + Z β ψ * ρI -ρ (from the definition of convex conjugate) = max ρ∈C T α H(ρ) + Z β D ψ ρ, ρI . Due to Eq. ( 14), (r * , ρ * ) is a saddle point of L, and thus ρ * ∈ arg max ρ∈C T L(ρ, r * ). By Lemma 4.1, it is easy to see that policy π * (that corresponds to ρ * ) satisfies π * ∈ arg max π∈Π L(π, r * ), thereby completing the proof. B.2 PROOF OF THEOREM 4.2 We present the following two lemmas before our main result. Lemma B.1. Denoting p 1 (x, y) = q 1 (x)q 1 (y|x) and p 2 (x, y) = q 2 (x)q 2 (y|x) as two joint distributions over finite spaces, we can bound the total variation distance (TVD) between p 1 and p 2 as D TV (p 1 , p 2 ) ≤ E x∼q1(x) D TV (q 1 (•|x), q 2 (•|x)) + D TV (q 1 , q 2 ). ( ) Proof. The proof is straight-forward: D TV (p 1 , p 2 ) = 1 2 x,y p 1 (x, y) -p 2 (x, y) = 1 2 x,y q 1 (x)q 1 (y|x) -q 2 (x)q 2 (y|x) = 1 2 x,y q 1 (x)q 1 (y|x) -q 1 (x)q 2 (y|x) + q 1 (x)q 2 (y|x) -q 2 (x)q 2 (y|x) ≤ 1 2 x,y q 1 (x) q 1 (y|x) -q 2 (y|x) + 1 2 x,y q 2 (y|x) q 1 (x) -q 2 (x) = x q 1 (x) • 1 2 y q 1 (y|x) -q 2 (y|x) + 1 2 x q 1 (x) -q 2 (x) y q 2 (y|x) = E x∼q1(x) D TV (q 1 (y|x), q 2 (y|x)) + D TV (q 1 , q 2 ), where the last equality is obtained due to y q 2 (y|x) = 1. Lemma B.2. Suppose that we have two Markov chain transition distributions T 1 (s ′ |s) and T 2 (s ′ |s), and the initial state distributions are the same. Then, for each h ∈ [1, 2, . . . ), the TVD of state marginals in time step h is bounded as D TV (p h 1 , p h 2 ) ≤ h-1 h ′ =0 E s∼p h ′ 2 D TV T 1 (•|s), T 2 (•|s) , where p h i (s) . = Pr(s h = s | T i , µ) for i = 1, 2. Proof. First, we have p h 1 (s) -p h 2 (s) = s ′ T 1 (s|s ′ )p h-1 1 (s ′ ) - s ′ T 2 (s|s ′ )p h-1 2 (s ′ ) ≤ s ′ T 1 (s|s ′ )p h-1 1 (s ′ ) -T 2 (s|s ′ )p h-1 2 (s ′ ) = s ′ T 1 (s|s ′ )p h-1 1 (s ′ ) -T 1 (s|s ′ )p h-1 2 (s ′ ) + T 1 (s|s ′ )p h-1 2 (s ′ ) -T 2 (s|s ′ )p h-1 2 (s ′ ) ≤ s ′ T 1 (s|s ′ ) p h-1 1 (s ′ ) -p h-1 2 (s ′ ) + p h-1 2 (s ′ ) T 1 (s|s ′ ) -T 2 (s|s ′ ) = s ′ T 1 (s|s ′ ) p h-1 1 (s ′ ) -p h-1 2 (s ′ ) + E s ′ ∼p h-1 2 T 1 (s|s ′ ) -T 2 (s|s ′ ) . Thus, we can write D TV (p h 1 , p h 2 ) = 1 2 s p h 1 (s) -p h 2 (s) ≤ 1 2 s E s ′ ∼p h-1 2 T 1 (s|s ′ ) -T 2 (s|s ′ ) + 1 2 s s ′ T 1 (s|s ′ ) p h-1 1 (s ′ ) -p h-1 2 (s ′ ) (using Eq. (19)) = 1 2 s E s ′ ∼p h-1 2 T 1 (s|s ′ ) -T 2 (s|s ′ ) + 1 2 s ′ p h-1 1 (s ′ ) -p h-1 2 (s ′ ) s T 1 (s|s ′ ) =E s ′ ∼p h-1 2 1 2 s T 1 (s|s ′ ) -T 2 (s|s ′ ) + 1 2 s ′ p h-1 1 (s ′ ) -p h-1 2 (s ′ ) (using s T 1 (s|s ′ ) = 1) =E s ′ ∼p h-1 2 D TV T 1 (•|s ′ ), T 2 (•|s ′ ) + D TV (p h-1 1 , p h-1 2 ) (20) ≤ h-1 h ′ =0 E s∼p h ′ 2 D TV T 1 (•|s), T 2 (•|s) + D TV (p 0 1 , p 2 ) (iteratively using Eq. ( 20)) = h-1 h ′ =0 E s∼p h ′ 2 D TV T 1 (•|s), T 2 (•|s) , (due to same initial state distributions) which completes the proof. Observe that Lemma B.1 bounds the TVD of a joint distribution by the TVDs of its corresponding conditional and marginal distributions, and that Lemma B.2 bounds the difference of two MDPs' state visitations in each time step by the cumulative dynamics differences. Next, we provide the following lemma that bounds the difference between the expert's and learned policy's occupancy measures from above. Lemma B.3. For each ρ ∈ C T , denote π as its corresponding stationary policy, i.e., π . = ρ(s, a)/ a ′ ρ(s, a ′ ), and ρ π denote the occupancy measure of π under true transition dynamics T . Then, the following holds: D TV (ρ π , ρ E ) ≤ γ 1 -γ E s,a∼ ρ D TV T (•|s, a), T (•|s, a) + D TV (ρ, ρE ) + D TV (ρ E , ρ E ), where ρ E is the occupancy measure of expert policy π E under the true transition dynamics. Proof. For conciseness, let ρ 1 . = ρ π and ρ 2 . = ρ. Using the triangle inequality, it is easy to see that D TV (ρ 1 , ρ E ) ≤ D TV (ρ 1 , ρ 2 ) + D TV (ρ 2 , ρE ) + D TV (ρ E , ρ E ), ( ) where ρE is the empirical occupancy measure of expert policy π E . To bound D TV (ρ 1 , ρ 2 ), denoting p h 1 (s, a) . = Pr(s h = s, a h = a | T , π, µ) and p h 2 (s, a) . = Pr(s h = s, a h = a | T , π, µ) (the difference between them is marked in red), we can write D TV (ρ 1 , ρ 2 ) = 1 2 s,a ρ 1 (s, a) -ρ 2 (s, a) = 1 2 s,a (1 -γ) ∞ h=0 γ h p h 1 (s, a) -(1 -γ) ∞ h=0 γ h p h 2 (s, a) (using the definition of occupancy measure in Section 2) = 1 -γ 2 s,a ∞ h=0 γ h p h 1 (s, a) -p h 2 (s, a) ≤ 1 -γ 2 ∞ h=0 s,a γ h p h 1 (s, a) -p h 2 (s, a) = (1 -γ) ∞ h=0 γ h • 1 2 s,a p h 1 (s, a) -p h 2 (s, a) = (1 -γ) ∞ h=0 γ h • D TV p h 1 (s, a), p h 2 (s, a) ≤ (1 -γ) ∞ h=0 γ h D TV p h 1 (s), p h 2 (s) , where . = a T (s ′ , a|s), where we slightly overload notations using T (s ′ , a|s) . = π(a|s)T (s ′ |s, a) and T (s ′ , a|s) . = π(a|s) T (s ′ |s, a). We obtain (seeing π(a|s) as q 1 (x), q 2 (x), T (s ′ |s, a) as q 1 (y|x), and T (s ′ |s, a) as q 2 (y|x), and then using Lemma B.1) p h 1 (s) . = Pr(s h = s | T, π, µ), p h 2 (s) . = Pr(s h = s | T , π, µ), D TV T 1 (•|s), T 2 (•|s) = 1 2 s ′ T 1 (s ′ |s) -T 2 (s ′ |s) = 1 2 s ′ a T (s ′ , a|s) -T (s ′ , a|s) ≤ 1 2 s ′ ,a T (s ′ , a|s) -T (s ′ , a|s) = D TV T (s ′ , Based on that, the following holds: (due to Eq. ( 35)) D TV (ρ 1 , ρ 2 ) ≤ (1 -γ) ∞ h=0 γ h D TV p h 1 (s), p h 2 (s) (from Eq. ( )) ≤ (1 -γ) ∞ h=1 γ h h-1 h ′ =0 E s∼p h ′ 2 (s) D TV T 1 (•|s), T 2 (•|s) (using fact D TV (p 0 1 (s), p 0 2 (s)) = D TV (µ, µ) = 0 and Lemma B.2) ≤ (1 -γ) ∞ h=1 γ h h-1 h ′ =0 E s∼p h ′ 2 (s) E a∼π(a|s) D TV T (•|s, a), T (•|s, a) (using the above result) = (1 -γ) ∞ h=1 γ h h-1 h ′ =0 E s,a∼p h ′ 2 (s,a) D TV T (•|s, a), T (•|s, a) (noting that p h 2 (s, a) = p h 2 (s)π(a|s)) = (1 -γ) s,a D TV T (•|s, a), T (•|s, a) ∞ h=1 γ h h-1 h ′ =0 p h ′ 2 (s,



The exploration in the context of this manuscript refers to enhancing the generalization capability of the algorithm by escaping the offline data manifold via model rollout. For convenience, we omit a constant multiplier, 1/(1 -γ), in the objective for conciseness, i.e., the complete objective function is given by maxπ∈Π Es,a∼ρπ [R(s, a)/(1 -γ)]. To avoid ambiguity, we use DTV(p h 1 (s, a), p h 2 (s, a)) and DTV(p h 1 (s), p h 2 (s)) to denote the TVDs between the corresponding state-action distributions and state distributions respectively.



Figure 1: An illustration of the two-tier tradeoffs in CLARE.

I) Conservative reward updating. Given current policy π, dynamics model T , and offline datasets D E and D B , CLARE updates reward funtion r based on the following loss: L(r|π) . = Z β E s,a∼ ρπ [r(s, a)] penalized on model rollouts -E s,a∼ ρE [r(s, a)] increased on expert data -E s,a∼ ρD [β(s, a)r(s, a)] weighting expert and diverse data + Z β ψ(r) regularizer , (2) where ρD (s, a) . = (|D E (s, a)| + |D B (s, a)|)/(D E + D B ) is the empirical distribution of (s, a) in the union dataset D = D E ∪ D B and ρE .

Conservative model-based reward learning (CLARE) Input: expert data D E , diverse data D B , bar u, learning rate η, policy regularizer weight λ Learn dynamics model T represented by an ensemble of neural networks using all offline data; Set weight β(s, a) for each offline state-action tuple (s, a) ∈ D E ∪ D B by Eq. (11); Initialize the policy π θ and reward function r ϕ parameterized by θ and ϕ respectively; while not done do (Safe policy improvement) Run a MaxEnt RL algorithm for some steps with model T and current reward function r ϕ to update policy π θ , based on L(π θ |r ϕ ) -λD KL (π b ∥π θ ); (Conservative reward updating) Update r ϕ by ϕ ← ϕ -η∇ ϕ L(r ϕ |π θ ) for a few steps; end Theorem 4.3.

Corollary 4.1. Suppose that when ρD (s, a) = 0, c(s, a) > c min holds for each (s, a) ∈ S × A. Under the same condition as in Theorem 4.3, if β(s, a) are set as β * (s, a) ,a) , if c(s, a) ≤ c min and ρD (s, a) > 0, -ρE (s,a) ρD (s,a) , if c(s, a) > c min + 2 and ρD (s, a) > 0, 0, otherwise,

Figure2: CLARE against other algorithms on all tasks over different dataset sizes consisting of expert and medium data equally.

Figure 3: Performance of CLARE. 1) Impact of u: Figure 3(a) shows the impact of user-chosen parameter u on the performance using 10k expert tuples. 2) Convergence speed: Figures 3(c) and 3(b) show the convergence of CLARE using 10k expert and 10k medium tuples. In each iteration, CLARE carries out policy improvement by total 10k gradient updates (total 500 epochs with 20 gradient steps per epoch) for the actor and critic networks using SAC. 3) Recovered reward: Figure 3(d) shows the result of training SAC via replacing the underlying reward by the one learned from CLARE.Results on MuJoCo control. To answer the first question and validate the effectiveness of the learned reward, we evaluate CLARE on different tasks using limited state-action tuples sampled from D4RL datasets. The ranges of standard deviations of the results in Exp. & Rand.,, respectively. As shown in Table6, CLARE yields the best performance by a significant margin on almost all datasets, especially with lowquality data thereof. It demonstrates that the reward function learned by CLARE can effectively guide offline policy search while exploiting the useful knowledge in the diverse data.

Conservative reward updating Input: expert data D E , diverse data D B , replay buffer D replay , model buffer D model , reward function r ϕ , learning rate η, the number of steps T Update replay buffer D replay ← D replay ∪ D model ; for t = 1 to T do Update the parameters of reward function r ϕ by ϕ ← ϕ -η∇L(r ϕ ); end Algorithm 4: Conservative model-based reward learning (CLARE) Input: expert data D E , diverse data D B , bar u, learning rate η, policy regularizer weight λ Learn dynamics model T represented by an ensemble of neural networks using all offline data; Set weight β(s, a) for each offline state-action tuple (s, a) ∈ D E ∪ D B by Eq. (11); Initialize the policy π θ and reward function r ϕ parameterized by θ and ϕ respectively; Initialize replay buffer D replay ← ∅; while not done do (Safe policy improvement) Run Algorithm 2 to update policy π θ and get model buffer D model ; (Conservative reward updating) Run Algorithm 3 to update reward function r ϕ ; end A.2 HYPERPARAMETERS We summarize the hyperparameters used in the evaluation as follows.

Figure 4: Comparison to MOMAX. Each experiment uses 10k expert and 10k medium state-actions.

a) -E ρD β(s, a)r(s, a) -E ρE r(s, a) + Z β ψ(r) = α H(π) + s,a Z β ρπ (s, a) -ρD (s, a)β(s, a) -ρE (s, a) r(s, a) + Z β ψ(r) = α H(π) + Z β s,a ρπ (s, a) -ρD (s, a)β(s, a) + ρE (s, a) Z β r(s, a) + Z β ψ(r) = α H(π) + Z β E ρπ r(s, a) -E ρI r(s, a) + ψ(r) . (denoting ρI (s, a) . = ρD (s,a)β(s,a)+ ρE (s,a) Z β ) where the last equality holds due to Z β = 1 + E s,a∼ ρD [β(s, a)] = s,a ρE (s, a) + ρD (s, a)β(s, a) ≥ 0. (from β(s, a) ≥ -ρ E (s, a)/ρ D (s, a) for (s, a) ∈ D) Thanks to Lemma 4.1, there exists a one-to-one correspondence between Π and C T . Thus, we can rewrite π) + Z β E ρ r(s, a) -E ρI r(s, a) + ψ(r) .= L( ρ,r)

Additionally, denote r * and ρ * as r * ∈ arg min r∈R max ρ∈C T L(ρ, r), ρ * ∈ arg max ρ∈C T α H(ρ) + Z β D ψ ρ, ρI .

a|s), T (s ′ , a|s) ≤ E a∼π(a|s) D TV T (•|s, a), T (•|s, a) .

TV T (•|s, a), T (•|s, a) TV T (•|s, a), T (•|s, a) derived by rearranging the terms in A)= γ(1 -γ) s,a D TV T (•|s, a), T (•|s, a) ∞ h=1 γ h-1 ρ 2 (s, a) 1 -γ (noting that ρ 2 (s, a) = (1 -γ) ∞ h ′ =0 γ h ′ p h ′ 2 (s, a)) = s,a D TV T (•|s, a), T (•|s, a) s, a)D TV T (•|s, a), T (•|s, a) a∼ρ2 D TV T (•|s, a), T (•|s, a) .(24)Substituting Eq. (24) in Eq. (22) gives the desired result.Denoting ρ π as the occupancy measure of π under underlying dynamics model T , We can writeJ(π E ) -J(ρ π ) = s,a ρ E (s, a)R(s, a)s,a ρ π (s, a)R(s, a) (from the definition) = s,a ρ E (s, a) -ρ π (s, a) R(s, a) ≤ s,a ρ E (s, a) -ρ π (s, a) (due to |R(s, a)| ≤ 1) = 2D TV (ρ π , ρ E ).(25)Then, based on Lemma B.3, the desired result in Theorem 4.2 can be obtained by combining Eq. (25) with Eq. (21). B.3 PROOF OF THEOREM 4.3 Recall that c(s, a) = C • D TV (T (•|s, a), T (•|s, a)). We define f (ρ) . = E s,a∼ρ [c(s, a)] + 2D TV (ρ, ρE ) = s,a c(s, a)ρ(s, a) + ρ(s, a) -ρE (s, a) .(26)Thanks to Lemma 4.1, minimizing the RHS of Eq. (7) is equivalent to the following problem:min ρ∈P(S×A) f (ρ).(27)Let δ(s, a) . = ρ(s, a) -ρE (s, a). Then, Problem (27) can be transformed to the following one:

a) ≥ -ρ E (s, a) s ∈ S, a ∈ A. (30) For conciseness, we rewrite Problem (28)-(30) as the following form: i corresponds to a state-action pair, n . = |S| • |A|, [n] . = {1, 2, . . . , n}, and δ . = {δ i : i ∈ [n]}.Due to Eq. (32) and Eq. (33), [n] can be divided into two disjoint sets,N 1 (δ) . = {i ∈ [n] : δ i > 0} and N 2 (δ) . = {i ∈ [n] : δ i ≤ 0} (N 1 (δ) = ∅ iff all δ i = 0). Thus, we can write δ meeting Constraints (32) and (33), we denote δ ′ (which should beδ ′ N1 if written in full) satisfying δ ′ j = -1[c j -c min 1 > 2] • ρE j for all j ∈ N 2 (δ), δ ′ i = 0 for all i ∈ N 1 (δ)\N min , and δ ′ i = j∈N2(δ) 1[c j -c min 1 > 2] • ρE j /|N min (δ)| for all i ∈ N min (δ), where N min (δ) . = {i ∈ N 1 (δ) : i ∈ arg min i ′ ∈N1(δ) c i ′ } and c min 1 . = min i∈N1(δ) c i . Then, we have g(δ ′ ) = i∈N1(δ) (c i + 1)δ ′ i + j∈N2(δ) (c j -1)δ ′ j = i∈Nmin(δ) (c i + 1)δ ′ i + i ′ ∈N1(δ)\Nmin(δ) (c i ′ + 1)δ ′ i ′ + j∈N2(δ) (c j -1)δ ′ j = i∈Nmin(δ) (c i + 1)δ ′ i + i ′ ∈N1(δ)\Nmin(δ) (c i ′ + 1)δ ′ i ′ -j∈N2(δ) 1[c j -c min 1 > 2] • (c j -1)ρ E j • (c j -1)δ j(adding and subtracting -j∈N2(δ) 1[c jj -c min 1 > 2] • (c min 1 -c j + 2)ρ E j + (c min 1 Constraint (32)) = g(δ ′ ).

Results on D4RL datasets. For each task, the experiments are carried out with three different data combinations: 1) 10k expert tuples, 2) 5k expert and 5k medium tuples, and 3) 5k expert and 5k random tuples. The data scores below for 1), 2), and 3) correspond to expert, medium, and random data, respectively. We tune IQ-LEARN, EDM, and AVRIL based on their publicly available source code. Results are averaged over 7 random seeds. The highest score across all algorithms is bold.

To answer the first question and validate the effectiveness of the learned reward, we evaluate CLARE on different tasks using limited state-action tuples sampled from D4RL datasets. The ranges of standard deviations of the results in Exp. & Rand.,, respectively. As shown in Table6, CLARE yields the best performance by a significant margin on almost all datasets, especially with lowquality data thereof. It demonstrates that the reward function learned by CLARE can effectively guide offline policy search while exploiting the useful knowledge in the diverse data.

Klein et al. (2012) further introduce a linearly parameterized score function-based multi-class classification algorithm to output reward function based on an estimate of expert feature expectation.Herman et al. (2016) present a gradient-based solution that simultaneously estimates the feature weights and parameters of the transition model by taking into account the bias of the demonstrations.Lee et al. (2019) propose Deep Successor Feature Networks (DSFN) that estimates feature expectations in an off-policy setting. However, the assumption of full knowledge of the reward feature functions inKlein et al. (2011);Herman et al.

Table A.2. Hyperparameters for CLARE. Instead of u, the hyperparameters used in the evaluation are identical across different tasks (Half-Cheetah, Walker2d, Hopper, and Ant).

Performance under different u values. We tune u from set {0.4, 0.6, 0.8}. For each MuJoCo task, the experiments are carried out with three data combinations: 1) 10k expert state-action tuples, 2) 5k expert and 5k medium state-action tuples, and 3) 5k expert and 5k random state-action tuples. The highest score across different u is bold.

Impact of diverse data. For each MuJoCo task, the experiments are carried out with three data combinations: 1) 10k expert state-action tuples, 2) 5k expert state-action tuples, 3) 5k expert and 5k medium state-action tuples, and 4) 5k expert and 5k random state-action tuples.

Results under different expert sample sizes.

ACKNOWLEDGMENTS

This research was supported in part by the National Natural Science Foundation of China under Grant No. 62122095, 62072472, and U19A2067, by NSF Grants CNS-2203239, CNS-2203412, and RINGS-2148253, and by a grant from the Guoqiang Institute, Tsinghua University.

A EXPERIMENTAL DETAILS

In this section, we present necessary experimental details for reproducibility.

A.1 PRACTICAL IMPLEMENTATION DETAILS

In the experiment, our implementation is built upon the open source framework of offline RL algorithms, provided at: https://github.com/polixir/OfflineRL, including data sampling, policy testing, dynamics model structure, etc. The implementation of SAC in the policy improvement uses the open source code available at: https://github.com/pranz24/ pytorch-soft-actor-critic (under the MIT License). Additionally, the expert and diverse state-action pairs are sampled at random from the D4RL dataset provided at: https: //github.com/rail-berkeley/d4rl (under the Apache License 2.0).Model learning. Following the same line as in Yu et al. (2020; 2021) , we model the transition dynamics by an ensemble of probabilistic neural networks, each of which takes the current state and action as input and outputs a Gaussian distribution over next states, i.e., { T i (s ′ |s, a) = N (µ i (s, a), Σ i (s, a))} N i=1 . Using offline state-action pairs, 7 models are trained independently via maximum likelihood, each of which is represented as by a 4-layer feedforward neural network with 256 hidden units. The best 5 models are picked based on the validation prediction error on a held-out set. During model rollouts, one model will be selected randomly from the ensemble.Policy improvement. We represent both critic and actor as a 2-layer feedforward neural network with 256 hidden units and Swish activation functions. In each iteration, we update the critic and actor networks using SAC (Haarnoja et al., 2018) for 500 epochs (each has 20 gradient updates). As described in Section 5, we use a KL divergence with the behavior policy to accelerate inner-loop policy search. It is implemented by adding -E s,a∼D ′ [log π(a|s)] (D ′ ∈ D) to the actor loss. An instantiation of the policy improvement can be found in Algorithm 2. Sample batches from D model and use SAC to update policy π θ with -E s,a∼D ′ [log π θ (a|s)] (D ′ ∈ D) added on the policy loss; end Reward updating. We represent the reward function as a 4-layer feedforward neural network with 256 hidden units and Swish activate functions. In each iteration, the reward function is updated by 5 gradient steps with stepsize 5 × 10 -5 , based on the following practical reward loss: 32) and ( 33)}, we have the following fact:whereDue to Eq. ( 35), we haveThe following fact is true:where the last inequality holds because {j ∈, and we can express δ * aswhere N min = N * 1 . Due to δ(s, a) + ρE (s, a) = ρ(s, a), we obtain the optimal solution of Problem (27) as follows: 

B.5 MINIMIZING A CHI-SQUARED DIVERGENCE

The f -divergence between two distributions ρ 1 and ρ 2 is defined aswhere f * is the convex conjugate. The χ 2 -divergence is the f -divergence with f (x) = (x -1) 2 and f * (y) = y 2 4 + y, i.e.,Published as a conference paper at ICLR 2023 By interpreting g = -r and X = (s, a), the following holds: Thus, using a convex reward regularizer ψ(r) = r 2 4δ enables CLARE to minimize a χ 2 -divergence between the target policy and learned policy, i.e., max ρ∈C T α H(ρ) -Z β δ • χ 2 (ρ, ρ * ).

