MIND THE GAP: OFFLINE POLICY OPTIMIZATION FOR IMPERFECT REWARDS

Abstract

Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, RGM (Reward Gap Minimization), which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards. Code is available at https://github.com/Facebear-ljx/RGM.

1. INTRODUCTION

Reward plays an imperative role in every reinforcement learning (RL) problem. It encodes the desired task behaviors, serving as a guiding signal to incentivize agents to learn and solve a given task. As widely recognized in RL studies, a desirable reward function should not only define the task the agent learns to solve, but also offers the "bread crumbs" that allow the agent to efficiently learn to solve the task (Abel et al., 2021; Singh et al., 2009; Sorg, 2011) . However, due to task complexity and human cognitive biases (Hadfield-Menell et al., 2017) , accurately describing a complex task using numerical rewards is often difficult or impossible (Abel et al., 2021; Li et al., 2019) . In most practical settings, the rewards are typically "imperfect" and hard to be fixed through reward tuning when online interactions are costly or dangerous (Zhan et al., 2022) . Such imperfect rewards are widespread in real-world applications and can appear in forms such as partially correct rewards, sparse rewards, mismatched rewards from other tasks, and completely incorrect rewards (see Figure 1 for an intuitive illustration ). These rewards either fail to incentivize agents to learn correct behaviors or cannot provide effective signals to speed up the learning process. Consequently, it is of great importance and practical value to devise a versatile method that can perform robust offline policy optimization under diverse settings of imperfect rewards. Reward shaping (Ng et al., 1999) is the most common approach to tackling imperfect rewards, but it requires tremendous human efforts and numerous online evaluations. Another possible avenue is imitation learning (IL) (Pomerleau, 1988; Kostrikov et al., 2019) or offline inverse reinforcement learning methods (IRL) (Jarboui & Perchet, 2021) , by directly imitating or deriving new rewards from expert behaviors. However, these methods heavily depend on the quantity and quality of expert demonstrations and offline datasets, which are often beyond reach in practice. Another key challenge is how to precisely measure the discrepancy between the given reward in the data and the true reward of the task. As evaluating the learned policy's behavior under a specific reward function through environment interactions becomes impossible under the offline setting, let alone revising the reward. In this paper, we investigate the challenge of learning effective offline RL policies under imperfect rewards, when environment interactions are not possible. We first formally define the relative gap between the given and perfect rewards based on state-action visitation distribution matching (referred to as reward gap), and formulate the problem as a bi-level optimization problem. In the upper layer, the imperfect rewards are adjusted by a reward correction term, which is learned by minimizing the reward gap toward expert behaviors. In the lower layer, we solve a pessimistic RL problem to obtain the optimized policy under the corrected rewards. By exploiting Lagrangian duality of the lower-level problem, the overall optimization procedure can be tractably solved in a fully-offline manner without any online interactions. We call this approach Reward Gap Minimization (RGM). Compared to existing methods, RGM can: 1) evaluate and minimize the reward gap without any online interactions; 2) eliminate the strong dependency on human efforts and numerous expert demonstrations; and 3) handle diverse types of reward settings (e.g., perfect, partially correct, sparse, multi-task data sharing, incorrect) in a unified framework for reliable offline policy optimization. Through extensive experiments on D4RL datasets (Fu et al., 2020) , sparse reward tasks, multi-task data sharing tasks and a discrete-space navigation task, we demonstrate that RGM can achieve superior performance across diverse settings of imperfect rewards. Furthermore, we show that RGM effectively corrects wrong/inconsistent rewards against expert preference and effectively retrieves useful information from biased rewards, making it an ideal tool for practical applications where reward functions are difficult to design. However, reward shaping requires online evaluation and tuning, which is not applicable in the offline setting. Currently, few mechanisms are specifically designed for offline RL to handle sparse rewards. Imperfect rewards in multi-task data sharing. Sharing data across different tasks can potentially enhance offline RL performance on a target task by utilizing additional data from other relevant tasks. As the goals of other relevant tasks are different from that of the target task, the rewards designed for other tasks are naturally imperfect for solving the target task. Since directly sharing datasets from other tasks exacerbates the distribution shift in offline RL (Yu et al., 2021; Bai et al., 2023) , prior work such as CDS (Yu et al., 2021) shares data relevant to the target task based on learned Q-values, but it requires access to the functional form of the reward for relabling. CDS+UDS (Yu et al., 2022) directly set the shared rewards to zero without reward relabeling to reduce the bias in the shared rewards, but it cannot completely remedy the reward bias. Completely incorrect rewards. When rewards are believed to be totally wrong or missing, researchers typically adopt offline imitation learning (IL) methods. These methods directly mimic the expert from demonstrations without the presence of a reward signal. Among these approaches, behavior cloning (BC) (Pomerleau, 1988; Florence et al., 2022) is the simplest one, but is vulnerable to covariate shift and compounding errors (Rajaraman et al., 2020) . Recent works tackle this problem via distribution matching (Jarboui & Perchet, 2021; Kostrikov et al., 2019; Kim et al., 2021; Ma et al., 2022) or using a discriminator to measure the optimal level of the data and further guide policy learning (Zolna et al., 2020; Xu et al., 2022b; Zhang et al., 2022) . These approaches all have strong requirements on the size and coverage of the expert datasets, and only try to imitate the expert rather than improve beyond the policies in data via RL based on the underlying reward of the task.

3. PRELIMINARIES

Markov decision process under imperfect rewards. We consider the typical Markov Decision Process (MDP) setting (Puterman, 2014), which is defined by a tuple M := (S, A, r, T, µ 0 , γ). S and A represent the state and action space, r : S × A → R is the perfect reward function, T : S × A → ∆(S) is the transition dynamics which represents the probability T (s t+1 |s t , a t ) of the transition from state s t to state s t+1 by executing action a t at timestep t. µ 0 ∈ ∆(S) is the distribution of the initial state s 0 , and γ ∈ (0, 1) is the discount factor. The perfect reward function r(s, a) encodes the desired behaviors of the task. But in most cases, we only have access to an imperfect human-designed reward function r(s, a), which may not align well with the target task. This leads to a biased MDP M := (S, A, r, T, µ 0 , γ) as compared to the original MDP M. To remedy the adverse effects of imperfect reward signals, existing offline policy learning studies (Zolna et al., 2020; Xu et al., 2022b; Ma et al., 2022; Kim et al., 2021; Jarboui & Perchet, 2021) introduce additional expert demonstrations D E = s E 0 , a E 0 , s E 1 , • • • (i) N E i=0 to provide extra information on the desired policy behaviors. We follow a similar setup, but only consume very limited expert demonstrations. In our offline policy optimization setting, we are given a pre-collected dataset D = (s 0 , a 0 , r0 , s 1 , • • • ) (i) N i=0 that is generated by an unknown behavior policy π β and annotated with imperfect rewards r. We aim to learn an effective policy π r : S → ∆(A) to capture the optimized agent behavior in M rather than M using both D and a very small expert dataset D E . Reinforcement learning. Given a MDP and the reward function r(s, a), the goal of RL is to find an optimized policy π * r to maximize the expected cumulative discount reward: π * r = arg max πr (1 - γ)E[ ∞ t=0 γ t r (s t , a t ) |s 0 ∼ µ 0 (•), a t ∼ π r (•|s t ) , s t+1 ∼ T (•|s t , a t )] . This optimization objective can be equivalently written into the following succinct form (Puterman, 2014; Nachum et al., 2019b) by defining the normalized discounted state-action visitation distribution d πr (s, a) (in the rest of the paper, we omit "normalized discounted state-action" for brevity unless otherwise specified): π * r = arg max πr E (s,a)∼d πr [r(s, a)] d πr (s, a) = (1 -γ) ∞ t=0 γ t Pr[s t = s, a t = a|s 0 ∼ µ 0 (•), a t ∼ π r (•|s t ) , s t+1 ∼ T (•|s t , a t )] This RL objective is not directly applicable to offline setting, as it is no longer possible to sample from d πr via online interactions, and serious distributional shift (Kumar et al., 2019) may occur without proper data-related regularization when learning from offline datasets. To tackle these problems, several recent works (Nachum et al., 2019b; Nachum & Dai, 2020; Lee et al., 2021) incorporate a regularizer into Eq. ( 1) to formulate a pessimistic RL framework that is solvable in the offline setting: π * r = arg max πr E (s,a)∼d πr [r(s, a)] -αD d πr ∥d D (2) where d D is the visitation distribution of dataset D, D (•∥•) represents some statistical discrepancy measures and α > 0 controls the strength of the regularization.

4. REWARD GAP MINIMIZATION

To handle diverse imperfect reward settings, three challenges have to be tackled: 1) Measure the gap between the given rewards and the underlying unknown perfect rewards; 2) Unify different reward settings and bridge the reward gap; 3) Perform offline policy optimization using an integrated framework. Our solution to these challenges is Reward Gap Minimization (RGM). We formally define the reward gap in the perspective of visitation distribution matching and introduce a correction term to correct the problematic rewards. Then, we model RGM as a bi-level optimization problem, with the upper layer minimizing the reward gap and the lower layer solving a pessimistic RL problem. To derive a tractable algorithm, we leverage Lagrangian duality to eliminate the requirement for online samples.

4.1. DEFINITION OF REWARD GAP

As observed in recent literature, some tasks cannot be captured by a numerical Markovian reward function (Abel et al., 2021) . Hence, learning an explicit proxy of the perfect reward function and comparing it to the given rewards is unlikely the best option to characterize the reward gap. In this study, we define the reward gap based on the outcome of the learned agent behavior, i.e., from the perspective of visitation distribution matching. Definition 1. (Reward gap) Given an arbitrary reward function r(s, a) and the visitation distribution d * of the optimal policy induced from the perfect rewards r, the reward gap between r and r is: D f d π * r ∥d * (3) where D f (p∥q) = E z∼q f p(z) q(z) is the f -divergence between distributions p and q, and d π * r represents the visitation distribution induced by π * r , which is derived using Eq. (2) with r. Note that d * is unobtainable since the perfect reward function is unknown. We can alternatively use the visitation distribution d E induced by unknown π E in expert demonstrations D E to approximate d * . Next, we discuss how to adjust r to minimize the reward gap.

4.2. BI-LEVEL OPTIMIZATION

Reward correction. In our study, we consider r(s, a) := r(s, a) + ∆r(s, a, r), where ∆r(s, a, r) is a learnable reward correction term that is correlated with the given imperfect rewards r in D. The introduction of ∆r(s, a, r) enables us to exploit useful information within the partially correct rewards, while also correcting the wrong or inconsistent reward signals. We can further use it to construct a bi-level optimization formulation for RGM, where the upper-level problem optimizes the reward correction term to minimize the f -divergence between d π * r and d E , and the lower-level problem solves π * r as the optimal policy of a pessimistic RL problem with the corrected rewards: ∆r * = arg min ∆r D f d π * r ∥d E (4) s.t. π * r = arg max π r E (s,a)∼d π r [r(s, a)] -αD f d π r ∥d D The above bi-level optimization formulation poses several technical difficulties, stemming from the complexity of deriving d π * r from π * r , as well as the requirement of online samples from d π * r , which is impossible under the offline setting. In the following, we present reformulations for both lower and upper-level problems, which leads to a tractable form and an easy-to-implement algorithm. Reformulation of the lower-level problem. We first reformulate the lower-level problem by exploiting duality and the Bellman flow constraint (Puterman, 2014). 2014)). Then, the lower level problem Eq. ( 5) can be re-written to a constraint maximization problem w.r.t. d in place of π r : d π * r = arg max d≥0 E (s,a)∼d [r(s, a)] -αD f d∥d D s.t. a d(s, a) = (1 -γ)µ0(s) + γT⋆d(s), ∀s ∈ S (7) The Lagrange dual problem of Eq. ( 7) is as follow: min V (s) max d≥0 E (s,a)∼d [r(s, a)] -αD f d∥d D + s V (s) (1 -γ)µ0(s) + γT⋆d(s) - a d(s, a) where V (s) are Lagrange multipliers. Note that the primal problem Eq. ( 7) is convex w.r.t. d, and under a mild assumption (see Assumption 1 in Appendix A.2), the Slater's condition (Boyd et al., 2004) holds, which means by strong duality, we can solve the original primal problem by solving Eq. ( 8). After rearranging the terms, Eq. ( 8) can be equivalently written as the following form (see Lemma 2 in Appendix A.2 for detailed deduction): min V (s) max d≥0 (1 -γ)E s∼µ0 [V (s)] + E (s,a)∼d [r(s, a) + γT V (s, a) -V (s)] -αD f d∥d D (9) in which T V (s, a) = s ′ T (s ′ |s, a)V (s ′ ) denotes the transition operator. Next, by exploiting the Fenchel conjugate, we can further transform the minimax problem Eq. ( 9) into a tractable single-level unconstrained minimization problem (see Proposition 1 in Appendix A.2 for detailed derivation), which eliminates the requirement of online samples: min V (s) (1 -γ)E s∼µ0 [V (s)] + α E (s,a)∼d D f ⋆ ( r(s, a) + γT V (s, a) -V (s) α ) ( ) where f ⋆ is the Fenchel conjugate of f . In the above formulation, the Lagrange multipliers V (s) can be equivalently perceived as some sort of state-value function, which can be learned and optimized via a parameterized neural network, similar to the treatment used in the DICE-family of RL algorithms (Nachum et al., 2019a; Nachum & Dai, 2020) . Reformulation of the upper-level problem. Using the property of Fenchel conjugate, the optimal d * and V * from the lower level problem satisfy the following nice relationship (see Proposition 2 in Appendix A.3 for details): d π * r (s, a) d D (s, a) = f ′ ⋆ r(s, a) + γT V * (s, a) -V * (s) α Plugging the above equation into Eq. ( 5), we can obtain a new objective for the upper-level problem: For simplicity, we denote ∆r * = arg min ∆r D f f ′ ⋆ r + γT V * -V * α d D ∥d E (12) f ′ ⋆ r+γT V * -V * α as g. By expanding the f -divergence, we have: D f d D g∥d E = E (s,a)∼d E f d D (s, a)g(s, a) d E (s, a) = E (s,a)∼d D d E (s, a) d D (s, a) f d D (s, a) d E (s, a) g(s, a) The above objective involves computing the distribution ratio w(s, a) ≜ d E (s, a)/d D (s, a). In the tabular case, we can empirically estimate w(s, a) = (s,ā)∈D E 1(s=s,ā=a)/N E (s,ā)∈D 1(s=s,ā=a)/N . But in the continuous state-action settings, estimating the distribution ratio w using only samples from d D and d E becomes a challenge. Inspired by previous studies (Goodfellow et al., 2020; Ma et al., 2022) , we instead train a discriminator h : S × A → (0, 1) to infer if (s, a) samples are from D E or not: h * = arg min h E (s,a)∼d D [log(h(s, a))] + E (s,a)∼d E [log(1 -h(s, a))] where the optimal discriminator is h (Goodfellow et al., 2020) . We can optimize the above objective to obtain the optimal h * , and further recover w(s, a) = 1/h * (s, a) -1. * (s, a) = d D (s,a) d D (s,a)+d E (s,a) Finally, combining all the reformulations, the final tractable form of the original bi-level optimization problem Eq. ( 4)-( 5) is given as follows: ∆r * = arg min ∆r E (s,a)∼d D w(s, a) • f f ′ ⋆ r(s, a) + γT V * (s, a) -V * (s) α /w(s, a) s.t. V * (s) = arg min V (s) (1 -γ)Es∼µ 0 [V (s)] + α E (s,a)∼d D f⋆ r(s, a) + γT V (s, a) -V (s) α (15) Policy extraction. With the learned reward correction term ∆r(s, a, r), we can in principle use existing offline RL algorithms to learn the policy with the corrected rewards. However, this implicates additional policy evaluation and policy improvement steps. A more elegant way is to extract the policy through weighted BC as follows, which is substantially more robust and less expensive: π * = arg min π -E (s,a)∼d π * r [log π(a|s)] = arg min π -E (s,a)∼d D d π * r (s, a) d D (s, a) log π(a|s) where d π * r (s,a) d D (s,a ) can be calculated from Eq. ( 11).

4.3. PRACTICAL IMPLEMENTATION

In our implementation, we use stochastic first-order two-timescale optimization technique (Borkar, 1997), which has been successfully applied in several RL algorithms (Hong et al., 2020; Cheng et al., 2022) , to solve bi-level optimization problems. Specifically, we make the gradient update step size of the upper layer much smaller than the one of the lower layer (see Figure 2 for RGM framework. Refer to Appendix B for additional implementation details of RGM). Table 1 : Average normalized scores of RGM compared with offline IL and RL baselines on D4RL datasets. The scores are from the final 10 evaluations with 5 seeds. (T), (P) and (C) mean policy optimization with true rewards, partially correct rewards and completely incorrect rewards, respectively. "-r","-m","-m-r", and "-m-e" are short for random, medium, medium-replay, and medium-expert, respectively. We obtain the results by running author-provided open-source code, and some scores are reported from TD3+BC and IQL papers. For each dataset, the top 2 scores under partially correct rewards are marked in blue. 

5.1. COMPARATIVE RESULTS

Comparisons for partially correct rewards. We train RGM and SOTA offline RL methods (TD3+BC (Fujimoto & Gu, 2021) , IQL (Kostrikov et al., 2021b) and CQL (Kumar et al., 2020) ) under partially correctfoot_1 rewards and report their performances evaluated based on the perfect rewardsfoot_2 in Table 1 . Table 1 shows that RGM surpasses offline RL methods under partially correct rewardsfoot_3 by a large margin and achieves similar performance to offline RL policies that are trained on perfect rewards. This shows a remarkable advantage of RGM as it can alleviate severe performance degradation when perfect rewards are unattainable and hence removes the restrictive requirements on perfect rewards, which can be particularly useful for a wide range of real-world scenarios. Comparisons for completely incorrect rewards. When rewards are believed to be completely incorrect, one generally resorts to IL methods. We compare RGM with BC and SOTA offline IL methods (DWBC (Xu et al., 2022b) and SMODICE (Ma et al., 2022) ) that can learn from mixedquality data. Only offline IL methods are considered as baselines, because other existing methods that tackle incorrect rewards can only be applied in the online settings (see Section 2 for discussions). In our setting, we train offline IL baselines using the original D4RL dataset D, which may not cover enough expert trajectories. However, DWBC and SMODICE both build on the strong assumption that D already covers a large proportion of expert datasets, which is a rare case in real scenarios. As a result, Table 1 shows that these two methods suffer from inferior performance when the restrictive requirements on the quality and state-action space coverage of expert data are not satisfied. RGM, however, performs well when nearly no expert trajectories are contained in the offline dataset, because RGM is optimizing an RL objective that relaxes the requirements on the quality of the dataset. To further illustrate the superiority of RGM, we compare RGM against DWBC and SMODICE under their settings by adding 100∼200 expert trajectories into D. Results show that RGM can still outperform SOTA offline IL methods by a large margin (see Table 8 in Appendix D). Comparisons for sparse rewards. We evaluate RGM against BC and offline RL methods TD3+BC, CQL and IQL on Robomimic (Mandlekar et al., 2021) Lift and Can tasks. We also evaluate on the well-known extremely difficult AntMaze tasks. We report the average max success rate as the evaluation metric in Table 2 shows that the offline RL baselines fail miserably on AntMaze tasksfoot_4 , as sparse rewards are hard to back-propagate through a very long horizon (≈ 1K steps), while RGM can correctly provide dense signals to guide the ant navigate to the destination. For Robomimic Lift and Can tasks, RGM again outperforms existing methods, while other methods can also achieve reasonable performance. We suspect that these offline datasets may already contain near-optimal trajectories as BC can achieve reasonable performance. Moreover, the planning horizon of both tasks are relatively short (≈ 150 steps), thus is relatively simple for offline RL to back-propagate the sparse signals. Extension to multi-task data sharing. We highlight that RGM can also perform well in the offline multi-task data sharing tasks (Yu et al., 2021) , which utilize datasets from other relevant tasks to enhance the offline RL performance on a target task. Prior works either require the functional form of rewards to be known for relabeling (Yu et al., 2021) or partially correct the reward biases (Yu et al., 2022) . In contrast, RGM systematically corrects the reward biases without reward relabelling, using just one expert trajectory from the target task. To demonstrate the efficacy of RGM compared to SOTA multi-task data sharing algorithms CDS (Yu et al., 2021) and CDS+UDS (Yu et al., 2022) , we conduct experiments in multi-task Walker (Stand, Walk, Run, Flip) and Quadruped (Walk, Run, Roll-Fast, Jump) domains built on DeepMind Control Suite (Tassa et al., 2018) . For each task, we use TD3 (Fujimoto et al., 2018) to collect three types of datasets (expert, medium, replay), and share the replay dataset of the relevant task with the medium dataset of the target task. For RGM, we only draw one expert trajectory for the discriminator training. We report the experimental results in Figure 3 , which shows that RGM substantially outperforms CDS and CDS+UDS (see Appendix C.3 and D.5 for more experiment details and results).

5.2. INVESTIGATIONS ON REWARD CORRECTION

Benefits of learned rewards. We investigate the potential benefits of the learned rewards via demonstrative experiments in an 8×8 grid world environment. We observe the learned rewards in RGM enjoy three desirable properties that are unlikely to be provided in other existing methods: 1) encode long horizon information; 2) correct wrong rewards against expert preference; and 3) retrieve useful information from existing rewards, as shown in Figure 4 . Specifically, Figure 4b shows that the learned rewards not only recover correct learning signals on the path of the expert, but also generalize well on regions not covered by expert data. In most locations, the agent can navigate to the destination by simply maximizing the one-step reward, meaning that the learned rewards encode long-horizon information. Moreover, Figure 4c shows that the learned rewards can avoid the dangerous fire locations by retrieving useful information provided in imperfect r, meanwhile correcting the wrong rewards against expert preference. Offline RL with corrected rewards. The learned corrected rewards r obtained by RGM can also be used in other offline RL approaches. To be mentioned, the corrected rewards are optimized based on the specific α in Eq. ( 5), hence may not be optimal to other offline RL methods. Nevertheless, Figure 5 shows that the corrected rewards can largely remedy the negative effects of the partially correct rewards and even surpass perfect rewards in some datasets. Ablations on learned rewards. Additionally, we investigate the learned rewards in highdimensional continuous control tasks by inspecting the learning process of both the reward correction term ∆r and the final learned rewards r. Figure 6a shows that the reward correction term ∆r initially cannot distinguish expert and non-expert data well, but adapts and converges quickly. After a few training steps, ∆r can correctly reward expert data and punish non-expert data very well. We also perform ablations on the effect of diverse types of imperfect rewards r on ∆r and r. Figure 6b shows that a perfect r is beneficial to enlarge reward differences on expert and non-expert samples, and incorrect r can be counterproductive. Nevertheless, RGM can largely correct the wrong rewards and produce reasonable learning signals. Similar effects are also observed on ∆r, as Figure 6c shows.

6. DISCUSSION AND CONCLUSION

In this paper, we propose RGM (Reward Gap Minimization), a unified offline policy optimization approach applicable to diverse settings of imperfect rewards. RGM is formulated as a bi-level optimization problem, which achieves reward correction and simultaneous policy learning in a fully offline paradigm. Extensive experiments and illustrative examples show that RGM can perform robust policy optimization under imperfect rewards. Several desirable properties are also identified in the corrected rewards learned by RGM. One limitation of RGM is the need for a small expert dataset, which may not be easily accessible in some applications. However, RGM relaxes the strong dependencies on online reward tuning and tedious human efforts, which renders it a powerful tool to solve many real-world problems.

A PROOFS

A.1 BACKGROUND We begin by briefly introducing the Fenchel conjugate (also known as convex conjugate or Legendre-Fenchel transformation): Definition 3. (Fenchel conjugate) In a real Hilbert space X , if a function f (x) is proper, then the Fenchel conjugate f ⋆ of f at y is: f ⋆ (y) = sup x∈X (y T x -f (x)) (17) where the domain of the f ⋆ (y) is given by: dom f ⋆ = y : sup x∈dom f y T x -f (x) < ∞ ( ) If f is convex and lower semi-continuous as well, we have the duality f ⋆⋆ (x) = f (x). Furthermore, if f is also differentiable, then the maximizer x * of f ⋆ (y) satisfies: x * f ′ ⋆ (y) Next, we present the interchangeability principle, which plays a key role in Proposition 1. Lemma 1. (Interchangeability principle) Let ξ be a random variable on Ξ and assume for any ξ ∈ Ξ, function g(•, ξ) is a proper and upper semi-continuous concave function. Then E ξ max u∈R g(u, ξ) = max u(•)∈G(Ξ) E ξ [g(u(ξ), ξ)] where G(Ξ) = {u(•) : Ξ → R} is the entire space of functions defined on support Ξ . Proof. Please refer to (Dai et al., 2017; Rockafellar & Wets, 2009) .

A.2 PROOF OF TRACTABLE TRANSFORMATION OF THE LOWER-LEVEL PROBLEM

We start our proof from the original bi-level optimization problem Eq. ( 4) and Eq. ( 5). Using the Bellman flow constraint for Eq. ( 5) yields: ∆r * = arg min ∆r D f d π * r ∥d E s.t. d π * r = arg max d≥0 E (s,a)∼d [r(s, a)] -αD f d∥d D s.t. a d(s, a) = (1 -γ)µ 0 (s) + γT ⋆ d(s), ∀s ∈ S Assumption 1. There exists at least one d such that: a d(s, a) = (1 -γ)µ 0 (s) + γT ⋆ d(s), d(s) > 0, ∀s ∈ S We note that this assumption is mild since when every state is reachable from the initial state distribution, the assumption is satisfied, which is common in practice. Slater's theorem (Boyd et al., 2004) states that strong duality holds, if the optimization problem is strictly feasible (Slater's condition holds) and the problem is convex. So under Assumption 1 with the fact that the lower level problem is convex w.r.t. d, the strong duality holds, which means that the above lower level problem can be re-written as the following form: min V (s) max d≥0 E (s,a)∼d [r(s, a)] -αD f d∥d D + s V (s) (1 -γ)µ0(s) + γT⋆d(s) - a d(s, a) Lemma 2. The minimax problem: min V (s) max d≥0 E (s,a)∼d [r(s, a)] -αD f d∥d D + s V (s) (1 -γ)µ0(s) + γT⋆d(s) - a d(s, a) can be equivalently written as: min V (s) max d≥0 (1 -γ)Es∼µ 0 [V (s)] + E (s,a)∼d [r(s, a) + γT V (s, a) -V (s))] -αD f d∥d D (25) Proof. E (s,a)∼d [r(s, a)] -αD f d∥d D + s V (s) (1 -γ)µ0(s) + γT⋆d(s) - a d(s, a) = E (s,a)∼d [r(s, a)] -αD f d∥d D + s V (s) (1 -γ)µ0(s) + γ s,ā T (s|s, ā)d(s, ā) - a d(s, a) = s,a d(s, a)r(s, a) -αD f d∥d D + (1 -γ) s µ0(s)V (s) + γ s,ā d(s, ā) s T (s|s, ā)V (s) - s,a d(s, a)V (s) = s,a d(s, a)r(s, a) -αD f d∥d D + (1 -γ) s µ0(s)V (s) + γ s,a d(s, a) s ′ T (s ′ |s, a)V (s ′ ) - s,a d(s, a)V (s) = (1 -γ) s µ0(s)V (s) + s,a d(s, a) r(s, a) + γ s ′ T (s ′ |s, a)V (s ′ ) -V (s) -αD f d∥d D = (1 -γ)Es∼µ 0 [V (s)] + E (s,a)∼d [r(s, a) + γT V (s, a) -V (s))] -αD f d∥d D Proposition 1. The minimax problem: min V (s) max d≥0 E (s,a)∼d [r(s, a)] -αD f d∥d D + s V (s) (1 -γ)µ0(s) + γT⋆d(s) - a d(s, a) shares the same optimal value as the following minimization problem: min V (s) (1 -γ)E s∼µ0 [V (s)] + α E (s,a)∼d D f ⋆ ( r(s, a) + γT V (s, a) -V (s) α ) where f ⋆ is the Fenchel conjugate function of f with dom f = {u : u ≥ 0} Proof. Using Lemma 2, this minimax problem can be re-written as: min V (s) max d≥0 (1 -γ)E s∼µ0 [V (s)] + E (s,a)∼d [r(s, a) + γT V (s, a) -V (s)] -αD f d∥d D (29) Next, min V (s) max d≥0 (1 -γ)Es∼µ 0 [V (s)] + E (s,a)∼d [r(s, a) + γT V (s, a) -V (s)] -αD f d∥d D = min V (s) (1 -γ)Es∼µ 0 [V (s)] + max d≥0 E (s,a)∼d [r(s, a) + γT V (s, a) -V (s)] -αD f d∥d D = min V (s) (1 -γ)Es∼µ 0 [V (s)] + α max d≥0 E (s,a)∼d r(s, a) + γT V (s, a) -V (s) α -D f d∥d D L (30) L in the last step reduces to:  α max d≥0 E (s,a)∼d r(s, a) + γT V (s, a) -V (s) α -D f d∥d D = α max d≥0 E (s,a)∼d D d(s, a) d D (s, a) (r(s, a) + γT V (s, a) -V (s)) α -E (s, where y(s, a) = r(s,a)+γT V (s,a)-V (s) α , the third step follows the interchangeability principle (Lemma 1) and the last step comes from the Fenchel conjugate of convex function ffoot_5 . Using this result, we finally yield the tractable lower-level problem Eq. (10).

A.3 PROOF OF TRACTABLE TRANSFORMATION OF THE UPPER-LEVEL PROBLEM

Proposition 2. The original upper-level problem min ∆r D f d π * r ∥d E (32) can be equivalently written as: min ∆r D f f ′ ⋆ r + γT V * -V * α d D ∥d E (33) where d π * r is the optimal state-action visitation distribution of Eq. ( 7) Proof. By the property Eq. ( 19), the maximizer d(s,a) d D (s,a) * of f ⋆ (y(s, a)) in Eq. (31) satisfies d(s, a) d D (s, a) * = f ′ ⋆ r(s, a) + γT V (s, a) -V (s) α Given V * , we have: d π * r (s, a) d D (s, a) = f ′ ⋆ r(s, a) + γT V * (s, a) -V * (s) α ( ) Substituting this result into the original upper-level problem completes the proof. Next, we denote f ′ ⋆ r+γT V * -V * α as g. By expanding the f -divergence, we have the upper-level objective: D f d D g∥d E = E (s,a)∼d E f d D (s, a)g(s, a) d E (s, a) (36) = E (s,a)∼d D d E (s, a) d D (s, a) f d D (s, a) d E (s, a) g(s, a) (37) = E (s,a)∼d D w(s, a)f g(s, a) w(s, a) where the distribution ratio w(s, a) ≜ d E (s, a)/d D (s, a). Finally, by combining proposition 1 and proposition 2, the original bi-level optimization problem Eq. ( 4)-( 5) is rewritten equivalently as follows: ∆r * = arg min ∆r E (s,a)∼d D w(s, a)f f ′ ⋆ r(s, a) + γT V * (s, a) -V * (s) α /w(s, a) s.t. V * (s) = arg min V (s) (1 -γ)E s∼µ0 [V (s)] + α E (s,a)∼d D f ⋆ r(s, a) + γT V (s, a) -V (s) α (39) B IMPLEMENTATION DETAILS OF RGM B.1 RGM WITH KL-DIVERGENCE In this section, we introduce the implementation details of RGM. For KL-divergence, we have f (x) = x log x and its Fenchel conjugate is f ⋆ (x) = e x-1 . However, this exponential form is numerically unstable and prone to value explosion in practice. We address this issue by using the fact that the conjugate of the negative entropy function, restricted to the probability simplex, is the log-sum-exp function (Boyd et al., 2004) , i.e., D ⋆,f (y) = log E x∼q [exp y(x)]. Then, the optimization problem of RGM with KL divergence is min ∆r E (s,a)∼d D Softmax Adv(∆r, V * ) α log d D (s, a) d E (s, a) + log Softmax Adv(∆r, V * ) α s.t.V * = arg min V (1 -γ)E s∼µ0 [V (s)] + α log E (s,a)∼d D exp Adv(∆r, V ) α (40) where, Adv(∆r, V ) := r(s, a)+γT V (s, a)-V (s) = r(s, a)+∆r(s, a, r)+γT V (s, a)-V (s) and log d D(s,a) d E(s,a) can be obtained by training a discriminator log d D(s,a) d E(s,a) = -log 1 h * -1 using Eq. ( 14) in continuous MDPs. The importance ratio used to extract the policy is ψ * (s, a) = d π * r (s, a) d D (s, a) = Softmax r + ∆r + γT V * (s, a) -V * (s) α (41) B.1.1 OPTIMIZE WITHOUT SUM-EXP Note that in the upper level objective of Eq. ( 40), we need to calculate a log-sum-exp value in the denominator of the log(Softmax) term, where log (Softmax(Adv(∆r, V * )/α)) = Adv(∆r, V * )/αlog s,a∈S×A exp(Adv(∆r, V * )/α). In low-dimensional discrete state-action space, we can easily get this value via summing over the overall space. In high-dimensional continuous MDPs, however, it is pretty difficult to retrieve the value because it requires integration over the entire space. CQL (Kumar et al., 2020) approximates this value via importance sampling but requires additional samples from the entire state-action space. There are some other methods like Markov Chain Monte Carlo (MCMC) or Score Match (SM) (Song & Kingma, 2021) that can approximate the update gradient but bring additional computation costs and suffer from some technical issues. Fortunately, we can subtly circumvent the log-sum-exp term by optimizing the upper bound of the original upper-level problem using the following inequality (Boyd et al., 2004) : max xi∈B {x 1 , ..., x n } ≤ max{x 1 , ..., x n } ≤ log n i exp (x i ) where max xi∈B {x 1 , ..., x n } is the max value in a mini-batch B which is sampled from {x 1 , ..., x n }. For simplicity, we denote max xi∈B {x 1 , ..., x n } as max xi∈B {x}. Substituting Eq. ( 42) into the upper-level problem of Eq. ( 40), we get the upper bound of the original upper-level optimization objective: Upper(40) = E (s,a)∼d D   Softmax Adv(∆r, V * ) α   log d D (s, a) d E (s, a) + Adv(∆r, V * ) α -log s,a∈S×A exp Adv(∆r, V * ) α     ≤ E (s,a)∼d D Softmax Adv(∆r, V * ) α log d D (s, a) d E (s, a) + Adv(∆r, V * ) α -max B Adv(∆r, V * ) α ∝ E (s,a)∼d D exp Adv(∆r, V * ) α log d D (s, a) d E (s, a) + Adv(∆r, V * ) α -max B Adv(∆r, V * ) α ( ) where Upper( 40) denotes the upper level objective in Eq. ( 40). Replacing Eq. ( 43) to the upper level objective in Eq. ( 40), we obtain the final optimization problem: . Note that the exp term in the upper-level problem is prone to value explosion in practice, we clip the exp value to (-∞, 100] like IQL (Kostrikov et al., 2021b) does to improve training stability. min ∆r E (s,a)∼d D exp Adv(∆r, V * ) α log d D (s, a) d E (s, a) + Adv(∆r, V * ) α -max B Adv(∆r, V * ) α s.t.V * = arg min V (1 -γ)E s∼µ0 [V (s)] + α log E (s,a)∼d D exp Adv(∆r, V * ) α When extracting the policy, we can ignore the annoying sum-exp term in the denominator of Softmax and get the following ratio, because it does not influence the direction of gradients to update the policy. ψ * (s, a) = d π * r (s, a) d D (s, a) ∝ exp r + ∆r + γT V * (s, a) -V * (s) α := ψ * (s, a) However, using Eq.( 45), we can only get an unnormalized distribution ratio instead of an exact one. We resort to self-normalized importance sampling (Owen, 2013) to obtain a normalized ratio: ψ * (s, a) = ψ * (s, a) E (s,a)∼d D [ ψ * (s, a)] B.2 RGM WITH X 2 -DIVERGENCE Additionally, we can also implement RGM using X 2 -divergence. For X 2 -divergence, we have f (x) = 1 2 (x -1) 2 with dom f = {x : x ≥ 0} 6 and its Fenchel conjugate is f ⋆ (x) = 1 2 (x + 1) 2 and f ′ ⋆ (x) = max (0, x + 1). Then, the optimization objective of RGM with X 2 divergence is min ∆r E (s,a)∼d D d E (s, a) 2d D (s, a) max 0, Adv(∆r, V * ) α + 1 d D (s, a) d E (s, a) -1 2 s.t V * = arg min V (1 -γ)Es∼µ 0 [V (s)] + α 2 E (s,a)∼d D Adv(∆r, V ) α 2 The importance ratio used to extract the policy is: ψ * (s, a) = d π * r (s, a) d D (s, a) = max 0, r + ∆r + γT V * (s, a) -V * (s) α + 1 For RGM with KL-divergence, the upper layer contains an exponential term exp( Adv(δr,V * ) α ), which may pose numerical instability. For RGM with χ 2 divergence, f ′ ⋆ (x) = max (0, x + 1) and so the gradient vanishes when x + 1 < 0, which makes the policy learning slow or even fail. In practice, we follow the criteria from SMODICE (Ma et al., 2022) by monitoring the initial policy loss to choose the types of f -divergence.

B.3 RGM HYPERPARAMETERS AND PSEUDOCODE

For continuous MDPs with high dimensional state-action spaces, we implement RGM by parameterizing h τ , ∆r ϕ , V θ and π w using deep neural networks with parameter τ, ϕ, θ and w, respectively. We implement RGM based on a two-time scale first-order stochastic gradient update, where the reward correction term is updated much slower than the Lagrangian multiplier V . We choose the cosine annealing learning rate schedule of the reward correction term and policy network to stabilize the training process. To make the reward correction term comparable w.r.t the original imperfect rewards, we normalize the imperfect rewards to standard Gaussian distribution N (0, 1) and strict the output range of ∆r ϕ to [-3, 3] by Tanh function. The conclusive hyperparameters can be found in Table 3 . The pseudocode of RGM with deep neural networks can be found in Algorithm 1. We run RGM on one RTX 3080Ti GPU with about 1h30min training time to apply 1M gradient steps. We report the wall-clock training time of RGM compared with SOTA offline RL methods as well as SOTA offline IL methods that can learn from mixed quality data in Table 4 . RGM is as efficient as most baselines but has an additional ability to combat the negative impacts of imperfect rewards. Task descriptions. The Robomimic (Mandlekar et al., 2021) tasks we try to solve include Lift and Can. For the Lift task, RL policy needs to control a 7-DOF robot arm to learn to lift a cube that is randomly located at a table. For the Can task, RL policy needs to control a 7-DOF robot arm to learn to pick a can that is randomly located at a table and place it in a specific location. The AntMaze tasks we try to solve include AntMaze medium tasks, where an ant not only needs to learn to walk but also navigates from the goal to the destination in a medium-size maze. This task is extremely difficult due to the non-markovian and mixed-quality offline dataset, the stochastic property of environments, and the high dimensional state-action space (Fu et al., 2020) . We use MG datasets as the large potentially suboptimal dataset rather than MH datasets since MH datasets are non-markovian and thus are hard to be solved by modern offline RL methods (Mandlekar et al., 2021) , which is not the main challenge we try to solve. AntMaze dataset composition. The expert dataset of RGM is composed of 30 successful trajectories (which may be suboptimal) that are collected by training IQL with dense rewards. We set the original D4RL Antmaze-medium-play-v2 and Antmaze-medium-diverse-v2 datasets as non-expert datasets.

C.3 MULTI-TASK DATA SHARING EXPERIMENTS

Task descriptions. The multi-task data sharing experiments contain 2 domains with 4 tasks per domain built on DeepMind Control Suite (Tassa et al., 2018) . The immediate rewards in the 8 tasks are all in the unit interval, r(s, a) ∈ [0, 1]. (a) For Walker (Stand, Walk, Run, Flip) domain, the agent needs to control a biped in a 2D vertical plane to master four different locomotion skills. The observation space is 24 dimensional, and the action space is 6 dimensional. The episode length is set to 1000. (b) For Quadruped (Walk, Run, Roll-Fast, Jump) domain, the agent needs to control a quadruped within a 3D space to master four different moving skills. The observation space is 78 dimensional, and the action space is 12 dimensional. The episode length is set to 1000. Dataset composition. We take the same rule of dataset generation and similar task settings as the work (Bai et al., 2023) . For each task, we utilize TD3 (Fujimoto et al., 2018) 

D ADDITIONAL RESULTS

In this section, we provide additional comparative and ablation results of RGM against baseline methods. Recall that DWBC (Xu et al., 2022b) and SMODICE (Ma et al., 2022) all assume the offline dataset already covers a lot of expert trajectories, which is more restrictive compared to the requirement of RGM. Therefore, we further demonstrate the superiority of RGM compared to these offline IL methods by evaluating RGM under the same settings of DWBC and SMODICE. We combine the original D4RL dataset with 200 or 100 expert trajectories as the offline dataset D, see Table 7 for descriptions of the expert trajectories. The comparisons under these dataset configurations can be found in Table 8 . We can observe from Table 8 that RGM still outperforms existing SOTA offline IL methods under their settings. We also implemented the discounted visitation distribution sampling in RGM. This is done by augmenting the D4RL datasets that adds the timestep of each (s, a) pair in an episode. When performing sampling in Eq.(14-16) and calculating the gradient, we sample (s, a, t) in the D4RL datasets and then multiply the gradient by γ t . Empirically, we found that the performance of the discounted visitation distribution version is not better than the sampling distribution version of RGM. Figure 12 and Table 9 show that RGM (sampling distribution) surpasses RGM (discounted visitation distribution) in most cases with lower variance, while the latter wins by a slight margin in only a few cases.

D.3 EXPERIMENTS ON NOISY PARTIALLY CORRECT REWARDS

We add i.i.d Gaussian noises with different standard deviation σ to original D4RL rewards to construct noisy imperfect rewards with different degrees of imperfection. We set σ = 1 to construct partially correct rewards and σ = 10 as largely incorrect rewards, see Table 10 for detailed results. Table 10 shows that RGM under perfect rewards slightly outperforms RGM with partially correct rewards, indicating that RGM can largely remedy the negative impacts caused by reward noises with σ = 1. Meanwhile, the highly noisy rewards (σ = 10) surely impact the performance, but its mean designed for mixed-quality data (DWBC and SMODICE). It is found that RGM also enjoys a higher level of performance gains when the amount of expert data is increased.

D.5 EXPERIMENTS ON MULTI-TASK DATA SHARING

We present concrete results of the multi-task data sharing experiment. Table 14 shows the evaluated scores on multi-task data sharing, which are illustrated in Fig. 3 .

D.6 ADDITIONAL LEARNING CURVES OF RGM

We present the learning curves of RGM compared with offline IL and RL baselines on D4RL datasets related to the results presented in Table 1 .

D.7 ILLUSTRATIVE EXAMPLE FOR THE NON-TABULAR SCENARIOS

The results of the 8×8 grid world experiments in Section 5.2 and Appendix C.4 illustrate the potential benefits of the learned rewards in the tabular case. In this subsection, we consider a one-dimensional random walk task in the non-tabular case and provide the visualization of the learned corrected rewards r. In this task, the state space is a straight line from [0, +3] and the agent can move at each step in the range of [-0.5, 0.5]. If the agent goes beyond the edge (s < 0 or s > +3), then we keep it at the edge (s = 0 or s = +3). The agent needs to start from state s = 0 and reach the destination located at s = 3 as fast as possible. The expert dataset D E consists of one trajectory where the expert takes action a = 0.5 at every state. The offline dataset D consists of 1000 trajectories generated by a completely random policy where the agent takes action uniformly from [-0.5, 0.5] at every state. The sparse rewards r = +10 is set when reaching the destination while r = 0 anywhere else. The visualization of learned rewards r at each state-action pair is shown in Figure 14 .

E DISCUSSION ON THE APPLICABILITY TO ONLINE SETTINGS

It should be noted that the proposed RGM framework can also be applied to the online setting. This can be achieved by simply setting α = 0 in Eq. (4-5), and we have the bi-level objective of the online  Since we could get online samples from d π * r in the online setting, so we don't have to eliminate d π * r . One can use the existing popular online RL algorithms to solve the lower-level problem, while leveraging the online samples from d π * r to solve the upper-level problem. Hence the online version of RGM can be perceived as a reduced and simplified version of the original RGM. The core idea of the reward correction has not been changed in the online setting, which illustrates that to some extent, our proposed RGM is a unified policy optimization method for imperfect rewards.



EXPERIMENTSIn this section, we present empirical evaluations of RGM under diverse imperfect reward settings, including partially correct rewards, completely incorrect rewards, sparse rewards, and multi-task data sharing setting on Robomimic(Mandlekar et al., 2021), D4RL-v2(Fu et al., 2020) and a dataset of a grid-world navigation task. As D4RL MuJoCo tasks are deterministic, we use only one expert trajectory to assist the reward correction and policy learning for these tasks. The signs of 50% D4RL rewards are flipped and hence only half rewards can give correct learning signals. We regard the original D4RL rewards as perfect since we evaluate the policies in terms of these rewards, which can be perceived as solving the tasks encoded in the original D4RL rewards. All sign of the original rewards is flipped. Note that in IQL and CQL papers, they turn the original sparse rewards into dense rewards by applying the reward subtraction trick (minus 1 on every reward, so the reward becomes negative except at the goal). dom f = {u : u ≥ 0} and f is convex, so f⋆(y) = -f (0) when y ≤ f ′ (0). On account of the state-action visitation distribution d ≥ 0



Figure 1: Diverse settings of imperfect rewards.

(Bellman flow constraint) Let T ⋆ d(s) = s,ā T (s|s, ā)d(s, ā) denote the transpose (or adjoint) transition operator, the Bellman flow constraint for the visitation distribution d(s, a) is: a d(s, a) = (1 -γ)µ 0 (s) + γT ⋆ d(s), ∀s ∈ S (6) If d(s, a) ≥ 0 satisfies the Bellman flow constraint, then d(s, a) is feasible and there is a one-to-one correspondence between d and the related policy π: i.e., d is the only visitation distribution for policy π(a|s) = d(s,a) ā d(s,ā) , while π is the only policy whose visitation distribution is d (for detailed proof see Puterman (

Figure 2: Illustration of the reformulated bi-level optimization problem.

Figure 3: Results on multi-task data sharing tasks.

Figure 4: Learned rewards r and optimal distribution d π * r trained on two types of imperfect rewards r. The opacity of each square represents the value of marginal state distribution d π * r (s). The opacity of the arrow shows the learned reward r, where the darkest arrow points to the direction of the highest reward. The expert starts from , follows the path and arrow to reach the goal . r in (b) is +10 at the goal and is zero at other states. r in (c) falsely punishes the agent on and correctly punishes the RL agent on fire marks .

Figure 5: Performance drop of normalized returns of SOTA offline RL methods on D4RL datasets under perfect and RGM corrected rewards. The wrong rewards are the partially correct rewards as inTable 1. H: Hopper; HC: HalfCheetah; W: Walker2d.

Figure 6: Experiments on learned rewards in hopper-m-r task. The superscript "¯" denotes the mean value of mini-batch samples. The subscript "E" and "O" denote the value on expert and non-expert data. In (b)(c), large rE -rO and ∆rE -∆rO indicate that expert and non-expert data are clearly distinguishable according to the learned rewards, and small values mean the opposite.

(s,a)∼d D [f⋆(y(s, a))]

44)We practically utilize the same mini-batch B as that of SGD gradient update step to calculate max B Adv(∆r,V * ) α

Figure 8: Robomimic tasks

Figure 9: AntMaze medium task.Robomimic dataset composition. The Robomimic(Mandlekar et al., 2021) datasets that we used in this paper contain 3 types of datasets: PH (Proficient-Human): datasets are collected by a single, experienced human operator. MH (Multi-Human): datasets are collected by 6 human operators of varying proficiency. MG (Machine-Generated): datasets are collected by first training SAC on the Lift and Can task, taking agent checkpoints that are saved regularly during training, and collecting 300 rollout trajectories from each checkpoint. We treat PH dataset as the expert dataset since the environment is stochastic, thus only one expert trajectory is difficult to capture the expert distribution. We use MG datasets as the large potentially suboptimal dataset rather than MH datasets since MH datasets are non-markovian and thus are hard to be solved by modern offline RL methods(Mandlekar et al., 2021), which is not the main challenge we try to solve.

Figure 10: Different tasks in Walker and Quadruped domain

to collect three types of datasets (expert, medium, replay). The expert dataset contains only one expert episode, the medium dataset contains 1000 episodes of interactions, and the replay dataset contains 2000 episodes of interactions. For Walker (Stand, Walk, Run, Flip) domain, the Stand task is set to the target task, and the others are relevant tasks. For Quadruped (Walk, Run, Jump, Roll-Fast) domain, the Walk task is set to the target task, and the others are relevant tasks. We conduct two-task data sharing experiments, in which we share the replay dataset of the relevant task with the medium dataset of the target task.C.4 GRID WORLD EXPERIMENTSDataset composition. The offline dataset D we use in grid world experiments consists of 1000 trajectories generated by a completely random policy (Figure11 (b)). There are two settings of imperfect rewards r: (i) r = +10 when reaching the goal while r = 0 anywhere else. (ii) (Figure11 (c)) r = +10 when reaching the goal, r = -10 when encountering the fire (true fire or fake fire), r = 0 everywhere else. The expert demonstration dataset D E consists of only one expert demonstration (Figure11 (a)).

Figure 11: (a) The only one expert demonstration path, which starts from , follows the path and arrow to reach the goal . (b) The empirical distribution heatmap of offline dataset D O , which consists of trajectories generated by random policy starting from . The darker the color is, the more frequently the agent passes. (c) Illustration of imperfect rewards. Agent gets r = -10 when reaching , r = +10 when reaching , r = 0 everywhere else.



Results on sparse reward tasks.

The hyperparameters of RGM with deep neural networks

Wall-clock run time comparison of RGM and other baselines

Dataset compositions for Robomimic Datasets

The details about the expert data that are used to construct the non-expert dataset in offline IL settings.

Average normalized scores of RGM compared with SOTA offline IL methods that can learn from mixed quality data under their settings. The notation "-w.e" stands for the mixed dataset that combines the original D4RL dataset with some expert trajectories. The scores are taken over the final 10 evaluations with 5 random seeds. We obtain the results via ruining author-provided open-source codes. RGM achieves 7 highest scores in 12 tasks.Figure12: Experiments on sampling from discounted and undiscounted distributions score is 45.0, which is still considerably higher than other Offline RL and IL methods under partially correct rewards with the largest mean value of 35.5 as shown in Table1.

Normalized scores of RGM sampling from discounted distribution and undiscounted distribution

Normalized scores of RGM on different degrees of noisy datasets.

Normalized scores of RGM and offline IL baselines when D E contains 10 expert trajectories. Dataset DWBC (N E = 10) SMODICE (N E = 10) RGM (N E = 10)

Normalized scores of RGM and offline IL baselines when D E contains 40 expert trajectories. Dataset DWBC (N E = 40) SMODICE (N E = 40) RGM (N E = 40)

Normalized scores of RGM and offline IL baselines when D E contains 80 expert trajectories. Dataset DWBC (N E = 80) SMODICE (N E = 80) RGM (N E = 80)

Evaluated scores on multi-task data sharing.

ACKNOWLEDGMENTS

This work is supported by funding from Haomo.AI, and National Natural Science Foundation of China under Grant 62125304, 62073182. The authors would also like to thank the anonymous reviewers for their feedback on the manuscripts.

Algorithm 1 RGM (KL-divergence) with Deep Neural Networks

Input: One Expert demonstration D E , offline Dataset D, set D ← D E ∪ D. Initialize τ, ϕ, θ, w. / / Discriminator learning Train h τ using D E and D using Eq. ( 14). for t = 0, 1, 2, ..., N do Sample mini-batch transitions (s, a, r, s ′ ) ∼ D / / Reward Gap Minimization Bi-level optimization Update V θ , ∆r ϕ using Eq. ( 44) with l ϕ < < l θ / / Policy extraction Update π w based on Eq. ( 16) and Eq. ( 46) end for

C EXPERIMENTAL DETAILS

In this section, we introduce the detailed experimental setups in our paper.

C.1 D4RL EXPERIMENTS

Task Descriptions. The D4RL (Fu et al., 2020) tasks we try to solve include Hopper, Halfcheetah and Walker2d. For these tasks, RL policies need to control the robots to move in the forward (right) direction by applying torques on the joints. Dataset composition. The D4RL (Fu et al., 2020 ) datasets that we used in this paper contain 5 types of datasets: random: roll out a random policy for 1M steps. expert: roll out an expert policy that trained with SAC (Haarnoja et al., 2018) for 1M steps. medium: roll out a medium policy that achieves 1/3 the performance of the expert for 1M steps. medium-replay: replay buffer of a SAC agent that is trained to the performance of the medium policy. medium-expert: equally mixed dataset combines medium and expert data. We sample only one trajectory from the expert dataset to serve as the expert demonstration D E . The other datasets are treated as non-expert datasets D. Imperfect rewards.We assume the original rewards in D4RL datasets are perfect, since we evaluate the policy performance based on the perfect reward function in the original gym environment during evaluation. We randomly flip the sign of 50% original rewards to construct partially correct rewards, where half rewards can provide correct learning signals while the other half cannot. We flip all signs of the original rewards to construct completely incorrect rewards. The expert data has contain 7 states (s = 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0), but the learned rewards can still generalize well in the state space even in regions that are not covered by the expert data. Similar to the 8×8 grid world experiment, we can successfully navigate to the destination by only maximizing per-step reward r, which means that the learned rewards also encode long-horizon information.

