OFFLINE REINFORCEMENT LEARNING FROM RAN-DOMLY PERTURBED DATA SOURCES

Abstract

Most of the existing offline reinforcement learning (RL) studies assume the available dataset is sampled directly from the target task. However, in some practical applications, the available data are often coming from several related but heterogeneous environments. A theoretical understanding of efficient learning from heterogeneous offline datasets remains lacking. In this work, we study the problem of offline RL based on multiple data sources that are randomly perturbed versions of the target Markov decision process (MDP). A novel HetPEVI algorithm is first proposed, which simultaneously considers two types of uncertainties: sample uncertainties from a finite number of data samples per data source, and source uncertainties due to a finite number of data sources. In particular, the sample uncertainties from all data sources are jointly aggregated, while an additional penalty term is specially constructed to compensate for the source uncertainties. Theoretical analysis demonstrates the near-optimal performance of HetPEVI. More importantly, the costs and benefits of learning with randomly perturbed data sources are explicitly characterized: on one hand, an unavoidable performance loss occurs due to the indirect access to the target MDP; on the other hand, efficient learning is achievable as long as the sources collectively (instead of individually) provides a good data coverage. Finally, we extend the study to linear function approximation and propose the HetPEVI-Lin algorithm that provides additional efficiency guarantees beyond the tabular cases.

1. INTRODUCTION

Offline reinforcement learning (RL) (Levine et al., 2020 ), a.k.a. batch RL (Lange et al., 2012) , has received growing interest in the recent years. It aims at training RL agents using accessible datasets collected a priori and thus avoids expensive online interactions. Along with its tremendous empirical successes (Kidambi et al., 2020) , recent studies have also established theoretical understandings of offline RL (Rashidinejad et al., 2021; Jin et al., 2021b; Duan et al., 2021; Uehara & Sun, 2021) . Despite these advances, the majority of offline RL research focuses on learning via data collected exactly from the target task environment (Kumar et al., 2020) . However, in practice, it is difficult to ensure that all such data are perfectly from one source environment. Instead, in many cases, it is more reasonable to assume that data are collected from different sources that are perturbed versions of the target task. For example, when training a chatbot (Jaques et al., 2020) , the offline dialogue datasets typically consist of short conversations between different people, who naturally have varying language habits. The training objective is the common underlying language structure, e.g., basic grammar, which however cannot be reflected in any individual dialogue but must be holistically learned from the aggregation of them. More examples can be found in healthcare (Tang & Wiens, 2021) , autonomous driving (Sallab et al., 2017) and others. While a few empirical investigations under the offline meta-RL framework have been reported (Dorfman et al., 2021; Lin et al., 2022; Mitchell et al., 2021) , theoretical understandings of effectively and efficiently learning the underlying task using datasets from multiple heterogeneous sources are largely lacking. Motivated by both practical and theoretical limitations, this work makes progress in the underexplored RL problem of learning the target task using data from heterogeneously perturbed data sources. In particular, we study the problem of learning a target Markov decision process (MDP) from offline datasets sampled from multiple heterogeneously realized source MDPs. Several provably efficient designs are proposed, targeting both tabular and linear MDPs. To the best of our knowledge, this is the first work that proposes provably efficient offline RL algorithms to handle perturbed data sources, which can benefit relevant applications and further shed light on the theoretical understanding of offline meta-RL. The contributions are summarized as follows: • We study a new offline RL problem where the datasets are collected from multiple heterogeneous source MDPs, with possibly different reward and transition dynamics, instead of directly from the target MDP. Motivated by practical applications, the data source MDPs are modeled as random perturbations of the target MDP. Compared with studies of offline RL using data directly from the target MDP (Rashidinejad et al., 2021; Jin et al., 2021b) , we face new challenges of jointly considering uncertainties caused by the finite number of data samples per source (referred to as the sample uncertainties) and by the finite number of data sources (referred to as the source uncertainties). • A novel HetPEVI algorithm is proposed, which generalizes the idea of pessimistic value iteration (Jin et al., 2021b) and uses carefully crafted penalty terms to address the sample and source uncertainties simultaneously. Specifically, in the HetPEVI, the specially designed penalty term contains two parts: one aggregating the sample uncertainties associated with each dataset, and the other term compensating the source uncertainties. The combination of these two parts jointly characterizes the uncertainty associated with the collected datasets. • Theoretical analysis proves the effectiveness of HetPEVI with a corresponding lower bound, which first demonstrates that even with finite randomly perturbed MDPs and finite data samples from each of them, it is feasible to efficiently learn the target MDP. More importantly, the analysis reveals that learning with multiple perturbed data sources brings both costs and benefits. On one hand, due to indirect access to the target MDP, an unavoidable learning cost occurs. This cost scales only with the number of data sources and cannot be reduced by increasing the size of datasets, which highlights the importance of the diversity of data sources. On the other hand, effective learning only requires that the datasets collectively (instead of individually) provide a good coverage of the optimal policy, which may provide additional insights for practical data collections. • Moreover, we extend the study to linear function approximation where offline data are collected from linear MDPs with a shared feature mapping but heterogeneously realized system dynamics. The HetPEVI-Lin algorithm is developed to jointly consider the sample and source uncertainties while incorporating the linear structure. Theoretical analysis demonstrates the effectiveness of HetPEVI-Lin and verifies the sufficiency of a good collective coverage. Related Works. With the empirical success of offline RL (Levine et al., 2020) , its theoretical understandings have been gradually established in recent years. In particular, the principle of "pessimism" is incorporated and proved efficient for offline RL (Jin et al., 2021b; Rashidinejad et al., 2021) 2022) for linear MDPs (Jin et al., 2020) . These theoretical advances are mainly focused on learning with data directly from the target task. However, in the practical studies of RL, growing interests have been made to utilize data from heterogeneous sources, e.g., offline meta-RL (Mitchell et al., 2021; Dorfman et al., 2021; Lin et al., 2022; Li et al., 2020b) . This work is thus motivated to provide a theoretical understanding of how to extract information about the target task from multiple sources. A more detailed literature review regarding both online and offline RL with single or heterogeneous environments is provided in Appendix A.1. .

2. PROBLEM FORMULATION

Preliminaries of RL. We consider an RL problem characterized by an episodic MDP M := (H, S, A, P, r). In this tuple, H is the length of each episode, S is the state space, A is the action space, P is the transition matrix so that P h (s ′ |s, a) gives the probability of transiting to state s ′ if action a is taken for state s at step h, and r h (s, a) is the deterministic reward in the interval of [0, 1] of taking action a for state s at step h.foot_0 Specifically, in each episode, starting from an initial state s 1 , at each step h ∈ [H], the agent observes state s h ∈ S, picks action a h ∈ A, receives reward



The assumption of deterministic rewards is standard in theoretical analysis ofRL (Jin et al., 2018; 2020) as the uncertainties in estimating rewards are dominated by those in estimating transitions.



Following this line, Xie et al. (2021b); Li et al. (2022); Shi et al. (2022) further fine-tune the designs for the tabular setting and Zanette et al. (2021); Min et al. (2021); Yin et al. (2022); Xiong et al. (

