OFFLINE REINFORCEMENT LEARNING FROM RAN-DOMLY PERTURBED DATA SOURCES

Abstract

Most of the existing offline reinforcement learning (RL) studies assume the available dataset is sampled directly from the target task. However, in some practical applications, the available data are often coming from several related but heterogeneous environments. A theoretical understanding of efficient learning from heterogeneous offline datasets remains lacking. In this work, we study the problem of offline RL based on multiple data sources that are randomly perturbed versions of the target Markov decision process (MDP). A novel HetPEVI algorithm is first proposed, which simultaneously considers two types of uncertainties: sample uncertainties from a finite number of data samples per data source, and source uncertainties due to a finite number of data sources. In particular, the sample uncertainties from all data sources are jointly aggregated, while an additional penalty term is specially constructed to compensate for the source uncertainties. Theoretical analysis demonstrates the near-optimal performance of HetPEVI. More importantly, the costs and benefits of learning with randomly perturbed data sources are explicitly characterized: on one hand, an unavoidable performance loss occurs due to the indirect access to the target MDP; on the other hand, efficient learning is achievable as long as the sources collectively (instead of individually) provides a good data coverage. Finally, we extend the study to linear function approximation and propose the HetPEVI-Lin algorithm that provides additional efficiency guarantees beyond the tabular cases.

1. INTRODUCTION

Offline reinforcement learning (RL) (Levine et al., 2020 ), a.k.a. batch RL (Lange et al., 2012) , has received growing interest in the recent years. It aims at training RL agents using accessible datasets collected a priori and thus avoids expensive online interactions. Along with its tremendous empirical successes (Kidambi et al., 2020) , recent studies have also established theoretical understandings of offline RL (Rashidinejad et al., 2021; Jin et al., 2021b; Duan et al., 2021; Uehara & Sun, 2021) . Despite these advances, the majority of offline RL research focuses on learning via data collected exactly from the target task environment (Kumar et al., 2020) . However, in practice, it is difficult to ensure that all such data are perfectly from one source environment. Instead, in many cases, it is more reasonable to assume that data are collected from different sources that are perturbed versions of the target task. For example, when training a chatbot (Jaques et al., 2020) , the offline dialogue datasets typically consist of short conversations between different people, who naturally have varying language habits. The training objective is the common underlying language structure, e.g., basic grammar, which however cannot be reflected in any individual dialogue but must be holistically learned from the aggregation of them. More examples can be found in healthcare (Tang & Wiens, 2021) , autonomous driving (Sallab et al., 2017) and others. While a few empirical investigations under the offline meta-RL framework have been reported (Dorfman et al., 2021; Lin et al., 2022; Mitchell et al., 2021) , theoretical understandings of effectively and efficiently learning the underlying task using datasets from multiple heterogeneous sources are largely lacking. Motivated by both practical and theoretical limitations, this work makes progress in the underexplored RL problem of learning the target task using data from heterogeneously perturbed data sources. In particular, we study the problem of learning a target Markov decision process (MDP) 1

