PESSIMISM IN THE FACE OF CONFOUNDERS: PROV-ABLY EFFICIENT OFFLINE REINFORCEMENT LEARNING IN PARTIALLY OBSERVABLE MARKOV DECISION PRO-CESSES

Abstract

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of P3O is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that P3O achieves a n -1/2 -suboptimality, where n is the number of trajectories in the dataset. To our best knowledge, P3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.

1. INTRODUCTION

Offline reinforcement learning (RL) (Sutton and Barto, 2018) aims to learn an optimal policy of a sequential decision making problem purely from an offline dataset collected a priori, without any further interactions with the environment. Offline RL is particularly pertinent to applications in critical domains such as precision medicine (Gottesman et al., 2019) and autonomous driving (Shalev-Shwartz et al., 2016) . In particular, in these scenarios, interacting with the environment via online experiments might be risky, slow, or even possibly unethical. But oftentimes offline datasets consisting of past interactions, e.g., treatment records for precision medicine (Chakraborty and Moodie, 2013; Chakraborty and Murphy, 2014) and human driving data for autonomous driving (Sun et al., 2020) , are adequately available. As a result, offline RL has attracted substantial research interest recently (Levine et al., 2020) . Most of the existing works on offline RL develop algorithms and theory on the model of Markov decision processes (MDPs). However, in many real-world applications, due to certain privacy concerns or limitations of the sensor apparatus, the states of the environment cannot be directly stored in the offline datasets. Instead, only partial observations generated from the states of the environments are stored (Dulac-Arnold et al., 2021) . For example, in precision medicine, a physician's treatment might consciously or subconsciously depend on the patient's mood and socioeconomic status (Zhang and Bareinboim, 2016), which are not recorded in the data due to privacy concerns. As another example, in autonomous driving, a human driver makes decisions based on multimodal information of the environment that is not limited to visual and auditory inputs, but only observations captured by LIDARs and cameras are stored in the datasets (Sun et al., 2020) . In light of the partial observations in the datasets, these situations are better modeled as partially observable Markov decision processes (POMDPs) (Lovejoy, 1991) . Existing offline RL methods for MDPs, which fail to handle partial observations, are thus not applicable.

