PESSIMISM IN THE FACE OF CONFOUNDERS: PROV-ABLY EFFICIENT OFFLINE REINFORCEMENT LEARNING IN PARTIALLY OBSERVABLE MARKOV DECISION PRO-CESSES

Abstract

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of P3O is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that P3O achieves a n -1/2 -suboptimality, where n is the number of trajectories in the dataset. To our best knowledge, P3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.

1. INTRODUCTION

Offline reinforcement learning (RL) (Sutton and Barto, 2018) aims to learn an optimal policy of a sequential decision making problem purely from an offline dataset collected a priori, without any further interactions with the environment. Offline RL is particularly pertinent to applications in critical domains such as precision medicine (Gottesman et al., 2019) and autonomous driving (Shalev-Shwartz et al., 2016) . In particular, in these scenarios, interacting with the environment via online experiments might be risky, slow, or even possibly unethical. But oftentimes offline datasets consisting of past interactions, e.g., treatment records for precision medicine (Chakraborty and Moodie, 2013; Chakraborty and Murphy, 2014) and human driving data for autonomous driving (Sun et al., 2020) , are adequately available. As a result, offline RL has attracted substantial research interest recently (Levine et al., 2020) . Most of the existing works on offline RL develop algorithms and theory on the model of Markov decision processes (MDPs). However, in many real-world applications, due to certain privacy concerns or limitations of the sensor apparatus, the states of the environment cannot be directly stored in the offline datasets. Instead, only partial observations generated from the states of the environments are stored (Dulac-Arnold et al., 2021) . For example, in precision medicine, a physician's treatment might consciously or subconsciously depend on the patient's mood and socioeconomic status (Zhang and Bareinboim, 2016), which are not recorded in the data due to privacy concerns. As another example, in autonomous driving, a human driver makes decisions based on multimodal information of the environment that is not limited to visual and auditory inputs, but only observations captured by LIDARs and cameras are stored in the datasets (Sun et al., 2020) . In light of the partial observations in the datasets, these situations are better modeled as partially observable Markov decision processes (POMDPs) (Lovejoy, 1991) . Existing offline RL methods for MDPs, which fail to handle partial observations, are thus not applicable. In this work, we make the initial step towards studying offline RL in POMDPs where the datasets only contain partial observations of the states. In particular, motivated from the aforementioned real-world applications, we consider the case where the behavior policy takes actions based on the states of the environment, which are not part of the dataset and thus are latent variables. Instead, the trajectories in datasets consist of partial observations emitted from the latent states, as well as the actions and rewards. For such a dataset, our goal is to learn an optimal policy in the context of general function approximation. Furthermore, offline RL in POMDP suffers from several challenges. First of all, it is known that both planning and estimation in POMDPs are intractable in the worst case (Papadimitriou and Tsitsiklis, 1987; Burago et al., 1996; Goldsmith and Mundhenk, 1998; Mundhenk et al., 2000; Vlassis et al., 2012) . Thus, we have to identify a set of sufficient conditions that warrants efficient offline RL. More importantly, our problem faces the unique challenge of the confounding issue caused by the latent states, which does not appear in either online and offline MDPs or online POMDPs. In particular, both the actions and observations in the offline dataset depend on the unobserved latent states, and thus are confounded (Pearl, 2009) .

Rh-1 Rh

Sh-1 Sh Sh+1 Oh Such a confounding issue is illustrated by a causal graph in Figure 1 . As a result, directly applying offline RL methods for MDPs will nevertheless incur a considerable confounding bias. Besides, since the latent states evolve according to the Markov transition kernel, the causal structure is thus dynamic, which makes the confounding issue more challenging to handle than that in static causal problems. Furthermore, apart from the confounding issue, since we aim to learn the optimal policy, our algorithm also needs to handle the distributional shift between the trajectories induced by the behavior policy and the family of target policies. Finally, to handle large observation spaces, we need to employ powerful function approximators. As a result, the coupled challenges due to (i) the confounding bias, (ii) distributional shift, and (iii) large observation spaces that are distinctive in our problem necessitates new algorithm design and theory. To this end, by leveraging tools from proximal causal inference (Lipsitch et al., 2010; Tchetgen et al., 2020; Miao et al., 2018b; a) , we propose the Proxy variable Pessimistic Policy Optimization (P3O) algorithm, which provably addresses the challenge of the confounding bias and the distributional shift in the context of general function approximation. Specifically, we focus on a benign class of POMDPs where the causal structure involving latent states can be captured by the past and current observations, which serves as the negative control action and outcome respectively (Miao et al., 2018a; b; Cui et al., 2020; Singh, 2020; Kallus et al., 2021; Bennett and Kallus, 2021; Shi et al., 2021) . Then the value of each policy can be identified by a set of confounding bridge functions corresponding to that policy, which satisfy a sequence of backward moment equations that are similar to the celebrated Bellman equations in classical RL (Bellman and Kalaba, 1965) . Thus, by estimating these confounding bridge functions from offline data, we can estimate the value of each policy without incurring the confounding bias. More concretely, P3O involves two components -policy evaluation via minimax estimation and policy optimization via pessimism. Specifically, to tackle the distributional shift, P3O returns the policy that maximizes pessimistic estimates of the values obtained by policy evaluation. Meanwhile, in policy evaluation, to ensure pessimism, we construct a coupled sequence of confidence regions for the confounding bridge functions via minimax estimation, using function approximators. Furthermore, under a partial coverage assumption on the confounded dataset, we prove that P3O achieves a O(H log(N fun )/n) suboptimality, where n is the number of trajectories, H is the length of each trajectory, N fun stands for the complexity of the employed function classes (e.g., the covering number), and O(•) hides logarithmic factors. When specified to linear function classes, the suboptimality of

