WHAT ARE THE STATISTICAL LIMITS OF OFFLINE RL WITH LINEAR FUNCTION APPROXIMATION?

Abstract

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of every policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon to nontrivially estimate the value of any given policy. Our results highlight that sampleefficient offline policy evaluation is not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

1. INTRODUCTION

Offline methods (also known as off-policy methods or batch methods) are a promising methodology to alleviate the sample complexity burden in challenging reinforcement learning (RL) settings, particularly those where sample efficiency is paramount (Mandel et al., 2014; Gottesman et al., 2018; Wang et al., 2018; Yu et al., 2019) . Off-policy methods are often applied together with function approximation schemes; such methods take sample transition data and reward values as inputs, and approximate the value of a target policy or the value function of the optimal policy. Indeed, many practical deep RL algorithms find their prototypes in the literature of offline RL. For example, when running on off-policy data (sometimes termed as "experience replay"), deep Q-networks (DQN) (Mnih et al., 2015) can be viewed as an analog of Fitted Q-Iteration (Gordon, 1999) with neural networks being the function approximators. More recently, there are an increasing number of both model-free (Laroche et al., 2019; Fujimoto et al., 2019; Jaques et al., 2020; Kumar et al., 2019; Agarwal et al., 2020) and model-based (Ross & Bagnell, 2012; Kidambi et al., 2020) offline RL methods, with steady improvements in performance (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2020; Kidambi et al., 2020) . However, despite the importance of these methods, the extent to which data reuse is possible, especially when off-policy methods are combined with function approximation, is not well understood. For example, deep Q-network requires millions of samples to solve certain Atari games (Mnih et al., 2015) . Also important is that in some safety-critical settings, we seek guarantees when offline-trained policies can be effective (Thomas, 2014; Thomas et al., 2019) . A basic question here is that if there are fundamental statistical limits on such methods, where sample-efficient offline RL is simply not possible without further restrictions on the problem. In the context of supervised learning, it is well-known that empirical risk minimization is sampleefficient if the hypothesis class has bounded complexity. For example, suppose the agent is given a d-dimensional feature extractor, and the ground truth labeling function is a (realizable) linear function with respect to the feature mapping. Here, it is well-known that a polynomial number of samples in d suffice for a given target accuracy. Furthermore, in this realizable case, provided the training data has a good feature coverage, then we will have good accuracy against any test distribution. 1In the more challenging offline RL setting, it is unclear if sample-efficient methods are possible, even under analogous assumptions. This is our motivation to consider the following question: What are the statistical limits for offline RL with linear function approximation? Here, one may hope that value estimation for a given policy is possible in the offline RL setting under the analogous set of assumptions that enable sample-efficient supervised learning, i.e., 1) (realizability) the features can perfectly represent the value functions and 2) (good coverage) the feature covariance matrix of our off-policy data has lower bounded eigenvalues. The extant body of provable methods on offline RL either make representational assumptions that are far stronger than realizability or assume distribution shift conditions that are far stronger than having coverage with regards to the spectrum of the feature covariance matrix of the data distribution. For example, Szepesvári & Munos (2005) analyze offline RL methods by assuming a representational condition where the features satisfy (approximate) closedness under Bellman updates, which is a far stronger representation condition than realizability. Recently, Xie & Jiang (2020a) propose a offline RL algorithm that only requires realizability as the representation condition. However, the algorithm in (Xie & Jiang, 2020a) requires a more stringent data distribution condition. Whether it is possible to design a sample-efficient offline RL method under the realizability assumption and a reasonable data coverage assumption -an open problem in (Chen & Jiang, 2019) -is the focus of this work. Our Contributions. Perhaps surprisingly, our main result shows that, under only the above two assumptions, it is information-theoretically not possible to design a sample-efficient algorithm to non-trivially estimate the value of a given policy. The following theorem is an informal version of the result in Section 4. Theorem 1.1 (Informal). In the offline RL setting, suppose the data distributions have (polynomially) lower bounded eigenvalues, and the Q-functions of every policy are linear with respect to a given feature mapping. Any algorithm requires an exponential number of samples in the horizon H to output a non-trivially accurate estimate of the value of any given policy π, with constant probability. This hardness result states that even if the Q-functions of all polices are linear with respect to the given feature mapping, we still require an exponential number of samples to evaluate any given policy. Note that this representation condition is significantly stronger than assuming realizability with regards to only a single target policy; it assumes realizability for all policies. Regardless, even under this stronger representation condition, it is hard to evaluate any policy, as specified in our hardness result. This result also formalizes a key issue in offline reinforcement learning with function approximation: geometric error amplification. To better illustrate the issue, in Section 5, we analyze the classical Least-Squares Policy Evaluation (LSPE) algorithm under the realizability assumption, which demonstrates how the error propagates as the algorithm proceeds. Here, our analysis shows that, if we only rely on the realizability assumption, then a far more stringent condition is required for sample-efficient offline policy evaluation: the off-policy data distribution must be quite close to the distribution induced by the policy to be evaluated.



Specifically, if the features have a uniformly bounded norm and if the minimum eigenvalue of the feature covariance matrix of our data is bounded away from 0, say by 1/poly(d), then we have good accuracy on any test distribution. See Assumption and the comments thereafter.

