VIPER: PROVABLY EFFICIENT ALGORITHM FOR OF-FLINE RL WITH NEURAL FUNCTION APPROXIMATION

Abstract

We propose a novel algorithm for offline reinforcement learning called Value Iteration with Perturbed Rewards (VIPeR), which amalgamates the pessimism principle with random perturbations of the value function. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, VIPeR implicitly obtains pessimism by simply perturbing the offline data multiple times with carefully-designed i.i.d. Gaussian noises to learn an ensemble of estimated state-action value functions and acting greedily with respect to the minimum of the ensemble. The estimated state-action values are obtained by fitting a parametric model (e.g., neural networks) to the perturbed datasets using gradient descent. As a result, VIPeR only needs O(1) time complexity for action selection, while LCB-based algorithms require at least Ω(K 2 ), where K is the total number of trajectories in the offline data. We also propose a novel data-splitting technique that helps remove a factor involving the log of the covering number in our bound. We prove that VIPeR yields a provable uncertainty quantifier with overparameterized neural networks and enjoys a bound on sub-optimality of Õ(κH 5/2 d/ √ K), where d is the effective dimension, H is the horizon length and κ measures the distributional shift. We corroborate the statistical and computational efficiency of VIPeR with an empirical evaluation on a wide set of synthetic and real-world datasets. To the best of our knowledge, VIPeR is the first algorithm for offline RL that is provably efficient for general Markov decision processes (MDPs) with neural network function approximation.

1. INTRODUCTION

Offline reinforcement learning (offline RL) (Lange et al., 2012; Levine et al., 2020) is a practical paradigm of RL for domains where active exploration is not permissible. Instead, the learner can access a fixed dataset of previous experiences available a priori. Offline RL finds applications in several critical domains where exploration is prohibitively expensive or even implausible, including healthcare (Gottesman et al., 2019; Nie et al., 2021) , recommendation systems (Strehl et al., 2010; Thomas et al., 2017) , and econometrics (Kitagawa & Tetenov, 2018; Athey & Wager, 2021) , among others. The recent surge of interest in this area and renewed research efforts have yielded several important empirical successes (Chen et al., 2021; Wang et al., 2023; 2022; Meng et al., 2021) . A key challenge in offline RL is to efficiently exploit the given offline dataset to learn an optimal policy in the absence of any further exploration. The dominant approaches to offline RL address this challenge by incorporating uncertainty from the offline dataset into decision-making (Buckman et al., 2021; Jin et al., 2021; Xiao et al., 2021; Nguyen-Tang et al., 2022a; Ghasemipour et al., 2022; An et al., 2021; Bai et al., 2022) . The main component of these uncertainty-aware approaches to offline RL is the pessimism principle, which constrains the learned policy to the offline data and leads to various lower confidence bound (LCB)-based algorithms. However, these methods are not easily extended or scaled to complex problems where neural function approximation is used to estimate the value functions. In particular, it is costly to explicitly compute the statistical confidence regions of the model or value functions if the class of function approximator is given by overparameterized neural networks. For example, constructing the LCB for neural offline contextual bandits (Nguyen-Tang et al., 2022a) and RL (Xu & Liang, 2022) requires computing the inverse of a large covariance matrix whose size scales with the number of parameters in the neural network. This computational cost hinders the practical application of these provably efficient offline RL algorithms. Therefore, a largely open question is how to design provably computationally efficient algorithms for offline RL with neural network function approximation. In this work, we present a solution based on a computational approach that combines the pessimism principle with randomizing the value function (Osband et al., 2016; Ishfaq et al., 2021) . The algorithm is strikingly simple: we randomly perturb the offline rewards several times and act greedily with respect to the minimum of the estimated state-action values. The intuition is that taking the minimum from an ensemble of randomized state-action values can efficiently achieve pessimism with high probability while avoiding explicit computation of statistical confidence regions. We learn the state-action value function by training a neural network using gradient descent (GD). Further, we consider a novel data-splitting technique that helps remove the dependence on the potentially large log covering number in the learning bound. We show that the proposed algorithm yields a provable uncertainty quantifier with overparameterized neural network function approximation and achieves a sub-optimality bound of Õ(κH 5/2 d/ √ K), where K is the total number of episodes in the offline data, d is the effective dimension, H is the horizon length, and κ measures the distributional shift. We achieve computational efficiency since the proposed algorithm only needs O(1) time complexity for action selection, while LCB-based algorithms require O(K 2 ) time complexity. We empirically corroborate the statistical and computational efficiency of our proposed algorithm on a wide set of synthetic and real-world datasets. The experimental results show that the proposed algorithm has a strong advantage in computational efficiency while outperforming LCB-based neural algorithms. To the best of our knowledge, ours is the first offline RL algorithm that is both provably and computationally efficient in general MDPs with neural network function approximation.

2. RELATED WORK

Randomized value functions for RL. For online RL, Osband et al. (2016; 2019) were the first to explore randomization of estimates of the value function for exploration. Their approach was inspired by posterior sampling for RL (Osband et al., 2013) , which samples a value function from a posterior distribution and acts greedily with respect to the sampled function. Concretely, Osband et al. (2016; 2019) generate randomized value functions by injecting Gaussian noise into the training data and fitting a model on the perturbed data. Jia et al. ( 2022) extended the idea of perturbing rewards to online contextual bandits with neural function approximation. Ishfaq et al. ( 2021) obtained a provably efficient method for online RL with general function approximation using the perturbed rewards. While randomizing the value function is an intuitive approach to obtaining optimism in online RL, obtaining pessimism from the randomized value functions can be tricky in offline RL. Indeed, Ghasemipour et al. (2022) point out a critical flaw in several popular existing methods for offline RL that update an ensemble of randomized Q-networks toward a shared pessimistic temporal difference target. In this paper, we propose a simple fix to obtain pessimism properly by updating each randomized value function independently and taking the minimum over an ensemble of randomized value functions to form a pessimistic value function. 



Offline RL with function approximation. Provably efficient offline RL has been studied extensively for linear function approximation.Jin et al. (2021)  were the first to show that pessimistic value iteration is provably efficient for offline linear MDPs. Xiong et al. (2023); Yin et al. (2022) improved upon Jin et al. (2021) by leveraging variance reduction. Xie et al. (2021) proposed a Bellman-consistency assumption with general function approximation, which improves the bound of Jin et al. (2021) by a factor of √ d when realized to finite action space and linear MDPs. Wang et al. (2021); Zanette (2021) studied the statistical hardness of offline RL with linear function approximation via exponential lower bound, and Foster et al. (2021) suggested that only realizability and strong uniform data coverage are not sufficient for sample-efficient offline RL. Beyond linearity, some works study offline RL for general function approximation, both parametric and nonparametric. These approaches are either based on Fitted-Q Iteration (FQI)(Munos & Szepesvári, 2008; Le

