VIPER: PROVABLY EFFICIENT ALGORITHM FOR OF-FLINE RL WITH NEURAL FUNCTION APPROXIMATION

Abstract

We propose a novel algorithm for offline reinforcement learning called Value Iteration with Perturbed Rewards (VIPeR), which amalgamates the pessimism principle with random perturbations of the value function. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, VIPeR implicitly obtains pessimism by simply perturbing the offline data multiple times with carefully-designed i.i.d. Gaussian noises to learn an ensemble of estimated state-action value functions and acting greedily with respect to the minimum of the ensemble. The estimated state-action values are obtained by fitting a parametric model (e.g., neural networks) to the perturbed datasets using gradient descent. As a result, VIPeR only needs O(1) time complexity for action selection, while LCB-based algorithms require at least Ω(K 2 ), where K is the total number of trajectories in the offline data. We also propose a novel data-splitting technique that helps remove a factor involving the log of the covering number in our bound. We prove that VIPeR yields a provable uncertainty quantifier with overparameterized neural networks and enjoys a bound on sub-optimality of Õ(κH 5/2 d/ √ K), where d is the effective dimension, H is the horizon length and κ measures the distributional shift. We corroborate the statistical and computational efficiency of VIPeR with an empirical evaluation on a wide set of synthetic and real-world datasets. To the best of our knowledge, VIPeR is the first algorithm for offline RL that is provably efficient for general Markov decision processes (MDPs) with neural network function approximation.

1. INTRODUCTION

Offline reinforcement learning (offline RL) (Lange et al., 2012; Levine et al., 2020) is a practical paradigm of RL for domains where active exploration is not permissible. Instead, the learner can access a fixed dataset of previous experiences available a priori. Offline RL finds applications in several critical domains where exploration is prohibitively expensive or even implausible, including healthcare (Gottesman et al., 2019; Nie et al., 2021 ), recommendation systems (Strehl et al., 2010; Thomas et al., 2017) , and econometrics (Kitagawa & Tetenov, 2018; Athey & Wager, 2021) , among others. The recent surge of interest in this area and renewed research efforts have yielded several important empirical successes (Chen et al., 2021; Wang et al., 2023; 2022; Meng et al., 2021) . A key challenge in offline RL is to efficiently exploit the given offline dataset to learn an optimal policy in the absence of any further exploration. The dominant approaches to offline RL address this challenge by incorporating uncertainty from the offline dataset into decision-making (Buckman et al., 2021; Jin et al., 2021; Xiao et al., 2021; Nguyen-Tang et al., 2022a; Ghasemipour et al., 2022; An et al., 2021; Bai et al., 2022) . The main component of these uncertainty-aware approaches to offline RL is the pessimism principle, which constrains the learned policy to the offline data and leads to various lower confidence bound (LCB)-based algorithms. However, these methods are not easily extended or scaled to complex problems where neural function approximation is used to estimate 1

