OPTIMISTIC EXPLORATION WITH LEARNED FEATURES PROVABLY SOLVES MARKOV DECISION PROCESSES WITH NEURAL DYNAMICS

Abstract

Incorporated with the recent advances in deep learning, deep reinforcement learning (DRL) has achieved tremendous success in empirical study. However, analyzing DRL is still challenging due to the complexity of the neural network class. In this paper, we address such a challenge by analyzing the Markov decision process (MDP) with neural dynamics, which covers several existing models as special cases, including the kernelized nonlinear regulator (KNR) model and the linear MDP. We propose a novel algorithm that designs exploration incentives via learnable representations of the dynamics model by embedding the neural dynamics into a kernel space induced by the system noise. We further establish an upper bound on the sample complexity of the algorithm, which demonstrates the sample efficiency of the algorithm. We highlight that, unlike previous analyses of RL algorithms with function approximation, our bound on the sample complexity does not depend on the Eluder dimension of the neural network class, which is known to be exponentially large (Dong et al., 2021).

1. INTRODUCTION

Reinforcement learning (RL) aims to accomplish sequential decision-making in an uncertain environment via iteratively interacting with the environment (see Sutton et al. (1998) ). Equipped with modern function approximators such as deep neural networks, deep RL algorithms achieve tremendous empirical successes (Mnih et al., 2015; Silver et al., 2017; Hafner et al., 2019) . Despite its empirical successes, the theoretical understanding of deep RL is relatively underdeveloped. There are several recent works (Abbasi-Yadkori et al., 2019; Wang et al., 2019; Fan et al., 2020) that analyze RL algorithms with neural network parameterization, including policy iteration (PI) (Lagoudakis & Parr, 2003) , policy gradient (PG) (Williams, 1992) and deep Q-learning (Mnih et al., 2013) . However, those works depend on restrictive assumptions that either the agent has access to a simulator or the MDPs have bounded concentrability coefficients, which in fact imply that the state space is already well-explored. Another line of research (Jiang et al., 2017; Jin et al., 2020; Cai et al., 2019; Du et al., 2021) further removes such assumptions by conducting provably efficient exploration in RL. Such a direction of research typically hinges on a low-rank MDP assumption. Thus, those works either assume that the MDP is linear in the known feature or propose computational-inefficient algorithms, limiting the ability to explore the environment with neural network parameterization. To explore the environment with neural network parameterization, a recent line of work (Wang et al., 2020; Jin et al., 2021a) analyzes the use of general function approximators in RL, covering neural network parameterization as a special case. Such analyses typically depend on the Eluder dimension (Russo & Van Roy, 2013) , which unfortunately can be exponentially large even for a simple neural network class (Dong et al., 2021) and thus makes the results statistically inefficient for neural network parameterization. Therefore, we raise the following question: Can we design RL algorithms that can conduct provably efficient exploration in structured environments with neural network parameterization? Specifically, Our goal is to develop computational-efficient algorithms whose sample efficiency does not depend on the Eluder dimension of neural networks for structured environments with neural network parameterization. Our key insight is that, when the transition dynamics is captured by an energy-based model, we leverage the spectral decomposition of the kernel such that the challenge of distribution shift is characterized by the effective dimension of the kernel. 

1.1. RELATED WORK

Our work is closely related to the line of research on provably efficient exploration in the function approximation setting (Jiang et al., 2017; Jin et al., 2020; Cai et al., 2019; Du et al., 2021; Uehara et al., 2021; Zhang et al., 2022a) . Such a line of research typically hinges on MDPs with a low-rank structure. For instance, the study of linear MDPs (Jin et al., 2020; Cai et al., 2019) requires that the transition dynamics are linear in the known feature map. In contrast, the feature maps are unknown in our setting and need to be estimated. The study of low-rank MDPs (Jiang et al., 2017; Du et al., 2021; Uehara et al., 2021; Ren et al., 2022) is more closely aligned to our work in the sense that the feature map is unknown and needs to be estimated. 



Northwestern University University of Chicago Yale University University of Alberta DeepMind Corresponding authors: siruizheng2025@u.northwestern.edu



Jiang et al. (2017)  andDu et al. (2021)   require optimistic planning over the confidence set of transition dynamics, which is computationally inefficient.Uehara et al. (2021) and Ren et al. (2022)  propose an algorithm for low-rank MDP that is both computationally efficient and sample efficient. Nevertheless, they only consider finite hypothesis classes, and require sampling from the stationary distribution of the MDP.Our work is also related to the study of provably efficient exploration with general function approximation(Wang et al., 2020; Jin et al., 2021a). Nevertheless, previous results typically depend on the Eluder dimension (Russo & Van Roy, 2013) of the hypothesis class, which is exponentially large for simple neural network classes(Dong et al., 2021).Yang et al. (2020)  achieves sample-efficient exploration based on the overparameterized neural networks(Simsek et al., 2021)  as the function approximator. However, their analysis hinges on the neural tangent kernel (NTK) and can not handle NNs beyond NTK regime. In contrast, our analysis can adapt to generic neural network classes.Our work is also related to the analysis of model-based RL(Osband & Van Roy, 2014; Ayoub et al.,  2020; Kakade et al., 2020)  and representation learning(Ren et al., 2021; Nachum & Yang, 2021;  Zhang et al., 2022b). The definition of our MDPs with neural dynamics generalizes that inKakade  et al. (2020) and Ren et al. (2021). In contrast to the KNR model inKakade et al. (2020), we can handle the infinite neural network hypothesis class and do not require the nonlinear feature map to be known.Ren et al. (2021)  require sampling from the posterior distribution of the hypothesis class, which is computational-inefficient when the hypothesis class is large. In addition, the sample

To illustrate this insight, we propose a new model called MDPs with neural dynamics, which allows neural network parameterization and captures various MDP models proposed in previous works, including the KNR model(Kakade et al., 2020)  and the linear MDP model(Jin et al., 2020). We then propose an algorithm, namely, Exploration with Learnable Neural Features (ELNF), and show that ELNF is sample efficient. ELNF iteratively fits the transition dynamics and reward functions with neural networks. Upon fitting the models, ELNF conducts exploration based on upper confidence bounds (UCB) (Abbasi-Yadkori et al. (2011)), which are obtained from the feature maps that correspond to the fitted model. We remark that the bonus in ELNF can be efficiently computed.

