SIMULTANEOUSLY LEARNING STOCHASTIC AND AD-VERSARIAL MARKOV DECISION PROCESS WITH LIN-EAR FUNCTION APPROXIMATION

Abstract

Reinforcement learning (RL) has been commonly used in practice. To deal with the numerous states and actions in real applications, the function approximation method has been widely employed to improve the learning efficiency, among which the linear function approximation has attracted great interest both theoretically and empirically. Previous works on the linear Markov Decision Process (MDP) mainly study two settings, the stochastic setting where the reward is generated in a stochastic way and the adversarial setting where the reward can be chosen arbitrarily by an adversary. All these works treat these two environments separately. However, the learning agents often have no idea of how rewards are generated and a wrong reward type can severely disrupt the performance of those specially designed algorithms. So a natural question is whether an algorithm can be derived that can efficiently learn in both environments but without knowing the reward type. In this paper, we first consider such best-of-both-worlds problem for linear MDP with the known transition. We propose an algorithm and prove it can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting where K is the horizon. To the best of our knowledge, it is the first such result for linear MDP.

1. INTRODUCTION

Reinforcement learning (RL) studies the problem where a learning agent interacts with the environment over time and aims to maximize its cumulative rewards in a given horizon. It has a wide range of real applications including robotics (Kober et al., 2013) , games (Mnih et al., 2013; Silver et al., 2016) , etc. The environment dynamics are usually modeled by the Markov Decision Process (MDP) with a fixed transition function. We consider the general episodic MDP setting where the interactions last for several episodes and the length of each episode is fixed (Jin et al., 2018; 2020b; Luo et al., 2021; Yang et al., 2021) . In each episode, the agent first observes its current state and would decide which action to take. After making the decision, it receives an instant reward and the environment will then transfer to the next state. The cumulative reward in an episode is called the value and the objective of the agent is equivalent to minimizing the regret defined as the cumulative difference between the optimal value and its received values over episodes. Many previous works focus on the tabular MDP setting where the state and action space are finite and the values can be represented by a table (Azar et al., 2017; Jin et al., 2018; Chen et al., 2021; Luo et al., 2021) . Most of them study the stochastic setting with the stationary reward in which the reward of a state-action pair is generated from a fixed distribution (Azar et al., 2017; Jin et al., 2018; Simchowitz & Jamieson, 2019; Yang et al., 2021) . Since the reward may change over time in applications, some works consider the adversarial MDP where the reward can be arbitrarily generated among different episodes (Yu et al., 2009; Rosenberg & Mansour, 2019; Jin et al., 2020a; Chen et al., 2021; Luo et al., 2021) . All of these works pay efforts to learn the value function table to find the optimal policy and the computation complexity highly depends on the state and action space size. However, in real applications such as the Go game, there are numerous states and the value function table is huge, which brings a great challenge to the computation complexity for traditional algorithms in the tabular case. To cope with the dimensionality curse, a rich line of works employ the function approximation methods, such as the linear function and deep neural networks, to approximate the value functions or the policies to improve learning efficiency. These methods also achieve great success in practical scenarios such as the Atari and Go games (Mnih et al., 2013; Silver et al., 2016) . Despite their great empirical performances, it also brings a series of challenges in deriving theoretical analysis. To build a better theoretical understanding of these approximation methods, lots of works start from deriving regret guarantees for linear function classes. The linear MDP is a popular model which assumes both the transition and reward at a state-action pair are linear in the corresponding d-dimensional feature (Jin et al., 2020b; He et al., 2021; Hu et al., 2022) . There are also mainly two types of the reward. However, the learning agent usually has no idea of how the reward is generated. And once the reward type is wrong, the specially designed algorithm for a separate setting may suffer great loss. Thus deriving an algorithm that can adapt to different environment types becomes a natural solution for this problem. This direction has attracted great research interest in simpler bandit (Bubeck & Slivkins, 2012; Zimmert et al., 2019; Lee et al., 2021; Kong et al., 2022) and tabular MDP settings (Jin & Luo, 2020; Jin et al., 2021b) but still remains open in linear MDP. In this paper, we try to answer the question of deriving best-of-both-worlds (BoBW) guarantees for linear MDP. Due to the challenge of learning in the adversarial setting, we also consider the known transition case. We propose an algorithm that continuously detects the real environment type and adjusts its strategy. It has been shown that our algorithm can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting. To the best of our knowledge, these are the first BoBW results for linear MDP. It is also worth noting that our BoBW algorithm relies on an algorithm that can achieve a high-probability guarantee for the adversarial setting, which previous works fail to provide. And we propose the first analysis for a high-probability regret bound in the adversarial linear MDP. 2022) with a tighter concentration analysis. Apart from UCB, the TS-type algorithm has also been proposed for this setting (Zanette et al., 2020a) . All these results do not consider the specific problem structure. In the stochastic setting, deriving an instance-dependent regret is more attractive to show the tighter performances of algorithms in a specific problem. This type of regret has been widely studied under the tabular MDP setting (Simchowitz & Jamieson, 2019; Yang et al., 2021) 2021) provide an O(d 2/3 H 2 K 2/3 ) upper bound with the help of a simulator and O(d 2 H 4 K 14/15 ) guarantee for the general case. Above all, even in the separate adversarial setting, O( √ K) regret is only derived for known transition case. We also study the known transition



For the stochastic setting, Jin et al. (2020b) provides the first efficient algorithm named Least-Square Value Iteration UCB (LSVI-UCB) and show that the its regret over K episodes can be upper bounded by O( √ K). To seek for a tighter result with respect to the specific problem structure, He et al. (2021) provide a new analysis for LSVI-UCB and show it achieves an O(poly log K) instance-dependent regret upper bound. The adversarial setting is much harder than the stochastic one since the reward can change arbitrarily but the agent can only observe the rewards on the experienced trajectory. For this more challenging case, a regret upper bound of order O( √ K) is only obtained in the case with known transition by Neu & Olkhovskaya (2021). All these works separately treat two environment types.

Linear MDP. Recently, deriving theoretically guaranteed algorithms for RL with linear function approximation has attracted great interests. The linear MDP model is one of the most popular one. Jin et al. (2020b) develop the first efficient algorithm LSVI-UCB both in sample and computation complexity for this setting. They show that the algorithm achieves O( √ d 3 H 3 K) regret where d is the feature dimension and H is the length of each episode. This result is recently improved to the optimal order O(dH √ K) by Hu et al. (

. He et al. (2021) is the first to provide this type of regret in linear MDP. Using a different proof framework, they show that the LSVI-UCB algorithm can achieve O(d 3 H 5 log K/∆) where ∆ is the minimum value gap in the episodic MDP. All these works consider the stochastic setting with stationary rewards. Neu & Olkhovskaya (2021) first attempts to analyze the more challenging adversarial environment. They consider a simplier setting with known transition and provide an O( √ dHK) regret upper bound. For unknown transition case, Luo et al. (

