SIMULTANEOUSLY LEARNING STOCHASTIC AND AD-VERSARIAL MARKOV DECISION PROCESS WITH LIN-EAR FUNCTION APPROXIMATION

Abstract

Reinforcement learning (RL) has been commonly used in practice. To deal with the numerous states and actions in real applications, the function approximation method has been widely employed to improve the learning efficiency, among which the linear function approximation has attracted great interest both theoretically and empirically. Previous works on the linear Markov Decision Process (MDP) mainly study two settings, the stochastic setting where the reward is generated in a stochastic way and the adversarial setting where the reward can be chosen arbitrarily by an adversary. All these works treat these two environments separately. However, the learning agents often have no idea of how rewards are generated and a wrong reward type can severely disrupt the performance of those specially designed algorithms. So a natural question is whether an algorithm can be derived that can efficiently learn in both environments but without knowing the reward type. In this paper, we first consider such best-of-both-worlds problem for linear MDP with the known transition. We propose an algorithm and prove it can simultaneously achieve O(poly log K) regret in the stochastic setting and Õ( √ K) regret in the adversarial setting where K is the horizon. To the best of our knowledge, it is the first such result for linear MDP.

1. INTRODUCTION

Reinforcement learning (RL) studies the problem where a learning agent interacts with the environment over time and aims to maximize its cumulative rewards in a given horizon. It has a wide range of real applications including robotics (Kober et al., 2013 ), games (Mnih et al., 2013; Silver et al., 2016) , etc. The environment dynamics are usually modeled by the Markov Decision Process (MDP) with a fixed transition function. We consider the general episodic MDP setting where the interactions last for several episodes and the length of each episode is fixed (Jin et al., 2018; 2020b; Luo et al., 2021; Yang et al., 2021) . In each episode, the agent first observes its current state and would decide which action to take. After making the decision, it receives an instant reward and the environment will then transfer to the next state. The cumulative reward in an episode is called the value and the objective of the agent is equivalent to minimizing the regret defined as the cumulative difference between the optimal value and its received values over episodes. Many previous works focus on the tabular MDP setting where the state and action space are finite and the values can be represented by a table (Azar et al., 2017; Jin et al., 2018; Chen et al., 2021; Luo et al., 2021) . Most of them study the stochastic setting with the stationary reward in which the reward of a state-action pair is generated from a fixed distribution (Azar et al., 2017; Jin et al., 2018; Simchowitz & Jamieson, 2019; Yang et al., 2021) . Since the reward may change over time in applications, some works consider the adversarial MDP where the reward can be arbitrarily generated among different episodes (Yu et al., 2009; Rosenberg & Mansour, 2019; Jin et al., 2020a; Chen et al., 2021; Luo et al., 2021) . All of these works pay efforts to learn the value function table to find the optimal policy and the computation complexity highly depends on the state and action space size. However, in real applications such as the Go game, there are numerous states and the value function table is huge, which brings a great challenge to the computation complexity for traditional algorithms in the tabular case. To cope with the dimensionality curse, a rich line of works employ the function

