ACHIEVING SUB-LINEAR REGRET IN INFINITE HORI-ZON AVERAGE REWARD CONSTRAINED MDP WITH LINEAR FUNCTION APPROXIMATION

Abstract

We study the infinite horizon average reward constrained Markov Decision Process (CMDP). In contrast to existing works on model-based, finite state space, we consider the model-free linear CMDP setup. We first propose a computationally inefficient algorithm and show that Õ( √ d 3 T ) regret and constraint violation can be achieved, in which T is the number of interactions, and d is the dimension of the feature mapping. We also propose an efficient variant based on the primaldual adaptation of the LSVI-UCB algorithm and show that Õ((dT ) 3/4 ) regret and constraint violation can be achieved. This improves the known regret bound of Õ(T 5/6 ) for the finite state-space model-free constrained RL which was obtained under a stronger assumption compared to ours. We also develop an efficient policy-based algorithm via novel adaptation of the MDP-EXP2 algorithm to the primal-dual set up with Õ( √ T ) regret and even zero constraint violation bound under a stronger set of assumptions.

1. INTRODUCTION

In many standard applications of Reinforcement learning (RL) (e.g., autonomous vehicles), the agent needs to satisfy certain constraints (e.g., safety constraint, fairness). These problems can be formulated as constrained Markov Decision process (CMDP) such that the agent needs to ensure that average utility (cost, resp.) exceeds a certain threshold (is below a threshold, resp.). While CMDP in the finite state-space has been studied, those studies can not be extended to the large state space. RL with value function approximation has demonstrated empirical success for large-scale RL application using the deep neural networks. However, theoretical understandings of constrained RL with value function approximation is quite limited. Recently, Ghosh et al. (2022) has made some progress towards the understanding of the constrained RL for linear MDP under episodic setting. In particular, Ghosh et al. ( 2022) developed a primal-dual adaptation of the LSVI-UCB and showed Õ( √ d 3 T ) regret and violation where d is the dimension of the feature space and T is the number of interactions. Importantly, the above bounds are independent of the cardinality of state-space. However, infinite-horizon model fits well compared to the finite horizon setting for many real-world applications (e.g., stock-market investment, routing decisions). Compared to the discounted-reward model, maximizing the long-term average reward under the long-term average utility also has its advantage in the sense that the transient behavior of the learner does not really matter for the latter case (Wei et al., 2020) . Recently, model-based RL algorithm for infinite-horizon average reward CMDP has been proposed (Chen et al., 2022) . However, it considers tabular setup. Further, the model-based approach requires large memory to store the model parameters. It is also hard (computationally) to extend model-based approaches for infinite state-space such as linear MDP (Wei et al., 2020) . Model-free RL algorithm is more popular because of the ease of implementation, and being computational, and storage efficient particularly for the large state space. However, model-free learning for the infinite horizon average reward setup is even more challenging. For example, it is still unknown whether it is possible to achieve a computationally efficient model-free algorithm with Õ( √ T ) regret even under the unconstrained tabular setup (Wei et al., 2020) for a weakly communicating MDP. To the best of our knowledge, Wei et al. ( 2022) is the only paper to study model-free algorithms for CMDPs in the infinite horizon average reward setup. In particular, they consider the finite-state tabular setting and the regret scales polynomially with the number of states. Thus, the result would not be useful for large-scale RL applications where the number of states could even be infinite. To summarize, little is known for the performance guarantee of model-free algorithms in CMDPs beyond tabular settings in the infinite-horizon average reward setting, even in the case of linear CMDP. Motivated by this, we are interested in the following question: Can we achieve provably sample-efficient and model-free exploration for CMDPs beyond tabular settings for the infinite horizon average reward setting? Contribution. To answer the above question, we consider CMDP with linear function approximation, where the transition dynamics, the utility function, and the reward function can be represented as a linear function of some known feature mapping. Our main contributions are as follows. • We propose an algorithm (Algorithm 1) which achieves Õ( √ d 3 T ) regret and constraint violation bounds with a high probability when the optimal policy belongs to a smooth function class (Definition 1). This is the first result that shows that Õ( √ T ) regret and violation are achievable for linear CMDP in the infinite-horizon regime using model-free RL. Achieving uniform concentration bound for individual value function turns out to be challenging and we need to rely on the smoothness of the policy, unlike the unconstrained case. The algorithm relies on an optimizer that returns the parameters corresponding to the state-action bias function by solving a contained optimization problem. • We also propose an efficient variant and show that Õ((dT ) 3/4 ) regret and violation bound can be achieved. This is the first result that provides sub-linear regret and violation guarantee under only Assumption 1 for linear CMDP using a computationally efficient algorithm. The idea is to consider a finite-horizon episodic setup by dividing the entire time-horizon T into multiple episodes T /H where each episode consists of H steps. We then invoke the primal-dual adaptation of the LSVI-UCB algorithm proposed in Ghosh et al. ( 2022) to learn the good policy for the finite-horizon setting by carefully crafting the constraint for the episodic case. Finally, we bound the gap between the infinite-horizon average and the finite-horizon result to obtain the final result. • We also propose an algorithm which can be implemented efficiently and achieves Õ( √ T ) regret and Õ( √ T ) constraint violation under a stronger set of assumptions (similar to the ones made in Wei et al. (2021a) for the unconstrained setup). We, further, show that one can achieve zero constraint violation for large enough (still, finite) T while maintaining the same order on regret. • We attain our bounds without estimating the unknown transition model or requiring a simulator, and they depend on the state-space only through the dimension of the feature mapping. 

2. PRELIMINARIES

We consider an infinite horizon constrained MDP, denoted by (S, A, P, r, g) where S is the state space, A is the action space, P is transition probability measures, r, and g are reward and utility functions respectively. We assume that S is a measurable space with possibly infinite number of elements, and A is a finite action set. P(•|x, a) is the transition probability kernel which denotes the probability to reach a state when action a is taken at state x. We also denote P as p to simplify the notation. p satisfies X p(dx ′ |x, a) = 1 (following integral notation from Hernández-Lerma ( 2012))



To the best of our knowledge, these sub-linear bounds are the first results for model-free (or, model-based) online RL algorithms for infinite-horizon average reward CMDPs with function approximations. Wei et al. (2022) proposes a model-free algorithm in the tabular setting which achieves Õ(T 5/6 ) regret. Since linear MDP contains tabular setting our result improves the existing results. Further, we show that we can achieve zero constraint violation by maintaining the same order on regret under the same set of Assumptions ofWei et al. (2022). We relegate Related Work to Appendix A.

