A REDUCTION APPROACH TO CONSTRAINED REINFORCEMENT LEARNING

Abstract

Many applications of reinforcement learning (RL) optimize a long-term reward subject to risk, safety, budget, diversity or other constraints. Though constrained RL problem has been studied to incorporate various constraints, existing methods either tie to specific families of RL algorithms or require storing infinitely many individual policies found by an RL oracle to approach a feasible solution. In this paper, we present a novel reduction approach for constrained RL problem that ensures convergence when using any off-the-shelf RL algorithm to construct an RL oracle yet requires storing at most constantly many policies. The key idea is to reduce the constrained RL problem to a distance minimization problem, and a novel variant of Frank-Wolfe algorithm is proposed for this task. Throughout the learning process, our method maintains at most constantly many individual policies, where the constant is shown to be worst-case optimal to ensure convergence of any RL oracle. Our method comes with rigorous convergence and complexity analysis, and does not introduce any extra hyper-parameter. Experiments on a grid-world navigation task demonstrate the efficiency of our method.

1. INTRODUCTION

Contemporary approaches in reinforcement learning (RL) largely focus on optimizing the behavior of an agent against a single reward function. RL algorithms like value function methods (Zou et al., 2019; Zheng et al., 2018) or policy optimization methods (Chen et al., 2019; Zhao et al., 2017) are widely used in real-world tasks. This can be sufficient for simple tasks. However, for complicated applications, designing a reward function that implicitly defines the desired behavior can be challenging. For instance, applications concerning risk (Geibel & Wysotzki, 2005; Chow & Ghavamzadeh, 2014; Chow et al., 2017 ), safety (Chow et al., 2018) or budget (Boutilier & Lu, 2016; Xiao et al., 2019) are naturally modelled by augmenting the RL problem with orthant constraints. Exploration suggestions, such as to visit all states as evenly as possible, can be modelled by using a vector to measure the behavior of the agent, and to find a policy whose measurement vector lies in a convex set (Miryoosefi et al., 2019) . To solve RL problem under constraints, existing methods either ensure convergence only on a specific family of RL algorithms, or treat the underlying RL algorithms as a black box oracle to find individual policy, and look for mixed policy that randomizes among these individual policies. Though the second group of methods has the advantage of working with arbitrary RL algorithms that best suit the underlying problem, existing methods have practically infeasible memory requirement. To get an -approximate solution, they require storing O(1/ ) individual policies, and an exact solution requires storing infinitely many policies. This limits the prevalence of such methods, especially when the individual policy uses deep neural networks. In this paper, we propose a novel reduction approach for the general convex constrained RL (C2RL) problem. Our approach has the advantage of the second group of methods, yet requires storing at most constantly many policies. For a vector-valued Markov Decision Process (MDP) and any given target convex set, our method finds a mixed policy whose measurement vector lies in the target convex set, using any off-the-shelf RL algorithm that optimizes a scalar reward as a RL oracle. To do so, the C2RL problem is reduced to a distance minimization problem between a polytope and a convex set, and a novel variant of Frank-Wolfe type algorithm is proposed to solve this distance minimization problem. To find an -approximate solution in an m-dimensional vector-valued MDP,

