A REDUCTION APPROACH TO CONSTRAINED REINFORCEMENT LEARNING

Abstract

Many applications of reinforcement learning (RL) optimize a long-term reward subject to risk, safety, budget, diversity or other constraints. Though constrained RL problem has been studied to incorporate various constraints, existing methods either tie to specific families of RL algorithms or require storing infinitely many individual policies found by an RL oracle to approach a feasible solution. In this paper, we present a novel reduction approach for constrained RL problem that ensures convergence when using any off-the-shelf RL algorithm to construct an RL oracle yet requires storing at most constantly many policies. The key idea is to reduce the constrained RL problem to a distance minimization problem, and a novel variant of Frank-Wolfe algorithm is proposed for this task. Throughout the learning process, our method maintains at most constantly many individual policies, where the constant is shown to be worst-case optimal to ensure convergence of any RL oracle. Our method comes with rigorous convergence and complexity analysis, and does not introduce any extra hyper-parameter. Experiments on a grid-world navigation task demonstrate the efficiency of our method.

1. INTRODUCTION

Contemporary approaches in reinforcement learning (RL) largely focus on optimizing the behavior of an agent against a single reward function. RL algorithms like value function methods (Zou et al., 2019; Zheng et al., 2018) or policy optimization methods (Chen et al., 2019; Zhao et al., 2017) are widely used in real-world tasks. This can be sufficient for simple tasks. However, for complicated applications, designing a reward function that implicitly defines the desired behavior can be challenging. For instance, applications concerning risk (Geibel & Wysotzki, 2005; Chow & Ghavamzadeh, 2014; Chow et al., 2017 ), safety (Chow et al., 2018) or budget (Boutilier & Lu, 2016; Xiao et al., 2019) are naturally modelled by augmenting the RL problem with orthant constraints. Exploration suggestions, such as to visit all states as evenly as possible, can be modelled by using a vector to measure the behavior of the agent, and to find a policy whose measurement vector lies in a convex set (Miryoosefi et al., 2019) . To solve RL problem under constraints, existing methods either ensure convergence only on a specific family of RL algorithms, or treat the underlying RL algorithms as a black box oracle to find individual policy, and look for mixed policy that randomizes among these individual policies. Though the second group of methods has the advantage of working with arbitrary RL algorithms that best suit the underlying problem, existing methods have practically infeasible memory requirement. To get an -approximate solution, they require storing O(1/ ) individual policies, and an exact solution requires storing infinitely many policies. This limits the prevalence of such methods, especially when the individual policy uses deep neural networks. In this paper, we propose a novel reduction approach for the general convex constrained RL (C2RL) problem. Our approach has the advantage of the second group of methods, yet requires storing at most constantly many policies. For a vector-valued Markov Decision Process (MDP) and any given target convex set, our method finds a mixed policy whose measurement vector lies in the target convex set, using any off-the-shelf RL algorithm that optimizes a scalar reward as a RL oracle. To do so, the C2RL problem is reduced to a distance minimization problem between a polytope and a convex set, and a novel variant of Frank-Wolfe type algorithm is proposed to solve this distance minimization problem. To find an -approximate solution in an m-dimensional vector-valued MDP, et al., 2019; Miryoosefi et al., 2019) to a constant. We also show this m + 1 constant is worstcase optimal to ensure convergence of RL algorithms using deterministic policies. Moreover, our method introduces no extra hyper-parameter, which is favorable for practical usage. A preliminary experimental comparison demonstrates the performance of the proposed method and the sparsity of the policy found.

2. RELATED WORK

For high dimensional constrained RL, one line of approaches incorporates the constraint as a penalty signal into the reward function, and makes updates in a multiple time-scale scheme (Tessler et al., 2018; Chow & Ghavamzadeh, 2014) . When used with policy gradient or actor-critic algorithms (Sutton & Barto, 2018), this penalty signal guides the policy to converge to a constraint satisfying one (Paternain et al., 2019; Chow et al., 2017) . However, the convergence guarantee requires the RL algorithm can find a single policy that satisfies the constraint, hence ruling out methods that search for deterministic policies, such as Deep Q-Networks (DQN) (Mnih et al., 2013) , Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) and their variants (Van Hasselt et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; Barth-Maron et al., 2018) . Another line of approaches uses a game-theoretic framework, and does not tie to specific families of RL algorithm. The constrained problem is relaxed to a zero-sum game, whose equilibrium is solved by online learning (Agarwal et al., 2018) . The game is played repeatedly, each time any RL algorithm can be used to find a best response policy to play against a no-regret online learner. The mixed policy that uniformly distributed among all played policies can be shown to converge to an optimal policy of the constrained problem (Freund & Schapire, 1999; Abernethy et al., 2011) . Taking this approach, Le et al. ( 2019) uses Lagrangian relaxation to solve the orthant constraint case, and Miryoosefi et al. ( 2019) uses conic duality to solve the convex constraint case. However, since the convergence is established by the no-regret property, the policy found by these methods requires randomization among policies found during the learning process, which limits their prevalence. Different from the game-theoretic approaches, we reduce the C2RL to a distance minimization problem and propose a novel variant of Frank-Wolfe (FW) algorithm to solve it. Our result builds on recent finding that the standard FW algorithm emerges as computing the equilibrium of a special convex-convave zero sum game (Abernethy & Wang, 2017) . This connects our approach with previous approaches from game-theoretic framework (Agarwal et al., 2018; Le et al., 2019; Miryoosefi et al., 2019) . The main advantage of our reduction approach is that the convergence of FW algorithm does not rely on the no-regret property of an online learner. Hence there is no need to introduce extra hyper-parameters, such as learning rate of the online learner, and intuitively, we can eliminate unnecessary policies to achieve better sparsity. To do so, we extend Wolfe's method for minimum norm point problem (Wolfe, 1976) to solve our distance minimization problem. Throughout the learning process, we maintain an active policy set, and constantly eliminate policies whose measurement vector are affinely dependent of others. Unlike norm function in Wolfe's method, our objective function is not strongly convex. Hence we cannot achieve the linear convergence of Wolfe's method as shown in Lacoste-Julien & Jaggi (2015) . Instead, we analyze the complexity of our method based on techniques from Chakrabarty et al. (2014) . A theoretical comparison between our method and various approaches in constrained RL is provided in Table 1 .



Comparison with previous approaches. To find an -approximate solution, time complexity under orthant or convex constraints is compared using the numbers of RL oracle calls. The memory requirement is measured by the number of individual policies stored for an -approximate solution.our method only stores at most m + 1 policies, which improves from infinitely many O(1/ ) (Le

