A REDUCTION APPROACH TO CONSTRAINED REINFORCEMENT LEARNING

Abstract

Many applications of reinforcement learning (RL) optimize a long-term reward subject to risk, safety, budget, diversity or other constraints. Though constrained RL problem has been studied to incorporate various constraints, existing methods either tie to specific families of RL algorithms or require storing infinitely many individual policies found by an RL oracle to approach a feasible solution. In this paper, we present a novel reduction approach for constrained RL problem that ensures convergence when using any off-the-shelf RL algorithm to construct an RL oracle yet requires storing at most constantly many policies. The key idea is to reduce the constrained RL problem to a distance minimization problem, and a novel variant of Frank-Wolfe algorithm is proposed for this task. Throughout the learning process, our method maintains at most constantly many individual policies, where the constant is shown to be worst-case optimal to ensure convergence of any RL oracle. Our method comes with rigorous convergence and complexity analysis, and does not introduce any extra hyper-parameter. Experiments on a grid-world navigation task demonstrate the efficiency of our method.

1. INTRODUCTION

Contemporary approaches in reinforcement learning (RL) largely focus on optimizing the behavior of an agent against a single reward function. RL algorithms like value function methods (Zou et al., 2019; Zheng et al., 2018) or policy optimization methods (Chen et al., 2019; Zhao et al., 2017) are widely used in real-world tasks. This can be sufficient for simple tasks. However, for complicated applications, designing a reward function that implicitly defines the desired behavior can be challenging. For instance, applications concerning risk (Geibel & Wysotzki, 2005; Chow & Ghavamzadeh, 2014; Chow et al., 2017) , safety (Chow et al., 2018) or budget (Boutilier & Lu, 2016; Xiao et al., 2019) are naturally modelled by augmenting the RL problem with orthant constraints. Exploration suggestions, such as to visit all states as evenly as possible, can be modelled by using a vector to measure the behavior of the agent, and to find a policy whose measurement vector lies in a convex set (Miryoosefi et al., 2019) . To solve RL problem under constraints, existing methods either ensure convergence only on a specific family of RL algorithms, or treat the underlying RL algorithms as a black box oracle to find individual policy, and look for mixed policy that randomizes among these individual policies. Though the second group of methods has the advantage of working with arbitrary RL algorithms that best suit the underlying problem, existing methods have practically infeasible memory requirement. To get an -approximate solution, they require storing O(1/ ) individual policies, and an exact solution requires storing infinitely many policies. This limits the prevalence of such methods, especially when the individual policy uses deep neural networks. In this paper, we propose a novel reduction approach for the general convex constrained RL (C2RL) problem. Our approach has the advantage of the second group of methods, yet requires storing at most constantly many policies. For a vector-valued Markov Decision Process (MDP) and any given target convex set, our method finds a mixed policy whose measurement vector lies in the target convex set, using any off-the-shelf RL algorithm that optimizes a scalar reward as a RL oracle. To do so, the C2RL problem is reduced to a distance minimization problem between a polytope and a convex set, and a novel variant of Frank-Wolfe type algorithm is proposed to solve this distance minimization problem. To find an -approximate solution in an m-dimensional vector-valued MDP, Table 1 : Comparison with previous approaches. To find an -approximate solution, time complexity under orthant or convex constraints is compared using the numbers of RL oracle calls. The memory requirement is measured by the number of individual policies stored for an -approximate solution.

Method

Orthant constraint

Converge

for any RL algo. No extra hyperparameter Memory requirement Tessler et al. (2018) To a fixed point 1 Le et al. (2019) O(1/ ) O(1/ ) Miryoosefi et al. ( 2019) O(1/ ) O(1/ ) O(1/ ) C2RL (this paper) O(1/ ) O(1/ ) ≤ m + 1 our method only stores at most m + 1 policies, which improves from infinitely many O(1/ ) (Le et al., 2019; Miryoosefi et al., 2019) to a constant. We also show this m + 1 constant is worstcase optimal to ensure convergence of RL algorithms using deterministic policies. Moreover, our method introduces no extra hyper-parameter, which is favorable for practical usage. A preliminary experimental comparison demonstrates the performance of the proposed method and the sparsity of the policy found.

2. RELATED WORK

For high dimensional constrained RL, one line of approaches incorporates the constraint as a penalty signal into the reward function, and makes updates in a multiple time-scale scheme (Tessler et al., 2018; Chow & Ghavamzadeh, 2014) . When used with policy gradient or actor-critic algorithms (Sutton & Barto, 2018) , this penalty signal guides the policy to converge to a constraint satisfying one (Paternain et al., 2019; Chow et al., 2017) . However, the convergence guarantee requires the RL algorithm can find a single policy that satisfies the constraint, hence ruling out methods that search for deterministic policies, such as Deep Q-Networks (DQN) (Mnih et al., 2013) , Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) and their variants (Van Hasselt et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; Barth-Maron et al., 2018) . Another line of approaches uses a game-theoretic framework, and does not tie to specific families of RL algorithm. The constrained problem is relaxed to a zero-sum game, whose equilibrium is solved by online learning (Agarwal et al., 2018) . The game is played repeatedly, each time any RL algorithm can be used to find a best response policy to play against a no-regret online learner. The mixed policy that uniformly distributed among all played policies can be shown to converge to an optimal policy of the constrained problem (Freund & Schapire, 1999; Abernethy et al., 2011) . Taking this approach, Le et al. (2019) uses Lagrangian relaxation to solve the orthant constraint case, and Miryoosefi et al. (2019) uses conic duality to solve the convex constraint case. However, since the convergence is established by the no-regret property, the policy found by these methods requires randomization among policies found during the learning process, which limits their prevalence. Different from the game-theoretic approaches, we reduce the C2RL to a distance minimization problem and propose a novel variant of Frank-Wolfe (FW) algorithm to solve it. Our result builds on recent finding that the standard FW algorithm emerges as computing the equilibrium of a special convex-convave zero sum game (Abernethy & Wang, 2017) . This connects our approach with previous approaches from game-theoretic framework (Agarwal et al., 2018; Le et al., 2019; Miryoosefi et al., 2019) . The main advantage of our reduction approach is that the convergence of FW algorithm does not rely on the no-regret property of an online learner. Hence there is no need to introduce extra hyper-parameters, such as learning rate of the online learner, and intuitively, we can eliminate unnecessary policies to achieve better sparsity. To do so, we extend Wolfe's method for minimum norm point problem (Wolfe, 1976) to solve our distance minimization problem. Throughout the learning process, we maintain an active policy set, and constantly eliminate policies whose measurement vector are affinely dependent of others. Unlike norm function in Wolfe's method, our objective function is not strongly convex. Hence we cannot achieve the linear convergence of Wolfe's method as shown in Lacoste-Julien & Jaggi (2015) . Instead, we analyze the complexity of our method based on techniques from Chakrabarty et al. (2014) . A theoretical comparison between our method and various approaches in constrained RL is provided in Table 1 .

3. PRELIMINARIES

A vector-valued Markov decision process can be identified by a tuple {S, A, β, P, c}, where S is a set of states, A is the set of actions and β is the initial state distribution. At the start of each episode, an initial state s 0 is drawn following the distribution β. Then, at each step t = 0, 1, . . . , the agent observes a state s t ∈ S and makes a decision to take an action a t . After a t is chosen, at the next observation the state evolves to state s t+1 ∈ S with probability P (s t+1 |s t , a t ). However, instead of a scalar reward, in our setting, the agent receives an m-dimensional vector c t ∈ R m that may implicitly contain measurements of reward, risk or violation of other constraints. The episode ends after a certain number of steps, called the horizon, or when a terminate state is reached. Actions are typically selected according to a policy π, where π(s) is a distribution over actions for any s ∈ S. Policies that take a single action for any state are deterministic policies, and can be identified by the mapping π : S → A. The set of all deterministic policies is denoted by Π. For a discount factor γ ∈ [0, 1), the discounted long-term measurement vector of a policy π ∈ Π is defined as c(π) := E( T t=0 γ t c t (s t , π(s t ))), where the expectation is over trajectories generated by the described random process. Unlike unconstrained setting, for a constrained RL problem, it is possible that all feasible policies are non-deterministic (see Appendix D for an example). This limits the usage of RL algorithms that search for deterministic policies in the setting of constrained RL problem. One workaround is to use mixed policies. For a set of policies U, a mixed policy is a distribution over U, and the set of all mixed policies over U is denoted by ∆(U). To execute a mixed policy µ ∈ ∆(U), we first select a policy π ∈ U according to π ∼ µ(π), and then execute π for the entire episode. Altman (1999) shows that any c(•) achievable can be achieved by some mixed deterministic policies µ ∈ ∆(Π). Therefore, though an off-shelves RL algorithm may not converge to any constraint-satisfying policy, it can be used as a subroutine to find individual policies (possibly deterministic), and a randomization among these policies can converge to a feasible policy. The discounted long-term measurement vector of a mixed policy µ ∈ ∆(Π) is defined similarly c(µ) := E π∼µ (c(π)) = π∈Π µ(π)c(π). For a mixed policy µ ∈ ∆(U), its active set is defined to be the set of policies with non-zero weights A := {π ∈ U|µ(π) > 0}. The memory requirement of storing µ, is then proportional to the size of its active set. Since a mixed policy can be interpreted as a convex combination of policies in its active set, in the following, the term sparsity of a mixed policy refers to the sparsity of this combination. Our learning problem, the convex constrained reinforcement learning (C2RL), is to find a policy whose expected long-term measurement vector lies in a given convex set; i.e., for a given convex target set C ⊂ R m , our target is to find µ * such that c(µ * ) ∈ Ω (C2RL). (3) Any policy µ * that satisfies c(µ * ) ∈ Ω is called a feasible policy, and a C2RL problem is feasible if there exists some feasible policies. In the following, we assume the C2RL problem is feasible.

4. APPROACH, ALGORITHM AND ANALYSIS

We now show how the C2RL (3) can be reduced to a distance minimization problem (7) between a polytope and a convex set. A novel variant of Frank-Wolfe-type algorithm is then proposed to solve the distance minimization problem, followed by theoretic analysis about convergence and sparsity of the proposed method.

4.1. REDUCE C2RL TO A DISTANCE MINIMIZATION PROBLEM

Let ||•|| denote the Euclidean norm. For a convex set Ω ∈ R m , let Proj Ω (x) ∈ arg min y∈Ω ||x-y|| be the projection operator, and dist 2 (x, Ω) := 1 2 ||x-Proj Ω (x)|| 2 be half of the squared Euclidean distance function. Then we consider the problem to find a policy whose measurement vector is closest to the target convex set, arg min µ∈∆(Π) dist 2 (c(µ), Ω). A policy µ * ∈ ∆(Π) is defined to be an optimal solution if it minimizes (4). Otherwise, the approximation error of µ ∈ ∆(Π) is defined as err(µ) := dist 2 (c(µ), Ω) -dist 2 (c(µ * ), Ω) (Approximation Error) (5) where µ * is an optimal solution, and a policy is defined to be an -approximate solution if its approximation error is no larger than . When C2RL (3) is feasible, the equivalence of being optimal to (4) and being feasible to C2RL can be easily established. Since a feasible policy of C2RL problem lies inside Ω, it minimizes the non-negative dist 2 function, and hence is optimal to (4). Vice versa, any optimal solution to (4) lies inside Ω and is a feasible solution to C2RL. From a geometric perspective, let c(Π) := {c(π)|π ∈ Π} be the set of all values achievable by deterministic policies. If the MDP has finite states and actions (though may be extremely large), then Π is finite as well, and hence c(Π) contains finitely many points in R m . Then the set of values achievable by mixed deterministic policies c(∆(Π)) := {c(µ)|µ ∈ ∆(Π)} = { π µ(π)c(π) | π µ(π) = 1, µ(π) ≥ 0} ⊂ R m (6) is the convex hull of c(Π); i.e., c(∆(Π)) is a m-dimension polytope whose vertices are c(Π). Therefore finding a policy whose value is closest to the target convex set ( 4) is equivalent to find a point in the polytope c(∆(Π)) that is closest to the convex set Ω arg min c(µ)∈c(∆(Π)) dist 2 (c(µ), Ω) (Distance minimization problem). To solve this constrained optimization problem, it might be tempting to consider projection methods. However, constructing a projection operator for c(∆(Π)) is non-trivial. For any given measurement vector, it is obscure how to modify a general RL algorithm to update the parameters such that the discounted expected measurement vector is closest to the given value. Therefore, projection-free methods are preferable for this task. Frank-Wolfe (FW) algorithm does not require any projection operation, instead it uses a linear minimizer oracle. Intuitively, finding a linear minimizer is similar to the reward maximization process of what a general RL algorithm does. In section 4.3, we formalize this idea. We show that after simple modifications, any RL algorithm that maximizes a scalar reward can be used to construct such a linear minimizer oracle. Before getting into details of the construction process, we discuss FW-type algorithms over polytope and its applications in the distance minimization problem (7).

4.2. DISTANCE MINIMIZATION BY FRANK-WOLFE-TYPE ALGORITHMS

The Frank-Wolfe algorithm (FW) is a first-order method to minimize a convex function f : P → R over a compact and convex set P, with only access to a linear minimizer oracle. When the feasible set is a polytope P := conv({s 1 , s 2 , . . . , s n }) ⊂ R m defined as the convex hull of finitely many points, FW-type algorithms are discussed by Lacoste-Julien & Jaggi (2015) to optimize min x∈P f (x) using Oracle(v) := arg min s∈{s1,...,sn} s T v. The standard FW (Algorithm 2 in Appendix A.1) consists of making repeated calls to the linear minimizer oracle to find an improving point s, followed by a convex averaging step of the current iterate x t-1 and the oracle's output s. If we have already constructed a RL oracle(λ) that outputs a policy π ∈ arg min π∈Π λ T c(π) together with its measurement vector c(π), then the distance minimizing problem (7) can be solved with standard FW by using ω ← Proj Ω (x) // If y / ∈ conv(S c ), then update y to the intersection of conv(S c ) and segment joining x and y. Then remove points in S c unnecessary for describing y.  π, c(π) ← RL oracle(∇dist 2 (x t-1 , Ω)) = RL oracle(x t-1 -Proj Ω (x t-1 )) (9) 3: π, c(π) ← RL Oracle(x -ω) // Potential improving point 4: if (x -ω) T (x -c(π)) ≤ then break 5: if S c ∪ {c(π)} is affinely independent then S c ← S c ∪ {c(π)}, y ← θy + (1 -θ)x, λ i = θα i + (1 -θ)λ i 12: S c ← {c(π i )|c(π i ) ∈ S c and λ i > 0}, S p ← {π i |π i ∈ S p and λ i > 0} 13: end while 14: Update µ ← π∈Sp λ π π, x ← y, λ ← α. 15: end while 16: return µ, c(µ) ← x to find an improving policy and its measurement vector. For η t := 2 t+2 , the convex averaging steps µ t ← (1 -η t )µ t-1 + η t π, x t ← (1 -η t )x t-1 + η t c(π), (10) then maintain the mixed policy, and the corresponding measurement vector, respectively. However, after T rounds of iteration, the µ t found has an active set containing up to T individual polices, and is not sparse enough. If neural networks are used to parameterize the policy, that requires storing T copies of parameters for the individual network, which is unaffordable for largescale usage. To find even more sparse policies, we turn to variants of FW-type algorithms. In particular, Wolfe's method for minimum norm point in a polytope (Wolfe, 1976; De Loera et al., 2018) . In Wolfe's method (Algorithm 3 in Appendix A.2), the loop in FW is called a major cycle, and the convex averaging step is replaced by a weight optimization process, called minor cycle. Wolfe's method maintains an active set S, and the current point can be represented by a sparse combination of points in the active set. The minor cycles maintain S to be an affinely independent set such that the affine minimizer is inside S t , which Wolfe calls corrals. Recall an affine minimizer is defined as arg min s∈aff(S) ||s|| 2 , where aff(S) := {y|y = z∈S α T z x, z∈S α z = 1} is the affine hull formed by S. Since the active set is affinely independent, the number of active atoms is at most m + 1 at any time. Wolfe's method is shown to strictly decrease the approximation error between two major cycles.

4.3. OUR MAIN ALGORITHM

The main obstacle to apply Wolfe's method to our distance minimization problem (7) is that the objective function in Wolfe's method is the norm function. However, in our problem, the objective function is the distance function to a convex set. Unlike the norm function, the distance function to a convex set is not strongly convex and affine minimizer is ill-defined with respect to a convex set. To tackle these problems, we modify the Wolfe's method. At the core of our new variant of FW algorithm, we add a projection step to Wolfe's method.

Projection

Step In each major cycle, we minimize the distance to a projected point ω := Proj Ω (x). Intuitively, since the distance to the convex set is upper bounded by the distance to this projected point ω, if the distance to ω converges, so does the distance to the target convex set. Formally, for a set of points S ⊂ R m , and a point x ∈ R m , we extend the definition of an affine minimizer to define affine minimizer with respect to x as arg min s∈aff(S) ||s -x|| 2 . For x being the affine minimizer of S with respect to ω, the extended affine minimizer property gives Given ω, ∀v ∈ aff(S), (vx) T (xω) = 0 (Extended affine minimizer property) (11) Similar to Wolfe's method, our C2RL method (Algo. 1) contains an outer loop (called major cycle) to find improving policies and their measurement vectors, and an inner loop (called minor cycle) to maintain the affinely independent property of the active set S c . At the start of each major cycle step, the S c is an affinely independent set. Then, the RL oracle (defined in ( 15)) finds a potential improving policy π ∈ U, and its long-term measurement vector c(π). If the c(π) does not get strictly closer to the ω := Proj(x), then we are done, and x is the optimal value. Otherwise, the c(π) is added into the active set, and the minor cycle is run to eliminate policies whose measurement vectors are affinely dependent. Line 6 to line 13 contains the minor cycle, which is the same as the original Wolfe's method (except in line 6, we find affine minimizer with respect to ω). The elimination is executed as a series of affine projections. The minor cycle terminates if active set S c is affinely independent. Though the interleaving of major and minor cycles oscillate the size of active set S c , the minor cycles keep |S c | an affinely independent set, and is terminated whenever S c contains a single element. Therefore at the start of any major cycle, the size of the active set satisfies |S c | ∈ [0, m + 1]. More background about the minor cycle in Wolfe's method is provided in Appendix A.2.

Construction of RL Oracle

The construction of our RL oracle can use any off-the-shelf RL algorithm that maximizes a scalar reward. For any given λ ∈ R m , we define any algorithm that finds a policy minimizing the linear function λ T c(•) as a RL oracle, that is RL oracle p (λ) ∈ arg min π∈Π λ T c(π). Recall that standard RL algorithm receives a scalar reward after each state transition, instead of the long-term measurement vector c(π) ∈ R m . We then use the following linear property to reformulate the right hand side of ( 12) to a standard RL problem arg min π∈Π λ T c(π) = arg min π∈Π λ T E( T t=0 γ t c t ) = -arg max π∈Π E( T t=0 γ t (-λ T c t )). This shows that if we consider the Markov decision process with the same state, action, and transition probability, and construct a scalar reward r := (-λ T c t ), then any policy that maximizes the expected r is a linear minimizer of (12). Therefore any RL algorithm that best suits the underlying problems can be used to construct a RL oracle. Certifying constraint satisfaction amounts to evaluate the measurement vector of the current policy. This is handy in online settings, where simulations can be used to evaluate the measurement vector of the policy directly. Otherwise, in batch settings, various off-policy evaluation methods, such as importance sampling (Precup, 2000; Precup et al., 2001) or doubly robust (Jiang & Li, 2016; Dudík et al., 2011) , can be used to evaluate the policy. RL oracle c (λ) := c(arg min π∈Π λ T c(π)) = arg min c(π),π∈Π λ T c(π). ( ) To simplify notation, we assume a RL Oracle returns a policy as well as its measurement vector RL Oracle(λ) := π, c(π) = RL oracle p (λ), RL oracle c (λ) Finding Extended Affine Minimizer The process AffineMinimizer(S, x) returns the (y, α) the affine minimizer of S with respect to x where y is the affine minimizer and α := {α s |∀s ∈ S c } is the set of coefficient expressing y as an affine combination of points in S, that is y = s∈Sc α s s, where α s is the weight associated with s. The process AffineMinimizer(S, x) can be straightforwardly implemented using linear algebra. Wolfe (1976) also provides a more efficient implementation that uses a triangular array representation of the active set.

4.4. CONVERGENCE AND SPARSITY

In this section, we analyze the convergence and complexity of the proposed C2RL method (Algo. 1). We first show that approximation error of C2RL strictly decreases between any two major cycle steps and it converges in O(1/t) rate. Then we show our method ensures convergence of arbitrary RL algorithm, including those searching for deterministic policies. Moreover, concerning the memory complexity, we show that maintaining an active policy set of m+1 is worst case optimal to ensure the convergence of arbitrary RL algorithm. Therefore, the proposed C2RL indeed achieves the optimal sparsity for the found policy, making it favorable for large-scale usage. The main difference between the convergence analysis of C2RL and Wolfe's method is the addition of the projection step. Intuitively, at each major step, if we are making a significant progress toward the projected point, then the distance to the convex set is decreased by at least the same amount. Time Complexity. In our analysis, we consider the approximation error as defined in (5). We use superscript t to denote the variable in t-th major cycle before executing any minor cycle. To simplify notions, we let x t := c(µ t ) and s t := c(π t ). When discussing one step with t fixed, let y i denote the affine minimizer found in i-th minor cycle (line 6 of Algo. 1). We first show that the C2RL method strictly reduces approximation error between two calls of the RL oracle. Theorem 4.1 (Approximation Error Strictly Decreases). For any non-terminal step t, we have err(µ t+1 ) < err(µ t ). That is, the measurement vector of µ t found by the C2RL method gets strictly closer to the convex set Ω after major cycle step. The proof is provided in Appendix B. The idea is to consider the distance between x t and ω t . When the major cycle has no minor cycle, the non-terminal condition and the affine minimizer property implies dist 2 (x t+1 , ω t ) < dist 2 (x t , ω t ). Otherwise we show that the first minor cycle strictly reduces the dist 2 (x t , ω t ) by moving along the segment joining x and y, and the subsequent minor cycle cannot increase it. Since ω t ∈ Ω, we conclude err(x t+1 ) ≤ dist 2 (x t+1 , ω t ) < dist 2 (x t , ω t ) = err(x t ), and the approximation error strictly decreases. Given the approximation error strictly decreases, Wolfe's method for minimum norm point can be shown to terminate finitely (Wolfe, 1976) . However, this finitely terminating property does not hold for our algorithm. Since a changed ω t may yield a lower distance to the same active set S t c , the active set may stay unchanged across major cycles (see Figure 2 Middle for an example). Therefore we establish the convergence of the C2RL method by the following theorem. Theorem 4.2 (Convergence in Approximation Error). For t ≥ 1, the mixed policy µ t found by the C2RL method satisfies err(µ t ) ≤ 16Q 2 /(t + 2), where Q := max µ∈∆(U ) ||c(µ)|| is the maximum norm of a measurement vector. The proof is provided in Appendix C, which relies on the following two lemmas. We briefly discuss the main idea here. Define major cycle steps with at most one minor cycle as "non-drop step" and major cycle steps with more than one minor cycles as "drop steps". We show that in each non-drop step, Algorithm 1 is guaranteed to make enough progress in the following lemma. Lemma 4.3. For a non-drop step in C2RL method, we have err(µ t )-err(µ t+1 ) ≥ err 2 (µ t )/8Q 2 . Though this does not hold for drop steps, we can bound the frequency of drop steps by the following. Lemma 4.4. After t major cycle steps of C2RL method, the number of drop steps is less than t/2. Since the approximation error strictly decreases (Thm. 4.1), and in more than half of the major cycles steps, the C2RL method makes significantly progress. The Thm. (4.2) can then be proved using an induction argument (Appendix C). Convergence with Arbitrary RL Algo. The convergence of the C2RL method when used with RL algorithms that search for deterministic policies, such as DQN, DDPG and variants, is indeed straightforward. In (8), though each time the oracle yields a vertex, the FW-type algorithms indeed optimize over the polytope formed by these vertices. Then since citetaltman1999constrained shows that any c(•) achievable can be achieved by some mixed deterministic policies, we conclude that if the underlying problem is feasible, then our C2RL method is able to find a feasible policy.

Memory Complexity

We then discuss the sparsity of mixed policy for constrained RL problem. We give a constructive proof in Appendix D to show that to ensure convergence for RL algorithms that search for deterministic policies, storing m + 1 policies is required in the worst case. Theorem 4.5 (Memory Complexity Bound). For an constrained RL problem with m-dimensional measurement vector, in the worst case, a mixed policy needs to randomize among m + 1 individual policies to ensure convergence of RL oracles that search for deterministic policies. Since the minor cycles in the C2RL method eliminate policies with affinely dependent measurement vectors, after the termination of minor cycles, the size of the active set is at most m + 1. That is, the policy found by the C2RL method requires randomization among no more than m + 1 individual policies. Therefore the proposed C2RL indeed achieves the optimal sparsity in the worst case, making it favorable for large-scale usage. Corollary 4.5.1. The C2RL method that randomizes among at most m + 1 policies is worst-case optimal to ensure convergence of any RL oracle.

5. EXPERIMENTS

We evaluate the performance of C2RL in a grid-world navigation task (Fig. 1 ), and demonstrate its ability to efficiently find sparse policy. In this Risky Mars Rover environment, the agent is required to navigate from the starting point to the goal point, by moving to one of the four neighborhood cells at each step. The episodes terminate when the goal point is reached or after 300 steps. To enforce robustness, we add a risky area to indicate the dangerous states. The agent receives a measurement vector to indicate the steps it takes (0.1 for every step), and whether it stays in the risky area (0.1 for every risky step, and 0 otherwise), with discount factor γ = 0.99. We constrain the agent to reach the goal point with expected cumulative steps measure within 1.1 and the expected cumulative risky steps within 0.05. Note that by design, the shortest path from the starting point to the goal point does not satisfy the constraint. This is common in practice, as robustness typically evolves trade-off between the reward and the constraint satisfaction. The proposed C2RL method is compared with approachability-based policy optimization (Ap-proPO) (Miryoosefi et al., 2019) and with reward constrained policy optimization (RCPO) (Tessler et al., 2018) . ApproPO solves the same convex constrained RL problem by using an RL oracle to play against a no-regret online learner (Hazan et al., 2008; Zinkevich, 2003) . Since ApproPO and C2RL both use a RL oracle, ApproPO is a natural baseline to be compared with our method. Besides, we also compare with RCPO, which takes a Lagrangian approach to incorporate the constraints as a penalty signal into the reward. Using an advantage actor critic (A2C) Mnih et al. (2016) , RCPO has been shown to converge to a fixed point. For a fair comparison, C2RL and ApproPO uses an A2C agent as the RL oracle, with the same hyperparameter as used in RCPO. The approximation errors are compared after training for the same number of samples. Note that the C2RL method does not introduce any extra hyper-parameter. For ApproPO and RCPO, they require extra hyper-parameter for the initialization and learning rate of a variable equivalent to our λ in the outer loop. This is because our approach does not rely on the online learning framework, and therefore there is no need to tune the initialization and learning rate for our λ and ease the usage. We first showcase the consequences of our theoretical results using an optimal RL oracle. For any x ∈ R m , an optimal policy can be easily found via Dijkstra's algorithm. If multiple optimal paths exist, one is randomly picked to form a deterministic policy. Using this as an optimal RL oracle, the convergence property of C2RL and ApproPo are compared. Figure 2 Middle shows the value of policies c(µ t ) found after each call to the oracle. In Figure 2 Right, when approaching the boundary of the feasible set, the iterations of approachability-based methods start to zigzag. Since C2RL contains a minor cycle to re-optimize the weights among the active set, C2RL progresses quickly to reach the exact optimal solution. In Figure 3 Left, the approximation error is shown for 300 calls of the optimal RL oracle. We then compare C2RL, ApproPO and RCPO using the same A2C agent (details of the model structures and hyper-parameters are provided in Appendix E). We run each algorithm for 50 times, and each run for a maximum of 100 thousands of samples. The mean and standard deviation of the results are presented in Figure 3 . The original paper of ApproPO suggests using a cache to save memory, and the memory requirement of this variant is also presented. Figure 3 demonstrates that C2RL converges to an optimal policy faster than previous methods, and a sparse combination of individual policies is maintained throughout the iteration process.

6. CONCLUSION

In this paper, we introduce C2RL, an algorithm to solve RL problems under orthant or convex constraints. Our method reduces the constrained RL problem to a distance minimization problem, and a novel variant of Frank-Wolfe type algorithm is proposed to solve this. Our method comes with rigorous theoretical guarantees and does not introduce any extra hyper-parameter. To find an -approximation solution, C2RL takes O(1/ ) calls of any RL oracle and ensures convergence to work with arbitrary RL algorithm. Moreover, C2RL strictly reduces the approximation error between consecutive calls of RL oracle, and for m-dimensional constraints, the memory requirement is reduced from storing infinitely many policies (O(1/ )) to storing at most constantly many (m + 1) polices. We further show that the constant is worst-case optimal to ensure the convergence for RL algorithms that search for deterministic policies. Experimentally, we demonstrate that the proposed C2RL method finds sparse solution efficiently, and outperforms previous methods.

A MORE ON FRANK-WOLFE-TYPE ALGORITHMS

A.1 STANDARD FRANK-WOLFE ALGORITHM Algorithm 2 Frank-Wolfe algorithm (Frank et al., 1956 ) Input: obj. f : Y → R, oracle O(•), init. x 0 ∈ Y 1: for t=1, 2, 3 . . . , T do 2: s ← Oracle(∇f (x t-1 )) = arg min s∈{s1,...,sn} s T ∇f (x t-1 ) 3: x t ← (1 -η t )x t-1 + η t s , for η t := 2 t+2 4: end for 5: return x T For a convex function f : X → R the Frank-Wolfe algorithm (FW) solves the constrained optimization problem over a compact and convex set X . The standard FW is known to have a sublinear convergence rate, and various methods are proposed to improve the performance. For example, when the underlying feasible set is a polytope, and the objective function is strongly convex, multiple variants, such as away-step FW (Wolfe, 1970; Jaggi, 2013 ), pairwise FW (Mitchell et al., 1974) , and Wolfe's method (Wolfe, 1976) are shown to enjoy linear convergence rate. Linear convergence under other conditions is also studied (Beck & Shtern, 2017; Garber & Hazan, 2013a; b) // If y / ∈ conv(S), then update y to the intersection of conv(S) and segment joining x and y. Then remove points in S unnecessary for describing y. Update x = y and λ = α. 14: end while 15: return x Wolfe's method is an iterative algorithm for finding the point with minimum Euclidean norm in a polytope, which is defined as the convex hull of a set of finitely many points. The Wolfe's method consists of a finite number of major cycles, each of which consists of a finite number of minor cycles. At the start of each major cycle, let H(x) := {y T x = x x } be the hyperplane defined by x. If H(x) separates the polytope from the origin, then the major cycle is terminated. Otherwise, it invokes an oracle to find any point on the near side of the hyperplane. The point is then added into the active set S, and starts a minor cycle. In a minor cycle, let y be the point of smallest norm in of the affine hull aff(S). If y is in the relative interior of the convex hull conv(S), then x is updated to y and the minor cycle is terminated. Otherwise, y is updated to the nearest point to y on the line segment conv(S) ∩ [x, y]. Thus y is updated to a boundary point of conv(S), and any point that is not on the face of conv(S) in which y lies is deleted. The minor cycles are executed repeatedly until S becomes a corral, that is, a set whose affine minimizer lies inside its convex hull. Since a set of one point is always a corral, the minor cycles is terminated after a finite number of runs. B PROOF OF THEOREM 4.1 Theorem 4.1 (Approximation Error Strictly Decreases). For any non-terminal step t, we have err(µ t+1 ) < err(µ t ). That is, the measurement vector of µ t found by the C2RL method gets strictly closer to the convex set Ω after major cycle step. Proof. If the current step is a major cycle with no minor cycle, then x t+1 is the affine minimizer of aff(S ∪ {s t }) with respect to ω t . Then the affine minimizer property implies (s tx t+1 )(x t+1ω t ) = 0. Since iteration does not terminate at step t, we have (x tω t ) T (x ts t ) > 0, and therefore x t+1 not equal to x t . Then x t+1 is the unique affine minimizer implies f Ω ( x t+1 ) = min ω∈Ω ||x t+1 -ω|| 2 ≤ ||x t+1 -ω t || 2 < ||x t -ω t || 2 = f Ω (x t ). Otherwise the current step contains one or more minor cycles. In this case, we show that the first minor cycle strictly reduces the approximation error, and the (possibly) following minor cycles cannot increase it. For the first minor cycle, the affine minimizer y 0 of aff(S ∪ {s t }) with respect to ω t is outside conv(S ∪ {s t }). Let z = θy 0 + (1 -θ)x t be the intersection of conv(S ∪ {s t }) and segment joining x and y. Let V 0 := S t and V i denote the active set after the i-th minor cycle. Then since y 1 is the affine minimizer of V 1 with respect to ω t , we have ||z -ω t || = ||θy 0 + (1 -θ)x t -ω t || ≤ θ||y 0 -ω t || + (1 -θ)||x t -ω t || < ||x t -ω t ||, where the second step uses the triangle inequality and the last step follows since the segment x t y 0 intersects the interior of conv(S ∪{s t }), and the distance to ω t strictly decreases along this segment. Therefore the point z found by first minor cycle satisfies f Ω (z) = min ω∈Ω ||z -ω|| 2 ≤ ||z -ω t || 2 < ||x t -ω t || = f Ω (x t ). Hence h(y 1 ) < h(x t ), and the first minor cycle strictly decreases the approximation error. By a similar argument, in subsequent minor cycles the approximation error cannot be increased. However, after the first minor cycle, the iterating point may already at the intersection point and the strict inequality in last step of Eq. 17 need to be replaced by non-strict inequality. Therefore any major cycle either finds an improving point and continue, or enters minor cycles where the first minor cycle finds an improving point, and the subsequent minor cycles does not increase the distance. Adding both side of f Ω (x t+1 ) < f Ω (x t ) by f Ω (x * ) and we have the approximation error h(x t+1 ) < h(x t ) strictly decreases. C PROOF OF THEOREM 4.2 We first prove the Theorem 4.2, using Lemma 4.3 and Lemma 4.4. Then we present the proof of the lemmas. Theorem 4.2 (Convergence in Approximation Error). For t ≥ 1, the mixed policy µ t found by the C2RL method satisfies err(µ t ) ≤ 16Q 2 /(t + 2). ( ) where Q := max µ∈∆(U ) ||c(µ)|| is the maximum norm of a measurement vector. Proof. Since Lemma 4.4 shows that drop steps are no more than half of total major cycle steps, and Theorem 4.1 guarantees these drop steps reducing the approximation error, we can safely skip these step, and re-index the step numbers to include non-drop steps only using k. Since Ω is a convex set, the squared Euclidean distance function dist 2 (x, Ω) is convex for x, which implies dist 2 (x t , Ω) + (qx t )∇dist 2 (x t , Ω) ≤ dist 2 (q, Ω). (37) Putting in ∇dist 2 (x t , Ω) = (x t -Proj Ω (x t )) = (x tω t ), we get (x tω t )(x tq) ≥ err(µ t ), which together with Eq. 32 and Eq. 36 concludes that for non-drop step with no minor cycles, we have err(µ t )err(µ t+1 ) ≥ err 2 (µ t )/8Q 2 . For non-drop step with one minor cycle, we use the Theorem 6 of (Chakrabarty et al., 2014) . By a linear translation of adding all points with -ω t , it gives ||x t -ω t || 2 -||x t+1 -ω t || 2 ≥ ((x t -ω t )(x t -q)) 2 /8Q 2 . ( ) Then applying the same argument as Eq. 37, and we finished our proof. Lemma 4.4. After t major cycle steps of C2RL method, the number of drop steps is less than t/2. Proof. Recall that at the termination of a minor cycle, the size of the active set |S c | ∈ [1, m]. Since in each major cycle steps, the size of active set S t increases by one, and each drop step reduces the size of S t by at least one, the number of drop steps is always less than half of total number of the major cycle steps. D PROOF OF THEOREM 4.5 Theorem 4.5 (Memory Complexity Bound). For an constrained RL problem with m-dimensional measurement vector, in the worst case, a mixed policy needs to randomize among m + 1 individual policies to ensure convergence of RL oracles that search for deterministic policies. Proof. We give a constructive proof. Consider a m-dimensional vector-valued MDP with a single state, m + 1 actions, and c(a i ) := e i is the unit vector of i-th dimension for i ∈ [1, m], and c(a m+1 ) := 0, and the episode terminates after 1 steps. The constrained RL problem is to find a policy whose measurement vector lies in the convex set of a single point {1/2m}. By linear programming, it is clear that the only feasible mixed deterministic policy is to select a m+1 with 1/2 probability, and the rest m actions with 1/2m probability; i.e. the unique feasible policy to this problem has an active set containing m + 1 deterministic policies. Therefore any method randomize among less than m + 1 individual policies does not ensure convergence when used with RL algorithms searching for deterministic policies.

E ADDITIONAL EXPERIMENT DETAILS

All the methods use the same A2C agent. The input is the one-hot encoded current position index. The A2C is the standard fully connected multi-layer perceptron with ReLU activation function. The actor and critic share the internal representation and have their only final layer. Both actor and critic networks use Adam optimizer with learning rate set to 1e -2 . The network is as follows

Actor Critic Input layer

One-hot encoded state index (dim=54) Hidden layer Linear(in=54, out=128, act="relu") Output layer Linear(in=128, out=4, act="relu") Linear(in=128, out=1, act="relu") Output name Action score

State value

For ApproPO, the constant κ for projection convex set to convex cone is set to be 20. The θ is initialized to 0. Following the original paper. For RCPO, the learning rate of its λ is set to 2.5e -5 , and its λ is initialized to 0 and updated by online gradient descent with learning rate set to 1, as used by the original paper. The proposed C2RL introduces no extra hyper-parameters, and has nothing to report.



λ satisfies x = s∈Sc λ s s 11:

Figure1: Left: The Risky Mars Rover environment. The agent is required to navigate from the starting point to reach the goal point without staying long (0.5 steps in expectation) in the risky area (cross-hatching region). Middle, Right. Example of an optimal mixed policy found by C2RL in a single run. After 10k samples, C2RL finds a mixed policy that randomizes among two policies with weight 0.49 and 0.51. The visitation probabilities of the two policies are plotted.

Figure 2: Left: Visualization of the distance minimization problem (7) in R 2 , where the number of steps and the number of steps in risky zone are measured. The green hatched region is the polytope formed by values achievable by mixed deterministic policies c(∆(Π)), and the red hatched region is the target set. Middle: Using an optimal RL oracle, 10 paths are sampled to showcase the convergence property of C2RL and ApproPO, where each cross on the dashed line corresponds to a call to the oracle. Right: If we zoom in, ApproPO suffers from the zig-zagging problem.

Figure 3: Left: Time complexity measured by number of calling an optimal RL oracle. Middle, Right: Using A2C to approximate an RL oracle, time complexity measured by thousands of samples and memory complexity measured by the number of policies stored are compared.

λ satisfies x = s∈S λ s s 10:y ← θy + (1 -θ)x, λ i = θα i + (1 -θ)λ i 11:S ← {s i |s i ∈ S and λ i > 0}

Algorithm 1 Convex Constrained Reinforcement Learning (C2RL) Input. RL Oracle constructed by any RL algorithm, projection operator to target set Proj Ω . Initialize. Random policy π, value x = c(π), active sets S p := [π], S c := [x] and weight λ = [1]. Output. Mixed policy µ and its value c(µ) s.t. c(µ) minimizes the distance to the target set Ω.

S p ← S p ∪ {π}

. A.2 WOLFE'S METHOD FOR MINIMUM NORM POINT Algorithm 3 Wolfe's Method for Minimum Norm Point Initialize x ∈ P, active set S = [x] and weight λ = [1]. Output: x ∈ P that has the minimum Euclidean norm.

annex

For these non-drop steps, we claim that err(µ k ) ≤ 8Q 2 /(k + 1). Using Lemma 4.3, we prove the convergence rate using induction. We first bound the error of any err(µ k ). For any k ≥ 1where Eq. 21 uses the definition of our squared Euclidean distance function. Eq. 22 follows from triangle inequality, and Eq. 23 is by the contractive property of the Euclidean distance.When k = 1, the Eq. 25 established the based case. Now for k ≥ 1, assume that err(µSince the quadratic function of the right hand side is monotonically increasing on (-∞, 4Q 2 ], using the inductive hypothesisSince for t steps of major cycle steps, the number of non-drop steps k > t/2, we conclude that err(µ t ) ≤ 16Q 2 /(t + 2).Then we prove the lemmas. Lemma 4.3. For a non-drop step, we have err(µ t )err(µ t+1 ) ≥ err 2 (µ t )/8Q 2 .Proof. The non-drop step contains either no minor cycle or one minor cycle. We first consider the no minor cycle case.If a major cycle contains no minor cycle, then x t+1 is the affine minimizer of the S ∪ {s t }.where the equation ( 31) follows from the affine minimizer property Eq. ( 11). For ||x tx t+1 || in the last equation, and ∀q ∈ aff(S ∪ {s t }), we have≥ 1 2Q (x tx t+1 )(x tq) ( Cauchy-Schwarz inequality) (35) = 1 2Q (x tω t )(x tq) ( Affine minimizer property).(36)Then it suffices to show that (x tω t )(x tq) ≥ err(µ t ).

