ENTROPIC RISK-SENSITIVE REINFORCEMENT LEARNING: A META REGRET FRAMEWORK WITH FUNCTION APPROXIMATION

Abstract

We study risk-sensitive reinforcement learning with the entropic risk measure and function approximation. We consider the finite-horizon episodic MDP setting, and propose a meta algorithm based on value iteration. We then derive two algorithms for linear and general function approximation, namely RSVI.L and RSVI.G, respectively, as special instances of the meta algorithm. We illustrate that the success of RSVI.L depends crucially on carefully designed feature mapping and regularization that adapt to risk sensitivity. In addition, both RSVI.L and RSVI.G maintain risk-sensitive optimism that facilitates efficient exploration. On the analytic side, we provide regret analysis for the algorithms by developing a meta analytic framework, at the core of which is a risk-sensitive optimism condition. We show that any instance of the meta algorithm that satisfies the condition yields a meta regret bound. We further verify the condition for RSVI.L and RSVI.G under respective function approximation settings to obtain concrete regret bounds that scale sublinearly in the number of episodes.

1. INTRODUCTION

Risk is one of the most important considerations in decision making, so should it be in reinforcement learning (RL) . As a prominent paradigm in RL that performs learning while accounting for risk, risksensitive RL explicitly models risk of decisions via certain risk measures and optimizes for rewards simultaneously. It is poised to play an essential role in application domains where accounting for risk in decision making is crucial. A partial list of such domains includes autonomous driving (Buehler et al., 2009; Thrun, 2010 ), behavior modeling (Niv et al., 2012; Shen et al., 2014) , realtime strategy games (Berner et al., 2019; Vinyals et al., 2019) and robotic surgery (Fagogenis et al., 2019; Shademan et al., 2016) . In this paper, we study risk-sensitive RL through the lens of function approximation, which is an important apparatus for scaling up and accelerating RL algorithms in applications of high dimension. We focus on risk-sensitive RL with the entropic risk measure, a classical framework established by the seminal work of Howard & Matheson (1972) . Informally, for a fixed risk parameter β = 0, our goal is to maximize the objective V β = 1 β log{Ee βR }. The definition of V β will be made formal later in (2). The objective (1) admits a Taylor expansion V β = E[R] + β 2 Var(R) + O(β 2 ). Comparing (1) with the risk-neutral objective V = E[R] studied in the standard RL setting, we see that β > 0 induces a risk-seeking objective and β < 0 induces a risk-averse one. Therefore, the formulation with the entropic risk measure in (1) accounts for both risk-seeking and risk-averse modes of decision making, whereas most others are restricted to the risk-averse setting (Fu et al., 2018) . It can also be seen that V β tends to the risk-neutral V as β → 0. Existing works on function approximation for RL have mostly focused on the risk-neutral setting and heavily exploits the linearity of risk-neutral objective V in both transition dynamics (implicitly captured by the expectation) and the reward R, which is clearly not available in the risk-sensitive objective (1). It is also well known that even in the risk-neutral setting, improperly implemented function approximation could result in errors that scale exponentially in the size of the state space. Combined with nonlinearity of the risk-sensitive objective (1), it compounds the difficulties of implementing function approximation in risk-sensitive RL with provable guarantees. This work provides a principled solution to function approximation in risk-sensitive RL by overcoming the above difficulties. Under the finite-horizon MDP setting, we propose a meta algorithm based on value iteration, and from that we derive two concrete algorithms for linear and general function approximation, which we name RSVI.L and RSVI.G, respectively. By modeling a shifted exponential transformation of estimated value functions, RSVI.L and RSVI.G cater to the nonlinearity of the risk-sensitive objective (1) and adapt to both risk-seeking and risk-averse settings. Moreover, both RSVI.L and RSVI.G maintain risk-sensitive optimism in the face of uncertainty for effective exploration. In particular, RSVI.L exploits a synergistic relationship between feature mapping and regularization in a risk-sensitive fashion. The resulting structure of RSVI.L makes it more efficient in runtime and memory than RSVI.G under linear function approximation, while RSVI.G is more general and allows for function approximation beyond the linear setting. Furthermore, we develop a meta regret analytic framework and identify a risk-sensitive optimism condition that serves as the core component of the framework. Under the optimism condition, we prove a meta regret bound incurred by any instance of the meta algorithm, regardless of function approximation settings. Furthermore, we show that both RSVI.L and RSVI.G satisfy the optimism condition under the respective function approximation and achieve regret that scales sublinearly in the number of episodes. The meta framework therefore helps us disentangle the analysis associated with function approximation from the generic analysis, shedding light on the role of function approximation in regret guarantees. We hope that our meta framework will motivate and benefit future studies of function approximation in risk-sensitive RL. Our contributions. We may summarize the contributions of the present paper as follows: • we study function approximation in risk-sensitive RL with the entropic risk measure; we provide a meta algorithm, from which we derive two concrete algorithms for linear and general function approximation, respectively; the concrete algorithms are both shown to adapt to all levels of risk sensitivity and maintain risk-sensitive optimism over the learning process; • we develop a meta regret analytic framework and identify a risk-sensitive optimism condition, under which we prove a meta regret bound for the meta algorithm; furthermore, by showing that the optimism condition holds for both concrete algorithms, we establish regret bounds for them under linear and general function approximation, respectively. Notations. For a positive integer n, we let [n] := {1, 2, . . . , n}. For a number u = 0, we define sign(u) = 1 if u > 0 and -1 if u < 0. For two non-negative sequences {a i } and {b i }, we write a i b i if there exists a universal constant C > 0 such that a i ≤ Cb i for all i, and write a i b i if a i b i and b i a i . We use Õ(•) to denote O(•) while hiding logarithmic factors. For any ε > 0 and set X , we let N ε (X , • ) be the ε-net of the set X with respect to the norm • . We let ∆(X ) be the set of probability distributions supported on X . For any vector u ∈ R n and symmetric and positive definite matrix Γ ∈ R n×n , we let u Γ := √ u Γu. We denote by I n the n × n identity matrix.

2. RELATED WORK

Initiated by the seminal work of Howard & Matheson (1972) , risk-sensitive control/RL with the entropic risk measure has been studied in a vast body of literature (Bäuerle & Rieder, 2014; Borkar, 2001; 2002; 2010; Borkar & Meyn, 2002; Cavazos-Cadena & Hernández-Hernández, 2011; Coraluppi & Marcus, 1999; Di Masi & Stettner, 1999; 2000; 2007; Fleming & McEneaney, 1995; Hernández-Hernández & Marcus, 1996; Jaśkiewicz, 2007; Marcus et al., 1997; Mihatsch & Neuneier, 2002; Osogami, 2012; Patek, 2001; Shen et al., 2013; 2014; Whittle, 1990 ). Yet, this line of works either assumes known transition kernels or focuses on asymptotic behaviors of the problem/algorithms, and finite-sample/time results with unknown transitions have rarely been investigated.

