ENTROPIC RISK-SENSITIVE REINFORCEMENT LEARNING: A META REGRET FRAMEWORK WITH FUNCTION APPROXIMATION

Abstract

We study risk-sensitive reinforcement learning with the entropic risk measure and function approximation. We consider the finite-horizon episodic MDP setting, and propose a meta algorithm based on value iteration. We then derive two algorithms for linear and general function approximation, namely RSVI.L and RSVI.G, respectively, as special instances of the meta algorithm. We illustrate that the success of RSVI.L depends crucially on carefully designed feature mapping and regularization that adapt to risk sensitivity. In addition, both RSVI.L and RSVI.G maintain risk-sensitive optimism that facilitates efficient exploration. On the analytic side, we provide regret analysis for the algorithms by developing a meta analytic framework, at the core of which is a risk-sensitive optimism condition. We show that any instance of the meta algorithm that satisfies the condition yields a meta regret bound. We further verify the condition for RSVI.L and RSVI.G under respective function approximation settings to obtain concrete regret bounds that scale sublinearly in the number of episodes.

1. INTRODUCTION

Risk is one of the most important considerations in decision making, so should it be in reinforcement learning (RL) . As a prominent paradigm in RL that performs learning while accounting for risk, risksensitive RL explicitly models risk of decisions via certain risk measures and optimizes for rewards simultaneously. It is poised to play an essential role in application domains where accounting for risk in decision making is crucial. A partial list of such domains includes autonomous driving (Buehler et al., 2009; Thrun, 2010 ), behavior modeling (Niv et al., 2012; Shen et al., 2014) , realtime strategy games (Berner et al., 2019; Vinyals et al., 2019) and robotic surgery (Fagogenis et al., 2019; Shademan et al., 2016) . In this paper, we study risk-sensitive RL through the lens of function approximation, which is an important apparatus for scaling up and accelerating RL algorithms in applications of high dimension. We focus on risk-sensitive RL with the entropic risk measure, a classical framework established by the seminal work of Howard & Matheson (1972) . Informally, for a fixed risk parameter β = 0, our goal is to maximize the objective V β = 1 β log{Ee βR }. The definition of V β will be made formal later in (2). The objective (1) admits a Taylor expansion V β = E[R] + β 2 Var(R) + O(β 2 ). Comparing (1) with the risk-neutral objective V = E[R] studied in the standard RL setting, we see that β > 0 induces a risk-seeking objective and β < 0 induces a risk-averse one. Therefore, the formulation with the entropic risk measure in (1) accounts for both risk-seeking and risk-averse modes of decision making, whereas most others are restricted to the risk-averse setting (Fu et al., 2018) . It can also be seen that V β tends to the risk-neutral V as β → 0. Existing works on function approximation for RL have mostly focused on the risk-neutral setting and heavily exploits the linearity of risk-neutral objective V in both transition dynamics (implicitly captured by the expectation) and the reward R, which is clearly not available in the risk-sensitive objective (1). It is also well known that even in the risk-neutral setting, improperly implemented function approximation could result in errors that scale exponentially in the size of the state space. 1

