THREE PROBLEM CLASSES THAT MARKOV REWARDS CANNOT EXPRESS

Abstract

In this paper, we study the expressivity of Markovian reward functions, and identify several limitations to what they can express. Specifically, we look at three classes of reinforcement learning tasks (multi-objective reinforcement learning, risk-averse reinforcement learning, and modal reinforcement learning), and then prove mathematically that most of the tasks in each of these classes cannot be expressed using scalar, Markovian reward functions. In the process, we provide necessary and sufficient conditions for when a multi-objective reinforcement learning problem can be reduced to ordinary, scalar reward reinforcement learning. We also call attention to a new class of reinforcement learning problems (namely those we call "modal" problems), which have so far not been given any systematic treatment in the reinforcement learning literature. In addition, we also show that many of these problems can be solved effectively using reinforcement learning. This rules out the possibility that those problems which cannot be expressed using Markovian reward functions also are impossible to learn effectively.

1. INTRODUCTION

To use reinforcement learning (RL) to solve a task, it is necessary to first encode that task using a reward function (Sutton & Barto, 2018) . Usually, these reward functions are Markovian functions from state-action-next-state triples to reals. In this paper, we study the expressivity of Markovian reward functions, and identify several limitations to what they can express. Specifically, we will examine three classes of tasks, all of which are both intuitive to understand, and useful in practical situations. We will then show that almost all tasks in each of these three classes are impossible to express using Markovian reward functions. Moreover, we also show that many of these problems can be solved effectively with RL, either by providing references to existing literature, or by providing an outline of a possible approach. This rules out the possibility that those problems which cannot be expressed using Markovian reward functions also are impossible to learn effectively. The first class of problems we look at, in Section 2, is the single-policy version of multi-objective RL (MORL). In such a problem, the agent receives multiple reward signals, and the aim is to learn a single policy that achieves an optimal trade-off of those rewards according to some criterion (Roijers et al., 2013; Liu et al., 2015) . For example, a single-policy MORL algorithm might attempt to maximise the rewards lexicographically (Skalse et al., 2022b) . We will look at the question of which MORL problems can be reduced to ordinary RL, by providing a scalar reward function that induces the same preferences as the original MORL problem. Moreover, we will provide a complete solution to this problem, in the form of necessary and sufficient conditions. We find that this can only be done for MORL problems that correspond to a linear weighting of the rewards, which means that it cannot be done for the vast majority of all interesting MORL problems. The next class of problems we look at, in Section 3, is risks-sensitive RL. There are many contexts where it is desirable to be risk averse. In economics, and related fields, this is often modelled using utility functions U : R → R which are concave in some underlying quantity. Can the same thing be done with reward functions? Is it possible to take a reward function, and then create a version of that reward function which induces more risk-averse behaviour? We show that the answer is no -none of the standard risk-averse utility functions can be expressed using reward functions. This demonstrates another limitation in the expressive power of Markovian rewards. The last class of problems we look at, in Section 4, is something we call modal tasks. These are tasks where the agent is evaluated not only based on what trajectories it generates, but also based on what it could have done along those trajectories. For example, consider the instruction "you should always be able to return to the start state". We provide a formalisation of such tasks, argue that there are many situations in which these tasks could be useful, and finally prove that these tasks also typically cannot be formalised using ordinary reward functions. In Section 5, we discuss how to solve tasks from each of these classes using RL. We provide references to existing literature, and then sketch both an approach for learning a wide class of MORL problems, and an approach for learning a wide class of modal problems. Finally, in Section 6, we discuss the significance and limitations of our results, together with ways to extend them.

1.1. RELATED WORK

There has been a few recent papers which examine the expressivity of Markovian reward functions. The first of these is the work by Abel et al. ( 2021), who point to three different ways to formalise the notion of a "task" (namely, as a set of acceptable policies, as an ordering over policies, or as an ordering over trajectories). They then demonstrate that each of these classes contains at least one instance which cannot be expressed using a reward function (by using the fact that the set of all optimal policies forms a convex set, and the fact that the reward function is Markovian). They also provide algorithms which compute reward functions for these types of tasks, by constructing a linear program. We greatly extend their work by providing new results that are significantly stronger. Another important paper is the work by Vamplew et al. ( 2022), who argue that there are many important aspects of intelligence which can be captured by MORL, but not by scalar RL. Like them, we also argue that MORL is a genuine extension of scalar RL, but our approach is quite different. They focus on the question of whether MORL or (scalar) RL is a better foundation for the development of general intelligence (considering feasibility, safety, and etc), and they provide qualitative arguments and biological evidence. By contrast, we are more narrowly focused on what incentive structures can be expressed by MORL and scalar RL, and our results are mathematical. There is also other relevant work that is less strongly related. For example, Icarte et al. ( 2022) point out that there are certain tasks which cannot be expressed using Markovian rewards, and propose a way extend their expressivity by augmenting the reward function with an automaton that they call a reward machine. Similar approaches have also been used by e. (2022b) . This existing literature typically focuses on the creation of algorithms for solving particular MORL problems, and has so far not tackled the problem of characterising when MORL problems can be reduced to scalar RL. Modal RL has (to the best of our knowledge) never been discussed explicitly in the literature before. However, it relates to some existing work, such as side-effect avoidance (Krakovna et al., 2018; 2020; Turner et al., 2020) , and the work by Wang et al. (2020) .

1.2. PRELIMINARIES

The standard RL setting is formalised using Markov Decision Processes (MDPs), which are tuples ⟨S, A, τ, µ 0 , R, γ⟩ where S is a set of states, A is a set of actions, τ : S × A ⇝ S is a transition function, µ 0 is an initial state distribution over S, R : S × A × S ⇝ R a reward function, where R(s, a, s ′ ) is the reward obtained if the agent moves from state s to s ′ by taking action a, and γ ∈ (0, 1) is a discount factor. Here, f : X ⇝ Y denotes a probabilistic mapping f from X to Y . A state is terminal if τ (s, a) = s and R(s, a, s) = 0 for all a. A trajectory ξ is a path s 0 , a 0 , s 1 . . . in an



g. Hasanbeig et al. (2020); Hammond et al. (2021). There are also other ways to extend Markovian rewards to a more general setting, such as convex RL, as studied by e.g. Hazan et al. (2019); Zhang et al. (2020); Zahavy et al. (2021); Geist et al. (2022); Mutti et al. (2022), and vectorial RL, as studied by e.g. Cheung (2019a;b). Also related is the work by Skalse et al. (2022c), who show that there are certain relationships that are never satisfied by any pair of reward functions. This paper can also be seen as relating to earlier work on characterising what kinds of preference structures can be expressed using utility functions, such as the famous work by von Neumann & Morgenstern (1947), and other work in game theory. There is a large literature on (the overlapping topics of) single-policy MORL, constrained RL, and risk-sensitive RL. Some notable examples of this work includes Achiam et al. (2017); Chow et al. (2017); Miryoosefi et al. (2019); Tessler et al. (2019); Skalse et al.

