THREE PROBLEM CLASSES THAT MARKOV REWARDS CANNOT EXPRESS

Abstract

In this paper, we study the expressivity of Markovian reward functions, and identify several limitations to what they can express. Specifically, we look at three classes of reinforcement learning tasks (multi-objective reinforcement learning, risk-averse reinforcement learning, and modal reinforcement learning), and then prove mathematically that most of the tasks in each of these classes cannot be expressed using scalar, Markovian reward functions. In the process, we provide necessary and sufficient conditions for when a multi-objective reinforcement learning problem can be reduced to ordinary, scalar reward reinforcement learning. We also call attention to a new class of reinforcement learning problems (namely those we call "modal" problems), which have so far not been given any systematic treatment in the reinforcement learning literature. In addition, we also show that many of these problems can be solved effectively using reinforcement learning. This rules out the possibility that those problems which cannot be expressed using Markovian reward functions also are impossible to learn effectively.

1. INTRODUCTION

To use reinforcement learning (RL) to solve a task, it is necessary to first encode that task using a reward function (Sutton & Barto, 2018) . Usually, these reward functions are Markovian functions from state-action-next-state triples to reals. In this paper, we study the expressivity of Markovian reward functions, and identify several limitations to what they can express. Specifically, we will examine three classes of tasks, all of which are both intuitive to understand, and useful in practical situations. We will then show that almost all tasks in each of these three classes are impossible to express using Markovian reward functions. Moreover, we also show that many of these problems can be solved effectively with RL, either by providing references to existing literature, or by providing an outline of a possible approach. This rules out the possibility that those problems which cannot be expressed using Markovian reward functions also are impossible to learn effectively. The first class of problems we look at, in Section 2, is the single-policy version of multi-objective RL (MORL). In such a problem, the agent receives multiple reward signals, and the aim is to learn a single policy that achieves an optimal trade-off of those rewards according to some criterion (Roijers et al., 2013; Liu et al., 2015) . For example, a single-policy MORL algorithm might attempt to maximise the rewards lexicographically (Skalse et al., 2022b). We will look at the question of which MORL problems can be reduced to ordinary RL, by providing a scalar reward function that induces the same preferences as the original MORL problem. Moreover, we will provide a complete solution to this problem, in the form of necessary and sufficient conditions. We find that this can only be done for MORL problems that correspond to a linear weighting of the rewards, which means that it cannot be done for the vast majority of all interesting MORL problems. The next class of problems we look at, in Section 3, is risks-sensitive RL. There are many contexts where it is desirable to be risk averse. In economics, and related fields, this is often modelled using utility functions U : R → R which are concave in some underlying quantity. Can the same thing be done with reward functions? Is it possible to take a reward function, and then create a version of that reward function which induces more risk-averse behaviour? We show that the answer is no -none of the standard risk-averse utility functions can be expressed using reward functions. This demonstrates another limitation in the expressive power of Markovian rewards.

