THREE PROBLEM CLASSES THAT MARKOV REWARDS CANNOT EXPRESS

Abstract

In this paper, we study the expressivity of Markovian reward functions, and identify several limitations to what they can express. Specifically, we look at three classes of reinforcement learning tasks (multi-objective reinforcement learning, risk-averse reinforcement learning, and modal reinforcement learning), and then prove mathematically that most of the tasks in each of these classes cannot be expressed using scalar, Markovian reward functions. In the process, we provide necessary and sufficient conditions for when a multi-objective reinforcement learning problem can be reduced to ordinary, scalar reward reinforcement learning. We also call attention to a new class of reinforcement learning problems (namely those we call "modal" problems), which have so far not been given any systematic treatment in the reinforcement learning literature. In addition, we also show that many of these problems can be solved effectively using reinforcement learning. This rules out the possibility that those problems which cannot be expressed using Markovian reward functions also are impossible to learn effectively.

1. INTRODUCTION

To use reinforcement learning (RL) to solve a task, it is necessary to first encode that task using a reward function (Sutton & Barto, 2018) . Usually, these reward functions are Markovian functions from state-action-next-state triples to reals. In this paper, we study the expressivity of Markovian reward functions, and identify several limitations to what they can express. Specifically, we will examine three classes of tasks, all of which are both intuitive to understand, and useful in practical situations. We will then show that almost all tasks in each of these three classes are impossible to express using Markovian reward functions. Moreover, we also show that many of these problems can be solved effectively with RL, either by providing references to existing literature, or by providing an outline of a possible approach. This rules out the possibility that those problems which cannot be expressed using Markovian reward functions also are impossible to learn effectively. The first class of problems we look at, in Section 2, is the single-policy version of multi-objective RL (MORL). In such a problem, the agent receives multiple reward signals, and the aim is to learn a single policy that achieves an optimal trade-off of those rewards according to some criterion (Roijers et al., 2013; Liu et al., 2015) . For example, a single-policy MORL algorithm might attempt to maximise the rewards lexicographically (Skalse et al., 2022b) . We will look at the question of which MORL problems can be reduced to ordinary RL, by providing a scalar reward function that induces the same preferences as the original MORL problem. Moreover, we will provide a complete solution to this problem, in the form of necessary and sufficient conditions. We find that this can only be done for MORL problems that correspond to a linear weighting of the rewards, which means that it cannot be done for the vast majority of all interesting MORL problems. The next class of problems we look at, in Section 3, is risks-sensitive RL. There are many contexts where it is desirable to be risk averse. In economics, and related fields, this is often modelled using utility functions U : R → R which are concave in some underlying quantity. Can the same thing be done with reward functions? Is it possible to take a reward function, and then create a version of that reward function which induces more risk-averse behaviour? We show that the answer is no -none of the standard risk-averse utility functions can be expressed using reward functions. This demonstrates another limitation in the expressive power of Markovian rewards. The last class of problems we look at, in Section 4, is something we call modal tasks. These are tasks where the agent is evaluated not only based on what trajectories it generates, but also based on what it could have done along those trajectories. For example, consider the instruction "you should always be able to return to the start state". We provide a formalisation of such tasks, argue that there are many situations in which these tasks could be useful, and finally prove that these tasks also typically cannot be formalised using ordinary reward functions. In Section 5, we discuss how to solve tasks from each of these classes using RL. We provide references to existing literature, and then sketch both an approach for learning a wide class of MORL problems, and an approach for learning a wide class of modal problems. Finally, in Section 6, we discuss the significance and limitations of our results, together with ways to extend them.

1.1. RELATED WORK

There has been a few recent papers which examine the expressivity of Markovian reward functions. The first of these is the work by Abel et al. (2021) , who point to three different ways to formalise the notion of a "task" (namely, as a set of acceptable policies, as an ordering over policies, or as an ordering over trajectories). They then demonstrate that each of these classes contains at least one instance which cannot be expressed using a reward function (by using the fact that the set of all optimal policies forms a convex set, and the fact that the reward function is Markovian). They also provide algorithms which compute reward functions for these types of tasks, by constructing a linear program. We greatly extend their work by providing new results that are significantly stronger. Another important paper is the work by Vamplew et al. (2022) , who argue that there are many important aspects of intelligence which can be captured by MORL, but not by scalar RL. Like them, we also argue that MORL is a genuine extension of scalar RL, but our approach is quite different. They focus on the question of whether MORL or (scalar) RL is a better foundation for the development of general intelligence (considering feasibility, safety, and etc), and they provide qualitative arguments and biological evidence. By contrast, we are more narrowly focused on what incentive structures can be expressed by MORL and scalar RL, and our results are mathematical. There is also other relevant work that is less strongly related. For example, Icarte et al. (2022) point out that there are certain tasks which cannot be expressed using Markovian rewards, and propose a way extend their expressivity by augmenting the reward function with an automaton that they call a reward machine. Similar approaches have also been used by e.g. Hasanbeig et al. (2020) ; Hammond et al. (2021) . There are also other ways to extend Markovian rewards to a more general setting, such as convex RL, as studied by e.g. Hazan et al. (2019) ; Zhang et al. (2020) ; Zahavy et al. (2021); Geist et al. (2022) ; Mutti et al. (2022) , and vectorial RL, as studied by e.g. Cheung (2019a; b) . Also related is the work by Skalse et al. (2022c) , who show that there are certain relationships that are never satisfied by any pair of reward functions. This paper can also be seen as relating to earlier work on characterising what kinds of preference structures can be expressed using utility functions, such as the famous work by von Neumann & Morgenstern (1947) , and other work in game theory. There is a large literature on (the overlapping topics of) single-policy MORL, constrained RL, and risk-sensitive RL. Some notable examples of this work includes Achiam et al. (2017) ; Chow et al. (2017) ; Miryoosefi et al. (2019) ; Tessler et al. (2019) ; Skalse et al. (2022b) . This existing literature typically focuses on the creation of algorithms for solving particular MORL problems, and has so far not tackled the problem of characterising when MORL problems can be reduced to scalar RL. Modal RL has (to the best of our knowledge) never been discussed explicitly in the literature before. However, it relates to some existing work, such as side-effect avoidance (Krakovna et al., 2018; 2020; Turner et al., 2020) , and the work by Wang et al. (2020) .

1.2. PRELIMINARIES

The standard RL setting is formalised using Markov Decision Processes (MDPs), which are tuples ⟨S, A, τ, µ 0 , R, γ⟩ where S is a set of states, A is a set of actions, τ : S × A ⇝ S is a transition function, µ 0 is an initial state distribution over S, R : S × A × S ⇝ R a reward function, where R(s, a, s ′ ) is the reward obtained if the agent moves from state s to s ′ by taking action a, and γ ∈ (0, 1) is a discount factor. Here, f : X ⇝ Y denotes a probabilistic mapping f from X to Y . A state is terminal if τ (s, a) = s and R(s, a, s) = 0 for all a. A trajectory ξ is a path s 0 , a 0 , s 1 . . . in an MDP that is possible according to µ 0 and τ . We use G to denote the trajectory return function, where G(ξ) = ∞ t=0 γ t r t . A policy is a mapping π : S ⇝ A, and Π is the set of all policies. Given a policy π, its value function V π : S → R is the function where V π (s) is the expected future discounted reward when following π from s, and its Q-function Q π : S × A → R = E S ′ ∼τ (s,a) [R(s, a, S ′ ) + γ • V π (S ′ )]. The policy evaluation function J : Π → R is J(π) = E S0∼µ0 [V π (S o )]. If a policy maximises J, then we say that this policy is optimal. We denote optimal policies by π ⋆ , and their value function and Q-function by V ⋆ and Q ⋆ . Moreover, given an MDP M, we say that M's policy order is the ordering ≺ on Π induced by π 1 ≺ π 2 ⇐⇒ J(π 1 ) < J(π 2 ) for all π 1 , π 2 . For a more comprehensive overview, see Sutton & Barto (2018) . In this paper, we will say that a reward function R is trivial if J(π 1 ) = J(π 2 ) for all π 1 , π 2 . Moreover, we say that R 1 and R 2 are equivalent if J 1 (π 1 ) < J 1 (π 2 ) ⇐⇒ J 2 (π 1 ) < J 2 (π 2 ) for all π 1 , π 2 , and that they are opposites if J 1 (π 1 ) < J 1 (π 2 ) ⇐⇒ J 2 (π 1 ) > J 2 (π 2 ) for all π 1 , π 2 . MORL problems are formalised using Multi-Objective MDPs (MOMDPs), which are tuples ⟨S, A, τ, µ 0 , ⃗ R, γ⟩. The only place where MOMDPs differ from MDPs are ⃗ R, which is a function ⃗ R : S × A × S ⇝ R k that, for each transition s, a, s ′ , returns k different rewards (for some k). We denote the reward function that returns the i'th component of ⃗ R as R i , and use V π i , Q π i , J i , G i , etc, to refer to its value functions, Q-functions, evaluation function, return function, etc. Since there may not be any single policy which maximises each component of ⃗ R, a MORL problem additionally needs a rule for how to combine and trade off each reward.

1.3. A REMARK ON "TASKS"

In order to determine if a given task can be expressed by Markovian reward functions, we must first determine what it means for a reward function to express a task. One answer to this question is to say that a task corresponds to a desired policy π, and that a reward function R expresses the task if π is optimal under R (possibly with the additional requirement that π is the only policy that is optimal under R). With this definition, we find that any task can be expressed as a Markovian reward function, at least as long as π is stationary and deterministic (see Appendix B). Another possible definition is to say that a task corresponds to an ordering ≺ on Π, which encodes a preference ordering over all policies, and that a reward function R expresses the task if J orders Π according to ≺. It is primarily this latter definition that we will use in this paper. The main reason for this is that it often is impossible to find the optimal policy in complex environments. This means that it is not enough for R to have the right optimal policy; it must also induce the right preferences between the (sub-optimal) policies that the policy optimisation algorithm actually considers. The only way to robustly ensure that this is the case is if R induces the right policy ordering. These are not the only two reasonable definitions. As mentioned previously, more definitions can be found in Abel et al. (2021) .

2. MULTI-OBJECTIVE REINFORCEMENT LEARNING

In this section, we examine the MORL setting. We first need a general definition of what a singlepolicy MORL problem is. Recall that a MOMDP ⟨S, A, τ, µ 0 , ⃗ R, γ⟩ by itself has no one canonical objective to maximise. We therefore introduce the notion of a MORL objective: Definition 1. A MORL objective over k rewards is a function O that takes k policy evaluation functions J 1 . . . J k and returns a (total) ordering ≺ O over the set of all policies Π. Given a MOMDP M = ⟨S, A, τ, µ 0 , ⃗ R, γ⟩, a MORL objective O gives us an ordering over Π that tells us when a policy is preferred over another. We use ≺ M O to denote the policy ordering that is obtained when we apply O to M's policy evaluation functions. For the purposes of this paper, we will not need to impose any further requirements on ≺ O . For example, we will not insist that ≺ O must have a greatest element in Π, or that π 1 ≺ O π 2 whenever π 2 is a Pareto improvement over π 1 , etc, even though a reasonable MORL objective presumably would have these properties. We next give a few examples of some interesting MORL objectives: Definition 2. Given J 1 . . . J k , the LexMax objective ≺ Lex is given by π 1 ≺ Lex π 2 if and only if there is an i ∈ {1 . . . m} such that J i (π 1 ) < J i (π 2 ), and J j (π 1 ) = J j (π 2 ) for j < i. Definition 3. Given J 1 . . . J k , the MaxMin objective ≺ Min is given by π 1 ≺ Min π 2 ⇐⇒ min i J i (π 1 ) < min i J i (π 2 ). Definition 4. Given J 1 . . . J k and some c 1 . . . c m ∈ R, the MaxSat objective ≺ Sat is given by π 1 ≺ Sat π 2 if and only if the number of rewards that satisfy J i (π 1 ) ≥ c i is larger than the number of rewards that satisfy J i (π 2 ) ≥ c i . Definition 5. Given J 1 , J 2 and some c ∈ R, the ConSat objective ≺ Con is given by π 1 ≺ Con π 2 if and only if either J 1 (π 1 ) < c  and J 1 (π 1 ) < J 1 (π 2 ), or if J 1 (π 1 ), J 1 (π 2 ) ≥ c and J 2 (π 1 ) < J 2 (π 2 ). In other words, the LexMax objective has lexicographic preferences over R 1 . . . R m , so that policies are first ordered by their expected discounted R 1 -reward, and then policies that obtain the same expected discounted R 1 -reward are ordered by their expected discounted R 2 -reward, and so on. The MaxMin objective orders policies by their worst performance according to any of R 1 . . . R m (which could be used to obtain worst-case guarantees). The MaxSat objective only cares about whether a policy reaches a certain threshold for each reward, and ranks policies based on how many thresholds they reach. The ConSat objective wants to maximise J 2 , but under the constraint that J 1 reaches a certain threshold. These MORL objectives are simply a short list of illustrative examples, demonstrating the flexibility of the framework. A few more examples are given in Appendix D. We next need to define what it means to reduce a MORL problem to a (scalar) RL problem: Definition 6. A MOMDP M = ⟨S, A, τ, µ 0 , ⃗ R, γ⟩ with objective O is equivalent to the MDP M = ⟨S, A, τ, µ 0 , R, γ⟩ if and only if M 's policy order is ≺ M O . Note that M must have the same states, actions, transition function, initial state distribution, and discount factor, as M. This definition therefore says that M with O is equivalent to M if M is given by replacing ⃗ R = ⟨R 1 . . . R k ⟩ with a single reward function R, and R induces the same preferences between all policies as O(J 1 . . . J k ). We can now derive necessary and sufficient conditions for when a MORL problem can be reduced to a scalar-reward RL problem. Theorem 1. If a MOMDP M = ⟨S, A, τ, µ 0 , ⃗ R, γ⟩ with objective O is equivalent to an MDP M = ⟨S, A, τ, µ 0 , R, γ⟩, then J(π) = k i=1 w i • J i (π) for some w 1 . . . w k ∈ R. Moreover, M with O is also equivalent to the MDP with reward R(s, a, s ′ ) = k i=1 w i • R i (s, a, s ′ ). Proof. Suppose M with O is equivalent to an MDP M = ⟨S, A, τ, µ 0 , R, γ⟩. First, let m : Π → R |S||A| be the function that maps each policy π to the |S||A|-dimensional vector where m(π)[s, a] = ∞ t=0 γ t P ξ∼π (S t = s, A t = a). Moreover, for a reward function R, let ⃗ R ∈ R |S||A| be the |S||A|-dimensional vector where ⃗ R[s, a] = E S ′ ∼τ (s,a) [R(s, a, S ′ )]. Note that we now have that J(π) = m(π) • ⃗ R, for any reward function R. Recall also that multiplication by an |S||A|-dimensional vector induces a linear function over R |S||A| . This means that, for any reward function R, we can express its policy evaluation function J : Π → R as L • m, where L is a linear function. In particular, J = L • m, and J i = L i • m for each of R i ∈ ⃗ R. From the definition of MORL objectives, we have that J(π) is a function of J 1 (π) . . . J k (π). This, in turn, means that L(v) is a function of L 1 (v) . . . L k (v), for any v ∈ Im(m). Let M be the (|S||A| × k)-dimensional matrix that maps each vector v ∈ R |S||A| to ⟨L 1 (v), . . . , L k (v)⟩ (in other words, the matrix whose rows are ⃗ R 1 . . . ⃗ R k ). Since L(v) is a function of L 1 (v) . . . L k (v) , we have that L can be expressed as f • M for some function f . Since L is a linear function, and since M is a linear transformation, we that f must be a linear function as well. This means that there are w 1 . . . w k ∈ R k such that f (x) = k i=1 w i • x i , which implies that L(v) = m i=1 w i • L i (v), and further that J(π) = k i=1 w i • J i (π). This completes the first part. Next, let R(s, a, s ′ ) = i 1 k w i • R i (s, a, s ′ ). Straightforward algebra shows that J(π) = k i=1 w i • J i (π) . Now, since J = J, and since M with O is equivalent to M, we have that M with O is equivalent to the MDP with reward R. This completes the second part. This theorem effectively tells us that only linear MORL objectives can be represented using scalarreward RL! This imposes a harsh limitation on what kinds of tasks can be encoded using scalar rewards. Theorem 1 also has the following corollary, which is useful for demonstrating when some MORL objective cannot be expressed using scalar reward functions. Given an ordering ≺ over Π dependent on some evaluation functions J 1 . . . J k , we say that a function U : Π → R represents ≺ if U (π 1 ) < U (π 2 ) ⇐⇒ π 1 ≺ π 2 . We say that U is a linear representation if U (π) = f ( k i=1 w i • J i (π)) for some w 1 . . . w k ∈ R and some f that is strictly monotonic. Corollary 1. If O(J 1 . . . J k ) has a non-linear representation U , and M is a MOMDP whose Jfunctions are J 1 . . . J k , then M with O is not equivalent to any MDP. Proof. Assume for contradiction that M with O is equivalent the MDP M = ⟨S, A, τ, µ 0 , R, γ⟩. Then J represents O(J 1 . . . J k ), and this in turn means that U must be strictly monotonic in J. Moreover, Theorem 1 implies that J = k i=0 w i • J i for some w 1 . . . w k ∈ R k . However, this contradicts our assumptions. Therefore, we can prove that M with O is not equivalent to any MDP by finding a non-linear representation of ≺ M O . We will now show that none of the MORL objectives given in Definition 2-5 can be expressed using single-objective RL, except in a few degenerate edge cases. Theorem 2. There is no MDP equivalent to M with LexMax, as long as M has at least two reward functions that are neither trivial, equivalent, or opposites. Proof. Suppose M with LexMax is equivalent to M = ⟨S, A, τ, µ 0 , R, γ⟩. Let i be the smallest number such that R i is non-trivial, and let j be the smallest number greater than i such that R j is non-trivial, and not equivalent to or opposite of R i . Then there are π 1 , π 2 such that J i (π 1 ) = J i (π 2 ) and J j (π 1 ) < J j (π 2 ), which means that π 1 ≺ M Lex π 2 . Moreover, since J represents ≺ M Lex , it follows that there are no π, π ′ such that J i (π) < J i (π ′ ) and J(π) > J(π ′ ). Then Theorem 1 in Skalse et al. (2022c) implies that R i is equivalent to R. However, then J(π 1 ) = J(π 2 ), which means that J cannot represent ≺ M Lex . Theorem 3. There is no MDP equivalent to M with MaxMin, unless M has a reward function R i such that J i (π) ≤ J j (π) for all j ∈ {1 . . . k} and all π.

Proof. O M

Min is represented by the function U (π) = min i J i (π). Moreover, if M has no reward function R i such that J i (π) ≤ J j (π) for all j ∈ {1 . . . k} and all π then this representation is non-linear. Corollary 1 then implies that M with MaxMin is not equivalent to any MDP. Theorem 4. There is no MDP equivalent to M with MaxSat, as long as M has at least one reward R i where J i (π 1 ) < c i and J i (π 2 ) ≥ c i for some π 1 , π 2 ∈ Π. Proof. Note that MaxSat(M) is represented by the function U (π) = k i=1 1[J i (π) ≥ c i ], where 1[J i (π) ≥ c i ] is the function that is equal to 1 when J i (π) ≥ c i , and 0 otherwise. Moreover, U is not strictly monotonic in any function that is linear in J 1 . . . J k . Corollary 1 thus implies that M with MaxSat is not equivalent to any MDP. Theorem 5. There is no MDP equivalent to M with ConSat, unless either R 1 and R 2 are equivalent, or max π J 1 (π) ≤ c. Con is represented by U (π) = {J 1 (π) if J 1 (π) ≤ c, else J 2 (π) -min π J 2 (π) + c}. Moreover, this representation is non-linear, unless either R 1 and R 2 are equivalent, or max π J 1 (π) ≤ c. Corollary 1 then implies that M with ConSat is not equivalent to any MDP. Theorem 2-5 show that none of the MORL objectives given in Definition 2-5 can be expressed using single-objective RL, except in a few degenerate cases where those MORL objectives are uninteresting. This demonstrates that there is no satisfactory way to reduce MORL problems to scalar-reward RL (and hence that scalar RL is unable to express many natural task specifications).

3. RISK-SENSITIVE REINFORCEMENT LEARNING

The next area we will look at is that of risk-sensitive reinforcement learning. An ordinary RL agent tries to maximise the expectation of its reward function. However, there are many cases where it is natural to want the agent to be risk-averse. In economics, risk-aversion is typically modelled by using utility functions U (c) that are concave in some relevant quantity c (which might be money, for example). A natural question is then whether a similar trick may be used with reward functions? That is, given a reward function R 1 and a concave function f , can we construct a reward function R 2 such that G 2 (ξ) = f (G 1 (ξ)) for all trajectories ξ? We will examine this question. Some of the most common risk-averse utility functions includes exponential utility, isoelastic utility, and quadratic utility. The exponential utility function is given by U (c) = -e αc , where α > 0 is a parameter controlling the degree of risk aversion. The isoelastic utility function is given by U (c) = c 1-α , for α > 0, α ̸ = 1, or by U (c) = ln(c) (corresponding to the case when α = 1). The quadratic utility function is given by U (c) = c-αc 2 , where α > 0. Since this function is decreasing for sufficiently large c, its domain is typically restricted to (-∞, 1/2α]. We will examine each of these, and show that none of them can be expressed using reward functions. In this section, we will consider the domain of G to be the set of all coherent trajectories, not the set of trajectories which are possible under some transition function τ . In other words, we consider the set of all trajectories to be (S × A) ω . The reason for this is that we do not want to presume any prior knowledge of the environment. If we restrict the set of trajectories we consider, then some risk-averse utility functions can become possible to express (consider the case of a tree-shaped MDP, for example). Finally, we will say that R is constant if it has a constant value for all s, a, s ′ . To prove our results, we will make use of three lemmas. The proofs of these lemmas are fairly long, but not very illuminating, and so we have relegated them to Appendix A. Lemma 1. If R is non-constant, then for any state s there exists trajectories ζ 1 , ζ 2 , ζ 3 starting in s such that G(ζ 1 ) ̸ = G(ζ 2 ), G(ζ 2 ) ̸ = G(ζ 3 ), and G(ζ 1 ) ̸ = G(ζ 3 ). Lemma 2. If G 2 (ξ) = f (G 1 (ξ) ) for all ξ and some f , then for any transition ⟨s, a, s ′ ⟩ and any trajectory ζ starting in s ′ , R 2 (s, a, s ′ ) = f (R 1 (s, a, s ′ ) + γG 1 (ζ)) -γf (G 1 (ζ)). Lemma 3. For any non-constant reward R 1 and any f that is injective on range(G 1 ), if for any y ∈ range(R 1 ) and any γ ∈ (0, 1) there are at most two distinct x 1 , x 2 such that f (y + γx 1 )γf (x 1 ) = f (y + γx 2 ) -γf (x 2 ) then there is no reward R 2 such that G 2 (ξ) = f (G 1 (ξ)) for all ξ. Using these lemmas, we can now derive our main results: Theorem 6. For any non-constant reward function R 1 and any constant α ̸ = 0, there is no reward function R 2 such that G 2 (ξ) = -e αG1(ξ) for all valid trajectories ξ. Proof. With f (x) = -e αx , the expression in Lemma 3 becomes -e α(y+γx) + γe αx . The derivative of this expression with respect to x is γα(-e α(y+γx) + e αx ), which has only one root when γ ̸ = 0 and α ̸ = 0. This means that there can be at most two distinct values x 1 , x 2 such that -e α(y+γx1) + γe αx1 = -e α(y+γx2) + γe αx2 . Since -e αx is injective, we can thus apply Lemma 3, which completes the proof. Theorem 7. For any non-constant reward function R 1 and any constant α > 0, α ̸ = 1, there is no reward function R 2 such that G 2 (ξ) = G 1 (ξ) 1-α for all valid trajectories ξ. Proof. With f (x) = x 1-α , the expression in Lemma 3 becomes (y + γx) (1-α) -γx 1-α . The derivative of this expression with respect to x is γ(α -1)(x -α -(γx + y) -α ), which has only one root when γ ̸ = 0 and α ̸ ∈ {0, 1}. This means that there can be at most two distinct values x 1 , x 2 such that (y + γx 1 ) (1-α) -γx 1-α 1 = (y + γx 2 ) (1-α) -γx 1-α 2 . Since x 1-α is injective, we can thus apply Lemma 3, which completes the proof. Theorem 8. For any non-constant reward function R 1 , there is no reward function R 2 such that G 2 (ξ) = ln(G 1 (ξ)) for all valid trajectories ξ. Proof. With f (x) = ln(x), the expression in Lemma 3 becomes ln(y+γx)-γ ln(x). The derivative of this expression with respect to x is γ(1/(y + γx) -1/x), which has only one root when γ ̸ = 0. Since ln(x) is injective, we can thus apply Lemma 3, which completes the proof. Theorem 9. For any non-constant reward function R 1 and any α > 0 where max ξ G 1 (ξ) ≤ 1 2α , there is no reward function R 2 such that G 2 (ξ) = G 1 (ξ) -αG 1 (ξ) 2 for all ξ. Proof. With f (x) = x -αx 2 , the expression in Lemma 3 becomes y + γx -α(y + γx) 2 . This is a second-degree polynomial, which means that there can be at most two distinct values x 1 , x 2 such that y + γx 1 -α(y + γx 1 ) 2 = y + γx 2 -α(y + γx 2 ) 2 . Moreover, if max ξ G 1 (ξ) ≤ 1 2α then f (x) = x -αx 2 is injective on range(G 1 ). We can thus apply Lemma 3. We can thus see that Lemma 3 is quite flexible. It allows us to rule out many modifications to G as impossible, including all the standard risk-averse utility functions. It would be desirable to strengthen these results, and provide necessary and sufficient conditions for when it is possible to construct a reward R 2 such that G 2 (ξ) = f (G 1 (ξ)) for some function f and some (non-constant) reward R 1 . We consider this to be an important question for further work.

4. MODAL REINFORCEMENT LEARNING

The final class of tasks we will examine is one which we have decided to refer to as modal tasks. Before we give a formal definition of this class, we will first provide some intuition. In analytic philosophy, a distinction is made between categorical facts and modal facts. In short, categorical facts only concern what is true in actuality, whereas modal facts concern what must be true, could have been true, or cannot be true, etc. For example, it is a categorical fact that the Eiffel Tower is brown, and a modal fact that it could have had a different colour. It is (arguably) a categorical fact that the number 3 is prime, and a modal fact that it could not have been otherwise. To give another example, there is a difference between stating that nothing can travel faster than light and that nothing does travel faster than light -the former statement, which is modal, is stronger than the latter, which is categorical. One can further distinguish between different kinds of possibility (e.g. logical vs physical possibility, etc), and discussions about modality also involves topics such as causality and counterfactuals, etc. A complete treatment of this subject is far beyond the scope of this paper, but for an overview, see e.g. Menzel (2021) . Modality does of course relate to modal logic, but it also relates to temporal logic. In particular, computational tree logic (CTL), and its extensions, can express many modal statements. The intuition behind this section is that a reward function always is expressed in terms of categorical facts, whereas many tasks are naturally expressed in terms of modal facts. For example, consider an instruction such as "you should always be able to return to the start state". This instruction seems quite reasonable, but it is not obvious how to translate it into a reward function. Note that this instruction is not telling the agent to actually return to the start state, it merely says that it should maintain the ability to do so. To give a few other examples, consider instructions such as "you should never enter a state from which it is possible to quickly enter an unsafe state", "you should always be able to press the emergency shutdown button", or "you should never enter a state where you would be unable to receive a feedback signal". These instructions all seem very reasonable, and they are expressed in terms of what should be possible or impossible along the trajectory of the agent, rather than in terms of what in fact occurs along that trajectory. Given this background motivation, we can now give a formal definition of modal tasks: Definition 7. Given a set of states S and a set of actions A, a modal reward function R ♢ is a function R ♢ : S × A × S × (S × A ⇝ S) → R which takes two states s, s ′ ∈ S, an action a ∈ A, and a transition function τ over S and A, and returns a real number. R ♢ (s, a, s ′ , τ ) is the reward that is obtained when transitioning from state s to s ′ using action a in an environment whose transition function is τ . Here we allow R ♢ an unrestricted dependence on τ , to make our results as general as possible, even if a practical algorithm for solving modal tasks presumably would require restrictions on what this dependence can look like (see Appendix E). Modal reward functions can be used to express instructions such as those we gave above. For example, a simple case might be "you get 1 reward if you reach this goal state, and -1 reward if you ever enter a state from which you cannot reach the initial state". This reward depends on the transition function, because the transition function determines from which states you can reach the initial state. As usual, R ♢ then induces a Q-function Q ♢ , value function V ♢ , and evaluation function J ♢ , etc. We say that a modal reward R ♢ and an ordinary reward R are contingently equivalent given a transition function τ if J ♢ and J induce the same ordering of policies given τ , and that they are robustly equivalent if J ♢ and J induce the same ordering of policies for all τ . We use R ♢ τ to denote the reward function R ♢ τ (s, a, s ′ ) = R ♢ (s, a, s ′ , τ ). We will also use the following definition: Definition 8. A modal reward function R ♢ is trivial if there is a reward function R such that for all τ , R and R ♢ τ have the same policy ordering under τ . The intuition here is that a trivial modal reward function does not actually depends on τ in any important sense. Note that this is not necessarily to say that R ♢ τ = R for all τ . For example, it could be the case that R ♢ τ is a scaled version of R, or that R ♢ τ and R differ by potential shaping Ng et al. (1999) , or that R ♢ τ is modified in a way such that E S ′ ∼τ (s,a) [R ♢ τ (s, a, S ′ )] = E S ′ ∼τ (s,a) [R(s, a, S ′ )] , since none of these differences affect the policy ordering. Theorem 10. For any modal reward R ♢ and any transition function τ , there exists a reward function R that is contingently equivalent to R ♢ given τ . Moreover, unless R ♢ is trivial, there is no reward function that is robustly equivalent to R ♢ . Proof. This is straightforward. For the first part, simply let R(s, a, s ′ ) = R ♢ (s, a, s ′ , τ ). The second part is immediate from the definition of trivial modal reward functions. In other words, every modal task can be expressed with ordinary reward function in each particular environment, but no reward function expresses a (non-trivial) modal task in all environments. Is this enough? We argue that it is not, because the construction of R ♢ τ will invariably be laborious, and require detailed knowledge of the environment. For example, consider the task "you should always be able to return to the start state"; here, constructing R ♢ τ would amount to manually enumerating all the states from which the start state is reachable. This is very much against the spirit of reinforcement learning, where much of the point is that we want to be able to specify tasks which can be pursued in unknown environments. In short, a method which requires a model of the environment is arguably not a reinforcement learning method. We thus argue that reward functions are unable to capture modal tasks in a satisfactory way. One remaining question might be why one would want to express instructions for reinforcement learning agents in terms of modal properties. After all, what benefit is there to the instruction "never enter a state from which it is possible to quickly enter an unsafe state" over the instruction "never enter an unsafe state"? One reason is that the former task might lead to behaviour that is more robust to changes in the environment. For example, if an RL agent is trained in a simulated environment, and deployed in the real world, then it seems like it would be preferable to tell the agent to avoid risky states, rather than unsafe states, since imperfections in the simulation could lead to an underestimation of the risk involved. Another example is the existing work on avoiding side effects (Krakovna et al., 2018; 2020; Turner et al., 2020) , which it is natural to express in modal terms. This work can be viewed as being aimed at making the behaviour of an RL agent more robust to misspecification of the reward function.

5. SOLVING "INEXPRESSIBLE" TASKS

We have pointed to three classes of tasks which cannot be expressed using reward functions (namely multi-objective tasks, risk-sensitive tasks, and modal tasks). A natural next question is whether these tasks could be solved using RL, or whether only the tasks which correspond to Markovian reward functions can be effectively learnt? We discuss this issue below. In short, it is possible to design RL algorithms for tasks in each of these categories. Multi-objective reinforcement learning is well-explored, with many existing algorithms (see Section 1.1). Most of these algorithms are designed to solve a specific MORL objective; for example, Skalse et al. (2022b) solve the LexMax objective, and Tessler et al. (2019) solve the ConSat objective. There is (to the best of our knowledge) not yet any algorithm for the solving e.g. the MaxMin objective, but there is no good reason to believe that such an algorithm could not be made. Similarly, there are existing algorithms for risk-sensitive RL (e.g. Chow et al. (2017) ), and even algorithms that solve certain modal tasks (Krakovna et al., 2018; 2020; Turner et al., 2020; Wang et al., 2020) . It should also be possible to design algorithms which can flexibly solve many different tasks from the classes we have discussed (instead of having to be designed for just one particular task). For example, suppose a MORL objective can be represented by a function U : R k → R, such that π 1 ≺ π 2 when U (J 1 (π 1 ) . . . J k (π 1 )) < U (J 1 (π 2 ) . . . J k (π 2 )), and that U is differentiable. We give a few examples of such objectives in Appendix D, including e.g. a "soft" version of MaxMin. With such an objective, if we have a policy π that is differentiable with respect to some parameters θ, then it should be possible to compute the gradient of U (J 1 (π) . . . J k (π)) with respect to θ, and then use a policy gradient method to increase U . This means that it should be possible to design an actor-critic algorithm which can solve any differentiable MORL objective. We consider the development and evaluation of such methods to be a promising direction for further work. We outline a possible approach for solving a wide class of modal tasks in Appendix E.

6. DISCUSSION

In this paper, we have studied the ability of Markovian reward functions to express different kinds of problems. We have looked at three classes of tasks; multi-objective tasks, risk-sensitive tasks, and modal tasks, and found that Markovian reward functions are unable to express most of the tasks in each of these three classes. We have also provided necessary and sufficient conditions for when a single-policy MORL problem can be expressed using a single reward function (which, as it turns out, is almost never), and also drawn attention to a class of tasks which have just barely been explored previously (namely modal tasks). Finally, we have also shown that many of these problems still can be solved with RL, and even outlined some methods for how to extend these solutions. There are several ways to extend our work. First of all, while we have given many examples of tasks which cannot be formalised using Markovian reward functions, we have not given a general characterisation of what reward functions are or are not able to express. It would be very desirable to have a set of intuitive necessary and sufficient conditions, which exactly describe those policy orderings that can be expressed using reward functions, similar to what the VNM axioms provide in the case of utility functions. We outline some initial steps towards such a characterisation in Appendix B. Note that the VNM axioms themselves cannot be directly applied to RL, see Appendix C. Additionally, it would also be desirable to provide necessary and sufficient conditions for when it is possible to construct a reward R 2 such that G 2 (ξ) = f (G 1 (ξ)) for some function f and some (non-constant) reward R 1 , as we discussed at the end of Section 3. Our work also provides a strong motivation for developing more RL algorithms that can learn tasks which cannot be expressed using Markovian reward functions. There are several ways to to this. In section 5, we outline an approach for learning any differentiable MORL objective using policy gradients, and in Appendix E, we outline an approach for learning a large class of modal tasks. It would also be very interesting to explore more general ways to express RL tasks, and study their expressivity. For example, it would be interesting to know if (and to what extent) MORL tasks can be expressed using reward machines (Icarte et al., 2022), and similar. for all trajectories ζ starting in s ′ . For clarity, let x = G 1 (ζ) and y = R 1 (s, a, s ′ ), so that f (y + γx) -γf (x). By assumption, there can be at most two distinct values x 1 , x 2 such that f (y + γx 1 )γf (x 1 ) = f (y + γx 2 ) -γf (x 2 ). However, Lemma 1 implies that there are at least three ζ 1 , ζ 2 , ζ 3 starting in s ′ with distinct values of G 1 . Since f is injective on range(G 1 ), this means that there are at least three distinct values of x for which f (y + γx) -γf (x) must be constant (and equal to R 2 (s, a, s ′ )), which is a contradiction.

B TOWARDS NECESSARY AND SUFFICIENT CONDITIONS

In this paper, we have provided several examples of "natural" policy orderings which cannot be represented using a reward function. It would be desirable to have a set of necessary and sufficient conditions to characterise those orderings over Π that can be expressed by reward functions, similar to that provided by the VNM axioms (the VNM axioms themselves do not provide this, see Appendix C). We consider this to be an important topic for future work. In this section, we will discuss a few interesting properties which are shared by all policy orderings which can be represented by reward functions. We believe that these examples will help with building an intuition for what reward functions can and cannot express. We would first like to point out that, while it seems difficult to characterise the policy orderings which can be expressed by reward functions, it is fairly straightforward to exactly characterise the sets of policies Π that can be optimal under some reward function: Proposition 1. A set of policies Π is the optimal policy set for some reward function if and only if there is a function o : S → P(A)\∅ that maps each state to a (non-empty) set of "optimal actions", and π ∈ Π if and only if supp(π(s)) ⊆ o(s). Proof. For the "if" part, consider the reward function R where R(s, a, s ′ ) = 0 if a ∈ o(s), and R(s, a, s ′ ) = -1 otherwise. The "only if" part follows from the fact that the optimal Q-function Q ⋆ is the same for all optimal policies, so we can let o(s) = arg max a Q ⋆ (s, a). This immediately lets us rule out many policy orderings as inexpressible. For example, consider the task "always go in the same direction" -this task cannot be expressed as a reward function, because any policy that mixes the actions of two other optimal policies must itself be optimal. It also shows that Markovian reward functions cannot be used to encourage stochastic policies. For example, there is no Markovian reward function under which "play rock, paper, and scissors with equal probability" is the unique optimal policy. The next thing we would like to point out is that no reward function can express an ordering over Π that has a countable number of equivalence classes (except trivial reward functions, which have only one equivalence class). This simple fact also rules out many orderings. Proposition 2. If R is non-trivial then J has an uncountable number of equivalence classes. Proof. This follows from the intermediate value theorem, and the fact that J is continuous in Π. This simple observation can be used to e.g. create an alternative proof of Theorem 4, which says that the MaxSat objective cannot be represented as a (scalar) reward function. It also shows that objectives such as e.g. J(π) = min ξ∈supp(π) G(ξ), which evaluates policies according to the worst trajectory in their support, cannot be represented (since any policy then has the same value as some deterministic policy, and since there is only a finite number of deterministic policies).

C A DIGRESSION ON THE VON NEUMANN-MORGENSTERN AXIOMS

The famous VNM axioms, due to von Neumann & Morgenstern (1947) , provide necessary and sufficient conditions for when a utility function can be used to represent a preference ordering for lotteries over a finite choice set. In an MDP, a policy induces a distribution over trajectories, and a reward function assigns a value to each trajectory. One might then wonder if the VNM axioms could provide necessary and sufficient conditions for when an ordering over Π can be realised using a reward function. This is not the case, and in this appendix, we briefly point out why. These results are not novel to this paper, but are instead provided to help with intuition building. First of all, the VNM theorem assumes that the choice set is finite, whereas in an MDP, the number of trajectories is (countably) infinite. There are preferences between distributions over countable choice sets which satisfy the VNM axioms, but which can nonetheless not be represented using utility functions.foot_0 Second, not all distributions over trajectories can be represented as a policy (unless we allow both the policy and the transition function to be non-stationary). Third, there is a special structure to how a reward function assigns value to a trajectory, and not all functions Ξ → R can be represented in this way. This means that the VNM axioms are not applicable to RL. However, it may still be possible to provide similar intuitive necessary and sufficient conditions for the RL case. We consider this to be an important topic for future work.

D MORE MORL OBJECTIVES

In this Appendix, we give even more examples of MORL objectives, and some comments on how to construct them -the purpose of this is mainly just to show how rich this space is. First, similar to the MaxMin objective, we might want to judge a policy according to its best performance: Definition 9. Given J 1 . . . J k , the MaxMax objective ≺ Max is given by π 1 ≺ Max π 2 ⇐⇒ max i J i (π 1 ) < max i J i (π 2 ). We would next like to point out that it is possible to create smooth versions of almost any MORL objective. In Section 5, we outline an approach for learning any continuous, differentiable MORL objective, so this is quite useful. We begin with a soft version of the MaxMax objective: Definition 10. Given J 1 . . . J k and α > 0, the Soft MaxMax objective ≺ MaxSoft is given by π) . This is of course not the only way to continuously approximate MaxMax, it is just an example of one way of doing it. Here α controls how "sharp" the approximation is -the larger α is, the closer J MaxSoft gets to the sharp max function, and the smaller α is, the closer it gets to the arithmetic mean function (so by varying α, we can continuously interpolate between them). Similarly, we can also create a smooth version of MaxMin: Definition 11. Given J 1 . . . J k and α > 0, the Soft MaxMin objective ≺ MinSoft is given by J MaxSoft (π) = k i=1 J i (π)e αJi(π) k i=1 e αJi( J MinSoft (π) = k i=1 J i (π)e -αJi(π) k i=1 e -αJi(π) . As before, the larger α is, the closer J MinSoft gets to the sharp min function, and the smaller α is, the closer it gets to the arithmetic mean function We can also smoothen MaxSat: Definition 12. Given J 1 . . . J k , c 1 . . . c k , and α > 0, the Soft MaxSat objective ≺ SatSoft is J SatSoft (π) = k i=1 1 1 + e -α(Ji(π)-ci) . The larger α is, the closer J SatSoft gets to the sharp MaxSat function (and the smaller α gets, the closer J SatSoft gets to a flat 0.5). And, again, this is of course not the only way to create a smooth version of MaxSat. It is unclear if it is possible to create a smooth version of ConSat without having any prior knowledge of (a lower bound of) the value of min π J 1 (π), but with this value it should be reasonably straightforward (see the construction in Theorem 5). As for LexMax, we can of course create a smooth approximation of it by taking a linear approximation of the weights, but here we would need some prior knowledge of max π J 1 (π) . . . max π J k (π).

E A METHOD FOR SOLVING MODAL TASKS

In this Appendix, we give an outline of one possible method for solving modal tasks. We mainly want to show that it is feasible to learn modal tasks, and so we only provide a solution sketch; the task of implementing and evaluating this method is something we leave as a topic for future work. We will first define a restricted class of modal tasks, which is both very expressive, and also more amenable to learning than the more general version given in Definition 7: Definition 13. An affordance consists of a reward function and a discount factor, ⟨R, γ⟩, and an affordance-based reward is a function R ♢ : S × A × S × R 2k → R, that is continuous in the last 2k arguments. An affordance-based MDP is a tuple ⟨S, A, τ, µ 0 , R ♢ , γ, ⟨R, γ⟩ k ⟩, where the reward given for transitioning from s to s ′ via a is R ♢ (s, a, s ′ , V ⋆ 1 (s) . . . V ⋆ k (s), V ⋆ 1 (s ′ ) . . . V ⋆ k (s ′ )), where V ⋆ i is the value function of the i'th affordance. This definition requires some explanation. In psychology (and other fields, such as user interface design), an affordance is, roughly, a perceived possible action, or a perceived way to use an object. For example, if you see a button, then the fact that you can press that button, and expect something to happen, is part of how you perceive it, in a way that might not be the case if you could somehow show the button to a premodern human. It can also be used to refer to a choice or action that is perceived as available in some context (without being tied to an object). Here, we are using it to refer to a task that could be performed in an MDP. The intuition is that R ♢ is allowed to depend on what could be done from s and s ′ , in addition to the state features of s and s ′ . Before outlining an algorithm, let us first give a few examples of how to formalise modal tasks within this framework. First consider the instruction "you should always be able to return to the start state". We can formalise this using a reward function R 1 that gives 1 reward if the start state is entered, and 0 otherwise, and pair it up with a discount parameter γ that is very close to 1. We could then set R ♢ to, for example, R ♢ (s, a, s ′ , V ⋆ 1 (s), V ⋆ 1 (s ′ )) = R(s, a, s ′ ) • tanh(V ⋆ 1 (s ′ )), where R describes some base task. In this way, no reward is given if the start state cannot be reached from s ′ . Next, consider the instruction "never enter a state from which it is possible to quickly enter an unsafe state". To formalise this, let R 1 give 1 reward if an unsafe state is entered, and 0 otherwise, and let γ correspond to a very high discount rate (e.g. 0.7). We could then set R ♢ to, for example, R ♢ (s, a, s ′ , V ⋆ 1 (s), V ⋆ 1 (s ′ )) = R(s, a, s ′ ) -V ⋆ 1 (s ′ ), where R again describes some base task. These examples show that our "affordance-based" MDPs are quite flexible, and that they should be able to formalise many natural modal tasks in a satisfactory way, including most of our motivating examples.foot_1 However, the definition could of course be made more general. For example, we could allow the affordances to themselves be based on affordance-based reward functions, etc. However, it is not clear if this would bring much benefit in practice. Let us now outline an approach for solving affordance-based MDPs using reinforcement learning, specifically using an action-value method. First, let the agent maintain k + 1 Q-functions, Q ♢ , Q 1 , . . . , Q k , one for R ♢ and one for each affordance ⟨R i , γ i ⟩. Next, we suppose that the agent updates each of Q 1 , . . . , Q k using an off-policy update rule, such as Q-learning; this will ensure that Q 1 , . . . , Q k converge to their true values (i.e. to Q ⋆ 1 . . . Q ⋆ k ), as long as the agent explores infinitely often. Note that the use of an off-policy update rule is crucial. Next, let the agent update Q ♢ as if it were an ordinary Markovian reward function, using the reward R(s, a, s ′ ) = R ♢ (s, a, s ′ , V 1 (s) . . . V k (s), V 1 (s ′ ) . . . V k (s ′ )), where V i (s) is given by max a Q i (s, a). In other words, we let it update Q ♢ using an estimate of the true value of R ♢ , expressed in terms of its current estimates of V ⋆ 1 . . . V ⋆ k . The fact that Q 1 , . . . , Q k converge to Q ⋆ 1 , . . . , Q ⋆ k , and the fact that R ♢ is continuous in its value function arguments, will ensure that the estimate R also converges to the true value of R ♢ . The update rule used for Q ♢ could be either on-policy or off-policy. We then suppose that the agent selects its actions by applying a Bandit algorithm to Q ♢ , and that this Bandit algorithm is greedy in the limit, but also explores infinitely often, as usual. This algorithm should be able to learn to optimise the reward in any affordance-based MDP. In the tabular case, it should be possible (and reasonably straightforward) to prove that it always converges to an optimal policy (assuming that appropriate learning rates are used, etc), using Lemma 1 in



For example, consider the ordering that prefers all distributions with infinite support over all distributions with finite support, and which is indifferent between any two distributions in either of these classes. This arguably excludes "you should never enter a state where you would be unable to receive a feedback signal". However, this instruction only makes sense in a multi-agent setting.

A PROOFS OF LEMMAS

In this Appendix, we provide the proofs of the lemmas from Section 3. Lemma 1. If R is non-constant, then for any state s there exists trajectories Proof. First note that if R is non-constant, then there must be some state s and some trajectories ξ 1 , ξ 2 starting in s such that G(ξ 1 ) ̸ = G(ξ 2 ) (this follows from Theorem 3.8 in Skalse et al. (2022a) ). We will establish that there is a ξ 3 starting in s such that G(ξ 3 ) ̸ = G(ξ 1 ) and G(ξ 3 ) ̸ = G(ξ 2 ), and then show that this implies that such trajectories exist for all states.Suppose for contradiction that for any ξ 3 starting in s, eitherConsider a transition ⟨s, a, s⟩, and let ζ 1 = ⟨s, a, s⟩ + ξ 1 and ζ 2 = ⟨s, a, s⟩ + ξ 2 ; we will do a case enumeration, and show that eitherCombining this, and rearranging, givesThis exhausts all cases, which means that if R is non-constant, then there must be some state s and some trajectoriesFinally, note that this means that we can construct such trajectories for any state s ′ , by simply composing a transition ⟨s ′ , a, s⟩ with each of) for all ξ and some f , then for any transition ⟨s, a, s ′ ⟩ and any trajectoryProof. Suppose that G 2 (ξ) = f (G 1 (ξ)) for all trajectories ξ. Let ⟨s, a, s ′ ⟩ be an arbitrary transition, let ζ be an arbitrary trajectory starting in s ′ , and let ξ = ⟨s, a, s

By using the fact that

, and rearranging, we get thatSince ⟨s, a, s ′ ⟩ and ζ were chosen arbitrarily, this completes the proof.Lemma 3. For any non-constant reward R 1 and any f that is injective on range(G 1 ), if for any y ∈ range(R 1 ) and any γ ∈ (0, 1) there are at most two distinct x 1 , x 2 such that f (y + γx 1 )γf (x 1 ) = f (y + γx 2 ) -γf (x 2 ) then there is no reward R 2 such that G 2 (ξ) = f (G 1 (ξ)) for all ξ.Proof. Suppose for contradiction that G 2 (ξ) = f (G 1 (ξ)) for all ξ. Let ⟨s, a, s ′ ⟩ be an arbitrary transition. Applying Lemma 2, we get that Singh et al. (2000) . We would also expect it to perform well in practice, when used with function approximators (such as neural networks). However, we leave the task of implementing and properly evaluating this approach as a topic for future work.There are also several ways that this algorithm could be tweaked or improved. For example, the algorithm we have described is an action-value algorithm, but the same approach could of course be used to make an actor-critic algorithm instead. We also suspect that there could be interesting modifications one could make to the exploration strategy of the algorithm. If a standard Bandit algorithm (such as ϵ-greedy) is used, then the agent will mostly take actions that are optimal under its current estimate of Q ♢ . In the ordinary case, this is good, because it leads the agent spend more time in the parts of the MDP that are relevant for maximising the reward. However, in this case, there is a worry that it could lead the agent to neglect the parts of the (affordance-based) MDP that are relevant for learning more about V ⋆ 1 . . . V ⋆ k , which might slow down the learning. Again, we leave such developments for future work, since our aim here only is to show that it is feasible to learn non-trivial modal tasks.We also want to point out that the work by Wang et al. (2020) could provide another starting point for learning modal tasks using RL. In their work, they present some RL-based methods for determining whether a specification in Probabilistic Computational Tree Logic (PCTL) holds in an MDP. PCTL can be used to specify many kinds of properties of states in MDPs which depend on the transition function, including e.g. what states can and cannot be reached from a particular state, and with what probability, etc. We can therefore specify non-trivial modal tasks by providing a number of PCTL formulas, and allowing the reward function to depend on the truth values of these formulas. That is, we could consider a setup that is analogous to that which we give in Definition 13, but where the "affordances" are replaced by PCTL formulas. It should then be possible to learn tasks specified in this manner by using the techniques of Wang et al. (2020) to learn the values of the PCTL formulas, and then using ordinary RL to train on the resulting reward function.

