NON-LINEAR REWARDS FOR SUCCESSOR FEATURES

ABSTRACT

Reinforcement Learning algorithms have reached new heights in performance, often overtaking humans on several challenging tasks such as Atari and Go. However, the resulting models typically learn fragile policies that are unable to transfer between tasks without full retraining. Successor features aim to improve this situation by decomposing the policy into two components: one capturing environmental dynamics and the other modelling reward. Under this framework, transfer between related tasks requires only training the reward component. However, successor features builds upon the assumption that the current reward can be predicted from a linear combination of state features; an assumption with no guarantee. This paper proposes a novel improvement to the successor feature framework, where we instead assume that the reward function is a non-linear function of the state features, thereby increasing its representational power. After derivation of the new state-action value function, the decomposition includes a second term that learns the auto-correlation matrix between state features. Experimentally, we show this term explicitly models the environment's stochasticity and can also be used in place of -greedy exploration methods during transfer. The performance of the proposed improvements to the successor feature framework is validated empirically on navigation tasks and control of a simulated robotic arm. Recently, Reinforcement Learning (RL) algorithms have achieved superhuman performance in several challenging domains, such as Atari (Mnih et al., 2015) , Go (Silver et al., 2016), and Starcraft II (Vinyals et al., 2019) . The main driver of these successes has been the use of deep neural networks, which are a class of powerful non-linear function approximators, with RL algorithms (LeCun et al., 2015) . However, this class of Deep Reinforcement Learning (Deep RL) algorithms require immense amounts of data within an environment, often ranging from tens to hundreds of millions of samples (Arulkumaran et al., 2017) . Furthermore, commonly used algorithms often have difficulty in transferring a learned policy between related tasks, such as where the environmental dynamics remain constant, but the goal changes. In this case, the model must either be retrained completely or fine-tuned on the new task, in both cases requiring millions of additional samples. If the state dynamics are constant, but the reward structure varies between tasks, it is wasteful to retrain the entire model. A more pragmatic approach would be to decompose the RL agent's policy such that separate functions can learn the state dynamics and the reward structure; doing so enables reuse of the dynamics model and only requires learning the reward component. Successor features (Dayan, 1993) do precisely this; a model-free policy's action-value function is expressed as the dot product between a vector of expected discounted future state occupancies, the successor features, and another vector representing the immediate reward in each of those successor states. The factorization follows from the assumption that reward can be predicted as the dot product between a state representation vector and a learned reward vector. Therefore, transfer to a new task requires relearning only the reward parameters instead of the entire model and amounts to the supervised learning problem of predicting the current state's immediate reward. This factorization can be limiting because it is assumed that the reward is a linear function of the current state, which might not always be the case as the encoded features might not capture the required quantity for accurate reward modelling (Eysenbach et al., 2018; Hansen et al., 2019) . Therefore, this paper introduces a new form for the reward function: non-linear with respect to the current state. We assume that the learned features are not optimal and the reward cannot be predicted directly from the raw features, which is not a strong assumption. This form increases the reward function's representational power and makes it possible to incorporate the current state into reward estimation; lessening the burden on the encoder components. Under the new reward formulation, a secondary term emerges, which learns the future expected auto-correlation matrix of the state features. This new secondary term, referred to as Λ, can be exploited as a possible avenue for directed exploration. Exploring the environment using Λ allows us to exploit and reuse learned environmental knowledge instead of relying on a purely random approach for exploration, such as -greedy. Following this, the contributions of this research are as follows: • A novel formulation of successor features that uses a non-linear reward function. This formulation increases the representational power of the reward function. • Under the new reward formulation, a second term appears that models the future expected auto-correlation matrix of the state features. • We provide preliminary results that show the second term can be used for guided exploration during transfer instead of relying on -greedy exploration. After the introduction of relevant background material in Section 1, we introduce the successor feature framework with a non-linear reward function in Section 2, Section 3 provides experimental support and provides an analysis of the new term in the decomposition. The paper concludes with a final discussion and possible avenues for future work in Section 4.

1. BACKGROUND

1.1 REINFORCEMENT LEARNING Consider the interaction between an agent and an environment modelled by a Markov decision process (MDP) (Puterman, 2014). An MDP is defined as a set of states S, a set of actions A, a reward function R : S → R, a discount factor γ ∈ [0, 1], and a transition function T : S × A → [0, 1]. The transition function gives the next-state distribution upon taking action a in state s and is often referred to as the dynamics of the MDP. The objective of the agent in RL is to find a policy π, a mapping from states to actions, which maximizes the expected discounted sum of rewards within the environment. One solution to this problem is to rely on learning a value function, where the action-value function of a policy π is defined as: Q π (s, a) = E π ∞ t=0 γ t R(s t )|S t = s, A t = a where E π [. . . ] denotes the expected value when following the policy π. The policy is learned using an alternating process of policy evaluation, given the action-value of a particular policy and policy improvement, which derives a new policy that is greedy with respect to Q π (s, a) (Puterman, 2014).

1.2. SUCCESSOR FEATURES

Successor Features (SF) offer a decomposition of the Q-value function and have been mentioned under various names and interpretations (Dayan, 1993; Kulkarni et al., 2016; Barreto et al., 2017; Machado et al., 2017) . This decomposition follows from the assumption that the reward function can be approximately represented as a linear combination of learned features φ(s; θ φ ) extracted by a neural network with parameters θ φ and a reward weight vector w. As such, the expected one-step reward can be computed as: r(s, a) = φ(s; θ φ ) w. Following from this, the Q function can be rewritten as:  Q(s,



a) ≈ E π r t+1 + γr t+2 + . . . |S t = s, A t = a = E π φ(s t+1 ; θ φ ) w + φ(s t+2 ; θ φ ) w + . . . |S t = s, A t = a Q(s, a) = ψ π (s, a) • wwhere ψ(s, a) are referred to as the successor features under policy π. The i th component of ψ(s, a) provides the expected discounted sum of φ (i) t when following policy π starting from state s and

