NON-LINEAR REWARDS FOR SUCCESSOR FEATURES

ABSTRACT

Reinforcement Learning algorithms have reached new heights in performance, often overtaking humans on several challenging tasks such as Atari and Go. However, the resulting models typically learn fragile policies that are unable to transfer between tasks without full retraining. Successor features aim to improve this situation by decomposing the policy into two components: one capturing environmental dynamics and the other modelling reward. Under this framework, transfer between related tasks requires only training the reward component. However, successor features builds upon the assumption that the current reward can be predicted from a linear combination of state features; an assumption with no guarantee. This paper proposes a novel improvement to the successor feature framework, where we instead assume that the reward function is a non-linear function of the state features, thereby increasing its representational power. After derivation of the new state-action value function, the decomposition includes a second term that learns the auto-correlation matrix between state features. Experimentally, we show this term explicitly models the environment's stochasticity and can also be used in place of -greedy exploration methods during transfer. The performance of the proposed improvements to the successor feature framework is validated empirically on navigation tasks and control of a simulated robotic arm. Recently, Reinforcement Learning (RL) algorithms have achieved superhuman performance in several challenging domains, such as Atari (Mnih et al., 2015) , Go (Silver et al., 2016) , and Starcraft II (Vinyals et al., 2019) . The main driver of these successes has been the use of deep neural networks, which are a class of powerful non-linear function approximators, with RL algorithms (LeCun et al., 2015) . However, this class of Deep Reinforcement Learning (Deep RL) algorithms require immense amounts of data within an environment, often ranging from tens to hundreds of millions of samples (Arulkumaran et al., 2017) . Furthermore, commonly used algorithms often have difficulty in transferring a learned policy between related tasks, such as where the environmental dynamics remain constant, but the goal changes. In this case, the model must either be retrained completely or fine-tuned on the new task, in both cases requiring millions of additional samples. If the state dynamics are constant, but the reward structure varies between tasks, it is wasteful to retrain the entire model. A more pragmatic approach would be to decompose the RL agent's policy such that separate functions can learn the state dynamics and the reward structure; doing so enables reuse of the dynamics model and only requires learning the reward component. Successor features (Dayan, 1993) do precisely this; a model-free policy's action-value function is expressed as the dot product between a vector of expected discounted future state occupancies, the successor features, and another vector representing the immediate reward in each of those successor states. The factorization follows from the assumption that reward can be predicted as the dot product between a state representation vector and a learned reward vector. Therefore, transfer to a new task requires relearning only the reward parameters instead of the entire model and amounts to the supervised learning problem of predicting the current state's immediate reward. This factorization can be limiting because it is assumed that the reward is a linear function of the current state, which might not always be the case as the encoded features might not capture the required quantity for accurate reward modelling (Eysenbach et al., 2018; Hansen et al., 2019) . Therefore, this paper introduces a new form for the reward function: non-linear with respect to the current state. We assume that the learned features are not optimal and the reward cannot be predicted directly from the raw features, which is not a strong assumption. This form increases the reward function's representational power and makes it possible to incorporate the current state into reward 1

