ALIGN-RUDDER: LEARNING FROM FEW DEMON-STRATIONS BY REWARD REDISTRIBUTION

Abstract

Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks are often hierarchically composed of sub-tasks. A step in the Q-function indicates solving a sub-task, where the expectation of the return increases. RUDDER identifies these steps and then redistributes reward to them, thus immediately giving reward if sub-tasks are solved. Since the delay of rewards is reduced, learning is considerably sped up. However, for complex tasks, current exploration strategies struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we substitute RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations. Align-RUDDER inherits the concept of reward redistribution, which speeds up learning by reducing the delay of rewards. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently.

1. INTRODUCTION

Overview of our method, Align-RUDDER. Reinforcement learning algorithms struggle with learning complex tasks that have sparse and delayed rewards (Sutton & Barto, 2018; Rahmandad et al., 2009; Luoma et al., 2017) . RUDDER (Arjona-Medina et al., 2019) has shown to excel for learning sparse and delayed rewards. RUDDER requires episodes with high rewards, to store them in its lessons replay buffer for learning. However, for complex tasks episodes with high rewards are difficult to find by current exploration strategies. Humans and animals obtain high reward episodes by teachers, role models, or prototypes. In this context, we assume that episodes with high rewards are given as demonstrations. Consequently, RUDDER's safe exploration and lessons replay buffer can be replaced by these demonstrations. Generating demonstrations is often tedious for humans and time-consuming for automated exploration strategies, therefore typically only few demonstrations are available. However, RUDDER's LSTM as a deep learning method requires many examples for learning. Therefore, we replace RUDDER's LSTM by a profile model obtained from multiple sequence alignment of the demonstrations. Profile models are well known in bioinformatics, where they are used to score new sequences according to their sequence similarity to the aligned sequences. RUDDER's LSTM predicts the return for an episode given a state-action sub-sequence. Our method replaces the LSTM by the score of this sub-sequence if aligned to the profile model. In the RUDDER implementation, the LSTM predictions are used for return decomposition and reward redistribution by using the difference of consecutive predictions. Align-Rudder performs return decomposition and reward redistribution via the difference of alignment scores for consecutive sub-sequences. Align-RUDDER vs. temporal difference and Monte Carlo. We assume to have high reward episodes as demonstrations. Align-RUDDER uses these episodes to identify through alignment techniques state-actions that are indicative for high rewards. Align-RUDDER redistributes rewards to these state-actions to adjust a policy so that these state-actions are reached more often. Consequently, the return is increased and relevant episodes are sampled more frequently. For delayed rewards and model-free reinforcement learning: (I) temporal difference (TD) suffers from vanishing information (Arjona-Medina et al., 2019); (II) Monte Carlo (MC) averages over all futures, leading to high variance (Arjona-Medina et al., 2019) . Monte-Carlo Tree Search (MCTS) which was used for Go and chess is a model-based method that can handle delayed and rare rewards (Silver et al., 2016; 2017) . Basic insight: Q-functions (action-value functions) are step functions. Complex tasks are often hierarchically composed of sub-tasks (see Fig. 1 ). Hence, the Q-function of an optimal policy resembles a step function. A step is a change in return expectation, that is, the expected amount of the return or the probability to obtain the return changes. Steps indicate achievements, failures, accomplishing sub-tasks, or changes of the environment. To identify large steps in the Q-function speeds up learning, since it allows (i) to increase the return and (ii) to sample more relevant episodes. A Q-function predicts the expected return from every state-action pair (see Fig. 1 ), which is prone to make a prediction error that hampers learning. Since the Q-function is mostly constant, it is not necessary to predict the expected return for every state-action pair. It is sufficient to identify relevant state-actions across the whole episode and use them for predicting the expected return (see Fig. 1 ). The LSTM network (Hochreiter & Schmidhuber, 1995; 1997a; b) can store the relevant state-actions in its memory cells. Subsequently, it only updates them if a new relevant state-action pair appears, when also the output changes which is otherwise constant. The basic insight that Q-functions are step functions is the motivation for identifying these steps via return decomposition and speeding up learning via reward redistribution. Reward Redistribution: Idea and return decomposition. We consider reward redistributions that are obtained by return decomposition given an episodic Markov decision process (MDP). The Q-function is assumed to be a step function (blue curve, row 1 of Fig. 1 , right panel). Return decomposition identifies the steps of the Q-function (green arrows in Fig. 1 , right panel). A function (LSTM in RUDDER, alignment model in Align-RUDDER) predicts the expected return (red arrow, row 1 of Fig. 1 , right panel) given the state-action sub-sequence. The prediction is decomposed into single steps of the Q-function (green arrows in Fig. 1 ). The redistributed rewards (small red arrows in second and third row of right panel of Fig. 1 ) remove the steps. Consequently, the expected future reward is equal to zero (blue curve at zero in last row in right panel of Fig. 1 ). Having future rewards of zero means that learning the Q-values simplifies to estimating the expected immediate rewards (small red arrows in right panel of Fig. 1 ), since delayed rewards are no longer present. Reward redistribution using multiple sequence alignment. RUDDER uses an LSTM model for reward redistribution via return decomposition. The reward redistribution is the difference of two subsequent predictions of the LSTM model. If a state-action pair increases the prediction of the return,



Figure 1: Basic insight of RUDDER (left panel) and reward redistribution (right panel). Left panel, row 1: An agent has to take a key to open the door to a treasure. Both events increase the probability of obtaining the treasure. row 2: Learning the Q-function by a fully connected network requires to predict the expected return from every state-action pair (red arrows). row 3: Learning the Q-functions by memorizing relevant state-action pairs requires only to predict the steps (red arrows). Right panel, row 1: The Q-function is the future expected reward (blue curve) since reward is given only at the end. The Q-function is a step function, where green arrows indicate steps and the big red arrow the expected return. row 2: The redistributed reward (small red arrow) removes a step in the Q-function and the future expected return becomes constant at this step (blue curve). row 3: After redistributing the return, the expected future reward is equal to zero (blue curve at zero). Learning focuses on the expected immediate reward (red arrows) since delayed rewards are no longer present.

