ALIGN-RUDDER: LEARNING FROM FEW DEMON-STRATIONS BY REWARD REDISTRIBUTION

Abstract

Reinforcement Learning algorithms require a large number of samples to solve complex tasks with sparse and delayed rewards. Complex tasks are often hierarchically composed of sub-tasks. A step in the Q-function indicates solving a sub-task, where the expectation of the return increases. RUDDER identifies these steps and then redistributes reward to them, thus immediately giving reward if sub-tasks are solved. Since the delay of rewards is reduced, learning is considerably sped up. However, for complex tasks, current exploration strategies struggle with discovering episodes with high rewards. Therefore, we assume that episodes with high rewards are given as demonstrations and do not have to be discovered by exploration. Typically the number of demonstrations is small and RUDDER's LSTM model does not learn well. Hence, we introduce Align-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations, replacing RUDDER's safe exploration and lessons replay buffer. Second, we substitute RUDDER's LSTM model by a profile model that is obtained from multiple sequence alignment of demonstrations. Profile models can be constructed from as few as two demonstrations. Align-RUDDER inherits the concept of reward redistribution, which speeds up learning by reducing the delay of rewards. Align-RUDDER outperforms competitors on complex artificial tasks with delayed reward and few demonstrations. On the MineCraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently.

1. INTRODUCTION

Overview of our method, Align-RUDDER. Reinforcement learning algorithms struggle with learning complex tasks that have sparse and delayed rewards (Sutton & Barto, 2018; Rahmandad et al., 2009; Luoma et al., 2017) . RUDDER (Arjona-Medina et al., 2019) has shown to excel for learning sparse and delayed rewards. RUDDER requires episodes with high rewards, to store them in its lessons replay buffer for learning. However, for complex tasks episodes with high rewards are difficult to find by current exploration strategies. Humans and animals obtain high reward episodes by teachers, role models, or prototypes. In this context, we assume that episodes with high rewards are given as demonstrations. Consequently, RUDDER's safe exploration and lessons replay buffer can be replaced by these demonstrations. Generating demonstrations is often tedious for humans and time-consuming for automated exploration strategies, therefore typically only few demonstrations are available. However, RUDDER's LSTM as a deep learning method requires many examples for learning. Therefore, we replace RUDDER's LSTM by a profile model obtained from multiple sequence alignment of the demonstrations. Profile models are well known in bioinformatics, where they are used to score new sequences according to their sequence similarity to the aligned sequences. RUDDER's LSTM predicts the return for an episode given a state-action sub-sequence. Our method replaces the LSTM by the score of this sub-sequence if aligned to the profile model. In the RUDDER implementation, the LSTM predictions are used for return decomposition and reward redistribution by using the difference of consecutive predictions. Align-Rudder performs return decomposition and reward redistribution via the difference of alignment scores for consecutive sub-sequences. Align-RUDDER vs. temporal difference and Monte Carlo. We assume to have high reward episodes as demonstrations. Align-RUDDER uses these episodes to identify through alignment

