MOMENTUM BOOSTED EPISODIC MEMORY FOR IM-PROVING LEARNING IN LONG-TAILED RL ENVIRON-MENTS

Abstract

Conventional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature, where animals roam. Some objects are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called Zipfian. Taking inspiration from the theory of complementary learning systems, an architecture for learning from Zipfian distributions is proposed where long tail states are discovered in an unsupervised manner and states along with their recurrent activation are kept longer in episodic memory. The recurrent activations are then reinstated from episodic memory using a similarity search, giving weighted importance. The proposed architecture yields improved performance in a Zipfian task over conventional architectures. Our method outperforms IMPALA by a significant margin of 20.3% when maps/objects occur with a uniform distribution and by 50.2% on the rarest 20% of the distribution.

1. INTRODUCTION

Humans and animals roam around in environments that are unstructured in nature. However, existing algorithms in reinforcement learning are built around the assumption that environments are mostly uniform. Most of the time, a small subset of experiences frequently recur while many important experiences occur only rarely (Zipf, 2013; Smith et al., 2018) . For example, imagine a deer trying to survive in an environment with predators. If it is drinking water from a source and it escapes from a potential predator, the deer cannot afford to learn slowly from multiple such experiences to learn to avoid dangerous places corresponding to water sources. It needs to learn that experience quickly and generalize across similar instances. Similarly, in autonomous driving, experiences are not uniform, and usually, the rare instances where there is an accident or unusual experiences are more critical in real-world settings. This is the fundamental premise on which the theory of complementary learning systems (McClelland et al., 1995; Kumaran et al., 2016) is proposed. In this framework, an intelligent agent needs to have a fast learning system and a slow learning system operating together to restructure the statistics of the environment for better survival internally and not be naive in expecting uniform environments. In the brain, this is hypothesized through the interplay between the hippocampus, a fast learning system, and the neocortex which is a slow learning system, and together they manage to generalize and retain experiences crucial to the goals of the organism (Kumaran et al., 2016) . It is also true that to achieve their respective learning outcomes, both the systems need each other (Botvinick et al., 2019) . The hippocampus is able to achieve fast learning through its reliance on the slow learning system of the cortex where high dimensional data coming from the sensory systems are converted to low dimensional representations which can be operated on by the hippocampus. Such top-down modulation from the cortex influences the processing in the hippocampus (Kumaran & Maguire, 2007) . Similarly, the slow structured learning of the cortex happens through interleaved learning by replaying experiences stored in the hippocampus (O'Neill et al., 2010) . Here, we particularly look at this interplay between a fast learning and a slow learning system and apply this to solve the long-tailed phenomena. The reinforcement learning algorithm (Sutton & Barto, 2018b; Espeholt et al., 2018) uses the episodic buffer to generalize across experiences, and a 1

