MOMENTUM BOOSTED EPISODIC MEMORY FOR IM-PROVING LEARNING IN LONG-TAILED RL ENVIRON-MENTS

Abstract

Conventional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature, where animals roam. Some objects are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called Zipfian. Taking inspiration from the theory of complementary learning systems, an architecture for learning from Zipfian distributions is proposed where long tail states are discovered in an unsupervised manner and states along with their recurrent activation are kept longer in episodic memory. The recurrent activations are then reinstated from episodic memory using a similarity search, giving weighted importance. The proposed architecture yields improved performance in a Zipfian task over conventional architectures. Our method outperforms IMPALA by a significant margin of 20.3% when maps/objects occur with a uniform distribution and by 50.2% on the rarest 20% of the distribution.

1. INTRODUCTION

Humans and animals roam around in environments that are unstructured in nature. However, existing algorithms in reinforcement learning are built around the assumption that environments are mostly uniform. Most of the time, a small subset of experiences frequently recur while many important experiences occur only rarely (Zipf, 2013; Smith et al., 2018) . For example, imagine a deer trying to survive in an environment with predators. If it is drinking water from a source and it escapes from a potential predator, the deer cannot afford to learn slowly from multiple such experiences to learn to avoid dangerous places corresponding to water sources. It needs to learn that experience quickly and generalize across similar instances. Similarly, in autonomous driving, experiences are not uniform, and usually, the rare instances where there is an accident or unusual experiences are more critical in real-world settings. This is the fundamental premise on which the theory of complementary learning systems (McClelland et al., 1995; Kumaran et al., 2016) is proposed. In this framework, an intelligent agent needs to have a fast learning system and a slow learning system operating together to restructure the statistics of the environment for better survival internally and not be naive in expecting uniform environments. In the brain, this is hypothesized through the interplay between the hippocampus, a fast learning system, and the neocortex which is a slow learning system, and together they manage to generalize and retain experiences crucial to the goals of the organism (Kumaran et al., 2016) . It is also true that to achieve their respective learning outcomes, both the systems need each other (Botvinick et al., 2019) . The hippocampus is able to achieve fast learning through its reliance on the slow learning system of the cortex where high dimensional data coming from the sensory systems are converted to low dimensional representations which can be operated on by the hippocampus. Such top-down modulation from the cortex influences the processing in the hippocampus (Kumaran & Maguire, 2007) . Similarly, the slow structured learning of the cortex happens through interleaved learning by replaying experiences stored in the hippocampus (O'Neill et al., 2010) . Here, we particularly look at this interplay between a fast learning and a slow learning system and apply this to solve the long-tailed phenomena. The reinforcement learning algorithm (Sutton & Barto, 2018b; Espeholt et al., 2018) uses the episodic buffer to generalize across experiences, and a familiarity memory prioritizes long-tail data from the outputs generated by the RL algorithm. This prioritization of samples happens through a contrastive learning-related momentum loss which enables the unsupervised discovery of long-tailed data from the stream of experiences (Zhou et al., 2022) . These prioritized samples are kept longer in memory and then, the hidden activations corresponding to these prioritized samples are reinstated in the recurrent layers of the RL network. Our Main Contributions are: • Proposing a first solution for the problem of navigating to objects occurring with a long tail distribution using deep reinforcement learning. • Application of contrastive momentum loss for unsupervised discovery of long tail states in the context of reinforcement learning. • Novel method to prioritize long-tail states in the buffer then reinstating hidden activations in recurrent layers.

2. BACKGROUND 2.1 MARKOV DECISION PROCESSES

Let's assume an environment E which provides the agent with an observation S t , the agent selects an action A t , and then the environment responds by providing the agent with the next state S t+1 . The interactions between the agent and environment are formalized by MDPs which are reinforcement learning tasks that satisfy Markov property (Sutton & Barto, 2018a) . It is defined by the tuple < S, A, R, T , γ > where, S represents the set of states, A is the set of actions, R : S × A → R denotes the reward function. T : S × A → Dist(S) represents the transition function mapping state-action pairs to a distribution over next states Dist(S) and γ ∈ [0, 1] is the discount factor.

2.2. IMPALA

Previous works on the actor-critic framework have been based on a single learner multiple actor architectures where the communications are based on gradients with respect to policy parameters (Mnih et al., 2016b) . IMPALA is a distributed off-policy actor-critic framework (Espeholt et al., 2018) ; here actors communicate a sequence of trajectories to the learners which give it a very high throughput. We consider IMPALA as our base architecture and build on top of it.

2.3. MEMORY SYSTEMS IN RL

Memory systems in humans allow them to retrieve the relevant set of experiences for decisionmaking in case of unseen circumstances. In neuroscience, some of the types of memories studied are -Working Memory and Episodic Memory. Working memory is short-term temporary storage while episodic memory is a non-parametric or semi-parametric long-term storage memory 

2.4. SELF SUPERVISED LEARNING

Self Supervised Learning has been widely used in Reinforcement Learning to learn better representations for learning actions (Anand et al., 2019; Laskin et al., 2020) . Here we look at the Contrastive Learning framework (Chopra et al., 2005; Chen et al., 2020; He et al., 2019) which uses similarity constraints to learn representations. The representation in pixel space is learned by bringing different views of the same images together and vice-versa. Usually, Self-supervised long-tailed learning methods have been developed mainly from a loss perspective or a model perspective. Focal loss



. Deep Reinforcement Learning agents with episodic memory, in particular a combination of non-parametric and parametric networks have shown improved sample efficiency and are suitable for decision-making in rare events. Blundell et al. (2016) used a non-parametric model to keep the best Q-values in tabular memory. Pritzel et al. (2017) in Neural Episodic Control proposed a differentiable-neural-dictionary to keep the representations and Q-values in a semi-tabular form. Hansen et al. (2018) took up a trajectory-centric approach to model such systems. Our work looks up to Fortunato et al. (2019)'s state-centric formulation of an Episodic Memory (MEM) which implements working memory using a latent recurrent neural network and an episodic memory.

