THE ACT OF REMEMBERING: A STUDY IN PARTIALLY OBSERVABLE REINFORCEMENT LEARNING

Abstract

Partial observability remains a major challenge for reinforcement learning (RL). In fully observable environments it is sufficient for RL agents to learn memoryless policies. However, some form of memory is necessary when RL agents are faced with partial observability. In this paper we study a lightweight approach: we augment the environment with an external memory and additional actions to control what, if anything, is written to the memory. At every step, the current memory state is part of the agent's observation, and the agent selects a tuple of actions: one action that modifies the environment and another that modifies the memory. When the external memory is sufficiently expressive, optimal memoryless policies yield globally optimal solutions. We develop the theory for memory-augmented environments and formalize the RL problem. Previous attempts to use external memory in the form of binary memory have produced poor results in practice. We propose and experimentally evaluate alternative forms of k-size buffer memory where the agent can decide to remember observations by pushing (or not) them into the buffer. Our memories are simple to implement and outperform binary and LSTM-based memories in well-established partially observable domains.

1. INTRODUCTION

Reinforcement Learning (RL) agents learn policies (i.e., mappings from observations to actions) by interacting with an environment. RL agents usually learn memoryless policies, which are policies that only consider the last observation when selecting the next action. In fully observable environments, learning memoryless policies is an effective strategy. However, RL methods often struggle when the environment is partially observable. Indeed, partial observability is one of the main challenges to applying RL in real-world settings (Dulac-Arnold et al., 2019) . When faced with partially observable environments, RL agents require some form of memory to learn optimal behaviours. This is usually accomplished using k-order memories (Mnih et al., 2015) , recurrent networks (Hausknecht & Stone, 2015) , or memory-augmented networks (Oh et al., 2016) . In this paper, we study a lightweight alternative approach to tackle partially observability in RL. The approach consists of providing the agent with an external memory and extra actions to control it (as shown in Figure 1 ). The resulting RL problem is still partially observable, but if the external memory is sufficiently expressive, then optimal memoryless policies will also yield globally optimal solutions. Previous works that explored this idea using external binary or continuous memories produced poor results with standard RL methods (Peshkin et al., 1999; Zhang et al., 2016) . Our work shows that the main issue is with the type of memory they were using, and that RL agents are capable of learning effective strategies to utilize external memories when structured appropriately. In what follows, we • formalize the RL problem in the context of memory-augmented environments and study the theory behind memoryless policies that jointly decide what to do and what to remember; • propose two novel forms of external memory called Ok and OAk. These k-size buffer memories generalize k-order memories by letting the agent (learn to) decide whether to push the current observation into the memory buffer or not; • empirically evaluate Ok and OAk relative to previously proposed binary (Bk), k-order (Kk), and LSTM memories (the most widely used approach for partially observable RL).

