THE ACT OF REMEMBERING: A STUDY IN PARTIALLY OBSERVABLE REINFORCEMENT LEARNING

Abstract

Partial observability remains a major challenge for reinforcement learning (RL). In fully observable environments it is sufficient for RL agents to learn memoryless policies. However, some form of memory is necessary when RL agents are faced with partial observability. In this paper we study a lightweight approach: we augment the environment with an external memory and additional actions to control what, if anything, is written to the memory. At every step, the current memory state is part of the agent's observation, and the agent selects a tuple of actions: one action that modifies the environment and another that modifies the memory. When the external memory is sufficiently expressive, optimal memoryless policies yield globally optimal solutions. We develop the theory for memory-augmented environments and formalize the RL problem. Previous attempts to use external memory in the form of binary memory have produced poor results in practice. We propose and experimentally evaluate alternative forms of k-size buffer memory where the agent can decide to remember observations by pushing (or not) them into the buffer. Our memories are simple to implement and outperform binary and LSTM-based memories in well-established partially observable domains.

1. INTRODUCTION

Reinforcement Learning (RL) agents learn policies (i.e., mappings from observations to actions) by interacting with an environment. RL agents usually learn memoryless policies, which are policies that only consider the last observation when selecting the next action. In fully observable environments, learning memoryless policies is an effective strategy. However, RL methods often struggle when the environment is partially observable. Indeed, partial observability is one of the main challenges to applying RL in real-world settings (Dulac-Arnold et al., 2019) . When faced with partially observable environments, RL agents require some form of memory to learn optimal behaviours. This is usually accomplished using k-order memories (Mnih et al., 2015) , recurrent networks (Hausknecht & Stone, 2015) , or memory-augmented networks (Oh et al., 2016) . In this paper, we study a lightweight alternative approach to tackle partially observability in RL. The approach consists of providing the agent with an external memory and extra actions to control it (as shown in Figure 1 ). The resulting RL problem is still partially observable, but if the external memory is sufficiently expressive, then optimal memoryless policies will also yield globally optimal solutions. Previous works that explored this idea using external binary or continuous memories produced poor results with standard RL methods (Peshkin et al., 1999; Zhang et al., 2016) . Our work shows that the main issue is with the type of memory they were using, and that RL agents are capable of learning effective strategies to utilize external memories when structured appropriately. In what follows, we • formalize the RL problem in the context of memory-augmented environments and study the theory behind memoryless policies that jointly decide what to do and what to remember; • propose two novel forms of external memory called Ok and OAk. These k-size buffer memories generalize k-order memories by letting the agent (learn to) decide whether to push the current observation into the memory buffer or not; • empirically evaluate Ok and OAk relative to previously proposed binary (Bk), k-order (Kk), and LSTM memories (the most widely used approach for partially observable RL).

RL agent Environment Memory

Memory-Augmented Environment ā = a, w a (o, a, r, o ) w r o m ō = o , m Figure 1: A diagram of a Memory-Augmented Environment. Results show that Ok and OAk memories are usually more sample efficient than LSTM memories, while being faster to train and trivial to integrate with off-the-shelf RL methods. We therefore advocate for the adoption of Ok and OAk memories for partially observable RL problems. We end with a discussion of limitations of Ok and OAk and interesting avenues for future work.

2. PRELIMINARIES

RL agents learn how to act by interacting with an environment. Often these environments are modelled as a Markov Decision Process (MDP). An MDP is a tuple M = S, A, R, p, γ, µ , where S is a finite set of states, A is a finite set of actions, R is the finite set of possible rewards, p(s , r|s, a) defines the dynamics of the MDP, γ is the discount factor, and µ is the initial state distribution. The interaction is usually divided into episodes. At the beginning of an episode, the environment is set to an initial state s 0 , sampled from µ. Then, at time step t, the agent observes the current state s t ∈ S and executes an action a t ∈ A. In response, the environment returns the next state s t+1 and immediate reward r t sampled from p(s t+1 , r t |s t , a t ). The process then repeats until the end of the episode (when a new episode will begin) or potentially keep going for ever in non-episodic MDPs. Agents select actions according to a policy π(a|s)-which is a probability distribution from states to actions. The prediction task is to estimate how "good" a policy is, where the policy is evaluated according to the expected discounted return in any state. This can be done by estimating the action-value function q π of policy π, where q π (s, a) represents the expected discounted return when executing action a in state s and following π thereafter. Formally, q π (s, a) = E π ∞ k=0 γ k r t+k S t = s, A t = a , where E π [•] denotes the expected value of a random variable given that the agent follows policy π, and t is any time step. q π is usually estimated using Monte Carlo samples (Barto & Duff, 1994) or TD methods (Sutton, 1988) . The control task involves finding the optimal policy π * . This is the policy that maximizes the expected discounted return in every state. To do so, most RL methods rely on the policy improvement theorem, which we discuss in Section 5. We use a Partially Observable Markov Decision Process (POMDP) formulation to model partial observability. A POMDP is a tuple P = S, O, A, R, p, ω, γ, µ , where S, A, R, p, γ, and µ are as in the MDP above, O is a finite set of observations, and ω(o|s) is the observation probability distribution. Interacting with a POMDP is similar to an MDP. The environment starts from a sampled initial state s 0 ∼ µ. At time step t, the agent is in state s t ∈ S, executes an action a t ∈ A, receives an immediate reward r t , and moves to s t+1 according to p(s t+1 , r t |s t , a t ). However, the agent does not observe s t directly. Instead, the agent observes o t ∈ O, which is linked to s t via ω(o t |s t ).

3. RELATED WORK

Early attempts to perform RL in partially observable domains focused on learning memoryless policies. Jaakkola et al. (1995) identified an RL algorithm that was guaranteed to converge to locally optimal memoryless policies, and similar guarantees have been given in the POMDP literature (Li et al., 2011) . Unfortunately, Singh et al. (1994) showed that an optimal memoryless policy π * (a t |o t ) can be arbitrarily worse than the optimal history-based policy π * (a t |o 0 , a 0 , . . . , o t ) for POMDPs. Different approaches have been proposed to learn history-based policies using some form of stateapproximation technique. For example, model-based RL methods learn a state representation of

