TRAINING EQUILIBRIA IN REINFORCEMENT LEARN-ING

Abstract

In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria-policies that are stable under further training-and can converge to equilibria that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization. We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states. Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, even when there exists a memoryless optimal policy. Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.

1. INTRODUCTION

In Markov decision processes (MDPs), Q-learning and policy gradient methods are known to always converge to an optimal policy (Watkins and Dayan 1992; Bhandari and Russo 2019) subject to the assumptions detailed in Section 3. In non-Markovian environments such as partially observable MDPs (POMDPs), this guarantee fails when using Markovian policies: the algorithms don't always converge to an optimal policy, and there may be multiple (suboptimal) policies surrounded by 'basins of attraction'. This is true even in the tabular setting with full exploration, where formal convergence guarantees are strongest. For example, consider the double-tap environment described in Figure 1 . The optimal policy in this environment is to always choose action a 2 . However, a policy that favors a 1 is stuck in a local optimum. Thus this policy will never converge to the optimal policy even with further training, despite exploring the full state-action space. We call a policy that is fixed under further training an equilibrium of the training process. For policy gradient methods, equilibria correspond to stationary points of the expected return; for Q-learning, equilibria are fixed points of the Bellman equation. Pendrith and McGarity (1998) construct an environment in which Q-learning may converge to a suboptimal policy. We extend their finding and show that in POMDPs, policy gradient methods and Q-learning may have multiple training equilibria and sometimes converge to a suboptimal equilibrium despite full exploration and other favorable circumstances (described in Section 3). S 1 S 2 a 1 , 1 a 2 , 0 a 1 , 0 a 2 , 2 Figure 1: The double-tap environment. Arrows are transitions between states, labeled with a corresponding action and reward. The agent is rewarded when it chooses the same action twice in a row. The states s 1 , s 2 are unobserved. The optimal policy is to always choose action a 2 . However, a policy that favors a 1 is caught in a local optimum, since picking a 2 is only profitable once done with high probability (see Figure 2a ). Theoretical results. We show that multiple equilibria can only emerge when the distribution of unobserved states depends on past actions (Proposition 3.1). Our interpretation is that a memoryless policy must play a coordination game with its past self, and may get trapped in suboptimal Nash equilibria. For some contexts, we formalize this interpretation and show that Nash equilibria of coordination games correspond to the training equilibria of an associated RL environment (Theorem 3.2). A sufficient amount of memory resolves the coordination game and provably implies a unique optimal training equilibrium (Proposition 4.1). Thus memory can be crucial for convergence of RL algorithms even when a task does not require memory for optimal performance. Empirical results. (Reported in Section 5). We confirm empirically that our theoretical results hold in practice. In addition, we show that even a memoryless policy can often learn to use the external environment as auxiliary memory, thus improving convergence in the same way that policy memory does. However, there exist counterexamples in which even a flexible environment that allows for external memory is not sufficient to learn an optimal policy. We also confirm a hypothesis that in environments with multiple equilibria, parameter noise (Rückstieß et al. 2008; Plappert et al. 2018 ) can lead to convergence to better equilibria, thus providing a novel explanation of why parameter noise is observed to be beneficial.

2. BACKGROUND ON POMDPS

Reinforcement learning (RL) (Sutton and Barto 2018) is the problem of training an agent that takes actions in an environment in order to maximize a reward function. The most common formalism for reinforcement learning environments is the Markov decision process (MDP). An MDP models an environment which is Markovian in the sense that the environment state s t and reward r t at time t depend only on the previous state s t-1 and the previous action a t-1foot_0 . Crucially, in this formalism a policy has access to the entire state s t at each step, and thus no memory is necessary to perform optimally. Most realistic environments are not MDPs, since real-world problems are invariably partially observable. For example, a driving agent must anticipate the possibility of other drivers emerging from side-roads even if they are currently out of sight. To model partial observability, it is common to extend the MDP formalism to define a partially observable Markov decision process (POMDP) (Åström 1965) . The main idea is to model the environment as an unobserved MDP, and allow the policy to access observations sampled from an observation distribution O(o | s) conditional on the current state. Formally, a POMDP is a tuple (S, A, O, T, O, R, γ, η 0 ), where S is the set of states, A the set of possible actions, O the set of observations, T the transition kernel, O is the conditional distribution of observations, R is the reward function, γ ∈ [0, 1) is the discount factor, and η 0 is the initial state distribution. Let s t denote the state at time t. Then a timestep proceeds as follows: an observation is drawn according to the distribution O(o t | s t ) and given as input to the policy, the policy outputs an action a t , the agent receives reward R(s t , a t ), and the next state is generated according to the transition kernel T (s t+1 | s t , a t ).

3. TRAINING EQUILIBRIA IN POMDPS

It is well-known that Q-learning and policy gradient methods converge to a globally optimal policy if the environment is a (fully observable) MDP (Watkins and Dayan 1992; Bhandari and Russo 2019). However, in partially observable environments we have no such general guarantees. In particular, in this section we will study partially observable environments where RL algorithms do converge, but there exist multiple policies-training equilibria-to which they might converge, some of which are suboptimal. It is also possible that a training algorithm may not converge at all. Q-learning with discontinuous action-selection methodsfoot_1 may end up oscillating between two suboptimal policies (Gordon 1996). However, it is known that for Q-learning with continuous action selection, fixed points always exist



In some formulations, the reward may also depend on the current state st. For example ε-greedy action selection (Sutton and Barto 2018).

