TRAINING EQUILIBRIA IN REINFORCEMENT LEARN-ING

Abstract

In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria-policies that are stable under further training-and can converge to equilibria that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization. We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states. Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, even when there exists a memoryless optimal policy. Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.

1. INTRODUCTION

In Markov decision processes (MDPs), Q-learning and policy gradient methods are known to always converge to an optimal policy (Watkins and Dayan 1992; Bhandari and Russo 2019) subject to the assumptions detailed in Section 3. In non-Markovian environments such as partially observable MDPs (POMDPs), this guarantee fails when using Markovian policies: the algorithms don't always converge to an optimal policy, and there may be multiple (suboptimal) policies surrounded by 'basins of attraction'. This is true even in the tabular setting with full exploration, where formal convergence guarantees are strongest. For example, consider the double-tap environment described in Figure 1 . The optimal policy in this environment is to always choose action a 2 . However, a policy that favors a 1 is stuck in a local optimum. Thus this policy will never converge to the optimal policy even with further training, despite exploring the full state-action space. We call a policy that is fixed under further training an equilibrium of the training process. For policy gradient methods, equilibria correspond to stationary points of the expected return; for Q-learning, equilibria are fixed points of the Bellman equation. Pendrith and McGarity (1998) construct an environment in which Q-learning may converge to a suboptimal policy. We extend their finding and show that in POMDPs, policy gradient methods and Q-learning may have multiple training equilibria and sometimes converge to a suboptimal equilibrium despite full exploration and other favorable circumstances (described in Section 3). S 1 S 2 a 1 , 1 a 2 , 0 a 1 , 0 a 2 , 2 Figure 1 : The double-tap environment. Arrows are transitions between states, labeled with a corresponding action and reward. The agent is rewarded when it chooses the same action twice in a row. The states s 1 , s 2 are unobserved. The optimal policy is to always choose action a 2 . However, a policy that favors a 1 is caught in a local optimum, since picking a 2 is only profitable once done with high probability (see Figure 2a ).

