MEASURING AND MITIGATING INTERFERENCE IN RE-INFORCEMENT LEARNING

Abstract

Catastrophic interference is common in many network-based learning systems, and many proposals exist for mitigating it. But, before we overcome interference we must understand it better. In this work, we first provide a definition and novel measure of interference for value-based control methods such as Fitted Q Iteration and DQN. We systematically evaluate our measure of interference, showing that it correlates with forgetting, across a variety of network architectures. Our new interference measure allows us to ask novel scientific questions about commonly used deep learning architectures and develop new learning algorithms. In particular we show that updates on the last layer result in significantly higher interference than updates internal to the network. Lastly, we introduce a novel online-aware representation learning algorithm to minimize interference, and we empirically demonstrate that it improves stability and has lower interference.

1. INTRODUCTION

Generalization is a key property of reinforcement learning (RL) algorithms with function approximation. An agent must correctly generalize its recent experience to both states it has not yet encountered and other states it encountered in the past. Generalization has been extensively studied in supervised learning, inputs are sampled iid from a fixed input distribution and the targets are sampled from a fixed conditional distribution. The distribution of training data is often not iid. When learning from a stream of temporally correlated data, as in RL, the learner might fit the learned function to recent data and potentially overwrite previous learning-for example, the estimated values. This phenomenon is commonly called interference or forgetting in RL (Bengio et al., 2020; Goodrich, 2015; Liu et al., 2019; Kirkpatrick et al., 2017; Riemer et al., 2018) . The conventional wisdom is that interference is particularly problematic in RL, even single-task RL, because (a) when an agent explores, it processes a sequence of observations, which are likely to be temporally correlated; (b) the agent continually changes its policy, changing the distribution of samples over time; and (c) most algorithms use bootstrap targets (as in temporal difference learning), making the update targets non-stationary. It is difficult to verify this conventional wisdom, as there is no established online measure of interference for RL. There has been significant progress quantifying interference in supervised learning (Chaudhry et al., 2018; Fort et al., 2019; Kemker et al., 2018; Riemer et al., 2018) , with some empirical work even correlating interference and properties of task sequences (Nguyen et al., 2019) , and investigations into (un)forgettable examples in classification (Toneva et al., 2019) . In RL, recent efforts have focused on generalization and transfer, rather than characterizing or measuring interference. Learning on new environments often results in drops in performance on previously learned environments (Farebrother et al., 2018; Packer et al., 2018; Rajeswaran et al., 2017; Cobbe et al., 2018) . DQN-based agents can hit performance plateaus in Atari, presumably due to interference. In fact, if the learning process is segmented in the right way, the interference can be more precisely characterized with TD errors across different game contexts (Fedus et al., 2020) . Unfortunately this analysis cannot be done online as learning progresses. Finally, recent work investigated several different possible measures of interference, but did not land on a clear measure (Bengio et al., 2020) . In this paper we advocate for a simpler approach to charactering interference in RL. In most systems the value estimates and actions change on every time-step conflating many different sources of non-stationarity, stochasticity, and error. If an update to the value function interferes, the result of

