MEASURING AND MITIGATING INTERFERENCE IN RE-INFORCEMENT LEARNING

Abstract

Catastrophic interference is common in many network-based learning systems, and many proposals exist for mitigating it. But, before we overcome interference we must understand it better. In this work, we first provide a definition and novel measure of interference for value-based control methods such as Fitted Q Iteration and DQN. We systematically evaluate our measure of interference, showing that it correlates with forgetting, across a variety of network architectures. Our new interference measure allows us to ask novel scientific questions about commonly used deep learning architectures and develop new learning algorithms. In particular we show that updates on the last layer result in significantly higher interference than updates internal to the network. Lastly, we introduce a novel online-aware representation learning algorithm to minimize interference, and we empirically demonstrate that it improves stability and has lower interference.

1. INTRODUCTION

Generalization is a key property of reinforcement learning (RL) algorithms with function approximation. An agent must correctly generalize its recent experience to both states it has not yet encountered and other states it encountered in the past. Generalization has been extensively studied in supervised learning, inputs are sampled iid from a fixed input distribution and the targets are sampled from a fixed conditional distribution. The distribution of training data is often not iid. When learning from a stream of temporally correlated data, as in RL, the learner might fit the learned function to recent data and potentially overwrite previous learning-for example, the estimated values. This phenomenon is commonly called interference or forgetting in RL (Bengio et al., 2020; Goodrich, 2015; Liu et al., 2019; Kirkpatrick et al., 2017; Riemer et al., 2018) . The conventional wisdom is that interference is particularly problematic in RL, even single-task RL, because (a) when an agent explores, it processes a sequence of observations, which are likely to be temporally correlated; (b) the agent continually changes its policy, changing the distribution of samples over time; and (c) most algorithms use bootstrap targets (as in temporal difference learning), making the update targets non-stationary. It is difficult to verify this conventional wisdom, as there is no established online measure of interference for RL. There has been significant progress quantifying interference in supervised learning (Chaudhry et al., 2018; Fort et al., 2019; Kemker et al., 2018; Riemer et al., 2018) , with some empirical work even correlating interference and properties of task sequences (Nguyen et al., 2019) , and investigations into (un)forgettable examples in classification (Toneva et al., 2019) . In RL, recent efforts have focused on generalization and transfer, rather than characterizing or measuring interference. Learning on new environments often results in drops in performance on previously learned environments (Farebrother et al., 2018; Packer et al., 2018; Rajeswaran et al., 2017; Cobbe et al., 2018) . DQN-based agents can hit performance plateaus in Atari, presumably due to interference. In fact, if the learning process is segmented in the right way, the interference can be more precisely characterized with TD errors across different game contexts (Fedus et al., 2020) . Unfortunately this analysis cannot be done online as learning progresses. Finally, recent work investigated several different possible measures of interference, but did not land on a clear measure (Bengio et al., 2020) . In this paper we advocate for a simpler approach to charactering interference in RL. In most systems the value estimates and actions change on every time-step conflating many different sources of non-stationarity, stochasticity, and error. If an update to the value function interferes, the result of that updated might not manifest in the policy's performance for several time steps, if at all. Interference classically refers to an update negatively impacting the agent's previous learning-eroding the agent's knowledge stored in the value function. Therefore it makes sense to first characterize interference in the value function updates, instead of the policy or return. We define interference in terms of prediction error for two common approximate dynamic programming algorithms, approximate policy iteration and fitted Q iteration. Most value-based deep RL algorithms are based on these two algorithms. The interference is defined as the change in prediction errors, which is similar to previous definitions of interference in supervised learning. Additionally, our approach yields an online estimate of interference, which can even be directly optimized. In this work, we provide a clear justification for the use of differences in squared TD errors as the definition of interference. We highlight the definitions of interference at different granularities, and the utility of considering different statistics to summarize interference within iterations versus over time. We evaluate our interference measure by computing the correlation to a forgetting metric, which reflects instability in control performance. We show that high interference correlates with forgetting, and simultaneously show interference and forgetting properties across a variety of architectures and optimization choices. We then use our measure to highlight that updates to internal layers of the network-the representation-contribute much less to interference than updates on the last layer. This motivates the design of a new algorithm that learns representation online that explicitly minimize interference. We conclude with a demonstration that this algorithm does indeed significantly improve stability and reduce interference.

2. PROBLEM FORMULATION AND LEARNING ALGORITHMS

In reinforcement learning (RL), an agent interacts with its environment, receiving observations and selecting actions to maximize a reward signal. We assume the environment can be formalized as a Markov decision process (MDP). An MDP is a tuple (S, A, Pr, R, γ, d 0 ) where S is a set of states, A is an set of actions, Pr : S × A × S → [0, 1] is the transition probability, R : S × A × S → R is the reward function, γ ∈ [0, 1] a discount factor, and d 0 is the initial distribution. The goal of the agent is to find a policy π : S × A → [0, 1] to maximize the expected discounted sum of rewards. Given a fixed policy π, the action-value function Q π : S × A → R is defined as Q π (s, a) := E[ ∞ k=0 γ k R t+k+1 |S t = s, A t = a], where R t+1 = R(S t , A t , S t+1 ), S t+1 ∼ Pr(•|S t , A t ), and actions are taken according to policy π: A t ∼ π(•|S t ). Given a policy π, the value function can be obtained using the Bellman operator for action values T π : R |S|×|A| → R |S|×|A| : (T π Q)(s, a) := s ∈S Pr(s |s, a) R(s, a, s ) + γ a ∈A π(a |s )Q(s , a ) . Q π is the unique solution of the Bellman equation T π Q = Q. The optimal value function Q * is defined as Q * (s, a) := sup π Q(s, a), with π * the policy that is greedy w.r.t. Q * . Similarly, the optimal value function can be obtained using the Bellman optimality operator for action values T : R |S|×|A| → R |S|×|A| : (T Q)(s, a) := s ∈S Pr(s |s, a) [R(s, a, s ) + γ max a ∈A Q(s , a )]. Q * is the unique solution of the Bellman equation T Q = Q. We can use neural networks to learn an approximation Q θ to the optimal action-value, with parameters θ. In this work, we restrict our attention to Iterative Value Estimation algorithms. These are algorithms where there is an explicit evaluation phase with a fixed policy, where the agent has several steps T eval to improve its value estimates. Two examples of such algorithms are Approximate Policy Iteration (API) and Fitted Q-Iteration (FQI) (Ernst et al., 2005) . In API, for the current policy π k , the agent updates its estimate of Q π k by taking T eval steps in the environment and performing a mini-batch update from a replay buffer on each step using the Sarsa update: θ t+1 ← θ t + αδ t ∇ θt Q θt (S t , A t ) where δ t := R t+1 + γQ θt (S t+1 , π k (S t+1 )) -Q θt (S t , A t ). In FQI, the policy and targets Q k are held fixed for T eval steps, with these fixed targets used as a regression target in the update. Again, a mini-batch update update from a replay buffer is used on each step as above, but with a different δ = R t+1 + γ max a ∈A Q k (S t+1 , a ) -Q θt (S t , A t ). The procedure for both algorithms is summarized in Algorithm 1.

