DEMYSTIFYING APPROXIMATE RL WITH ϵ-GREEDY EXPLORATION: A DIFFERENTIAL INCLUSION VIEW Anonymous

Abstract

Q-learning and SARSA(0) with ϵ-greedy exploration are leading reinforcement learning methods, and their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors, e.g., policy oscillation and chattering, convergence to different attractors (possibly even the worst policy) on different runs, etc., apart from the usual instability. Accordingly, a theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion theory to provide the first framework for resolving this problem. We further illustrate via numerical examples how this framework helps explain these algorithms' asymptotic behaviors.

1. INTRODUCTION

Tabular versions of value-based Reinforcement Learning (RL) algorithms such as Q-learning and SARSA are known to converge to the optimal Q-function under reasonable conditions (Singh et al., 2000; Jaakkola et al., 1993; Tsitsiklis, 1994) . However, the story of their approximate variants (those using function approximation) with ϵ-greedy exploration has been inconclusive. On the one hand, these variants, e.g., the Deep Q-Network (DQN) (Mnih et al., 2015) , have shown significant empirical successes. On the other, there is also growing evidence of undesirable behaviors such as policy oscillation, i.e., indefinitely cycling between multiple policies, or convergence to a sub-optimal or even the worst possible policy (Gordon, 1996; 2000; De Farias & Van Roy, 2000; Bertsekas, 2011; Young & Sutton, 2020) . Accordingly, a mathematical framework to explain such behaviors and, in turn, to identify conditions for some minimal reliability has been a long-standing open problem, even in the basic linear function approximation case (Sutton, 1999, Problem 1) . By reliability of an approximate value-based method, we mean some basic notions like i.) stability, ii.) convergence to the optimal policy when the optimal Q-value function lies in the approximating function class, or iii.) convergence to a policy with a better Q-value function than the initial policy. Tabular variants of Q-learning and SARSA are reliable in all these three viewpoints under the reasonable conditions that guarantee their convergence (see references above). Likewise, there are sufficient, albeit restrictive conditions (Melo et al., 2008; Chen et al., 2019; Carvalho et al., 2020; Lee & He, 2020; Xu & Gu, 2020) under which approximate Q-learning and SARSA with a fixed behavior policy are reliable, at least as per the first two notions above. For example, these conditions hold when the behavior policy is close to the optimal policy (the one to be estimated). In contrast, we claim that the reliability of approximate value-based RL methods with ϵ-greedy exploration cannot be taken for granted. To see this, consider Figure 1 , showing trajectories of different runs of a variant of DQN. This variant also employs experience replay and a target network as in (Mnih et al., 2015) , but uses a linear function instead of a neural network for approximating the optimal Q-value function. The reduction in the approximation power is offset by including the optimal Q-value function in this linear function classfoot_0 . For all the trajectories, the starting conditions are the same and set so that the initial behavior policy is close to the optimal policy, aligned with the conditions in the fixed behavior policy literature. Thus, one would expect the greedy policy along these trajectories to converge to the optimal policy. Surprisingly, we observe three different behaviors: i.) convergence to a sub-optimal policy (red), ii.) oscillation between two sub-optimal policies (blue, tail end) and iii.) convergence to the optimal policy (green). approximation which perfectly represents the optimal Q-value function Q * . Figure 1a shows these trajectories in the parameter space. The parameters of Q * are denoted by the black star at (1, 0). The initial parameter for all trajectories is the same (the black dot) and is chosen so that the initial behavior is the ϵ-greedy version of the optimal policy. In this idealized setting, one would have expected all trajectories to go to the star. In reality, all of them do converge, but only the green one has the desired limit (the initial fading of colors is for ease of exposition). Figure 1b shows the greedy policies associated with the different trajectories. The limiting greedy policy for the green trajectory is unique and is the optimal one; for the red, it is some sub-optimal policy. In contrast, that of the blue trajectory oscillates between two other sub-optimal policies. The above discussion along with those in (Young & Sutton, 2020) show that approximate valuebased methods with ϵ-greedy exploration can exhibit several pathological behaviors beyond the textbook instability phenomenon (Sutton & Barto, 2018), raising serious doubts about their reliability in practice. Towards addressing these concerns, a theory to explain the limiting behaviors of approximate value-based RL with greedification is thus an extremely important first step. Existing analyses based on Ordinary Differential Equations (ODEs) are of limited utility in building such a theory. To see why, note that RL schemes can be viewed as update rules of the form θ n+1 = θ n + α n [h(θ n ) + M n+1 ], n ≥ 0, where h is some driving function, α n is the stepsize, and M n+1 is noise. When h is 'nice' overall, e.g., globally Lipschitz continuous, the ODE method is useful to show that the limiting dynamics of ( 1) is governed by the ODE θ(t) = h(θ(t)) (Benaïm, 1999; Borkar, 2009) . This is indeed the case in policy evaluation. In value-based RL methods, however, h is quite complex, even with linear function approximation: the update rules involve sampling from distributions which change depending on the iterates. Accordingly, the ODE method has been made to work here only via restrictive assumptions on the sampling distribution, e.g., fixed behavior policy (Carvalho et al., 2020) , near-optimal behavior policy (Melo et al., 2008; Chen et al., 2019) , smooth soft-max behavior policy (Perkins & Precup, 2002; Zou et al., 2019) , etc. With ϵ-greedy exploration, the situation is even worse since, as we show, the resultant dynamics also turns out to be discontinuous.

Key Contributions:

The main highlights of this work can be summarized as follows. 1. Analysis Framework: Our work uses Differential Inclusion (DI) theory (Aubin & Cellina, 2012) to develop a new framework for analyzing value-based RL methods. Our key steps include i.) breaking down the parameter space into regions where the algorithm's dynamics are simple, ii.) identifying a DI that stitches the local dynamics together, and iii.) using this DI to explain the algorithm's overall (possibly complex) behavior. A DI is a generalization of an ODE which enables this stitching by allowing for multiple update directions at every point.



This is ensured by setting one column of the state-action feature matrix to the optimal Q-value function.



Figure 1: Trajectories of three runs of DQN on a 2-state 2-action MDP, with a linear 2-dimensional Q-value

