DEMYSTIFYING APPROXIMATE RL WITH ϵ-GREEDY EXPLORATION: A DIFFERENTIAL INCLUSION VIEW Anonymous

Abstract

Q-learning and SARSA(0) with ϵ-greedy exploration are leading reinforcement learning methods, and their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors, e.g., policy oscillation and chattering, convergence to different attractors (possibly even the worst policy) on different runs, etc., apart from the usual instability. Accordingly, a theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion theory to provide the first framework for resolving this problem. We further illustrate via numerical examples how this framework helps explain these algorithms' asymptotic behaviors.

1. INTRODUCTION

Tabular versions of value-based Reinforcement Learning (RL) algorithms such as Q-learning and SARSA are known to converge to the optimal Q-function under reasonable conditions (Singh et al., 2000; Jaakkola et al., 1993; Tsitsiklis, 1994) . However, the story of their approximate variants (those using function approximation) with ϵ-greedy exploration has been inconclusive. On the one hand, these variants, e.g., the Deep Q-Network (DQN) (Mnih et al., 2015) , have shown significant empirical successes. On the other, there is also growing evidence of undesirable behaviors such as policy oscillation, i.e., indefinitely cycling between multiple policies, or convergence to a sub-optimal or even the worst possible policy (Gordon, 1996; 2000; De Farias & Van Roy, 2000; Bertsekas, 2011; Young & Sutton, 2020) . Accordingly, a mathematical framework to explain such behaviors and, in turn, to identify conditions for some minimal reliability has been a long-standing open problem, even in the basic linear function approximation case (Sutton, 1999, Problem 1) . By reliability of an approximate value-based method, we mean some basic notions like i.) stability, ii.) convergence to the optimal policy when the optimal Q-value function lies in the approximating function class, or iii.) convergence to a policy with a better Q-value function than the initial policy. Tabular variants of Q-learning and SARSA are reliable in all these three viewpoints under the reasonable conditions that guarantee their convergence (see references above). Likewise, there are sufficient, albeit restrictive conditions (Melo et al., 2008; Chen et al., 2019; Carvalho et al., 2020; Lee & He, 2020; Xu & Gu, 2020) under which approximate Q-learning and SARSA with a fixed behavior policy are reliable, at least as per the first two notions above. For example, these conditions hold when the behavior policy is close to the optimal policy (the one to be estimated). In contrast, we claim that the reliability of approximate value-based RL methods with ϵ-greedy exploration cannot be taken for granted. To see this, consider Figure 1 , showing trajectories of different runs of a variant of DQN. This variant also employs experience replay and a target network as in (Mnih et al., 2015) , but uses a linear function instead of a neural network for approximating the optimal Q-value function. The reduction in the approximation power is offset by including the optimal Q-value function in this linear function classfoot_0 . For all the trajectories, the starting conditions are the same and set so that the initial behavior policy is close to the optimal policy, aligned with the conditions in the fixed behavior policy literature. Thus, one would expect the greedy policy along these trajectories to converge to the optimal policy. Surprisingly, we observe three different behaviors: i.) convergence to a sub-optimal policy (red), ii.) oscillation between two sub-optimal policies (blue, tail end) and iii.) convergence to the optimal policy (green).



This is ensured by setting one column of the state-action feature matrix to the optimal Q-value function.1

