Q-LEARNING WITH REGULARIZATION CONVERGES WITH NON-LINEAR NON-STATIONARY FEATURES

Abstract

The deep Q-learning architecture is a neural network composed of non-linear hidden layers that learn features of states and actions and a final linear layer that learns the Q-values of the features. The parameters of both components can possibly diverge. Regularization of the updates is known to solve the divergence problem of fully linear architectures, where features are stationary and known a priori. We propose a deep Q-learning scheme that uses regularization of the final linear layer of architecture, updating it along a faster time-scale, and stochastic full-gradient descent updates for the non-linear features at a slower time-scale. We prove the proposed scheme converges with probability 1. Finally, we provide a bound on the error introduced by regularization of the final linear layer of the architecture.

1. INTRODUCTION

The Q-learning algorithm, introduced in the seminal paper of Watkins & Dayan (1992) , is a stochastic semi-gradient descent algorithm that allows agents to learn to make sequential decisions towards long term goals by learning the optimal state-action value function of a given problem. The relevance of Q-learning in reinforcement learning (RL) cannot be overstated, as Q-learning with deep neural networks sustains the biggest breakthrough the field has seen (Mnih et al., 2015) . We can cast a deep Q-learning architecture as a neural network that combines a non-linear component of hidden layers, learning features of the input, and a final linear component that learns the Q-values of the learned features, as depicted in Figure 1 . Despite its merits, there is no guarantee that Q-learning with function approximation architectures, even linear ones, converges to the desired solution. In fact, divergence happens in well known examples where the parameters of the approximator do not approach any solution, either oscillating within a window (Boyan & Moore, 1995; Gordon, 2001) or growing without bound (Tsitsiklis & Van Roy, 1996; Baird, 1995) . There is also evidence for convergence to incompetent solutions (van Hasselt et al., 2018) . Recently, the works of Carvalho et al. (2020 ), Zhang et al. (2021 ) and Lim et al. (2022) provided insights on the role of regularization of the parameters and of the Q-values for stabilizing Q-learning with linear function approximation and obtaining a provably convergent scheme. Under the light of the architecture in Figure 1 , their setting is one in which the features are stationary and known a priori and only the final component is learned. In this work, we investigate whether a regularized version Q-learning with linear function approximation schemes converge while features are non-stationary. In Section 3, as a first result, we assume that the features are updated along a slower time-scale than the final layer and that they converge, and prove that the final layer converges. Our setting and proof are based on two time-scale stochastic approximation ideas. We also bound the distance between the optimal Q-function, generally outside the span of the features, and the regularized solution. Then, in Section 4, we investigate how we can learn the non-linear features along the slow time-scale with provable convergence guarantees, thus verifying the assumption of the first result. We propose three learning schemes that perform stochastic full-gradient descent on well defined loss functions and are able to use a recent result from Mertikopoulos et al. (2020) to establish their convergence. Putting our two results together, we obtain the first convergence result for stochastic semi-gradient Q-learning schemes with non-linear function approximation. Our scheme is two time-scale, where the final layer of a neural network is updated faster and learns regularized Q-values of non-linear features that are updated slower and with stochastic full-gradient descent updates.

(x, a)

Non-linear features Linear activation ϕ u (x, a) ϕ u (x, a) • v Hidden layers Final layer Q v,u (x, a) Figure 1 : A general deep Q-learning neural network architecture. The state-action pair inputs (x, a) are fed to non-linear hidden layers that output features of the input ϕ u (x, a). Then a linear activation layer parameterized by v outputs the Q-values of (x, a) using the features. Usually, the architecture is learned through performing stochastic semi-gradient descent updates on the Bellman error.

2. BACKGROUND

A Markov decision process M is a tuple (X , A, P, R), where X is a finite set of states, A is a finite set of actions, P is a set of transition probability distributions P (x, a) ∈ ∆(X )foot_0 and R is a set of bounded real-valued random variables, R(x, a) ∈ [-r max , r max ], called the reward. An agent interacts, discretely, with an environment described as a Markov decision process by observing the random state of the process X t , performing a random action A t and receiving a random reward R t ∼ R(X t , A t ). The state of the process changes to X t+1 and the interaction repeats. The way the agent selects actions once it observes a state is prescribed by a policy π that, for each state, is a probability distribution π(x) ∈ ∆(A). For a given policy π, we measure the value of performing some action at some state through the function Q π : X × A → R. The Q-function, given a state-action pair (x, a), gives the expected sum of rewards the agent receives throughout its interaction with the environment, after performing action a in state x, then continuing choosing actions according to π, and considering a discount factor γ ∈ [0, 1). The Markov decision problem is the one of finding a policy π * such that, for every state, maximizes the value of the best action. Such policy is known to exist (Puterman, 2005, Section 6.2). It may, however, not be unique. While an optimal policy π * is not necessarily unique, the optimal value Q * is and verifies the fixed-point equation of the Bellman operator Q * (x, a) = E R(x, a) + γ max a ′ ∈A Q * (X ′ , a ′ ) , where X ′ ∼ P (x, a) and the expectation is taken with respect to X ′ and R(x, a). Additionally, from Q * we can obtain an optimal policy π * by greedily choosing, for each state, the action with highest Q-value. Consequently, we can solve the Markov decision problem by solving the fixedpoint equation above. To solve equation 1, we can define a loss function h : R L → R such that h(w) = 1 2 E R(X, A) + γ max a ′ ∈A Q w (X ′ , a ′ ) -Q w (X, A) 2 , where Q w : X × A → R is a function approximator, for instance a neural network, and w ∈ R L its parameters. Additionally, the expectation is taken with respect to a distribution over state, action, next-state and reward transitions X, A, X ′ , R(X, A) . If we assume the off-policy target R(x, a) + γ max a ′ ∈A Q w (x ′ , a ′ ) of equation 2 is fixed.foot_1 , we obtain a stochastic semi-gradient descent scheme to minimize h. The resulting algorithm is called Q-learning with function approximation and takes the form w t+1 = w t + α t r t + γ max a ′ ∈A Q w (x ′ t , a ′ ) -Q w (x t , a t ) ∇ w Q w (x t , a t ) where state-action samples (x t , a t ) ∼ µ, the data distribution µ ∈ ∆(X × A), next-states x ′ t ∼ P (x t , a t ), rewards r t ∼ R(x t , a t ) and α t ∈ R + is a learning rate.



We use ∆(B) to denote the set of probability distributions over a set B. The technique is usually referred to as bootstrapping.

