Q-LEARNING WITH REGULARIZATION CONVERGES WITH NON-LINEAR NON-STATIONARY FEATURES

Abstract

The deep Q-learning architecture is a neural network composed of non-linear hidden layers that learn features of states and actions and a final linear layer that learns the Q-values of the features. The parameters of both components can possibly diverge. Regularization of the updates is known to solve the divergence problem of fully linear architectures, where features are stationary and known a priori. We propose a deep Q-learning scheme that uses regularization of the final linear layer of architecture, updating it along a faster time-scale, and stochastic full-gradient descent updates for the non-linear features at a slower time-scale. We prove the proposed scheme converges with probability 1. Finally, we provide a bound on the error introduced by regularization of the final linear layer of the architecture.

1. INTRODUCTION

The Q-learning algorithm, introduced in the seminal paper of Watkins & Dayan (1992) , is a stochastic semi-gradient descent algorithm that allows agents to learn to make sequential decisions towards long term goals by learning the optimal state-action value function of a given problem. The relevance of Q-learning in reinforcement learning (RL) cannot be overstated, as Q-learning with deep neural networks sustains the biggest breakthrough the field has seen (Mnih et al., 2015) . We can cast a deep Q-learning architecture as a neural network that combines a non-linear component of hidden layers, learning features of the input, and a final linear component that learns the Q-values of the learned features, as depicted in Figure 1 . Despite its merits, there is no guarantee that Q-learning with function approximation architectures, even linear ones, converges to the desired solution. In fact, divergence happens in well known examples where the parameters of the approximator do not approach any solution, either oscillating within a window (Boyan & Moore, 1995; Gordon, 2001) or growing without bound (Tsitsiklis & Van Roy, 1996; Baird, 1995) . There is also evidence for convergence to incompetent solutions (van Hasselt et al., 2018) . 2022) provided insights on the role of regularization of the parameters and of the Q-values for stabilizing Q-learning with linear function approximation and obtaining a provably convergent scheme. Under the light of the architecture in Figure 1 , their setting is one in which the features are stationary and known a priori and only the final component is learned. In this work, we investigate whether a regularized version Q-learning with linear function approximation schemes converge while features are non-stationary. In Section 3, as a first result, we assume that the features are updated along a slower time-scale than the final layer and that they converge, and prove that the final layer converges. Our setting and proof are based on two time-scale stochastic approximation ideas. We also bound the distance between the optimal Q-function, generally outside the span of the features, and the regularized solution. Then, in Section 4, we investigate how we can learn the non-linear features along the slow time-scale with provable convergence guarantees, thus verifying the assumption of the first result. We propose three learning schemes that perform stochastic full-gradient descent on well defined loss functions and are able to use a recent result from Mertikopoulos et al. (2020) to establish their convergence. Putting our two results together, we obtain the first convergence result for stochastic semi-gradient Q-learning schemes with non-linear function approximation. Our scheme is two time-scale, where the final layer of a neural network is updated faster and learns regularized Q-values of non-linear features that are updated slower and with stochastic full-gradient descent updates.



Recently, the works of Carvalho et al. (2020), Zhang et al. (2021) and Lim et al. (

