Q-LEARNING WITH REGULARIZATION CONVERGES WITH NON-LINEAR NON-STATIONARY FEATURES

Abstract

The deep Q-learning architecture is a neural network composed of non-linear hidden layers that learn features of states and actions and a final linear layer that learns the Q-values of the features. The parameters of both components can possibly diverge. Regularization of the updates is known to solve the divergence problem of fully linear architectures, where features are stationary and known a priori. We propose a deep Q-learning scheme that uses regularization of the final linear layer of architecture, updating it along a faster time-scale, and stochastic full-gradient descent updates for the non-linear features at a slower time-scale. We prove the proposed scheme converges with probability 1. Finally, we provide a bound on the error introduced by regularization of the final linear layer of the architecture.

1. INTRODUCTION

The Q-learning algorithm, introduced in the seminal paper of Watkins & Dayan (1992) , is a stochastic semi-gradient descent algorithm that allows agents to learn to make sequential decisions towards long term goals by learning the optimal state-action value function of a given problem. The relevance of Q-learning in reinforcement learning (RL) cannot be overstated, as Q-learning with deep neural networks sustains the biggest breakthrough the field has seen (Mnih et al., 2015) . We can cast a deep Q-learning architecture as a neural network that combines a non-linear component of hidden layers, learning features of the input, and a final linear component that learns the Q-values of the learned features, as depicted in Figure 1 . Despite its merits, there is no guarantee that Q-learning with function approximation architectures, even linear ones, converges to the desired solution. In fact, divergence happens in well known examples where the parameters of the approximator do not approach any solution, either oscillating within a window (Boyan & Moore, 1995; Gordon, 2001) or growing without bound (Tsitsiklis & Van Roy, 1996; Baird, 1995) . There is also evidence for convergence to incompetent solutions (van Hasselt et al., 2018) . Recently, the works of Carvalho et al. (2020) , Zhang et al. (2021) and Lim et al. (2022) provided insights on the role of regularization of the parameters and of the Q-values for stabilizing Q-learning with linear function approximation and obtaining a provably convergent scheme. Under the light of the architecture in Figure 1 , their setting is one in which the features are stationary and known a priori and only the final component is learned. In this work, we investigate whether a regularized version Q-learning with linear function approximation schemes converge while features are non-stationary. In Section 3, as a first result, we assume that the features are updated along a slower time-scale than the final layer and that they converge, and prove that the final layer converges. Our setting and proof are based on two time-scale stochastic approximation ideas. We also bound the distance between the optimal Q-function, generally outside the span of the features, and the regularized solution. Then, in Section 4, we investigate how we can learn the non-linear features along the slow time-scale with provable convergence guarantees, thus verifying the assumption of the first result. We propose three learning schemes that perform stochastic full-gradient descent on well defined loss functions and are able to use a recent result from Mertikopoulos et al. (2020) to establish their convergence. Putting our two results together, we obtain the first convergence result for stochastic semi-gradient Q-learning schemes with non-linear function approximation. Our scheme is two time-scale, where the final layer of a neural network is updated faster and learns regularized Q-values of non-linear features that are updated slower and with stochastic full-gradient descent updates.

(x, a)

Non-linear features Linear activation ϕ u (x, a) ϕ u (x, a) • v Hidden layers Final layer Q v,u (x, a) Figure 1 : A general deep Q-learning neural network architecture. The state-action pair inputs (x, a) are fed to non-linear hidden layers that output features of the input ϕ u (x, a). Then a linear activation layer parameterized by v outputs the Q-values of (x, a) using the features. Usually, the architecture is learned through performing stochastic semi-gradient descent updates on the Bellman error.

2. BACKGROUND

A Markov decision process M is a tuple (X , A, P, R), where X is a finite set of states, A is a finite set of actions, P is a set of transition probability distributions P (x, a) ∈ ∆(X )foot_0 and R is a set of bounded real-valued random variables, R(x, a) ∈ [-r max , r max ], called the reward. An agent interacts, discretely, with an environment described as a Markov decision process by observing the random state of the process X t , performing a random action A t and receiving a random reward R t ∼ R(X t , A t ). The state of the process changes to X t+1 and the interaction repeats. The way the agent selects actions once it observes a state is prescribed by a policy π that, for each state, is a probability distribution π(x) ∈ ∆(A). For a given policy π, we measure the value of performing some action at some state through the function Q π : X × A → R. The Q-function, given a state-action pair (x, a), gives the expected sum of rewards the agent receives throughout its interaction with the environment, after performing action a in state x, then continuing choosing actions according to π, and considering a discount factor γ ∈ [0, 1). The Markov decision problem is the one of finding a policy π * such that, for every state, maximizes the value of the best action. Such policy is known to exist (Puterman, 2005 , Section 6.2). It may, however, not be unique. While an optimal policy π * is not necessarily unique, the optimal value Q * is and verifies the fixed-point equation of the Bellman operator Q * (x, a) = E R(x, a) + γ max a ′ ∈A Q * (X ′ , a ′ ) , where X ′ ∼ P (x, a) and the expectation is taken with respect to X ′ and R(x, a). Additionally, from Q * we can obtain an optimal policy π * by greedily choosing, for each state, the action with highest Q-value. Consequently, we can solve the Markov decision problem by solving the fixedpoint equation above. To solve equation 1, we can define a loss function h : R L → R such that h(w) = 1 2 E R(X, A) + γ max a ′ ∈A Q w (X ′ , a ′ ) -Q w (X, A) 2 , where Q w : X × A → R is a function approximator, for instance a neural network, and w ∈ R L its parameters. Additionally, the expectation is taken with respect to a distribution over state, action, next-state and reward transitions X, A, X ′ , R(X, A) . If we assume the off-policy target R(x, a) + γ max a ′ ∈A Q w (x ′ , a ′ ) of equation 2 is fixed.foot_1 , we obtain a stochastic semi-gradient descent scheme to minimize h. The resulting algorithm is called Q-learning with function approximation and takes the form w t+1 = w t + α t r t + γ max a ′ ∈A Q w (x ′ t , a ′ ) -Q w (x t , a t ) ∇ w Q w (x t , a t ) where state-action samples (x t , a t ) ∼ µ, the data distribution µ ∈ ∆(X × A), next-states x ′ t ∼ P (x t , a t ), rewards r t ∼ R(x t , a t ) and α t ∈ R + is a learning rate.

3. LEARNING WITH NON-STATIONARY FEATURES

In our work, we consider parameterized functions as depicted in Figure 1 . We observe that a stateaction pair (x, a) is the input, and is processed by non-linear features parameterized by u, ϕ u . Then, ϕ u (x, a) is passed on to a linear layer parameterized by v. The output is Q v,u (x, a). In our architecture, we use Q v,u : X × A → R such that Q v,u (x, a) = ϕ u (x, a) • Proj ρ (v) , where Proj ρ : R K → B ρ maps v to a ball of radius ρ in R K that can be arbitrarily large. Proj ρ ensures boundedness of the Q-values without requiring boundedness of the parameters. We refer to the parameters of the final linear layer, v ∈ R K , as the final parameters, to the parameters of the nonlinear hidden layers, u ∈ R D , as the hidden parameters and to ϕ u : X × A → R K as the features. In this section, we make the argument that learning the final parameters v at a faster time-scale than the hidden parameters u allows us to decouple the convergence analysis of the two and consequently establish that a regularized version of Q-learning with convergence guarantees for the stationary features case, will also converge if the features are non-stationary but convergent. Specifically, following Borkar (2008, Chapter 6) , assuming the features change slower than the final layer allows us to treat the former as being quasi-static from the point of the view of the latter, even though both learning processes evolve simultaneously. We note that, in our analysis, we do not require, necessarily, that the features learn through the same supervision signal as the final layer. We define the Q-learning scheme of the final linear layer with the addition of ridge regularization of the parameters, merging ideas from Lim et al. (2022) and Zhang et al. (2021) . Definition 1. In Q-learning with regularization, the final parameters are updated according to v t+1 = v t + α t r t + γ max a ′ ∈A Q vt,ut (x ′ t , a ′ ) -ξQ vt,ut (x t , a t ) ϕ ut (x t , a t ) -α t ϵv t , where ξ, ϵ > 0 are regularization hyper-parameters and the positive learning rates in {α t } t≥0 are such that ∞ t=0 α t = ∞ and ∞ t=0 α 2 t < ∞. We observe that if ξ = 1, ϵ → 0, and ρ → ∞, we recover the original Q-learning with linear function approximation algorithm, which can diverge. For our first result, let us assume the features are updated much slower than the final layer through a stochastic approximation scheme with well-behaved noise. Formally, we assume the following. Assumption 1. The hidden parameters u are updated according to u t+1 = u t + β t g(v t , u t ) + N t+1 , where the vector field g : R K+D → R D is Lipschitz-continuous; the sequence of random vectors {N t } t≥0 has zero mean and finite variance; the learning rates in {β t } t≥0 are such that ∞ t=0 β t = ∞, ∞ t=0 β 2 t < ∞ and β t /α t → 0. We finally assume the sequence {u t } t≥0 converges, i.e., that u t → u * for some u * ∈ R D . In Sec.4, we propose three feature learning schemes that satisfy Assumption 1, i.e., specific choices of g and N that are provably convergent. For now, we establish that Q-learning with regularization converges with non-stationary convergent features. Theorem 1. Suppose that Assumption 1 holds and moreover (i) For all t ≥ 0, the distribution µ ∈ ∆(X × A) and is such that (X t , A t ) ∼ µ and are i.i.d.; (ii) The architecture ϕ : R D × X × A → R K and is Lipschitz-continuous on the first argument; (iii) For all u ∈ R D , the features ϕ u are such that, for all (x, a) ∈ X × A, ∥ϕ u (x, a)∥ ≤ 1; (iv) For all u ∈ R D , the features ϕ u and the distribution µ are such that the K × K matrix A ) is positive-definite and its minimum eigenvalue is σ. Σ u := E ϕ u (X, A)ϕ T u (X, Additionally suppose that the regularization parameter ξ > 1 is large enough, specifically that ξ > γ σ ; that ϵ > 0 is small enough, specifically that ϵ < ξσ -γ; and that the radius of the ball B ρ , ρ > 0, is also large enough, specifically that ρ > rmax ξσ-γ-ϵ . Then, it holds that v t → v * (u * ) w.p.1, with v * : R D → R K such that v * (u) = 1 ξ Σ -1 u E R(X, A) + γ max a ′ ∈A Q v * (u),u (X ′ , a ′ ) ϕ u (X, A) - ϵ ξ Σ -1 u v * (u). Before we present the proof, we discuss Assumptions (i) to (iv) of Theorem 1. While not necessary, Assumption (i) facilitates the formal analysis of the stochastic processes, as well as its exposition in the document. Assumption (ii) allows us to use theoretical results on the convergence of stochastic approximation processes, such as the ones of Borkar (2008, Chapter 6 ; Theorem 2) and Mertikopoulos et al. (2020, Theorem 2) . Assumption (iii) is important to the formal analysis of the limiting ordinary differential equations, as well as to be assured of technical requirements, such as Lipschitz continuity of the update. Specifically, since the Q-learning update considered features products between v and u dependent quantities, the assumption ensures the expected update is Lipschitz-continuous. Finally, Assumption (iv) is used to guarantee existence of solution to the limiting o.d.e., as well as to characterize such solution. Proof. We present an outline of the proof, referring to the supplementary material for proofs of auxiliary lemmas. The Q-learning algorithm presented is a two time-scale stochastic approximation algorithm where the fast component takes the form v t+1 = v t + α t f (v t , u t ) + M t+1 , with f : R K+D → R K the expected update f (v, u) = E R(X, A) + γ max a ′ ∈A Q v,u (X ′ , a ′ ) -ξQ v,u (X, A) ϕ u (X, A) -ϵv and M t ∈ R K its noise. Borkar (2008, Chapter 6; Theorem 2) provides conditions under which the stochastic process above converges. The conditions include that f is Lipschitz and M t+1 is a martingale-difference sequence, which we show through Lemmas 1 and 2, respectively. In addition, Lemma 3 establishes that, for each u ∈ R D , the ordinary differential equation (o.d.e.) vt = f (v t , u) has a unique globally asymptotically stable equilibrium v * (u), using a Lyapunov argument. Since we show using Lemma 4 that, additionally, the iterates remain bounded, we conclude through Lemma 5 that they converge to the equilibrium w.p.1. We finish this section by providing a bound on the solution obtained by the Q-learning scheme considered and the optimal solution. Let us denote the optimal Q-function by Q * , which exists and is unique and is generally outside the linear space generated by the features ϕ u * . Let us define the orthogonal projection of Q * into such linear space as the operator Φ u * such that (Φ u * Q)(x, a) = ϕ T u * (x, a)Σ -1 u E ϕ(x, a)Q(x, a) . We can think of Φ u * Q * as the best linear approximator of Q * . Unfortunately, such approximator is, in general, not reachable through Q-learning. Using w * to jointly denote the parameters v * (u * ), u * , we have the following result for the solution Q w * otained by the regularized Q-learning scheme. Theorem 2. Under Assumptions 1 and (i) to (iv) of Theorem 1, we have the following error bound on Q w * : ∥Q * -Q w * ∥ ∞ ≤ ξσ ξσ -γ ∥Q * -Φ u * Q * ∥ ∞ + r max (ξ -1) (1 -γ)(ξσ -γ) + f ϵ , where f ϵ = ϵrmax ξσ(σ-γ-1) • ξσ ξσ-γ . In equation 4, we observe the bound depends on the regularization, through the hyper-parameters ξ and ϵ, and on the features, through σ and u * . We can make ϵ → 0 and make f ϵ arbitrarily small. As for the second term, it disappears if ξ = 1 and the error then depends only on the best possible solution for the given features, Φ u * Q * . However, if ξ = 1, the Q-learning scheme may diverge. Proof. We have that ∥Q * -Q w * ∥ ∞ ≤ ∥Q * -Φ u * Q * ∥ ∞ + ∥Φ u * Q * -Q w * ∥ ∞ . Let us consider the second term on the right-hand side. We know that Q * = HQ * , using H to denote the Bellman operator, and that Q w * = 1 ξ Φ u * HQ w * + ϵ ξ Σ -1 u * v * (u * ) from the characterization of the limit solution in Theorem 1. Then, we have that ∥Φ u Q * -Q w * ∥ ∞ ≤ ∥Φ u * Q * - 1 ξ Φ u * Q * ∥ ∞ + 1 ξ ∥Φ u * HQ * -Φ u * HQ w * ∥ ∞ + ϵ ξσ ∥v * (u * )∥. ( ) by means of the Cauchy-Schwarz and Jensen inequalities. For the first term on the right hand side of equation 5, we can establish that ∥Φ u * Q * - 1 ξ Φ u * Q * ∥ ∞ ≤ (1 - 1 ξ )∥Φ u * Q * ∥ ∞ ≤ r max (ξ -1) ξσ(1 -γ) . For the second term on the right hand side of equation 5, we have that 1 ξ ∥Φ u * HQ * -Φ u * HQ w * ∥ ∞ ≤ γ ξσ ∥Q * -Q w * ∥ ∞ . Finally, for the third term on the right hand side of equation 5, we start by noting that ∥v * (u * )∥ ≤ 1 σ r max + γ∥v * (u * )∥ ∞ - ϵ ξσ ∥v * (u * )∥ ∞ . Equivalently, for the third term on the right hand side of equation 5 we have ϵ ξσ ∥v * (u * )∥ ≤ ϵr max ξσ(σ -γ) -ϵσ . Putting everything together, we conclude the result.

3.1. EXPERIMENTAL RESULTS

We illustrate our proposed learning architectures under three examples with converging features. In all of them, the original Q-learning diverges while the proposed architecture does not.

LINEAR v → 2v EXAMPLE

In the v → 2v example of Tsitsiklis & Van Roy (1996) there are two states and a single action. The first state always transitions to the second; the second state always transitions to itself. All rewards are 0 and, consequently, so are the Q-values. We consider the features ϕ u (x) = ψ(x) + u, where ψ(x) is 1 for the first state and 2 for the second state and u ∈ R. We consider u t = (-1) t t → 0. We divide the features by 2 in order to respect Assumption (iii). The desirable behavior of Q-learning would be v → 0. Figure 2a shows the results. We can see that when ξ = 1 the parameter v diverges. As ξ increases, learning is more stable. When ξ = 2, v converges to the desired solution v = 0.

STAR EXAMPLE

The star example of Baird (1995) , slightly modified by Sutton & Barto (2018) , has seven states and two actions. One of the actions transitions to any of the first six states uniformly, the other action transitions to the seventh state. All rewards are 0 and so are the Q-values. The behavioral policy chooses the first action with probability 6 7 and the second action with probability 1 7 . Therefore, the next state distribution is uniform. The target policy, however, always chooses the second action. For the first six states, the state-features features are ϕ u (x) = ψ(x) + u ∈ R 8 where ψ(x) are 2 in the x-th component, 1 in the eight component and 0 otherwise. For the seventh state, the features are 1 in the seventh component and 2 in the eight component. We consider again converging hidden parameters u t = (-1) t ( 1 t , 1 t , 1 t , 1 t , 1 t , 1 t , 1 t , 1 t ) → 0. We divide the features by √ 5 in order to respect Assumption (iii). Figure2b shows the results obtained. When ξ = 1, the parameters v grow. However, we see that as ξ increases, the final parameters v do not grow, as desired. 

NON-LINEAR v → 2v EXAMPLE

We modify the v → 2v to a non-linear learning architecture. The Markov chain remains the original from Tsitsiklis & Van Roy (1996) , and so does the data distribution. We learn the linear features with sigmoid activation function ϕ u (x) = σ(ψ(x)u) with the final linear layer parameterized by v and again ψ(x) = x. Then, we have Q v,u (x) = σ(ϕ(x)u)v. v = 0 recovers the correct Q-values. In Figure 3a , we see divergence of the final parameter when ξ = 1 and convergence to the correct solution when ξ = 1.5 and ξ = 2. In both cases the features converge, as can be seen is Figure 3b .

4. LEARNING NON-LINEAR FEATURES

Theorem 1 states that the regularized Q-learning scheme is convergent with features that are changing over time, throughout the learning process, as long as those features converge. We now present three learning settings for the hidden layers that we can show to satisfy the assumption, i.e., three learning setting under which convergence of the features is guaranteed. We consider a D-times differentiable objective function h : R D → R. We want to find z * such that z * = min z∈R D h(z). The stochastic gradient descent scheme for solving the equation above takes the form z t+1 = z + β t ∇ z h(z t ) + Y t+1 , where the random variables Y t have zero-mean and bounded variance. We have the following result from Mertikopoulos et al. (2020) . Theorem 3. Suppose that the function h is Lipschitz-continuous, Lipschitz-smooth, coercive and not asymptotically flat. Then, we have that the set of critical points Z * := {z : ∇ z h(z) = 0} is non-empty. Further suppose that the random variables Y t have zero-mean and finite variance. Then, z t → Z * ∞ w.p.1, where Z * ∞ ⊆ Z * is a bounded connected component over which h is constant. Theorem 3 provides general conditions under which the parameters of a stochastic approximation scheme that is, particularly, stochastic gradient descent of a loss function h, converge to a bounded region with constant value. While Q-learning is not true stochastic gradient descent and divergence of the parameters is known to happen, it is possible to learn the parameters of the features through stochastic full-gradient descent and obtain convergence guarantees. In the sequel, we propose three such feature learning schemes. One of the proposed schemes based on an unupervised learning update; another is based on a semi-supervised learning update; the final is based on a reinforcement learning update. In light of Theorem 3, all the proposed feature learning schemes are in accordance with Assumption 1 and are thus guaranteed to converge. We note, however, that in order to guarantee boundedness of the features, we should post-process them with a sigmoid final layer, σ : R → [0, 1] such that σ(x) = 1 1+e -x , thus respecting Assumption (iii) of Theorem 1. We define the generalized Huber loss H : R L → R such that H(l) = min p∈{1,2} 1 p ∥l∥ p p is a robust loss function in the conditions demanded by the theorem. The Huber loss considered is thus the 2-norm if its input is close to the origin and the 1-norm otherwise. Additionally, we remark that the finite linear combination and the finite composition of Lipschitz-continuous and Lipschitzsmooth functions is also Lipschitz-continuous and Lipschitz-smooth. Finally, we note that ∇H(l) = sign(l) if arg min p∈{1,2} 

4.1. UNSUPERVISED LEARNING

We can learn a linear map that reduces the input space, linearly, through principal component analysis. In the non-linear case, we can instantiate such learning using auto-encoder (Liou et al., 2014) or variational auto-encoder (Kingma & Welling, 2014) architectures. Finally, constrastive learning has also been used in feature extraction (Laskin et al., 2020) . All these methods have no task information but can still be powerful if dimensionality is an issue or we want to transfer learning across tasks or even domains (Higgins et al., 2017) . We focus here on the definition of a loss function over which the stochastic gradient descent scheme is guaranteed to converge: the auto-encoder. Formally, the auto-encoder performs stochastic gradient descent over the loss function h(u, s) = E H κ s ϕ u ψ(X, A) -ψ(X, A) , where ψ(x, a) is an euclidean representation of (x, a) in R P . In the auto-encoder, an encoder ϕ u : R P → R K , u ∈ R D learns to map the features into a latent space and a decoder κ s : R K → R P learns to reconstruct the original input. The features ϕ u can then be normalized and inputted to the final layer of our regularized Q-learning scheme. The auto-encoder has been applied successfully in reinforcement learning tasks (Lange et al., 2012) . In practice the stochastic gradient updates are as follows. u t+1 = u t -β t ∇ u ϕ ut ψ(x, a) ∇κ st ϕ ut ψ(x, a) ∇H κ s ϕ u ψ(x, a) -ψ(x, a) s t+1 = s t -β t ∇ s κ st ϕ u ψ(x, a) ∇ϕ ut ψ(x, a) ∇H κ s ϕ u ψ(X, A) -ψ(X, A) . The process of learning the final linear layer parameterized by v evolves concurrently as described in Section 3. As long as β t is o(α t ), convergence of both sets of parameters happens with probability 1 according to Theorem 3.

4.2. SEMI-SUPERVISED LEARNING

We can also approach the feature learning problem in a semi-supervised way grounded in MDP theory (Ferns et al., 2004) . Specifically, instead of only learning to put together inputs that are close to each other in the original space, we can learn to put them together if they are close to each other in the Markov decision process. Bissimulation metrics (Ferns et al., 2011) give us a way to perform such learning, by considering that state-action pairs are similar if they produce similar rewards and lead to similar states. We can thus define the loss function h(u) = E H H ϕ u (X, A) -ϕ u ( X, Ã) -H(R -R) -γH ϕ u (X ′ , •) -ϕ u ( X′ , •) and perform again stochastic gradient descent. The technique has also provided positive results in practice, specifically when compared to unsupervised learning (Zhang et al., 2020) . In this semi-supervised setting, the stochastic gradient updates take the form u t+1 = u t -β t ∇ u ϕ ut (x t , a t ) -ϕ ut (x t , ãt ) ∇H ϕ ut (x t , a t ) -ϕ ut (x t , ãt ) - -γ∇ u ϕ ut (x t , a t ) -ϕ ut (x t , ãt ) ∇H ϕ ut (x t , •) -ϕ ut (x t , •) ∇H H ϕ ut (x t , a t ) -ϕ ut (x t , ãt ) -H(r -rt ) -H ϕ ut (x t , •) -ϕ ut (x t , •) , where (x t , ãt , x′ t , rt ) are sampled independent and identically distributed. Again, the learning scheme used to update the hidden parameters u can happen alongside learning of the final linear layer parameterized by v and convergence is guaranteed by Theorem 3.

4.3. REINFORCEMENT LEARNING

Q-learning assumes bootstrapped targets and thus makes only stochastic semi-gradient descent over the loss function. In practice, that choice may produce good results but is often unstable. We propose that we could learn the features through full stochastic gradient descent and learn the final layer through the usual regularized stochastic semi-gradient descent scheme. The loss function considered is h(u, v ′ ) = E H R + γ max a ′ ∈A ϕ u (X ′ , a ′ ) • v ′ -ϕ u (X, A) • v ′ . Recent work from Avrachenkov et al. (2021) proves that, often, the stochastic full gradient descent over the loss function is able to perform competitively with the stochastic semi-gradient scheme. The updates take the form u t+1 = u t + β t ∇H r t + γ max a ′ ∈A ϕ ut (x ′ t , a ′ ) • v ′ t -ϕ ut (x t , a t ) • v ′ t ∇ u ϕ ut (x t , a t )v ′ t -γ∇ u max a ′ ∈A ϕ ut (x ′ t , a ′ )v ′ t v ′ t+1 = v ′ t + β t ∇H r t + γ max a ′ ∈A ϕ ut (x ′ t , a ′ ) • z t -ϕ ut (x t , a t ) • v ′ t ϕ ut (x t , a t ) -γ∇ v max a ′ ∈A ϕ ut (x ′ t , a ′ )v ′ t . For the gradient of the max operator, we may consider a smooth approximation parameterized by α, max α , such that max α (v 1 , . . . , v n ) = n i=1 vie αv i n i=1 e αv i . As α → ∞, max α → max . The hidden parameters u that are being updated can, at the same time, be used to learn the final parameters v using the regular semi-gradient Q-learning update. In the regularized version of Qlearning presented in Section 3, convergence happens (Theorem 3). In practice, our proposed approach is halfway between the full-gradient and semi-gradient schemes for reinforcement learning and is able to capture the stability of full-gradient schemes and optimality of semi-gradient schemes. The additional computational and training costs are minimal. More specifically, one additional linear layer of parameters z ∈ R K is required.

5. RELATED WORK

There were initial efforts to provide stable Q-learning methods in the presence of function approximation (Singh et al., 1994; Ormoneit & Sen, 2002; Szepesvári & Smart, 2004; Melo et al., 2008; Maei et al., 2010) . While meritous and insightful, the referred works were not only restricted to the linear function approximation case but also assumed the data was particularly well-aligned with specific distributions (Melo et al., 2008) . More recently, Q-learning was attributed finite-time error bounds when certain fixed behavior policies are used Chen et al. (2019) . Such policies are scarce or may not even exist as the number of features grows. Finite-time error bounds for Adaptive Dynamic Programming methods Bertsekas & Tsitsiklis (1995) , including Fitted Q-iteration (FQI) Ernst et al. (2005) , assume not only the realizability of the optimal Q-function but also closedness under Bellman update Szepesvári & Munos (2005) . Such conditions have been discussed in a recent work Chen & Jiang (2019) . The problem of divergence of Q-learning with function approximation was revived after a significant empirical success story of the use of Q-learning with deep neural networks (Mnih et al., 2015) , where the function approximation setting is non-linear and the features are non-stationary. One of the components of the renowned deep Q-network (DQN) is a target network that mitigates the negative impact of bootstrapping, i.e., of the stochastic semi-gradient update. The works of Carvalho et al. (2020) ; Zhang et al. (2021) ; Chen et al. (2022) provided theoretical insights and convergence guarantees for Q-learning with the target network. There is also work pointing out that regularization of the Q-values or the parameters themselves can stabilize Q-learning, resulting in a convergent algorithm (Zhang et al., 2021; Carvalho et al., 2020; Lim et al., 2022) . All these works are applicable to the linear function approximation case, merely. In the case of linear function approximation, the features are assumed to be stationary and known prior to learning. The practical applications of reinforcement learning, however, are moving in the opposite direction, where the features are also learned. Additionally, such learning setting is, typically, non-linear. Consequently, the gap between theory and practice remains significant. For the case of non-linear function approximation, a recent work suggests a loss function that is decreasing over time, assuming the neural network converges to the targets from a target network at each step (Wang & Ueda, 2021). Finally, Xu & Gu (2020) provide a finite-time result for Qlearning with over-parameterized neural networks. While being an interesting result, as the size of the network grows to infinity the learning architecture also grows closer to a tabular representation.

6. CONCLUSION

In this work, we provided the first convergence result for a Q-learning scheme with non-linear nonstationary features without the use of a target network. In our scheme, the final layer of a network is updated faster than the hidden layers. We show that if the features converge, the final layer also converges. We complement our theoretical analysis with experiments showcasing our result. Finally, we propose three schemes that result in guaranteed convergence of the features. In the future, it would be relevant to compare experimentally the three learning schemes for the features considered, specifically unsupervised, semi-supervised and full-gradient reinforcement learning. Additionally, it would be important to characterize the solutions obtained by each scheme.



We use ∆(B) to denote the set of probability distributions over a set B. The technique is usually referred to as bootstrapping.



Figure 2: Experimental results on the linear v → 2v and star problems under non-stationary convergent features for different values of regularization parameters ξ and fixed ϵ = 10 -8 . As the regularization parameter ξ increases, Q-learning updates stabilize and the parameters v converge.

Figure 3: Experimental results on the non-linear v → 2v problem for different values of regularization parameters ξ and fixed ϵ = 10 -8 . As the regularization parameter ξ increases, Q-learning updates stabilize and the final parameters v converge. The hidden parameter u converges regardless of the regularization parameter ξ, though the limit solutions are different for different values of ξ.

and ∇H(l) = l if arg min p∈{1,

