IMPLICIT UNDER-PARAMETERIZATION INHIBITS DATA-EFFICIENT DEEP REINFORCEMENT LEARNING

Abstract

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.

1. INTRODUCTION

Many pervasive deep reinforcement learning (RL) algorithms estimate value functions using bootstrapping, that is, by sequentially fitting value functions to target value estimates generated from the value function learned in the previous iteration. Despite high-profile achievements (Silver et al., 2017) , these algorithms are highly unreliable due to poorly understood optimization issues. Although a number of hypotheses have been proposed to explain these issues (Achiam et al., 2019; Bengio et al., 2020; Fu et al., 2019; Igl et al., 2020; Liu et al., 2018; Kumar et al., 2020a) , a complete understanding remains elusive. We identify an "implicit under-parameterization" phenomenon that emerges when value networks are trained using gradient descent combined with bootstrapping. This phenomenon manifests as an excessive aliasing of features learned by the value network across states, which is exacerbated with more gradient updates. While the supervised deep learning literature suggests that some feature aliasing is desirable for generalization (e.g., Gunasekar et al., 2017; Arora et al., 2019) , implicit under-parameterization exhibits more pronounced aliasing than in supervised learning. This over-aliasing causes an otherwise expressive value network to implicitly behave as an under-parameterized network, often resulting in poor performance. Implicit under-parameterization becomes aggravated when the rate of data re-use is increased, restricting the sample efficiency of deep RL methods. In online RL, increasing the number of gradient steps in between data collection steps for data-efficient RL (Fu et al., 2019; Fedus et al., 2020b) causes the problem to emerge more frequently. In the extreme case when no additional data is collected, referred to as offline RL (Lange et al., 2012; Agarwal et al., 2020; Levine et al., 2020) , implicit under-parameterization manifests consistently, limiting the viability of offline methods. We demonstrate the existence of implicit under-parameterization in common value-based deep RL methods, including Q-learning (Mnih et al., 2015; Hessel et al., 2018) and actor-critic (Haarnoja et al., 2018) , as well as neural fitted-Q iteration (Riedmiller, 2005; Ernst et al., 2005) . To isolate the issue, we study the effective rank of the features in the penultimate layer of the value network (Section 3). We observe that after an initial learning period, the rank of the learned features drops steeply. As the rank decreases, the ability of the features to fit subsequent target values and the optimal value function generally deteriorates and results in a sharp decrease in performance (Section 3.1). To better understand the emergence of implicit under-parameterization, we formally study the dynamics of Q-learning under two distinct models of neural net behavior (Section 4): kernel regression (Jacot et al., 2018; Mobahi et al., 2020) and deep linear networks (Arora et al., 2018) . We corroborate the existence of this phenomenon in both models, and show that implicit underparameterization stems from a pathological interaction between bootstrapping and the implicit regularization of gradient descent. Since value networks are trained to regress towards targets generated by a previous version of the same model, this leads to a sequence of value networks of potentially decreasing expressivity, which can result in degenerate behavior and a drop in performance. The main contribution of this work is the identification of implicit under-parameterization in deep RL methods that use bootstrapping. Empirically, we demonstrate a collapse in the rank of the learned features during training, and show it typically corresponds to a drop in performance in the Atari (Bellemare et al., 2013) and continuous control Gym (Brockman et al., 2016) benchmarks in both the offline and data-efficient online RL settings. We verify the emergence of this phenomenon theoretically and characterize settings where implicit under-parameterization can emerge. We then show that mitigating this phenomenon via a simple penalty on the singular values of the learned features improves performance of value-based RL methods in the offline setting on Atari.

2. PRELIMINARIES

The goal in RL is to maximize long-term discounted reward in a Markov decision process (MDP), defined as a tuple (S, A, R, P, γ) (Puterman, 1994) , with state space S, action space A, a reward function R(s, a), transition dynamics P (s |s, a) and a discount factor γ ∈ [0, 1). The Q-function Q π (s, a) for a policy π(a|s), is the expected long-term discounted reward obtained by executing action a at state s and following π(a|s) thereafter, Q π (s, a) := E [ ∞ t=0 γ t R(s t , a t )]. Q π (s, a) is the fixed point of the Bellman operator T π , ∀s, a: T π Q(s, a) := R(s, a) + γE s ∼P (•|s,a),a ∼π(•|s ) [Q(s , a )], which can be written in vector form as: Q π = R + γP π Q π . The optimal Q-function, Q * (s, a), is the fixed point of the Bellman optimality operator T : T Q(s, a) := R(s, a) + γE s ∼P (•|s,a) [max a Q(s , a )]. Practical Q-learning methods (e.g., Mnih et al., 2015; Hessel et al., 2018; Haarnoja et al., 2018) convert the Bellman equation into an bootstrapping-based objective for training a Q-network, Q θ , via gradient descent. This objective, known as mean-squared temporal difference (TD) error, is given by: L(θ) = s,a R(s, a) + γ Qθ (s , a ) -Q(s, a) 2 , where Qθ is a delayed copy of the Q-function, typically referred to as the target network. These methods train Q-networks via gradient descent and slowly update the target network via Polyak averaging on its parameters. We refer the output of the penultimate layer of the deep Q-network as the learned feature matrix Φ, such that Q(s, a) = w T Φ(s, a), where w ∈ R d and Φ ∈ R |S||A|×d . Algorithm 1 Fitted Q-Iteration (FQI) Minimize TD error for Q θ via t = 1, • • • , T gradient descent updates, min θ (Q θ (s, a) -y k ) 2 5: end for For simplicity of analysis, we abstract deep Q-learning methods into a generic fitted Q-iteration (FQI) framework (Ernst et al., 2005) . We refer to FQI with neural nets as neural FQI (Riedmiller, 2005) . In the k-th fitting iteration, FQI trains the Q-function, Q k , to match the target values, y k = R+γP π Q k-1 generated using previous Q-function, Q k-1 (Algorithm 1). Practical methods can be instantiated as variants of FQI, with different target update styles, different optimizers, etc. 1: Initialize Q-network Q θ , Figure 2 : Offline RL. srank δ (Φ) and performance of neural FQI on gridworld, DQN on Atari and SAC on Gym environments in the offline RL setting. Note that low rank (top row) generally corresponds to worse policy performance (bottom row). Rank collapse is worse with more gradient steps per fitting iteration (T= 10 vs. 200 on gridworld). Even when a larger, high coverage dataset is used, marked as DQN (4x data), rank collapse occurs (for Asterix also see Figure A .2 for a complete figure with a larger number of gradient updates).

3. IMPLICIT UNDER-PARAMETERIZATION IN DEEP Q-LEARNING

In this section, we empirically demonstrate the existence of implicit under-parameterization in deep RL methods that use bootstrapping. We characterize implicit under-parameterization in terms of the effective rank (Yang et al., 2019) of the features learned by a Q-network. The effective rank of the feature matrix Φ, for a threshold δ (we choose δ = 0.01), denoted as srank δ (Φ), is given by srank δ (Φ) = min k : k i=1 σi(Φ) d i=1 σi(Φ) ≥ 1 -δ , where {σ i (Φ)} are the singular values of Φ in decreasing order, i.e., σ 1 ≥ • • • ≥ σ d ≥ 0. Intuitively, srank δ (Φ) represents the number of "effective" unique components of the feature matrix Φ that form the basis for linearly approximating the Qvalues. When the network maps different states to orthogonal feature vectors, then srank δ (Φ) has high values close to d. When the network "aliases" state-action pairs by mapping them to a smaller subspace, Φ has only a few active singular directions, and srank δ (Φ) takes on a small value. Definition 1. Implicit under-parameterization refers to a reduction in the effective rank of the features, srank δ (Φ), that occurs implicitly as a by-product of learning deep neural network Q-functions. While rank decrease also occurs in supervised learning, it is usually beneficial for obtaining generalizable solutions (Gunasekar et al., 2017; Arora et al., 2019) . However, we will show that in deep Q-learning, an interaction between bootstrapping and gradient descent can lead to more aggressive rank reduction (or rank collapse), which can hurt performance. Experimental setup. To study implicit under-parameterization empirically, we compute srank δ (Φ) on a minibatch of state-action pairs sampled i.i.d. from the training data (i.e., the dataset in the offline setting, and the replay buffer in the online setting). We investigate offline and online RL settings on benchmarks including Atari games (Bellemare et al., 2013) and Gym environments (Brockman et al., 2016) . We also utilize gridworlds described by Fu et al. (2019) to compare the learned Q-function against the oracle solution computed using tabular value iteration. We evaluate DQN (Mnih et al., 2015) on gridworld and Atari and SAC (Haarnoja et al., 2018) on Gym domains. Data-efficient offline RL. In offline RL, our goal is to learn effective policies by performing Qlearning on a fixed dataset of transitions. We investigate the presence of rank collapse when deep Q-learning is used with broad state coverage offline datasets from Agarwal et al. (2020) . In the top row of Figure 2 , we show that after an initial learning period, srank δ (Φ) decreases in all domains (Atari, Gym and the gridworld). The final value of srank δ (Φ) is often quite small -e.g., in Atari, only 20-100 singular components are active for 512-dimensional features, implying significant underutilization of network capacity. Since under-parameterization is implicitly induced by the learning process, even high-capacity value networks behave as low-capacity networks as more training is performed with a bootstrapped objective (e.g., mean squared TD error). On the gridworld environment, regressing to Q * using supervised regression results in a much higher srank δ (Φ) (black dashed line in Figure 2 (left)) than when using neural FQI. On Atari, even when a 4x larger offline dataset with much broader coverage is used (blue line in Figure 2 ), rank collapse still persists, indicating that implicit under-parameterization is not due to limited offline dataset size. Figure 2 (2 nd row) illustrates that policy performance generally deteriorates as srank(Φ) drops, and eventually collapses simultaneously with the rank collapse. While we do not claim that implicit under-parameterization is the only issue in deep Q-learning, the results in Figure 2 show that the emergence of this under-parameterization is strongly associated with poor performance. To prevent confounding from the distribution mismatch between the learned policy and the offline dataset, which often affects the performance of Q-learning methods, we also study CQL (Kumar et al., 2020b) , an offline RL algorithm designed to handle distribution mismatch. We find a similar degradation in effective rank and performance for CQL (Figure A.3) , implying that underparameterization does not stem from distribution mismatch and arises even when the resulting policy is within the behavior distribution (though the policy may not be exactly pick actions observed in the dataset). We provide more evidence in Atari and Gym domains in Appendix A.1. Data-efficient online RL. Deep Q-learning methods typically use very few gradient updates (n) per environment step (e.g., DQN takes 1 update every 4 steps on Atari, n = 0.25). Improving the sample efficiency of these methods requires increasing n to utilize the replay data more effectively. However, we find that using larger values of n results in higher levels of rank collapse as well as performance degradation. In the top row of Figure 3 , we show that larger values of n lead to a more aggressive drop in srank δ (Φ) (red vs. blue/orange lines), and that rank continues to decrease with more training. Furthermore, the bottom row illustrates that larger values of n result in worse performance, corroborating Fu et al. (2019); Fedus et al. (2020b) . We find similar results with the Rainbow algorithm (Hessel et al., 2018) (Appendix A.2) . As in the offline setting, directly regressing to Q * via supervised learning does not cause rank collapse (black line in Figure 3 ).

3.1. UNDERSTANDING IMPLICIT UNDER-PARAMETERIZATION AND ITS IMPLICATIONS

How does implicit under-parameterization degrade performance? Having established the presence of rank collapse in data-efficient RL, we now discuss how it can adversely affect performance. As the effective rank of the network features Φ decreases, so does the network's ability to fit the subsequent target values, and eventually results in inability to fit Q * . In the gridworld domain, we measure this loss of expressivity by measuring the error in fitting oracle-computed Q * values via a linear transformation of Φ. When rank collapse occurs, the error in fitting Q * steadily increases during training, and the consequent network is not able to predict Q * at all by the end of training (Figure 4a ) -this entails a drop in performance. In Atari domains, we do not have access to Q * , and so we instead measure TD error, that is, the error in fitting the target value estimates, R + γP π Q k . In SEAQUEST, as rank decreases, the TD error increases (Figure 4b ) and the value function is unable to fit the target values, culminating in a performance plateau (Figure 3 ). This observation is consistent across other environments; we present further supporting evidence in Appendix A.4.

Does bootstrapping cause implicit under-parameterization?

We perform a number of controlled experiments in the gridworld and Atari environments to isolate the connection between rank collapse and bootstrapping. We first remove confounding issues of poor network initialization (Fedus et al., 2020a) and non-stationarity (Igl et al., 2020) by showing that rank collapse occurs even when the Q-network is re-initialized from scratch at the start of each fitting iteration (Figure 4c ). To show that the problem is not isolated to the control setting, we show evidence of rank collapse in the policy evaluation setting as well. We trained a value network using fitted Q-evaluation for a fixed policy π (i.e., using the Bellman operator T π instead of T ), and found that rank drop still occurs (FQE in Figure 4d ). Finally, we show that by removing bootstrapped updates and instead regressing directly to Monte-Carlo (MC) estimates of the value, the effective rank does not collapse (MC Returns in Figure 4d ). These results, along with similar findings on other Atari environments (Appendix A.3), our analysis indicates that bootstrapping is at the core of implicit under-parameterization.

4. THEORETICAL ANALYSIS OF IMPLICIT UNDER-PARAMETERIZATION

In this section, we formally analyze implicit under-parameterization and prove that training neural networks with bootstrapping reduces the effective rank of the Q-network, corroborating the empirical observations in the previous section. We focus on policy evaluation (Figure 4d and Figure A.9), where we aim to learn a Q-function that satisfies Q = R + γP π Q for a fixed π, for ease of analysis. We also presume a fixed dataset of transitions, D, to learn the Q-function.

4.1. ANALYSIS VIA KERNEL REGRESSION

We first study bootstrapping with neural networks through a mathematical abstraction that treats the Q-network as a kernel machine, following the neural tangent kernel (NTK) formalism (Jacot et al., 2018) . Building on prior analysis of self-distillation (Mobahi et al., 2020) , we assume that each iteration of bootstrapping, the Q-function optimizes the squared TD error to target labels y k with a kernel regularizer. This regularizer captures the inductive bias from gradient-based optimization of TD error and resembles the regularization imposed by gradient descent under NTK (Mobahi et al., 2020) . The error is computed on (s i , a i ) ∈ D whereas the regularization imposed by a universal kernel u with a coefficient of c ≥ 0 is applied to the Q-values at all state-action pairs as shown in Equation 1. We consider a setting c > 0 for all rounds of bootstrapping, which corresponds to the solution obtained by performing gradient descent on TD error for a small number of iterations with early stopping in each round (Suggala et al., 2018) and thus, resembles how the updates in Algorithm 1 are typically implemented in practice. Q k+1 ← arg min Q∈Q s i ,a i ∈D (Q(si, ai) -y k (si, ai)) 2 + c (s,a) (s ,a ) u((s, a), (s , a ))Q(s, a)Q(s , a ). (1) The solution to Equation 1 can be expressed as Q k+1 (s, a) = g T (s,a) (cI + G) -1 y k , where G is the Gram matrix for a special positive-definite kernel (Duffy, 2015) and g (s,a) denotes the row of G corresponding to the input (s, a) (Mobahi et al., 2020, Proposition 1) . A detailed proof is in Appendix C. When combined with the fitted Q-iteration recursion, setting labels y k = R + γP π Q k-1 , we recover a recurrence that relates subsequent value function iterates Q k+1 = G(cI + G) -1 y k = G(cI + G) -1 A [R + γP π Q k ] = A k i=1 γ k-i (P π A) k-i R := AM k R. (2) Equation 2 follows from unrolling the recurrence and setting the algorithm-agnostic initial Q-value, Q 0 , to be 0. We now show that the sparsity of singular values of the matrix M k generally increases over fitting iterations, implying that the effective rank of M k diminishes with more iterations. For this result, we assume that the matrix S is normal, i.e., the norm of the (complex) eigenvalues of S is equal to its singular values. We will discuss how this assumption can be relaxed in Appendix A.7. Theorem 4.1. Let S be a shorthand for S = γP π A and assume S is a normal matrix. Then there exists an infinite, strictly increasing sequence of fitting iterations, (k l ) ∞ l=1 starting from k 1 = 0, such that, for any two singular-values σ i (S) and σ j (S) of S with σ i (S) < σ j (S), ∀ l ∈ N and l ≥ l, σ i (M k l ) σ j (M k l ) < σ i (M k l ) σ j (M k l ) ≤ σ i (S) σ j (S) . Hence, srank δ (M k l ) ≤ srank δ (M k l ). Moreover, if S is positive semi-definite, then (k l ) ∞ l=1 = N, i.e., srank continuously decreases in each fitting iteration. We provide a proof of the theorem above as well as present a stronger variant that shows a gradual decrease in the effective rank for fitting iterations outside this infinite sequence in Appendix C. As k increases along the sequence of iterations given by k = (k l ) ∞ l=1 , the effective rank of the matrix M k drops, leading to low expressivity of this matrix. Since M k linearly maps rewards to the Qfunction (Equation 2), drop in expressivity results of M k in the inability to model the actual Q π . Summary of our analysis. Our analysis of bootstrapping and gradient descent from the view of regularized kernel regression suggests that rank drop happens with more training (i.e., with more rounds of bootstrapping). In contrast to self-distillation (Mobahi et al., 2020) , rank drop may not happen in every iteration (and rank may increase between two consecutive iterations occasionally), but srank δ exhibits a generally decreasing trend. 4.2 ANALYSIS WITH DEEP LINEAR NETWORKS UNDER GRADIENT DESCENT While Section 4.1 demonstrates rank collapse will occur in a kernel-regression model of Q-learning, it does not illustrate when the rank collapse occurs. To better specify a point in training when rank collapse emerges, we present a complementary derivation for the case when the Q-function is represented as a deep linear neural network (Arora et al., 2019) , which is a widely-studied setting for analyzing implicit regularization of gradient descent in supervised learning (Gunasekar et al., 2017; 2018; Arora et al., 2018; 2019) . Our analysis will show that rank collapse can emerge as the generated target values begin to approach the previous value estimate, in particular, when in the vicinity of the optimal Q-function. Proof strategy. Our proof consists of two steps: (1) We show that the effective rank of the feature matrix decreases within one fitting iteration (for a given target value) due to the low-rank affinity, (2) We show that this effective rank drop is "compounded" as we train using a bootstrapped objective. Proposition 4.1 explains (1) and Proposition 4.2, Theorem 4.2 and Appendix D.2 discuss (2). Additional notation and assumptions. We represent the Q-function as a deep linear network with at ≥ 3 layers, such that [s; a] to corresponding penultimate layer' features Φ(s, a). Let W j (k, t) denotes the weight matrix W j at the t-th step of gradient descent during the k-th fitting iteration (Algorithm 1). We define W k,t = W N (k, t)W φ (k, t) and L N,k+1 (W k,t ) as the TD error objective in the k-th fitting iteration. We study srank δ (W φ (k, t)) since the rank of features Φ = W φ (k, t)[S, A] is equal to rank of W φ (k, t) provided the state-action inputs have high rank. Q(s, a) = W N W φ [s; a], where N ≥ 3, W N ∈ R 1×d N -1 and W φ = W N -1 W N -2 • • • W 1 with W i ∈ R di×di-1 for i = 1, . . . , N -1. W φ maps an input We assume that the evolution of the weights is governed by a continuous-time differential equation (Arora et al., 2018) within each fitting iteration k. To simplify analysis, we also assume that all except the last-layer weights follow a "balancedness" property (Equation D.4), which suggests that the weight matrices in the consecutive layers in the deep linear network share the same singular values (but with different permutations). However, note that we do not assume balancedness for the last layer which trivially leads to rank-1 features, making our analysis strictly more general than conventionally studied deep linear networks. In this model, we can characterize the evolution of the singular values of the feature matrix W φ (k, t), using techniques analogous to Arora et al. (2019) : Proposition 4.1. The singular values of the feature matrix W φ (k, t) evolve according to: Abstract optimization problem for the low-rank solution. Building on Proposition 4.1, we note that the final solution obtained in a bootstrapping round (i.e., fitting iteration) can be equivalently expressed as the solution that minimizes a weighted sum of the TD error and a data-dependent implicit regularizer h D (W φ , W N ) that encourages disproportionate singular values of W φ , and hence, a low effective rank of W φ . While the actual form for h is unknown, to facilitate our analysis of bootstrapping, we make a simplification and express this solution as the minimum of Equation 5. min σr (k, t) = -N • σ 2 r (k, t) 1-1 N -1 • W N (k, t) T dL N,k+1 (W K,t ) dW , u r (k, t)v r (k, t) T , (4) for r = 1, • • • , min N -1 i=1 d i , W φ ,W N ∈M ||W N W φ [s; a] -y k (s, a)|| 2 + λ k srank δ (W φ ). Note that the entire optimization path may not correspond to the objective in Equation 5, but the Equation 5 represents the final solution of a given fitting iteration. M denotes the set of constraints that W N obtained via gradient optimization of TD error must satisfy, however we do not need to explicitly quantify M in our analysis. λ k is a constant that denotes the strength of rank regularization. Since srank δ is always regularized, our analysis assumes that λ k > 0 (see Appendix D.1). Rank drop within a fitting iteration "compounds" due to bootstrapping. In the RL setting, the target values are given by y k (s, a) = r(s, a) + γP π Q k-1 (s, a). First note that when r(s, a) = 0 and P π = I, i.e., when the bootstrapping update resembles self-regression, we first note that just "copying over weights" from iteration k -1 to iteration k is a feasible point for solving Equation 5, which attains zero TD error with no increase in srank δ . A better solution to Equation 5 can thus be obtained by incurring non-zero TD error at the benefit of a decreased srank, indicating that in this setting, srank δ (W φ ) drops in each fitting iteration, leading to a compounding rank drop effect. We next extend this analysis to the full bootstrapping setting. Unlike the self-training setting, y k (s, a) is not directly expressible as a function of the previous W φ (k, T ) due to additional reward and dynamics transformations. Assuming closure of the function class (Assumption D.1) under the Bellman update (Munos & Szepesvári, 2008; Chen & Jiang, 2019) , we reason about the compounding effect of rank drop across iterations in Proposition 4.2 (proof in Appendix D.2). Specifically, srank δ can increase in each fitting iteration due to R and P π transformations, but will decrease due to low rank preference of gradient descent. This change in rank then compounds as shown below. Proposition 4.2. Assume that the Q-function is initialized to W φ (0) and W N (0). Let the Q-function class be closed under the backup, i.e., ∃ W P N , W P φ , s.t. (R + γP π Q k-1 ) T = W P N (k)W P φ (k)[S; A] T , and assume that the change in srank due to dynamics and reward transformations is bounded: srank δ (W P φ (k)) ≤ srank δ (W φ (k -1)) + c k . Then, srank δ (W φ (k)) ≤ srank δ (W φ (0)) + k j=1 c j - k j=1 ||Q j -y j || λ j . Proposition 4.2 provides a bound on the value of srank after k rounds of bootstrapping. srank decreases in each iteration due to non-zero TD errors, but potentially increases due to reward and bootstrapping transformations. To instantiate a concrete case where rank clearly collapses, we investigate c k as the value function gets closer to the Bellman fixed point, which is a favourable initialization for the Q-function in Theorem 4.2. In this case, the learning dynamics begins to resemble the self-training regime, as the target values approach the previous value iterate y k ≈ Q k-1 , and thus, as we show next, the potential increase in srank (c k in Proposition 4.2) converges to 0. Theorem 4.2. Suppose target values y k = R+γP π Q k-1 are close to the previous value esti- mate Q k-1 , i.e. ∀ s, a, y k (s, a) = Q k-1 (s, a)+ε(s, a), with |ε(s, a)| |Q k-1 (s, a)|. Then, there is a constant 0 depending upon W N and W φ , such that for all ε < ε 0 , c k = 0. Thus, srank decreases in iteration k: We provide a complete form, including the expression for 0 and a proof in Appendix D.3. To empirically show the consequence of Theorem 4.2 that a decrease in srank δ (W φ ) values can lead to an increase in the distance to the fixed point in a neighborhood around the fixed point, we performed a controlled experiment on a deep linear net shown in Figure 5  srank δ (W φ (k)) ≤ srank δ (W φ (k -1)) -||Q k -y k ||/λ k .

5. MITIGATING UNDER-PARAMETRIZATION IMPROVES DEEP Q-LEARNING

We now show that mitigating implicit under-parameterization by preventing rank collapse can improve performance. We place special emphasis on the offline RL setting in this section, since it is particularly vulnerable to the adverse effects of rank collapse. We devise a penalty (or a regularizer) L p (Φ) that encourages higher effective rank of the learned features, srank δ (Φ), to prevent rank collapse. The effective rank function srank δ (Φ) is non-differentiable, so we choose a simple surrogate that can be optimized over deep networks. Since effective rank is maximized when the magnitude of the singular values is roughly balanced, one way to increase effective rank is to minimize the largest singular value of Φ, σ max (Φ), while simultaneously maximizing the smallest singular value, σ min (Φ). We construct a simple penalty L p (Φ) derived from this intuition, given by: L p (Φ) = σ 2 max (Φ) -σ 2 min (Φ). L p (Φ) can be computed by invoking the singular value decomposition subroutines in standard automatic differentiation frameworks (Abadi et al., 2016; Paszke et al., 2019) . We estimate the singular values over the feature matrix computed over a minibatch, and add the resulting value of L p as a penalty to the TD error objective, with a tradeoff factor α = 0.001. Does L p (Φ) address rank collapse? We first verify whether controlling the minimum and maximum singular values using L p (Φ) actually prevents rank collapse. When using this penalty on the gridworld problem (Figure 6a ), the effective rank does not collapse, instead gradually decreasing at the onset and then plateauing, akin to the evolution of effective rank in supervised learning. In Figure 6b , we plot the evolution of effective rank on two Atari games in the offline setting (all games in Appendix A.5), and observe that using L p also generally leads to increasing rank values.

Does mitigating rank collapse improve performance?

We now evaluate the performance of the penalty using DQN (Mnih et al., 2015) and CQL (Kumar et al., 2020b) on Atari dataset from Agarwal et al. ( 2020) (5% replay data), used in Section 3. Figure 7 summarizes the relative improvement from using the penalty for 16 Atari games. Adding the penalty to DQN improves performance on all 16/16 games with a median improvement of 74.5%; adding it to CQL, a state-of-the-art offline algorithm, improves performance on 11/16 games with median improvement of 14.1%. Prior work has discussed that standard Q-learning methods designed for the online setting, such as DQN, are generally ineffective with small offline datasets (Kumar et al., 2020b; Agarwal et al., 2020) . Our results show that mitigating rank collapse makes even such simple methods substantially more effective in this setting, suggesting that rank collapse and the resulting implicit under-parameterization may be an crucial piece of the puzzle in explaining the challenges of offline RL. We also evaluated the regularizer L p (Φ) in the dataefficient online RL setting, with results in Appendix A.6. This variant achieved median improvement of 20.6% performance with Rainbow (Hessel et al., 2018), however performed poorly with DQN, where it reduced median performance by 11.5%. Thus, while our proposed penalty is effective in many cases in offline and online settings, it does not solve the problem fully, i.e., it does not address the root cause of implicit under-parameterization and only addresses a symptom, and a more sophisticated solution may better prevent the issues with implicit under-parameterization. Nevertheless, our results suggest that mitigation of implicit under-parameterization can improve performance of data-efficient RL.

6. RELATED WORK

Prior work has extensively studied the learning dynamics of Q-learning with tabular and linear function approximation, to study error propagation (Munos, 2003; Farahmand et al., 2010) and to prevent divergence (De Farias, 2002; Maei et al., 2009; Sutton et al., 2009; Dai et al., 2018) , as opposed to deep Q-learning analyzed in this work. Q-learning has been shown to have favorable optimization properties with certain classes of features (Ghosh & Bellemare, 2020) , but our work shows that the features learned by a neural net when minimizing TD error do not enjoy such guarantees, and instead suffer from rank collapse. Recent theoretical analyses of deep Q-learning have shown convergence under restrictive assumptions (Yang et al., 2020; Cai et al., 2019; Zhang et al., 2020; Xu & Gu, 2019) , but Theorem 4.2 shows that implicit under-parameterization appears when the estimates of the value function approach the optimum, potentially preventing convergence. Xu et al. (2005; 2007) present variants of LSTD (Boyan, 1999) , which model the Q-function as a kernel-machine but do not take into account the regularization from gradient descent, as done in Equation 1, which is essential for implicit under-parameterization. Igl et al. (2020) ; Fedus et al. (2020a) argue that non-stationarity arising from distribution shift hinders generalization and recommend periodic network re-initialization. Under-parameterization is not caused by this distribution shift, and we find that network re-initialization does little to prevent rank collapse (Figure 4c ). Luo et al. (2020) proposes a regularization similar to ours, but in a different setting, finding that more expressive features increases performance of on-policy RL methods. Finally, Yang et al. (2019) study the effective rank of the Q * -values when expressed as a |S| × |A| matrix in online RL and find that low ranks for this Q * -matrix are preferable. We analyze a fundamentally different object: the learned features (and illustrate that a rank-collapse of features can hurt), not the Q * -matrix, whose rank is upper-bounded by the number of actions (e.g., 24 for Atari).

7. DISCUSSION

We identified an implicit under-parameterization phenomenon in deep RL algorithms that use bootstrapping, where gradient-based optimization of a bootstrapped objective can lead to a reduction in the expressive power of the value network. This effect manifests as a collapse of the rank of the features learned by the value network, causing aliasing across states and often leading to poor performance. Our analysis reveals that this phenomenon is caused by the implicit regularization due to gradient descent on bootstrapped objectives. We observed that mitigating this problem by means of a simple regularization scheme improves performance of deep Q-learning methods. While our proposed regularization provides some improvement, devising better mitigation strategies for implicit under-parameterization remains an exciting direction for future work. Our method explicitly attempts to prevent rank collapse, but relies on the emergence of useful features solely through the bootstrapped signal. An alternative path may be to develop new auxiliary losses (e.g., Jaderberg et al., 2016) that learn useful features while passively preventing underparameterization. More broadly, understanding the effects of neural nets and associated factors such as initialization, choice of optimizer, etc. on the learning dynamics of deep RL algorithms, using tools from deep learning theory, is likely to be key towards developing robust and data-efficient deep RL algorithms.

A ADDITIONAL EVIDENCE FOR IMPLICIT UNDER-PARAMETERIZATION

In this section, we present additional evidence that demonstrates the existence of the implicit underparameterization phenomenon from Section 3. In all cases, we plot the values of srank δ (Φ) computed on a batch size of 2048 i.i.d. sampled transitions from the dataset. 

A.2 DATA EFFICIENT ONLINE RL

In the data-efficient online RL setting, we verify the presence of implicit under-parameterization on both DQN and Rainbow (Hessel et al., 2018) algorithms when larger number of gradient updates are made per environment step. In these settings we find that more gradient updates per environment step lead to a larger decrease in effective rank, whereas effective rank can increase when the amount of data re-use is reduced by taking fewer gradient steps. A.3 DOES BOOTSTRAPPING CAUSE IMPLICIT UNDER-PARAMETERIZATION? In this section, we provide additional evidence to support our claim from Section 3 that suggests that bootstrapping-based updates are a key component behind the existence of implicit underparameterization. To do so, we empirically demonstrate the following points empirically: • Implicit under-parameterization occurs even when the form of the bootstrapping update is changed from Q-learning that utilizes a max a backup operator to a policy evaluation (fitted Q-evaluation) backup operator, that computes an expectation of the target Q-values under the distributions specified by a different policy. Thus, with different bootstrapped updates, the phenomenon still appears. Figure A.9: Offline Policy Evaluation on Atari. srank δ (Φ) and performance of offline policy evaluation (FQE) on 5 Atari games in the offline RL setting using the 5% and 20% DQN Replay dataset (Agarwal et al., 2020) . The rank degradation shows that under-parameterization is not specific to the Bellman optimality operator but happens even when other bootstrapping-based backup operators are combined with gradient descent. Furthermore, the rank degradation also happens when we increase the dataset size. • Implicit under-parameterization does not occur when Monte-Carlo regression targets -that compute regression targets for the Q-function by computing a non-parameteric estimate the future trajectory return, i.e., , y k (s t , a t ) = ∞ t =t γ t r(s t , a t ) and do not use bootstrapping. In this setting, we find that the values of effective rank actually increase over time and stabilize, unlike the corresponding case for bootstrapped updates. Thus, other factors kept identically the same, implicit under-parameterization happens only when bootstrapped updates are used. Carlo returns for targets and thus removing bootstrapping updates. Rank collapse does not happen in this setting implying that is bootstrapping was essential for under-parameterization. We perform the experiments using 5% and 20% DQN replay dataset from Agarwal et al. (2020) . • For the final point in this section, we verify if the non-stationarity of the policy in the Qlearning (control) setting, i.e., when the Bellman optimality operator is used is not a reason behind the emergence of implicit under-parameterization. The non-stationary policy in a control setting causes the targets to change and, as a consequence, leads to non-zero errors. However, rank drop is primarily caused by bootstrapping rather than non-stationarity of the control objective. To illustrate this, we ran an experiment in the control setting on Gridworld, regressing to the target computed using the true value function Q π for the current policy π (computed using tabular Q-evaluation) instead of using the bootstrap TD estimate. The results, shown in figure A.11a, show that the srank δ doesn't decrease significantly when regressing to true control values and infact increases with more iterations as compared to Figure 6a where rank drops with bootstrapping. This experiment, alongside with experiments discussed above, ablating bootstrapping in the stationary policy evaluation setting shows that rank-deficiency is caused due to bootstrapping.

A.4 HOW DOES IMPLICIT REGULARIZATION INHIBIT DATA-EFFICIENT RL?

Implicit under-parameterization leads to a trade-off between minimizing the TD error vs. encouraging low rank features as shown in Figure 4b . This trade-off often results in decrease in effective rank, at the expense of increase in TD error, resulting in lower performance. Here we present additional evidence to support this. projection of oracle computed Q * values is quite small and the regression error to Q * actually decreases unlike the case in Figure 4a , where it remains same or even increases. The method is able to learn policies that attain good performance as well. Hence, this justifies that when there's very little rank drop, for example, 5 rank units in the example on the right, FQI methods are generally able to learn Φ that is able to fit Q * . This provides evidence showing that typical Q-networks learn Φ that can fit the optimal Q-function when rank collapse does not occur. In Atari, we do not have access to Q * , and so we instead measure the error in fitting the target value estimates, R + γP π Q k . As rank decreases, the TD error increases ( Figure A.12: TD error vs. Effective rank on Atari. We observe that Huber-loss TD error is often higher when there is a larger implicit under-parameterization, measured in terms of drop in effective rank. The results are shown for the data-efficient online RL setting.

A.5 TRENDS IN VALUES OF EFFECTIVE RANK WITH PENALTY.

In this section, we present the trend in the values of the effective rank when the penalty L p (Φ) is added. In each plot below, we present the value of srank δ (Φ) with and without penalty respectively. Trends in effective rank and performance for offline DQN with a 4x larger dataset, where distribution shift effects are generally removed. Note that the performance of DQN with penalty is generally better than DQN and that the penalty (blue) is effective in increasing the values of effective rank in most cases. Infact in PONG, where the penalty is not effective in increasing rank, we observe suboptimal performance (blue vs. red). In this section, we present additional results for supporting the hypothesis that preventing rank-collapse leads to better performance. In the first set of experiments, we apply the proposed L p penalty to Rainbow in the data-efficient online RL setting (n = 4). In the second set of experiments, we present evidence for prevention of rank collapse by comparing rank values for different runs.

Wizard Of Wor

As we will show in the next section, the state-of-the-art Rainbow (Hessel et al., 2018) algorithm also suffers form rank collapse in the data-efficient online RL setting when more updates are performed per gradient step. In this section, we applied our penalty L p to Rainbow with n = 4, and obtained a median 20.66% improvement on top of the base method. This result is summarized below. 6) (Red) on 16 games , corresponding to the bar plot above. One unit on the x-axis is equivalent to 1M environment steps.

A.7 RELAXING THE NORMALITY ASSUMPTION IN THEOREM 4.1

We can relax the normality assumption on S in Theorem 4.1. An analogous statement holds for non-normal matrices S for a slightly different notion of effective rank, denoted as srank δ,λ (M k ), that utilizes eigenvalue norms instead of singular values. Formally, let λ 1 (M k ), • • • , λ 2 (M k ), • • • be the (complex) eigenvalues of M k arranged in decreasing order of their norms, i.e., , |λ 1 (M k )| ≥ |λ 2 (M k )| ≥ • • • , then, srank δ,λ (M k ) = min k : k i=1 |λ i (M k )| d i=1 |λ i (M k )| ≥ 1 -δ . A statement essentially analogous to Theorem 4.1 suggests that in this general case, srank δ,λ (M k ) decreases for all (complex) diagonalizable matrices S, which is the set of almost all matrices of size dim(S). Now, if S is approximately normal, i.e., when |σ i (S) -|λ i (S)|| is small, then the result in Theorem 4.1 also holds approximately as we discuss at the end of Appendix C. We now provide empirical evidence showing that the trend in the values of effective rank computed using singular values, srank δ (Φ) is almost identical to the trend in the effective rank computed using normalized eigenvalues, srank δ,λ (Φ). Since eigenvalues are only defined for a square matrix Φ, in practice, we use a batch of d = dim(φ(s, a)) state-action pairs for computing the eigenvalue rank and compare to the corresponding singular value rank in Figures A.20 Connection to Theorem 4.1. We computed the effective rank of Φ instead of S, since S is a theoretical abstraction that cannot be computed in practice as it depends on the Green's kernel (Duffy, 2015) obtained by assuming that the neural network behaves as a kernel regressor. Instead, we compare the different notions of ranks of Φ since Φ is the practical counterpart for the matrix, S, when using neural networks (as also indicated by the analysis in Section 4.2). In fact, on the gridworld (Figure A.21), we experiment with a feature Φ with dimension equal to the number of state-action pairs, i.e., dim(φ(s, a)) = |S||A|, with the same number of parameters as a kernel parameterization of the Q-function: Q(s, a) = s ,a w(s , a )k(s, a, s , a ). This can also be considered as performing gradient descent on a "wide" linear network , and we measure the feature rank while observing similar rank trends. Since we do not require the assumption that S is normal in Theorem 4.1 to obtain a decreasing trend in srank δ,λ (Φ), and we find that in practical scenarios (Figures A.20 and A.21) , srank δ (Φ) ≈ srank δ,λ (Φ) with an extremely similar qualitative trend we believe that Theorem 4.1 still explains the rank-collapse practically observed in deep Q-learning and is not vacuous. 

A.8 NORMALIZED PLOTS FOR FIGURE 3/ FIGURE A.6

In this section, we provide a set of normalized srank and performance trends for Atari games (the corresponding unnormalized plots are found in Figure A.6) . In these plots, each unit on the x-axis is equivalent to one gradient update, and so since n = 8 prescribes 8× many updates as compared to n = 1, it it runs for 8× as long as n = 1. These plots are in Note that the trend that effective rank decreases with larger n values also persists when rescaling the x-axis to account for the number of gradient steps, in all but one game. This is expected since it tells us that performing bootstrapping based updates in the data-efficient setting (larger n values) still leads to more aggressive rank drop as updates are being performed on a relatively more static dataset for larger values of n.

B HYPERPARAMETERS & EXPERIMENT DETAILS B.1 ATARI EXPERIMENTS

We follow the experiment protocol from Agarwal et al. (2020) for all our experiments including hyperparameters and agent architectures provided in Dopamine and report them for completeness and ease of reproducibility in Table B .1. We only use hyperparameter selection over the regularization experiment α p based on results from 5 Atari games (Asterix, Seaquest, Pong, Breakout and Seaquest). We will also open source our code to further aid in reproducing our results. Evaluation Protocol. Following Agarwal et al. (2020) , the Atari environments used in our experiments are stochastic due to sticky actions, i.e., there is 25% chance at every time step that the environment will execute the agent's previous action again, instead of the agent's new action. All agents (online or offline) are compared using the best evaluation score (averaged over 5 runs) achieved during training where the evaluation is done online every training iteration using a -greedy policy with = 0.001. We report offline training results with same hyperparameters over 5 random seeds of the DQN replay data collection, game simulator and network initialization. Offline Dataset. As suggested by Agarwal et al. (2020) , we randomly subsample the DQN Replay dataset containing 50 millions transitions to create smaller offline datasets with the same data distribution as the original dataset. We use the 5% DQN replay dataset for most of our experiments. We also report results using the 20% dataset setting (4x larger) to show that our claims are also valid even when we have higher coverage over the state space. Optimizer related hyperparameters. For existing off-policy agents, step size and optimizer were taken as published. We used the DQN (Adam) algorithm for all our experiments, given its superior performance over the DQN (Nature) which uses RMSProp, as reported by Agarwal et al. (2020) . Atari 2600 games used. For all our experiments in Section 3, we used the same set of 5 games as utilized by Agarwal et al. (2020); Bellemare et al. (2017) to present analytical results. For our We use the gridworld suite from Fu et al. (2019) to obtain gridworlds for our experiments. All of our gridworld results are computed using the 16 × 16 GRID16SMOOTHOBS environment, which consists of a 256-cell grid, with walls arising randomly with a probability of 0.2. Each state allows 5 different actions (subject to hitting the boundary of the grid): move left, move right, move up, move down and no op. The goal in this environment is to minimize the cumulative discounted distance to a fixed goal location where the discount factor is given by γ = 0.95. The features for this Q-function are given by randomly chosen vectors which are smoothened spatially in a local neighborhood of a grid cell (x, y). We use a deep Q-network with two hidden layers of size (64, 64), and train it using soft Q-learning with entropy coefficient of 0.1, following the code provided by authors of Fu et al. (2019) . We use a first-in-first out replay buffer of size 10000 to store past transitions. C PROOFS FOR SECTION 4.1 In this section, we provide the technical proofs from Section 4.1. We first derive a solution to optimization problem Equation 1 and show that it indeed comes out to have the form described in Equation 2. We first introduce some notation, including definition of the kernel G which was used for this proof. This proof closely follows the proof from Mobahi et al. (2020) . Definitions. For any universal kernel u, the Green's function (Duffy, 2015) of the linear kernel operator L given by: [LQ] (s, a) := (s ,a ) u((s, a), (s , a ))Q(s , a ) is given by the function g((s, a), (s , a )) that satisfies: (s,a) u((s, a), (s , a )) g((s , a ), (s, ā)) = δ((s, a) -(s, ā)), (C.1) where δ is the Dirac-delta function. Thus, Green's function can be understood as a kernel that "inverts" the universal kernel u to the identity (Dirac-delta) matrix. We can then define the matrix G as the matrix of vectors g (s,a) evaluated on the training dataset, D, however note that the functional g (s,a) can be evaluated for other state-action tuples, not present in D. G((s i , a i ), (s j , a j )) := g((s i , a i ), (s j , a j )) and g (s,a ) [i] = g((s, a), (s i , a i )) ∀(s i , a i ) ∈ D. (C. 2) Lemma C.0.1. The solution to Equation 1 is given by Equation 2. Proof. This proof closely follows the proof of Proposition 1 from (Mobahi et al., 2020) . We revisit key aspects the key parts of this proof here. We restate the optimization problem below, and solve for the optimum Q k to this equation by applying the functional derivative principle. min Q∈Q J(Q) := si,ai∈D (Q(s i , a i ) -y k (s i , a i )) 2 + c (s,a) (s ,a ) u((s, a), (s , a ))Q(s, a)Q(s , a ). The functional derivative principle would say that the optimal Q k to this problem would satisfy, for any other function f and for a small enough ε → 0, ∀f ∈ Q : ∂J(Q k + εf ) ∂ε ε=0 = 0. (C.3) By setting the gradient of the above expression to 0, we obtain the following stationarity conditions on Q k (also denoting (s i , a i ) := x i ) for brevity: xi∈D δ(x -x i ) (Q k (x i ) -y k (x i )) + c x u(x, x )Q k (x ) = 0. (C.4) Now, we invoke the definition of the Green's function discussed above and utilize the fact that the Dirac-delta function can be expressed in terms of the Green's function, we obtain a simplified version of the above relation: x u(x, x ) xi∈D (Q k (x i ) -y k (x i ))g(x , x i ) = -c x u(x, x )Q k (x ). (C.5) Since the kernel u(x, x ) is universal and positive definite, the optimal solution Q k (x) is given by: Q k (s, a) = - 1 c (si,ai)∈D (Q k (s i , a i ) -y k (s i , a i )) • g((s, a), (s i , a i )). (C.6) Finally we can replace the expression for residual error, Q k (s i , a i ) -y k (s i , a i ) using the green's kernel on the training data by solving for it in closed form, which gives us the solution in Equation 2. Q k (s, a) = - 1 c g T (s,a) (Q k -y k ) = g T (s,a) (cI + G) -1 y k . (C.7) Next, we now state and prove a slightly stronger version of Theorem 4.1 that immediately implies the original theorem. Theorem C.1. Let S be a shorthand for S = γP π A and assume S is a normal matrix. Then there exists an infinite, strictly increasing sequence of fitting iterations, (k l ) ∞ l=1 starting from k 1 = 0, such that, for any two singular-values σ i (S) and σ j (S) of S with σ i (S) ≤ σ j (S), ∀ l ∈ N and l ≥ l, σ i (M k l ) σ j (M k l ) < σ i (M k l ) σ j (M k l ) ≤ σ i (S) σ j (S) . (C.8) Therefore, the effective rank of M k satisfies: srank δ (M k l ) ≤ srank δ (M k l ). Furthermore, ∀ l ∈ N and t ≥ k l , σ i (M t ) σ j (M t ) < σ i (M k l ) σ j (M k l ) + O σ i (S) σ j (S) k l . (C.9) Therefore, the effective rank of M t , srank δ (M t ), outside the chosen subsequence is also controlled above by the effective rank on the subsequence (srank δ (M k l )) ∞ l=1 . To prove this theorem, we first show that for any two fitting iterations, t < t , if S t and S t are positive semi-definite, the ratio of singular values and the effective rank decreases from t to t . As an immediate consequence, this shows that when S is positive semi-definite, the effective rank decreases at every iteration, i.e., by setting k l = l (Corollary C.1.1). To extend the proof to arbitrary normal matrices, we show that for any S, a sequence of fitting iterations (k l ) ∞ l=1 can be chosen such that S k l is (approximately) positive semi-definite. For this subsequence of fitting iterations, the ratio of singular values and effective rank also decreases. Finally, to control the ratio and effective rank on fitting iterations t outside this subsequence, we construct an upper bound on the ratio f (t): σi(Mt) σj (Mt) < f (t), and relate this bound to the ratio of singular values on the chosen subsequence. Lemma C.1.1 (srank δ (M k ) decreases when S k is PSD.). Let S be a shorthand for S = γP π A and assume S is a normal matrix. Choose any t, t ∈ N such that t < t . If S t and S t are positive semi-definite, then for any two singular-values σ i (S) and σ j (S) of S, such that 0 < σ i (S) < σ j (S): σ i (M t ) σ j (M t ) < σ i (M t ) σ j (M t ) ≤ σ i (S) σ j (S) . (C.10) Hence, the effective rank of M k decreases from t to t : srank δ (M t ) ≤ srank δ (M t ). Proof. First note that M k is given by: M k := k i=1 γ k-i (P π A) k-i = k i=1 S k-i . (C.11) From hereon, we omit the leading γ term since it is a constant scaling factor that does not affect ratio or effective rank. Almost every matrix S admits a complex orthogonal eigendecomposition. Thus, we can write S := Uλ(S)U H . And any power of S, i.e., , S i can be expressed as: S i = Uλ(S) i U H , and hence, we can express M k as: M k := U k-1 i=0 λ(S) i U H = U • diag 1 -λ(S) k 1 -λ(S) • U H . (C.12) Since S is normal, its eigenvalues and singular values are further related as σ k (S) = |λ k (S)|. And this also means that M k is normal, indicating that σ i (M k ) = |λ i (M k )|. Thus, the singular values of M k can be expressed as σ i (M k ) := 1 -λ i (S) k 1 -λ i (S) , (C.13) When S k is positive semi-definite, λ i (S) k = σ i (S) k , enabling the following simplification: σ i (M k ) = |1 -σ i (S) k | |1 -λ i (S)| . (C.14) To show that the ratio of singular values decreases from t to t , we need to show that f (σ) = |1-σ t | |1-σ t | is an increasing function of σ when t > t. It can be seen that this is the case, which implies the desired result. To further show that srank δ (M t ) ≥ srank δ (M t ), we can simply show that ∀i ∈ [1, • • • , n], h k (i) := i j=1 σj (M k ) n j=1 σj (M k ) increases with k, and this would imply that the srank δ (M k ) cannot increase from k = t to k = t . We can decompose h k (i) as: h k (i) = i j=1 σ j (M k ) l σ l (M k ) = 1 1 + n j=i+1 σj (M k ) i j=1 σj (M k ) . (C.15) Since σ j (M k )/σ l (M i ) decreases over time k for all j, l if σ j (S) ≤ σ l (S), the ratio in the denominator of h k (i) decreases with increasing k implying that h k (i) increases from t to t . Corollary C.1.1 (srank δ (M k ) decreases for PSD S matrices.). Let S be a shorthand for S = γP π A. Assuming that S is positive semi-definite, for any k, t ∈ N, such that t > k and that for any two singular-values σ i (S) and σ j (S) of S, such that σ i (S) < σ j (S), σ i (M t ) σ j (M t ) < σ i (M k ) σ j (M k ) ≤ σ i (S) σ j (S) . (C.16) Hence, the effective rank of M k decreases with more fitting iterations: srank δ (M t ) ≤ srank δ (M k ). In order to now extend the result to arbitrary normal matrices, we must construct a subsequence of fitting iterations (k l ) ∞ l=1 where S k l is (approximately) positive semi-definite. To do so, we first prove a technical lemma that shows that rational numbers, i.e., numbers that can be expressed as r = p q , for integers p, q ∈ Z are "dense" in the space of real numbers. Lemma C.1.2 (Rational numbers are dense in the real space.). For any real number α, there exist infinitely many rational numbers p q such that α can be approximated by p q upto 1 q 2 accuracy. α -p q ≤ 1 q 2 . (C.17) Proof. We first use Dirichlet's approximation theorem (see Hlawka et al. (1991) for a proof of this result using a pigeonhole argument and extensions) to obtain that for any real numbers α and N ≥ 1, there exist integers p and q such that 1 ≤ q ≤ N and, |qα -p| ≤ 1 |N | + 1 < 1 N . (C.18) Now, since q ≥ 1 > 0, we can divide both sides by q, to obtain: α - p q ≤ 1 N q ≤ 1 q 2 . (C.19) To obtain infinitely many choices for p q , we observe that Dirichlet's lemma is valid only for all values of N that satisfy N ≤ 1 |qα-p| . Thus if we choose an N such that N ≥ N max where N max is defined as: N max = max 1 |q α -p | p , q ∈ Z, 1 ≤ q ≤ q . (C.20) Equation C.20 essentially finds a new value of N , such that the current choices of p and q, which were valid for the first value of N do not satisfy the approximation error bound. Applying Dirichlet's lemma to this new value of N hence gives us a new set of p and q which satisfy the 1 q 2 approximation error bound. Repeating this process gives us countably many choices of (p, q) pairs that satisfy the approximation error bound. As a result, rational numbers are dense in the space of real numbers, since for any arbitrarily chosen approximation accuracy given by 1 q 2 , we can obtain atleast one rational number, p q which is closer to α than 1 q 2 . This proof is based on Johnson (2016) .  σ i (M k ) := 1 -λ i (S) k 1 -λ i (S) , (C.21) Bound on Singular Value Ratio: The ratio between σ i (M k ) and σ j (M k ) can be expressed as σ i (M k ) σ j (M k ) = 1 -λ i (S) k 1 -λ j (S) k 1 -λ j (S) 1 -λ i (S) . (C.22) For a normal matrix S, σ i (S) = |λ i (S)|, so this ratio can be bounded above as σ i (M k ) σ j (M k ) ≤ 1 + σ i (S) k |1 -σ j (S) k | 1 -λ j (S) 1 -λ i (S) . (C.23) Defining f (k) to be the right hand side of the equation, we can verify that f is a monotonically decreasing function in k when σ i < σ j . This shows that this ratio of singular values in bounded above and in general, must decrease towards some limit lim k→∞ f (k). Construction of Subsequence: We now show that there exists a subsequence (k l ) ∞ l=1 for which S k l is approximately positive semi-definite. For ease of notation, let's represent the i-th eigenvalue as λ i (S) = |λ i (S)| • e iθi , where θ i > 0 is the polar angle of the complex value λ i (s) and |λ i (S)| is its magnitude (norm). Now, using Lemma C.1.2, we can approximate any polar angle, θ i using a rational approximation, i.e., , we apply lemma C.1.2 on θi 2π ∃ p i , q i ∈ N, s.t. θ i 2π - p i q i ≤ 1 q 2 i . (C.24) Since the choice of q i is within our control we can estimate θ i for all eigenvalues λ i (S) to infinitesimal accuracy. Hence, we can approximate θ i ≈ 2π pi qi . We will now use this approximation to construct an infinite sequence (k l ) ∞ l=1 , shown below: k l = l • LCM(q 1 , • • • , q n ) ∀ j ∈ N, (C.25) where LCM is the least-common-multiple of natural numbers q 1 , • • • q n . In the absence of any approximation error in θ i , we note that for any i and for any l ∈ N as defined above, λ i (S) k l = |λ i (S)| k l • exp 2iπ • pi qi • k l = |λ i (S)| k l , since the polar angle for any k l is going to be a multiple of 2π, and hence it would fall on the real line. As a result, S k l will be positive semi-definite, since all eigenvalues are positive and real. Now by using the proof for Lemma C.1.1, we obtain the ratio of i and j singular values are increasing over the sequence of iterations (k j ) ∞ j=1 . Since the approximation error in θ i can be controlled to be infinitesimally small to prevent any increase in the value of srank δ due to it (this can be done given the discrete form of srank δ ), we note that the above argument applies even with the approximation, proving the required result on the subsequence.

Controlling All Fitting Iterations using Subsequence:

We now relate the ratio of singular values within this chosen subsequence to the ratio of singular values elsewhere. Choose t, l ∈ N such that t > k l . Earlier in this proof, we showed that the ratio between singular values is bounded above by a monotonically decreasing function f (t), so σ i (M t ) σ j (M t ) ≤ f (t) < f (k l ). (C.26) Now, we show that that f (k l ) is in fact very close to the ratio of singular values: f (k l ) = |1 -σ i (S) k l | |1 -σ j (S) k l | 1 -λ j (S) 1 -λ i (S) ≤ σ i (M t ) σ j (M t ) + 2σ i (S) k l |1 -σ j (S) k l | 1 -λ j (S) 1 -λ i (S) . (C.27) The second term goes to zero as k l increases; algebraic manipulation shows that this gap be bounded by f (k l ) ≤ σ i (M k l ) σ j (M k l ) + σ i (S) σ j (S) k l 2σ j (S) |1 -σ j (S)| 1 -λ j (S) 1 -λ i (S) constant . (C.28) W N -1 (k, t) • • • W 1 (k, t). W φ (k, t) is the matrix that maps an input [s; a] to corresponding features Φ(s, a). In our analysis, it is sufficient to consider the effective rank of W φ (k, t) since the features Φ are given by: Φ(k, t) = W φ (k, t)[S; A], which indicates that: rank(Φ(k, t)) = rank(W φ (k, t)[S; A]) ≤ min (rank(W φ (k, t)), rank([S; A])) . Assuming the state-action space has full rank, we are only concerned about rank(W φ (k, t)) which justifies our choice for analyzing srank δ (W φ (k, t)). Let L k+1 (W N :1 (k, t)) denote the mean squared Bellman error optimization objective in the k-th fitting iteration. L k+1 (W N :1 (k, t)) = |D| i=1 (W N (k, t)W φ (k, t)[s i ; a i ] -y k (s i , a i )) 2 , where y k = R + γP π Q k . When gradient descent is used to update the weight matrix, the updates to W i (k, t) are given by: W j (k, t + 1) ← W j (k, t) -η ∂L k+1 (W N :1 (k, t)) ∂W j (k, t) . If the learning rate η is small, we can approximate this discrete time process with a continuous-time differential equation, which we will use for our analysis. We use Ẇ (k, t) to denote the derivative of W (k, t) with respect to t, for a given k. Ẇj (k, t) = -η ∂L k+1 (W N :1 (k, t)) ∂W j (k, t) (D.2) In order to quantify the evolution of singular values of the weight matrix, W φ (k, t), we start by quantifying the evolution of the weight matrix W φ (k, t) using a more interpretable differential equation. In order to do so, we make an assumption similar to but not identical as Arora et al. (2018) , that assumes that all except the last weight matrix are "balanced" at initialization t = 0, k = 0. i.e. ∀ i ∈ [0, • • • , N -2] : W T i+1 (0, 0)W i+1 (0, 0) = W i (0, 0)W i (0, 0) T . (D.3) Note the distinction from Arora et al. (2018) , the last layer is not assumed to be balanced. As a result, we may not be able to comment about the learning dynamics of the end-to-end weight matrix, but we prevent the vacuous case where all the weight matrices are rank 1. Now we are ready to derive the evolution of the feature matrix, W φ (k, t). Lemma D.0.1 (Adaptation of balanced weights (Arora et al., 2018) across FQI iterations). Assume the weight matrices evolve according to Equation D.2, with respect to L k for all fitting iterations k. Assume balanced initialization only for the first N -1 layers, i.e., W j+1 (0, 0) T W j+1 (0, 0) = W j (0, 0)W j (0, 0) T , ∀ j ∈ 1, • • • , N -2. Then the weights remain balanced throughout, i.e. ∀ k, t W j+1 (k, t) T W j+1 (k, t) = W j (k, t)W j (k, t) T , ∀ j ∈ 1, • • • , N -2. (D.4) Proof. First consider the special case of k = 0. To beign with, in order to show that weights remain balanced throughout training in k = 0 iteration, we will follow the proof technique in Arora et al. (2018) , with some modifications. First note that the expression for ∂L k+1 (W N :1 (k,t)) ∂Wj (k,t) can be expressed as: ∂L k+1 (W N :1 (k, t)) ∂W j (k, t) =   N i=j+1 W T i   • dL k (W N :1 ) dW N :1 • j-1 i=1 W T i = W T j+1 W T j+2 • • • W T N • dL k (W N :1 ) dW N :1 • j-1 i=1 W T i . Now, since the weight matrices evolve as per Equation D.2, by multiplying the similar differential equation for W j with W T j (k, t) on the right and multiplying evolution of W j+1 with W T j+1 (k, t) from the left, and adding the two equations, we obtain: ∀ j ∈ [0, • • • , N -2] : W T j+1 (0, t) Ẇj+1 (0, t) = Ẇj (0, t)W T j (0, t). (D.5) We can then take transpose of the equation above, and add it to itself, to obtain an easily integrable expression: d W j+1 (0, t)W j+1 (0, t) T dt = W T j+1 (0, t) Ẇj+1 (0, t) + Ẇj+1 (0, t)W T j+1 (0, t) = Ẇj (0, t)W T j (0, t) + W j (0, t) ẆT j (0, t) = d W T j (0, t)W j (0, t) dt . (D.6) Since we have assumed the balancedness condition at the initial timestep 0, and the derivatives of the two quantities are equal, their integral will also be the same, hence we obtain: W T j+1 (0, t)W T j+1 (0, t) = W j (0, t)W j (0, t) T . (D.7) Now, since the weights after T iterations in fitting iteration k = 0 are still balanced, the initialization for k = 1 is balanced. Note that since the balancedness property does not depend on which objective gradient is used to optimize the weights, as long as W j and W j+1 utilize the same gradient of the loss function. Formalizing this, we can show inductively that the weights will remain balanced across all fitting iterations k and at all steps t within each fitting iteration. Thus, we have shown the result in Equation D.4. Our next result aims at deriving the evolution of the feature-matrix that under the balancedness condition. We will show that the feature matrix, W φ (k, t) evolves according to a similar, but distinct differential equation as the end-to-end weight matrix, W N :1 (k, t), which still allows us to appeal to techniques and results from Arora et al. (2019) to study properties of the singular value evolution and hence, discuss properties related to the effective rank of the matrix, W φ (k, t). Lemma D.0.2 ((Extension of Theorem 1 from Arora et al. ( 2018)). Under conditions specified in Lemma D.0.1, the feature matrix, W φ (k, t) evolves as per the following continuous-time differential equation, for all fitting iterations k: Ẇφ (k, t) = -η N -1 j=1 W φ (k, t)W φ (k, t) T N -j N -1 • W N (k, t) T dLk(WN:1(k,t)) dWN:1 • W φ (k, t) T W φ (k, t) j-1 N -1 . Proof. In order to prove this statement, we utilize the fact that the weights upto layer N -2 are balanced throughout training. Now consider the singular value decomposition of any weight w j (unless otherwise states, we use W j to refer to W j (k, t) in this section, for ease of notation. W j = U j Σ j V T j . The belancedness condition re-written using SVD of the weight matrices is equivalent to V j+1 Σ T j+1 Σ j+1 V T j+1 = U j Σ j Σ T j U T j . (D.8) Thus for all j on which the balancedness condition is valid, it must hold that Σ T j+1 Σ j+1 = Σ j Σ T j , since these are both the eigendecompositions of the same matrix (as they are equal). As a result, the weight matrices W j and W j+1 share the same singular value space which can be written as ρ 1 ≥ ρ 2 ≥ • • • ≥ ρ m . The ordering of eigenvalues can be different, and the matrices U and V can also be different (and be rotations of one other) but the unique values that the singular values would take are the same. Note the distinction from Arora et al. (2018) , where they apply balancedness on all matrices, and that in our case would trivially give a rank-1 matrix. Now this implies, that we can express the feature matrix, also in terms of the common singular values , (ρ 1 , ρ 2 , • • • , ρ m ), for example, as W j (k, t) = U j+1 Diag √ ρ 1 , • • • , √ ρ m V T j , where U)j = V j+1 O j , where O j is an orthonormal matrix. Using this relationship, we can say the following: N -1 i=j W i (k, t) N -1 i=j W i (k, t) T = W φ (k, t)W T φ (k, t) N -j N -1 j i=1 W i (k, t) T j i=1 W i (k, t) = W φ (k, t) T W φ (k, t) j N -1 . Now, we can use these expressions to obtain the desired result, by taking the differential equations governing the evolution of W i (k, t), for i ∈ [1, • • • , N -1], multiplying the i-th equation by N -1 i+1 W j (k, t) from the left, and i-1 1 W j (k, t) to the right, and then summing over i. Ẇφ (k, t) = N -1 i=1   N -1 j=i+1 W j (k, t)   Ẇi (k, t)   i-1 j=1 W j (k, t)   = -η N -1 i=1 N -1 i+1 W j (k, t) N i+1 W j (k, t) T dL k (W N :1 ) dW N :1   i-1 j=1 W j (k, t) T i-1 j=1 W j (k, t)   The above equation simplifies to the desired result by taking out W N (k, t) from the first summation, and using the identities above for each of the terms. Comparing the previous result with Theorem 1 in Arora et al. (2018) , we note that the resulting differential equation for weights holds true for arbitrary representations or features in the network provided that the layers from the input to the feature layer are balanced. A direct application of Arora et al. (2018) restricts the model class to only fully balanced networks for convergence analysis and the resulting solutions to the feature matrix will then only have one active singular value, leading to less-expressive neural network configurations. Proof of Proposition 4.1. Finally, we are ready to use Lemma D.0.2 to prove the relationship with evolution on singular values. This proof can be shown via a direct application of Theorem 3 in Arora et al. (2019) . Given that the feature matrix, W φ (k, t) satisfies a very similar differential equation as the end-to-end matrix, with the exception that the gradient of the loss with respect to the end-to-end matrix is pre-multiplied by W N (k, t) T . As a result, we can directly invoke Arora et al. ( 2019)'s result and hence, we have ∀r ∈ [1, • • • , dim(W )] that: σr (k, t) = -N • σ 2 r (k, t) 1-1 N -1 • W N (k, t) T dL N,k (W K,t ) dW , u r (k, t)v r (k, t) T . (D.9) Further, as another consequence of the result describing the evolution of weight matrices, we can also obtain a result similar to Arora et al. (2019) that suggests that the goal of the gradient update on the singular vectors U(k, t) and V(k, t) of the features W φ (k, t) = U(k, t)S(k, t)V(K, t) T , is to align these spaces with W N (k, t) T dL N,k (W K,t ) In this section, we discuss why the evolution of singular values discussed in Equation 4indicates a decrease in the rank of the feature matrix within a fitting iteration k. To see this, let's consider the case when gradient descent has been run long enough (i.e., the data-efficient RL case) for the singular vectors to stabilize, and consider the evolution of singular values post timestep t ≥ t 0 in training. First of all, when the singular vectors stabilize, we obtain that u r (k, t) T W N (k, t) T dL N,k (W K,t ) dW v r (k, t) is diagonal (extending result of Corollary 1 from Arora et al. (2019) ). Thus, we can assume that u T r (k, t)W N (k, t) T dL N,k (W K,t ) dW v r (k, t) = f (k, t) • e r • d r , where e r is given by the unit basis vector for the standard orthonormal basis, f is a shorthand for the gradient norm of the loss function pre-multiplied by the transpose of the last layer weight matrix, and d r denotes the singular values of the state-action input matrix, [S, A]. In this case, we can re-write Equation 4 as: σr (k, t) = -N σ 2 r (k, t) 1-1/N • f (k, t) • e r • d r . (D.10) Note again that unlike Arora et al. (2019) , the expression for f (k, t) is different from the gradient of the loss, since the weight matrix W N (k, t) is multiplied to the expression of the gradient in our case. By using the fact that the expression for f (k, t) is shared across all singular values, we can obtain differential equations that are equal across different σ r (k, t) and σ r (k, t). Integrating, we can show that depending upon the ratio er•dr e r •d r , and the value of N , the singular values σ r (k, t) and σ r (k, t) will take on different magnitude values, in particular, they will grow at disproportionate rates. By then appealing to the result from Proposition 4.1 in Section 4.1 (kernel regression argument), we can say that disproportionately growing singular values would imply that the value of srank δ (W φ (k, t)) decreases within each iteration k as we will discuss next. Interpretation of the abstract rank-regularized objective. We now discuss the form for the abstract rank-regularized objective in Equation 5 that we utilize in Section 4.2. Intuitively, the justification for Equation 5 is that larger singular values grow much faster than small singular values, which dictates how the ratio of singular values evolves through time, thereby reducing effective rank. The discussion in the previous subsection shows that the effective rank, srank δ (W φ (k, t)) decreases within each fitting iteration (i.e., decreases over the variable t). Now, we will argue that for a given value of effective rank at t = 0 at a fitting iteration k (denoted srank δ (W φ (k, 0)), the effective rank after T steps of gradient descent, srank δ (W φ (k, T )), is constrained to be equal to some constant ε k < srank δ (W φ (k, 0)) due to the implicit regularizaton effect of gradient descent. To see this, note that if σ i (k, 0) > σ j (k, 0), then σ i (k, t) increases at a much faster rate than σ j (k, t) with increasing t (see Equation D.10 and Arora et al. (2019) , Equation 11) and hence, the value of h k,t (i) is increasing over t: h k,t (i) = i j=1 σ j (W φ (k, t)) l σ l (W φ (k, t)) = 1 1 + n j=i+1 σj (W φ (k,t)) i j=1 σj (W φ (k,t)) ≥ 1 1 + n j=i+1 σj (W φ (k,t)) σ1(W φ (k,t)) , (D.11) as the ratio between larger and smaller singular values increases. Using Equation D.11, we note that the value of h k,t (i) approaches 1 as the ratio σ j /σ 1 decreases (in the limit, the ratio is zero). An increase in the value of h k (i) implies a decrease in srank δ , which can be expressed as: srank δ (W φ (k, t)) = min j { h k (j) ≥ 1 -δ } . Thus, by running gradient descent for T 0 = T (ε k ) (a function of ε k ) steps within a fitting iteration, k, the effective rank of the resulting W φ (k, T 0 ) satisfies srank δ (W φ (k, T 0 )) = ε k . Now, since the solution W φ (k, T 0 ) satisfies the constraint srank δ (W φ (k, T 0 )) = ε k < srank δ (W φ (k, 0)), we express this solution using the solution of a penalized optimization problem (akin to penalty methods) that penalizes srank δ (W φ ) for a suitable coefficient λ k , which is a function of ε k . We assume λ k > 0 since when running gradient descent long enough (i.e., for T 0 steps in the above discussion), we can reduce srank δ (W φ ) from its initial value (note that ε k < srank δ (W φ (k, 0))), indicating a non-zero value of regularization. In this section, we illustrate that the rank drop effect compounds due to bootstrapping and prove Proposition 4.2 formally. Our goal is to demonstrate the compounding effect: a change in the rank at a given iteration gets propagated into the next iteration. We first illustrate this compounding effect and then discuss Theorem 4.2 as a special case of this phenomenon. Assumptions. We require two assumptions pertaining to closure of the function class and the change in effective rank due to reward and dynamics transformations, to be able for this analysis. We assume that the following hold for any fitting iteration k of the algorithm. Assumption D.1 (Closure). The chosen function class W 1 W 2 • • • W N is closed under the Bellman evaluation backup for policy π. That is, if Q T k-1 = W N (k -1)W φ (k -1)[S; A] T , then there exists an assignment of weights to the deep linear net such that the corresponding target value R+γP π Q k can be written as (R + γP π Q k-1 ) T = W P N (k)W P φ (k)[S; A] T . This assumption is commonly used in analyses of approximate dynamic programming with function approximation. Assumption D.2 (Change in srank δ ). For every fitting iteration k, we assume that the the difference between srank δ (W P φ (k)), equal to the effective rank to feature matrix obtained when R+γP π Q k-1 is expressed in the function class, and srank δ (W φ (k -1)) by an iteration specific threshold c k , i.e., srank δ (W P φ (k)) ≤ srank δ (W φ (k -1)) + c k . We will characterize the properties of c  ≤ srank δ (W φ (k -2)) + c k-1 - ||Q k-1 -y k-1 || λ k-1 + c k - ||Q k -y k || λ k srank δ (W φ (k)) ≤ srank δ (W φ (0)) + k j=1 c j - k j=1 ||Q j -y j || λ j (D.12) Derivation of the steps: The first inequality holds via the weight-copying argument: since the weights for the target value R + γP π Q k-1 can be directly copied to obtain a zero TD error feasible point with an equal srank, we will obtain a better solution at the optimum with a smaller srank in Equation 5. We then use Assumption D.2 to relate srank δ (W P φ (k)) to the effective rank of the previous Q-function, srank δ (W φ (k -1)), which gives us a recursive inequality. Finally, the last two steps follow from a repeated application of the recursive inequality in the second step. This proves the result shown in Proposition 4.2. Now if the decrease in effective rank due to accumulating TD errors is larger than the possible cumulative increase in rank c k , then we observe a rank drop. Also note that c k need not be positive, c k can be negative, in which case both terms contribute to a drop in rank. But in the most general case, c k , can also be positive for some iterations. This equation indicates how a change in in rank in one iteration gets compounded on the next iteration due to bootstrapping. )) (denoted as "Target") and srank δ (W φ (k -1)) (denoted as "Main") on Seaquest with in the data-efficient online RL setting. Note that the contribution of dynamics transformation in the value of c k is generally negative for the Pair 1 and is small/positive for Pair 2. What can we comment about the general value of c k ? In the general case, c k can be positive, for example when the addition of the reward and dynamics projection increases srank. Our current set of assumptions are insufficient to guarantee something about c k in the general case and more assumptions are needed. For instance, as shown in Theorem 4.2, under Assumption D.1, we obtain that if we are in a small neighborhood around the fixed point, we will find that c k ≤ 0. Empirical verification: While our analysis does not dictate the value of c k , we performed an experiment on Seaquest to approximately visualize srank δ (W P φ (k)) by computing the effective rank for the feature matrix Φ obtained when P π Q k-1 is expressed in the Q-function class. We do not include the reward function in this experiment, but we believe that a "simple" reward function, i.e., it takes only three possible values in Atari, and hence it may not contribute too much to srank δ (W P φ (k)). As shown in Figure D.2, the value of the target value feature effective rank decreases, indicating that the coefficient c k is expected to be bounded in practice. D.3 PROOF FOR THEOREM 4.2: RANK COLLAPSE NEAR A FIXED POINT Now, we will prove Theorem 4.2 by showing that when the current Q-function iterate Q k is close to a fixed point but have not converged yet, i.e., , when ||Q k -(R+γP π Q k )|| ≤ ε, then rank-decrease happens. We will prove this as a special case of the discussion in Section D.2, by showing that for any iteration k, when the current Q-function iterate Q k is close to a fixed point, the value of c k can be made close to 0. While the compounding argument (Equation D.12) may not directly hold here since we cannot consistently comment on whether making an update on the Q-function pushes it out of the ε-ball near the fixed point, so we instead prove a "one-step" result here. To prove Theorem 4.2, we evaluate the (infinitesimal) change in the singular values of the features of W φ (k, t) as a result of an addition in the value of ε. In this case, the change (or differential) in the singular value matrix, S(k, T ), (W φ (k, T ) = U(k, T )S(k, T )V(k, T ) T ) is given by: dS(k, T ) = I d • U(k, T ) T • dW φ (k, t) • V(k, T ) , (D.13) using results on computing the derivative of the singular value decomposition (Townsend, 2016) . Proof strategy: Our high-level strategy in this proof is to design a form for ε such that the value of effective rank for feature matrix obtained when the resulting target value Using this form of ε(s, a), we note that the targets y k used for the backup can be written as  y k = Q k-1 + ε is y k = Q k-1 + ε = W N (k -1, T )W φ (-1, T )[S; A] + W N (k -1, T )ζ[S; A] = W N (k -1, T ) • (W φ (k -1, T ) + ζ)



Figure 1: Implicit under-parameterization. Schematic diagram depicting the emergence of an effective rank collapse in deep Q-learning. Minimizing TD errors using gradient descent with deep neural network Q-function leads to a collapse in the effective rank of the learned features Φ, which is exacerbated with further training.

buffer µ. 2: for fitting iteration k in {1, . . . , N} do 3: Compute Q θ (s, a) and target values y k (s, a) = r + γ max a Q k-1 (s , a ) on {(s, a)} ∼ µ for training 4:

Figure 3: Data Efficient Online RL. srank δ (Φ) and performance of neural FQI on gridworld, DQN on Atari and SAC on Gym domains in the online RL setting, with varying numbers of gradient steps per environment step (n). Rank collapse happens earlier with more gradient steps, and the corresponding performance is poor.

Figure 4: (a) Fitting error for Q * prediction for n = 10 vs n = 200 steps in Figure 3 (left). Observe that rank collapse inhibits fitting Q * as the fitting error rises over training while rank collapses. (b) TD error for varying values of n for SEAQUEST in Figure 3 (middle). TD error increases with rank degradation. (c) Q-network re-initialization in each fitting iteration on gridworld. (d) Trend of srank δ (Φ) for policy evaluation based on bootstrapped updates (FQE) vs Monte-Carlo returns (no bootstrapping). Note that rank-collapse still persists with reinitialization and FQE, but goes away in the absence of bootstrapping.

Figure 5: Trend of srank δ (Φ) v.s. error on log scale to the projected TD fixed point. A drop in srank δ (Φ) (shown as blue and yellow circles) corresponds to a corresponding increase in distance to the fixed point.

that measures the relationship between of srank δ (Φ) and the error to the projected TD fixed point |Q -Q * |. Note that a drop in srank δ (Φ) corresponds to a increased value of |Q -Q * | indicating that rank drop when Q get close to a fixed point can affect convergence to it.

Figure 6: (a): srank δ (Φ) (top) and performance (bottom) of FQI on gridworld in the offline setting with 200 gradient updates per fitting iteration. Note reduced rank collapse and higher performance with the regularizer Lp(Φ). (b): Lp(Φ) mitigates the rank collapse in DQN and CQL in the offline RL setting on Atari.

Figure 7: DQN and CQL with Lp(Φ) penalty vs. their standard counterparts in the 5% offline setting on Atari from Section 3. Lp improves DQN on 16/16 and CQL on 11/16 games.

Figure A.1: Offline DQN on Atari. srank δ (Φ) and performance of DQN on five Atari games in the offline RL setting using the 5% DQN Replay dataset(Agarwal et al., 2020) (marked as DQN). Note that low srank (top row) generally corresponds to worse policy performance (bottom row). Also note that rank collapse begins to generally occur close to the position of peak return. Average across 5 runs is showed for each game with individual runs.

Figure A.4: Offline Control on MuJoCo. srank δ (Φ) and performance of SAC on three Gym environments the offline RL setting. Implicit Under-parameterization is conspicuous from the rank reduction, which highly correlates with performance degradation. We use 20% uniformly sampled data from the entire replay experience of an online SAC agent, similar to the 20% setting fromAgarwal et al. (2020).

Figure A.5: Offline Control on MuJoCo. srank δ (Φ) and performance of CQL on three Gym environments the offline RL setting. Implicit Under-parameterization is conspicuous from the rank reduction, which highly correlates with performance degradation. We use 20% uniformly sampled data from the entire replay experience of an online SAC agent, similar to the 20% setting fromAgarwal et al. (2020).

Figure A.6: Online DQN on Atari. srank δ (Φ) and performance of DQN on 5 Atari games in the online RL setting, with varying numbers of gradient steps per environment step (n). Rank collapse happens earlier with more gradient steps, and the corresponding performance is poor. This indicates that implicit under-parameterization aggravates as the rate of data re-use is increased.

Figure A.7: Online SAC on MuJoCo. srank δ (Φ) and performance of SAC on three Gym environments the online RL setting, with varying numbers of gradient steps per environment step (n). While in the simpler environments, HalfCheetah-v2, Hopper-v2 and Walker2d-v2 we actually observe an increase in the values of effective rank, which also corresponds to good performance with large n values in these cases, on the more complex Ant-v2 environment rank decreases with larger n, and the corresponding performance is worse with more gradient updates.

Figure A.10: Monte Carlo Offline Policy Evaluation. srank δ (Φ) on 5 Atari games in when using Monte Carlo returns for targets and thus removing bootstrapping updates. Rank collapse does not happen in this setting implying that is bootstrapping was essential for under-parameterization. We perform the experiments using 5% and 20% DQN replay dataset fromAgarwal et al. (2020).

Figure A.11b  shows a gridworld problem with one-hot features, which naturally leads to reduced state-aliasing. In this setting, we find that the amount of rank drop with respect to the supervised

Figure A.12)  and the value function is unable to fit the target values, culminating in a performance plateau(Figure A.6).

Figure A.13: Effective rank values with the penalty on DQN. Trends in effective rank and performance for offline DQN. Note that the performance of DQN with penalty is generally better than DQN and that the penalty (blue) is effective in increasing the values of effective rank. We report performance at the end of 100 epochs, as per the protocol set byAgarwal et al. (2020) in Figure7.

Offline CQL with Lp(Φ).

Figure A.15: Performance improvement of (a) offline DQN and (b) offline CQL with the Lp(Φ) penalty on 20% Atari dataset, i.e., the dataset referred to as 4x large in Figure 2.

Figure A.18: Performance of Rainbow (n = 4) and Rainbow (n = 4) with the Lp(Φ) penalty (Equation6. Note that the penalty improves on the base Rainbow in 12/16 games.

Figure A.19: Learning curves with n = 4 gradient updates per environment step for Rainbow(Blue) and Rainbow with the Lp(Φ) penalty (Equation6) (Red) on 16 games , corresponding to the bar plot above. One unit on the x-axis is equivalent to 1M environment steps.

Figure A.20: Comparing different measures of effective rank on a run of offline DQN in the 5% replay setting, previously studied in Figure A.3.

Figure A.21: Comparison of srank δ,λ (Φ) and srank δ (Φ) on the gridworld in the offline setting (left) and the online setting (right), when using a deep linear network with dim(φ(s, a)) = |S||A|. Note that both notions of effective rank exhibit a similar decreasing trend and are closely related to each other.

Figure A.22: Rank collapse in DQN as a function of gradient updates on the x-axis for five Atari games in the data-efficient online RL setting. This setting was previously studied in Figure A.6. Note that lesser number of updates per unit amount data, i.e., smaller values of n possess larger srank δ values.

Now we utilize Lemmas C.1.1 and C.1.2 to prove Proposition 4.1. Proof of Proposition 4.1 and Theorem C.1 Recall from the proof of Lemma C.1.1 that the singular values of M k are given by:

Figure D.1: Evolution of singular values of Φ on Atari. The larger singular values of the feature matrix Φ grow at a disproportionately higher rate than other smaller singular values as described by equation 4 .

PROOF OF PROPOSITION 4.2: COMPOUNDING RANK DROP IN SECTION 4.2

k in special cases by showing how Theorem 4.2 is a special case of our next argument, where we can analyze the trend in c k . To first see how a change in rank indicated by c k in Assumption D.2 propagates through bootstrapping, note that we can use Assumption D.2 and D.1 to observe the following relation:srank δ (W φ (k)) ≤ srank δ (W P φ (k)) -||Q k -y k || λ k(since we can copy over the weights)≤ srank δ (W φ (k -1)) + c k -||Q k -y k || λ k (using Assumption D.2)

Figure D.2: Two pairs of runs depicting the trends of target feature (approximately equal to srank δ (W P φ (k))) (denoted as "Target") and srank δ (W φ (k -1)) (denoted as "Main") on Seaquest with in the data-efficient online RL setting. Note that the contribution of dynamics transformation in the value of c k is generally negative for the Pair 1 and is small/positive for Pair 2.

expressed in the deep linear network function class, let's say with feature matrix, W φ (k, T ) and the last layer weights, W N (k, T ) = W N (k -1, T ), then the value of srank δ (W φ (k, T )) can be larger than srank δ (W φ (k -1, T )) by only a bounded limited amount α that depends on ε. More formally, we desire a result of the formsrank δ (W φ (k, T )) ≤ srank δ (W φ (k -1, T )) + α, where || || ||Q k-1 ||, and y k (s, a) = W N (k -1, T )W φ (k, T )[s; a]. (D.14)Once we obtain such a parameterization of y k in the linear network function class, using the argument in Section 4.2 about self-training, we can say that just copying over the weights W φ (k, T ) is sufficient to obtain a bounded increase in srank δ (W φ (k, T )) at the cost of incurring zero TD error (since the targets can be exactly fit). As a result, the optimal solution found by Equation 5 will attain lower srank δ (W φ (k, T )) value if the TD error is non-zero. Specifically,srank δ (W φ (k, T )) + ||Q k -y k || λ k ≤ srank δ (W φ (k -1, T )) + α (D.15)As a result we need to examine the factors that affect srank δ (W φ (k, T )) -(1) an increase in srank δ (W φ (k, T )) due to an addition of α to the rank, and (2) a decrease in srank δ (W φ (k, T )) due to the implicit behavior of gradient descent. Proof: In order to obtain the desired result (Equation D.14), we can express ε(s, a) as a function of the last layer weight matrix of the current Q-function, W N (k -1, T ) and the state-action inputs, [s; a]. Formally, we assume that ε(s, a) = W N (k -1, T )ζ[s; a], where ζ is a matrix of the same size as W φ such that ||ζ|| ∞ τ = O(||W φ (k, T )|| ∞ ), i.e., ζ has entries with ignorable magnitude compared the actual feature matrix, W φ (k, T ). This can be computed out of the closure assumption (Assumption D.1) from Section D.2.

Using the equation for sensitivity of singular values due to a change in matrix entries, we obtain the maximum change in the singular values of the resulting "effective" feature matrix of the targets y k in the linear function class, denoted as W φ (k, T ), is bounded:dS (k, T ) = I d • U(k, T ) • dW φ (k, T ) • V(k, T ) T =⇒ ||dS (k, T )|| ∞ ≤ ζ.(D.16)

where u r (k, t) and v r (k, t) denote the left and right singular vectors of the feature matrix, W φ (k, t), respectively.

1: Hyperparameters used by the offline and online RL agents in our experiments.

ACKNOWLEDGEMENTS

We thank Lihong Li, Aaron Courville, Aurick Zhou, Abhishek Gupta, George Tucker, Ofir Nachum, Wesley Chung, Emmanuel Bengio, Zafarali Ahmed, and Jacob Buckman for feedback on an earlier version of this paper. We thank Hossein Mobahi for insightful discussions about self-distillation and Hanie Sedghi for insightful discussions about implicit regularization and generalization in deep networks. We additionally thank Michael Janner, Aaron Courville, Dale Schuurmans and Marc Bellemare for helpful discussions. AK was partly funded by the DARPA Assured Autonomy program, and DG was supported by a NSF graduate fellowship and compute support from Amazon.

annex

Published as a conference paper at ICLR 2021 Putting these inequalities together proves the final statement,Extension to approximately-normal S. We can extend the result in Theorem C.1 (and hence also Theorem 4.1) to approximately-normal S. Note that the main requirement for normality of S (i.e., σ i (S) = |λ i (s)|) is because it is straightforward to relate the eigenvalue of S to M as shown below.Now, since the matrix S is approximately normal, we can express it using its Schur's triangular form as, S = U • (Λ + N) • U H , where Λ is a diagonal matrix and N is an "offset" matrix. The departure from normality of S is defined as: ∆(S) := inf N ||N|| 2 , where the infimum is computed over all matrices N that can appear in the Schur triangular form for S. For a normal S only a single value of N = 0 satisfies the Schur's triangular form. For an approximately normal matrix S, ||N|| 2 ≤ ∆(S) ≤ ε, for a small ε.Furthermore note that from Equation 6in Ruhe (1975) , we obtain thatimplying that singular values and norm-eigenvalues are close to each other for S.Next, let us evaluate the departure from normality of M k . First note that,and so,for a small epsilon (i.e., considering only terms that are linear in N for (Λ + N) j ), we note that:(C.32) Thus, the matrix M k is also approximately normal provided that the max eigenvalue norm of S is less than 1. This is true, since S = γP π A (see Theorem 4.1, where both P π and A have eigenvalues less than 1, and γ < 1.Given that we have shown that M k is approximately normal, we can show that srank δ (M k ) only differs from srank δ,λ (M k ), i.e., , the effective rank of eigenvalues, in a bounded amount. If the value of ε is then small enough, we still retain the conclusion that srank δ (M k ) generally decreases with more training by following the proof of Theorem C.1.

D PROOFS FOR SECTION 4.2

In this section, we provide technical proofs from Section 4.2. We start by deriving properties of optimization trajectories of the weight matrices of the deep linear network similar to Arora et al. (2018) but customized to our set of assumptions, then prove Proposition 4.1, and finally discuss how to extend these results to the fitted Q-iteration setting and some extensions not discussed in the main paper. Similar to Section 4.1, we assume access to a dataset of transitions, D = {(s i , a i , r(s i , a i ), s i } in this section, and assume that the same data is used to re-train the function.Notation and Definitions. The Q-function is represented using a deep linear network with at least 3 layers, such thatand W i ∈ R di×di-1 for i = 1, . . . , N -1. We index the weight matrices by a tuple (k, t): W j (k, t) denotes the weight matrix W j at the t-th step of gradient descent during the k-th fitting iteration (Algorithm 1). Let the end-to-end weight matrix W N W N -1 • • • W 1 be denoted shorthand as W N :1 , and let the features of the penultimate layer of the network, be denoted as W φ (k, t) := Now, let's use this inequality to find out the maximum change in the function, h k (i) used to compute.The above equation implies that the maximal change in the effective rank of the feature matrix generated by the targets, y k (denoted as W φ (k, T )) and the effective rank of the features of the current Q-function Q k (denoted as W φ (k, T )) are given by:where α can be formally written by the cardinality of the set:(D.17)Note that by choosing ε(s, a) and thus ζ to be small enough, we can obtain W φ (k, T ) such that α = 0. Now, the self-training argument discussed above applies and gradient descent in the next iteration will give us solutions that reduce rank.Assuming r > 1 and srank δ (W φ (k, T )) = r, we know that h k (r -1) < 1 -δ, while h k (r) ≥ 1 -δ. Thus, srank δ (W φ (k, T )) to be equal to r, it is sufficient to show that h k (r) -|dh k (r)| ≥ 1 -δ since both h k (i) and dh k (i) are increasing for i = 0, 1, • • • , r. Thus, srank δ (W φ (k, T )) = r ∀ζ, wheneverProof summary: We have thus shown that there exists a neighborhood around the optimal fixed point of the Bellman equation, parameterized by ε(s, a) where bootstrapping behaves like selftraining. In this case, it is possible to reduce srank while the TD error is non-zero. And of course, this would give rise to a rank reduction close to the optimal solution.

