RLX2: TRAINING A SPARSE DEEP REINFORCEMENT LEARNING MODEL FROM SCRATCH

Abstract

Training deep reinforcement learning (DRL) models usually requires high computation costs. Therefore, compressing DRL models possesses immense potential for training acceleration and model deployment. However, existing methods that generate small models mainly adopt the knowledge distillation-based approach by iteratively training a dense network. As a result, the training process still demands massive computing resources. Indeed, sparse training from scratch in DRL has not been well explored and is particularly challenging due to non-stationarity in bootstrap training. In this work, we propose a novel sparse DRL training framework, "the Rigged Reinforcement Learning Lottery" (RLx2), which builds upon gradient-based topology evolution and is capable of training a DRL model based entirely on sparse networks. Specifically, RLx2 introduces a novel delayed multistep TD target mechanism with a dynamic-capacity replay buffer to achieve robust value learning and efficient topology exploration in sparse models. It also reaches state-of-the-art sparse training performance in several tasks, showing 7.5×-20× model compression with less than 3% performance degradation and up to 20× and 50× FLOPs reduction for training and inference, respectively.

1. INTRODUCTION

Deep reinforcement learning (DRL) has found successful applications in many important areas, e.g., games (Silver et al., 2017) , robotics (Gu et al., 2017) and nuclear fusion (Degrave et al., 2022) . However, training a DRL model demands heavy computational resources. For instance, AlphaGo-Zero for Go games (Silver et al., 2017) , which defeats all Go-AIs and human experts, requires more than 40 days of training time on four tensor processing units (TPUs). The heavy resource requirement results in expensive consumption and hinders the application of DRL on resource-limited devices. Sparse networks, initially proposed in deep supervised learning, have demonstrated great potential for model compression and training acceleration of deep reinforcement learning. Specifically, in deep supervised learning, the state-of-the-art sparse training frameworks, e.g., SET (Mocanu et al., 2018) and RigL (Evci et al., 2020) , can train a 90%-sparse network (i.e., the resulting network size is 10% of the original network) from scratch without performance degradation. On the DRL side, existing works including Rusu et al. (2016) ; Schmitt et al. (2018) ; Zhang et al. (2019) succeeded in generating ultimately sparse DRL networks. Yet, their approaches still require iteratively training dense networks, e.g., pre-trained dense teachers may be needed. As a result, the training cost for DRL remains prohibitively high, and existing methods cannot be directly implemented on resource-limited devices, leading to low flexibility in adapting the compressed DRL models to new environments, i.e., on-device models have to be retrained at large servers and re-deployed. Training a sparse DRL model from scratch, if done perfectly, has the potential to significantly reduce computation expenditure and enable efficient deployment on resource-limited devices, and achieves excellent flexibility in model adaptation. However, training an ultra sparse network (e.g., 90% sparsity) from scratch in DRL is challenging due to the non-stationarity in bootstrap training. Specifically, in DRL, the learning target is not fixed but evolves in a bootstrap way (Tesauro et al., 1995) , and the distribution of the training data can also be non-stationary (Desai et al., 2019) . Moreover, using a sparse network structure means searching in a smaller hypothesis space, which further reduces the learning target's confidence. As a result, improper sparsification can cause irreversible damage to the learning path (Igl et al., 2021) , resulting in poor performance. Indeed, recent works (Sokar et al., 2021; Graesser et al., 2022) show that a direct adoption of a dynamic sparse training (DST) framework in DRL still fails to achieve good compression of the model for different environments uniformly. Therefore, the following interesting question remains open: Can an efficient DRL agent be trained from scratch with an ultra-sparse network throughout? In this paper, we give an affirmative answer to the problem and propose a novel sparse training framework, "the Rigged Reinforcement Learning Lottery" (RLx2), for off-policy RL, which is the first algorithm to achieve sparse training throughout using sparsity of more than 90% with only minimal performance loss. RLx2 is inspired by the gradient-based topology evolution criteria in RigL (Evci et al., 2020) for supervised learning. However, a direct application of RigL does not achieve high sparsity, because sparse DRL models suffer from unreliable value estimation due to limited hypothesis space, which further disturbs topology evolution. Thus, RLx2 is equipped with a delayed multi-step Temporal Difference (TD) target mechanism and a novel dynamic-capacity replay buffer to achieve robust value learning and efficient topology exploration. These two new components address the value estimation problem under sparse topology, and together with RigL, achieve superior sparse-training performance. The main contributions of the paper are summarized as follows. • We investigate the fundamental obstacles in training a sparse DRL agent from scratch, and discover two key factors for achieving good performance under sparse networks, namely robust value estimation and efficient topology exploration. • Motivated by our findings, we propose RLx2, the first framework that enables DRL training based entirely on sparse networks. RLx2 possesses two key functions, i.e., a gradientbased search scheme for efficient topology exploration, and a delayed multi-step TD target mechanism with a dynamic-capacity replay buffer for robust value learning. • Through extensive experiments, we demonstrate the state-of-the-art sparse training performance of RLx2 with two popular DRL algorithms, TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) , on several MuJoCo (Todorov et al., 2012) continuous control tasks. Our results show up to 20× model compression. RLx2 also achieves 20× acceleration in training and 50× in inference in terms of floating-point operations (FLOPs).

2. RELATED WORKS

We discuss the related works on training sparse models in deep supervised learning and reinforcement learning below. We also provide a comprehensive performance comparison in Table 1 . Sparse Models in Deep Supervised Learning Han et al. (2015; 2016) ; Srinivas et al. (2017) ; Zhu & Gupta (2018) focus on finding a sparse network by pruning pre-trained dense networks. Iterative Magnitude Pruning (IMP) in Han et al. (2016) achieves a sparsity of more than 90%. Techniques including neuron characteristic (Hu et al., 2016) , dynamic network surgery (Guo et al., 2016) , derivatives (Dong et al., 2017; Molchanov et al., 2019b) , regularization (Louizos et al., 2018; Tartaglione et al., 2018) , dropout (Molchanov et al., 2017), and weight reparameterizationSchwarz et al. (2021) have also been applied in network pruning. Another line of work focuses on the Lottery Ticket Hypothesis (LTH), first proposed in Frankle & Carbin (2019) , which shows that training from a sparse network from scratch is possible if one finds a sparse "winning ticket" initialization in deep supervised learning. The LTH is also validated in other deep learning models (Chen et al., 2020; Brix et al., 2020; Chen et al., 2021) . Many works (Bellec et al., 2017; Mocanu et al., 2018; Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2020 ) also try to train a sparse neural network from scratch without having to pre-trained dense models. These works adjust structures of sparse networks during training, including Deep Rewiring (DeepR) (Bellec et al., 2017) , Sparse Evolutionary Training (SET) (Mocanu et al., 2018) , Dynamic Sparse Reparameterization (DSR) (Mostafa & Wang, 2019) , Sparse Networks from Scratch (SNFS) (Dettmers & Zettlemoyer, 2019) , and Rigged Lottery (RigL) (Evci et al., 2020) . Single-Shot Network Pruning (SNIP) (Lee et al., 2019) and Gradient Signal Preservation (GraSP) (Wang et al., 2020) focus on finding static sparse networks before training. Here ST and TA stand for "sparse throughout training" and "training acceleration", respectively. The shown sparsity is the maximum sparsity level without performance degradation under the algorithms. †: There are multiple method combinations in (Graesser et al., 2022) , where "TE" stands for two topology evolution schemes: SET and RigL, "RL" refers to two RL algorithms: TD3 and SAC. (Rusu et al., 2016; Schmitt et al., 2018; Zhang et al., 2019; Livne & Cohen, 2020) , or fail in ultra sparse models, e.g. DST-based methods (Sokar et al., 2021; Graesser et al., 2022) . In this paper, we further improve the performance of DST by introducing a delayed multi-step TD target mechanism with a dynamic-capacity replay buffer, which effectively addresses the unreliability of fixed-topology models during sparse training.

3. DEEP REINFORCEMENT LEARNING PRELIMINARIES

In reinforcement learning, an agent interacts with an unknown environment to learn an optimal policy. The learning process is formulated as a Markov decision process (MDP) M = ⟨S, A, r, P, γ⟩, where S is the state space, A is the action space, r is the reward function, P denotes the transition matrix, and γ stands for the discount factor. Specifically, at time slot t, given the current state s t ∈ S, the agent selects an action a t ∈ A by policy π : S → A, which then incurs a reward r(s t , a t ). Denote the Q function associated with the policy π for state-action pair (s, a) as Q π (s, a) = E π T i=t γ i-t r(s i , a i )|s t = s, a t = a . In actor-critic methods (Silver et al., 2014) , the policy π(s; ϕ) is parameterized by a policy (actor) network with weight parameter ϕ, and the Q function Q(s, a; θ) is parameterized by a value (critic) network with parameter θ. The goal of the agent is to find an optimal policy π * (s; ϕ * ) which maximizes long-term cumulative reward, i.e., J * = max ϕ E π(ϕ) [ T i=0 γ i-t r(s i , a i )|s 0 , a 0 ]. There are various DRL methods for learning an efficient policy. In this paper, we focus on off-policy TD learning methods, including a broad range of state-of-the-art algorithms, e.g., TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) . Specifically, the critic network is updated by gradient descent to fit the one-step TD targets T 1 generated by a target network Q(s, a; θ ′ ), i.e., T 1 = r(s, a) + γQ (s ′ , a ′ ; θ ′ ) (2) for each state-action pair (s, a), where a ′ = π(s ′ ; ϕ). The loss function of the value network is defined as the expected squared loss between the current value network and TD targets: L(θ) = E π(ϕ) [Q(s, a; θ) -T 1 ] 2 . ( ) The policy π(s; ϕ) is updated by the deterministic policy gradient algorithm in Silver et al. (2014) : ∇ ϕ J(ϕ) = E π(ϕ) ∇ a Q π (s, a; θ)| a=π(s;ϕ) ∇ ϕ π(s; ϕ) .

4. RLX2: RIGGING THE LOTTERY IN DRL

In this section, we present the RLx2 From the results, we make the following important observations. (i) Topology evolution is essential. It can be seen that a random static sparse network (SS) leads to much worse performance than RigL. (ii) Robust value estimation is significant. This is validated by the comparison between RigL and RigL+Q * , both using the same topology adjustment scheme but with different Q-values. Motivated by the above findings, RLx2 utilizes gradient-based topology adjustment, i.e., RigL (for topology evolution), and introduces a delayed multi-step TD target mechanism with a dynamiccapacity replay buffer (for robust value estimation). Below, we explain the key components of RLx2 in detail, to illustrate why RLx2 is capable of achieving robust value learning and efficient topology exploration simultaneously.

4.1. GRADIENT-BASED TOPOLOGY EVOLUTION

The topology evolution in RLx2 is conducted by adopting the RigL method (Evci et al., 2020) . Specifically, we compute the gradient values of the loss function with respect to link weights. Then, we dynamically grow connections (connecting neurons) with large gradients and remove existing links with the smallest absolute value of the weights. In this way, we obtain a sparse mask that evolves by self-adjustment. The pseudo-code of our scheme is given in Algorithm 1, where ⊙ is the element-wise multiplication operator and M θ is the binary mask to represent the sparse topology of the network θ. The update fraction anneals during the training process according to ζ t = ζ0 2 (1 + cos( πt Tend )), where ζ 0 is the initial update fraction and T end is the total number of iterations. Finding top-k links with maximum gradients in Line 10 can be efficiently implemented such that Algorithm 1 owns time complexity O((1-s)N log N )) (detailed in Appendix A.1), where s is the total sparsity. Besides, the topology adjustment happens very infrequently during the training, i.e., every 10000 step in our setup, such that consumption of this step is negligible (detailed in Appendix C.3). Our topology evolution scheme can be implemented efficiently on resource-limited devices. Algorithm 1 Topology Evolution (Evci et al., 2020) 1 To achieve robust value estimation and properly guide the topology search, RLx2 utilizes two novel components: i) delayed multi-step TD targets to bootstrap value estimation; ii) a dynamic-capacity replay buffer to eliminate the potential data inconsistency due to policy change during training.

4.2.1. MULTI-STEP TD TARGET

In TD learning, a TD target is generated, and the value network will be iteratively updated by minimizing a squared loss induced by the TD target. Single-step methods generate the TD target by combining one-step reward and discounted target network output, i.e., T 1 = r t + γQ(s t+1 , π(s t+1 ); θ). However, a sparse network parameter θ = θ ⊙ M θ , obtained from its dense counterpart θ, will inevitably reside in a smaller hypothesis space due to using fewer parameters. This means that the output of the sparse value network θ can be unreliable and may lead to inaccurate value estimation. Denote the fitting error of the value network as ϵ(s, a) = Q(s, a; θ) -Q π (s, a). One sees that this error may be larger under a sparse model compared to that under a dense network. To overcome this issue, we adopt a multi-step target, i.e., T n = n-1 k=0 γ k r t+k + γ n Q(s t+n , π(s t+n ); θ), where the target combines an N -step sample and the output of the sparse value network after N -step, both appropriately discounted. By doing so, we reduce the expected error between the TD target and the true target. Specifically, Eq.( 4) shows the expected TD error between multi-step TD target T n and the true Q-value Q π associated with the target policy π, conditioned on transitions from behavior policy b (see detailed derivation in Appendix A.2). E b [T n (s, a)] -Q π (s, a) = (E b [T n (s, a)] -E π [T n (s, a)]) Policy inconsistency error +γ n E π [ϵ(s n , π(s n ))] Network fitting error (4) The multi-step target has been studied in existing works (Bertsekas & Ioffe, 1996; Precup, 2000; Munos et al., 2016) for improving TD learning. In our case, we also find that introducing a multistep target reduces the network fitting error by a multiplicative factor γ n , as shown in Eq. ( 4). On the other hand, it has been observed, e.g., in Fedus et al. (2020) , that an immediate adoption of multi-step TD targets may cause a larger policy inconsistency error (the first term in Eq. ( 4)). Thus, we adopt a delayed scheme to suppress policy inconsistency and further improve value learning. Specifically, at the early stage of training, we use one-step TD targets to better handle the quickly changing policy during this period, where a multi-step target may not be meaningful. Then, after several training epochs, when the policy change becomes less abrupt, We permanently switch to multi-step TD targets, to exploit its better approximation of the value function.

4.2.2. DYNAMIC-CAPACITY BUFFER

The second component of RLx2 for robust value learning is a novel dynamic buffer scheme for controlling data inconsistency. Off-policy algorithms use a replay buffer to store collected data and train networks with sampled batches from the buffer. Their performances generally improve when larger replay capacities are used (Fedus et al., 2020) . However, off-policy algorithms with unlimited-size replay buffers can suffer from policy inconsistency due to the following two aspects. (i) Inconsistent multi-step targets: In off-policy algorithms with multi-step TD targets, the value function is updated to minimize the squared loss in Eq. ( 3) on transitions sampled from the replay buffer, i.e., the reward sequence r t , r t+1 , • • • , r t+n collected during training. However, the fact that the policy can evolve during training means that the data in the replay buffer, used for Monte-Carlo approximation of the current policy π, may be collected under a different behavior policy b (Hernandez-Garcia & Sutton, 2019; Fedus et al., 2020) . As a result, it may lead to a large policy inconsistency error in Eq. ( 4), causing inaccurate estimation.  L(θ) = 1 |B t | (si,ai)∼Bt (Q(s i , a i ; θ) -T ) 2 (5) Compared to Eq. ( 3), the difference between the distribution of transitions in the mini-batch B t and the true transition distribution induced by the current policy also leads to a mismatch in the training objective (Fujimoto et al., 2019) . Indeed, our analysis in Appendix A.4 shows that training performance is closely connected to policy consistency. Motivated by our analysis, we introduce a dynamically-sized buffer to reduce the policy gap based on the policy distance of the collected data. The formal scheme is given in Algorithm 3. We introduce the following policy distance measure to evaluate the inconsistency of data in the buffer, i.e., D(B, ϕ) = 1 K (si,ai)∈OldK(B) ∥ π(s i ; ϕ) -a i ∥ 2 , where B denotes the current replay buffer, OldK(B) denotes the oldest K transitions in B, and π(•; ϕ) is the current policy. Here K is a hyperparameter. We calculate the D(B, ϕ) value every ∆ b steps. If D(B, ϕ) gets above a certain pre-specified threshold D 0 , we start to pop items from B in a First-In-First-Out (FIFO) order until this distance measure D becomes below the threshold. A visualization of the number of stored samples and the proposed policy distance metric during training is shown in Figure 4 . We see that the policy distance oscillates in the early stage as the policy evolves, but it is tightly controlled and does not violate the threshold condition to effectively address the off-policyness issue. As the policy converges, the policy distance tends to decrease and converge (Appendix C.8 also shows that the performance is insensitive to the policy threshold D 0 ). topology evolution plays in sparse training in Section 5.3. Our experiments are conducted in four popular MuJoCo environments: HalfCheetah-v3 (Hal.), Hopper-v3 (Hop.), Walker2d-v3 (Wal.), and Ant-v3 (Ant.),foot_4 for RLx2 with two off-policy algorithms, TD3 and SAC. Instantiations of RLx2 on TD3 and SAC are provided in Appendix B. Each result is averaged over eight random seeds. The code is available at https://github.com/tyq1024/RLx2.

5.1. COMPARATIVE EVALUATION

Table 2 summarizes the comparison results. In our experiments, we compare RLx2 with the following baselines: (i) Tiny, which uses tiny dense networks with the same number of parameters as the sparse model in training. (ii) SS: using static sparse networks with random initialization. (iii) SET (Mocanu et al., 2018) , which uses dynamic sparse training by dropping connections according to the magnitude and growing connections randomly. Please notice that the previous work (Sokar et al., 2021 ) also adopts the SET algorithm for topology evolution in reinforcement learning. Our implementations reach better performance due to different hyperparameters. (iv) RigL (Evci et al., 2020) , which uses dynamic sparse training by dropping and growing connections with magnitude and gradient criteria, respectively, the same as RLx2's topology evolution procedure. In our experiments, we allow the actor and critic networks to take different sparsities. We define an ultimate compression ratio, i.e., the largest sparsity level under which the performance degradation under RLx2 is within ±%3 of that under the original dense models. This can also be understood as the minimum size of the sparse model with the full performance of the original dense model. We present performance comparison results in Table 2 based on the ultimate compression ratio. The performance of each algorithm is evaluated with the average reward per episode over the last 30 policy evaluations of the training ( policy evaluation is conducted every 5000 steps). Hyperparameters are fixed in all four environments for TD3 and SAC, respectively, which are presented in Appendix C.2. Performance Table 2 shows RLx2 performs best among all baselines in all four environments by a large margin (except for a close performance with RigL with SAC in Hopper). In addition, tiny dense (Tiny) and random static sparse networks (SS) performance are worst on average. SET and RigL are better yet fail to maintain the performance in Walker2d-v3 and Ant-v3, which means robust value learning is necessary under sparse training. To further validate the performance of RLx2, we compare the performance of different methods under different sparsity levels in Hopper-v3 and Ant-v3 in Figure 5 , showing RLx2 has a significant performance gain over other baselines. 

5.2. ABLATION STUDY

We conduct a comprehensive ablation study on the three critical components of RLx2 on TD3, i.e., topology evolution, multi-step TD target, and dynamic-capacity buffer, to examine the effect of each component in RLx2 and their robustness in hyperparameters. In addition, we provide the sensitivity analysis for algorithm hyper-parameters, e.g. initial mask update fraction, mask update interval, buffer adjustment interval, and buffer policy distance threshold, in Appendix C.8. Topology evolution RLx2 drops and grows connections with magnitude and gradient criteria, respectively, which has been adopted in RigL (Evci et al., 2020) for deep supervised learning. To validate the necessity of our topology evolution criteria, we compare RLx2 with three baselines, which replace the topology evolution scheme in RLx2 with Tiny, SS and SET, while keeping other components in RLx2 unchanged. 4 Thy evolution e left partpolog of Table 3 shows that RigL as a topology adjustment scheme (the resulting scheme is RLx2 when using RigL) performs best among the four baselines. We also observe that Tiny performs worst, which is consistent with the conclusion in existing works (Zhu & Gupta, 2018 ) that a sparse network may contain a smaller hypothesis space and leads to performance loss, which necessitates a topology evolution scheme. Table 3 : Ablation study on topology evolution and multi-step target, where the performance (%) is normalized with respect to the performance of dense models.

Env.

Topoloy 

Multi-step TD targets

We also compare different step lengths in multi-step TD targets for RLx2 in the right part of Table 3 . We find that multi-step TD targets with a step length of 3 obtain the maximum performance. In particular, multi-step TD targets improve the performance dramatically in Hopper-v3 and Walker2d-v3, while the improvement in HalfCheetach-v3 and Ant-v3 is minor.

Dynamic-capacity Buffer

We compare different buffer sizing schemes, including our dynamic scheme, different fixed-capacity buffers, and an unlimited buffer. Figure 6 shows that our dynamiccapacity buffer performs best among all settings of the buffer. Smaller buffer capacity benefits the performance in the early stage but may reduce the final performance. This is because using a smaller buffer results in higher sample efficiency in the early stage of training but fails in reaching high performance in the long term, whereas a large or even unlimited one may perform poorly in all stages. Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. In Workshop at international conference on learning representations, 2018.

A ADDITIONAL DETAILS FOR SECTION 4

This section provides additional details for Section 4, including how to efficiently implement Algorithm 1 with limited resource, the derivation of Eq. ( 4) in Section 4.2.1, and the full algorithm of dynamic-capacity buffer in Section 4.2.2. A.1 EFFICIENT IMPLEMENTATION FOR ALGORITHM 1 For simplicity, in this section, we omit all indices of l in the symbols that appeared in Algorithm 1. Parameter Storing Suppose layer l takes an n (in) -dimensional vector x as the input, and outputs an n (out) -feature vector y via a linear transformation. Then the layer's number of parameters is N = n (in) × n (out) . A naive implementation will be to store both θ ∈ R n (out) ×n (in) and M θ ∈ {0, 1} n (out) ×n (in) in two dense n (out) × n (in) matrices in the memory. In forward and backward propagations, one simply performs the dense-matrix-multiply-vector operation on θ and x. However, this implementation cannot enjoy any speed-up even when we are using a sparsity ratio s close to 1, so the network is highly sparse. Also, the actual memory occupied by the model is always proportional to N and irrelevant to s. However, a better way is to store θ in a more compact manner, where only the non-zero indices (i.e., positions of the ones in M θ ) and their values. As a result, the weights of the layer now occupies Θ((1 -s)N ) memory, and the matrix-multiply-vector operation also just costs O((1 -s)N + n (in) + n (out) ). Such sparse matrix (or tensor of higher orders) structures are supported by many modern machine learning frameworks, e.g., torch.sparse in PyTorch. Many of them also support automatic gradient calculation and backward propagation. Link Dropping With this sparse representation of θ, it can be seen that the link dropping step of Algorithm 1 (Line 9) can be done in O((1 -s)N log N ) time by sorting all the weight entries in their absolute values and then picking the top-K items. Link Growing It then suffices to figure out a way to implement the link growing step of Algorithm 1 (Line 10). Denote by L the scalar loss function of the whole neural network containing layer l. Assume that we have just performed a backward propagation, where layer l contributes only O((1 -s)N + n (in) + n (out) ) to the computation time, and g (x) := ∂L ∂x , g (y) := ∂L ∂y and g (θ) := ∂L ∂θ all have been computed. Since θ is in a compact representation with (1 -s)N elements, now g (θ) obtained by auto-grad also only contains (1 -s)N items, but the growing step Line 10 basically asks to collect the top-K items in the whole dense gradient matrix with N elements. According to the chain rule, for a link θ ji between the i-th input neural and the j-th output neural, the partial derivative of L with respect to θ ji is given by (here, we abuse the notation a little) g (θ) ji := ∂L ∂θ ji = ∂L ∂y j y j ∂θ ji = g (y) j x i . Hence the desired true dense n (out) ×n (in) gradient matrix g (θ) is equal to g (y) x T . Our task reduces to collect the K entries with the largest absolute values while keeping away from the locations that have just been dropped in Line 9. In fact, this procedure can be efficiently implemented by scanning via n (out) pointers with the help of a heap (a.k.a., a priority queue), described in the following pseudo-code Algorithm 2. It can be seen that Algorithm 2 consumes O(n (out) + n (in) + |U | + (1 -s)N ) heap operations and set operations. If all sets S and U are implemented using binary search trees or hash tables, the costs for each heap operation and each set operation are all within O(log N ). Therefore the total running time of Algorithm 2 is O((1 -s)N log N ).

Algorithm 2 Efficient Link Growing

Input: Input featrue vector x, output gradient vector g (y) , base index set S 0 , dense matrix size (•|st,at) , where π and b denote the current policy and the behavior policy, respectively. Q π (s, a) denotes the Q function associated with policy π as defined in Eq. (1) in the manuscript, i.e., Q π (s, a) = E π T i=t γ i-t r (s i , a i ) |s t = s, a t = a . We also use ϵ(s, a) to denote the network fitting error, i.e., ϵ(s, a) = Q(s, a; θ) -Q π (s, a). Subsequently, we have: E b [T n (s t , a t )] -Q π (s t , a t ) = E b [ n-1 k=0 γ k r t+k + γ n Q(s t+n , π(s t+n ); θ)] -E π [ n-1 k=0 γ k r t+k + γ n Q π (s t+n , π(s t+n ))] = E b [ n-1 k=0 γ k r t+k + γ n Q(s t+n , π(s t+n ); θ)] -E π [ n-1 k=0 γ k r t+k + γ n Q(s t+n , π(s t+n ); θ)] + E π [γ n (Q(s t+n , π(s t+n ); θ) -Q π (s t+n , π(s t+n )))] = (E b [T n (s t , a t )] -E π [T n (s t , a t )]) + γ n E π [ϵ(s t+n , π(s t+n ))]. The first equality holds in the manuscript by definitions in Eq. ( 1) and of multi-step TD targets. The second equality holds by firstly adding and then subtracting the term E π [γ n (Q(s t+n , a t+n ; θ))], and the last equality holds by definitions of T n (s, a) and ϵ(s, a). This decomposition shows that the expected error consists of two parts, i.e., the network fitting error and the policy inconsistency error, which are well handled by our multi-step TD targets with a dynamic-capacity buffer.

A.3 ALGORITHM OF DYNAMIC-CAPACITY BUFFER IN SECTION 4.2.2

Algorithm 3 presents our formal procedure for dynamically controlling the buffer capacity in Section 4.2.2. For each step, a new transition is inserted into the replay buffer. To avoid the problem of policy inconsistency, we check the buffer per ∆ b steps and drop the oldest transitions if needed. Specifically, we first set a hard lower bound B min , and a hard upper bound B max of the buffer capacity. (i) If the buffer size is below B min , we store all newly collected data samples. (ii) If the amount of buffered transitions has reached B max , the oldest transitions will be replaced by the latest transitions. (iii) When the buffer is in (B min , B max ), for each time, we calculate the policy distance between the oldest behavior policy and the current policy, based on the oldest transitions stored in the buffer. In addition, if the policy distance exceeds the threshold D 0 , the oldest transitions are discarded. The full algorithm is given in Algorithm 3. 

A.4 DETAILED ANALYSIS OF DYNAMIC BUFFER

Using a dynamic buffer can reduce the gap between target policy and behavioral policy, as we have shown empirically in Section 5.2. In this section, we give a more detailed analysis of the influence of the dynamic buffer. We first define the notations used in our analysis.  (s) = ∞ t=0 γ t Pr(s t = s), a t ∼ π(•|s t ) Lemma A.1 below first shows that for two trajectory distributions generated by different policies, their KL divergence can be expressed by the KL divergence between these two policies. Lemma A.1. D KL (µ b (•|s t , a t )||µ π (•|s t , a t )) = n k=1 E s t+k ∼ρ (s) b,t+k D KL (b(•|s t+k )||π(•|s t+k )) Proof. The conditional trajectory distribution can be expressed as: µ π (τ |s t , a t ) = n k=1 π(a t+k |s t+k )p(s t+k |s t+k-1 , a t+k-1 ).

Thus,

D KL (µ b (•|s t , a t )||µ π (•|s t , a t )) = τ µ b (τ |s t , a t ) log µ b (τ |s t , a t ) µ π (τ |s t , a t ) = τ µ b (τ |s t , a t ) log n k=1 b(a t+k |s t+k ) n k=1 π(a t+k |s t+k ) = n k=1 τ [µ b (τ ) log b(a t+k |s t+k ) π(a t+k |s t+k ) ]. Note that E τ ∼µ b (•) [log µ b (τ |s t , a t ) µ π (τ |s t , a t ) ] = E (s t+k ,a t+k )∼ρ (s,a) b,t (•) [log µ b (τ |s t , a t ) µ π (τ |s t , a t ) ], then D KL (µ b (•|s t , a t )||µ π (•|s t , a t )) = n k=1 (s t+k ,a t+k ) ρ (s,a) b,t+k log b(a t+k |s t+k ) π(a t+k |s t+k ) = n k=1 E s t+k ∼ρ (s) b,t+k D KL (b(•|s t+k )||π(•|s t+k )). Proposition A.2 below shows that the relation between the policy inconsistency error defined in equation 4 and the policy distance. This proposition shows that multi-step TD learning can indeed be more robust with a dynamic buffer. Proposition A.2. The policy inconsistency error in equation 4 can be upper bounded by |E b [T n ] -E π [T n ]| ≤ ( 1 -γ n 1 -γ r m + γ n Q m ) 1 2 n k=1 E s∼ρ (s) b,t (•) D KL (b(•|s)||π(•|s)), where r m = sup r -inf r, Q m = sup Q(s, a; θ) -inf Q(s, a; θ). Proof. Suppose the multi-step TD target is bounded, we have |E b [T n ] -E π [T n ]| =|E τ ∼µ b (•|st,at) [T n ] -E τ ∼µπ(•|st,at) [T n ]| ≤(sup T n -inf T n )D TV (µ b (•|s t , a t )||µ π (•|s t , a t )). According to Pinsker's inequality, D TV (µ b (•|s t , a t )||µ π (•|s t , a t )) ≤ D KL (µ b (•|s t , a t )||µ π (•|s t , a t )) 2 . Thus, |E b [T n ] -E π [T n ]| ≤( n-1 k=0 γ k r m + γ n Q m ) D KL (µ b (•|s t , a t )||µ π (•|s t , a t )) 2 . Finally, using Lemma A.1, we obtain |E b [T n ] -E π [T n ]| ≤( 1 -γ n 1 -γ r m + γ n Q m ) 1 2 n k=1 E s∼ρ (s) b,t (•) D KL (b(•|s)||π(•|s)). Our next result, Proposition A.3, below shows that the mismatch between L(θ) and L(θ) can be controlled by reducing the KL divergence between the target policy and the behavior policy (complete proof in Appendix A.4. Therefore, one can improve the value estimation by eliminating data inconsistency. Proposition A.3. For the target policy π and behavioural policy b, we have |L(θ) -L(θ)| ≤ 2γ 1 -γ ∆E s∼d b 1 2 D KL (π(•|s), b(•|s)) , where ∆ = sup |(Q(s, a; θ) -T (s, a)) 2 | and the loss functions are defined as L(θ) = E (si,ai)∼dπ [(Q(s i , a i ; θ) -T (s i , a i )) 2 ], L(θ) = E (si,ai)∼d b [(Q(s i , a i ; θ) -T (s i , a i )) 2 ]. Proof. Denote ∆ = sup |(Q(s, a; θ) -T (s, a)) 2 |. Then, we have |L(θ) -L(θ)| =|E (si,ai)∼d b [(Q(s i , a i ; θ) -T (s i , a i )) 2 ] -E (si,ai)∼dπ [(Q(s i , a i ; θ) -T (s i , a i )) 2 ]| ≤2D TV (d π , d b )∆, i.e., the gap between the two loss functions can be bounded by the total variance distance between the two state-action visitation distributions. According to Achiam et al. (2017) , we obtain D TV (d π , d b ) ≤ γ 1 -γ E s∼d b [D TV (π(•|s), b(•|s))]. Thus, the loss function gap can be bounded by the total variance distance between the two policies, i.e., |L(θ) -L(θ)| ≤ 2γ 1 -γ ∆E s∼d b [D TV (π(•|s), b(•|s))]. With Pinsker's inequality, we can express the upper bound with the KL divergence between the two policies: |L(θ) -L(θ)| ≤ 2γ 1 -γ ∆E s∼d b [ 1 2 D KL (π(•|s), b(•|s))].

B DETAILS OF RLX2 WITH TD3 AND SAC

In this section, we provide the pseudo-codes of instantiations of RLx2 on TD3 and SAC in Algorithm 4 and Algorithm 5, respectively. We emphasize that RLx2 is a general sparse training framework for off-policy DRL and can be applied to training other DRL algorithms apart from TD3 and SAC, with sparse networks from scratch. Below, we first illustrate the critical steps of RLx2 in Algorithm 4, using TD3 as the base algorithm. Topology evolution is performed in Lines 15-17 and Lines 20-22 in Algorithm 1. Specifically, we first calculate the sparsity of each layer according to the target sparsity of the total model at initialization. The sparsity of each layer is fixed during the training. And we use the Erdős-Rényi strategy, which is introduced in Mocanu et al. ( 2018), to allocate the sparsity to each layer. For a sparse network with L layers, this strategy utilizes the equations below: (1 -S) l I l O l = L l=1 (1 -S l )I l O l , 1 -S l = k I l + O l I l * O l , where S is the target sparsity of the model, S l is the sparsity of the l-th layer, I l is the input dimensionality of the l-th layer, O l is the output dimensionality of the l-th layer, and k is a constant. The motivation of this strategy is that a layer with more parameters contains more redundancy. As a result, it can be compressed with a higher sparsity. The topology evolution update is performed every ∆ m time steps. Definitions of other hyperparameters related to topology evolution are listed in Algorithm 1 in the manuscript. Buffer capacity adjustment is performed in Lines 7-9. This adjustment is conducted every ∆ b step, with detailed procedure shown in Algorithm 3. Multi-step TD target is computed in Lines 10-13. We found that using a multi-step TD target in the early stage of training may result in poor performance because the policy may evolve quickly, which results in severe policy inconsistency. Thus, we start the multi-step TD target only when the number of training steps succeeds a pre-set threshold T 0 . As mentioned in Section 4.2.1, the one-step TD target and multi-step TD target in TD3 are computed as: T 1 = r t + γQ(s t+1 , π(s t+1 ); θ), T n = n-1 k=0 γ k r t+k + γ n Q(s t+n , π(s t+n ); θ). Note that the calculation of the multi-step TD target in SAC is slightly different from that in TD3. Specifically, the one-step TD target for SAC is computed as: T 1 = r t + γ(Q(s t+1 , ãt+1 ; θ) -α log π(ã t+1 |s t+1 )), where ãt+1 ∼ π(•|s t+1 ), and the n-step TD target for SAC is computed by T n = n-1 k=0 γ k r t + γ n Q(s t+n , ãt+n ; θ) -α n-1 k=0 γ k+1 log π(ã t+k+1 |s t+k+1 ), where ãt+k+1 ∼ π(•|s t+k+1 ), k = 0, 1, • • • , n -1. Due to this difference, we will see later in Section C.3 that the resulting FLOPs are slightly different for TD3 and SAC. Besides, Algorithm 5 also gives the detailed implementation of RLx2 with SAC, where topology evolution is performed in Lines 15-17 and Lines 20-22, buffer capacity adjustment is performed in Lines 7-9, and the multi-step TD target is computed in Lines 10-13. Algorithm 4 RLx2-TD3 1: Initialize sparse critic network Q θ 1 , Q θ 2 and sparse actor network π ϕ with random parameters θ1, θ2, ϕ and random masks M θ 1 , M θ 2 , M ϕ with determined sparsity S (c) , S (a) . 2: θ1 ← θ1 ⊙ M θ 1 , θ2 ← θ2 ⊙ M θ 2 , ϕ ← ϕ ⊙ M ϕ . // Start with a random sparse network 3: Initialize target networks θ ′ 1 ← θ1, θ ′ 2 ← θ2, ϕ ′ ← ϕ. Initialize replay buffer B. 4: for t = 1 to T do 5: Select action with exploration noise at ∼ π ϕ (st) + ϵ, ϵ ∼ N (0, σ) and observe reward rt and new state st+1 6: Store transition tuple (st, at, rt, st+1) in B 7: if t mod ∆ b = 0 then 8: Buffer capacity adjustment 9: end if // Check the buffer periodically 10: Set N = 1 temporarily if t < T0 // Delay to use multi-step TD target 11: Sample mini-batch of B multi-step transitions (si, ai, ri, si+1, ai+1, • • • , si+N ) from B 12: ã ← π ϕ ′ (si+N ) + ϵ, ϵ ∼ clip(N (0, σ), -c, c) 13: Calculate multi-step TD target y ← N -1 k=0 γ k r i+k + γ N minj=1,2 Q θ ′ j (si+N , ã) // multi-step TD target 14: Update critic networks θj ← θj -λ∇ θ j 1 B (y -Q θ j (si, ai)) 2 for j = 1, 2 15: if t mod ∆m = 0 then 16: Topology Evolution(Q θ j ) for j = 1, 2 17: end if // Update the mask of critic periodically 18: if t mod d = 0 then 19: Update actor network ϕ ← ϕ -λ∇ ϕ (-1 B Q θ 1 (si, ai)) 20: if t/d mod ∆m = 0 then 21: Topology Evolution(π ϕ ) 22: end if // Update the mask of actor periodically 23: Update target networks: where S l is the sparsity, I l is the input dimensionality, and O l is the output dimensionality of the l-th layer. Specifically, the "Total Size" column in Table 2 in the manuscript refers the model size including both actor and critic networks during training. θ ′ i ← τ θi + (1 -τ )θ ′ i , ϕ ′ ← τ ϕ + (1 -τ )ϕ ′ , θ ′ i ← θ ′ i ⊙ M θ i , ϕ ′ ← ϕ ′ ⊙ M ϕ // For both TD3 and SAC, double Q-learning is adopted, i.e., train two value networks concurrently. Besides, we also use target networks in our implementations as target critics for both TD3 and SAC, yet a target actor only for TD3. Thus, if we denote M Actor and M Critic as model sizes of actor and critic, respectively, the detailed calculation of model sizes can be obtained as shown in the second column of Table 5 . We denote B as the batch size used for training process. Initially, for a sparse network with L fully-connected layer, the required FLOPs for a forward pass is competed as follows (also adopted in Evci et al. (2020) and Molchanov et al. (2019a) ): FLOPs = L l=1 (1 -S l )(2I l -1)O l , where S l is the sparsity, I l is the input dimensionality, and O l is the output dimensionality of the l-th layer. Denote FLOPs Actor and FLOPs Critic as the FLOPs required in a forward pass of a single actor and critic network, respectively. The inference FLOPs is exactly FLOPs Actor as shown in the last column of Table 5 . As for the training FLOPs, the calculation consisted of multiple forward and backward passes in several networks, which will be detailed below. In particular, we compute the FLOPs needed for each training iteration. Besides, we omit the FLOPs of the following processes since they have little influence on the final result. (i) Interaction with the environment. Each time the agent decides an action to interact with the environment takes FLOPs as FLOPs Actor , which is much smaller than the FLOPs need for updating networks as shown in Table 5 since B ≫ 1. (ii) Updating target networks. Every parameter in the networks is updated as θ ′ ← τ θ + (1 -τ )θ ′ . Thus, the number of FLOPs here is the same as the model size, which is also negligible. (iii) Topology evolution and buffer capacity adjustment. These two components are performed every 10000 steps. Formally speaking, the average FLOPs of topology evolution is give by B× 2FLOPsActor (1-S (a) )∆m for the actor, and B × 4FLOPsCritic (1-S (c) )∆m for the critic, where S (a) and S (c) are the sparsity of actor and critic, respectively. The FLOPs of buffer capacity adjustment is 8B × FLOPsActor ∆ b , where 8B is because that we use the oldest 8B transitions to compute the policy distance. Thus, they are both negligible. Therefore, we focus on the FLOPs of updating actor and critic. The average FLOPs of updating actor and critic can be given as: FLOPs train = FLOPs update critic + 1 d FLOPs update actor , ( ) where d is the actor update interval (2 for TD3 and 1 for SAC in our implementations). Next we calculate the FLOPs of updating actor and critic, i.e. FLOPs update critic and FLOPs update actor . We first focus on TD3, while that for SAC is similar. Training FLOPs Calculation in TD3 (i) Critic FLOPs: Recall the way to update critic (two critics, θ 1 and θ 2 ) in TD3 is given by θ j ← θ j -λ∇ θj 1 B (T n -Q(s i , a i ; θ j )) 2 , for j = 1, 2, where B is the batch size, n-step TD target T n = N -1 k=0 γ k r i+k + γ N min j=1,2 Q(s i+N , ã; θ ′ j ) and θ ′ j refers to the target network. Subsequently, we can compute the FLOPs of updating critic as: FLOPs update critic = FLOPs TD target + FLOPs compute loss + FLOPs backward pass , where FLOPs TD target , FLOPs compute loss , and FLOPs backward pass refer to the numbers of FLOPs in computing the TD targets in forward pass, loss function in forward pass, and gradients in backward pass (backward-propagation), respectively. By Eq. ( 12) and Eq. ( 13) we have: FLOPs TD target = B × (FLOPs Actor + 2FLOPs Critic ), FLOPs compute loss = B × 2FLOPs Critic . ( ) For the FLOPs of gradients backward propagation, FLOPs backward pass , we compute it as two times the computational expense of the forward pass, which is adopted in existing literature (Evci et al., 2020) , i.e., FLOPs backward pass = B × 2 × 2FLOPs Critic , (16) where the extra factor 2 comes from the cost of double Q-learning. Combining Eq. ( 14), Eq. ( 15), and Eq. ( 16), the FLOPs of updating critic in TD3 is: FLOPs update critic = B × (FLOPs Actor + 8FLOPs Critic ). (ii) Actor FLOPs: Recall the way to update actor (parameterized by ϕ) in TD3 is given by ϕ ← ϕ -λ∇ ϕ (- 1 B Q(s i , a i ; θ 1 )), where θ 1 refers to a critic network. Subsequently, we compute the FLOPs of updating actor as: FLOPs update actor = FLOPs compute loss + FLOPs backward pass , and similar to the calculations of updating critic, we have: FLOPs compute loss = B × (FLOPs Actor + FLOPs Critic ), FLOPs backward pass = B × 2FLOPs Actor . Combining Eq. ( 18), Eq. ( 18) and Eq. ( 19), the flops of updating actor in TD3 is: FLOPs update actor = B × (3FLOPs Actor + FLOPs Critic ) Training FLOPs Calculation in SAC (i) Critic FLOPs: Calculations of FLOPs for SAC are similar to that in TD3. The way to update the critic in SAC is: θ j ← θ j -λ∇ θj 1 B (T n -Q θj (s i , a i )) 2 for j = 1, 2, where B is the batch size, n-step TD target T n = N -1 k=0 γ k r i+k + γ N min j=1,2 Q θ ′ j (s i+N , ãi+N ) -α N -1 k=0 γ k+1 log π(ã i+k+1 |s i+k+1 ) and θ ′ j refers to the target network. Note that the way to compute multi-step TD target in SAC is slightly different from which in TD3, we have: FLOPs TD target = B × (2FLOPs Actor + 2FLOPs Critic ). Other terms for updating critic are the same as those in Eq. ( 14) for TD3. Thus, the FLOPs of updating critic in SAC can be computed by FLOPs update critic = B × (2FLOPs Actor + 8FLOPs Critic ). (ii) Actor FLOPs: The way to update the actor in SAC is: ϕ ← ϕ -λ∇ ϕ (- 1 B min j=1,2 Q θj (s i , a i )), where θ j refers to a critic network. Subsequently, we have: FLOPs compute loss = B × (FLOPs Actor + 2FLOPs Critic ). The backward pass FLOPs is the same as that in TD3, i.e., FLOPs backward pass = B × 2FLOPs Actor . Thus, the FLOPs of updating the actor in SAC is: FLOPs update actor = B × (3FLOPs Actor + 2FLOPs Critic ). Table 6 shows the relative average FLOPs of each iteration for different algorithms, where the FLOPs of training a sparse network without any of these three methods is set to 1x. The sparsity is set to the average sparsity in different environments. The additional computations induced by topology evolution and dynamic buffer are negligible. Using multi-step TD learning also does not increase computations in TD3, and only introduces a small extra computation to SAC (< 5%), as analyzed above. 8 shows the improvement of the learned sparse network topology by robust value learning. We see that in addition to Ant-v3, significant improvements are also achieved in Hopper-v3 and Walker2d-v3. The only exception is HalfCheetah-v3. We hypothesize that this environment is easier for the agent to learn (as the performance improves much faster in the early stage than in the other three environments), and a good mask can be found comparatively easier. It is an interesting direction for future work to systematically analyze this problem. Table 7 shows the performance of different algorithms on four MuJoCo environments with standard deviations. Each result is calculated on 8 random seeds. RLx2 does not lead to a larger variance with topology evolution. In this section, we provide the detailed sensitivity analysis for new hyperparameters used in RLx2, including initial mask update fraction ζ, mask update interval ∆ m , buffer adjustment interval ∆ b , buffer policy distance threshold D 0 , and multi-step delay T 0 . Initial mask update fraction Table 9 shows the performance with different initial mask update fractions (denoted as ζ) in different environments. We also include the special case of keeping the mask static, i.e., ζ = 0. From Table 9 , we find that the sensitivities of the initial mask update fraction among different environments are similar. Besides, RLx2 achieves better performance with a large value of the initial mask update fraction. There is no apparent performance degradation even if ζ is set to 0.9, which may be due to the update fraction annealing scheme. Mask update interval Table 10 shows the performance with different mask update intervals (denoted as ∆ m ) in different environments. We also include the special case of never updating the mask, i.e., ∆ m = ∞. From Table 10 , we find that the sensitivities of the mask update interval among different environments are similar. Table 10 also shows that a small mask update interval reduces the performance since adjusting the mask too frequently may drop the critical connections before their weights are updated to large values by the optimizer. On the contrary, a large mask update interval reduces the impact caused by topology evolution but degrades to training a static sparse network. In general, a moderate value of 1 × 10 4 is favoured. Buffer adjustment interval Table 11 shows the performance with different buffer adjustment intervals (denoted as ∆ b ) in different environments. We also include the special case of never adjusting the buffer capacity, i.e., ∆ b = ∞. From Table 11 , we find that the sensitivities of the buffer adjustment interval among different environments are similar. We find the performance is not sensitive to the buffer adjustment interval ∆ b . We only observe an apparent performance degradation when the buffer adjustment interval is too large such that the policy distance cannot be reduced promptly. Buffer policy distance threshold Table 12 shows the performance with different buffer policy distance thresholds (denoted as D 0 ) in different environments. We also include the special case of never adjusting the buffer capacity, i.e., D 0 = ∞. From Table 12 , we find that the sensitivities of the buffer policy distance threshold among different environments are similar. It is shown that a very small threshold may reduce the performance, while the dynamic-capacity buffer helps improve the performance in a wide range of the threshold, especially when D 0 = 0.1 or 0.2. Multi-step delay Table 13 shows the performance with different multi-step delays (denoted as T 0 ) in different environments. We also include special cases of no delay, i.e., T 0 = 0, and delay all the time, i.e., T 0 = ∞. From Table 13 , we find that the sensitivities of the multi-step delay among different environments are similar. Compared to the multi-step method without delay, the performance can be improved by using a delay mechanism in the early stage of training, also outperforming the one-step scheme. In addition, the performance is not very sensitive to the multi-step delay T 0 , as we find that performance gains with different delays are similar. Table 13 : Sensitivity analysis on multi-step delay.

Environment

No delay T 0 = 0  T 0 = 1 × 10 5



Four schemes: 1) training with a random static sparse network (SS); 2) training with RigL, (RigL); 3) dynamic sparse training guided by true Q-value, i.e., Q-values from a fully trained expert critic with a dense network (RigL+Q * ); 4) and dynamic sparse training guided by learned Q-value with TD targets (RLx2). (Frankle & Carbin, 2019) obtains the mask, i.e., the "lottery ticket", by pruning a pretrained dense model. Our sparse mask is the final mask obtained by dynamic sparse training. EXPERIMENTSIn this section, we investigate the performance improvement of RLx2 in Section 5.1, and the importance of each component in RLx2 in Section 5.2. In particular, we pay extra attention to the role A more complex environment with larger state space, Humanoid-v3, is also evaluated in Appendix C.9. Take Algorithm 4 in Appendix B for an example. Only lines 16 and 21 of "Topology Evolution(•)" are changed while other parts remain unchanged. We also regard static topology as a special topology evolution. In BC, the actor network is trained under the guidance of a well-trained expert instead of the critic network.



Figure 4: Dynamic buffer capacity & policy inconsistency

Figure 5: Performance comparison under different model sparsity. Acceleration in FLOPs Different from knowledge-distillation/BC based methods, e.g., Livne & Cohen (2020); Vischer et al. (2022); Lee et al. (2021), RLx2 uses a sparse network throughout training. Thus, it has an additional advantage of immensely accelerating training and saving computation, i.e., 12× training acceleration and 20× inference acceleration for RLx2-TD3, and 7× training acceleration and 12× inference acceleration for RLx2-SAC.

Figure 7: Comparison of different sparse network architecture for training a sparse DRL agent in, where the sparsity is the same as that in Table2. performance, which is comparable with that under the original dense model. Due to the potential data inconsistency problem in value learning and the smaller hypothesis search space under sparse networks, training with a single fixed topology does not fully reap the benefit of high sparsity and can cause significantly degraded performance. That is why the winning ticket and random ticket both lead to significant performance loss compared to RLx2. On the other hand, Figure7(b) shows that in BC tasks, the winning ticket and RLx2 perform almost the same as the dense model, while the random ticket performs worst. This indicates that an appropriate fixed topology can indeed be sufficient to reach satisfactory performance in BC, which is intuitive since BC adopts a supervised learning approach and eliminates non-stationarity due to bootstrapping training. In conclusion, we find that a fixed winning ticket can perform as well as a dynamic topology that evolves during the training in behavior cloning, while RLx2 outperforms the winning ticket in RL training. This observation indicates that topology evolution not only helps find the winning ticket in sparse DRL training but is also necessary for training a sparse DRL agent due to the extra non-stationary in bootstrapping training, compared to deep supervised learning.

p(s ′ |s, a): Environment transition probability ρ (s) π,t : State distribution in time t under policy π ρ (s,a) π,t : State-action pair distribution in time t under policy π µ π (τ ): Distribution of trajectory τ = (s t , a t , s t+1 , • • • , s t+n , a t+n ) under policy π d π : State-action visitation distribution under policy π, d π

Figure 8: Comparison of different sparse topologies learned in HalfCheetah-v3, Hopper-v3, and Walker2d-v3. C.5 TRAINING CURVES OF COMPARATIVE EVALUATION IN SECTION 5.1 Figure 9 and Figure 10 show the training curves of different algorithms in four MuJoCo environments. RLx2 outperforms baseline algorithms on all four environments with both TD3 and SAC.

Figure 9: Training processes of RLx2-TD3 on four MuJoCo environments. The performance is calculated as the average reward per episode over the last 30 evaluations of the training.

Figure 14: Visualization of the binary masks of the actor. 32

Comparison of different sparse training techniques in DRL.

algorithm, which is capable of training a sparse DRL model from scratch. An overview of the RLx2 framework on an actor-critic architecture is shown in Figure 1. To motivate the design of RLx2, we present a comparison of four sparse DRL training methods using TD3 with different topology update schemes on InvertedPendulum-v2, a simple control task from MuJoCo, in Figure 2. 1

: N l : Number of parameters in layer l 2: θ l : Parameters in layer l 3: M θ l : Sparse mask of layer l

Comparisons of RLx2 with sparse training baselines. Here "Sp." refers to the sparsity level (percentage of model size reduced), "Total Size" refers to the total parameters of both critic and actor networks (detailed calculation of training and inference FLOPs are given in Appendix C.3). The right five columns show the final performance of different methods. The "Total size," "FLOPs" , and "Performance" are all normalized w.r.t. the original large dense model (detailed in Appendix C.2).

environment by more than 96%, and the critic is compressed by 85%-95%. The results for SAC are similar. RLx2 with SAC achieves a 5×-20× model compression.

Table2. performance, which is comparable with that under the original dense model. Due to the potential data inconsistency problem in value learning and the smaller hypothesis search space under sparse networks, training with a single fixed topology does not fully reap the benefit of high sparsity and can cause significantly degraded performance. That is why the winning ticket and random ticket both lead to significant performance loss compared to RLx2. On the other hand, Figure7(b) shows that in BC tasks, the winning ticket and RLx2 perform almost the same as the dense model, while the random ticket performs worst. This indicates that an appropriate fixed topology can indeed be sufficient to reach satisfactory performance in BC, which is intuitive since BC adopts a supervised learning approach and eliminates non-stationarity due to bootstrapping training. In conclusion, we find that a fixed winning ticket can perform as well as a dynamic topology that evolves during the training in behavior cloning, while RLx2 outperforms the winning ticket in RL training. This observation indicates that topology evolution not only helps find the winning ticket in sparse DRL training but is also necessary for training a sparse DRL agent due to the extra non-stationary in bootstrapping training, compared to deep supervised learning. details (including an efficient implementation for RLx2, implementation details of the dynamic buffer, hyperparameters, and network architectures) are included in Appendix C for reproduction. The proof for our analysis of the dynamic buffer can be found in AppendixA.4. The code is open-sourced in https://github.com/tyq1024/RLx2. Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S Morcos. Playing the lottery with rewards and multiple languages: lottery tickets in rl and nlp. In International conference on learning representations, 2020. Hongjie Zhang, Zhuocheng He, and Jing Li. Accelerating the deep reinforcement learning with neural network compression. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1-8. IEEE, 2019.

N , target sparsity s, forbid index set U Output: An index set S such that |S| = (1 -s)N , S ⊇ S 0 and S ∩ U = ∅ 1: Initialize S ← S 0 . 2: Sort |x| to get a permutation σ 1 , . . . , σ n (in) , such that |x σ1 | ≥ |x σ2 | ≥ • • • ≥ |x σ n (in) |. Denote p π (s t+1 , a t+1 , • • • , s t+n , a t+n |s t , a t ) the distribution of the trajectory starting from the current state s t and action a t under policy π. For simplicity, we use E π to denote E (st+1,at+1,••• ,st+n,at+n)∼pπ(•|st,at) , and use E b to denote E (st+1,at+1,••• ,st+n,at+n)∼p b

Target networks are also sparsified with the same mask

Hyperparameters of RLx2-TD3 and RLx2-SAC.

FLOPs and model size for RLx2-TD3 and RLx2-SAC.



Results inTable 2 with standard deviations Alg.

shows the performance of four environments with different buffer capacities. Consistent with the results in Section 5.2, a buffer that is either too small or too large can result in poor performance. Our dynamic buffer outperforms buffers with a fixed capacity.

Sensitivity analysis on initial mask update fraction. Environment Static Sparse ζ = 0.1 ζ = 0.3 ζ = 0.5 ζ = 0.7 ζ = 0.9

Sensitivity analysis on mask update interval.

Sensitivity analysis on buffer adjustment interval.

Sensitivity analysis on buffer policy distance threshold.Environment D 0 = 0.05 D 0 = 0.1 D 0 = 0.2 D 0 = 0.3 D 0 = 0.5

ACKNOWLEDGEMENTS

The work is supported by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403, the Tsinghua University Initiative Scientific Research Program, and Tsinghua Precision Medicine Foundation 10001020109.

Supplementary Materials

and random masks M θ 1 , M θ 2 , M ϕ with determined sparsity S (c) , S (a) . 2: if t mod ∆m = 0 then 16:Topology Evolution(Q θ j ) for j = 1, 2 17:end if // Update the mask of critic periodically 18:Update actor network ϕ ← ϕ -λ∇ ϕ (-Topology Evolution(π ϕ ) 21:end if // Update the mask of actor periodically 22:Automating Entropy Adjustment:Update target networks:Target networks are also sparsified with the same mask 24: end for

C EXPERIMENTAL DETAILS

We provide more experimental details in this section, including the detailed experimental setup, the calculations of model size and FLOPs, and supplementary experiment results.

C.1 HARDWARE SETUP

Our experiments are implemented with PyTorch (Paszke et al., 2017) and run on 8x P100 GPUs. Each run needs 12 hours for TD3 and 2 days for SAC for three million steps. The code will be open-sourced upon publication of the paper.

C.2 HYPERPARAMETER SETTINGS FOR REPRODUCTION

Table 4 presents detailed hyperparameters of RLx2-TD3 and RLx2-SAC in our experiments.

C.3 CALCULATION OF MODEL SIZE AND FLOPS

We present the details of calculating model sizes and FLOPs in this subsection, where focus on fully-connected layers since the networks used in our experiments are all Multilayer Perceptrons (MLPs). These calculations can be easily extended to convolutional layers or other architectures. Besides, we omit the offset term in fully-connected layers in our calculations.

C.3.1 MODEL SIZE

We first illustrate the calculation of model sizes, i.e., the total number of parameters in the model. Initially, for a sparse network with L fully-connected layers, we calculate the model size as:

C.9 ADDITIONAL RESULTS IN HUMANOID-V3

In this subsection, we investigate the effect of RLx2 in Humanoid-v3, one of the control tasks from MuJoCo. Humanoid-v3 is considered relatively complex due to the high input dimensionality (376). Thus, apart from the standard 256 neurons in each hidden layer (same as other environments), we also train a dense model with 1024 neurons in each hidden layer. As shown in Table 14 , the model with more hidden parameters (1024 hidden dimensions) does not achieve a better performance than a small model (256 hidden dimensions). This implies that the latter one seems to have sufficient capacity for the control task in Humanoid-v3. In addition, Table 14 shows the performance of the sparse models trained with RLx2 and other baseline algorithms, where RLx2 succeeds in training a highly sparse model in Humanoid-v3 with performance degradation less than 5%. In particular, RLx2 can achieve sparsity of around 90% in a small model with only 256 hidden neurons, showing its robustness for complex control tasks. Also, RLx2 outperforms most of the baseline algorithms. Although SET shows comparable performance with RLx2, RLx2 shows much higher sample efficiency than SET according to Figure 11 . When applying RLx2 to larger models with 1024 neurons in each hidden layer, we find that it still performs well with extremely high sparsity (around 99%). We calculate the number of parameters of the two sparse models and find they are very close, as shown in Table 14 . It suggests that RLx2 is an effective way to train a sparse model with the least parameters, and is robust to the hidden width of the dense model counterpart.

C.10 SUPPLEMENTARY RESULTS FOR INVESTIGATION OF THE LOTTERY TICKET HYPOTHESIS

We provide additional experiments with TD3 for investigation of the lottery ticket hypothesis (LTH) in the other three environments, including HalfCheetah-v3, Hopper-v3, and Walker2d-v3. As shown in Figure 12 , winning tickets fail to achieve the same performance as the dynamic topology in the reinforcement learning setting, while they all perform well under behavior cloning. This shows the necessity for reinforcement learning to adjust the network structure during the training process. 

C.11 VISUALIZATION OF SPARSE MODELS

In this section, we show visualizations of the sparse networks obtained under RLx2 in our experiments. Note that each layer in the sparse network in our implementation is bound to a binary mask M θ l , i.e., θ l = θ l ⊙ M θ l for the l-th layer. In the rest of this section, we investigate the property of the binary masks with visualization.By visualizing the final mask after training, we find RLx2 drops redundant dimensions in raw inputs adaptively and efficiently. Specifically, Figure 13 In Figure 14 , we further show visualizations of the raw binary masks (i.e., matrices with items representing neuron connections) of the actor in Ant-v3, where a black dot denotes an existing connection between the corresponding neurons, and a white point means there is no connection. As mentioned above, only very few connections are kept for the redundant input dimensions. We also find that the connections in the hidden layers tend to concentrate on a subset of neurons, showing that different neurons can play different roles in representations.

