DEEP Q-LEARNING WITH LOW SWITCHING COST

Abstract

We initiate the study on deep reinforcement learning problems that require low switching cost, i.e., a small number of policy switches during training. Such a requirement is ubiquitous in many applications, such as medical domains, recommendation systems, education, robotics, dialogue agents, etc, where the deployed policy that actually interacts with the environment cannot change frequently. Our paper investigates different policy switching criteria based on deep Q-networks and further proposes an adaptive approach based on the feature distance between the deployed Q-network and the underlying learning Q-network. Through extensive experiments on a medical treatment environment and a collection of the Atari games, we find our feature-switching criterion substantially decreases the switching cost while maintains a similar sample efficiency to the case without the low-switching-cost constraint. We also complement this empirical finding with a theoretical justification from a representation learning perspective.

1. INTRODUCTION

Reinforcement learning (RL) is often used for modeling real-world sequential decision-making problems such as medical domains, personalized recommendations, hardware placements, database optimization, etc. For these applications, oftentimes it is desirable to restrict the agent from adjusting its policy frequently. In medical domains, changing a policy requires a thorough approval process by experts. For large-scale software and hardware systems, changing a policy requires to redeploy the whole environment. Formally, we would like our RL algorithm admits a low switching cost. In this setting, it is required that the deployed policy that interacts with the environment cannot change many times. In some real-world RL applications such as robotics, education, and dialogue system, changing the deployed policy frequently may cause high costs and risks. Gu et al. (2017) trained robotic manipulation by decoupling the training and experience collecting threads; Mandel et al. (2014) applied RL to educational games by taking a data-driven methodology for comparing and validating policies offline, and run the strongest policy online; Jaques et al. (2019) developed an off-policy batch RL algorithms for dialog system, which can effectively learn in an offline fashion, without using different policies to interact with the environment. All of these work avoid changing the deployed policy frequently as they try to train the policy offline effectively or validate the policy to determine whether to deploy it online. For RL problem with a low switching cost constraint, the central question is how to design a criterion to decide when to change the deployed policy. Ideally, we would like this criterion to have the following four properties: 1. Low switching cost: This is the purpose of this criterion. An algorithm equipped with this policy switching criterion should have low switching cost. 2. High Reward: Since the deployed policy determines the collected samples and the agent uses fewer deployed policies, the collected data may not be informative enough to learn the optimal policy with high reward. We need this criterion to deploy policies that can collect informative samples. 3. Sample Efficiency: Since the agent only uses a few deployed policies, there may be more redundant samples, which will not be collected if the agent switches the policy frequently. We would like algorithms equipped with a criterion with similar sample efficiency as the case without the low switching cost constraint.

4.. Generality:

We would like this criterion to be effective not only on a specific task but also broadly effective on a wide range of tasks. In this paper, we take a step toward this important problem. We focus on designing a principled policy switching criterion for deep Q-networks (DQN) learning algorithms, which have been widely used in applications. For example, Ahn & Park (2020) apply DQN to control balancing between different HVAC systems, Ao et al. (2019) propose a thermal process control method based on DQN, and Chen et al. (2018) try to apply it to online recommendation. Notably these applications all require low switching cost. Our paper conducts a systematic study on DQN with low switching cost. Our contributions are summarized below.

Our Contributions

• We conduct the first systematic empirical study on benchmark environments that require modern reinforcement learning algorithms. We test two naive policy switching criteria: 1) switching the policy after a fixed number of steps and 2) switching the policy after an increasing step with a fixed rate. We find that neither criterion is a generic solution because sometimes they either cannot find the best policy or significantly decrease the sample efficiency. • Inspired by representation learning theory, we propose a new feature-based switching criterion that uses the feature distance between the deployed Q-network and the underlying learning Q-network. Through extensive experiments, we find our proposed criterion is a generic solution -it substantially decreases the switching cost while maintains a similar performance to the case without the low-switching-cost constraint. • Along the way, we also derive a deterministic Rainbow DQN (Hessel et al., 2018) , which may be of independent interest. Organization This paper is organized as follows. In Section 2, we review related work. In Section 3, we describe our problem setup and review necessary backgrounds. In Section 4, we describe deterministic Rainbow DQN with the low switching cost constraint. In Section 5, we introduce our feature-based policy switching criterion and its theoretical support. In Section 6, we conduct experiments to evaluate different criteria. We conclude in Section 7 and leave experiment details to appendix.

2. RELATED WORK

Low switching cost algorithms were first studied in the bandit setting (Auer et al., 2002; Cesa-Bianchi et al., 2013) . Existing work on RL with low switching cost is mostly theoretical. To our knowledge, Bai et al. (2019) is the first work that studies this problem for the episodic finite-horizon tabular RL setting. Bai et al. (2019) gave a low-regret algorithm with an O H 3 SA log (K) local switching upper bound where S is the number of stats, A is the number of actions, H is the planning horizon and K is the number of episodes the agent plays. The upper bound was improved in Zhang et al. (2020b; a) . The only empirical study on RL with switching cost is Matsushima et al. (2020) , which proposed a concept of deployment efficiency and gave a model-based algorithm. During the training process, the algorithm fixes the number of deployments, trains a dynamics model ensemble, and updates the deployed policy alternately. After each deployment, the deployed policy collects transitions in the real environment to enhance the models, and then the models optimize the policy by providing imagined trajectories. In other words, they reduce the number of deployments by training on simulated environments. Our goal is different: we design a criterion to decide when to change the deployed policy, and this criterion could be employed by model-free algorithms. There is a line of work on offline RL (also called Batch RL) methods, where the policy does not interact with the environment directly and only learns from a fixed dataset (Lange et al., 2012; Levine et al., 2020) . Some methods Interpolate offline and online methods, i.e., semi-batch RL algorithms (Singh et al., 1995; Lange et al., 2012) , which update the policy many times on a large batch of transitions. However, the switching cost is not their focus.

3. PRELIMINARIES

3.1 MARKOV DECISION PROCESS Throughout our paper, we consider the episodic Markov decision model (S, A, H, P, r). In this model, S is the state space, A is the action space, H ∈ Z + is the planning horizon. P is the transition operator where P (x |x, a) denotes the transition probability of taking action a from state x to state x . r : S × A → R is the reward function. A policy is a mapping from a state to an action, π : S → A. In this paper, we focus on deterministic policies as required by motivating applications. The dynamics of the episodic MDP can be view as the interaction of an agent with the environment periodically. We let K be the total number of episodes the agent plays. At the beginning of an episode k ∈ [K], the agent chooses a policy π k . The initial x k 1 ∈ S is sampled from a distribution, and the agent is then in step 1. At each step h ∈ [H] in this episode, based on the current state x k h ∈ S the agent chooses the action a k h = π k (x k h ). The environment will give the reward for the step r(x k h , a k h ) and move the agent to the next state x h+1 ∼ P (•|x k h , a k h ). The episode automatically ends when the agent reaches the step H + 1. Q-function is used to evaluate the long-term value for the action a and subsequent decisions. The Q-function of a policy π at time step h is defined as follows: Q π h (x, a) := r h (x, a) + E H i=h+1 r (x i , π (x i )) x h = x, a h = a The goal of this agent is to find a policy π * which maximizes the expected reward, π * = arg max π E π H h=1 r h . Ideally, we want to use a few episodes (K) as possible to learn π * .

3.2. SWITCHING COST

The concept of switching cost is used to quantify the adaptability of RL algorithms. The switching cost is defined as the number of policy changes of deployed policies in the running of the algorithm in K episodes, namely: N switch := K-1 k=1 I{π k = π k+1 } The goal of this paper is to equip algorithm with a criterion that learns π * using a few episodes while at the same time has small N switch .

3.3. DEEP Q-LEARNING

If we can estimate Q-function for each state and action pair well, there is no difficulty in finding π * . For example, we can always select a * = arg max a Q(x, a). However, it is not easy to learn Q value estimates for each state and action pair, especially when the state or the action space is large. In deep Q-learning (DQN), Mnih et al. (2015) combine deep networks and reinforcement learning successfully by using a deep neural network to approximate Q(x, a). Given the current state x h , the agent selects a action a h greedily based on the Q(x h , a), then the state move to x h+1 and a reward r t+1 is obtained. The transition (x t , a t , r t+1 , x t+1 ) is saved to the replay memory buffer. At each time, a batch of transitions is sampled from this buffer, and the parameters of neural networks are optimized by using stochastic gradient descent to minimize the loss (r h+1 + γ h+1 max a qθ(x h+1 , a ) -q θ (x h , a h )) 2 (3) where γ h+1 is the chosen discount of time step h + 1, θ is the parameters of the online network, and θ represents the parameters of the target network. The gradient of the loss is back-propagated only to update θ, while θ is not optimized directly. DQN has been successful as it led to superhuman performance on several Atari games. Nevertheless, there are also several limitations of this algorithm and many extensions have been proposed. Rainbow (Hessel et al., 2018) combines six of these extensions and obtains an excellent and stable performance on many Atari games. In the Appendix A, we review these tricks. In this paper, we focus on Rainbow DQN with low switching cost.

3.4. COUNT-BASED EXPLORATION

In many real-world scenarios that require low switching cost, it is also required to use deterministic policies, especially in applications mentioned above. Exploring strategies like -greedy and noisy net make the policy stochastic, which we cannot use here. Count-based exploration algorithms are known to perform near-optimally when used in reinforcement learning for solving tabular MDPs. Tang et al. (2017) applied these algorithms into high-dimensional state spaces successfully. They discretized the state space with a hash function φ : S → Z. Then an exploration bonus r + (x) = β √ n(φ(x)) is added to the reward function and the agent is trained with the modified reward r + r + . Note that with the count-based exploration, the policy is deterministic.

4. DETERMINISTIC RAINBOW DQN WITH A POLICY SWITCHING CRITERION

In this section, we first introduce how to implement a deterministic Rainbow DQN with a policy switching criterion. This implementation combines six DQN tricks and the policy is always deterministic.

4.1. DETERMINISTIC RAINBOW DQN

We first discuss how to make Rainbow DQN always output deterministic policies. Recall that to explore more efficiently, Rainbow adopts the Noisy Net (Fortunato et al., 2018) , which makes the policy stochastic. To obtain a deterministic policy and keep exploring, we remove Noisy Net and employ the Count-Based exploration described in the previous section. When the deployed policy interacts with the environment, it selects the action which maximizes the Q-function. After obtaining the reward by taking this action, an exploration bonus r + = β √ n(φ(x)) is added into the reward.

4.2. DQN WITH A POLICY SWITCHING CRITERION

Besides, Rainbow updates the policies which interacts with the environment directly. These policies usually switch millions of times during the training process (since the parameters are updated millions of times). In our implementation, we add an online policy in addition to the deployed policy that interacts with the environment. The deployed policy collects the data for the experience replay buffer, while the online policy is frequently updated when training. The deployed policy is replaced with the online policy when they meet the criterion we will discuss soon. The algorithm of deterministic switching Rainbow is shown in Algorithm 1.

4.3. POLICY SWITCHING CRITERIA

Here we first introduce two straightforward policy switching criteria. Fixed Interval Switching This is the simplest criterion, we switch the deployed policy with a fixed interval. Under the FIX n criterion, we switch the deployed policy whenever the online policy is updated n times where n is a per-specified number. We will specify n in our experiments. Adaptive Interval Switching This criterion aims to switch the deployed policy frequently at the first and reduce the switching speed gradually. Under the Adaptive n2m criterion, we increase the deployment interval from n to m. The interval between the i-th deployment and the (i + 1)-th deployment is min((i + 1) × n, m). We will specify n and m in our experiments. Algorithm 1 Deterministic Switching Rainbow 1: Initialize arameters θ online , θ deployed , θ target for online policy, deployed policy and target policy, initialize an empty replay buffer D 2: Denote the state encoder in online policy and deployed policy as f online and f deployed 3: Set the step to start training H start , the step to end training H max , and the interval to update the target policy H target . 4: Set n(accumulated updates) = 0, n(deployment) = 0 5: for h = 1 to H max do 6: Select a h = arg max a Q deployed (s h , a) 7: Execute action a t and observe reward r h and state x h+1 8: Compute the hash codes through for x h , φ(x h ) = sgn(Ag(x h ))

9:

A is a fixed matrix with i.i.d. entries drawn from a standard Gaussian distribution N (0, 1) and g is a flat function 10: Update the hash table counts, n(φ(x h )) = n(φ(x h )) + 1 11: Update the reward r h = r h + β √ n(φ(r h )) 12: Store (x h , a h , r h , x h+1 ) in D 13: if h > H start then 14: Sample a minibatch of transitions from D.

15:

Update θ online by stochastic gradient descent on the sampled minibatch once.  Feature-based switching F a input f deployed , f online , D Sample minibatch B of transitions (x h ) from D with probability p h Compute the the similarity sim(x h ) = f deployed (x h )•f online (x h ) ||f deployed (x h )||×||f online (x h )|| Compute the average similarity in sampled batch sim(B) = x∈B sim(x) ||B|| output bool(sim(B) ≤ a) These two criteria and our proposed new criterion are summarized in Algorithm 2. Unfortunately, as will be shown in our experiments, these two criteria do not perform well. This requires us to design new principled criterion.

5. FEATURE-BASED SWITCHING CRITERION

In the section, we describe our new policy switching criterion based on feature learning. We first describe this criterion, and then we provide some theoretical justification from a representation learning point of view. Feature-based Switching Criterion We adopt the view the DQN learns to extract informative features of the states of environments. Our proposed criterion tries to switch the deployed policy according to the extracted feature. When deciding whether to switch the deployed policy or not, we first sample a batch of states B from the experience replay buffer, and then extract the feature of all states by both the deployed deep Q-network and online deep Q-network. For a state x, the extracted feature are denoted as f deployed (x) and f online (x), respectively. The similarity score between f deployed and f online on state x is defined as sim(x) = f deployed (x), f online (x) ||f deployed (x)|| × ||f online (x)|| We then compute the averaged similarity score on the batch of states B sim(B) = x∈B sim(x) ||B|| With a hyper-parameter a ∈ [0, 1], the feature-based policy switching criterion is to change the deployed policy whenever sim(B) ≤ a. Theoretical Justification Our criterion is inspired by representation learning. To illustrate the idea, we consider the following setting. Suppose we want to learn f (•), a representation function that maps the input to a k-dimension vector. We assume we have input-output pairs (x, y) with y = w, f * (x) for some underlying representation function f * (•) and a linear predictor w ∈ R k . For ease of presentation, let us assume we know w, and our goal is to learn the underlying representation which together with w gives us 0 prediction error. Suppose we have data sets D 1 and D 2 . We use D 1 to train an estimator of f * , denoted as f 1 , and D 1 ∪ D 2 to train another estimator of f * , denoted as f 1+2 . The training method is empirical risk minimization, i.e., f 1 ← min f ∈F 1 |D 1 | (x,y)∈D1 (y -w, f (x) ) 2 and f 1+2 ← min f ∈F 1 |D 1 ∪ D 2 | (x,y)∈D1∪D2 (y -w, f (x) ) 2 where F is some pre-specified representation function class. The following theorem suggests if the similarity score between f 1 and f 1+2 is small, then f 1 is also far from the underlying representation f * . Theorem 1. Suppose f 1 and f 1+2 are trained via aforementioned scheme. There exist dataset D 1 , D 2 , function class F and w such that if the similarity score between f 1 and f 1+2 on D 1+2 is smaller than α, then the prediction error of f 1 on D 1+2 is 1 -α. The proof is deferred to Appendix B where we give explicitly constructions. Theorem 1 suggests that in certain scenarios, if the learned representation has not converged (the similarity score is small), then it cannot be the optimal representation which in turn will hurt the prediction accuracy. Therefore, if we find the similarity score is small, we should change the deployed policy.

6. EXPERIMENTS

In this section, we conduct experiments to evaluate different policy switching criteria on DQN. We study several Atari game environments along and an environment for simulating sepsis treatment for ICU patients. We evaluate the efficiency among different switching criteria in these environments. Implementation details and hyper-parameters are listed in the Appendix A.

6.1. ENVIRONMENTS

GYMIC GYMIC is an OpenAI gym environment for simulating sepsis treatment for ICU patients to an infection, where sepsis is caused by the body's response to an infection and could be lifethreatening. GYMIC built an environment to simulate the MIMIC sepsis cohort, where MIMIC is an open patient EHR dataset from ICU patients. This environment generates a sparse reward, the reward is set to +15 if the patient recovered and -15 if the patient died. This environment has 46 clinical features and a 5 × 5 action space. For the GYMIC, we display the learning curve of 1.5 million steps of the environment, after which the reward converge. We choose this environment because it is simulating a real-world problem that requires low switching cost. For Atari games, all the experiments were training for the environment stepping 3.5 million times. Atari 2600 Atari 2600 games are widely employed to evaluate the performance of DQN based agents. We also evaluate the efficiency among different switching criteria on several games, such as Pong, Road Runner, Beam Rider, etc.

6.2. RESULTS AND DISCUSSIONS

For all environments, we evaluate a feature-based criterion F 0.98, three fixed interval criteria covering a vast range F IX 10 2 , F IX 10 3 and F IX 10 4 , and an adaptive criterion increasing the deploying interval from 100 to 10,000. Besides the switching criteria we discussed above, we use "None" to indicate an experiment without the low-switching-cost constraint where deployed policy kept in sync with online policy all the time, notice that this experiment is equivalent to F IX 1. GYMIC As shown in figure 1 , none of the switching criteria affects the performance, but they can reduce the number of policy switches drastically, which indicates that reducing the switching cost in such a medical environment could be feasible. In particular, we find F IX 10 4 and F 0.98 are the two best criteria to reduce switching cost. Finally, this criterion keeps a good performance with the minimal switching cost. Atari 2600 GYMIC may be too simple, and we should compare the performances among different criteria in some more difficult environments. We evaluate the performance of different criteria when playing the Atari games, which are imagebased environments. In particular, the state space is much more complex. Figure 2 shows the results of 6 Atari games. In each subgraph, the upper curves are about the rewards of steps, while the curves blow is about switching cost. First we observe that overall trend, higher switching cost leads to better performance. In general Rainbow DQN with no switching cost constrain often gives the best performance. Also, F IX 10 2 enjoys better performance than F IX 10 3 and F IX 10 4 . In some games such as Qbert and Riverraid, although F IX 10 4 and Adaptive lead to lower switching cost, they fail to learn how to play the games well. Therefore, they are not desired generic policy switching criterion. Secondly, we observe that changing the policy with an adaptive interval may be better than a fixed interval. Focusing on the criteria F IX 10 4 and Adaptive 10 2 to10 4 , the adaptive criterion switches the online policy fast at first and decrease its switching speed gradually, and in the end, the adaptive criterion would have the same speed as the fixed one. Therefore, there is no significant difference Step" means the number of steps of the environment. We constrain this 3.5 million steps for all environments. In each environment, we display the reward over the steps on the top and the switching cost in a log scale at the bottom. "None" means no switching criterion under which the deployed policy always keeps sync with the online policy. We evaluate a feature-based criterion, three fixed interval criteria covering a vast range, and an adaptive criteria increasing the deploying interval from 100 to 10,000. Curves of reward are smoothed with a moving average over 5 points. between the total switching cost of these two criteria. However, we could observe that the adaptive criterion's performance is better than the fixed one when playing Beam Rider, Pong, Qbert, and they obtain similar performances in the rest games. Lastly, we find our proposed feature-based criterion (F 0.98) is the desired on that satisfy all four properties we discussed in Section 1. It significantly reduces the switching cost compared to "None", and is smaller than F IX 10 2 . While it incurs higher switching cost than F IX 10 3 , F IX 10 4 , and Adaptive 10 2 to10 4 , on all environments feature-based criterion consistently perform as well as "None" in the sense that 1) it finds the optimal policy eventually, 2) it has the same sample efficiency as "None". On the other hand, other criteria sometime have significantly worse performance compared to "None", so none of them is a generic solution.

7. CONCLUSION

In this paper, we focus on the concept of switching cost and take a step toward designing a generic solution for reducing the switching cost while maintaining the performance. Inspired by representation learning theory, we proposed a new feature-based policy switching criterion for deep Q-learning methods. Through experiment on one medical simulation environment and six Atari games, we find our proposed criterion significantly reduces the switching cost and at the same time enjoys the same performance as the case where there is no switching cost constraint. We believe our paper is just the first step on this important problem. One interesting question is how to design principled policy switching criteria for policy-based and model-based methods. Another direction is to give provable guarantees for these policy switching criteria that work for methods dealing with large state space in contrast to existing analyses are all about tabular RL (Bai et al., 2019; Zhang et al., 2020b; a) .

A DETAILS OF EXPERIMENTS

A.1 DETAILED ALGORITHM For completeness, we introduce the extensions of DQN and display our detailed algorithm. Double Q-learning Conventional Q-learning is affected by an overestimation bias, due to the maximization step in Equation 3, Double Q-learning (Hasselt et al., 2016) address this problem by decoupling. They use the loss (r h+1 + γ h+1 qθ(x h+1 , arg max a q θ (x h+1 , a )) -q θ (x h , a h )) 2 (4) This change reduce harmful overestimations that were present for DQN, which leads to a improvement. Multi-step learning Q-learning accumulates a single reward, Sutton ( 2005) use a forward-view multi-step accumulated reward long ago, where a n-step accumulated reward is defined as r (n) h := n-1 k=0 γ (n) h r h+k+1 and the final loss is (r (n) h + γ (n) h max a qθ(x h+1 , a ) -q θ (x h , a h )) 2 Multi-step targets often lead to faster learning (Sutton & Barto, 2005) Dueling networks Wang et al. (2016) splits the DQN network into two streams called value stream V and advantage stream A, these two strems share the convolutional encoder f , and action value Q(x, a) could be computed as: Q(x, a) = V (f (x)) + A(f (x), a) -a A(f (x), a ) N actions Prioritized replay DQN samples uniformly from the replay buffer, to sample more frequently from those transitions from which the policy can learn more and quickly, Schaul et al. (2016) samples transitions with probability p h relative to the last encountered absolute TD error p h ∝ |r h+1 + γ h+1 max a qθ(x h+1 , a ) -q θ (x h , a h )| ω (8) where ω is a hyper-parameter, and new transitions always have the maximum priority when they enter the buffer to ensure a bias towards unseen transitions. Distributional RL Instead of approximating the expected return as DQN, Bellemare et al. ( 2017) proposed a method to approximate the distribution of returns on a a discrete support z, z is a vector with N atoms atoms and is defined as z i = V min + (i -1) V max -V min N atoms -1 , i ∈ 1, 2, ..., N atoms where V min and V max is the minimal and maximal value in this support. The approximating distribution d h is defined on this support and the probability p i θ (x h , a h ) on each atom i as d h = (z, p θ (x h , a h )) .The goal is to update the trainable parameters θ to match this distribution with the actual distribution of returns. To learn the probability p i θ for each i with a variant of Bellman's equation, they minimize the Kullbeck-Leibler divergence D KL (Φ z d h ||d h ) between the distribution d h and the target distribution d h := (r h+1 +γ h+1 z, pθ(x h+1 , arg max a qθ(x h+1 , a ))), where Φ z is a L2-projection from the target distribution to the fixed support z and qθ(x h+1 , a) = z T pθ(x h+1 , a) Nosiy Net To address the limitations of exploring using -greedy policies, Fortunato et al. (2018) propose a noisy linear layer y = b + W x + (b noisy b + (W noisy w )x) to replace the standard linear y = b + W x, where b and w are random variables, denotes the element-wise product. Rainbow combines all of these 6 extensions, To make the policy deterministic but still keep exploring during training, we remove the Noisy Net and adopt Count-Based exploration. The detailed algorithm is as follows. Algorithm 3 Detailed algorithm 1: Hyper-parameters for extensions 2: Distributional RL: Number of atoms N atoms , min/max values V min /V max 3: Prioritization replay memory: exponent ω, capacity N 4: Multi-step: number of steps n 5: Initialization 6: Initialize prioritization replay memory D, state encoder f online , value stream V online and advantage stream A online for online policy P online 7: Initialize deployed policy P deployed and target policy P target with parameters in P online 8: Definition of Q-function 9: select a h = arg max a Q deployed (x h , a) z i = V min + (i -1) vmax-vmin Natoms-1 10: Āi online (x) = 1 Nactions a A i online (f online (x), a) 11: p i online (x, a) = sof tmax(V i online (f online (x)) + A i online (f online (x), a) -Āi online (x)) 12: Q online (x, a) = i z i p i online (x, a) 13: Similarly, Q deployed (x, a) = i z i p i deployed (x, a), Q target (x, a) = i z i p i target (x, 23: Execute action a h in emulator and observe reward r h and state x h+1 24: compute the hash codes through for x h , φ(x h ) = sgn(Ag(x h ))

25:

A is a matrix with i.i.d. entries drawn from a standard Gaussian distribution N (0, 1) and g is a flat function 26: update the hash table counts, n(φ(x h )) = n(φ(x h )) + 1 27: update the reward r h = r h + β √ n(φ(x h )) 28: store (x h , a h , r h , x h+1 ) in D 29: if h > H start then 30: sample minibatch of transitions (x h , a h , w h , r h , r h +1 ..., r h +n-1 , x h +n ) from D with probability p h 31: w h is the importance-sampling weight for transitions at h  T z j = [r (n) h + γ n z j ].clip(V min , V max ) 36: ∆z = (V max -V min )/(N atoms -1) 37: b j = ( T z j -V min )/∆z, l = b j , u = b j 38: m l = m l + p j target (x h +n , a * )(u -b j ), m u = m u + p j target (x h +n , a * )(b j -l) 39: end for 40: 1 : The basic hyper-parameters, we used the Adam optimizer with learing rate α = 0.0000625 and = 1.5 × 10 -4 , before training the online policy, we let the initialized random policy make 20K steps to collect some transitions and the capacity for replay buffer is 1M. During the training process, we sample 32 transitions from the replay buffer and update the online policy every four steps. The reward is clipped into [-1, 1] and Relu is adopted as the activation function. For replay prioritization we use the recommended proportional variant, with importance sampling from 0.4 to 1, the prioritization ω is set to 0.5. In addition, we employ N atoms = 51, V min = -10, V max = 10 for distributional RL and n = 3 for multi-step returns. Finally the count-base bonus is set to 0.01 In the end, we list the value of the switching cost and reward of different criteria when the environment take 1.5 million steps and 3 million steps in Table 4 . D KLh = -i m i log p i online (

B PROOF FOR SECTION 5

Proof of Theorem 1. We let w = (1, 1, . . . , 1) ∈ R k be a k-dimensional all one vector. We let F = {f : f (x) = (2σ( v 1 , x ) -1, 2σ( v 2 , x ) -1, . . . , 2σ( v k , x ) -1)} ⊂ {R k → R k } with σ(•) being the ReLU activation functionfoot_0 and v i ∈ {e i , -e i } where e i ∈ R k denotes the vector that only the i-th coordinate is 1 and others are 0. We assume k is an even number and αk is an integer for simplicity. We let the underlying f * be the vector correspond to (e 1 , e 2 , . . . , e k ). We let D 1 = {(e 1 , 1), (e 2 , 1), . . . , (e (1-α)k , 1)} and D 2 = {(e (1-α)k+1 , 1), . . . , (e k , 1)}. Because we use the ERM training scheme, it is clear that the training on D 1 ∪ D 2 will recover f * , i.e., f 1+2 = f * because if it is not f * is better solution (f * has 0 error ) for the empirical risk. Now if the similarity score between f 1 and f 1+2 is smaller than α, it means for f 1 , its corresponding {v (1-α)k+1 , . . . , v k } are not correct. In this case, f 1 's prediction error is at least 1 -α on D 1 ∪ D 2 , because it will predict 0 on all inputs of D 2 . 1,500,000 steps 4 : We list the value of the switching cost and reward of different criteria when the environment takes 1.5 million steps and 3 million steps. "Reward" corresponds to the absolute value of the reward, and "Gap" denotes the difference between the reward under a specific criterion and "None." And "Switching Cost" corresponds to the switching cost under a criterion at this time step.



We define σ(0) = 0.5



16:if h % H target == 0 then 17: Update θ target = θ online 18: Set n(accumulated updates) = n(accumulated updates) + 1 19: end if 20: if J (f deployed , f online , D, n(accumulated updates), n(deployment)) = true then 21: Update θ deployed = θ online 22: Update n(accumulated updates) = 0 23: Update n(deployment) = n(deployment) Switching Criteria (J in Algorithm 1) Fixed interval switching FIX n input n(accumulated updates) output bool(n(accumulated updates) ≥ n) Adaptive interval switching Adaptive n2m input n(accumulated updates), n(deployment) output bool(n(accumulated updates) ≥ min(n(deployment) + 1) × n, m)

Figure 1: Results on GYMIC, "Step" means the number of steps of the environment. We show the learning curve of 1.5 million steps. The figure above is the learning cure of reward, while the figure below displays the switching cost. "None" means no low-switching-cost constraint, and the deployed policy always keeps sync with the online policy. Curves of reward are smoothed with a moving average over 5 points.

Figure 2: The results on the Atari games, we compare different switching criteria on six Atari games. "Step" means the number of steps of the environment. We constrain this 3.5 million steps for all environments. In each environment, we display the reward over the steps on the top and the switching cost in a log scale at the bottom. "None" means no switching criterion under which the deployed policy always keeps sync with the online policy. We evaluate a feature-based criterion, three fixed interval criteria covering a vast range, and an adaptive criteria increasing the deploying interval from 100 to 10,000. Curves of reward are smoothed with a moving average over 5 points.

arg max a Q online (x h +n , a), m i = 0, i ∈ 1, 2, ..., N atoms 34:for j = 1 to N atoms do 35:

6530.9 11393.24 ± 4802.1 12236.62 ± 5669.9 5654.16 ± 2120.0 8857.94 ± 3173.9 12311.34 ± 6731

a)

x h , a h ) lists the basic hyper-parameters of the algorithm, all of our experiments share these hyperparameters except the experiments on GYMIC adopt the H target as 1K. Most of these parameter are the same with raw Rainbow algorithm. For the count base exploration, the bonus is β set to 0.01.



lists the rest hyper-parameters for experiments on GYMIC. Since there are 46 clinical features in this environment, we stack 4 consecutive states to compose a 184-dimensional vector as the input for the state encoder f online (f deployed or f target ). The state encoder is a 2-layer MLP with hidden size 128.



Extra hype-parameters for the experiments in GYMIC, we stack 4 consecutive states and adopt a 2-layer MLP with hidden size 128 to extract the feature of states.

Additional hyper-parameters for experiments in Atari games. Observations are grey-scaled and rescaled to 84 × 84 4 consecutive frames are staked as the state and each action is acted four times. And we limit the max number of frames for an episode to 108K. The state encoder consists of 3 convolutional layers.

