DEEP Q-LEARNING WITH LOW SWITCHING COST

Abstract

We initiate the study on deep reinforcement learning problems that require low switching cost, i.e., a small number of policy switches during training. Such a requirement is ubiquitous in many applications, such as medical domains, recommendation systems, education, robotics, dialogue agents, etc, where the deployed policy that actually interacts with the environment cannot change frequently. Our paper investigates different policy switching criteria based on deep Q-networks and further proposes an adaptive approach based on the feature distance between the deployed Q-network and the underlying learning Q-network. Through extensive experiments on a medical treatment environment and a collection of the Atari games, we find our feature-switching criterion substantially decreases the switching cost while maintains a similar sample efficiency to the case without the low-switching-cost constraint. We also complement this empirical finding with a theoretical justification from a representation learning perspective.

1. INTRODUCTION

Reinforcement learning (RL) is often used for modeling real-world sequential decision-making problems such as medical domains, personalized recommendations, hardware placements, database optimization, etc. For these applications, oftentimes it is desirable to restrict the agent from adjusting its policy frequently. In medical domains, changing a policy requires a thorough approval process by experts. For large-scale software and hardware systems, changing a policy requires to redeploy the whole environment. Formally, we would like our RL algorithm admits a low switching cost. In this setting, it is required that the deployed policy that interacts with the environment cannot change many times. In some real-world RL applications such as robotics, education, and dialogue system, changing the deployed policy frequently may cause high costs and risks. Gu et al. (2017) trained robotic manipulation by decoupling the training and experience collecting threads; Mandel et al. (2014) applied RL to educational games by taking a data-driven methodology for comparing and validating policies offline, and run the strongest policy online; Jaques et al. ( 2019) developed an off-policy batch RL algorithms for dialog system, which can effectively learn in an offline fashion, without using different policies to interact with the environment. All of these work avoid changing the deployed policy frequently as they try to train the policy offline effectively or validate the policy to determine whether to deploy it online. For RL problem with a low switching cost constraint, the central question is how to design a criterion to decide when to change the deployed policy. Ideally, we would like this criterion to have the following four properties: 1. Low switching cost: This is the purpose of this criterion. An algorithm equipped with this policy switching criterion should have low switching cost. 2. High Reward: Since the deployed policy determines the collected samples and the agent uses fewer deployed policies, the collected data may not be informative enough to learn the optimal policy with high reward. We need this criterion to deploy policies that can collect informative samples. 3. Sample Efficiency: Since the agent only uses a few deployed policies, there may be more redundant samples, which will not be collected if the agent switches the policy frequently. We would like algorithms equipped with a criterion with similar sample efficiency as the case without the low switching cost constraint. In this paper, we take a step toward this important problem. We focus on designing a principled policy switching criterion for deep Q-networks (DQN) learning algorithms, which have been widely used in applications. For example, Ahn & Park (2020) apply DQN to control balancing between different HVAC systems, Ao et al. ( 2019) propose a thermal process control method based on DQN, and Chen et al. ( 2018) try to apply it to online recommendation. Notably these applications all require low switching cost. Our paper conducts a systematic study on DQN with low switching cost. Our contributions are summarized below.

Our Contributions

• We conduct the first systematic empirical study on benchmark environments that require modern reinforcement learning algorithms. We test two naive policy switching criteria: 1) switching the policy after a fixed number of steps and 2) switching the policy after an increasing step with a fixed rate. We find that neither criterion is a generic solution because sometimes they either cannot find the best policy or significantly decrease the sample efficiency. • Inspired by representation learning theory, we propose a new feature-based switching criterion that uses the feature distance between the deployed Q-network and the underlying learning Q-network. Through extensive experiments, we find our proposed criterion is a generic solution -it substantially decreases the switching cost while maintains a similar performance to the case without the low-switching-cost constraint. • Along the way, we also derive a deterministic Rainbow DQN (Hessel et al., 2018) , which may be of independent interest. Organization This paper is organized as follows. In Section 2, we review related work. In Section 3, we describe our problem setup and review necessary backgrounds. In Section 4, we describe deterministic Rainbow DQN with the low switching cost constraint. In Section 5, we introduce our feature-based policy switching criterion and its theoretical support. In Section 6, we conduct experiments to evaluate different criteria. We conclude in Section 7 and leave experiment details to appendix.

2. RELATED WORK

Low switching cost algorithms were first studied in the bandit setting (Auer et al., 2002; Cesa-Bianchi et al., 2013) . Existing work on RL with low switching cost is mostly theoretical. 2020), which proposed a concept of deployment efficiency and gave a model-based algorithm. During the training process, the algorithm fixes the number of deployments, trains a dynamics model ensemble, and updates the deployed policy alternately. After each deployment, the deployed policy collects transitions in the real environment to enhance the models, and then the models optimize the policy by providing imagined trajectories. In other words, they reduce the number of deployments by training on simulated environments. Our goal is different: we design a criterion to decide when to change the deployed policy, and this criterion could be employed by model-free algorithms. There is a line of work on offline RL (also called Batch RL) methods, where the policy does not interact with the environment directly and only learns from a fixed dataset (Lange et al., 2012; Levine et al., 2020) . Some methods Interpolate offline and online methods, i.e., semi-batch RL



To our knowledge, Bai et al. (2019) is the first work that studies this problem for the episodic finite-horizon tabular RL setting. Bai et al. (2019) gave a low-regret algorithm with an O H 3 SA log (K) local switching upper bound where S is the number of stats, A is the number of actions, H is the planning horizon and K is the number of episodes the agent plays. The upper bound was improved in Zhang et al. (2020b;a).

