DEEP Q-LEARNING WITH LOW SWITCHING COST

Abstract

We initiate the study on deep reinforcement learning problems that require low switching cost, i.e., a small number of policy switches during training. Such a requirement is ubiquitous in many applications, such as medical domains, recommendation systems, education, robotics, dialogue agents, etc, where the deployed policy that actually interacts with the environment cannot change frequently. Our paper investigates different policy switching criteria based on deep Q-networks and further proposes an adaptive approach based on the feature distance between the deployed Q-network and the underlying learning Q-network. Through extensive experiments on a medical treatment environment and a collection of the Atari games, we find our feature-switching criterion substantially decreases the switching cost while maintains a similar sample efficiency to the case without the low-switching-cost constraint. We also complement this empirical finding with a theoretical justification from a representation learning perspective.

1. INTRODUCTION

Reinforcement learning (RL) is often used for modeling real-world sequential decision-making problems such as medical domains, personalized recommendations, hardware placements, database optimization, etc. For these applications, oftentimes it is desirable to restrict the agent from adjusting its policy frequently. In medical domains, changing a policy requires a thorough approval process by experts. For large-scale software and hardware systems, changing a policy requires to redeploy the whole environment. Formally, we would like our RL algorithm admits a low switching cost. In this setting, it is required that the deployed policy that interacts with the environment cannot change many times. In some real-world RL applications such as robotics, education, and dialogue system, changing the deployed policy frequently may cause high costs and risks. Gu et al. (2017) trained robotic manipulation by decoupling the training and experience collecting threads; Mandel et al. (2014) applied RL to educational games by taking a data-driven methodology for comparing and validating policies offline, and run the strongest policy online; Jaques et al. ( 2019) developed an off-policy batch RL algorithms for dialog system, which can effectively learn in an offline fashion, without using different policies to interact with the environment. All of these work avoid changing the deployed policy frequently as they try to train the policy offline effectively or validate the policy to determine whether to deploy it online. For RL problem with a low switching cost constraint, the central question is how to design a criterion to decide when to change the deployed policy. Ideally, we would like this criterion to have the following four properties: 1. Low switching cost: This is the purpose of this criterion. An algorithm equipped with this policy switching criterion should have low switching cost. 2. High Reward: Since the deployed policy determines the collected samples and the agent uses fewer deployed policies, the collected data may not be informative enough to learn the optimal policy with high reward. We need this criterion to deploy policies that can collect informative samples. 3. Sample Efficiency: Since the agent only uses a few deployed policies, there may be more redundant samples, which will not be collected if the agent switches the policy frequently. We would like algorithms equipped with a criterion with similar sample efficiency as the case without the low switching cost constraint.

