NEAR-OPTIMAL DEPLOYMENT EFFICIENCY IN REWARD-FREE REINFORCEMENT LEARNING WITH LINEAR FUNCTION APPROXIMATION

Abstract

We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the reward-free exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension d and planning horizon H, we propose a new algorithm that collects at most O( d 2 H 5 ϵ 2 ) trajectories within H deployments to identify ϵ-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal d dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.

1. INTRODUCTION

In many practical reinforcement learning (RL) based tasks, limited computing resources hinder applications of fully adaptive algorithms that frequently deploy new exploration policy. Instead, it is usually cheaper to collect data in large batches using the current policy deployment. Take recommendation system (Afsar et al., 2021) as an instance, the system is able to gather plentiful new data in very short time, while the deployment of a new policy often takes longer time, as it requires extensive computing and human resources. Therefore, it is impractical to switch the policy based on instantaneous data as a typical RL algorithm would demand. A feasible alternative is to run a large batch of experiments in parallel and only decide whether to update the policy after the whole batch is complete. The same constraint also appears in other RL applications such as healthcare (Yu et al., 2021) , robotics (Kober et al., 2013) and new material design (Zhou et al., 2019) . In those scenarios, the agent needs to minimize the number of policy deployment while learning a good policy using (nearly) the same number of trajectories as its fully-adaptive counterparts. On the empirical side, Matsushima et al. ( 2020) first proposed the notion deployment efficiency. Later, Huang et al. ( 2022) formally defined deployment complexity. Briefly speaking, deployment complexity measures the number of policy deployments while requiring each deployment to have similar size. We measure the adaptivity of our algorithms via deployment complexity and leave its formal definition to Section 2. Under the purpose of deployment efficiency, the recent work by Qiao et al. (2022) designed an algorithm that could solve reward-free exploration in O(H) deployments. However, their sample complexity O(|S| 2 |A|H 5 /ϵ 2 ), although being near-optimal under the tabular setting, can be unacceptably large under real-life applications where the state space is enormous or continuous. For environments with large state space, function approximations are necessary for representing the feature of each state. Among existing work that studies function approximation in RL, linear function approximation is arguably the simplest yet most fundamental setting. In this paper, we study deployment efficient RL with linear function approximation under the reward-free setting, and we consider the following question: Question 1.1. Is it possible to design deployment efficient and sample efficient reward-free RL algorithms with linear function approximation?

