NEAR-OPTIMAL DEPLOYMENT EFFICIENCY IN REWARD-FREE REINFORCEMENT LEARNING WITH LINEAR FUNCTION APPROXIMATION

Abstract

We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the reward-free exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension d and planning horizon H, we propose a new algorithm that collects at most O( d 2 H 5 ϵ 2 ) trajectories within H deployments to identify ϵ-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal d dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.

1. INTRODUCTION

In many practical reinforcement learning (RL) based tasks, limited computing resources hinder applications of fully adaptive algorithms that frequently deploy new exploration policy. Instead, it is usually cheaper to collect data in large batches using the current policy deployment. Take recommendation system (Afsar et al., 2021) as an instance, the system is able to gather plentiful new data in very short time, while the deployment of a new policy often takes longer time, as it requires extensive computing and human resources. Therefore, it is impractical to switch the policy based on instantaneous data as a typical RL algorithm would demand. A feasible alternative is to run a large batch of experiments in parallel and only decide whether to update the policy after the whole batch is complete. The same constraint also appears in other RL applications such as healthcare (Yu et al., 2021 ), robotics (Kober et al., 2013) and new material design (Zhou et al., 2019) . In those scenarios, the agent needs to minimize the number of policy deployment while learning a good policy using (nearly) the same number of trajectories as its fully-adaptive counterparts. On the empirical side, Matsushima et al. (2020) first proposed the notion deployment efficiency. Later, Huang et al. (2022) formally defined deployment complexity. Briefly speaking, deployment complexity measures the number of policy deployments while requiring each deployment to have similar size. We measure the adaptivity of our algorithms via deployment complexity and leave its formal definition to Section 2. Under the purpose of deployment efficiency, the recent work by Qiao et al. (2022) designed an algorithm that could solve reward-free exploration in O(H) deployments. However, their sample complexity O(|S| 2 |A|H 5 /ϵ 2 ), although being near-optimal under the tabular setting, can be unacceptably large under real-life applications where the state space is enormous or continuous. For environments with large state space, function approximations are necessary for representing the feature of each state. Among existing work that studies function approximation in RL, linear function approximation is arguably the simplest yet most fundamental setting. In this paper, we study deployment efficient RL with linear function approximation under the reward-free setting, and we consider the following question: Question 1.1. Is it possible to design deployment efficient and sample efficient reward-free RL algorithms with linear function approximation? Algorithms for reward-free RL Sample complexity Deployment complexity Algorithm 1 & 2 in Wang et al. ( 2020) Lower bound (Huang et al., 2022) If polynomial sample Ω(H) Our contributions. In this paper, we answer the above question affirmatively by constructing an algorithm with near-optimal deployment and sample complexities. Our contributions are threefold. O( d 3 H 6 ϵ 2 ) O( d 3 H 6 ϵ 2 ) FRANCIS (Zanette et al., 2020b) ‡ O( d 3 H 5 ϵ 2 ) O( d 3 H 5 ϵ 2 ) RFLIN (Wagenmaker et al., 2022b) ‡ O( d 2 H 5 ϵ 2 ) O( d 2 H 5 ϵ 2 ) Algorithm 2 & 4 in Huang et al. (2022) ‡ O( d 3 H 5 ϵ 2 ν 2 min ) * H LARFE (Qiao et al., 2022) † O( S 2 AH 5 ϵ 2 ) 2H Our Algorithm 1 & 2 (Theorem 5.1) ‡ O( d 2 H 5 ϵ 2 ) H Our Algorithm 1 & 2 (Theorem 7.1) ⋆ O( S 2 AH 5 ϵ 2 ) H Lower bound (Wagenmaker et al., 2022b) Ω( d 2 H 2 ϵ 2 ) N.A. • A new layer-by-layer type algorithm (Algorithm 1) for reward-free RL that achieves deployment complexity of H and sample complexity of O( d 2 H 5 ϵ 2 ). Our deployment complexity is optimal while sample complexity has optimal dependence in d and ϵ. In addition, when applied to tabular MDP, our sample complexity (Theorem 7.1) recovers best known result O( S 2 AH 5 ϵ 2 ). • We generalize G-optimal design and select near-optimal policy via uniform policy evaluation on a finite set of representative policies instead of using optimism and LSVI. Such technique helps tighten our sample complexity and may be of independent interest. • We show that "No optimal-regret online learners can be deployment efficient" and deployment efficiency is incompatible with the highly relevant regret minimization setting. For regret minimization under linear MDP, we present lower bounds (Theorem 7.2 and 7.3) for other measurements of adaptivity: switching cost and batch complexity.

1.1. CLOSELY RELATED WORKS

There is a large and growing body of literature on the statistical theory of reinforcement learning that we will not attempt to thoroughly review. Detailed comparisons with existing work on reward-free RL (Wang et al., 2020; Zanette et al., 2020b; Wagenmaker et al., 2022b; Huang et al., 2022; Qiao et al., 2022) are given in Table 1 . For more discussion of relevant literature, please refer to Appendix A and the references therein. Notably, all existing algorithms under linear MDP either admit fully adaptive structure (which leads to deployment inefficiency) or suffer from sub-optimal sample complexity. In addition, when applied to tabular MDP, our algorithm has the same sample complexity and slightly better deployment complexity compared to Qiao et al. (2022) . The deployment efficient setting is slightly different from other measurements of adaptivity. The low switching setting (Bai et al., 2019) restricts the number of policy updates, while the agent can decide whether to update the policy after collecting every single trajectory. This can be difficult to implement in practical applications. A more relevant setting, the batched RL setting (Zhang et al., 2022) requires decisions about policy changes to be made at only a few (often predefined) checkpoints. Compared to batched RL, the requirement of deployment efficiency is stronger by requiring each deployment to collect the same number of trajectories. Therefore, deployment efficient algorithms are easier to deploy in parallel (see, e.g., Huang et al., 2022 , for a more elaborate discussion). Lastly, we remark that our algorithms also work under the batched RL setting by running in H batches.



Comparison of our results (in blue) to existing work regarding sample complexity and deployment complexity. We highlight that our results match the best known results for both sample complexity and deployment complexity at the same time. ‡ : We ignore the lower order terms in sample complexity for simplicity.

