PROVABLY MORE EFFICIENT Q-LEARNING IN THE ONE-SIDED-FEEDBACK/FULL-FEEDBACK SETTINGS

Abstract

propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs Õ(H 3 p T ) regret and FQL incurs Õ(H 2 p T ) regret, where H is the length of each episode and T is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.

1. INTRODUCTION

Motivated by the classical operations research (OR) problem-inventory control, we customize Qlearning to more efficiently solve a wide range of problems with richer feedback than the usual bandit feedback. Q-learning is a popular reinforcement learning (RL) method that estimates the state-action value functions without estimating the huge transition matrix in a large MDP (Watkins & Dayan (1992) , Jaakkola et al. (1993) ). This paper is concerned with devising Q-learning algorithms that leverage the natural one-sided-feedback/full-feedback structures in many OR and finance problems.

Motivation

The topic of developing efficient RL algorithms catering to special structures is fundamental and important, especially for the purpose of adopting RL more widely in real applications. By contrast, most RL literature considers settings with little feedback, while the study of single-stage online learning for bandits has a history of considering a plethora of graph-based feedback models. We are particularly interested in the one-sided-feedback/full-feedback models because of their prevalence in many famous problems, such as inventory control, online auctions, portfolio management, etc. In these real applications, RL has typically been outperformed by domain-specific algorithms or heuristics. We propose algorithms aimed at bridging this divide by incorporating problem-specific structures into classical reinforcement earning algorithms. 2018) when an aggregation of the state-action pairs with known error is given beforehand. Our algorithms substantially improve the regret bounds (see Table 1 ) by catering to the full-feedback/one-sided-feedback structures of many problems. Because our regret bounds are unaffected by the cardinality of the state and action space, our Q-learning algorithms are able to deal with huge state-action space, and even continuous state space in some cases (Section 8). Note that both our work and Dong et al. ( 2019) are designed for a subset of the general episodic MDP problems. We focus on problems with richer feedback; Dong et al. ( 2019) focus on problems with a nice aggregate structure known to the decision-maker. The one-sided-feedback setting, or some similar notions, have attracted lots of research interests in many different learning problems outside the scope of episodic MDP settings, for example learning in auctions with binary feedback, dynamic pricing and binary search (Weed et al. ( 2016 2016)). In particular, Zhao & Chen (2019) study the one-sided-feedback setting in the learning problem for bandits, using a similar idea of elimination. However, the episodic MDP setting for RL presents new challenges. Our results can be applied to their setting and solve the bandit problem as a special case. The idea of optimization by elimination has a long history (Even-Dar et al. (2002) ). A recent example of the idea being used in RL is Lykouris et al. ( 2019) which solve a very different problem of robustness to adversarial corruptions. Q-learning has also been studied in settings with continuous states with adaptive discretization (Sinclair et al. ( 2019)). In many situations this is more efficient than the uniform discretization scheme we use, however our algorithms' regret bounds are unaffected by the action-state space cardinality so the difference is immaterial. Our special case, the full-feedback setting, shares similarities with the generative model setting in that both settings allow access to the feedback for any state-action transitions (Sidford et al. (2018) ). However, the generative model is a strong oracle that can query any state-action transitions, while the full-feedback model can only query for that time step after having chosen an action from the feasible set based on the current state, while accumulating regret.  H 3 SAT ) O(T ) O(SAH) Aggregated Q-learning Dong et al. (2019) Õ( p H 4 MT + ✏T ) 1 O(M AT ) O(MT ) Full-Q-learning (FQL) Õ( p H 4 T ) O(SAT ) O(SAH) Elimination-Based Half-Q-learning (HQL) Õ( p H 6 T ) O(SAT ) O(SAH)

2. PRELIMINARIES

We consider an episodic Markov decision process, MDP(S, A, H, P, r), where S is the set of states with |S| = S, A is the set of actions with |A| = A, H is the constant length of each episode, P is the unknown transition matrix of distribution over states if some action y is taken at some state x at step h 2 [H], and r h : S ⇥ A ! [0, 1] is the reward function at stage h that depends on the environment randomness D h . In each episode, an initial state x 1 is picked arbitrarily by an adversary. Then, at each stage h, the agent observes state x h 2 S, picks an action y h 2 A, receives a realized reward r h (x h , y h ), and then transitions to the next state x h+1 , which is determined by x h , y h , D h . At the final stage H, the episode terminates after the agent takes action y H and receives reward r H . Then next episode begins. Let K denote the number of episodes, and T denote the length of the horizon: T = H ⇥ K, where H is a constant. This is the classic setting of episodic MDP, except that in the one-sided-feedback setting, we have the environment randomness D h , that once realized, can help us determine the reward/transition of any alternative feasible action that "lies on one side" of our taken action (Section 2.1). The goal is to maximize the total reward accrued in each episode. A policy ⇡ of an agent is a collection of functions {⇡ h : S ! A} h2 [H] . We use V ⇡ h : S ! R to denote the value function at stage h under policy ⇡, so that V ⇡ h (x) gives the expected sum of remaining rewards under policy ⇡ until the end of the episode, starting from x h = x: V ⇡ h (x) := E h H X h 0 =h r h 0 x h 0 , ⇡ h 0 (x h 0 ) x h = x i . Q ⇡ h : S ⇥ A ! R denotes the Q-value function at stage h, so that Q ⇡ h (x, y) gives the expected sum of remaining rewards under policy ⇡ until the end of the episode, starting from x h = x, y h = y: Q ⇡ h (x, y) := E h r h (x h , y) + H X h 0 =h+1 r h 0 x h 0 , ⇡ h 0 (x h 0 ) x h = x, y h = y i



Here M is the number of aggregate state-action pairs; ✏ is the largest difference between any pair of optimal state-action values associated with a common aggregate state-action pair.



PRIOR WORK The most relevant literature to this paper is Jin et al. (2018), who prove the optimality of Q-learning with Upper-Confidence-Bound bonus and Bernstein-style bonus in tabular MDPs. The recent work of Dong et al. (2019) improves upon Jin et al. (

),(Feng  et al. (2018), Cohen et al. (2020), Lobel et al. (

Regret comparisons for Q-learning algorithms on episodic MDP

