PROVABLY MORE EFFICIENT Q-LEARNING IN THE ONE-SIDED-FEEDBACK/FULL-FEEDBACK SETTINGS

Abstract

propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs Õ(H 3 p T ) regret and FQL incurs Õ(H 2 p T ) regret, where H is the length of each episode and T is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.

1. INTRODUCTION

Motivated by the classical operations research (OR) problem-inventory control, we customize Qlearning to more efficiently solve a wide range of problems with richer feedback than the usual bandit feedback. Q-learning is a popular reinforcement learning (RL) method that estimates the state-action value functions without estimating the huge transition matrix in a large MDP (Watkins & Dayan (1992) , Jaakkola et al. (1993) ). This paper is concerned with devising Q-learning algorithms that leverage the natural one-sided-feedback/full-feedback structures in many OR and finance problems. Motivation The topic of developing efficient RL algorithms catering to special structures is fundamental and important, especially for the purpose of adopting RL more widely in real applications. By contrast, most RL literature considers settings with little feedback, while the study of single-stage online learning for bandits has a history of considering a plethora of graph-based feedback models. We are particularly interested in the one-sided-feedback/full-feedback models because of their prevalence in many famous problems, such as inventory control, online auctions, portfolio management, etc. In these real applications, RL has typically been outperformed by domain-specific algorithms or heuristics. We propose algorithms aimed at bridging this divide by incorporating problem-specific structures into classical reinforcement earning algorithms.

1.1. PRIOR WORK

The most relevant literature to this paper is Jin et al. (2018) , who prove the optimality of Q-learning with Upper-Confidence-Bound bonus and Bernstein-style bonus in tabular MDPs. The recent work of Dong et al. (2019) improves upon Jin et al. (2018) when an aggregation of the state-action pairs with known error is given beforehand. Our algorithms substantially improve the regret bounds (see Table 1 ) by catering to the full-feedback/one-sided-feedback structures of many problems. Because our regret bounds are unaffected by the cardinality of the state and action space, our Q-learning algorithms are able to deal with huge state-action space, and even continuous state space in some cases (Section 8). Note that both our work and Dong et al. (2019) are designed for a subset of the general episodic MDP problems. We focus on problems with richer feedback; Dong et al. (2019) focus on problems with a nice aggregate structure known to the decision-maker. The one-sided-feedback setting, or some similar notions, have attracted lots of research interests in many different learning problems outside the scope of episodic MDP settings, for example learning in auctions with binary feedback, dynamic pricing and binary search (Weed et al. (2016) , (Feng et al. (2018 ), Cohen et al. (2020) , Lobel et al. (2016) ). In particular, Zhao & Chen (2019) study the one-sided-feedback setting in the learning problem for bandits, using a similar idea of elimination. However, the episodic MDP setting for RL presents new challenges. Our results can be applied to their setting and solve the bandit problem as a special case. The idea of optimization by elimination has a long history (Even-Dar et al. (2002) ). A recent example of the idea being used in RL is Lykouris et al. (2019) which solve a very different problem of robustness to adversarial corruptions. Q-learning has also been studied in settings with continuous states with adaptive discretization (Sinclair et al. (2019) ). In many situations this is more efficient than the uniform discretization scheme we use, however our algorithms' regret bounds are unaffected by the action-state space cardinality so the difference is immaterial. Our special case, the full-feedback setting, shares similarities with the generative model setting in that both settings allow access to the feedback for any state-action transitions (Sidford et al. (2018) ). However, the generative model is a strong oracle that can query any state-action transitions, while the full-feedback model can only query for that time step after having chosen an action from the feasible set based on the current state, while accumulating regret.  H 4 MT + ✏T ) 1 O(M AT ) O(MT ) Full-Q-learning (FQL) Õ( p H 4 T ) O(SAT ) O(SAH) Elimination-Based Half-Q-learning (HQL) Õ( p H 6 T ) O(SAT ) O(SAH)

2. PRELIMINARIES

We consider an episodic Markov decision process, MDP(S, A, H, P, r), where S is the set of states with |S| = S, A is the set of actions with |A| = A, H is the constant length of each episode, P is the unknown transition matrix of distribution over states if some action y is taken at some state x at step h 2 [H], and r h : S ⇥ A ! [0, 1] is the reward function at stage h that depends on the environment randomness D h . In each episode, an initial state x 1 is picked arbitrarily by an adversary. Then, at each stage h, the agent observes state x h 2 S, picks an action y h 2 A, receives a realized reward r h (x h , y h ), and then transitions to the next state x h+1 , which is determined by x h , y h , D h . At the final stage H, the episode terminates after the agent takes action y H and receives reward r H . Then next episode begins. Let K denote the number of episodes, and T denote the length of the horizon: T = H ⇥ K, where H is a constant. This is the classic setting of episodic MDP, except that in the one-sided-feedback setting, we have the environment randomness D h , that once realized, can help us determine the reward/transition of any alternative feasible action that "lies on one side" of our taken action (Section 2.1). The goal is to maximize the total reward accrued in each episode. A policy ⇡ of an agent is a collection of functions {⇡ h : S ! A} h2 [H] . We use V ⇡ h : S ! R to denote the value function at stage h under policy ⇡, so that V ⇡ h (x) gives the expected sum of remaining rewards under policy ⇡ until the end of the episode, starting from x h = x: V ⇡ h (x) := E h H X h 0 =h r h 0 x h 0 , ⇡ h 0 (x h 0 ) x h = x i . Q ⇡ h : S ⇥ A ! R denotes the Q-value function at stage h, so that Q ⇡ h (x, y) gives the expected sum of remaining rewards under policy ⇡ until the end of the episode, starting from x h = x, y h = y: Q ⇡ h (x, y) := E h r h (x h , y) + H X h 0 =h+1 r h 0 x h 0 , ⇡ h 0 (x h 0 ) x h = x, y h = y i Let ⇡ ⇤ denote an optimal policy in the MDP that gives the optimal value functions V ⇤ h (x) = sup ⇡ V ⇡ h (x) for any x 2 S and h 2 [H]. Recall the Bellman equations: 8 < : V ⇡ h (x) = Q ⇡ h (x, ⇡ h (x)) Q ⇡ h (x, y) := E x 0 ,r h ⇠P(•|x,y) ⇥ r h + V ⇡ h+1 (x 0 ) ⇤ V ⇡ h+1 (x) = 0, 8x 2 S 8 < : V ⇤ h (x) = min y Q ⇤ h (x, y) Q ⇤ h (x, y) := E x 0 ,r h ⇠P(•|x,y) ⇥ r h + V ⇤ h+1 (x 0 ) ⇤ V ⇤ h+1 (x) = 0, 8x 2 S We let Regret MDP (K) denote the expected cumulative regret against ⇡ ⇤ on the MDP up to the end of episode k. Let ⇡ k denote the policy the agent chooses at the beginning of the kth episode. Regret MDP (K) = K X k=1 ⇥ V ⇤ 1 x k 1 V ⇡ k 1 x k 1 ⇤ (1) 2.1 ONE-SIDED-FEEDBACK Whenever we take an action y at stage h, once the environment randomness D h is realized, we can learn the rewards/transitions for all the actions that lie on one side of y, i.e., all y 0  y for the lower one-sided feedback setting (or all y 0 y for the higher side). This setting requires that the action space can be embedded in a compact subset of R (Appendix B), and that the reward/transition only depend on the action, the time step and the environment randomness, even though the feasible action set depends on the state and is assumed to be an interval A \ [a, 1) for some a = a h (x h ). We assume that given D h , the next state x h+1 (•) is increasing in y h , and a h (•) is increasing in x h for the lower-sided-feedback setting. We assume the optimal value functions are concave. These assumptions seem strong, but are actually widely satisfied in OR/finance problems, such as inventory control (lost-sales model), portfolio management, airline's overbook policy, online auctions, etc.

2.2. FULL-FEEDBACK

Whenever we take an action at stage h, once D h is realized, we can learn the rewards/transitions for all state-action pairs. This special case does not require the assumptions in Section 2.1. Example problems include inventory control (backlogged model) and portfolio management.

3. ALGORITHMS

Algorithm 1 Elimination-Based Half-Q-learning (HQL) Initialization: Q h (y) H, 8(y, h) 2 A ⇥ [H]; A 0 h A, 8h 2 [H]; A k H+1 A, 8k 2 [K]; for k = 1, . . . , K do Initiate the list of realized environment randomness to be empty D k = []; Receive x k 1 ; for h = 1, . . . , H do if max{A k h } is not feasible then Take action y k h closest feasible action to A k h ; else Take action y k h max{A k h }; Observe realized environment randomness Dk h , append it to D k ; Update x k h+1 x 0 h+1 (x k h , y k h , Dk h ); for h = H, . . . , 1 do for y 2 A k h do Simulate trajectory x 0 h+1 , . . . , x 0 ⌧ k h (x,y ) as if we had chosen y at stage h using D k until we find ⌧ k h (x, y), which is the next time we are able to choose from A k ⌧ k h (x,y) ; Update Q h (y) (1 ↵ k )Q h (y) + ↵ k [r h,⌧ k h (x,y) + V h+1 (x 0 h+1 (x k h , y k h , Dk h ))]; Update y k⇤ h arg max y2A k h Q h (y); Update A k+1 h {y 2 A k h : Q h (y k⇤ h ) Q h (y)  Confidence Interval 2 }; Update V h (x) max feasible y given x Q h (y); Without loss of generality, we present HQL in the lower-sided-feedback setting. We define constants ↵ k = (H + 1)/(H + k), 8k 2 [K]. We use rh,h 0 to denote the cumulative reward from stage h to stage h 0 . We use x 0 h+1 (x, y, Dk h ) to denote the next state given x, y and Dk h . By assumptions in Section 2.1, Q h (x, y) only depends on the y for Algorithm 1, so we simplify the notation to Q h (y). Main Idea of Algorithm 1 At any episode k, we have a "running set" A k h of all the actions that are possibly the best action for stage h. Whenever we take an action, we update the Q-values for all the actions in A k h . To maximize the utility of the lower-sided feedback, we always select the largest action in A k h , letting us observe the most feedback. We might be in a state where we cannot choose from A k h . Then we take the closest feasible action to A k h (the smallest feasible action in the lowersided-feedback case). By the assumptions in Section 2.1, this is with high probability the optimal action in this state, and we are always able to observe all the rewards and next states for actions in the running set. During episode k, we act in real-time and keep track of the realized environment randomness. At the end of the episode, for each h, we simulate the trajectories as if we had taken each action in A k h , and update the corresponding value functions, so as to shrink the running sets. Algorithm 2 Full-Q-Learning (FQL) Initialization: Q h (x, y) H, 8(x, y, h) 2 S ⇥ A ⇥ [H]. for k = 1, . . . , K do Receive x k 1 ; for h = 1, . . . , H do Take action y k h arg max feasible y given x k h Q h (x k h , y); and observe realized Dk h ; for x 2 S do for y 2 A do Update Q h (x, y) (1 ↵ k )Q h (x, y) + ↵ k h r h (x, y, Dk h )) + V h+1 x 0 h+1 (x, y, Dk h ) i ; Update V h (x) max feasible y given x Q h (x, y); Update x k h+1 x 0 h+1 (x k h , y k h , Dk h ); Algorithm 2 is a simpler variant of Algorithm 1, where we effectively set the "Confidence Interval" to be always infinity and select the estimated best action instead of maximum of the running set. It can also be viewed as an adaption of Jin et al. (2018) to the full-feedback setting.

4. MAIN RESULTS

Theorem 1. HQL has O(H 3 p T ◆) total expected regret on the episodic MDP problem in the onesided-feedback setting. FQL has O(H 2 p T ◆) total expected regret in the full-feedback setting. Theorem 2. For any (randomized or deterministic) algorithm, there exists a full-feedback episodic MDP problem that has expected regret ⌦( p HT ), even if the Q-values are independent of the state.

5. OVERVIEW OF PROOF

We use Jin et al. (2018) and Dong et al. (2019) , we define weights Q k h , V k h to denote the Q h , V h functions at the beginning of episode k . Recall ↵ k = (H + 1)/(H + k). As in ↵ 0 k := Q k j=1 (1 ↵ j ), ↵ i k := ↵ i Q k j=i+1 (1 ↵ j ) , and provide some useful properties in Lemma 3. Note that Property 3 is tighter than the corresponding bound in Lemma 4.1 from Jin et al. ( 2018), which we obtain by doing a more careful algebraic analysis. Lemma 3. The following properties hold for ↵ i t : 1. P t i=1 ↵ i t = 1 and ↵ 0 t = 0 for t 1; P t i=1 ↵ i t = 0 and ↵ 0 t = 1 for t = 0. 2. max i2[t] ↵ i t  2H t and P t i=1 ↵ i t 2  2H t for every t 1. 2 For convenience, we use a "Confidence Interval" of 8 p k 1 ( p H 5 ◆), where ◆ = 9 log(AT ).

3.. P

1 t=i ↵ i t = 1 + 1 H for every i 1. 4. 1 p t  P t i=1 ↵ i t p i  1+ 1 H p t for every t 1. All missing proofs for the lemmas in this section are in Appendix B. Lemma 4. (shortfall decomposition) For any policy ⇡ and any k 2 [K], the regret in episode k is: ⇣ V ⇤ 1 V ⇡ k 1 ⌘ (x k 1 ) = E ⇡ h H X h=1 max y2A Q ⇤ h x k h , y Q ⇤ h x k h , y k h i . Shortfall decomposition lets us calculate the regret of our policy by summing up the difference between Q-values of the action taken at each step by our policy and of the action the optimal ⇡ ⇤ would have taken if it was in the same state as us. We need to then take expectation of this random sum, but we get around this by finding high-probability upper-bounds on the random sum as follows: Recall for any (x, h, k) 2 S ⇥ [H] ⇥ [K], and for any y 2 A k h , ⌧ k h (x, y) is the next time stage after h in episode k that our policy lands on a simulated next state x k ⌧ k h (x,y) 0 that allows us to take an action in the running set A k ⌧ k h (x,y) . The time steps in between are "skipped" in the sense that we do not perform Q-value updating or V-value updating during those time steps when we take y at time (h, k). Over all the h 0 2 [H], we only update Q-values and V-values while it is feasible to choose from the running set. E.g. if no skipping happened, then ⌧ k h (x, y) = h + 1. Therefore, ⌧ k h (x, y) is a stopping time. Using the general property of optional stopping that E[M ⌧ ] = M 0 for any stopping time ⌧ and discrete-time martingale M ⌧ , our Bellman equation becomes Q ⇤ h (y) = E r⇤ h,⌧ k h ,x 0 ⌧ k h ,⌧ k h ⇠P(•|x,y) [r ⇤ h,⌧ k h + V ⇤ ⌧ k h (x 0 ⌧ k h )] where we simplify notation ⌧ k h (x, y) to ⌧ k h when there is no confusion, and recall rh,h 0 denotes the cumulative reward from stage h to h 0 . On the other hand, by simulating paths, HQL updates the Q functions backward h = H, . . . , 1 for any x 2 S, y 2 A k h at any stage h in any episode k as follows: Q k+1 h (y) (1 ↵ k )Q k h (y) + ↵ k [r k+1 ⌧ k+1 h (x,y) (x, y) + V k+1 ⌧ k+1 h (x,y) (x 0 ⌧ k+1 h (x,y) )] Then by Equation 4 and the definition of ↵ i k 's, we have Q k h (y) = ↵ 0 k 1 H + k 1 X i=1 ↵ i k 1 h ri h,⌧ k h (x,y) + V i+1 ⌧ k h (x,y) ⇣ x i ⌧ k h (x,y) ⌘i . ( ) which naturally gives us Lemma 5. For simpler notation, we use ⌧ i h = ⌧ i h (x, y). Lemma 5. For any (x, h, k) 2 S ⇥ [H] ⇥ [K], and for any y 2 A k h , we have Q k h Q ⇤ h (y) =↵ 0 k 1 (H Q ? h (y)) + k 1 X i=1 ↵ i k 1 h ⇣ V i+1 ⌧ i h V ⇤ ⌧ i h ⌘ (x i ⌧ i h ) + ri h,⌧ i h r⇤ h,⌧ i h + ⇣ V ⇤ ⌧ i h (x,y) (x i ⌧ i h ) + r⇤ h,⌧ i h E r⇤ ,x 0 ,⌧ i h ⇠P(•|x,y) ⇥ r⇤ h,⌧ i h + V ⇤ ⌧ i h (x 0 ⌧ i h ) ⇤ ⌘ i . Then we can bound the difference between our Q-value estimates and the optimal Q-values: Lemma 6. For any (x, h, k) 2 S ⇥ [H] ⇥ [K] , and any y 2 A k h , let ◆ = 9 log(AT ), we have: Q k h Q ⇤ h (y)  ↵ 0 k 1 H + k 1 X i=1 ↵ i k 1 V i+1 ⌧ i h V ⇤ ⌧ i h x i ⌧ i h + ri h,⌧ i h r⇤ h,⌧ i h + c r H 3 ◆ k 1 with probability at least 1 1/(AT ) 8 , and we can choose c = 2 p

2.. Now we define { h } H+1

h=1 to be a list of values that satisfy the recursive relationship h = H + (1 + 1/H) h+1 + c p H 3 ◆ , for any h 2 [H], where c is the same constant as in Lemma 6, and H+1 = 0. Now by Lemma 6, we get: Lemma 7. For any (h, k) 2 [H] ⇥ [K], { h } H h=1 is a sequence of values that satisfy max y2A k h (Q k h Q ⇤ h )(y)  h / p k 1 with probability at least 1 1/(AT ) 5 . Lemma 7 helps the following three lemmas show the validity of the running sets A k h 's: Lemma 8. For any h 2 [H], k 2 [K], the optimal action y ⇤ h is in the running set A k h with probability at least 1 1/(AT ) 5 . Lemma 9. Anytime we can play in A k h , the optimal Q-value of our action is within 3 h / p k 1 of the optimal Q-value of the optimal policy's action, with probability at least 1 2/(AT ) 5 . Lemma 10. Anytime we cannot play in A k h , our action that is the feasible action closest to the running set is the optimal action for the state x with probability at least 1 1/(AT ) 5 . Naturally, we want to partition the stages h = 1, . . . , H in each episode k into two sets, k A and k B , where k A contains all the stages h where we are able to choose from the running set, and k B contains all the stages h where we are unable to choose from the running set. So k B t k A = [H], 8k 2 [K] . Now we can prove Theorem 1. By Lemma 4 we have that V ⇤ h V ⇡ k h = E h H X h=1 ✓ max y2A Q ⇤ h (y) Q ⇤ h y k h ◆ i  E h H X h=1 max y2A ⇣ Q ⇤ h (y) Q ⇤ h y k h ⌘i  E h X h2 k A max y2A ⇣ Q ⇤ h (y) Q ⇤ h y k h ⌘i + E h X h2 k B max y2A ⇣ Q ⇤ h (y) Q ⇤ h y k h ⌘i . By Lemma 10, the second term is upper bounded by 0 • (1 1 A 5 T 5 ) + X h2 k B H • 1 A 5 T 5  X h2 k B H • 1 A 5 T 5 . By Lemma 7, the first term is upper-bounded by E  X h2 k A O ⇣ h p k 1 ⌘ P ⇣ max y2A k h ⇣ Q ⇤ h (y) Q ⇤ h y k h ⌘  h p k 1 ⌘ + X h2 k A H • P ⇣ max y2A k h ⇣ Q ⇤ h (y) Q ⇤ h y k h ⌘ > h p k 1 ⌘  O ⇣ X P h2 k A h p k 1 ⌘ + O ⇣ X P h2 k A H A 5 T 5 ⌘ . Then the expected cumulative regret between HQL and the optimal policy is: Regret MDP (K) = K X k=1 (V ⇤ 1 V ⇡ k 1 ) x k 1 = (V ⇤ 1 V ⇡1 1 ) x 1 1 + K X k=2 (V ⇤ 1 V ⇡ k 1 ) x k 1  H + K X k=2 ⇣ H X h2 k B H A 5 T 5 + X P h2 k A h p k 1 + X P h2 k A H A 5 T 5 ⌘  K X k=2 O( p H 7 ◆) p k 1  O(H 3 p T ◆). ⇤

5.1. PROOFS FOR FQL

Our proof for HQL can be conveniently adapted to recover the same regret bound for FQL in the full-feedback setting. We need a variant of Lemma 9: whenever we take the estimated best feasible action in FQL, the optimal Q-value of our action is within 3 h p k 1 of the optimal Q-value of the optimal action, with probability at least 1 2/(AT ) 5 . Then using Lemmas 4,5,6 and 8 where all the Q k h (y) are replaced by Q k h (x, y), the rest of the proof follows without needing the assumptions for the one-sided-feedback setting. For the tighter O(H 2 p T ◆) regret bound for FQL in Theorem 1, we adopt similar notations and proof in Jin et al. (2018) (but adapted to the full-feedback setting) to facilitate quick comprehension for readers who are familiar with Jin et al. (2018) . The idea is to use V k 1 V ⇡ k 1 x k h as a high probability upper-bound on (V ⇤ 1 V ⇡ k 1 ) x k 1 , and then upper-bound it using martingale properties and recursion. Because FQL leverages the full feedback, it shrinks the concentration bounds much faster than existing algorithms, resulting in a significantly lower regret bound. See Appendix E.

6. EXAMPLE APPLICATIONS: INVENTORY CONTROL AND MORE

Inventory Control is one of the most fundamental problems in supply chain optimization. It is known that base-stock policies (aka. order-up-to policies) are optimal for the classical models we are concerned with (Zipkin (2000) , Simchi-Levi et al. (2014) ). Therefore, we let the actions for the episodic MDP be the amounts to order inventory up to. At the beginning of each step h, the retailer sees the inventory x h 2 R and places an order to raise the inventory level up to y h x h . Without loss of generality, we assume the purchasing cost is 0 (Appendix C). Replenishment of y h x h units arrive instantly. Then an independently distributed random demand D h from unknown distribution F h is realized. We use the replenished inventory y h to satisfy demand D h . At the end of stage h, if demand D h is less than the inventory, what remains becomes the starting inventory for the next time period x h+1 = (y h D h ) + , and we pay a holding cost o h for each unit of left-over inventory. Backlogged model: if demand D h exceeds the inventory, the additional demand is backlogged, so the starting inventory for the next period is x h+1 = y h D h < 0. We pay a backlogging cost b h > 0 for each unit of the extra demand. The reward for period h is the negative cost: r h (x h , y h ) = c h (y h x h ) + o h (y h D h ) + + b h (D h y h ) + . This model has full feedback because once the environment randomness-the demand is realized, we can deduce what the reward and leftover inventory would be for all possible state-action pairs. Lost-sales model: is considered more difficult. When the demand exceeds the inventory, the extra demand is lost and unobserved instead of backlogged. We pay a penalty of p h > 0 for each unit of lost demand, so the starting inventory for next time period is x h+1 = 0. The reward for period h is: r h (x h , y h ) = c h (y h x h ) + o h (y h D h ) + + p h (D h y h ) + . Note that we cannot observe the realized reward because the extra demand (D h y h ) + is unobserved for the lost-sales model. However, we can use a pseudo-reward r h (x h , y h ) = o h (y h D h ) + p h min(y h , D h ) that will leave the regret of any policy against the optimal policy unchanged (Agrawal & Jia (2019 ), Yuan et al. (2019) ). This pseudo-reward can be observed because we can always observe min(y h , D h ). Then this model has (lower) one-sided feedback because once the environment randomness-the demand is realized, we can deduce what the reward and leftover inventory would be for all possible state-action pairs where the action (order-up-to level) is lower than our chosen action, as we can also observe min(y 0 h , D h ) for all y 0 h  y h . Past literature typically studies under the assumption that the demands along the horizon are i.i.d. (Agrawal & Jia (2019) , Zhang et al. (2018) ). Unprecedentedly, our algorithms solve optimally the episodic version of the problem where the demand distributions are arbitrary within each episode. Our result: it is easy to see that for both backlogged and lost-sales models, the reward only depends on the action, the time step and the realized demand, not on the state-the starting inventory. However, the feasibility of an action depends on the state, because we can only order up to a quantity no lower than the starting inventory. The feasible action set at any time is A \ [x h , 1). The next state x h+1 (•) and a h (•) are monotonely non-decreasing, and the optimal value functions are concave. Online Second-Price Auctions: the auctioneer needs to decide the reserve price for the same item at each round (Zhao & Chen (2019) ). Each bidder draws a value from its unknown distribution and only submits the bid if the value is no lower than the reserve price. The auctioneer observes the bids, gives the item to the highest bidder if any, and collects the second highest bid price (including the reserve price) as profits. In the episodic version, the bidders' distributions can vary with time in an



Here M is the number of aggregate state-action pairs; ✏ is the largest difference between any pair of optimal state-action values associated with a common aggregate state-action pair.



Since inventory control literature typically considers a continuous action space [0, M] for someM 2 R + , we discretize [0, M] with step-size M T 2 , so A = |A| = T 2 . Discretization incurs additional regret Regret gap = O( M T 2 • HT ) = o(1)by Lipschitzness of the reward function. For the lost-sales model, HQL gives O(H 3 p T log T ) regret. For the backlogged model, FQL gives O(H 2 p T log T ) regret, and HQL gives O(H 3 p T log T ) regret. See details in Appendix C. Comparison with existing Q-learning algorithms: If we discretize the state-action space optimally for Jin et al. (2018) and for Dong et al. (2019), then applying Jin et al. (2018) to the backlogged model gives a regret bound of O(T 3/4 p log T ). Applying Dong et al. (2019) to the backlogged inventory model with optimized aggregation gives us O(T 2/3 p log T ). See details in Appendix D.

Regret comparisons for Q-learning algorithms on episodic MDP

annex

episode, and the horizon consists of K episodes. This is a (higher) one-sided-feedback problem that can be solved efficiently by HQL because once the bids are submitted, the auctioneer can deduce what bids it would have received for any reserve price higher than the announced reserve price.Airline Overbook Policy: is to decide how many customers the airline allows to overbook a flight (Chatwin (1998) ). This problem has lower-sided feedback because once the overbook limit is reached, extra customers are unobserved, similar to the lost-sales inventory control problem.Portfolio Management is allocation of a fixed sum of cash on a variety of financial instruments (Markowitz (1952) ). In the episodic version, the return distributions are episodic. On each day, the manager collects the increase in the portfolio value as the reward, and gets penalized for the decrease. This is a full-feedback problem, because once the returns of all instruments become realized for that day, the manager can deduce what his reward would have been for all feasible portfolios.

7. NUMERICAL EXPERIMENTS

We compare FQL and HQL on the backlogged episodic inventory control problem against 3 benchmarks: the optimal policy (OPT) that knows the demand distributions beforehand and minimizes the cost in QL-UCB from Jin et al. (2018) , and Aggregated QL from Dong et al. (2019) .For Aggregated QL and QL-UCB, we optimize by taking the Q-values to be only dependent on the action, thus reducing the state-action pair space. Aggregated QL requires a good aggregation of the state-action pairs to be known beforehand, which is usually unavailable for online problems. We aggregate the state and actions to be multiples of 1 for Dong et al. (2019) in Table 2 . We do not fine-tune the confidence interval in HQL, but use a general formula q H log(HKA) k for all settings. We do not fine-tune the UCB-bonus in QL-UCB either. Below is a summary list for the experiment settings. Each experimental point is run 300 times for statistical significance. Table 2 shows that both FQL and HQL perform promisingly, with significant advantage over the other two algorithms. FQL stays consistently very close to the clairvoyant optimal, while HQL catches up rather quickly using only one-sided feedback. See more experiments in Appendix F.

8. CONCLUSION

We propose a new Q-learning based framework for reinforcement learning problems with richer feedback. Our algorithms have only logarithmic dependence on the state-action space size, and hence are barely hampered by even infinitely large state-action sets. This gives us not only efficiency, but also more flexibility in formulating the MDP to solve a problem. Consequently, we obtain the first O( p T ) regret algorithms for episodic inventory control problems. We consider this work to be a proof-of-concept showing the potential for adapting reinforcement learning techniques to problems with a broader range of structures.

