PROVABLY MORE EFFICIENT Q-LEARNING IN THE ONE-SIDED-FEEDBACK/FULL-FEEDBACK SETTINGS

Abstract

propose a new Q-learning-based algorithm, Elimination-Based Half-Q-Learning (HQL), that enjoys improved efficiency over existing algorithms for a wide variety of problems in the one-sided-feedback setting. We also provide a simpler variant of the algorithm, Full-Q-Learning (FQL), for the full-feedback setting. We establish that HQL incurs Õ(H 3 p T ) regret and FQL incurs Õ(H 2 p T ) regret, where H is the length of each episode and T is the total length of the horizon. The regret bounds are not affected by the possibly huge state and action space. Our numerical experiments demonstrate the superior efficiency of HQL and FQL, and the potential to combine reinforcement learning with richer feedback models.

1. INTRODUCTION

Motivated by the classical operations research (OR) problem-inventory control, we customize Qlearning to more efficiently solve a wide range of problems with richer feedback than the usual bandit feedback. Q-learning is a popular reinforcement learning (RL) method that estimates the state-action value functions without estimating the huge transition matrix in a large MDP (Watkins & Dayan (1992 ), Jaakkola et al. (1993) ). This paper is concerned with devising Q-learning algorithms that leverage the natural one-sided-feedback/full-feedback structures in many OR and finance problems. Motivation The topic of developing efficient RL algorithms catering to special structures is fundamental and important, especially for the purpose of adopting RL more widely in real applications. By contrast, most RL literature considers settings with little feedback, while the study of single-stage online learning for bandits has a history of considering a plethora of graph-based feedback models. We are particularly interested in the one-sided-feedback/full-feedback models because of their prevalence in many famous problems, such as inventory control, online auctions, portfolio management, etc. In these real applications, RL has typically been outperformed by domain-specific algorithms or heuristics. We propose algorithms aimed at bridging this divide by incorporating problem-specific structures into classical reinforcement earning algorithms. 2018) when an aggregation of the state-action pairs with known error is given beforehand. Our algorithms substantially improve the regret bounds (see Table 1 ) by catering to the full-feedback/one-sided-feedback structures of many problems. Because our regret bounds are unaffected by the cardinality of the state and action space, our Q-learning algorithms are able to deal with huge state-action space, and even continuous state space in some cases (Section 8). Note that both our work and Dong et al. ( 2019) are designed for a subset of the general episodic MDP problems. We focus on problems with richer feedback; Dong et al. ( 2019) focus on problems with a nice aggregate structure known to the decision-maker. The one-sided-feedback setting, or some similar notions, have attracted lots of research interests in many different learning problems outside the scope of episodic MDP settings, for example learning in auctions with binary feedback, dynamic pricing and binary search (Weed et al. (2016 ), (Feng et al. (2018 ), Cohen et al. (2020 ), Lobel et al. (2016) ). In particular, Zhao & Chen (2019) study the 1



PRIOR WORK The most relevant literature to this paper is Jin et al. (2018), who prove the optimality of Q-learning with Upper-Confidence-Bound bonus and Bernstein-style bonus in tabular MDPs. The recent work of Dong et al. (2019) improves upon Jin et al. (

