ASYMPTOTIC INSTANCE-OPTIMAL ALGORITHMS FOR INTERACTIVE DECISION MAKING

Abstract

Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm's performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance f , our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret C(f ) ln n, where C(f ) is an exact characterization of the complexity of f . The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity C(f ) is the minimum cost to test the instance f against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore & Szepesvari, 2017), and improve upon the previous best instance-dependent upper bound (Xu et al., 2021) for reinforcement learning.

1. INTRODUCTION

Bandit and reinforcement learning (RL) algorithms demonstrated a wide range of successful reallife applications (Silver et al., 2016; 2017; Mnih et al., 2013; Berner et al., 2019; Vinyals et al., 2019; Mnih et al., 2015; Degrave et al., 2022) . Past works have theoretically studied the regret or sample complexity of various interactive decision making problems, such as contextual bandits, reinforcement learning (RL), partially observable Markov decision process (see Azar et al. (2017) 2021) present a unified algorithmic principle for achieving the minimax regret-the optimal regret for the worst-case problem instances. However, minimax regret bounds do not necessarily always present a full picture of the statistical complexity of the problem. They characterize the complexity of the most difficult instances, but potentially many other instances are much easier. An ideal algorithm should adapt to the complexity of a particular instance and incur smaller regrets on easy instances than the worst-case instances. Thus, an ideal regret bound should be instance-dependent, that is, depending on some properties of each instance. Prior works designed algorithms with instance-dependent regret bounds that are stronger than minimax regret bounds, but they are often not directly comparable because they depend on different properties of the instances, such as the gap conditions and the variance of the value function (Zanette & Brunskill, 2019; Xu et al., 2021; Foster et al., 2020; Tirinzoni et al., 2021) . A more ambitious and challenging goal is to design instance-optimal algorithms that outperform, on every instance, all consistent algorithms (those achieving non-trivial regrets on all instances). Past works designed instance-optimal algorithms for multi-armed bandit (Lai et al., 1985) , linear bandits (Kirschner et al., 2021; Hao et al., 2020) , Lipschitz bandits (Magureanu et al., 2014), and ergodic MDPs (Ok et al., 2018) . However, instance-optimal regret bounds for tabular reinforcement learning remain an open question, despite recent progress (Tirinzoni et al., 2021; 2022) . The challenge partly stems from the fact that the existence of such an instance-optimal algorithm is even a priori unclear for general interactive decision making problems. Conceivably, each algorithm can have its own specialty on a subset of instances, and no algorithm can dominate all others on all instances. Somewhat surprisingly, we prove that there exists a simple algorithm (T2C, stated in Alg. 1) that is asymptotic instance-optimal for general interactive decision making problems with finite number of decisions.We determine the exact leading term of the optimal asymptotic regret for instance f to be C(f ) ln n. Here, n is the number of interactions and C(f ) is a complexity measure for the instance f that intuitively captures the difficulty of distinguishing f from other instances (that have different optimal decisions) using observations. Concretely, under mild conditions on the interactive decision problem, our algorithm achieves an asymptotic regret C(f ) ln n (Theorem 5.2) for every instance f , while every consistent algorithm must have an asymptotic regret at least C(f ) ln n (Theorem 3.2). Our algorithm consists of three simple steps. First, it explores uniformly for o(1)-fraction of the steps and computes the MLE estimate of the instance with relatively low confidence. Then, it tests whether the estimate instance (or, precisely, its associated optimal decision) is indeed correct using the most economical set of queries/decisions. Concretely, it computes a set of decisions with minimal regret, such that, using a log-likelihood ratio test, it can either distinguish the estimated instance from all other instances (with different optimal decisions) with high confidence, or determine that our estimation was incorrect. Finally, with the high-confidence estimate, it commits to the optimal decision of the estimated instance in the rest of the steps. The algorithmic framework essentially reduces the problem to the key second step -optimal hypothesis testing with active data collection. Our results recover the classical gap-dependent regret bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore & Szepesvari, 2017; Hao et al., 2020) . As an instantiation of the general algorithm, we present the first asymptotic instance-optimal algorithm for tabular RL, improving upon prior instance-dependent algorithms (Xu et al., 2021; Simchowitz & Jamieson, 2019; Tirinzoni et al., 2021; 2022) .

1.1. ADDITIONAL RELATED WORKS

Some algorithms are proved instance-optimal for specific interactive decision making problems. Variants of UCB algorithms are instance-optimal for bandits with various assumptions (Lattimore & Szepesvári, 2020; Gupta et al., 2021; Tirinzoni et al., 2020; Degenne et al., 2020; Magureanu et al., 2014) , but are suboptimal for linear bandits (Lattimore & Szepesvari, 2017). These algorithms rely on the optimism-in-face-of-uncertainty principle to deal with exploration-exploitation tradeoff, whereas our algorithm explicitly computes the best tradeoff. Kirschner et al. (2021) ; Lattimore & Szepesvari (2017); Hao et al. (2020) design non-optimistic instance-optimal algorithms for linear bandits. There are also instance-optimal algorithms for ergodic MDPs where the regret is less sensitive to the exploration policy (Ok et al., 2018; Burnetas & Katehakis, 1997; Graves & Lai, 1997) , interactive decision making with finite hypothesis class, finite state-action space, and known rewards (Rajeev et al., 1989) , and interactive decision making with finite observations (Komiyama et al., 2015) . The most related problem setup is structured bandits (Combes et al., 2017; Van Parys & Golrezaei, 2020; Jun & Zhang, 2020) , where the instances also belong to an abstract and arbitrary family F. The structured bandits problem is a very special case of general decision making problems and does not contain RL because the observation is a scalar. In contrast, the observation in general decision making problems could be high-dimensional (e.g., a trajectory with multiple states, actions, and rewards for episodic RL), which makes our results technically challenging. Many algorithms' regret bounds depend on some properties of instances such as the gap condition. Foster et al. ( 2020) prove a gap-dependent regret bound for contextual bandits. For reinforcement learning, the regret bound may depend on the variance of the optimal value function (Zanette & Brunskill, 2019) or the gap of the Q-function (Xu et al., 2021; Simchowitz & Jamieson, 2019; Yang et al., 2021) . Xu et al. (2021); Foster et al. (2020) also prove that the gap-dependent bounds cannot be improve on some instances. To some extent, these instance-dependent lower bounds can be viewed as minimax bounds for a fine-grained instance family (e.g., all instances with the same Q-function gap), and therefore are different from ours.



; Jin et al. (2018); Dong et al. (2021); Li et al. (2019); Agarwal et al. (2014); Foster & Rakhlin (2020); Jin et al. (2020), and references therein). Recently, Foster et al. (

