ASYMPTOTIC INSTANCE-OPTIMAL ALGORITHMS FOR INTERACTIVE DECISION MAKING

Abstract

Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm's performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance f , our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret C(f ) ln n, where C(f ) is an exact characterization of the complexity of f . The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity C(f ) is the minimum cost to test the instance f against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore & Szepesvari, 2017), and improve upon the previous best instance-dependent upper bound (Xu et al., 2021) for reinforcement learning.

1. INTRODUCTION

Bandit and reinforcement learning (RL) algorithms demonstrated a wide range of successful reallife applications (Silver et al., 2016; 2017; Mnih et al., 2013; Berner et al., 2019; Vinyals et al., 2019; Mnih et al., 2015; Degrave et al., 2022) . Past works have theoretically studied the regret or sample complexity of various interactive decision making problems, such as contextual bandits, reinforcement learning (RL), partially observable Markov decision process (see Azar et al. (2017) 2021) present a unified algorithmic principle for achieving the minimax regret-the optimal regret for the worst-case problem instances. However, minimax regret bounds do not necessarily always present a full picture of the statistical complexity of the problem. They characterize the complexity of the most difficult instances, but potentially many other instances are much easier. An ideal algorithm should adapt to the complexity of a particular instance and incur smaller regrets on easy instances than the worst-case instances. Thus, an ideal regret bound should be instance-dependent, that is, depending on some properties of each instance. Prior works designed algorithms with instance-dependent regret bounds that are stronger than minimax regret bounds, but they are often not directly comparable because they depend on different properties of the instances, such as the gap conditions and the variance of the value function (Zanette & Brunskill, 2019; Xu et al., 2021; Foster et al., 2020; Tirinzoni et al., 2021) . A more ambitious and challenging goal is to design instance-optimal algorithms that outperform, on every instance, all consistent algorithms (those achieving non-trivial regrets on all instances). Past works designed instance-optimal algorithms for multi-armed bandit (Lai et al., 1985) , linear bandits (Kirschner et al., 2021; Hao et al., 2020 ), Lipschitz bandits (Magureanu et al., 2014 ), and ergodic MDPs (Ok et al., 2018) . However, instance-optimal regret bounds for tabular reinforcement learning remain an open question, despite recent progress (Tirinzoni et al., 2021; 2022) . The challenge partly



; Jin et al. (2018); Dong et al. (2021); Li et al. (2019); Agarwal et al. (2014); Foster & Rakhlin (2020); Jin et al. (2020), and references therein). Recently, Foster et al. (

