ON THE INTERPLAY BETWEEN MISSPECIFICATION AND SUB-OPTIMALITY GAP: FROM LINEAR CONTEX-TUAL BANDITS TO LINEAR MDPS

Abstract

We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class up to a bounded misspecification level ζ > 0. We propose an algorithm based on a novel data selection scheme, which only selects the contextual vectors with large uncertainty for online regression. We show that, when the misspecification level ζ is dominated by O(∆/ √ d) with ∆ being the minimal sub-optimality gap and d being the dimension of the contextual vectors, our algorithm enjoys the same gap-dependent regret bound O(d 2 /∆) as in the well-specified setting up to logarithmic factors. Together with a lower bound adapted from Du et al. (2019); Lattimore et al. (2020), our result suggests an interplay between misspecification level and the sub-optimality gap: (1) the linear contextual bandit model is efficiently learnable when ζ ≤ O(∆/ √ d); and (2) it is not efficiently learnable when ζ ≥ Ω(∆/ √ d). We also extend our algorithm to reinforcement learning with linear Markov decision processes (linear MDPs), and obtain a parallel result of gap-dependent regret. Experiments on both synthetic and real-world datasets corroborate our theoretical results.

1. INTRODUCTION

Linear contextual bandits (Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013) have been extensively studied when the reward function can be represented as a linear function of the contextual vectors. However, such a well-specified linear model assumption sometimes does not hold in practice. This motivates the study of misspecified linear models. In particular, we only assume that the reward function can be approximated by a linear function up to some worstcase error ζ called misspecification level. Existing algorithms for misspecified linear contextual bandits (Lattimore et al., 2020; Foster et al., 2020) can only achieve an O(d √ K + ζK √ d log K) regret bound , where K is the total number of rounds and d is the dimension of the contextual vector. Such a regret, however, suggests that the performance of these algorithms will degenerate to be linear in K when K is sufficiently large. The reason for this performance degeneration is because existing algorithms, such as OFUL (Abbasi-Yadkori et al., 2011) and linear Thompson sampling (Agrawal & Goyal, 2013) , utilize all the collected data without selection. This makes these algorithms vulnerable to "outliers" caused by the misspecified model. Meanwhile, the aforementioned results do not consider the sub-optimality gap in the expected reward between the best arm and the second best arm. Intuitively speaking, if the sub-optimality gap is smaller than the misspecification level, there is no hope to obtain a sublinear regret. Therefore, it is sensible to take into account the sub-optimality gap in the misspecified setting, and pursue a gap-dependent regret bound. The same misspecification issue also appears in reinforcement learning with linear function approximation, when a linear function cannot exactly represent the transition kernel or value function of the underlying MDP. In this case, Du et al. ( 2019) provided a negative result showing that if the misspecification level is larger than a certain threshold, any RL algorithm will suffer from an exponentially large sample complexity. This result was later revisited in the stochastic linear bandit setting by Lattimore et al. (2020) , which shows that a large misspecification error will make the bandit model not efficiently learnable. However, these results cannot well explain the tremendous success of deep reinforcement learning on various tasks (Mnih et al., 2013; Schulman et al., 2015; 2017) , where the deep neural networks are used as function approximators with misspecification error. In this paper, we aim to understand the role of model misspecification in linear contextual bandits through the lens of sub-optimality gap. By proposing a new algorithm with data selection, we can achieve a constant regret bound for such a problem. We also extend our algorithm to the linear Markov decision processes (Jin et al., 2020) and obtain a regret bound of similar flavor. Our contributions are highlighted as follows: • We propose a new algorithm called DS-OFUL (Data Selection OFUL). DS-OFUL only learns from the data with large uncertainty. We prove an O(d 2 ∆ -foot_0 ) gap-dependent regret 1 bound when the misspecification level is small (i.e., ζ = O(∆/ √ d)) and the minimal sub-optimality gap ∆ is known. Our regret bound improves upon the gap-dependent regret in the well-specified setting (Abbasi-Yadkori et al., 2011) by a logarithmic factor. To the best of our knowledge, this is the first constant gap-dependant regret bound for misspecified linear contextual bandits, even assuming a known minimal sub-optimality gap. • We also prove a gap-dependent lower bound following the lower bound proof technique in Du et al. ( 2019); Lattimore et al. ( 2020). This together with the upper bound suggests an interplay between the misspecification level and the sub-optimality gap: the linear contextual bandit is efficiently learnable if ζ ≤ O(∆/ √ d) while it is not efficiently learnable if ζ ≥ Ω(∆/ √ d). • We extend the same idea to the misspecified linear MDP, and propose an algorithm called DS-LSVI (Data-Selection LSVI). DS-LSVI enjoys a gap-dependent regret bound, which suggests a similar interplay between the misspecification level and sub-optimality gap in episodic MDPs to achieve a logarithmic regret bound O(H 5 d 3 ∆ -1 log(K)) • Finally, we conduct experiments on the linear contextual bandit with both synthetic and real datasets, and demonstrate the superior performance of DS-OFUL algorithm. This corroborates our theoretical results. 

2. RELATED WORK

In this section, we review the related work for misspecified linear bandits and misspecified reinforcement learning. We defer more related work on the function approximation in bandits and RL to Appendix A. Misspecified Linear Bandits. Ghosh et al. ( 2017) is probably the first work considering the misspecified linear bandits, which shows that the OFUL (Abbasi-Yadkori et al., 2011) algorithm cannot achieve a sublinear regret in the presence of misspecification. They, therefore, proposed a new algorithm with a hypothesis testing module for linearity to determine whether to use OFUL (Abbasi-Yadkori et al., 2011) or the multi-armed UCB algorithm. Their algorithm enjoys the same performance guarantee as OFUL in the well-specified setting and can avoid the linear regret under certain misspecification setting. Lattimore et al. 



we use notation O(•) to hide the log factor other than number of rounds T



Scalars and constants are denoted by lower and upper case letters, respectively. Vectors are denoted by lower case bold face letters x, and matrices by upper case bold face letters A. We denote by [k] the set {1, 2, • • • , k} for positive integers k. For two non-negative sequence {a n }, {b n }, a n = O(b n ) means that there exists a positive constant C such that a n ≤ Cb n , and we use O(•) to hide the log factor in O(•) other than number of rounds T or episode K; a n = Ω(b n ) means that there exists a positive constant C such that a n ≥ Cb n , and we use Ω(•) to hide the log factor. For a vector x ∈ R d and a positive semi-definite matrix A ∈ R d×d , we define ∥x∥ 2 A = x ⊤ Ax. For any set C, we use |C| to denote its cardinality.

(2020) proposed a phase-elimination algorithm for misspecified stochastic linear bandits, which achieves O( √ dK + ζK √ d) regret bound. For contextual linear bandits, both Lattimore et al. (2020) and Foster et al. (2020) proved a O(d √ K + ζK √ d) regret bound. Takemura et al. (2021); Vial et al. (2022) also provide a similar regret bound without the knowledge of the misspecification level. Van Roy & Dong (2019) proved a lower bound of sample complexity, which suggests when ζ √ d ≥ 8 log |D|, any best arm identification algorithm will

