TIGHT NON-ASYMPTOTIC INFERENCE VIA SUB-GAUSSIAN IN-TRINSIC MOMENT NORM

Abstract

In non-asymptotic statistical inferences, variance-type parameters of sub-Gaussian distributions play a crucial role. However, direct estimation of these parameters based on the empirical moment generating function (MGF) is infeasible. To this end, we recommend using a sub-Gaussian intrinsic moment norm [Buldygin and Kozachenko (2000) , Theorem 1.3] through maximizing a series of normalized moments. Importantly, the recommended norm can not only recover the exponential moment bounds for the corresponding MGFs, but also lead to tighter Hoeffiding's sub-Gaussian concentration inequalities. In practice, we propose an intuitive way of checking sub-Gaussian data with a finite sample size by the sub-Gaussian plot. Intrinsic moment norm can be robustly estimated via a simple plug-in approach. Our theoretical results are applied to non-asymptotic analysis, including the multi-armed bandit.

1. INTRODUCTION

With the advancement of machine learning techniques, computer scientists have become more interested in establishing rigorous error bounds for desired learning procedures, especially those with finite sample validity (Wainwright, 2019; Zhang & Chen, 2021; Yang et al., 2020) . In specific settings, statisticians, econometricians, engineers and physicist have developed non-asymptotic inferences to quantify uncertainty in data; see Romano & Wolf (2000) ; Chassang (2009); Arlot et al. (2010) ; Yang et al. (2020) ; Horowitz & Lee (2020) ; Armstrong & Kolesár (2021) ; Zheng & Cheng (2021) ; Lucas et al. (2008) ; Owhadi et al. (2013) ; Wang (2020) . Therefore, the concentration-based statistical inference has received a considerable amount of attention, especially for bounded data (Romano & Wolf, 2000; Auer et al., 2002; Hao et al., 2019; Wang et al., 2021; Shiu, 2022) and Gaussian data (Arlot et al., 2010; Duy & Takeuchi, 2022; Bettache et al., 2021; Feng et al., 2021) . For example, Hoeffding's inequality can be applied to construct nonasymptotic confidence intervals based on bounded datafoot_0 . However, in reality, it may be hard to know the support of data or its underlying distribution. In this case, misusing Hoeffding's inequality (Hoeffding, 1963) for unbounded data will result in a notably loose confidence interval (CI); see Appendix A.1. Hence, it is a common practice to assume that data follow sub-Gaussian distribution (Kahane, 1960) . By the Chernoff inequalityfoot_1 , we have P(X ≥ t) ≤ inf s>0 exp{-st}E exp{sX} , ∀ t ≥ 0. Hence, tightness of a confidence interval relies on how we upper bound the moment generating function (MGF) E exp{sX} for all s > 0. This can be further translated into the following optimal variance proxy of sub-Gaussian distribution. Definition 1. A r.v. X is sub-Gaussian (sub-G) with a variance proxy σ 2 [denoted as X ∼ subG(σ 2 )] if its MGF satisfies E exp(tX) ≤ exp(σ 2 t 2 /2) for all t ∈ R. The sub-Gaussian parameter σ opt (X) is defined by the optimal variance proxy (Chow, 1966) : σ 2 opt (X) := inf σ 2 > 0 : E exp(tX) ≤ exp{σ 2 t 2 /2}, ∀ t ∈ R = 2 sup t∈R t -2 log[E exp(tX)]. Note that σ 2 opt (X) ≥ Var X; see (14) in Appendix A.2. When σ 2 opt (X) = Var X, it is called strict sub-Gaussianity (Arbel et al., 2020) . Based on Theorems 1.5 in Buldygin & Kozachenko (2000) , we have P (X ≥ t) ≤ exp - t 2 2σ 2 opt (X) , P | n i=1 X i | ≥ t ≤ 2 exp - t 2 2 n i=1 σ 2 opt (X i ) . (2) for independent sub-G r.v.s X and {X i } n i=1 . The above inequality (2) provides the tightest upper bound over the form P(X > t) ≤ exp(-Ct 2 ) (or P(| n i=1 X i | > t) ≤ exp(-Ct 2 )) for some positive constant C via Chernoff inequality. Given {X i } n i=1 i.i.d. ∼ subG(σ 2 opt (X)), a straightforward application of (2) gives an non-asymptotic 100(1 -α)% CI EX = 0 ∈ [X n ± σ opt (X) 2n -1 log(2/α)]. A naive plug-in estimatefoot_2 of σ 2 opt (X) := 2 sup t∈R t -2 log[E exp(tX)] (Arbel et al., 2020) is σ 2 opt (X) := 2 sup t∈R t -2 log[n -1 Σ n i=1 exp(tX i )]. However, two weaknesses of (4) substantially hinder its application: (i) the optimization result is unstable due to the possible non-convexity of the objective function; (ii) exponentially large n is required to ensure the variance term Var(n -1 n i=1 exp(tX i )) not to explode when t is large. In Section 3, we present some simulation evidence. On the other hand, we are aware of other forms of variance-type parameter. For instance, van der Vaart & Wellner (1996) introduced the Orlicz norm as ∥X∥ w2 := inf{c > 0 : E exp{|X| 2 /c 2 } ≤ 2}, frequently used in empirical process theory. Additionally, Vershynin (2010) suggested a norm based on the scale of moments as Buldygin & Kozachenko (2000) . However, as shown in Table 1 and Appendix A.2.1, both types of norm fail to deliver sharp probability bounds even for strict sub-G distributions, such as the standard Gaussian distribution and symmetric beta distribution.  ∥X∥ ψ2 := max k≥2 k -1/2 (E|X| k ) 1/k in Page 6 of ∥ • ∥ * -norm sharp tail for P(|X| ≥ t) sharp MGF bound half length of (1 -δ)-CI easy to estimate σopt(X) Yes [2exp{-t 2 2 /σ 2 opt (X)}] Yes [exp{σ 2 opt (X) t 2 2 }] 2 log(2/δ)σopt(X) No ∥X∥w 2 Yes [2 exp{-t 2 2 /( ∥X∥w 2 √ 2 ψ 2 )}] No [exp{(4 √ e∥X∥ ψ 2 ) 2 t 2 2 }] 2 log(2/δ) √ 2e∥X∥ ψ 2 Yes ∥X∥ G (Def. 2) Yes [2exp{-t 2 2 /∥X∥ 2 G }] Yes [exp{∥X∥ 2 G t 2 2 }] 2 log(2/δ)∥X∥ G Yes 1.1 CONTRIBUTIONS In light of the above discussions, we advocate the use of the intrinsic moment norm in the Definition 2 in the construction of tight non-asymptotic CIs. There are two specific reasons: (i) it approximately recovers tight inequalities (2); (ii) it can be estimated friendly (with a closed form) and robustly. The following definition 2 is from Page 6 and Theorem 1.3 in Buldygin & Kozachenko (2000) . 2k) . Definition 2 (Intrinsic moment norm). ∥X∥ G := max k≥1 [ 2 k k! (2k)! EX 2k ] 1/(2k) = max k≥1 [ 1 (2k-1)!! EX 2k ] 1/( From the sub-G characterization (see Theorem 2.6 in Wainwright ( 2019)), ∥X∥ G < ∞ iff σ opt (X) < ∞ for any zeromean r.v. X. Hence, the finite intrinsic moment norm of a r.v. X ensures sub-Gaussianity (satisfying Definition 1). Our contributions in this paper can be summarized as follows. 1. By ∥X∥ G , we achieve a sharper Hoeffding-type inequality under asymetric distribution; see Theorem 2(b). 2. Compared to the normal approximation based on Berry-Esseen (B-E) bounds, our results are more applicable to data of extremely small sample size. We illustrate Bernoulli observations with the comparison of two types of CIs based on the B-E-corrected CLT and Hoeffding's inequality in Figure 1 ; see Appendix A for details. 3. A novel method called sub-Gaussian plot is proposed for checking whether the unbounded data are sub-Gaussian. We introduce plug-in and robust plug-in estimators for ∥X∥ G , and establish finite sample theories. 4. Finally, we employ the intrinsic moment norm estimation in the non-asymptotic inference for a bandit problem: Bootstrapped UCB-algorithm for multi-armed bandits. This algorithm is shown to achieve feasible error bounds and competitive cumulative regret on unbounded sub-Gaussian data. 

2. SUB-GAUSSIAN PLOT AND TESTING

Before estimating ∥X∥ G , the first step is to verify X is indeed sub-G given its i.i.d. copies {X i } n i=1 . Corollary 7.2 (b) in Zhang & Chen (2021) shows for r.v.s X i ∼ subG(σ 2 opt (X)) (without independence assumption) P(max 1≤i≤j X i ≤ σ opt (X) 2(log j + t)) ≥ 1 -exp{-t}, which implies max 1≤i≤j X i = O P ( √ log j). Moreover, we will show the above rate is indeed sharp for a class of unbound sub-G r.v.s characterized by the lower intrinsic moment norm below. Definition 3 (Lower intrinsic moment norm). The lower intrinsic moment norm for a sub-G X is defined as ∥X∥ G := min k≥1 {[(2k -1)!!] -1 EX 2k } 1/(2k) . By the method in Theorem 1 of Zhang & Zhou (2020) , we obtain the following the tight rate result with a lower bound. Theorem 1. (a). If ∥X∥ G > 0 for i.i.d. symmetric sub-G r.v.s {X i } n i=1 ∼ X, then with probability at least 1 -δ ∥X∥ G/∥X ∥ G 2 2∥X∥ 2 G /∥X∥ 2 G -1 log n -log C -2 (X) -log log 2 δ ≤ max 1≤i≤n X i ∥X∥ G ≤ 2[log n + log 2 δ ], where C(X) < 1 is constant defined in Lemma 1 below; (b) if X is bounded variable, then ∥X∥ G = 0. The upper bound follows from the proof of (5) similarly. The proof of lower bound relies on the sharp reverse Chernoff inequality from Paley-Zygmund inequality (see Paley & Zygmund (1932) ). Lemma 1 (A reverse Chernoff inequality). Suppose ∥X∥ G > 0 for a symmetric sub-G r.v. X. For t > 0, then P(X ≥ t) ≥ C 2 (X) exp{-4[2∥X∥ 2 G /∥X∥ 4 G -∥X∥ -2 G ]t 2 }, where C(X) := ∥X∥ 2 G 4∥X∥ 2 G -∥X∥ 2 G 4∥X∥ 2 G -2∥X∥ 2 G 4∥X∥ 2 G -∥X∥ 2 G 2[2∥X∥ 2 G /∥X∥ 2 G -1] ∈ (0, 1). Theorem 1 of Zhang & Zhou (2020) does not optimize the constant in Paley-Zygmund inequality. In contrast, our Lemma 1 has an optimal constant; see Appendix C for details. Sub-Gaussian plot under unbounded assumptionfoot_3 . By Theorem 1, we propose a novel sub-Gaussian plot check whether i.i.d data {X i } n i=1 follow a sub-G distribution. Suppose that for each j, {X * i } j i=1 are independently sampled from the empirical distribution F n (x) = 1 n n i=1 1(X i ≤ x) of {X i } n i=1 . Specifically, we plot the order statistics {max 1≤i≤j X * i } n j=1 on the plane coordinate axis, where x axis represents √ log j + 1 and y axis the value of max 1≤i≤j X * i . We check whether those points have a linear tendency at the boundary: the more they are close to the tendency of a beeline, the more we can trust the data are sub-Gaussian. The Figure 2 shows sub-Gaussian plot of N (0, 1) and Exp(1). It can be seen that sub-Gaussian plot of N (0, 1) shows linear tendency at the boundary, while Exp(1) shows quadratic tendency at the boundary. For the quadratic tendency, we note that if {X i } n i=1 have heavier tails such as sub-exponentiality, then max 1≤i≤j X i = O P (log j) instead of the order O( √ log j); see Corollary 7.3 in Zhang & Chen (2021) . 

3. FINITE SAMPLE PROPERTIES OF INTRINSIC MOMENT NORM

In this section, we characterize two important properties of the intrinsic moment norm that are used in constructing non-asymptotic confidence intervals.

3.1. BASIC PROPERTIES

Lemma 2 below establishes that the intrinsic moment norm is estimable. Lemma 2. For sub-G X, we have arg max m∈2N EX m (m-1)!! 1/m < ∞, where 2N := {2, 4, • • • } is the even number set. Lemma 2 ensure that for any sub-Gaussian variable X, its intrinsic moment norm can be computed as ∥X∥ G := max m∈2N EX m (m-1)!! 1/m = max 1≤k≤k X EX 2k (2k-1)!! 1/(2k) with some finite k X < ∞. This is an important property that other norms may not have. The ∥X∥ ψ2 := max k≥2 k -1/2 (E|X| k ) 1/k for Gaussian X achieves its optimal point at k = ∞; see Example 3 in Appendix A.2.1. As for σ 2 opt (X) := 2 sup t∈R log[E exp(tX)] t 2 , it is unclear that its value can be achieved at a finite t. Note that if k X = 1, one has ∥X∥ 2 G = Var(X). Next, we present an example in calculating the values of k X . Denote Exp(1)| [0,M ] as the truncated standard exponential distribution on [0, M ] with the density as f (x) = e -x M 0 e -x dx 1 {x∈[0,M ]} . Example 1. a. X ∼ U [-a, a], k X = 1 for any a ∈ R; b. X ∼ Exp(1)| [0,2.75] -E Exp(1)| [0,2.75] , k X = 2; c. X ∼ Exp(1)| [0,3] -E Exp(1)| [0,3] , k X = 3. Indeed, for any fixed k 0 ∈ N, we can construct a truncated exponential r.v. X := Exp(1)| [0,M ] such that k X = k 0 by properly adjusting the truncation level M .

3.2. CONCENTRATION FOR SUMMATION

In what follows, we will show another property of ∥X∥ G that it recovers nearly tight MGF bounds in Definition 1. More powerfully, it enables us to derive the sub-G Hoeffding's inequality (2). Theorem 2. Suppose that {X i } n i=1 are independent r.v.s with max i∈ [n] ∥X i ∥ G < ∞. We have (a). If X i is symmetric about zero, then Eexp{tX i } ≤ exp{t 2 ∥X i ∥ 2 G /2} for any t ∈ R, and P (| n i=1 X i | ≥ s) ≤ 2 exp{-s 2 /[2 n i=1 ∥X i ∥ 2 G ]}, s ≥ 0. (b). If X i is not symmetric, then Eexp{tX i } ≤ exp{(17/12)t 2 ∥X i ∥ 2 G /2} for any t ∈ R, and P (| n i=1 X i | ≥ s) ≤ 2 exp{-(12/17)s 2 /[2 n i=1 ∥X i ∥ 2 G ]}, s ≥ 0. Theorem 2(a) is an existing result in Theorem 2.6 of Wainwright (2019). For Theorem 2(b), we obtain 17/12 ≈ 1.19, while Lemma 1.5 in Buldygin & Kozachenko (2000) obtained  Eexp{tX i } ≤ exp t 2 2 ( 4 √ 3.1∥X i ∥ G ) 2 for t ∈ R with 4 √ 3.1 ≈ 1.

4. ESTIMATION OF THE INTRINSIC MOMENT NORM

A first thought to estimate ∥X∥ G is by the plug-in approach. Although k X is proven to be finite in Lemma 2, its (possibly large) exact value is still unknown in practice. Instead, we use a non-decreasing index sequence {κ n } to replace k X in the estimation. Hence, we suggest a plug-in feasible estimator ∥X∥ G = max 1≤k≤κn 1 (2k -1)!! 1 n n i=1 X 2k i 1/(2k) . Deriving the non-asymptotic property of the ∥X∥ G is not an easy task: the maximum point k(κ 2k) will change with the sample size n even κ n is fixed. n ) := arg max 1≤k≤κn [ 1 (2k-1)!! 1 n n i=1 X 2k i ] 1/( To resolve this, we first examine the oracle estimator defined as Hao et al., 2019; Zhang & Wei, 2022) , we present the non-asymptotic concentration of ∥X∥ G around it ture value ∥X∥ G . ∥X∥ G = 1 (2k X -1)!! 1 n n i=1 X 2k X i 1/2k X . Here, based on Orlicz norm ∥Y ∥ ψ θ := inf{t > 0 : E exp{|Y | θ /t θ } ≤ 2} of sub-Weibull r.v. Y with θ > 0 ( Proposition 1. Suppose {X i } n i=1 i.i.d. ∼ X and X satisfies ∥X∥ ψ 1/k X < ∞, then for any t > 0, P ∥X∥ 2k X G -∥X∥ 2k X G ≤ 2e∥X∥ ψ 1/k X C(k -1 X ) t n + γ 2k X A(k -1 X ) t k X n ≥ 1 -2e -t , where the constant γ ≈ 1.78, and the constant functions C(•) and A(•) are defined in Appendix C. The exponential-moment condition ∥X∥ ψ 1/k X < ∞ is too strong for the error bound of ∥X∥ [(2k -1)!!] -1 P Bs m X 2k 1/(2k) . As stated in Proposition 1, the naive plug-in estimator ∥X∥ G = ∥X∥ 1,G is not robust. MOM estimators (7) with b ≫ 1 have two merits: (a) it only needs finite moment conditions, but the exponential concentration bounds are still achieved; (b) it permits some outliers in the data. Non-asymptotic inferences require to bound for ∥X∥ G exactly by a feasible estimator ∥X∥ b,G up to a sharp constants. Next, we establish a high-probability upper bound for the estimated norm, if the data has O ∪ I outlier assumptions as follows. • (M.1) Suppose that the data {X i } n i=1 contains n -n o inliers {X i } i∈I drawn i.i.d. according to a target distribution, and there are no distributional assumptions on n o outliers {X i } i∈O ; To serve for error bounds in the presence of outliers, (M.2) considers the specific fraction function of the polluted inputs; see Laforgue et al. (2021) . Define g k,m (σ k ) and ḡk,m (σ k ) as the sequences for any m ∈ N and 1 ≤ k ≤ κ n : • (M.2) b = b O + b S , g k,m (σ k ) := 1 -EX 2k /(2k -1)!! -1/(2k) max 1≤k≤κn -2[m/η(ε)] -1/2 σ k k /(EX 2k ) + EX 2k /(2k -1)!! 1/(2k) ; (8) and g k,m (σ k ) := [2[m/η(ε)] -1/2 σ k k /(EX 2k ) + 1] 1/(2k) -1. We obtain a robust and non-asymptotic CI for ∥X∥ G . Theorem 3 (Finite sample guaranteed coverage). Suppose √ VarX 2k ≤ σ k k for a sequence {σ k } κn k=1 , we have P ∥X∥ G ≤ [1 -max 1≤k≤κn g k,m (σ k )] -1 ∥X∥ b,G > 1 -κ n • e -2bη(ε)(1-3 4η(ε) ) 2 ; and P{∥X∥ G ≥ [1 + max 1≤k≤κn g k,m (σ k )] -1 ∥X∥ b,G } > 1 -κ n • e -2bη(ε)(1-3 4η(ε) ) 2 for κ n ≥ κ X under (M.1-M.2). Theorem 3 ensures the concentration of the estimator ∥X∥ b,G when κ n ≥ k X under enough sample. If η(ε) = 1 with ε = 0, then the data are i.i.d., which have no outlier, and outlier assumptions in M.1-M.2 can be dropped in Theorem 3. When the data is i.i.d. Gaussian vector, Proposition 4.1 in Auer et al. (2002) also gave a high-probability estimated upper bound for ℓ p -norm of the vector of Gaussian standard deviations, our result is for intrinsic moment norm. In practice, the block number b can be taken by the adaptation method based on the Lepski method (Depersin & Lecué, 2022) . To guarantee high probability events in Theorem 3, it is required that the index sequence κ n should not be very large for fixed b. The larger κ n needs larger b in blocks B 1 , . . . , B b . In the simulation, we will see that an increasing index sequence κ n with slow rate will lead a good performance. Finally, we compare our two estimators ( 6) and ( 7), as well as the estimator (4) in Figure 3 . We consider the standard Gaussian and Rademacher variable distributed X, in the two case we have ∥X∥ 2 G = σ 2 opt (X) = Var(X) = 1. The following figure shows the performance of three estimators under sample n = 10 to 1000 with κ n just chosen as ⌈log n⌉. For the MOM method, we use five blocks in this simple setting. For a more complex case, one can use Lepski's method to choose b (see Page & Grünewälder (2021) ), but some considerable computation cost may be introduced. From Figure 3 , we know that the performance of the MOM estimator is best, while the naive estimator ( 4) is worst. For the high-quality data of extremely small sample size, we can apply the leave-one-out Hodges-Lehmann method (Rousseeuw & Verboven, 2002) for further numerical improvement; see Appendix B for details.

5. APPLICATION IN MULTI-ARMED BANDIT PROBLEM

In the multi-armed bandit problem (MAB), a player chooses between K different slot machines (an K-armed bandit), each with a different unknown random reward r.v.s {Y k } K k=1 ⊆ R, while each realization of a fixed arm k is independent and shares the same distribution. Further, we assume the rewards are sub-Gaussian, i.e. ∥Y k -µ k ∥ G < ∞, k ∈ [K]. Our goal is to find the best arm with the largest expected reward, say Y t * , by pulling arms. In each round t ∈ [T ], the player pulls an arm (an action) A t ∈ [K]. Conditioning on {A t = k}, we define the observed reward {Y k,t } t∈[T ] i.i.d. ∼ P k . The goal of the exploration in MAB is to minimize the cumulative regret after T steps: . The exploration performance is better, if we have smaller Reg T (Y, A). Without loss of generality, we assume t ⋆ = 1. We seek to evaluate the expected bounds from the decomposition (see Lemma 4.5 in Lattimore & Szepesvári (2020)), Reg T := E Reg T (Y, A) = K k=1 ∆ k E T t=1 1 {A t = k} , where E is taken on the randomness of the player's actions {A t } t∈[T ] , and ∆ k = µ 1 -µ k is the sub-optimality gap for arm k ∈ [K]/{1} . The upper bound of Reg T is called problem-independent if the regret bound depends on the distribution of the data and does not rely on the gap ∆ k . For each iteration t, let T k (t) := card{1 ≤ τ ≤ t : A τ = k} be the number of pull for arm k until time t during the bandit process. Then if we define Y T k (t) := 1 T k (t) τ ≤t,Aτ =k Y k,τ as the running average of the rewards of arm k at time t. Suppose we obtain a 100(1 -δ)% CI Y T k (t) -c k (t), Y T k (t) + c k (t) for µ k from a tight concentration inequality. Therefore, we confidently reckon that the reward of arm k is Y T k (t) + c k (t), and play the arm A t = k, hoping to maximize the reward with a high probability for finite t. This is upper confidence bound (UCB, Auer et al. (2002) ) algorithms. And many works based on this methods appears recently, for example, Hao et al. (2019) use bootstrap method with the second order correction to give a algorithm with the explicit regret bounds for sub-Gaussian rewards. However, many existent algorithms contain unknown norms for the random rewards, they are actually infeasible. And Theorem 4 is one example with explicit regret bound. For instance, the algorithm Hao et al. (2019) needs to use the unknown Orlicz-norm of Y k -µ k in the algorithm. Thus, it is actually infeasible in practice. Fortunately, our estimator can solve this problem. Suppose that Y k -µ k is symmetric around zero, by one-side version of Theorem 2, the (9) implies that for all k and all t, P(Y  T k (t) > µ k + ∥Y k -µ k ∥ G 2 T k (t) log 1 δ ) ≤ δ. (Y T k (t) ≤ µ k + ∥Y k -µ k ∥ b k ,G 1-o(1) 2 T k (t) log 1 δ ) ≥ 1 -δ -k Y k • exp(-b k /8) if η(ε) = 1 with ε = 0. If the UCB algorithm is correctly applied, for a finite T k (t), with high probability, we will pull the best arm. In practice, we nearly do not know any knowledge about the data. As a flexible way of uncertainty qualification, the multiplier bootstrap (Arlot et al., 2010) enables mimicking the non-asymptotic properties of the target statistic by reweighing its summands of the centralized empirical mean. The multiplier bootstrapped quantile for the i.i.d. observation Y n := {Y i } n i=1 is the (1 -α)-quantile of the distribution of n -1 n i=1 w i (Y i -Y n ), which is defined as q α (Y n -Y n , w) := inf{x ∈ R | P w (n -1 n i=1 w i (Y i -Y n ) > x) ≤ α}, where w := {w i } n i=1 are bootstrap random weights independent of Y n . We denote the statistics φ G (Y n ) as something satisfying P Yn (|Y n -EY 1 | ≥ φ G (Y n )) ≤ α. Algorithm 1: Bootstrapped UCB Input: φ G (Y T k (t) ) is given by ( 11). for t = 1, . . . , K do Pull each arm once to initialize the algorithm. end for t = K + 1, . . . , T do Set a confidence level α ∈ (0, 1). Calculate the boostrapped quantile q α/2 (Y T k (t) -Y T k (t) , w) with the Rademacher bootstrapped weights w independent with any Y . Pull the arm At = argmax k∈[K] UCB k (t) := argmax k∈[K] Y T k (t) + q α/2 (Y T k (t) -Y T k (t) , w) + 2 log(4/α) T k (t) φG(Y T k (t) ) . Receive reward Y At . end Motivated by Hao et al. (2019) , we design Algorithm 1 based on some estimators of the UCB. It guarantees a relatively small regret by bootstrapped threshold q α/2 (Y T k (t) -Y T k (t) , w) adding a concentration based second-order correction φ G (Y T k (t) ) that is specified in Theorem 3. In the following regret bounds, we assume the mean reward from the k-th arm µ k is known. In practice, it can be replaced by a robust estimator, and we obtain the results of MOM estimator. Theorem 4. Consider a K-armed sub-G bandit under (9) and suppose that Y k -µ k is symmetric around zero. For any round T , according to moment conditions in Theorem 3, choosing φ  G (Y T k (t) ) as φ G (Y T k (t) ) = 2 log(4/α) T 1/2 k (t) -1 ∥Y k -µ k ∥ b k ,G Reg T ≤ 16(2 + √ 2) 2 max k∈[K] ∥Y k -µ k ∥ 2 G log T K k=2 ∆ -1 k + (4T -1 + 2T -25-16 √ 2 + 8) K k=2 ∆ k , where ∆ k is the sub-optimality gap. Moreover, let µ * 1 := max k1∈[K] µ k1 -min k2∈[K] µ k2 be the range over the rewards, the problem-independent regret Reg T ≤ 8(2 + √ 2) max k∈[K] ∥Y k -µ k ∥ G T K log T + (4T -1 + 2T -25-16 √ 2 + 8)Kµ * 1 . From Theorem 4, we know that the regret of our method achieve minimax rate log T for a problem-dependent problem and √ KT for a problem-independent case (see Tao et al. (2022) ), so Algorithm 1 can be seen as an optimal algorithm. Compared with the traditional vanilla UCB, we do improve the constant. When Y k ∼ N (µ k , 1), the constant factor in regret bound in Auer et al. (2002) is 256, which is larger than 16(2 + √ 2) 2 in our theorem. When the UCB has unknown sub-G parameters, Theorem 4 first studies a feasible UCB algorithm with sub-G parameter plugging estimation. Many previous UCB algorithms based on non-asymptotic inference in the literature assume that the sub-G parameter is a preset constant, see the algorithm in Hao et al. (2019) for instance. Next, we give an simulation for Theorem 4 in two sub-G cases to verify the performance of estimated norms. Similar to Hao et al. (2019) ; Wang et al. (2020) , we design the three methods as follows: 1. Use our method φ(Y T k (t) ) with Estimated Norm in Theorem 4; As we can see, EG1 and EG2 are both sub-Gaussian rewards. In the simulation, µ k is assigned sightly small for bounded max k ∆ k , which is a standard-setting in the MAB problem (see the condition of Corollary 1 in Wang et al. (2020) for instance). The simulation results are shown in Figure 4 , which illustrates that our method outperforms the other two methods under the unbounded sub-Gaussian rewards and small sample T ∈ [1, 150]. In the Gaussian case, Algorithm 1 can also give better results compared with the CLT-based UCB when the round is relatively small. For the Gaussian mixture case, the algorithm based on the intrinsic moment norm has smaller regret than the other two methods under T ∈ [1, 800].



Recently, Phan et al. (2021) obtained a sharper result than Hoeffding's inequality for bounded data. For simplicity, we consider centered random variable (r.v.) with zero mean throughout the paper for all sub-Gaussian r.v.. We point out that a conservative and inconsistent estimator 2 inf t∈R log(n -1 n i=1 exp(tXi))/t 2 was proposed in statistical physics literature(Wang, 2020). Sub-G plot can only be applied to data with enough samples. When n is very small, there is not enough information to suggest unbounded trends. We roughly treat the data as bounded r.v. for a very small n, and there is no need to use a sub-G plot in this case.



Figure 1: CIs via Hoeffding's inequality (red line) and B-E-corrected CLT (blue line). It describes a deficiency of B-E-corrected CLT under small sample, and it suggests that a simple Hoeffding's inequality can even perform better.

Figure 2: sub-Gaussian plot of standard Gaussian and standard exponential distribution for n = 1000. Left: The two dot lines indicate the points drop in a triangle region with a high probability. Right: The points in the case of exponential distribution approximately live curve triangle region with quadratic trends.

although it has exponential decay probability 1 -2exp(-t).Except for the direct plug-in estimator, here we resort to the median-of-means (MOM, Page244 in Nemirovskij & Yudin (1983)) as the robust plug-in estimator of intrinsic moment norm. Let m and b be a positive integer such that n = mb and let B 1 , . . . , B b be a partition of [n] into blocks of equal cardinality m. For any s ∈ [b], let P Bs m X = m -1 i∈Bs X i for independent data {X i } n i=1 . The MOM version intrinsic moment norm estimator is defined as ∥X∥ b,G := max 1≤k≤κn med s∈[b]

where b O is the number of blocks containing at least one outliers and b S is the number of sane blocks containing no outliers. Let ε := n o /n be the fraction of the outliers and no b < 1 2 . Assume here exists a fraction function η(ε) for sane block such that b S ≥ η(ε)b for a function η(ε) ∈ (0, 1].

Figure 3: DE represents the naive plug-in estimator (6), MOM represents the MOM estimator (7), and OP is the estimator plug-in naive estimator (4) for the optimal variance proxy.

Let subsample size m k and block size b k be positive integer such that T k(t) = m k b k for MOM estimators ∥Y k -µ k ∥ b k ,G inSection 3. Theorem 3 (a) guarantee that true norms can be replaced by MOM-estimated norms such that P

as a re-scaled version of MOM estimator ∥Y k -µ k ∥ b k ,G with block number b k satisfying the moment assumptions C[UCB1] and C[UCB2] in Appendix C. Fix a confidence level α = 4/T 2 , if the player pull an arm A t ∈ [K] according to Algorithm 1, then we have the problem-dependent regret of Algorithm 1 is bounded by

Figure 4: The regret of MAB with sub-G rewards under three methods. x-axis represents the round and y-axis is the cumulative regret.

Comparison of sub-Gaussian norms ∥ • ∥ * for centralized and symmetric X.

32. Essentially, 17/12 > 1 appears for asymmetric variables, since ∥ • ∥ G is defined by comparing a Gaussian variable G that is symmetric. A technical reason for this improvement is that ∥ • ∥ G does not need Stirling's approximation for attaining a sharper MGF bound when expanding the exponential function by Taylor's formula. To show the tightness of Theorem 2(b), in Figure5of Appendix C, we gives some comparisons with σ opt (X), 17/12∥X∥ G , √ 2e∥X∥ ψ2 , ∥X∥ w2 / √ 2 and √ Var X in terms of confidence length in Table1, when X is Bernoulli or beta distribution.

