TIGHT NON-ASYMPTOTIC INFERENCE VIA SUB-GAUSSIAN IN-TRINSIC MOMENT NORM

Abstract

In non-asymptotic statistical inferences, variance-type parameters of sub-Gaussian distributions play a crucial role. However, direct estimation of these parameters based on the empirical moment generating function (MGF) is infeasible. To this end, we recommend using a sub-Gaussian intrinsic moment norm [Buldygin and Kozachenko (2000), Theorem 1.3] through maximizing a series of normalized moments. Importantly, the recommended norm can not only recover the exponential moment bounds for the corresponding MGFs, but also lead to tighter Hoeffiding's sub-Gaussian concentration inequalities. In practice, we propose an intuitive way of checking sub-Gaussian data with a finite sample size by the sub-Gaussian plot. Intrinsic moment norm can be robustly estimated via a simple plug-in approach. Our theoretical results are applied to non-asymptotic analysis, including the multi-armed bandit.

1. INTRODUCTION

With the advancement of machine learning techniques, computer scientists have become more interested in establishing rigorous error bounds for desired learning procedures, especially those with finite sample validity (Wainwright, 2019; Zhang & Chen, 2021; Yang et al., 2020) . In specific settings, statisticians, econometricians, engineers and physicist have developed non-asymptotic inferences to quantify uncertainty in data; see Romano & Wolf (2000) ; Chassang (2020) . Therefore, the concentration-based statistical inference has received a considerable amount of attention, especially for bounded data (Romano & Wolf, 2000; Auer et al., 2002; Hao et al., 2019; Wang et al., 2021; Shiu, 2022) and Gaussian data (Arlot et al., 2010; Duy & Takeuchi, 2022; Bettache et al., 2021; Feng et al., 2021) . For example, Hoeffding's inequality can be applied to construct nonasymptotic confidence intervals based on bounded datafoot_0 . However, in reality, it may be hard to know the support of data or its underlying distribution. In this case, misusing Hoeffding's inequality (Hoeffding, 1963) for unbounded data will result in a notably loose confidence interval (CI); see Appendix A.1. Hence, it is a common practice to assume that data follow sub-Gaussian distribution (Kahane, 1960) . By the Chernoff inequalityfoot_1 , we have P(X ≥ t) ≤ inf s>0 exp{-st}E exp{sX} , ∀ t ≥ 0. Hence, tightness of a confidence interval relies on how we upper bound the moment generating function (MGF) E exp{sX} for all s > 0. This can be further translated into the following optimal variance proxy of sub-Gaussian distribution. Definition 1. A r.v. X is sub-Gaussian (sub-G) with a variance proxy σ 2 [denoted as X ∼ subG(σ 2 )] if its MGF satisfies E exp(tX) ≤ exp(σ 2 t 2 /2) for all t ∈ R. The sub-Gaussian parameter σ opt (X) is defined by the optimal variance proxy (Chow, 1966): σ 2 opt (X) := inf σ 2 > 0 : E exp(tX) ≤ exp{σ 2 t 2 /2}, ∀ t ∈ R = 2 sup t∈R t -2 log[E exp(tX)]. Note that σ 2 opt (X) ≥ Var X; see (14) in Appendix A.2. When σ 2 opt (X) = Var X, it is called strict sub-Gaussianity (Arbel et al., 2020) . Based on Theorems 1.5 in Buldygin & Kozachenko (2000) , we have P (X ≥ t) ≤ exp - t 2 2σ 2 opt (X) , P | n i=1 X i | ≥ t ≤ 2 exp - t 2 2 n i=1 σ 2 opt (X i ) . (2) for independent sub-G r.v.s X and {X i } n i=1 . The above inequality (2) provides the tightest upper bound over the form P(X > t) ≤ exp(-Ct 2 ) (or P(| n i=1 X i | > t) ≤ exp(-Ct 2 )) for some positive constant C via Chernoff inequality. Given {X i } n i=1 i.i.d. ∼ subG(σ 2 opt (X)), a straightforward application of (2) gives an non-asymptotic 100(1 -α)% CI EX = 0 ∈ [X n ± σ opt (X) 2n -1 log(2/α)]. A naive plug-in estimatefoot_2 of σ 2 opt (X) := 2 sup t∈R t -2 log[E exp(tX)] (Arbel et al., 2020) is σ 2 opt (X) := 2 sup t∈R t -2 log[n -1 Σ n i=1 exp(tX i )]. However, two weaknesses of (4) substantially hinder its application: (i) the optimization result is unstable due to the possible non-convexity of the objective function; (ii) exponentially large n is required to ensure the variance term Var(n -1 n i=1 exp(tX i )) not to explode when t is large. In Section 3, we present some simulation evidence. On the other hand, we are aware of other forms of variance-type parameter. For instance, van der Vaart & Wellner (1996) introduced the Orlicz norm as ∥X∥ w2 := inf{c > 0 : E exp{|X| 2 /c 2 } ≤ 2}, frequently used in empirical process theory. Additionally, Vershynin (2010) suggested a norm based on the scale of moments as Buldygin & Kozachenko (2000) . However, as shown in Table 1 and Appendix A.2.1, both types of norm fail to deliver sharp probability bounds even for strict sub-G distributions, such as the standard Gaussian distribution and symmetric beta distribution. ∥X∥ ψ2 := max k≥2 k -1/2 (E|X| k ) 1/k in Page 6 of Table 1 : Comparison of sub-Gaussian norms ∥ • ∥ * for centralized and symmetric X. ∥ • ∥ * -norm sharp tail for P(|X| ≥ t) sharp MGF bound half length of (1 -δ)-CI easy to estimate σopt(X) Yes [2exp{-t 2 2 /σ 2 opt (X)}] Yes [exp{σ 2 opt (X) t 2 2 }] 2 log(2/δ)σopt(X) No ∥X∥w 2 Yes [2 exp{-t 2 2 /( ∥X∥w 2 √ 2 ) 2 ]}] No [exp{(2∥X∥w 2 ) 2 t 2 2 }] 2 log(2/δ)∥X∥w 2 / √ 2 No ∥X∥ ψ 2 No [2exp{-t 2 2 /(2e∥X∥ 2 ψ 2 )}] No [exp{(4 √ e∥X∥ ψ 2 ) 2 t 2 2 }] 2 log(2/δ) √ 2e∥X∥ ψ 2 Yes ∥X∥ G (Def. 2) Yes [2exp{-t 2 2 /∥X∥ 2 G }] Yes [exp{∥X∥ 2 G t 2 2 }] 2 log(2/δ)∥X∥ G Yes 1.1 CONTRIBUTIONS In light of the above discussions, we advocate the use of the intrinsic moment norm in the Definition 2 in the construction of tight non-asymptotic CIs. There are two specific reasons: (i) it approximately recovers tight inequalities (2); (ii) it can be estimated friendly (with a closed form) and robustly. The following definition 2 is from Page 6 and Theorem 1.3 in Buldygin & Kozachenko (2000) . Definition 2 (Intrinsic moment norm). 2k) . ∥X∥ G := max k≥1 [ 2 k k! (2k)! EX 2k ] 1/(2k) = max k≥1 [ 1 (2k-1)!! EX 2k ] 1/( From the sub-G characterization (see Theorem 2.6 in Wainwright ( 2019)), ∥X∥ G < ∞ iff σ opt (X) < ∞ for any zeromean r.v. X. Hence, the finite intrinsic moment norm of a r.v. X ensures sub-Gaussianity (satisfying Definition 1). Our contributions in this paper can be summarized as follows. 1. By ∥X∥ G , we achieve a sharper Hoeffding-type inequality under asymetric distribution; see Theorem 2(b). 2. Compared to the normal approximation based on Berry-Esseen (B-E) bounds, our results are more applicable to data of extremely small sample size. We illustrate Bernoulli observations with the comparison of two types of CIs based on the B-E-corrected CLT and Hoeffding's inequality in Figure 1 ; see Appendix A for details. 3. A novel method called sub-Gaussian plot is proposed for checking whether the unbounded data are sub-Gaussian. We introduce plug-in and robust plug-in estimators for ∥X∥ G , and establish finite sample theories. 4. Finally, we employ the intrinsic moment norm estimation in the non-asymptotic inference for a bandit problem: Bootstrapped UCB-algorithm for multi-armed bandits. This algorithm is shown to achieve feasible error bounds and competitive cumulative regret on unbounded sub-Gaussian data.



Recently, Phan et al. (2021) obtained a sharper result than Hoeffding's inequality for bounded data. For simplicity, we consider centered random variable (r.v.) with zero mean throughout the paper for all sub-Gaussian r.v.. We point out that a conservative and inconsistent estimator 2 inf t∈R log(n -1 n i=1 exp(tXi))/t 2 was proposed in statistical physics literature (Wang, 2020).



(2009); Arlot et al. (2010); Yang et al. (2020); Horowitz & Lee (2020); Armstrong & Kolesár (2021); Zheng & Cheng (2021); Lucas et al. (2008); Owhadi et al. (2013); Wang

