SHARPER BOUNDS FOR UNIFORMLY STABLE ALGO-RITHMS WITH STATIONARY MIXING PROCESSES

Abstract

Generalization analysis of learning algorithms often builds on a critical assumption that training examples are independently and identically distributed, which is often violated in practical problems such as time series prediction. In this paper, we use algorithmic stability to study the generalization performance of learning algorithms with ψ-mixing data, where the dependency between observations weakens over time. We show uniformly stable algorithms guarantee high-probability generalization bounds of the order O(1/ √ n) (within a logarithmic factor), where n is the sample size. We apply our general result to specific algorithms including regularization schemes, stochastic gradient descent and localized iterative regularization, and develop excess population risk bounds for learning with ψ-mixing data. Our analysis builds on a novel moment bound for weakly-dependent random variables on a φ-mixing sequence and a novel error decomposition of generalization error. * We use O to hide logarithmic factors

1. INTRODUCTION

Generalization gap refers to the discrepancy between training and testing, which is a quantity of central importance in statistical learning theory (SLT) (Shalev-Shwartz & Ben-David, 2014) . A popular approach to controlling the generalization gap is to bound it by the uniform convergence between training and testing errors over a function space (Bartlett & Mendelson, 2002) , which leads to bounds depending on the complexity of function spaces, such as VC dimension (Vapnik, 2013) , covering number (Cucker & Zhou, 2007) and Rademacher complexity (Bartlett & Mendelson, 2002) . These complexity-based bounds do not exploit the property of a learning algorithm and would generally admit a square-root dependency on the dimension (Feldman, 2016) , which are not favorable for large-scale problems. To incorporate the property of a learning algorithm and remove the dependency on dimension, a concept of algorithmic stability has been introduced into SLT (Bousquet & Elisseeff, 2002) . Intuitively speaking, algorithmic stability measures how a small perturbation of the training dataset would affect the output model of a learning algorithm, which has close connection to several key properties such as learnability (Shalev-Shwartz et al., 2010) , robustness and privacy (Bassily et al., 2020) . Recent research has witnessed an increasing interest in leveraging stability to study the generalization behavior of various algorithms, such as stochastic gradient descent (Hardt et al., 2016) , structured prediction (London et al., 2016) , meta learning (Maurer, 2005) and transfer learning (Kuzborskij & Lampert, 2018) . Most of these discussions are based on a critical assumption that the training examples are independently and identically distributed (i.i.d.) . This assumption is often violated in practical applications. For example, the i.i.d. assumption is too restrictive in time series prediction (Vidyasagar, 2013) . The prices of the same stock on different days may have temporal dependence. These phenomena motivate several analyses to derive meaningful bounds for learning problems with observations drawn from a non-i.i.d. process (Yu, 1994; Vidyasagar, 2013) . A widely used relaxation of the i.i.d. assumption is to assume the observations are drawn from a mixing process (Yu, 1994; Meir, 2000; Lozano et al., 2005; Vidyasagar, 2013) , where the dependency between two observations is quantified by a mixing coefficient as a function of the discrepancy of the associated two indices. These mixing coefficients decay either as a polynomial function or an exponential function of the discrepancy (Vidyasagar, 2013) . Several mixing processes have been introduced into the literature, including the β-mixing, φ-mixing and ψ-mixing processes (Yu, 1994; Meir, 2000; Lozano et al., 2005) . Within this formulation, various generalization bounds have been developed to show how the dependency among observations would affect the learning process. Interestingly, these discussions imply a concept called "effective size" which plays a similar role of the sample size in the i.i.d. scenario (Yu, 1994; Kuznetsov & Mohri, 2017) . As in the i.i.d. case, most generalization analyses in the non-i.i.d. case focus on complexity-based bounds (Meir, 2000; Yu, 1994; Kuznetsov & Mohri, 2017) . There are few stability analyses of learning algorithms in the non-i.i.d. cases. An exception is the work in Mohri & Rostamizadeh (2010) , which, to our knowledge, gives the first systematic analysis on the stability and generalization in a non-i.i.d. case. The authors developed high-probability generalization bounds for learning with stationary φ-mixing and β-mixing sequences, which are then applied to general kernel regularizationbased bounds. Due to the algorithm-specific nature, these bounds are preferable to complexity-based bounds if the associated hypothesis space has a very large complexity. However, the stability analysis (Mohri & Rostamizadeh, 2010) only implies sub-optimal generalization bounds. Indeed, for β-uniformly stable algorithms, the high-probability bounds in Mohri & Rostamizadeh (2010) are of the order of O( √ nβ + ∆ n / √ n), where n is the sample size and ∆ n is a term depending on the decay rate of mixing coefficients. For learning with λ-strongly convex problems, the uniform stability parameter is of the order O(1/(nλ)) (Bousquet & Elisseeff, 2002) and therefore the bounds in Mohri & Rostamizadeh (2010)  become O(1/( √ nλ) + ∆ n / √ n). A typical choice of λ is λ ≈ n -α for α > 0 (Shalev-Shwartz & Ben-David, 2014) and then the bounds further become O(n α-1 2 + ∆ n / √ n), which cannot imply the optimal bounds O(1/ √ n) even if ∆ n = O(1) . For learning with i.i.d. data, recent breakthroughs (Feldman & Vondrak, 2019; Bousquet et al., 2020) in stability analysis show that β-uniformly stable algorithms enjoy generalization bounds of the order O(1/ √ n) * . This motivates a natural question: can we develop generalization bounds of the order O(1/ √ n) for uniformly stable algorithms applied to mixing process? This paper provides an affirmative answer to the above question. Our contributions are listed below. 1. We develop a moment bound for weakly dependent random variables defined on a φ-mixing sequence. We show our bound matches the existing moment bounds for i.i.d. random variables up to a logarithmic factor. As a byproduct, we develop a Marcinkiewicz-Zygmund inequality for a φ-mixing sequence, which may be interesting in its own right. 2. We develop high-probability bounds of order O(1/ √ n) for uniformly stable algorithms for learning with ψ-mixing sequences (our results actually require assumptions on φ ′ -mixing coefficients which are weaker than assumptions on ψ-mixing coefficients). We achieve this by introducing a different decomposition of generalization errors to make sure we get weakly-dependent and mean-zero random variables, which is more challenging than the i.i.d. case. Our results recover the existing bounds within a constant factor in the i.i.d. case. 3. We apply our general bound to some specific algorithms to show the effectiveness of our results, including kernel regularization schemes, stochastic gradient descent (SGD) and iterative localization. The paper is organized as follows. We present the related work in Section 2. We develop concentration inequalities for φ-mixing data in Section 3 and present general stability-based bounds in Section 4. We apply our general result to specific algorithms in Section 5. We conclude the paper in Section 6.

2. RELATED WORK

In this section, we discuss the related work. We first discuss the related work on algorithmic stability and then the related work on learning with dependent data. Algorithmic stability. Algorithmic stability measures how the replacement/removal of a single (or a few) example would affect the output model, which is an important concept in SLT (Bousquet & Elisseeff, 2002) . A nice property of algorithmic stability is that it only considers the behavior of the output model and therefore can imply capacity-independent generalization bounds. An important stability measure called uniform stability was introduced in an influential work (Bousquet & Elisseeff, 2002) , which was used to study the generalization behavior of regularization schemes. This uniform stability was extended to the setting of randomized algorithms (Elisseeff et al., 2005) , which was further used to study the generalization guarantee of SGD (Hardt et al., 2016) . To better exploit the training examples for better generalization bounds, a relaxation of uniform stability called on-average stability has been introduced (Shalev-Shwartz et al., 2010) . In particular, the on-average stability was shown to be equivalent to learnability (Shalev-Shwartz et al., 2010) and was used to derive data-dependent error bounds (Kuzborskij & Lampert, 2018; Lei & Ying, 2020; Zhou et al., 2021; Li et al., 2020; Nikolakakis et al., 2022) . The smoothness assumption for stability analysis of SGD was removed in the papers (Lei & Ying, 2020; Bassily et al., 2020) . For nonconvex problems, stability and generalization of learning algorithms that converge to global optima were studied for gradient-dominated problems (Charles & Papailiopoulos, 2018; Lei & Ying, 2021) . While most discussions focus on upper bounds on the stability, recent work also develops lower bounds on the stability of SGD (Bassily et al., 2020; Amir et al., 2021) . While most stability analyses imply optimal bounds in expectation, the recent study shows that uniform stability can imply almost optimal bounds with high probability (Feldman & Vondrak, 2019; Bousquet et al., 2020; Klochkov & Zhivotovskiy, 2021; Yuan & Li, 2022; Li & Liu, 2022) . Algorithmic stability has found wide applications in various learning problems, including transfer learning (Kuzborskij & Lampert, 2018) , meta-learning (Maurer, 2005) , structured prediction (London et al., 2016) , hyperparameter optimization (Bao et al., 2021) , neural networks (Richards & Kuzborskij, 2021) and adversarial training (Xing et al., 2021) . Learning with dependent data. For learning with dependent data, one generally assumes that the data are drawn from stationary and mixing sequences with the dependence between observations diminishing appropriately over time (Doukhan, 1994; Smale & Zhou, 2009) . Initially, generalization bounds were established via a uniform convergence approach based on complexity measures of function classes, such as VC dimension (Yu, 1994) , covering numbers (Meir, 2000) and Rademacher complexity (Mohri & Rostamizadeh, 2008) . Based on a localization idea and self-bounding loss functions, Steinwart & Christmann (2009) developed fast learning rates for regularized algorithms with geometrically α-mixing data. Ralaivola et al. (2010) and Alquier et al. (2013) established convergence rates under the assumption of stationary and weak dependence. While most discussions focus on stationary sequence, Kuznetsov & Mohri (2017) used Rademacher complexity to study learning bounds with non-stationary φ-mixing and β-mixing sequences. The first stability analysis of learning with mixing sequences was given in the paper (Mohri & Rostamizadeh, 2010) . The stability approach was also used to study online learning with dependent data (Agarwal & Duchi, 2013) and learning with graph-dependent data (Zhang et al., 2019) . SGD with Markov sampling has also been recently studied (Sun et al., 2018; Wang et al., 2022) .

3. CONCENTRATION INEQUALITIES FOR φ-MIXING SEQUENCES

We consider learning problems with a sequence of dependent observations. We assume the dependency between two observations decays with their gap. There are several concepts to quantify the dependency relationship within a stationary sequence such as β-mixing, φ-mixing and ψ-mixing (Mohri & Rostamizadeh, 2010; Yu, 1994) . We focus on the φ-mixing and ψ-mixing sequences in this paper. Let Z = {Z t } ∞ t=-∞ be a stationary sequence of random variables. For any i, j ∈ N, let σ j i denote the σ-algebra generated by the random variables Z k , i ≤ k ≤ j. Definition 1 (φ-Mixing Sequence). For any k ∈ N, the φ-mixing coefficient of Z is defined as φ(k) = sup n,A∈σ ∞ n+k ,B∈σ n -∞ Pr(A|B) -Pr(A) . Z is said to be φ-mixing if φ(k) → 0 as k → ∞. It is said to be algebraically φ-mixing (with degree r > 0) if there exists a real number φ 0 > 0 such that φ(k) ≤ φ 0 /k r for all k, exponentially mixing (with degree r) if there exist real numbers φ 0 , φ 1 such that φ(k) ≤ φ 0 exp(-φ 1 k r ) for all k. Definition 2 (ψ-Mixing Sequence (Bradley, 2007) ). For any k ∈ N, the ψ-mixing coefficient of the stochastic process Z is defined as ψ(k) = sup n,A∈σ ∞ n+k ,B∈σ n -∞ Pr(A ∩ B)/Pr(A)Pr(B) -1 . Z is said to be ψ-mixing if ψ(k) → 0 as k → ∞. It is said to be algebraically ψ-mixing (with degree r > 0) if there exists a real number ψ 0 > 0 such that ψ(k) ≤ ψ 0 /k r for all k, exponentially mixing (with degree r) if there exist real numbers ψ 0 , ψ 1 such that ψ(k) ≤ ψ 0 exp(-ψ 1 k r ) for all k. We will use ψ-mixing to give a bound on the stability analysis version of φ ′ -mixing in Lemma 3. Intuitively speaking, φ(k) and ψ(k) measure the dependency of an event on those happened k units of time ahead. By the definition, we know that ψ-mixing is stronger than the φ-mixing. Below we provide examples of φ-mixing and ψ-mixing sequences in Kesten & O'Brien (1976) . We first consider random variables V n , U n and S n for n ∈ Z defined on the probability space (Ω, F, P ), which are independent of each other and have the following distributions: for all n ∈ Z, P (V n = i) = β i , i = 0, 1 and 0 < β 0 < β 0 + β 1 = 1; P (U n = k) = p k ≥ 0, k = 0, 1, • • • and ∞ k=0 p k = 1; P (S n = j) = γ j , j = 0, 1, 2 and 0 < γ 0 < γ 0 + γ 1 < γ 0 + γ 1 + γ 2 = 1. Example 1 (Example of φ-mixing sequence). Let {f k , k ≥ 1} be a non-increasing sequence such that f 1 ≤ 1, f k → 0 as k → ∞ and 2 log(1 -f k+1 ) ≥ log(1 -f k ) + log(1 -f k+2 ) for {k : f k < 1}. For any 0 < ϵ < 1 2 we define β 0 = ϵ, β 1 = 1 -ϵ. Define {p n } by n-1 k=0 p k = (1 -f n ) (1 -f n+1 ) -1 if f n < 1 0 if f n = 1. Note that p k = (1 -f k+1 )(1 -f k+2 ) -1 -(1 -f k )(1 -f k+1 ) -1 ≥ 0 for {k : f k < 1} since 2 log(1 -f k+1 ) ≥ log(1 -f k ) + log(1 -f k+2 ). Let X n = U n + 1 2 V n + 1 4 W n where W n = V n-Un . Then {X n , n ∈ Z} is a φ-mixing sequence with (1 -ϵ)f k ≤ φ(k) ≤ f k . Next we provide an example of ψ-mixing sequence. Example 2 (Example of ψ-mixing sequence). Let {g k , k ≥ 1} be a sequence such that g 1 -g 2 = 1, g k → 0 as k → ∞ and 2g k+1 ≤ g k + g k+2 , k = 1, 2, • • • . For any ϵ ∈ (0, 1), Let γ 2 = ϵ, γ 0 = γ 1 = 1 2 (1 -ϵ), β 0 = β 1 = 1 2 . Define {p n } by p 0 = 0, p k = g k -2g k+1 + g k+2 , k = 1, 2, • • • . Note that p k ≥ 0 for k = 0, 1, • • • . Let U n , V n and S n be as before. Let Z n = S n I [Sn=0 or 1] + V n-Un I [Sn=2] , where I is the indicator function. Finally we define X n = V n + 2Z n . Then {X n , n ∈ Z} is a ψ-mixing sequence with ϵ(1 + ϵ) -1 g k ≤ ψ(k) ≤ exp ϵ(1 -ϵ) -1 g k -1. To develop error bounds for learning with φ-mixing sequences, we first develop concentration inequalities for φ-mixing sequences. In the following theorem to be proved in Section A, we derive a tail bound for the summation of dependent random variables in terms of the tail behavior of each individual random variable. Let ∆ n = 1 + 2 n k=1 φ(k). The L p -norm of a real-valued random variable Z is denoted by ∥Z∥ p := E[|Z| p ] 1 p , p ≥ 1. Theorem 1. Let X 1 , . . . , X n be a finite contiguous subsequence from a φ-mixing sequence. Let Z i be a function of X i with E[Z i ] = 0 and Pr{|Z i | > ε} ≤ 2 exp(-ε 2 /b). Then for any p ≥ 1 we have n i=1 Z i p ≤ (9 + log(n))p∆ n √ 2nb. Remark 1. Theorem 1 is an extension of Marcinkiewicz-Zygmund inequality for independent random variables to φ-mixing sequences. Indeed, if Z i are i.i.d., it was shown & Liang, 2001) . We show how the mixing behavior would affect the concentration by including ∆ n in our bound. In particular, if Z i are independent then ∆ n = 1 and in this case, our result matches the Marcinkiewicz-Zygmund inequality up to a logarithmic factor. Note ∆ n = O(1) for algebraical φ-mixing with r > 1 and exponential φ-mixing sequences (Mohri & Rostamizadeh, 2010) . (Xuejun et al., 2010) . This bound requires an assumption involving n i=1 Z i p = O(p √ nb) (Ren

Under the assumption

∞ k=1 φ 1 2 (k) < ∞, it was shown n i=1 Z i p ≤ C p n i=1 ∥Z i ∥ p p 1 p + n i=1 ∥Z i ∥ 2 2 1 2 ∞ k=1 φ 1 2 (k), which is larger than n k=1 φ(k) in ∆ n since φ 1 2 (k) ≥ φ(k). For example, if φ(k) = O(k -1 ) then n k=1 φ(k) = O(log n) while n k=1 φ 1 2 (k) = O( √ n). Moreover, the bound in Xuejun et al. (2010) involves C p which is not explicitly stated. As a comparison, our bound involves all explicit constants. Remark 2. Our basic idea to prove Theorem 1 is to apply a McDiarmid inequality (Lemma A.1) to a Lipschitz function defined on a φ-mixing sequence. If we define Φ ′ (X 1 , . . . , X n ) = n i=1 Z i , one cannot guarantee the Lipschitz continuity of Φ due to the unboundedness of Z i . Our novelty is to define Z i = Z i I |Zi|≤ϵ where I [•] is an indicator function and ϵ = O( b log(1/δ)). The boundedness of Z i implies the (2ϵ)-Lipschitz continuity of Φ and therefore we can apply the McDiarmid inequality to study its decay rate of Φ. Furthermore, the assumption Pr{|Z i | > ε} ≤ 2 exp(-ε 2 /b) shows that Φ and Φ ′ are equal with a high probability. We then combine these two observations together to derive a high-probability bound for Φ ′ , which further leads to a bound on the L p -norm of Φ ′ by the equivalence between high-probability bound and the L p -norm bound. Based on Theorem 1, we develop a moment bound for Lipschitz functions (w.r.t. the Hamming distance) defined on mixing sequences. The following theorem is an extension of a result in the i.i.d. case (Bousquet et al., 2020) to the case with mixing sequences. This result plays a major role in developing our generalization bounds for learning with mixing sequences. The first assumption is a conditional boundedness assumption which is standard for concentration inequalities. The second assumption implies that the g i is of mean zero conditioned on any fixed Z [n]\[i] , which is stronger than E[g i ] = 0 since the conditional expectation holds for any fixed Z [n]\[i] . The last assumption implies that g i is insensitive to the change of any single example, which implies that g i is concentrated around its expectation by McDiarmid's inequality. Our proof follows from the framework in Bousquet et al. (2020) , which is given in Section B. Theorem 2 (Concentration Inequality for φ-Mixing Sequence). Let Z 1 , . . . , Z n be a finite contiguous subsequence from a φ-mixing sequence. Denote Z = {Z 1 , . . . , Z n }. Let g 1 , . . . , g n be some functions g i : Z n → R such that the following holds for any i ∈ [n] • E Z [n]\[i] [g i (Z)|Z i ] ≤ M almost surely, • E Zi [g i (Z)|Z [n]\[i] ] = 0 a.s., • g i is β-Lipschitz w.r.t. the Hamming distance. Then for any p ≥ 1 we have (k = ⌈log 2 n⌉) n i=1 g i p ≤ 3M ∆ n 2pn + 2 k pβ k-1 l=0 (9 + l)∆ 2 2 l . Remark 3. If the sequence is i.i.d., the following bound was developed (Bousquet et al., 2020 ) n i=1 g i p ≤ 3M 2pn + 12 √ 6pnβ log 2 n. (3.1) Our bound recovers the existing result in the i.i.d. case. Note ∆ n = 1 for any n and then Theorem 2 implies n i=1 g i p ≤ 3M √ 2pn+npβ(5+log 2 ⌈n⌉) 2 , which matches Eq. (3.1) up to a logarithmic factor. This is the first extension of the result in Bousquet et al. (2020) to a φ-mixing setting. We follow the framework in Bousquet et al. (2020) to prove Theorem 2. The difference is to replace the Marcinkiewicz-Zygmund inequality for i.i.d. random variables by Theorem 1 for φ-mixing random variables. The basic idea is to use the representation n i=1 g i = n i=1 E[g i |Z i ] + k-1 l=0 n i=1 g l i - g l+1 i , where g l i is the expectation of g i conditioned on some random variables and k is an integer depending on n. We then use a McDiarmid inequality and the conditional boundedness of E[g i |Z i ] to control n i=1 E[g i |Z i ], and use Theorem 1 to control k-1 l=0 n i=1 g l i -g l+1 i .

4. STABILITY AND GENERALIZATION

Let Z = X × Y be a sample space, where X ⊆ R d is an input space and Y is an output space. We consider supervised learning problems where S = {z 1 , . . . , z n } = (x 1 , y 1 ), . . . , (x n , y n ) is a contiguous subsequence from a ψ-mixing sequence. Based on S, we wish to find a model h : X → Y. We consider parametric models where the model is determined by a parameter w in a parameter space W. The performance of a model w on a single example z can be measured by a loss function f (w; z). The empirical risk and population risk are then defined by F S (w) = 1 n n i=1 f (w; z i ) and F (w) = E z [f (w; z)], (4.1) which measure the behavior of w on training examples and test examples, respectively. Here the test point z is assumed to be dependent on S (i.e., z is assumed to follow immediately after the sample S), which is the most realistic setting considered in Mohri & Rostamizadeh (2010) . We refer to the discrepancy between training and testing as the generalization gap F (w) -F S (w). In machine learning, we often apply an algorithm to get a model with a small training error. Meanwhile, we wish the output model also admits a small generalization gap to enjoy good generalization to test data. In this paper, we are interested in developing generalization error bounds that decay to zero as the sample size goes to infinity. Our basic tool is the algorithmic stability, which measures the sensitivity of the output up to the perturbation of a single example. Various concepts of stability have been introduced in the literature. In this paper, we focus on the uniform stability which is arguably the most widely used algorithmic stability. We use w S to mean the output model if we apply an algorithm A to the dataset S. Note we omit the dependency of the notation on A, which should be clear from the context. We say two sets are neighboring datasets if they differ by one example. Definition 3 (Uniform Stability (Bousquet & Elisseeff, 2002) ). A randomized algorithm A is ϵuniformly stable if for all neighboring datasets S, S ′ ∈ Z n we have sup z f (w S ; z)-f (w S ′ ; z) ≤ ϵ. Our stability analysis requires a different mixing coefficient defined as follows φ ′ (k) = sup n,A∈σ n-k -∞ ,zn∈σ n n ,B∈σ ∞ n+k Pr(z n |A, B) -Pr(z n ) . (4.2) It is clear that φ ′ (k) ≥ φ(k). The following lemma controls φ ′ (k) in terms of the ψ-mixing coefficients. According to the following lemma, one can show that if ψ(k) = O(k -r ) then φ ′ (k) = O(k -r ). If ψ(k) = O(exp(-ψ 1 k r )) then φ ′ (k) = O(exp(-ψ 1 k r )). Lemma 3. Let Z = {Z t } ∞ t=-∞ be drawn from a ψ-mixing distribution and assume ψ(k) < 1. Then φ ′ (k) ≤ max (1 + ψ(k)) 2 1 -ψ(k) -1, 1 - (1 -ψ(k)) 2 1 + ψ(k) . To apply Theorem 2 to learning with mixing sequences, we need to introduce a sequence of functions g i satisfying the conditions in Theorem 2 and relate them to the generalization gap. Let z ′ i (resp. z ′′ i ) be drawn from the same distribution of z i , i.e., the conditional distribution of z ′ i (resp. z ′′ i ) given z 1 , . . . , z i-1 , z i+1 , . . . , z n is the same as that of z i given z 1 , . . . , z i-1 , z i+1 , . . . , z n . Let S i,b = {z 1 , . . . , z i-b-1 , z i , z i+b+1 , . . . , z n-b }, i.e., we remove 2b points around z i . For any i ∈ [n], let S i i,b = {z 1 , . . . , z i-b-1 , z ′ i , z i+b+1 , . . . , z n-b }. We then define the following random variables g i = E z ′ i E z ′′ i [f (w S i i,b ; z ′′ i )] -f (w S i i,b ; z i ) , ∀i ∈ [n]. (4. 3) The following lemma to be proved in Section C gives generalization bounds in terms of stability and n i=1 g i . We will use φ ′ (b) in Theorem 5 and all corollaries in Section 5. The underlying reason is that we need to remove 2b points around z i to get S i,b for the application of Theorem 2. An upper bound |F (w S ) -E z ′′ i [f (w S i,b ; z ′′ i )]| requires to use φ ′ (b). Lemma 4. Let S be drawn from a ψ-mixing distribution. Let b ∈ {0, . . . , n} denote the number of last points removed in S, i.e., S b = {z 1 , . . . , z n-b }. Let w S denote the hypothesis trained on S. If the algorithm A is β-uniformly stable and the loss function is bounded by M > 0, then the following inequality holds with g i defined in Eq. (4.3) n(F (w S ) -F S (w S )) ≤ 2n(3b + 1)β + nM (φ(b) + φ ′ (b)) + n i=1 g i . We can apply Theorem 2 to control the term n i=1 g i and derive the following generalization bounds in terms of mixing coefficients. Theorem 5 (General Mixing Stability Bound). Let w S denote the hypothesis returned by a βuniformly stable algorithm trained on a sample S drawn from a ψ-mixing stationary distribution. Let M denote the uniform bound of the loss function. Then for any b ∈ {0, . . . , n} and any δ ∈ (0, 1), the following inequality holds with probability at least 1 -δ (k = ⌈log 2 n⌉) F (w S ) -F S (w S ) ≤ 2(3b + 1)β + M (φ(b) + φ ′ (b)) + 3eM ∆ n 2 log(e/δ) n + 2 k+1 eβ log(e/δ) n k-1 l=0 (9 + l)∆ 2 2 l . Remark 4. For φ-mixing sequences, the following high-probability bounds were developed (Mohri & Rostamizadeh, 2010 ) F (w S ) -F S (w S ) = O log(1/δ)∆ n √ n(b + 1)β + √ nφ(b) + n -1 2 . (4.4) As a comparison, our generalization bound in Theorem 5 becomes F (w S ) -F S (w S ) = O φ ′ (b) + ∆ n n -1 log(1/δ) + β(b + ∆ 2 n log 2 n log(1/δ)) . (4.5) Eq. (4.5) improves Eq. (4.4) as follows: (1) we replace √ nφ(b) with φ ′ (b); (2) we replace √ nbβ∆ n with β(b + ∆ 2 n log 2 n). The above two terms save a factor of √ n. Meanwhile it should be also mentioned that our bounds involve φ ′ , while the bounds in Mohri & Rostamizadeh (2010) involve φ. We now compare these two bounds under assumptions of φ ′ in three cases (results in Mohri & Rostamizadeh (2010) hold for φ-mixing coefficients and also hold for φ ′ -mixing coefficients). 1. In the i.i.d. case, (4.5) becomes F (w S ) -F S (w S ) = O n -1 log(1/δ) + β log 2 n log(1/δ) , which matches existing stability-based bounds (Bousquet et al., 2020) up to a logarithmic factor. As a comparison, Eq. (4.4) becomes F (w S ) -F S (w S ) = O log(1/δ)( √ nβ + 1/ √ n) . 2. We now consider algebraically mixing sequences, i.e., φ ′ (k) ≤ φ 0 k -r with r > 1. In this case, we know ∆ n = O(1) (Mohri & Rostamizadeh, 2010 ). If we choose b ≍ β -1 r+1 in Eq. (4.5), we get βb ≍ b -r ≍ β r r+1 and therefore (we denote B ≍ B if there are absolute constants c 1 and c 2 such that c 1 B ≤ B ≤ c 2 B.) F (w S ) -F S (w S ) = O β r r+1 + n -1 log(1/δ) + β log 2 n log(1/δ) . (4.6) As a comparison, the optimal choice b ≍ β -1 r+1 in Eq. (4.4) implies As a comparison, Eq. (4.4) implies F (w S ) -F S (w S ) = O log(1/δ)( √ nβ r r+1 + n -1 2 ) . F (w S ) -F S (w S ) = O log(1/δ) √ nβ log 1 r (1/β) + √ nβ + 1/ √ n . It is clear that our analysis removes a factor of √ n in front of the stability parameter β. Remark 5. The analysis in the mixing case is more challenging than that in the i.i.d. case. The analysis in Bousquet et al. (2020)  introduces gi = E z ′ i E zi [f (w S i ; z) -f (w S i ; z i )], where z is independently drawn from the stationary distribution and S i = {z 1 , . . . , z i-1 , z ′ i , z i+1 , . . . , z n }. While E zi [g i ] = 0 in the i.i.d. case, we cannot guarantee E zi [g i ] = 0 in the mixing case due to the dependency between z i and z j (j ̸ = i). In this way, one cannot apply Theorem 2 to gi in the mixing case. We use a much more complicated decomposition of the generalization error in terms of g i in Eq. (4.3), which is of mean zero in the mixing case. In this process, we introduce concepts such as S i,b , S i i,b to fully exploit the stability and mixing property. It should be mentioned that S i,b , S i i,b have been introduced in Mohri & Rostamizadeh (2010) . However, the aim is different. These concepts are used in Mohri & Rostamizadeh (2010) to get bounds in expectation for Φ(S) := F (w S ) -F S (w S ) via a lemma in Yu (1994) for β-mixing sequence, where the concentration of Φ(S) around its expectation can be directly studied via the McDiarmid inequality for φ-mixing sequence. As a comparison, our aim is to replace the sequence gi = E z ′ i E zi [f (w S i ; z) -f (w S i ; z i )] (with non-zero conditional mean) by the sequence g i in Eq. (4.3) with zero conditional mean, which is then controlled by our new concentration inequality for φ-mixing sequences (Theorem 2).

5. APPLICATIONS

We now present applications of Theorem 5 to several algorithms, including the kernel regularization algorithm, SGD and localized iterative regularization. Let A be an algorithm which outputs a model A(S) after observing the dataset S. We are interested in the excess population risk of a model A(S) defined by F (A(S)) -F (w * ), which measures the performance of the output model A(S) as compared to the best model w * = arg min w F (w). An efficient approach to this aim is based on the following error decomposition (Bousquet & Bottou, 2008 ) F (A(S)) -F (w * ) = F (A(S)) -F S (A(S)) + F S (A(S)) -F S (w * ) + F S (w * ) -F (w * ), (5.1) where we refer to F (A(S)) -F S (A(S)) as the generalization gap and F S (A(S)) -F S (w * ) as the optimization error. The last term F S (w * ) -F (w * ) is easy to control since w * is independent of S. We will apply stability analysis to control the generalization gap, and tools in optimization theory to control the optimization error. To this aim, we give some necessary definitions. The Lipschitz condition means the gradient is bounded, and the smoothness means the gradient is Lipschitz continuous. Examples of Lipschitz loss functions include the hinge loss, logistic loss and absolute loss. Examples of Lipschitz and smooth loss functions include the logistic loss and Huber loss. Definition 4 (Lipschitz, smoothness and convexity). Let L, γ > 0 and µ ≥ 0. Let f : W × Z → R. • We say f is L-Lipschitz continuous if |f (w; z) -f (w ′ ; z)| ≤ L∥w -w ′ ∥ for any w, w ′ , z. • We say f is γ-smooth if ∥∇f (w; z) -∇f (w ′ ; z)∥ ≤ γ∥w -w ′ ∥ for any w, w ′ , z. • We say f is µ-strongly convex if f (w; z)-f (w ′ ; z)-⟨w-w ′ , ∇f (w ′ ; z)⟩ ≥ µ 2 ∥w-w ′ ∥ 2 for any w, w ′ , z. We say f is convex if it is µ-strongly convex with µ = 0.

5.1. KERNEL REGULARIZATION SCHEMES

We first consider kernel regularization schemes with convex and Lipschitz loss functions. Let K : X × X → R be a Mercer kernel (i.e., K is symmetric and positive definite) and W be the associated reproducing kernel Hilbert space with the norm ∥ • ∥ K . We consider the following model w S,λ = arg min w∈W F S (w) + λ∥w∥ 2 K , (5.2) where λ > 0 is a regularization parameter to tradeoff the data-fitting term F S and the regularizer ∥w∥ 2 K . The following corollary gives high-probability excess population risk bounds on kernel regularization. The proof is given in Section D.1. Corollary 6. Let w S,λ denote the hypothesis returned by Eq. (5.2) when trained on a sample S drawn from a ψ-mixing stationary distribution. Assume f is convex, L-Lipschitz and bounded by M > 0. Then, with probability at least 1 -δ, the following excess risk bound holds (k = ⌈log 2 n⌉) F (w S,λ ) -F (w * ) = O ∆ n log 1 2 (1/δ) √ n + b nλ + M φ ′ (b) + k-1 l=0 l∆ 2 2 l log(1/δ) nλ + λ∥w * ∥ 2 K . Remark 6. We now instantiate the above bounds under special mixing sequences. We first consider the algebraically mixing sequence, i.e., φ ′ (k) ≤ φ 0 k -r with r > 1. In this case, analysis similar to Remark 4 implies the following bound with an appropriate choice of b F (w S,λ ) -F (w * ) = O n -1 2 log 1 2 (1/δ) + (nλ) -r r+1 + log 2 n log(1/δ) nλ + λ∥w * ∥ 2 K . If ∥w * ∥ K = O(1), then we choose λ ≍ 1/ √ n and get F (w S,λ ) -F (w * ) = O n -1 2 log 1 2 (1/δ) + n -r 2(r+1) log 2 n log(1/δ) . We now consider the exponential mixing case, i.e., φ ′ (k) ≤ φ 0 exp (-φ 1 k r ). In this case, analysis similar to Remark 4 implies F (w S )-F S (w S ) = O n -1 log(1/δ)+(nλ) -1 log 2 n log(1/δ)+(nλ) -1 log 1 r (nλ) +λ∥w * ∥ 2 K . If ∥w * ∥ K = O(1), then we can choose λ ≍ 1/ √ n to derive F (w S ) -F S (w S ) = O n -1 2 log(1/δ) + log 2 n log(1/δ) + log 1 r (n) .

5.2. STOCHASTIC GRADIENT DESCENT

We apply our generalization bounds to SGD with convex and smooth loss functions, which has wide applications in training complex models in the big-data era due to its simplicity and efficiency. Definition 5 (Stochastic Gradient Descent). Let w 1 = 0 ∈ R d be an initial point and {η t } t be a sequence of positive step sizes. SGD updates models by w t+1 = w t -η t ∇f (w t ; z it ), where ∇f (w t , z it ) denotes a gradient of f w.r.t. the first argument and i t is independently drawn from the uniform distribution over [n] = {1, . . . , n}. We assume the algorithm produces w S = 1 T T t=1 w t , which is an average of SGD iterates. We first present the generalization error bounds. The proofs are given in Section D.2. Corollary 7 (Generalization bound). Assume that the loss function f (•; z) is γ-smooth, convex, L-Lipschitz and bounded by M > 0 for every z. Suppose that we run SGD with step sizes η t ≤ min(2/γ, η) for T ≍ n steps on a sample S drawn from a ψ-mixing stationary distribution. Then, with probability at least 1 -δ we have (k = ⌈log 2 n⌉) F (w S ) -F S (w S ) = O ηb log(1/δ) + η k-1 l=0 l∆ 2 2 l log 2 (1/δ) + M φ ′ (b) + ∆ n log(1/δ) n . As a corollary, we develop the following excess risk bounds. Corollary 8 (Excess risk bound). Assume that the loss function f (•; z) is γ-smooth, convex, L-Lipschitz and bounded by M > 0 for every z. Suppose that we run SGD with step sizes η t = η ≍ 1/ √ T for T ≍ n steps on a sample S drawn from a ψ-mixing stationary distribution. Then, with probability at least 1 -δ we have (k = ⌈log 2 n⌉) F (w S )-F (w * ) = O n -1 2 b log(1/δ)+n -1 2 k-1 l=0 l∆ 2 2 l log 2 (1/δ)+M φ ′ (b)+ ∆ n log 1 2 (1/δ)+log 3 2 (n/δ) √ n .

5.3. ITERATIVE LOCALIZED ALGORITHM

We now turn to convex and non-smooth problems. In this case, SGD requires a very small step size to enjoy good stability, which however would affect the optimization process. Indeed, a tradeoff between generalization and optimization requires running SGD with O(n 2 ) iterations, which is not computationally efficient (Lei & Ying, 2020; Bassily et al., 2020) . To speed up the algorithm, we consider an iterative localization scheme (Algorithm 1 is deferred to Section D.3), which was introduced in Feldman et al. (2020) . The basic idea of Algorithm 1 is to implement the optimization in epochs. At each epoch, Algorithm 1 builds an objective function with a regularizer depending on the output of the previous epoch, which is solved by SGD with T i iterations and learning rates {η t }. The following corollary gives error bounds for Algorithm 1. The proof is given in Section D.3. Corollary 9. Assume that the loss function f (•; z) is convex, L-Lipschitz and bounded by M > 0 for every z. We run Algorithm 1 on sample S i , i ∈ [m] drawn from a ψ-mixing stationary distribution. If we choose γ ≍ n -1 2 , then, with probability at least 1 -δ F (w m )-F (w * ) = O n -1 2 log n∆ n log 1 2 (n/δ)+bn -1 2 +M log nφ ′ (b)+n -1 2 k-1 l=0 l∆ 2 2 l log(n/δ) , where k = ⌈log 2 n⌉. Moreover, Algorithm 1 requires only O(n log n) gradient computations to achieve this generalization bound.

6. CONCLUSIONS

With high probability, we develop the first stability-based generalization bounds of the order O(1/ √ n) for learning with a mixing sequence. We apply our results to several specific algorithms such as regularization schemes, SGD and localized iterative regularization. Our analysis relies on a new moment bound for weakly-dependent random variables defined on a mixing sequence. Our generalization bounds involve φ ′ -mixing coefficients, which are larger than the φ-mixing coefficients. It would be very interesting to investigate whether these φ ′ -mixing coefficients can be replaced by φ-mixing coefficients. We guess φ ′ -mixing would be more similar to φ-mixing than ψ-mixing since both φand φ ′ -coefficients measure the difference between a conditional probability and a probability (i.e., of the form |Pr(A|B) -Pr(A)|). As a comparison, ψ-mixing considers the difference between 1 and the ratio of probabilities (i.e. of the form |1 -Pr(A ∩ B)/Pr(A)Pr(B)|). The following lemma shows the equivalence of tails and moments (Bousquet et al., 2020) . Lemma A.2. Let Y be a random variable. If for any δ ∈ (0, 1), with probability at least 1 -δ |Y | ≤ a log(e/δ) + b log(e/δ), then for any p ≥ 1 it holds that ∥Y ∥ p ≤ 3 √ pa + 9pb. If ∥Y ∥ p ≤ √ pa + pb for any p ≥ 1, then for any δ ∈ (0, 1) we have with probability at least 1 -δ |Y | ≤ e a log(e/δ) + b log(e/δ) . Proof of Theorem 1. Let ϵ > 0 be a number to be fixed later and define Z i = Z i I |Zi|≤ϵ , ∀i ∈ [n], where I[•] is the indicator function (1 if the argument is true and 0 otherwise). Define Φ(X 1 , . . . , X n ) = n i=1 Z i . First we show the Lipschitz continuity of Φ w.r.t. the Hamming distance. Suppose we change X 1 by X ′ 1 and keep other X i . Define Z ′ 1 , . . . , Z ′ n similarly to Z 1 , . . . , Z n but as functions of X ′ 1 , X 2 , . . . , X n . Since Z j is a function of X j (i.e., once we know X j we know Z j ), we know Z j = Z ′ j for all j ̸ = 1. Then We fix ϵ = b log(en/δ). We now assume that Eq. (A.1) and Eq. (A.2) hold for all i ∈ [n], which happen with probability at least 1 -δ. Under this event, we have Z i = Z i and (for simplicity we assume n ≥ 2/(e -2)) Φ(X 1 , . . . , X n ) -Φ(X ′ 1 , X 2 , . . . , X n ) = Z 1 + . . . + Z n -Z ′ 1 + Z 2 + . . . + Z n = | Z 1 -Z ′ 1 | ≤ 2ϵ. n i=1 Z i = n i=1 Z i ≤ ∆ n 2nb log(en/δ) log(2e/((e -2)δ)) ≤ √ 2nb∆ n log(en/δ). The following inequality then holds with probability at least 1 - δ n i=1 Z i - √ 2nb∆ n log(n) ≤ √ 2nb∆ n log(e/δ). According to Lemma A.2, for any p ≥ 1 it holds that n i=1 Z i - √ 2nb∆ n log(n) p ≤ 9p √ 2nb∆ n . The stated bound follows directly. The proof is completed.

B PROOF OF THEOREM 2

We follow the framework in Bousquet et al. (2020) to prove Theorem 2. Proof of Theorem 2. Without loss of generality, we assume n = 2 k . Consider a sequence of partitions B 0 , . . . , B k with B k = {1, 2, . . . , 2 k }. We then obtain B l from B l+1 by splitting each subset in B l+1 into two equal parts. In this way, we get B 0 = {{1}, {2}, . . . , {2 k }}, B 1 = {{1, 2}, {3, 4}, . . . , {2 k -1, 2 k }}, . . . , B k = {[n]}. For each i ∈ [n] and l = 0, 1, . . . , k, denote by B l (i) ∈ B l the only set from B l that contains i. In particular, B 0 (i) = {i} and B k (i) = [n]. For each i ∈ [n] and each l = 0, 1, . . . , k, introduce the random variables g l i = g l i (Z i , Z [n]\B l (i) ) = E[g i |Z i , Z [n]\B l (i) ]. That is, we condition on Z i and all the variables that are not in the same set as Z i in the partition B l . One can check that g 0 i = g i and g k i = E[g i |Z i ]. We can write a telescopic sum for each i ∈ [n] g i = E[g i |Z i ] + k-1 l=0 (g l i -g l+1 i ). It then follows from the triangle inequality that  n i=1 g i p ≤ n i=1 E[g i |Z i ] p + k-1 l=0 n i=1 (g l i -g l+1 i ) p . (B.1) Since |E[g i |Z i ]| ≤ M , one can check that Φ(Z 1 , . . . , Z n ) = n i=1 E[g i |Z i ] is 2M -Lipschitz w. n i=1 E[g i |Z i ] ≤ M ∆ n 2n log(2/δ).

It then follows from Lemma

A.2 that n i=1 E[g i |Z i ] p ≤ 3M ∆ n 2pn. (B.2) According to the definition of g l i , one can see that E Z B l+1 (i)\B l (i) [g l i ] = g l+1 i . Furthermore, according to our assumption we know g l i as a function of Z j , j ∈ B l+1 (i)\B l (i) satisfies the β-Lipschitz continuity w.r.t. the Hamming distance. Therefore, one can apply Lemma A.1 with c = β and Φ = g l i to derive the following inequality with (there are 2 l random variables) Pr |g l i -g l+1 i | ≥ ε ≤ 2 exp - 2ε 2 β 2 • 2 l ∆ 2 2 l , (B.3) where the probability is w.r.t. Z B l+1 (i)\B l (i) . Let us consider the sum i∈B l (g l i -g l+1 i ) for any B l ∈ B l . Note Z ′ i := g l i -g l+1 i is a function of Z i , Z [n]\B l . We now condition on Z [n]\B l and then Z ′ i is a function of Z i . According to Eq. (B.3), we can apply Theorem 1 with b = 2 l-1 β 2 ∆ 2 2 l to derive the following inequality E i∈B (g l i -g l+1 i ) p |Z [n]\B l ≤ (9 + log(|B l |))p 2 l |B l |β 2 ∆ 2 2 l ∆ 2 |B l | = (9 + l)p 2 2l β 2 ∆ 4 2 l = (9 + l)p2 l β∆ 2 2 l . We now take integration w.r.t. Z [n]\B l and get i∈B l (g l i -g l+1 i ) p ≤ (9 + l)p2 l β∆ 2 2 l . According to the triangle inequality, we further get i∈[n] (g l i -g l+1 i ) p ≤ B l ∈B l i∈B l (g l i -g l+1 i ) p ≤ 2 k-l • (9 + l)p2 l β∆ 2 2 l = (9 + l)2 k pβ∆ 2 2 l , where we have used the fact that |B l | = 2 k-l . It follows that k-1 l=0 n i=1 (g l i -g l+1 i ) p ≤ 2 k pβ k-1 l=0 (9 + l)∆ 2 2 l . We can plug Eq. (B.2) and the above inequality back into Eq. (B.1) to derive n i=1 g i p ≤ 3M ∆ n 2pn + 2 k pβ k-1 l=0 (9 + l)∆ 2 2 l . The proof is completed.

C PROOF OF THEOREM 5

To prove Theorem 5, we require the following lemma to control the difference between two test errors: one with the test example drawn from the mixing sequence and one with the test example drawn from the independent stationary distribution. Let S b = {z 1 , . . . , z n-b } be the sequence by removing the last b points of S.  E z [f (w S b ; z)] -E z [f (w S ; z) | S] ≤ bβ + M φ(b). Proof of Lemma 4. Let z ′ i (resp. z ′′ i ) be drawn from the same distribution of z i , i.e., the conditional distribution of z ′ i (resp. z ′′ i ) given z 1 , . . . , z i-1 , z i+1 , . . . , z n is the same as that of z i given z 1 , . . . , z i-1 , z i+1 , . . . , z n . Let S i,b = {z 1 , . . . , z i-b-1 , z i , z i+b+1 , . . . , z n-b }. For any i ∈ [n], let S i i,b = {z 1 , . . . , z i-b-1 , z ′ i , z i+b+1 , . . . , z n-b }. We have the following decomposition n i=1 E z ′′ i [f (w S i,b ; z ′′ i )] -f (w S i,b ; z i ) = n i=1 E z ′′ i [f (w S i,b ; z ′′ i )] -E z ′ i E z ′′ i [f (w S i i,b ; z ′′ i )] + n i=1 E z ′ i E z ′′ i [f (w S i i,b ; z ′′ i )] -f (w S i i,b ; z i ) + n i=1 E z ′ i f (w S i i,b ; z i ) -f (w S i,b ; z i ) . According to the definition of β-uniform stability we know n i=1 E z ′′ i [f (w S i,b ; z ′′ i )] -f (w S i,b ; z i ) ≤ 2βn + n i=1 E z ′ i E z ′′ i [f (w S i i,b ; z ′′ i )] -f (w S i i,b ; z i ) . (C.1) For any i ∈ [n], introduce g i = E z ′ i E z ′′ i [f (w S i i,b ; z ′′ i )] -f (w S i i,b ; z i ) . Then, we have n i=1 E z ′′ i [f (w S i,b ; z ′′ i )] -f (w S i,b ; z i ) ≤ 2βn + n i=1 g i . (C.2) According to Lemma C.1 we know F (w S ) -E z [f (w S i,b ; z)] ≤ 3bβ + M φ(b). By the definition of φ ′ , we know E z ′′ i [f (w S i,b ; z ′′ i )] -E z [f (w S i,b ; z)] ≤ M φ ′ (b). Furthermore, the definition of stability implies |f (w S i,b ; z i ) -f (w S ; z i )| ≤ 3bβ. We combine the above three inequalities together and derive n i=1 |F (w S ) -E z ′′ i [f (w S i,b ; z ′′ i )]| + n i=1 f (w S i,b ; z i ) -f (w S ; z i ) ≤ (6bβ + M φ(b) + M φ ′ (b))n. Combining the above inequality and Eq. (C.1), we obtain n(F (w S ) -F S (w S )) ≤ n i=1 E z ′′ i [f (w S i,b ; z ′′ i )] -f (w S i,b ; z i ) + n i=1 |F (w S ) -E z ′′ i [f (w S i,b ; z ′′ i )]| + n i=1 f (w S i,b ; z i ) -f (w S ; z i ) ≤ (6b + 2)nβ + n(M φ(b) + M φ ′ (b)) + n i=1 g i . The proof is completed. Proof of Theorem 5. Recall the definition of g i , g i = E z ′ i E z ′′ i [f (w S i i,b ; z ′′ i )] -f (w S i i,b ; z i ) . Since z i and z ′′ i follow from the same distribution, we know E[g i |z n\i ] = 0. One can check other assumptions in Theorem 2 also hold. Therefore, one can apply Theorem 2 to derive the following inequality with probability at least 1 -δ n i=1 g i p ≤ 3M ∆ n 2pn + 2 k+1 pβ k-1 l=0 (9 + l)∆ 2 2 l , where k = ⌈log 2 n⌉. According to Lemma A.2 we further get the following inequality with probability at least 1 - δ n i=1 g i ≤ e 3M ∆ n 2n log(e/δ) + 2 k+1 β k-1 l=0 (9 + l)∆ 2 2 l log(e/δ) . (C.3) According to Lemma 4, we know n(F (w S ) -F S (w S )) ≤ (6b + 2)nβ + n(M φ(b) + M φ ′ (b)) + n i=1 g i . We can combine the above inequality with Eq. (C.3) to derive the following inequality with probability at least 1 -δ n F (w S ) -F S (w S ) ≤ e 3M ∆ n 2n log(e/δ) + 2 k+1 β k-1 l=0 (9 + l)∆ 2 2 l log(e/δ) + (6b + 2)nβ + n(M φ(b) + M φ ′ (b) ). The proof is completed.

D PROOF OF APPLICATIONS D.1 PROOF OF COROLLARY 6

To prove Corollary 6, we require the following lemma on the uniform stability of kernel regularization (Bousquet & Elisseeff, 2002) . Lemma D.1 (Bousquet & Elisseeff 2002) . Let the loss function f be L-Lipschitz and convex. Let the algorithm A be defined in (5.2). Then A is β-uniformly stable with β ≤ L 2 λn . Proof of Corollary 6. Let A be the algorithm which returns w S,λ . By Lemma D.1, we know A is β-uniformly stable, where β ≤ L 2 nλ . Plugging the above inequality into Theorem 5, we derive the following inequality with probability at least 1 -δ Furthermore, we have the following error decomposition F (w S,λ ) -F (w * ) + λ∥w S,λ ∥ 2 K = F (w S,λ ) -F S (w S,λ ) + F S (w S,λ ) -F S (w * ) + λ∥w S,λ ∥ 2 K -λ∥w * ∥ 2 K + λ∥w * ∥ 2 K + F S (w * ) -F (w * ) ≤ F (w S,λ ) -F S (w S,λ ) + F S (w * ) -F (w * ) + λ∥w * ∥ 2 K , where we have used the definition of w S,λ . We can combine the above three inequalities together to derive the stated bound. The proof is completed.

D.2 PROOF OF COROLLARY 7 AND COROLLARY 8

First, we prove the stability bound for convex loss minimization via SGD. Then, we apply the stability bound and Theorem 5 to the generalization bound. To develop high-probability bounds, we need to introduce a concentration inequality (Wainwright, 2019) . Lemma D.2 (Chernoff's Bound). Let X 1 , . . . , X t be independent random variables taking values in {0, 1}. Let X = t j=1 X j and µ = E[X]. Then for any δ > 0 with probability at least 1 -exp -µ δ2 /(2 + δ) we have X ≤ (1 + δ)µ. Furthermore, for any δ ∈ (0, 1) with probability at least 1 -δ we have X ≤ µ + log(1/δ) + 2µ log(1/δ). Proof of Corollary 7. Let S and S ′ be two samples of size n differing in only a single example. Consider the gradient updates w 1 , . . . , w T and w ′ 1 , . . . , w ′ T induced by running SGD on sample S and S ′ . We now suppose S and S ′ differ by the first example and apply the Lipschitz condition on f (•; z) to get |f (w T ; z) -f (w ′ T ; z)| ≤ L [δ T ] , (D.1) where δ T = ∥w T -w ′ T ∥. Observe that at step t, with probability 1 -1/n, the example selected by SGD is the same in both S and S ′ . The convexity and γ-smoothness imply that (Hardt et al., 2016) ⟨∇f (v, z) -∇f (w, z), v -w⟩ ≥ 1 γ ∥∇f (v, z) -∇f (w, z)∥ 2 (D.2) Therefore, with probability at least 1 -δ, there holds w t+1 -w ′ t+1 ≤ 2Lη(t/n + log(1/δ) + 2tn -1 log(1/δ)). By the convexity of the norm ∥ • ∥, we get the following inequality with probability at least 1 -δ ∥w S -w ′ S ∥ ≤ 2Lη(T /n + log(1/δ) + 2T n -1 log(1/δ)). Plugging the inequality back into Eq. (D.1), we obtain that, with probability at least 1 -δ |f (w S ; z) -f (w ′ S ; z)| ≤ 2L 2 η(T /n + log(1/δ) + 2T n -1 log(1/δ)). If i t ̸ = 1, (D.7)



† The work was done when Shi Fu was an intern at JD Explore Academy * Corresponding authors



log 2 n log(1/δ), which is smaller.3. Finally, we consider exponential mixing sequences, i.e., φ ′ (k) ≤ φ 0 exp (-φ 1 k r ), which imply ∆ n = O(1). If we fix b = ⌈log 1 r (1/β)⌉, we know exp(-b r ) ≤ bβ = O(β log 1 r (1/β)) and F (w S ) -F S (w S ) = O n -1 log(1/δ) + β log 2 n log(1/δ) + β log

According to Lemma A.1 with c = 2ϵ we derive the following inequality with probability at least 1 -(e -2)δ/e n i=1 Z i ≤ ϵ∆ n 2n log(2e/((e -2)δ)). (A.1) According to the assumption Pr{|Z i | > ε} ≤ 2 exp(-ε 2 /b), we know with probability at least 1 -2δ/(en) |Z i | ≤ b log(en/δ). (A.2)

Lemma C.1 (Mohri & Rostamizadeh 2010). Let F (w S ) = E z [f (w S ; z) | S] denote the expectation in the dependent case (i.e., z depends on S) and F (w S b ) = E z [f (w S b ; z)] denote the expectation where the test points are assumed independent of the training data (i.e., z is independent of S). If A is β-uniformly stable and f (w; z) ∈ [0, M ], then the following inequality holds for any b > 0

(w S,λ ) -F S (w S,λ ) ≤ 2(3b + 1)L 2 nλ + M φ(b) + M φ ′we have the following inequality with probability at least 1 -δF S (w * ) -F (w * ) = O

r.t. the Hamming distance. Furthermore, we have E[E[g i |Z i ]] = 0. Now we can apply Lemma A.1 with c = 2M to derive the following inequality with at least 1 -δ

by η t ≤ 2/γ we knoww t+1 -w ′ t+1 2 = ∥w t -w ′ t ∥ 2 -2η t ⟨∇f (w t , z it ) -∇f (w ′ t , z ′ it ), w t -w ′ t ⟩ + η 2 t ∥∇f (w t , z it ) -∇f (w ′ t , z ′ it )∥ 2 ≤ ∥w t -w ′ t ∥ 2 -With probability 1/n, the example selected is different, i.e. i t = 1. Then, by the triangle equality and Eq. D.3,w t+1 -w ′ t+1 = ∥w t -η t ∇f (w t , z it ) -(w ′ t -η t ∇f (w ′ t , z it )∥ + η t ∥∇f (w ′ t , z ′ it ) -∇f (w ′ t , z it )∥ ≤ ∥w t -w ′ t ∥ + 2η t L.Combining the above two cases, we can conclude that for every t,w t+1 -w ′ t+1 ≤ ∥w t -w ′ t ∥ + 2η t LI [it=1] ,(D.5) where I denotes the indicator function. Solving recursive inequality gives, We can apply Lemma D.2 with X k = I [i k =1] , µ = t/n to get the following inequality with probability at least 1 -δ

ACKNOWLEDGEMENT

This work is supported by the Major Science and Technology Innovation 2030 "New Generation Artificial Intelligence" key project (No. 2021ZD0111700), NSFC No. 62222117 and National Social Science Found of China "Research on Virtual Reality Media Narrative " (Grant No.21&ZD326). The work was done when Yunwen Lei was at the School of Computer Science, University of Birmingham.

A PROOF OF THEOREM 1

In this section, we prove Theorem 1. To this aim, we introduce several lemmas. The following lemma is a McDiarmid inequality for stable functions defined on mixing sequences (Kontorovich & Ramanan, 2008) . Lemma A.1. Let Φ : Z n → R be a measurable function that is c-Lipschitz w.r.t. the Hamming distance for some c > 0, and let Z 1 , . . . , Z n be random variables distributed according to a φ-mixing distribution. Then for any ϵ > 0 the following inequality holdsFurthermore, for any δ ∈ (0, 1) the following inequality holds with probability at least 1 -δWe can combine Eq. (D.7) and Theorem 5 to obtain the following inequality with probability at leastBy the choice of T ≍ n, we can getThe proof is completed.To prove excess risk bounds, we require the following high-probability bound on optimization error. Notice that optimization error analysis does not depend on the mixing property of the dataset since the randomness is taken with respect to the random indices.Lemma D.3 (Optimization Error (Lei & Tang, 2018) ). Assume that the loss function f (•; z) is convex and L-Lipschitz for every z. Suppose that we run SGD with step sizes η t = η ≍ 1 √ T then with probability at least 1 -δ we haveProof of Corollary 8. By Corollary 7, we know with probability at least 1 -δ thatBy Lemma A.1, we have the following inequality with probability at least 1 -δLemma D.3 shows the following inequality with probability at least 1 -δWe plug the above three inequalities back into Eq. (5.1), and derive the following inequality with probability at least 1 -3δThe proof is completed.

D.3 PROOF OF COROLLARY 9

In this section, we present the proof on the stability of the iterative localization technique. To this aim, we first present Algorithm 1.Algorithm 1: Iterative Localized Algorithmdraw a sample S i of size n from the mixing distribution 4 apply SGD with T i iterations and step size η t to minimize the following problem and get w i

5. end

Then we move to generalization bound for Corollary 9. To this aim, We need to introduce some definitions for our proof. For any i, letwhere F Si is defined in Algorithm 1. Lemma D.4 (Optimization Error Bound). Suppose that the function w → f (w; z) is µ-strongly convex (with respect to ∥ • ∥) and L-Lipschitz. Let {w t } t be produced by SGD on sample S and stepThen, for any δ ∈ (0, 1), with probability at least 1 -δ,The proof of Lemma D.4 can be found in Harvey et al. (2019) . According to Algorithm 1, w i is the output by SGD with η t = γ i n/(t + 1) to minimize F Si (w), with the iterates weighted as Lemma D.4. The following lemma establishes the bound of Euclidean distance of w i and ŵi . Lemma D.5. Suppose that the function w -→ f (w; z) is L-Lipschitz and µ-strongly convex. For any δ ∈ (0, 1), the following inequality holds with probability at least 1 -δProof. From Algorithm 1, we know that F Si is λ i := 2/ (γ i n)-strongly convex. According to Lemma D.4, the following inequality holds with probability at least 1 -δIt then follows from the definition of ŵi and the strong convexity thatand thereforeLemma D.6 (Bousquet & Elisseeff 2002) . Suppose the function f : W × Z → R takes a structure f = ℓ + r, where ℓ : W × Z → R and r : W → R. Assume for all z, we have ∥∇ℓ (w; z)∥ ≤ L. Suppose F S = 1 n n i=1 f (w; z i ) is µ-strongly convex and define A as A(S) = arg min w∈W F S (w). Then A is 4L 2 nµ -uniformly stable. Lemma D.7. Assume for any z, w → f (w; z) is L-Lipschitz. Let ŵi be defined in Eq. (D.9). With probability at least 1 -δ/(2m) we have the following inequality uniformly for any wwhere k = ⌈log 2 n⌉.Proof. For any i, defineand w * i = arg min w F i (w), where we assume z is independently drawn from the stationary distribution of mixing sequence. Let A i be the algorithm outputting the minimizer of F Si . We know F Si is λ i = 2/(γ i n)-strongly convex. Then analysis similar to Corollary 6 implies the stated inequality with probability at least 1 -δ/(2m). The proof is completed.Based on the above lemmas, we now turn to proving Corollary 9.Proof of Corollary 9. Let ŵ0 = w * and ŵi be defined by Eq. (D.9). We can decompose F (w m ) -F (w * ) by(D.10)Since f is L lipschitz, Lemma D.5 implies the following inequality with probability at least 1-δ/(2m)Furthermore, we can apply Lemma D.7 with w = ŵi-1 to derive the following inequality with probability at least 1 -δ/(2m)The following inequality then holds with probability at leastBy Lemma D.5, we further get the following inequality with probability at least 1 -δwhere in the last two steps we have used γ i = γ/2 i . We can plug the above inequality and Eq. (D.11) back into Eq. (D.10), and derive the following inequality with probability 1 -δThis gives the stated bound. The proof is completed.

E PROOF OF LEMMA 3

Proof of Lemma 3. For simplicity, we only consider discrete random variables.According to the definition of mixing sequence, we know It then follows thatIn a similar way, one can showThe proof is completed by combining the above two inequalities together.

