SHARPER BOUNDS FOR UNIFORMLY STABLE ALGO-RITHMS WITH STATIONARY MIXING PROCESSES

Abstract

Generalization analysis of learning algorithms often builds on a critical assumption that training examples are independently and identically distributed, which is often violated in practical problems such as time series prediction. In this paper, we use algorithmic stability to study the generalization performance of learning algorithms with ψ-mixing data, where the dependency between observations weakens over time. We show uniformly stable algorithms guarantee high-probability generalization bounds of the order O(1/ √ n) (within a logarithmic factor), where n is the sample size. We apply our general result to specific algorithms including regularization schemes, stochastic gradient descent and localized iterative regularization, and develop excess population risk bounds for learning with ψ-mixing data. Our analysis builds on a novel moment bound for weakly-dependent random variables on a φ-mixing sequence and a novel error decomposition of generalization error.

1. INTRODUCTION

Generalization gap refers to the discrepancy between training and testing, which is a quantity of central importance in statistical learning theory (SLT) (Shalev-Shwartz & Ben-David, 2014) . A popular approach to controlling the generalization gap is to bound it by the uniform convergence between training and testing errors over a function space (Bartlett & Mendelson, 2002) , which leads to bounds depending on the complexity of function spaces, such as VC dimension (Vapnik, 2013) , covering number (Cucker & Zhou, 2007) and Rademacher complexity (Bartlett & Mendelson, 2002) . These complexity-based bounds do not exploit the property of a learning algorithm and would generally admit a square-root dependency on the dimension (Feldman, 2016), which are not favorable for large-scale problems. To incorporate the property of a learning algorithm and remove the dependency on dimension, a concept of algorithmic stability has been introduced into SLT (Bousquet & Elisseeff, 2002) . Intuitively speaking, algorithmic stability measures how a small perturbation of the training dataset would affect the output model of a learning algorithm, which has close connection to several key properties such as learnability (Shalev-Shwartz et al., 2010) , robustness and privacy (Bassily et al., 2020) . Recent research has witnessed an increasing interest in leveraging stability to study the generalization behavior of various algorithms, such as stochastic gradient descent (Hardt et al., 2016 ), structured prediction (London et al., 2016 ), meta learning (Maurer, 2005) and transfer learning (Kuzborskij & Lampert, 2018) . Most of these discussions are based on a critical assumption that the training examples are independently and identically distributed (i.i.d.) . This assumption is often violated in practical applications. For example, the i.i.d. assumption is too restrictive in time series prediction (Vidyasagar, 2013) . The prices of the same stock on different days may have temporal dependence. These phenomena motivate several analyses to derive meaningful bounds for learning problems with observations drawn from a non-i.i.d. process (Yu, 1994; Vidyasagar, 2013) . A widely used relaxation of the i.i.d. assumption is to assume the observations are drawn from a mixing process (Yu, 1994; Meir, 2000; Lozano et al., 2005; Vidyasagar, 2013) , where the dependency Published as a conference paper at ICLR 2023 between two observations is quantified by a mixing coefficient as a function of the discrepancy of the associated two indices. These mixing coefficients decay either as a polynomial function or an exponential function of the discrepancy (Vidyasagar, 2013) . Several mixing processes have been introduced into the literature, including the β-mixing, φ-mixing and ψ-mixing processes (Yu, 1994; Meir, 2000; Lozano et al., 2005) . Within this formulation, various generalization bounds have been developed to show how the dependency among observations would affect the learning process. Interestingly, these discussions imply a concept called "effective size" which plays a similar role of the sample size in the i.i.d. scenario (Yu, 1994; Kuznetsov & Mohri, 2017) . As in the i.i.d. case, most generalization analyses in the non-i.i.d. case focus on complexity-based bounds (Meir, 2000; Yu, 1994; Kuznetsov & Mohri, 2017) . There are few stability analyses of learning algorithms in the non-i.i.d. cases. An exception is the work in Mohri & Rostamizadeh (2010), which, to our knowledge, gives the first systematic analysis on the stability and generalization in a non-i.i.d. case. The authors developed high-probability generalization bounds for learning with stationary φ-mixing and β-mixing sequences, which are then applied to general kernel regularizationbased bounds. Due to the algorithm-specific nature, these bounds are preferable to complexity-based bounds if the associated hypothesis space has a very large complexity. However, the stability analysis (Mohri & Rostamizadeh, 2010) only implies sub-optimal generalization bounds. Indeed, for β-uniformly stable algorithms, the high-probability bounds in Mohri & Rostamizadeh (2010) are of the order of O( √ nβ + ∆ n / √ n), where n is the sample size and ∆ n is a term depending on the decay rate of mixing coefficients. For learning with λ-strongly convex problems, the uniform stability parameter is of the order O(1/(nλ)) (Bousquet & Elisseeff, 2002) and therefore the bounds in Mohri & Rostamizadeh (2010)  become O(1/( √ nλ) + ∆ n / √ n). A typical choice of λ is λ ≈ n -α for α > 0 (Shalev-Shwartz & Ben-David, 2014) and then the bounds further become O(n α-1 2 + ∆ n / √ n), which cannot imply the optimal bounds O(1/ √ n) even if ∆ n = O(1) . For learning with i.i.d. data, recent breakthroughs (Feldman & Vondrak, 2019; Bousquet et al., 2020) in stability analysis show that β-uniformly stable algorithms enjoy generalization bounds of the order O(1/ √ n) * . This motivates a natural question: can we develop generalization bounds of the order O(1/ √ n) for uniformly stable algorithms applied to mixing process? This paper provides an affirmative answer to the above question. Our contributions are listed below. 1. We develop a moment bound for weakly dependent random variables defined on a φ-mixing sequence. We show our bound matches the existing moment bounds for i.i.d. random variables up to a logarithmic factor. As a byproduct, we develop a Marcinkiewicz-Zygmund inequality for a φ-mixing sequence, which may be interesting in its own right. 2. We develop high-probability bounds of order O(1/ √ n) for uniformly stable algorithms for learning with ψ-mixing sequences (our results actually require assumptions on φ ′ -mixing coefficients which are weaker than assumptions on ψ-mixing coefficients). We achieve this by introducing a different decomposition of generalization errors to make sure we get weakly-dependent and mean-zero random variables, which is more challenging than the i.i.d. case. Our results recover the existing bounds within a constant factor in the i.i.d. case. 3. We apply our general bound to some specific algorithms to show the effectiveness of our results, including kernel regularization schemes, stochastic gradient descent (SGD) and iterative localization. The paper is organized as follows. We present the related work in Section 2. We develop concentration inequalities for φ-mixing data in Section 3 and present general stability-based bounds in Section 4. We apply our general result to specific algorithms in Section 5. We conclude the paper in Section 6.

2. RELATED WORK

In this section, we discuss the related work. We first discuss the related work on algorithmic stability and then the related work on learning with dependent data. Algorithmic stability. Algorithmic stability measures how the replacement/removal of a single (or a few) example would affect the output model, which is an important concept in SLT (Bousquet & Elisseeff, 2002) . A nice property of algorithmic stability is that it only considers the behavior of the



† The work was done when Shi Fu was an intern at JD Explore Academy * Corresponding authors

