SHARPER BOUNDS FOR UNIFORMLY STABLE ALGO-RITHMS WITH STATIONARY MIXING PROCESSES

Abstract

Generalization analysis of learning algorithms often builds on a critical assumption that training examples are independently and identically distributed, which is often violated in practical problems such as time series prediction. In this paper, we use algorithmic stability to study the generalization performance of learning algorithms with ψ-mixing data, where the dependency between observations weakens over time. We show uniformly stable algorithms guarantee high-probability generalization bounds of the order O(1/ √ n) (within a logarithmic factor), where n is the sample size. We apply our general result to specific algorithms including regularization schemes, stochastic gradient descent and localized iterative regularization, and develop excess population risk bounds for learning with ψ-mixing data. Our analysis builds on a novel moment bound for weakly-dependent random variables on a φ-mixing sequence and a novel error decomposition of generalization error.

1. INTRODUCTION

Generalization gap refers to the discrepancy between training and testing, which is a quantity of central importance in statistical learning theory (SLT) (Shalev-Shwartz & Ben-David, 2014) . A popular approach to controlling the generalization gap is to bound it by the uniform convergence between training and testing errors over a function space (Bartlett & Mendelson, 2002) , which leads to bounds depending on the complexity of function spaces, such as VC dimension (Vapnik, 2013) , covering number (Cucker & Zhou, 2007) and Rademacher complexity (Bartlett & Mendelson, 2002) . These complexity-based bounds do not exploit the property of a learning algorithm and would generally admit a square-root dependency on the dimension (Feldman, 2016), which are not favorable for large-scale problems. To incorporate the property of a learning algorithm and remove the dependency on dimension, a concept of algorithmic stability has been introduced into SLT (Bousquet & Elisseeff, 2002) . Intuitively speaking, algorithmic stability measures how a small perturbation of the training dataset would affect the output model of a learning algorithm, which has close connection to several key properties such as learnability (Shalev-Shwartz et al., 2010) , robustness and privacy (Bassily et al., 2020) . Recent research has witnessed an increasing interest in leveraging stability to study the generalization behavior of various algorithms, such as stochastic gradient descent (Hardt et al., 2016 ), structured prediction (London et al., 2016 ), meta learning (Maurer, 2005) and transfer learning (Kuzborskij & Lampert, 2018) . Most of these discussions are based on a critical assumption that the training examples are independently and identically distributed (i.i.d.) . This assumption is often violated in practical applications. For example, the i.i.d. assumption is too restrictive in time series prediction (Vidyasagar, 2013) . The prices of the same stock on different days may have temporal dependence. These phenomena motivate several analyses to derive meaningful bounds for learning problems with observations drawn from a non-i.i.d. process (Yu, 1994; Vidyasagar, 2013) . A widely used relaxation of the i.i.d. assumption is to assume the observations are drawn from a mixing process (Yu, 1994; Meir, 2000; Lozano et al., 2005; Vidyasagar, 2013) , where the dependency



† The work was done when Shi Fu was an intern at JD Explore Academy * Corresponding authors 1

