ON STABILITY AND GENERALIZATION OF BILEVEL OPTIMIZATION PROBLEM

Abstract

Stochastic) bilevel optimization is a frequently encountered problem in machine learning with a wide range of applications such as meta-learning, hyper-parameter optimization, and reinforcement learning. Most of the existing studies on this problem only focused on analyzing the convergence or improving the convergence rate, while little effort has been devoted to understanding its generalization behaviors. In this paper, we conduct a thorough analysis on the generalization of first-order (gradient-based) methods for the bilevel optimization problem. We first establish a fundamental connection between algorithmic stability and generalization gap in different forms and give a high probability generalization bound which improves the previous best one from O( √ n) to O(log n), where n is the sample size. We then provide the first stability bounds for the general case where both inner and outer level parameters are subject to continuous update, while existing work allows only the outer level parameter to be updated. Our analysis can be applied in various standard settings such as strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC). Our analysis for the NC-NC setting can also be extended to a particular nonconvex-stronglyconvex (NC-SC) setting that is commonly encountered in practice. Finally, we corroborate our theoretical analysis and demonstrate how iterations can affect the generalization gap by experiments on meta-learning and hyper-parameter optimization.

1. INTRODUCTION

(Stochastic) bilevel optimization is a widely confronted problem in machine learning with various applications such as meta-learning (Finn et al., 2017; Bertinetto et al., 2018; Rajeswaran et al., 2019) , hyper-parameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Baydin et al., 2017; Bergstra et al., 2011; Luketina et al., 2016) , reinforcement learning (Hong et al., 2020) , and few-shot learning (Koch et al., 2015; Santoro et al., 2016; Vinyals et al., 2016) . The basic form of this problem can be defined as follows min x∈R d 1 R(x) = F (x, y * (x)) := E ξ [f (x, y * (x); ξ)] s.t. y * (x) = arg min y∈R d 2 {G(x, y) := E ζ [g(x, y; ζ)]} , where f : R d1 × R d2 → R and g : R d1 × R d2 → R are two continuously differentiable loss functions with respect to x and y. Problem (1) has an optimization hierarchy of two levels, where the outer-level objective function f depends on the minimizer of the inner-level objective function g. Due to its importance, the above bilevel optimization problem has received considerable attention in recent years. A natural way to solve problem (1) is to apply alternating stochastic gradient updates with approximating ∇ y g(x, y) and ∇f (x, y), respectively. Briefly speaking, previous efforts mainly examined two types of methods to perceive an approximate solution that is close to the optimum y * (x). One is to utilize the single-timescale strategy (Chen et al., 2021; Guo et al., 2021; Khanduri et al., 2021; Hu et al., 2022) , where the updates for y and x are carried out simultaneously. The other one is to apply the two-timescale strategy (Ghadimi & Wang, 2018; Ji et al., 2021 ; REFERENCE STABILITY BOUNDS IN VARIOUS SETTINGS SC-SC C-C NC-NC NC-SC SSGD (THIS WORK) O(1/m1) O(κ1 K/2 /m1) O(K κ 2 /m1) O(K κ 3 /m1) TSGD (THIS WORK) O((κ4) K /m1) O((κ4) K /m1) O(T 1-κ 5 K κ 5 /m1) O(T 1-κ 6 K κ 6 /m1) Table 1 : Summary of main results. κi: a constant for all i above; T : inner iterations; K: outer iterations; m1: size of outer dataset. SSGD and TSGD stand for Algorithm 1 and Algorithm 2, the single-timescale and two-timescale methods, via stochastic gradient descent. Hong et al., 2020; Pedregosa, 2016) , where the update of y is repeated multiple times to achieve a more accurate approximation before conducting the update of x. While (2) Additionally, the UD algorithm allows the outer level parameters to be updated continuously but needs to reinitialize the inner level parameters before each iteration in the inner loop, which is not commonly used in practice due to their inefficiency (see line 4 in Algorithm 3). ( 3) The proof of Theorem 2 in their work is unclear to show whether the update of outer level parameters is argument dependent on the inner level parameters, where may exist some gap in the analysis of UD algorithm (see Appendix E for detailed discussions). ( 4)Their experiments take only hyper-parameter optimization into consideration and neglect other applications in the bilevel optimization instances. To address all the aforementioned issues, we give in this paper a thorough analysis on the generalization behaviors of first-order (gradient-based) methods for general bilevel optimization problem. We employ the recent advances of algorithmic stability to investigate the generalization behaviors in different settings. Specifically, our main contributions can be summarized as follows: • Firstly, we establish a fundamental connection between generalization gap and different notations of algorithmic stability (argument stability and uniform stability) for any randomized bilevel optimization algorithms in both expectation and high probability forms. Specifically, we show that the high probability form of the generalization gap bound can be improved from O( √ n) to O(log n) compared with the result in Bao et al. (2021) . • Next, we present the stability bounds for gradient-based methods with either singletimescale or two-timescale update strategy under different standard settings. To the best of our knowledge, this work provides the first stability bounds for the two-timescale (double loop) algorithms, which allows the accumulation of the sub-sampled gradients in the inner level. In detail, we consider the settings of strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC), and further extend our analysis to a particular nonconvex-strongly-convex (NC-SC) setting that is widely appeared in practice. Table 1 is the summary of our main results. • Thirdly, we provide the first generalization bounds for the case where both the outer and inner level parameters are subject to continuous (iterative) changes. Compared to the previous work (Bao et al., 2021) , our work does not need the reinitialization step before each iteration in the inner level and hence our algorithm can carry over the last updated inner level parameters, which is more general and practical. • Finally, we conduct empirical studies to corroborate our theories via meta-learning and hyperparameter optimization, which are two applications of bilevel optimization. Due to space limitations, all the proofs and additional experiments are included in Appendix.



there is a long list of work on bilevel optimization, most of the existing work only focuses on either analyzing its convergence behaviors(Ghadimi & Wang, 2018; Hong et al., 2020; Ji et al.,  2021)  or improving its convergence rate, based on the convexity and the smoothness properties of f (•, •) and/or g(•, •)(Liu et al., 2020; Li et al., 2020). Contrarily, only little effort is devoted to understanding the generalization behavior of the problem. To the best of our knowledge, there is only one recent work on the generalization analysis for bilevel problems(Bao et al., 2021), which presents the first expected uniform stability bound. However, there are still several undesirable issues in this work: (1) Their result is only for the uniform stability (which could be deduced from argument stability with certain conditions, see Definition 4 for details), leaving the analysis of other stronger definitions of algorithmic stability open;

