ON STABILITY AND GENERALIZATION OF BILEVEL OPTIMIZATION PROBLEM

Abstract

Stochastic) bilevel optimization is a frequently encountered problem in machine learning with a wide range of applications such as meta-learning, hyper-parameter optimization, and reinforcement learning. Most of the existing studies on this problem only focused on analyzing the convergence or improving the convergence rate, while little effort has been devoted to understanding its generalization behaviors. In this paper, we conduct a thorough analysis on the generalization of first-order (gradient-based) methods for the bilevel optimization problem. We first establish a fundamental connection between algorithmic stability and generalization gap in different forms and give a high probability generalization bound which improves the previous best one from O( √ n) to O(log n), where n is the sample size. We then provide the first stability bounds for the general case where both inner and outer level parameters are subject to continuous update, while existing work allows only the outer level parameter to be updated. Our analysis can be applied in various standard settings such as strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC). Our analysis for the NC-NC setting can also be extended to a particular nonconvex-stronglyconvex (NC-SC) setting that is commonly encountered in practice. Finally, we corroborate our theoretical analysis and demonstrate how iterations can affect the generalization gap by experiments on meta-learning and hyper-parameter optimization.

1. INTRODUCTION

(Stochastic) bilevel optimization is a widely confronted problem in machine learning with various applications such as meta-learning (Finn et al., 2017; Bertinetto et al., 2018; Rajeswaran et al., 2019 ), hyper-parameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Baydin et al., 2017; Bergstra et al., 2011; Luketina et al., 2016 ), reinforcement learning (Hong et al., 2020) , and few-shot learning (Koch et al., 2015; Santoro et al., 2016; Vinyals et al., 2016) . The basic form of this problem can be defined as follows min x∈R d 1 R(x) = F (x, y * (x)) := E ξ [f (x, y * (x); ξ)] s.t. y * (x) = arg min y∈R d 2 {G(x, y) := E ζ [g(x, y; ζ)]} , where f : R d1 × R d2 → R and g : R d1 × R d2 → R are two continuously differentiable loss functions with respect to x and y. Problem (1) has an optimization hierarchy of two levels, where the outer-level objective function f depends on the minimizer of the inner-level objective function g. Due to its importance, the above bilevel optimization problem has received considerable attention in recent years. A natural way to solve problem (1) is to apply alternating stochastic gradient updates with approximating ∇ y g(x, y) and ∇f (x, y), respectively. Briefly speaking, previous efforts mainly examined two types of methods to perceive an approximate solution that is close to the optimum y * (x). One is to utilize the single-timescale strategy (Chen et al., 2021; Guo et al., 2021; Khanduri et al., 2021; Hu et al., 2022) , where the updates for y and x are carried out simultaneously. The other one is to apply the two-timescale strategy (Ghadimi & Wang, 2018; Ji et al., 2021;  

