ASYNCHRONOUS DISTRIBUTED BILEVEL OPTIMIZA-TION

Abstract

Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting a massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretical analysis that the iteration complexity of ADBO to obtain the ϵ-stationary point is upper bounded by O( 1 ϵ 2 ). Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO.

1. INTRODUCTION

Recently, bilevel optimization has emerged due to its popularity in various machine learning applications, e.g., hyperparameter optimization (Khanduri et al., 2021; Liu et al., 2021a) , metalearning (Likhosherstov et al., 2021; Ji et al., 2020) , reinforcement learning (Hong et al., 2020; Zhou & Liu, 2022) , and neural architecture search (Jiang et al., 2020; Jiao et al., 2022b) . In bilevel optimization, one optimization problem is embedded or nested with another. Specifically, the outer optimization problem is called the upper-level optimization problem and the inner optimization problem is called the lower-level optimization problem. A general form of the bilevel optimization problem can be written as, min F (x, y) s.t. y = arg min y ′ f (x, y ′ ) var. x, y, where F and f denote the upper-level and lower-level objective functions, respectively. x ∈ R n and y ∈ R m are variables. Bilevel optimization can be treated as a special case of constrained optimization since the lower-level optimization problem can be viewed as a constraint to the upperlevel optimization problem (Sinha et al., 2017) . The proliferation of smartphones and Internet of Things (IoT) devices has generated a plethora of data in various real-world applications. Centralized bilevel optimization approaches require collecting a massive amount of data from distributed edge devices and passing them to a centralized server for model training. These methods, however, may give rise to data privacy risks (Subramanya & Riggio, 2021) and encounter communication bottlenecks (Subramanya & Riggio, 2021) . To tackle these challenges, recently, distributed algorithms have been developed to solve the decentralized bilevel optimization problems (Yang et al., 2022; Chen et al., 2022b; Lu et al., 2022) . Tarzanagh et al. ( 2022) and Li et al. ( 2022) study the bilevel optimization problems under a federated setting. Specifically, the distributed bilevel optimization problem can be given by min F (x, y) = N i=1 G i (x, y) s.t. y = arg min y ′ f (x, y ′ ) = N i=1 g i (x, y ′ ) var. x, y, where N is the number of workers (devices), G i and g i denote the local upper-level and lower-level objective functions, respectively. Although existing approaches have shown their success in resolving distributed bilevel optimization problems, they only focus on the synchronous distributed setting. Synchronous distributed methods may encounter the straggler problem (Jiang et al., 2021) and its speed is limited by the worker with maximum delay (Chang et al., 2016) . Moreover, synchronous distributed method will immediately stop working if a few workers fail to respond (Zhang & Kwok, 2014) (which is common in large-scale distributed systems). The aforementioned issues give rise to the following question: Can we design an asynchronous distributed algorithm for bilevel optimization? To this end, we develop an Asynchronous Distributed Bilevel Optimization (ADBO) algorithm which is a single-loop algorithm and computationally efficient. The proposed ADBO regards the lower-level optimization problem as a constraint to the upper-level optimization problem, and utilizes cutting planes to approximate this constraint. Then, the approximate problem is solved in an asynchronous distributed manner by the proposed ADBO. We prove that even if both the upperlevel and lower-level objectives are nonconvex, the proposed ADBO is guaranteed to converge. The iteration complexity of ADBO is also theoretically derived. To facilitate the comparison, we not only present a centralized bilevel optimization algorithm in Appendix A, but also compare the convergence results of ADBO to state-of-the-art bilevel optimization algorithms with both centralized and distributed settings in Table 1 . Contributions. Our contributions can be summarized as: 1. We propose a novel algorithm, ADBO, to solve the bilevel optimization problem in an asynchronous distributed manner. ADBO is a single-loop algorithm and is computationally efficient. To the best of our knowledge, it is the first work in tackling asynchronous distributed bilevel optimization problem. 2. We demonstrate that the proposed ADBO can be applied to bilevel optimization with nonconvex upper-level and lower-level objectives with constraints. We also theoretically derive that the iteration complexity for the proposed ADBO to obtain the ϵ-stationary point is upper bounded by O( 1 ϵ 2 ). 3. Our thorough empirical studies justify the superiority of the proposed ADBO over the existing state-of-the-art methods.

2. RELATED WORK

Bilevel optimization: The bilevel optimization problem was firstly introduced by Bracken & McGill (1973) . In recent years, many approaches have been developed to solve this problem and they can be divided into three categories (Gould et al., 2016) . The first type of approaches assume there is an analytical solution to the lower-level optimization problem (i.e., ϕ(x) = arg min y ′ f (x, y ′ )) (Zhang et al., 2021) . In this case, the bilevel optimization problem can be simplified to a single-level optimization problem (i.e., min x F (x, ϕ(x)). Nevertheless, finding the analytical solution for the lower-level optimization problem is often very difficult, if not impossible. The second type of approaches replace the lower-level optimization problem with the sufficient conditions for optimality (e.g., KKT conditions) (Biswas & Hoyle, 2019; Sinha et al., 2017) . Then, the bilevel program can be reformulated as a single-level constrained optimization problem. However, the resulting problem could be hard to solve since it often involves a large number of constraints (Ji et al., 2021; Gould et al., 2016) . The third type of approaches are gradient-based methods (Ghadimi & Wang, 2018; Hong et al., 2020; Liao et al., 2018) that compute the hypergradient (or the estimation of hypergradient), i.e., ∂F (x,y) ∂x + ∂F (x,y) ∂y ∂y ∂x , and use gradient descent to solve the bilevel optimization

