ASYNCHRONOUS DISTRIBUTED BILEVEL OPTIMIZA-TION

Abstract

Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting a massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretical analysis that the iteration complexity of ADBO to obtain the ϵ-stationary point is upper bounded by O( 1 ϵ 2 ). Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO.

1. INTRODUCTION

Recently, bilevel optimization has emerged due to its popularity in various machine learning applications, e.g., hyperparameter optimization (Khanduri et al., 2021; Liu et al., 2021a) , metalearning (Likhosherstov et al., 2021; Ji et al., 2020) , reinforcement learning (Hong et al., 2020; Zhou & Liu, 2022) , and neural architecture search (Jiang et al., 2020; Jiao et al., 2022b) . In bilevel optimization, one optimization problem is embedded or nested with another. Specifically, the outer optimization problem is called the upper-level optimization problem and the inner optimization problem is called the lower-level optimization problem. A general form of the bilevel optimization problem can be written as, min F (x, y) s.t. y = arg min y ′ f (x, y ′ ) var. x, y, where F and f denote the upper-level and lower-level objective functions, respectively. x ∈ R n and y ∈ R m are variables. Bilevel optimization can be treated as a special case of constrained optimization since the lower-level optimization problem can be viewed as a constraint to the upperlevel optimization problem (Sinha et al., 2017) . The proliferation of smartphones and Internet of Things (IoT) devices has generated a plethora of data in various real-world applications. Centralized bilevel optimization approaches require collecting a massive amount of data from distributed edge devices and passing them to a centralized server for model training. These methods, however, may give rise to data privacy risks (Subramanya & Riggio, 2021) and encounter communication bottlenecks (Subramanya & Riggio, 2021) . To tackle these challenges, recently, distributed algorithms have been developed to solve the decentralized bilevel optimization problems (Yang et al., 2022; Chen et al., 2022b; Lu et al., 2022) . Tarzanagh et al. (2022) and Li et al. (2022) study the bilevel optimization problems under a federated setting. Specifically, the distributed bilevel optimization problem can be given by min F (x, y) = N i=1 G i (x, y) s.t. y = arg min y ′ f (x, y ′ ) = N i=1 g i (x, y ′ ) var. x, y, where N is the number of workers (devices), G i and g i denote the local upper-level and lower-level objective functions, respectively. Although existing approaches have shown their success in resolving distributed bilevel optimization problems, they only focus on the synchronous distributed setting. Synchronous distributed methods may encounter the straggler problem (Jiang et al., 2021) and its speed is limited by the worker with maximum delay (Chang et al., 2016) . Moreover, synchronous distributed method will immediately stop working if a few workers fail to respond (Zhang & Kwok, 2014) (which is common in large-scale distributed systems). The aforementioned issues give rise to the following question: Can we design an asynchronous distributed algorithm for bilevel optimization? To this end, we develop an Asynchronous Distributed Bilevel Optimization (ADBO) algorithm which is a single-loop algorithm and computationally efficient. The proposed ADBO regards the lower-level optimization problem as a constraint to the upper-level optimization problem, and utilizes cutting planes to approximate this constraint. Then, the approximate problem is solved in an asynchronous distributed manner by the proposed ADBO. We prove that even if both the upperlevel and lower-level objectives are nonconvex, the proposed ADBO is guaranteed to converge. The iteration complexity of ADBO is also theoretically derived. To facilitate the comparison, we not only present a centralized bilevel optimization algorithm in Appendix A, but also compare the convergence results of ADBO to state-of-the-art bilevel optimization algorithms with both centralized and distributed settings in Table 1 . Contributions. Our contributions can be summarized as: 1. We propose a novel algorithm, ADBO, to solve the bilevel optimization problem in an asynchronous distributed manner. ADBO is a single-loop algorithm and is computationally efficient. To the best of our knowledge, it is the first work in tackling asynchronous distributed bilevel optimization problem. 2. We demonstrate that the proposed ADBO can be applied to bilevel optimization with nonconvex upper-level and lower-level objectives with constraints. We also theoretically derive that the iteration complexity for the proposed ADBO to obtain the ϵ-stationary point is upper bounded by O( 1 ϵ 2 ). 3. Our thorough empirical studies justify the superiority of the proposed ADBO over the existing state-of-the-art methods.

2. RELATED WORK

Bilevel optimization: The bilevel optimization problem was firstly introduced by Bracken & McGill (1973) . In recent years, many approaches have been developed to solve this problem and they can be divided into three categories (Gould et al., 2016) . The first type of approaches assume there is an analytical solution to the lower-level optimization problem (i.e., ϕ(x) = arg min y ′ f (x, y ′ )) (Zhang et al., 2021) . In this case, the bilevel optimization problem can be simplified to a single-level optimization problem (i.e., min x F (x, ϕ(x)). Nevertheless, finding the analytical solution for the lower-level optimization problem is often very difficult, if not impossible. The second type of approaches replace the lower-level optimization problem with the sufficient conditions for optimality (e.g., KKT conditions) (Biswas & Hoyle, 2019; Sinha et al., 2017) . Then, the bilevel program can be reformulated as a single-level constrained optimization problem. However, the resulting problem could be hard to solve since it often involves a large number of constraints (Ji et al., 2021; Gould et al., 2016) . The third type of approaches are gradient-based methods (Ghadimi & Wang, 2018; Hong et al., 2020; Liao et al., 2018) that compute the hypergradient (or the estimation of hypergradient), i.e., ∂F (x,y) ∂x + ∂F (x,y) ∂y ∂y ∂x , and use gradient descent to solve the bilevel optimization problems. Most of the existing bilevel optimziation methods focus on centralized settings and require collecting a massive amount of data from distributed edge devices (workers). This may give rise to data privacy risks (Subramanya & Riggio, 2021) and encounter communication bottlenecks (Subramanya & Riggio, 2021) . Asynchronous distributed optimization: To alleviate the aforementioned issues in the centralized setting, various distributed optimization methods can be employed. Distributed optimization methods can be generally divided into synchronous distributed methods and asynchronous distributed methods (Assran et al., 2020) . For synchronous distributed methods (Boyd et al., 2011) , the master needs to wait for the updates from all workers before it proceeds to the next iteration. Therefore, it may suffer from the straggler problem and the speed is limited by the worker with maximum delay (Chang et al., 2016) . There are several advanced techniques have been proposed to make the synchronous algorithm more efficient, such as large batch size, warmup and so on (Goyal et al., 2017; You et al., 2019; Huo et al., 2021; Liu & Mozafari, 2022; Wang et al., 2020) . For asynchronous distributed methods (Chen et al., 2020; Matamoros, 2017) , the master can update its variables once it receives updates from S workers, i.e., active workers (1 ≤ S ≤ N , where N is the number of all workers). The asynchronous distributed algorithm is strongly preferred for large scale distributed systems in practice since it does not suffer from the straggler problem (Jiang et al., 2021) . Asynchronous distributed methods (Wu et al., 2017; Liu et al., 2017) have been employed for many real-world applications, such as Google's DistBelief system (Dean et al., 2012) , the training of 10 million YouTube videos (Le, 2013), federated learning for edge computing (Lu et al., 2019; Liu et al., 2021c) . Since the action orders of each worker are different in the asynchronous distributed setting, which will result in complex interaction dynamics (Jiang et al., 2021) , the theoretical analysis for asynchronous distributed algorithms is usually more challenging than that of the synchronous distributed algorithms. In summary, the synchronous and asynchronous algorithm have different application scenarios. When the delay of each worker is not much different, the synchronous algorithm suits better. While there are stragglers in the distributed system, the asynchronous algorithm is more preferred. So far, existing works for distributed bilevel optimization only focus on the synchronous setting (Tarzanagh et al., 2022; Li et al., 2022; Chen et al., 2022b) , how to design an asynchronous algorithm for distributed bilevel optimization remains under-explored. To the best of our knowledge, this is the first work that designs an asynchronous algorithm for distributed bilevel optimization.

3. ASYNCHRONOUS DISTRIBUTED BILEVEL OPTIMIZATION

In this section, we propose Asynchronous Distributed Bilevel Optimization (ADBO) to solve the distributed bilevel optimization problem in an asynchronous manner. First, we reformulate problem in Eq. (2) as a consensus problem (Matamoros, 2017; Chang et al., 2016) , min F ({x i }, {y i }, v, z) = N i=1 G i (x i , y i ) s.t. x i = v, i = 1, • • • , N {y i }, z = arg min {y ′ i },z ′ f (v, {y ′ i }, z ′ ) = N i=1 g i (v, y ′ i ) y ′ i = z ′ , i = 1, • • • , N var. {x i }, {y i }, v, z, where x i ∈ R n and y i ∈ R m are local variables in i th worker, v ∈ R n and z ∈ R m are the consensus variables in the master node. The reformulation given in Eq. ( 3) is a consensus problem which allows to develop distributed training algorithms for bilevel optimization based on the parameter server architecture (Assran et al., 2020) . As shown in Figure 13 , in parameter server architecture, the communication is centralized around the master, and workers pull the consensus variables v, z from and send their local variables x i , y i to the master. Parameter server training is a well-known data-parallel approach for scaling up machine learning model training on a multitude of machines (Verbraeken et al., 2020) . Most of the existing bilevel optimization works in machine learning only consider the bilevel programs without upper-level and lower-level constraints (Franceschi et al., 2018; Yang et al., 2021; Chen et al., 2022a) or bilevel programs with only upper-level (or lower-level) constraint (Zhang et al., 2022; Mehra & Hamm, 2021) . On the contrary, we focus on the bilevel programs (i.e., Eq. ( 3)) with both lower-level and upper-level constraints, which is more challenging. By defining ϕ(v) = arg min and h(v, {y i }, z) = || {y i } z -ϕ(v)|| 2 , we can reformulate problem in Eq. (3) as: min F ({x i },{y i }, v, z) = N i=1 G i (x i , y i ) s.t. x i = v, i = 1, • • • , N h(v, {y i }, z) = 0 var. {x i }, {y i }, v, z. To better clarify how ADBO works, we sketch the procedure of ADBO. Firstly, ADBO computes the estimate of the solution to lower-level optimization problem. Then, inspired by cutting plane method, a set of cutting planes is utilized to approximate the feasible region of the upper-level bilevel optimization problem. Finally, the asynchronous algorithm for solving the resulting problem and how to update cutting planes are proposed. The remaining contents are divided into four parts, i.e., estimate of solution to lower-level optimization problem, polyhedral approximation, asynchronous algorithm, updating cutting planes.

3.1. Estimate of Solution to Lower-level Optimization Problem

A consensus problem, i.e., the lower-level optimization problem in Eq. ( 3), needs to be solved in a distributed manner if an exact ϕ(v) is desired. Following existing works (Li et al., 2022; Gould et al., 2016; Yang et al., 2021) for bilevel optimization, instead of pursuing the exact ϕ(v), an estimate of ϕ(v) could be utilized. For this purpose, we first obtain the first-order Taylor approximation of g i (v, {y ′ i }) with respect to v, i.e., for a given point v, g i (v, {y ′ i }) = g i (v, {y ′ i }) + ∇ v g i (v, {y ′ i }) ⊤ (v -v). Then, similar to many works that use K steps of gradient descent (GD) to approximate the optimal solution of lower-level optimization problem (Ji et al., 2021; Yang et al., 2021; Liu et al., 2021b) , we utilize the results after K communication rounds between workers and master to approximate ϕ(v). Specifically, given g i (v, {y ′ i }), the augmented Lagrangian function of the lower-level optimization problem in Eq. ( 3) can be expressed as, g p (v, {y ′ i }, z ′ , {φ i }) = N i=1 g i (v, y ′ i ) + φ ⊤ i (y ′ i -z ′ ) + µ 2 ||y ′ i -z ′ || 2 , ( ) where φ i ∈ R m is the dual variable, and µ > 0 is a penalty parameter. In (k + 1) th iteration, we have, (1) Workers update their local variables as follows, y ′ i,k+1 = y ′ i,k -η y ∇ yi g p (v, {y ′ i,k }, z ′ k , {φ i,k }), where η y is the step-size. Then, workers transmit the local variables y ′ i,k+1 to the master. (2) After receiving updates from workers, the master updates variables as follows, z ′ k+1 = z ′ k -η z ∇ z g p (v, {y ′ i,k }, z ′ k , {φ i,k }), φ i,k+1 = φ i,k + η φ ∇ φi g p (v, {y ′ i,k+1 }, z ′ k+1 , {φ i,k }), where η z and η φ are step-sizes. Next, the master broadcasts z ′ k+1 and φ i,k+1 to workers. As mentioned above, we utilize the results after K communication rounds to approximate ϕ(v), i.e., ϕ(v) =   {y ′ i,0 - K-1 k=0 η y ∇ yi g p (v, {y ′ i,k }, z ′ k , {φ i,k })} z ′ 0 - K-1 k=0 η z ∇ z g p (v, {y ′ i,k }, z ′ k , {φ i,k })   . (9)

3.2. Polyhedral Approximation

Considering ϕ(v) in Eq. ( 9), the relaxed problem with respect to the problem in Eq. ( 4) is, min F ({x i },{y i }, v, z) = N i=1 G i (x i , y i ) s.t. x i = v, i = 1, • • • , N h(v, {y i }, z) ≤ ε var. {x i }, {y i }, v, z, where ε > 0 is a constant. Assuming that h(v, {y i }, z) is a convex function with respect to (v, {y i }, z), which is always satisfied when we set K = 1 in Eq. ( 10) according to the operations that preserve convexity (Boyd et al., 2004) . Since the sublevel set of a convex function is convex (Boyd et al., 2004) , the feasible set with respect to constraint h(v, {y i }, z) ≤ ε is a convex set. In this paper, inspired by the cutting plane method (Boyd & Vandenberghe, 2007; Michalka, 2013; Franc et al., 2011; Yang et al., 2014) , a set of cutting planes is utilized to approximate the feasible region with respect to constraint h(v, {y i }, z) ≤ ε in Eq. ( 10). The set of cutting planes forms a polytope, let P t denote the polytope in (t + 1) th iteration, which can be expressed as, P t = {a l ⊤ v + N i=1 b i,l ⊤ y i + c l ⊤ z + κ l ≤ 0, l = 1,• • •, |P t |}, where a l ∈ R n , b i,l ∈ R m , c l ∈ R m and κ l ∈ R 1 are the parameters in l th cutting plane, and |P t | denotes the number of cutting planes in P t . Thus, the approximate problem in (t + 1) th iteration can be expressed as follows, min F ({x i }, {y i }, v, z) = N i=1 G i (x i , y i ) s.t. x i = v, i = 1, • • • , N a l ⊤ v+ N i=1 b i,l ⊤ y i +c l ⊤ z+κ l ≤ 0, l = 1,• • •, |P t | var. {x i }, {y i }, v, z, The cutting planes will be updated to refine the approximation, details are given in Section 3.4.

3.3. Asynchronous Algorithm

In the proposed ADBO, we solve the distributed bilevel optimization problem in an asynchronous manner. The Lagrangian function of Eq. ( 12) can be written as: L p = N i=1 G i (x i , y i ) + |P t | l=1 λ l a l ⊤ v + N i=1 b i,l ⊤ y i + c l ⊤ z + κ l + N i=1 θ i ⊤ (x i -v), where λ l ∈ R 1 , θ i ∈ R n are dual variables, L p is simplified form of L p ({x i },{y i },v,z,{λ l },{θ i }). The regularized version (Xu et al., 2020) of Eq. ( 13) is employed to update all variables as follows, L p ({x i },{y i }, v, z,{λ l },{θ i }) = L p - |P t | l=1 c t 1 2 ||λ l || 2 - N i=1 c t 2 2 ||θ i || 2 , ( ) where c t 1 and c t 2 denote the regularization terms in (t + 1) th iteration. In each iteration, we set that |P t | ≤ M, ∀t. c t 1 = 1 η λ (t+1) 1 4 ≥ c 1 , c t 2 = 1 η θ (t+1) 1 4 ≥ c 2 are two nonnegative non-increasing sequences, where η λ and η θ are positive constants, and constants c 1 , c 2 meet that 0 < c 1 ≤ 1/η λ c, 0 < c 2 ≤ 1/η θ c, c = ((4M α 3 /η λ 2 +4N α 4 /η θ 2 ) 2 1/ϵ 2 +1) 1 4 (ϵ, α 3 , α 4 are introduced in Section 4). Following (Zhang & Kwok, 2014) , to alleviate the staleness issue in ADBO, we set that master updates its variables once it receives updates from S active workers at every iteration and every worker has to communicate with the master at least once every τ iterations. In (t + 1) th iteration, let Q t+1 denote the index set of active workers, the proposed algorithm proceeds as follows, (1) Active workers update the local variables as follows, x t+1 i = x t i -η x ∇ xi L p ({x ti i },{y ti i },v ti , z ti ,{λ ti l },{θ ti i }), i ∈ Q t+1 x t i , i / ∈ Q t+1 , ( ) y t+1 i = y t i -η y ∇ yi L p ({x ti i },{y ti i },v ti , z ti ,{λ ti l },{θ ti i }), i ∈ Q t+1 y t i , i / ∈ Q t+1 , ( ) where ti denotes the last iteration during which worker i was active, η x and η y are step-sizes. Then, the active workers transmit the local variables x t+1 i and y t+1 i to the master. (2) After receiving the updates from active workers, the master updates the variables as follows, v t+1 = v t -η v ∇ v L p ({x t+1 i },{y t+1 i },v t , z t ,{λ t l },{θ t i }), z t+1 = z t -η z ∇ z L p ({x t+1 i },{y t+1 i },v t+1 , z t ,{λ l t },{θ t i }), λ t+1 l = λ t l + η λ ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 , z t+1 ,{λ t l },{θ t i }), θ t+1 i = θ t i + η θ ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 , z t+1 ,{λ l t+1 },{θ t i }), i ∈ Q t+1 θ t i , i / ∈ Q t+1 , where η v , η z , η λ and η θ are step-sizes. Next, the master broadcasts v t+1 , z t+1 , θ t+1 i and {λ t+1 l } to worker i, i ∈ Q t+1 (i.e., active workers). Details are summarized in Algorithm 1.

3.4. Updating Cutting Planes

Every k pre iterations (k pre > 0 is a pre-set constant, which can be controlled flexibly), the cutting planes are updated based on the following two steps (a) and (b) when t < T 1 : (a) Removing the inactive cutting planes, P t+1 = Drop(P t , cp l ), if λ t+1 l and λ t l = 0 P t , otherwise , where cp l represents the l th cutting plane in P t and Drop(P t , cp l ) represents the l th cutting plane cp l is removed from P t . The dual variable set {λ t+1 } will be updated as follows, {λ t+1 } = Drop({λ t }, λ l ), if λ t+1 l and λ t l = 0 {λ t }, otherwise , where {λ t+1 } and {λ t } represent the dual variable set in (t + 1) th and t th iterations, respectively. Drop({λ t }, λ l ) represents that λ l is removed from the dual variable set {λ t }. (b) Adding new cutting planes. Firstly, we investigate whether (v t+1 ,{y t+1 i }, z t+1 ) is feasible for the constraint h(v, {y i }, z) ≤ ε. We can obtain h(v t+1 ,{y t+1 i }, z t+1 ) according to ϕ(v t+1 ) in Eq. ( 9). If (v t+1 ,{y t+1 i }, z t+1 ) is not a feasible solution to the original problem (Eq. ( 10)), new cutting plane cp t+1 new will be generated to separate the point (v t+1 ,{y t+1 i }, z t+1 ) from the feasible region of constraint h(v, {y i }, z) ≤ ε. Thus, the valid cutting plane (Boyd & Vandenberghe, 2007)  a l ⊤ v + N i=1 b i,l ⊤ y i + c l ⊤ z + κ l ≤ 0 must satisfy that, a l ⊤ v + N i=1 b i,l ⊤ y i + c l ⊤ z + κ l ≤ 0, ∀(v,{y i }, z) satisfies h(v,{y i }, z) ≤ ε a l ⊤ v t+1 + N i=1 b i,l ⊤ y t+1 i + c l ⊤ z t+1 + κ l > 0 . ( ) Since h(v, {y i }, z) is a convex function, we have that, h(v, {y i }, z) ≥ h(v t+1 , {y t+1 i }, z t+1 )+     ∂h(v t+1 ,{y t+1 i },z t+1 ) ∂v { ∂h(v t+1 ,{y t+1 i },z t+1 ) ∂yi } ∂h(v t+1 ,{y t+1 i },z t+1 ) ∂z     ⊤   v {y i } z -   v t+1 {y t+1 i } z t+1     . (24) Combining Eq. ( 24) with Eq. ( 23), we have that a valid cutting plane (with respect to point (v t+1 , {y t+1 i }, z t+1 )) can be expressed as, h(v t+1 , {y t+1 i }, z t+1 ) +     ∂h(v t+1 ,{y t+1 i },z t+1 ) ∂v { ∂h(v t+1 ,{y t+1 i },z t+1 ) ∂yi } ∂h(v t+1 ,{y t+1 i },z t+1 ) ∂z     ⊤   v {y i } z -   v t+1 {y t+1 i } z t+1     ≤ ε. (25) For brevity, we utilize cp t+1 new to denote the new added cutting plane (i.e., Eq. ( 25)). Thus the polytope P t+1 will be updated as follows, P t+1 = Add(P t+1 , cp t+1 new ), if h(v t+1 ,{y t+1 i }, z t+1 ) > ε P t+1 , otherwise , where Add(P t+1 , cp t+1 new ) represents that new cutting plane cp t+1 new is added to polytope P t+1 . The dual variable set {λ t+1 } is updated as follows, {λ t+1 } = Add({λ t+1 }, λ t+1 |P t+1 | ), if h(v t+1 ,{y t+1 i }, z t+1 ) > ε {λ t+1 }, otherwise , where Add({λ t+1 }, λ t+1 |P t+1 | ) represents that dual variable λ t+1 |P t+1 | is added to the dual variable set {λ t+1 }. Finally, master broadcasts the updated P t+1 and {λ t+1 } to all workers. The details of the proposed algorithm are summarized in Algorithm 1.

Algorithm 1 ADBO: Asynchronous Distributed Bilevel Optimization

Initialization: master iteration t = 0, variables {x 0 i }, {y 0 i }, v 0 , z 0 , {λ 0 l }, {θ 0 i } and polytope P 0 . repeat for active worker do updates variables x t+1 i , y t+1 i according to Eq. ( 15) and ( 16); end for Active workers transmit their local variables to master; for master do updates variables v t+1 , z t+1 , {λ t+1 l }, {θ t+1 i } according to Eq. ( 17), ( 18), ( 19) and (20); end for master broadcasts variables to active workers; if (t + 1) mod k pre == 0 and t < T 1 then master computes ϕ(v t+1 ) according to Eq. ( 9); master updates P t+1 and {λ t+1 } according to Eq. ( 21), ( 22), ( 26) and ( 27); master broadcasts P t+1 and {λ t+1 } to all workers; end if t = t + 1; until termination.

4. DISCUSSION

Theorem 1 (Convergence) As the cutting plane continues to be added to the polytope, the optimal objective value of approximate problem in Eq. ( 12) converges monotonically. The proof of Theorem 1 is presented in Appendix C. Definition 1 (Stationarity gap) Following (Xu et al., 2020; Lu et al., 2020; Jiao et al., 2022a) , the stationarity gap of our problem at t th iteration is defined as: ∇G t =        {∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} {∇ yi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} ∇ v L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ∇ z L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) {∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} {∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })}        . ( ) Definition 2 (ϵ-stationary point) ({x t i },{y t i },v t , z t ,{λ t l },{θ t i }) is an ϵ-stationary point (ϵ ≥ 0) of a differentiable function L p , if ||∇G t || 2 ≤ ϵ. T (ϵ) is the first iteration index such that ||∇G t || 2 ≤ ϵ, i.e., T (ϵ) = min{t | ||∇G t || 2 ≤ ϵ}. Assumption 1 (Smoothness/Gradient Lipschitz) Following (Ji et al., 2021) , we assume that L p has Lipschitz continuous gradients, i.e., for any ω, ω ′ , we assume that there exists L > 0 satisfying that, ||∇L p (ω) -∇L p (ω ′ )|| ≤ L||ω -ω ′ ||, Assumption 2 (Boundedness) Following (Qian et al., 2019), we assume that variables are bounded, i.e., ||x i || 2 ≤ α 1 , ||v|| 2 ≤ α 1 , ||y i || 2 ≤ α 2 , ||z|| 2 ≤ α 2 , ||λ l || 2 ≤ α 3 , ||θ i || 2 ≤ α 4 . And we assume that before obtaining the ϵ-stationary point (i.e., t ≤ T (ϵ)-1), the variables in master satisfy that ||v t+1 -v t || 2 +||z t+1 -z t || 2 + l ||λ t+1 l -λ t l || 2 ≥ ϑ, where ϑ > 0 is a relative small constant. The change of the variables in master is upper bounded within τ iterations: ||v t -v t-k || 2 ≤ τ k 1 ϑ, ||z t -z t-k || 2 ≤ τ k 1 ϑ, l ||λ t l -λ t-k l || 2 ≤ τ k 1 ϑ, ∀1 ≤ k ≤ τ , where k 1 > 0 is a constant. Theorem 2 (Iteration complexity) Suppose Assumption 1 and 2 hold, we set the step-sizes as η x = η y = η v = η z = 2 L+η λ M L 2 +η θ N L 2 +8( M γL 2 η λ c 1 2 + N γL 2 η θ c 2 2 ) , η λ < min{ 2 L+2c 0 1 , 1 30τ k1N L 2 } and η θ ≤ 2 L+2c 0 2 . For a given ϵ, we have:  T (ϵ) ∼ O(max{( 4M α 3 η λ 2 + 4N α 4 η θ 2 ) 2 1 ϵ 2 , ( 4(d 7 + η θ (N -S)L 2 2 )( - d +k d τ (τ -1))d 6 ϵ + (T 1 + 2) 1 2 ) 2 }),

5. EXPERIMENT

In this section, experimentsfoot_0 are conducted on two hyperparameter optimization tasks (i.e., data hyper-cleaning task and regularization coefficient optimization task) in the distributed setting to evaluate the performance of the proposed ADBO. The proposed ADBO is compared with the stateof-the-art distributed bilevel optimization method FEDNEST (Tarzanagh et al., 2022) . In data hypercleaning task, experiments are carried out on MNIST (LeCun et al., 1998) and Fashion MNIST (Xiao et al., 2017) datasets. In coefficient optimization task, following (Chen et al., 2022a) , experiments are conducted on Covertype (Blackard & Dean, 1999)  and IJCNN1 (Prokhorov, 2001) datasets. 5.1 DATA HYPER-CLEANING Following (Ji et al., 2021; Yang et al., 2021) , we compare the performance of the proposed ADBO and distributed bilevel optimization method FEDNEST on the distributed data hyper-cleaning task (Chen et al., 2022b) on MNIST and Fashion MNIST datasets. Data hyper-cleaning involves training a classifier in a contaminated environment where each training data label is changed to a random class number with a probability (i.e., the corruption rate). In the experiment, the distributed data hyper-cleaning problem is considered, whose formulation can be expressed as, min F (ψ, w) = N i=1 1 |D val i | (xj ,yj )∈D val i L(x j ⊤ w, y j ) s.t. w = arg min w ′ f (ψ, w ′ ) = N i=1 1 |D tr i | (xj ,yj )∈D tr i σ(ψ j )L(x j ⊤ w ′ , y j ) + C r ||w ′ || 2 var. ψ, w, where D tr i and D val i denote the training and validation datasets on i th worker, respectively. (x j , y j ) denote the j th data and label. σ(.) is the sigmoid function, L is the cross-entropy loss, C r is a regularization parameter and N is the number of workers in the distributed system. In MNIST and Fashion MNIST datasets, we set N = 18, S = 9 and τ = 15. According to Cohen et al. (2021) , we assume that the communication delay of each worker obeys the heavy-tailed distribution. The proposed ADBO is compared with the state-of-the-art distributed bilevel optimization method FEDNEST and SDBO (Synchronous Distributed Bilevel Optimization, i.e., ADBO without asynchronous setting). The test accuracy versus time is shown in Figure 1 , and the test loss versus time is shown in Figure 2 . We can observe that the proposed ADBO is the most efficient algorithm since 1) the asynchronous setting is considered in ADBO, the master can update its variables once it receives updates from S active workers instead of all workers; and 2) ADBO is a single-loop algorithm and only gradient descent/ascent is required at each iteration, thus ADBO is computationally more efficient.

5.2. REGULARIZATION COEFFICIENT OPTIMIZATION

Following (Chen et al., 2022a) , we compare the proposed ADBO with baseline algorithms FEDNEST and SDBO on the regularization coefficient optimization task with Covertype and IJCNN1 datasets. The distributed regularization coefficient optimization problem is given by, min F (ψ, w) = N i=1 1 |D val i | (xj ,yj )∈D val i L(x j ⊤ w, y j ) s.t. w = arg min w ′ f (ψ, w ′ ) = N i=1 1 |D tr i | (xj ,yj )∈D tr i L(x j ⊤ w ′ , y j ) + n j=1 ψ j (w ′ j ) 2 var. ψ, w, where ψ ∈ R n , w ∈ R n and L respectively denote the regularization coefficient, model parameter, and logistic loss, and w ′ = [w ′ 1 , . . . , w ′ n ]. In Covertype and IJCNN1 datasets, we set N = 18, S = 9, τ = 15 and N = 24, S = 12, τ = 15, respectively. We also assume that the delay of each worker obeys the heavy-tailed distribution. Firstly, we compare the performance of the proposed ADBO, SDBO and FEDNEST in terms of test accuracy and test loss on Covertype and IJCNN1 datasets, which are shown in Figure 3 and 4 . It is seen that the proposed ADBO is more efficient because of the same two reasons we gave in Section 5.1. We also consider the straggler problem, i.e., there exist workers with high delays (stragglers) in the distributed system. In this case, the efficiency of the bilevel optimization method with the synchronous distributed setting will be affected heavily. In the experiment, we assume there are three stragglers in the distributed system, and the mean of (communication + computation) delay of stragglers is four times the delay of normal workers. The results on Covertype and IJCNN1 datasets are reported in Figure 5 and 6. It is seen that the efficiency of the synchronous distributed algorithms (FEDNEST and SDBO) will be significantly affected, while the proposed ADBO does not suffer from the straggler problem since it is an asynchronous method and is able to only consider active workers.

6. CONCLUSION

Existing bilevel optimization works focus either on the centralized or synchronous distributed setting, which will give rise to data privacy risks and suffer from the straggler problem. As a remedy, we propose ADBO in this paper to solve the bilevel optimization problem in an asynchronous distributed manner. To our best knowledge, this is the first work that devises the asynchronous distributed algorithm for bilevel optimization. We demonstrate that the proposed ADBO can effectively tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions. Theoretical analysis has also been conducted to analyze the convergence properties and iteration complexity of ADBO. Extensive empirical studies on real-world datasets demonstrate the efficiency and effectiveness of the proposed ADBO.

A CUTTING PLANE METHOD FOR BILEVEL OPTIMIZATION

In this section, a cutting plane method, named CPBO, is proposed for bileve optimization. Defining ϕ(x) = arg min y ′ f (x, y ′ ) and h(x, y) = ||y -ϕ(x)|| 2 , we can reformulate problem in Eq. ( 1) as: min F (x, y) s.t. h(x, y) = 0 var. x, y. Following the previous works (Li et al., 2022; Gould et al., 2016; Yang et al., 2021) in bilevel optimization, it is not necessary to get the exact ϕ(x), and the approximate ϕ(x) is given as follows. Firstly, as many work do (Ji et al., 2021; Yang et al., 2021) , we utilize the K steps of gradient descent (GD) to approximate ϕ(x). And the first-order Taylor approximation of f (x, y ′ ) with respect to x is considered, i.e., for a given point x, f (x, y ′ ) = f (x, y ′ ) + ∇ x f (x, y ′ ) ⊤ (x -x) . Thus, we have, ϕ(x) = y ′ 0 - K-1 k=0 η∇ y f (x, y ′ k ), ( ) where η is the step-size. Considering the estimated ϕ(x) in Eq. ( 35), the relaxed problem with respect to problem in Eq. ( 34) is considered as follows, min F (x, y) s.t. h(x, y) ≤ ε var. x, y. Assuming that h(x, y) is a convex function with respect to (x, y), which is always satisfied when we set K = 1 in Eq. ( 35) according to the operations that preserve convexity (Boyd et al., 2004) . Since the sublevel set of a convex function is convex, we have that the feasible set of (x, y), i.e., Z relax = {(x, y) ∈ R n ×R m |h(x, y) ≤ ϵ}, is a convex set. We utilize a set of cutting plane constraints (i.e., linear constraints) to approximate the feasible set Z relax . The set of cutting plane constraints forms a polytope, which can be expressed as follows, P = {(x, y) ∈ R n ×R m |a l ⊤ x + b l ⊤ y + κ l ≤ 0, l = 1, • • • , L}, where a l ∈ R n , b l ∈ R m and κ l ∈ R 1 are parameters in l th cutting plane, and L represents the number of cutting planes in P. Considering the approximate problem, which can be expressed as follows, min F (x, y) s.t. a l ⊤ x+b l ⊤ y+κ l ≤ 0, l = 1,• • •, |P t | (39) var. x, y, where P t is the polytope in (t + 1) th iteration, and |P t | denotes the number of cutting planes in P t . Then, the Lagrangian function of Eq. ( 39) can be written as, L p (x, y, {λ l }) = F (x, y) + |P t | l=1 λ l (a l ⊤ x+b l ⊤ y+κ l ), where λ l is the dual variable. The proposed algorithm proceeds as follows in (t + 1) th iteration: If t < T 1 , the variables are updated as follows, x t+1 = x t -η x ∇ x L p (x t , y t , {λ t l }), y t+1 = y t -η y ∇ y L p (x t+1 , y t , {λ t l }), λ t+1 l = λ t l + η λ l ∇ λ l L p (x t+1 , y t+1 , {λ t l }), l = 1,• • •, |P t |, where η x , η y and η λ l are the step-sizes. And every k pre iterations (k pre > 0 is a pre-set constant, which can be controlled flexibly) the cutting planes will be updated based on the following two steps: Table 1 : Convergence results of bilevel optimization algorithms (with centralized and distributed setting).

Method

Centralized Synchronous (Distributed) Asynchronous (Distributed) AID-BiO (Ghadimi & Wang, 2018) O( 1 ϵ 1.25 ) NA NA AID-BiO (Ji et al., 2021) O( 1 ϵ 1 ) NA NA ITD-BiO (Ji et al., 2021) O( 1 ϵ 1 ) NA NA STABLE (Chen et al., 2022a) O( 1 ϵ 2 ) 1 NA NA stocBio (Ji et al., 2021) O( 1 ϵ 2 ) 1 NA NA VRBO (Yang et al., 2021) O( 1 ϵ 1.5 ) 1 NA NA FEDNEST (Tarzanagh et al., 2022) NA O( 1 ϵ 2 ) 1 NA SPDB (Lu et al., 2022) NA O( 1 ϵ 2 ) 1 NA DSBO (Yang et al., 2022) NA O( 1 ϵ 2 ) 1 NA Proposed Method O( 1 ϵ 1 ) NA O( 1 ϵ 2 ) 1 Stochastic optimization algorithm. (a) Removing the inactive cutting planes, that is, P t+1 = Drop(P t , cp l ), if λ t+1 l and λ t l = 0 P t , otherwise , where cp l represents the l th cutting plane in P t , and Drop(P t , cp l ) represents removing the l th cutting plane cp l from P t . And the dual variable set {λ t } will be updated as follows, {λ t+1 } = Drop({λ t }, λ t l ), if λ t+1 l and λ t l = 0 {λ t }, otherwise , where {λ t+1 } and {λ t } respectively represent the dual variable set in (t + 1) th and t th iteration. And Drop({λ t }, λ t l ) represents that λ t l is removed from the dual variable set {λ t }. (b) Adding new cutting planes. Firstly, we investigate whether (x t+1 , y t+1 ) is a feasible solution to the original problem in Eq. (36). If (x t+1 , y t+1 ) is not a feasible solution to the original problem, that is h(x t+1 , y t+1 ) > ε, new cutting plane is generated to separate the point (x t+1 , y t+1 ) from Z relax , that is, the valid cutting plane a l ⊤ x+b l ⊤ y+κ l ≤ 0 must satisfy that, a l ⊤ x+b l ⊤ y+κ l ≤ 0, ∀(x, y) ∈ Z relax a l ⊤ x t+1 +b l ⊤ y t+1 +κ l > 0 . ( ) Since h(x, y) is a convex function, we have that, h(x, y) ≥ h(x t+1 , y t+1 ) + ∂h(x t+1 ,y t+1 ) ∂x ∂h(x t+1 ,y t+1 ) ∂y ⊤ x y - x t+1 y t+1 . ( ) According to Eq. ( 47), h(x t+1 , y t+1 ) + ∂h(x t+1 ,y t+1 ) ∂x ∂h(x t+1 ,y t+1 ) ∂y ⊤ x y - x t+1 y t+1 ≤ ε is a valid cutting plane at point (x t+1 , y t+1 ) which satisfies Eq. ( 46). For brevity, we utilize cp t+1 new to denote this cutting plane. Thus, we have that, P t+1 = Add(P t+1 , cp t+1 new ), if h(x t+1 , y t+1 ) > ε P t+1 , if h(x t+1 , y t+1 ) ≤ ε , where Add(P t+1 , cp t+1 new ) represents that new cutting plane cp t+1 new is added to polytope P t+1 . And the dual variable set is updated as follows, {λ t+1 } = Add({λ t+1 }, λ t+1 |P t+1 | ), if h(x t+1 , y t+1 ) > ε {λ t+1 }, if h(x t+1 , y t+1 ) ≤ ε , Algorithm 2 CPBO: Cutting Plane Method for Bilevel Optimization Initialization: iteration t = 0, variables x 0 , y 0 , {λ 0 l } and polytope P 0 . repeat if t < T 1 then updating variables x t+1 , y t+1 and λ t+1 l according to Eq. ( 41), ( 42) and (43); if (t + 1) mod k pre == 0 then updating the polytope P t+1 according to Eq. ( 44) and ( 48); updating the dual variable set {λ t+1 } according to Eq. ( 45) and ( 49); end if else updating variables x t+1 and y t+1 according to Eq. ( 50) and ( 51); end if t = t + 1; until termination. Else if t ≥ T 1 , the polytope P T1 and dual variables will be fixed. Variables x, y will be updated as follows, x t+1 = x t -η x ∇ x Lp (x t , y t ), y t+1 = y t -η y ∇ y Lp (x t+1 , y t ), where Lp (x, y) = F (x, y) + |P T 1 | l=1 λ l [max{0, a l ⊤ x+b l ⊤ y+κ l }] 2 . And details of the proposed algorithm are summarized in Algorithm 2. The comparison about the convergence results between the proposed method and state-of-the-art methods are summarized in Table 1 .

A.1 EXPERIMENT

To evaluate the performance of the proposed CPBO, experiments are carried out on two applications: 1) hyperparameter optimization, 2) meta-learning. In hyperparameter optimization, we compare CPBO with baseline algorithms stocBio (Ji et al., 2021) , STABLE (Chen et al., 2022a) , VRBO (Yang et al., 2021) ), and AID-CG (Grazzi et al., 2020) on the regularization coefficient optimization task (Chen et al., 2022a) with Covertype (Blackard & Dean, 1999) and IJCNN1 (Prokhorov, 2001) datasets. We compare the performance of the proposed CPBO with all competing algorithms in terms of both the test accuracy and the test loss, which are shown in Figure 7 and 8. In metalearning, we focus on the bilevel optimization problem in (Rajeswaran et al., 2019) . And we compare the proposed CPBO with baseline algorithms MAML (Finn et al., 2017) , iMAML (Rajeswaran et al., 2019) , and ANIL (Raghu et al., 2019) on Omniglot (Lake et al., 2015) and CIFAR-FS (Bertinetto et al., 2018) datasets. And the comparison between the proposed method with the baseline algorithms are shown in Figure 9 and 10. It is seen that the proposed CPBO can achieve relatively fast convergence rate among all competing algorithms since 1) the iteration complexity of the proposed method is not high; 2) every step in CPBO is computationally efficient. Assumption A.1 (Smoothness/Gradient Lipschitz) Following (Ji et al., 2021) , we assume that Lp has Lipschitz continuous gradients, i.e., for any ω, ω ′ , we assume that there exists L > 0 satisfying that, ||∇ Lp (ω) -∇ Lp (ω ′ )|| ≤ L||ω -ω ′ ||. ( ) Assumption A.2 (Boundedness) Following (Qian et al., 2019), we assume that variables have boundedness, i.e., ||x|| 2 ≤ β 1 , ||y|| 2 ≤ β 2 . Theorem 3 (Iteration Complexity) Under Assumption A.1, A.2, and setting the step-sizes as η x < 2 L , η y < 2 L , the iteration complexity (also the gradient complexity) of the proposed algorithm to obtain ϵ-stationary point is bounded by O( 1 ϵ ).

Proof of Theorem 3:

According to Assumption A.1 and Eq. ( 50), when t ≥ T 1 , we have, Lp (x t+1 , y t ) ≤ Lp (x t , y t ) + ∇ x Lp (x t , y t ), x t+1 -x t + L 2 ||x t+1 -x t || 2 ≤ Lp (x t , y t ) -η x ||∇ x Lp (x t , y t )|| 2 + Lηx 2 2 ||∇ x Lp (x t , y t )|| 2 . ( ) Similarly, according to Assumption A.1 and Eq. ( 51), we have, Combining Eq. ( 53) with Eq. ( 54), we have, Lp (x t+1 , y t+1 ) ≤ Lp (x t+1 , y t ) + ∇ y Lp (x t+1 , y t ), y t+1 -y t + L 2 ||y t+1 -y t || 2 ≤ Lp (x t+1 , y t ) -η y ||∇ y Lp (x t+1 , y t )|| 2 + Lηy 2 2 ||∇ y Lp (x t+1 , y t )|| 2 . ( ) (η x - Lη x 2 2 )||∇ x Lp (x t , y t )|| 2 +(η y - Lη y 2 2 )||∇ y Lp (x t+1 , y t )|| 2 ≤ Lp (x t , y t )-Lp (x t+1 , y t+1 ). ( ) According to the setting of η x , η y , we have that η x -Lηx 2 2 > 0, η y - Lηy 2 2 > 0. And we set constant d = min{η x -Lηx 2 2 , η y - Lηy 2 2 }, thus we can obtain that, ||∇ x Lp (x t , y t )|| 2 + ||∇ y Lp (x t+1 , y t )|| 2 ≤ Lp (x t , y t ) -Lp (x t+1 , y t+1 ) d . Summing both sides of Eq. (56 ) for t = {T 1 , • • • , T -1}, we obtain that, 1 T -T 1 T -1 t=T1 (||∇ x Lp (x t , y t )|| 2 + ||∇ y Lp (x t+1 , y t )|| 2 ) ≤ Lp (x T1 , y T1 ) -L * p (T -T 1 )d , where L * p = min Lp (x,y). Combining Eq. ( 57) with Definition A.1, we have that the number of iterations required by Algorithm 2 to return an ϵ-stationary point is bounded by O( Lp (x T1 , y T1 ) -L * p d 1 ϵ + T 1 ). B PROOF OF THEOREM 2 In this section, we provide complete proofs for Theorem 2. Firstly, we make some definitions about our problem. Definition B.1 Following (Xu et al., 2020) , the stationarity gap at t th iteration is defined as: ∇G t =              {∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} {∇ yi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} ∇ v L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ∇ z L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) {∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} {∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })}              . ( ) And we also define: (∇G t ) xi = ∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇G t ) yi = ∇ yi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇G t ) v = ∇ v L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇G t ) z = ∇ z L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇G t ) λ l = ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇G t ) θi = ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }). It follows that, ||∇G t || 2 = N i=1 (||(∇G t ) xi || 2 +||(∇G t ) yi || 2 +||(∇G t ) θi || 2 )+||(∇G t ) v || 2 +||(∇G t ) z || 2 + |P t | l=1 ||(∇G t ) λ l || 2 . ( ) Definition B.2 At t th iteration, the stationarity gap w.r.t L p ({x i },{y i },v, z,{λ l },{θ i }) is defined as: ∇ G t =              {∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} {∇ yi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} ∇ v L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ∇ z L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) {∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })} {∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })}              . ( ) We further define: (∇ G t ) xi = ∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇ G t ) yi = ∇ yi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇ G t ) v = ∇ v L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇ G t ) z = ∇ z L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇ G t ) λ l = ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), (∇ G t ) θi = ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }). It follows that, ||∇ G t || 2 = N i=1 (||(∇ G t ) xi || 2 +||(∇ G t ) yi || 2 +||(∇ G t ) θi || 2 )+||(∇ G t ) v || 2 +||(∇ G t ) z || 2 + |P t | l=1 ||(∇ G t ) λ l || 2 . ( ) Definition B.3 In the proposed asynchronous algorithm, for the i th worker in t th iteration, the last iteration where this worker was active is defined as ti . And the next iteration this worker will be active is defined as t i . For the iteration index set which i th worker is active during T 1 + T + τ iteration, it is defined as V i (T ). And the j th element in V i (T ) is defined as vi (j). Then, we provide some useful lemmas used for proving the main convergence results in Theorem 2. Lemma 1 Let sequences η t x = η t y = η t v = η t z = 2 L+η λ |P t |L 2 +η θ N L 2 +8( |P t |γL 2 η λ (c t 1 ) 2 + N γL 2 η θ (c t 2 ) 2 ) , suppose Assumption 1 and 2 hold, we can obtain that, L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i }) -L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ N i=1 ( L+L 2 +1 2 -1 η t x )||x t+1 i -x t i || 2 + N i=1 ( L+1 2 -1 η t y )||y t+1 i -y t i || 2 + 3N L 2 τ k 1 |P t | l=1 ||λ t+1 l -λ t l || 2 +( L+6N L 2 τ k1 2 -1 η t v )||v t+1 -v t || 2 + ( L+6N L 2 τ k1 2 -1 η t z )||z t+1 -z t || 2 . ( ) Proof of Lemma 1: Utilizing the Lipschitz properties in Assumption 1, we can obtain that, L p ({x t+1 1 , x t 2 ,• • •, x t N },{y t i },v t , z t ,{λ t l },{θ t i })-L p ({x t i },{y t i },v t , z t ,{λ t l },{θ t i }) ≤ ∇ x1 L p ({x t i },{y t i },v t , z t ,{λ t l },{θ t i }), x t+1 1 -x t 1 + L 2 ||x t+1 1 -x t 1 || 2 , L p ({x t+1 1 , x t+1 2 ,• • •, x t N },{y t i },v t ,z t ,{λ t l },{θ t i })-L p ({x t+1 1 , x t 2 ,• • •, x t N },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ ∇ x2 L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), x t+1 2 -x t 2 + L 2 ||x t+1 2 -x t 2 || 2 , . . . L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i })-L p ({x t+1 1 ,• • •, x t+1 N-1 , x t N },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ ∇ x N L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), x t+1 N -x t N + L 2 ||x t+1 N -x t N || 2 . ( ) Summing up the above inequalities in Eq. ( 66), we can obtain that, 15), we have that, L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i })-L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ N i=1 ∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), x t+1 i -x t i + L 2 ||x t+1 i -x t i || 2 . ( ) Combining ∇ xi L p ({x ti i }, {y ti i }, v ti , z ti , {λ ti l }, {θ ti i }) = ∇ xi L p ({x ti i }, {y ti i }, v ti , z ti , {λ ti l }, {θ ti i }) with Eq. ( x t+1 i -x t i , ∇ xi L p ({x ti i },{y ti i },v ti ,z ti ,{λ ti l },{θ ti i }) = - 1 η x ||x t+1 i -x t i || 2 ≤ - 1 η t x ||x t+1 i -x t i || 2 . ( ) Next, combining the Cauchy-Schwarz inequality with Assumption 1, 2, we can get, x t+1 i -x t i , ∇ xi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ xi L p ({x ti i },{y ti i },v ti ,z ti ,{λ ti l },{θ ti i }) ≤ 1 2 ||x t+1 i -x t i || 2 + L 2 2 (||v t -v tj || 2 +||z t -z tj || 2 + |P t | l=1 ||λ t l -λ tj l || 2 ) ≤ 1 2 ||x t+1 i -x t i || 2 + 3L 2 τ k1 2 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ). Thus, according to Eq. ( 67), ( 68) and ( 69), we can obtain that, L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i })-L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ N i=1 ( L+1 2 -1 η t x )||x t+1 i -x t i || 2 + 3N L 2 τ k1 2 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ). Similarly, using the Lipschitz properties in Assumption 1, we have, 16), we can obtain that, L p ({x t+1 i },{y t+1 i },v t ,z t ,{λ t l },{θ t i })-L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ N i=1 ∇ yi L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i }), y t+1 i -y t i + L 2 ||y t+1 i -y t i || 2 . (71) Combining ∇ yi L p ({x ti i }, {y ti i }, v ti , z ti , {λ ti l }, {θ ti i }) = ∇ yi L p ({x ti i }, {y ti i }, v ti , z ti , {λ ti l }, {θ ti i }) with Eq. ( y t+1 i -y t i , ∇ yi L p ({x ti i },{y ti i },v ti ,z ti ,{λ ti l },{θ ti i }) = - 1 η y ||y t+1 i -y t i || 2 ≤ - 1 η t y ||y t+1 i -y t i || 2 . ( ) Then, combining the Cauchy-Schwarz inequality with Assumption 1, 2, we can get the following inequalities, y t+1 i -y t i , ∇ yi L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ yi L p ({x ti i },{y ti i },v ti ,z ti ,{λ ti l },{θ ti i }) ≤ 1 2 ||y t+1 i -y t i || 2 + L 2 2 (||x t+1 i -x t i || 2 +||v t -v tj || 2 +||z t -z tj || 2 + |P t | l=1 ||λ t l -λ tj l || 2 ) ≤ 1 2 ||y t+1 i -y t i || 2 + L 2 2 ||x t+1 i -x t i || 2 + 3L 2 τ k1 2 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ). Thus, combining Eq. ( 71), ( 72) with ( 73), we have, L p ({x t+1 i },{y t+1 i },v t ,z t ,{λ t l },{θ t i }) -L p ({x t+1 i },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ N i=1 ( L+1 2 -1 η t y )||y t+1 i -y t i || 2 + N i=1 L 2 2 ||x t+1 i -x t i || 2 + 3N L 2 τ k1 2 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ). Combining the Lipschitz properties in Assumption 1 with Eq. ( 17), we have, L p ({x t+1 i },{y t+1 i },v t+1 ,z t ,{λ t l },{θ t i }) -L p ({x t+1 i },{y t+1 i },v t ,z t ,{λ t l },{θ t i }) ≤ ∇ v L p ({x t+1 i },{y t+1 i },v t ,z t ,{λ t l },{θ t i }), v t+1 -v t + L 2 ||v t+1 -v t || 2 ≤ ( L 2 -1 η t v )||v t+1 -v t || 2 . ( ) Similarly, combining the Lipschitz properties in Assumption 1 with Eq. ( 18), we have, L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-L p ({x t+1 i },{y t+1 i },v t+1 ,z t ,{λ t l },{θ t i }) ≤ ∇ z L p ({x t+1 i },{y t+1 i },v t+1 ,z t ,{λ t l },{θ t i }), z t+1 -z t + L 2 ||z t+1 -z t || 2 ≤ ( L 2 -1 η t z )||z t+1 -z t || 2 . ( ) By combining Eq. ( 70), ( 74), ( 75), ( 76), we conclude the proof of Lemma 1. Lemma 2 Suppose Assumption 1 and 2 hold, ∀t ≥ T 1 , we have: L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t+1 i }) -L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }) ≤ ( L+L 2 +1 2 -1 η t x + |P t |L 2 2a1 + |Q t+1 |L 2 2a3 ) N i=1 ||x t+1 i -x t i || 2 +( L+1 2 -1 η t y + |P t |L 2 2a1 + |Q t+1 |L 2 2a3 ) N i=1 ||y t+1 i -y t i || 2 +( L+6τ k1N L 2 2 -1 η t v + |P t |L 2 2a1 + |Q t+1 |L 2 2a3 )||v t+1 -v t || 2 +( L+6τ k1N L 2 2 -1 η t z + |P t |L 2 2a1 + |Q t+1 |L 2 2a3 )||z t+1 -z t || 2 + 1 2η θ N i=1 ||θ t i -θ t-1 i || 2 +( a1+6τ k1N L 2 2 - c t-1 1 -c t 1 2 + 1 2η λ ) |P t | l=1 ||λ t+1 l -λ t l || 2 +( a3 2 - c t-1 2 -c t 2 2 + 1 2η θ ) N i=1 ||θ t+1 i -θ t i || 2 + c t-1 1 2 |P t | l=1 (||λ t+1 l || 2 -||λ t l || 2 )+ 1 2η λ |P t | l=1 ||λ t l -λ t-1 l || 2 + c t-1 2 2 N i=1 (||θ t+1 i || 2 -||θ t i || 2 ), where a 1 > 0 and a 3 > 0 are constants.

Proof of Lemma 2:

According to Eq. ( 19), in (t + 1) th iteration, ∀λ ∈ Λ, it follows that: λ t+1 l -λ t l -η λ ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i }), λ-λ t+1 l = 0. Let λ = λ t l , we can obtain: ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })- 1 η λ (λ t+1 l -λ t l ), λ t l -λ t+1 l = 0. Likewise, in t th iteration, we can obtain: ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i })- 1 η λ (λ t l -λ t-1 l ), λ t+1 l -λ t l = 0. ( ) Since L p ({x i },{y i },v,z,{λ l },{θ i }) is concave with respect to λ l and follows from Eq. ( 79) and Eq. ( 80), ∀t ≥ T 1 , we have, L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i }) -L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i }) ≤ |P t | l=1 ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i }), λ t+1 l -λ t l ≤ |P t | l=1 ( ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), λ t+1 l -λ t l + 1 η λ λ t l -λ t-1 l , λ t+1 l -λ t l ). ( ) Denoting v t+1 1,l = λ t+1 l -λ t l -(λ t l -λ t-1 l ), we can get the following equality, |P t | l=1 ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), λ t+1 l -λ t l = |P t | l=1 ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })), λ t+1 l -λ t l (1a) + |P t | l=1 ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), v t+1 1,l + |P t | l=1 ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), λ t l -λ t-1 l (1c). First, we put attention on the (1a) in Eq. ( 82), (1a) can be expressed as follows, ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), λ t+1 l -λ t l = ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), λ t+1 l -λ t l + c t-1 1 -c t 1 2 (||λ t+1 l || 2 -||λ t l || 2 )- c t-1 1 -c t 1 2 ||λ t+1 l -λ t l || 2 . ( ) Combining Cauchy-Schwarz inequality with Assumption 1, we can obtain, ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), λ t+1 l -λ t l ≤ L 2 2a1 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 )+ a1 2 ||λ t+1 l -λ t l || 2 , ( ) where a 1 > 0 is a constant. Combining Eq. ( 83) with Eq. ( 84), we can obtain that, |P t | l=1 ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), λ t+1 l -λ t l ≤ |P t | l=1 ( L 2 2a1 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 )+ a1 2 ||λ t+1 l -λ t l || 2 + c t-1 1 -c t 1 2 (||λ t+1 l || 2 -||λ t l || 2 )- c t-1 1 -c t 1 2 ||λ t+1 l -λ t l || 2 ). Then, we focus on the (1b) in Eq. ( 82). According to Cauchy-Schwarz inequality, (1b) can be expressed as follows, |P t | l=1 ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), v t+1 1,l ≤ |P t | l=1 ( a2 2 ||∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i })|| 2 + 1 2a2 ||v t+1 1,l || 2 ), ) where a 2 > 0 is a constant. Next, we focus on the (1c) in Eq. ( 82). Defining L 1 ′ = L + c 0 1 , according to Assumption 1 and the trigonometric inequality, ∀λ l , we have, ||∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i })|| = ||∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t i })-c t-1 1 (λ t l -λ t-1 l )|| ≤ (L + c t-1 1 )||λ t l -λ t-1 l || ≤ L 1 ′ ||λ t l -λ t-1 l ||. ( ) Following from Eq. ( 87) and the strong concavity of terov, 2003; Xu et al., 2020) , we can obtain that, L p ({x i },{y i },v,z,{λ l },{θ i }) w.r.t λ l (Nes |P t | l=1 ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }) ≤ |P t | l=1 (- 1 L1 ′ +c t-1 1 ||∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i })|| 2 - c t-1 1 L1 ′ L1 ′ +c t-1 1 ||λ t l -λ t-1 l || 2 ). ( ) In addition, the following inequality can be obtained, 1 η λ λ t l -λ t-1 l , λ t+1 l -λ t l ≤ 1 2η λ ||λ t+1 l -λ t l || 2 -1 2η λ ||v t+1 1,l || 2 + 1 2η λ ||λ t l -λ t-1 l || 2 . ( ) Combining Eq. ( 81), ( 82), ( 85), ( 86), ( 88), ( 89), η λ 2 ≤ 1 L1 ′ +c 0 1 , and setting a 2 = η λ , we have: L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i }) -L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i }) ≤ |P t |L 2 2a1 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 ) +( a1 2 - c t-1 1 -c t 1 2 + 1 2η λ ) |P t | l=1 ||λ t+1 l -λ t l || 2 + c t-1 1 2 |P t | l=1 (||λ t+1 l || 2 -||λ t l || 2 ) + 1 2η λ |P t | l=1 ||λ t l -λ t-1 l || 2 . ( ) According to Eq. ( 20), in (t + 1) th iteration, ∀θ ∈ Θ, it follows that, θ t+1 i -θ t i -η θ ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i }), θ-θ t+1 i = 0. ( ) Choosing θ = θ t i , we can obtain, ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })- 1 η θ (θ t+1 i -θ t i ), θ t i -θ t+1 i = 0. Likewise, in t th iteration, we have, ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i })- 1 η θ (θ t i -θ t-1 i ), θ t+1 i -θ t i = 0. Since L p ({x i },{y i },v,z,{λ l },{θ i }) is concave with respect to θ i and follows from Eq. ( 93): L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t+1 i })-L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i }) ≤ N i=1 ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i }), θ t+1 i -θ t i ≤ N i=1 ( ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), θ t+1 i -θ t i + 1 η θ θ t i -θ t-1 i , θ t+1 i -θ t i ). ( ) Denoting v t+1 2,l = θ t+1 i -θ t i -(θ t i -θ t-1 i ), we have that, N i=1 ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), θ t+1 i -θ t i = N i=1 ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i (2a) + N i=1 ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), v t+1 2,l + N i=1 ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), θ t i -θ t-1 i (2c). We firstly focus on the (2a) in Eq. ( 95), we can write the (2a) as, ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i = ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i + c t-1 2 -c t 2 2 (||θ t+1 i || 2 -||θ t i || 2 )- c t-1 2 -c t 2 2 ||θ t+1 i -θ t i || 2 ). And combining the Cauchy-Schwarz inequality with Assumption 1, we can obtain, ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i = ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i ≤ L 2 2a3 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 ) + a3 2 ||θ t+1 i -θ t i || 2 , ( ) where a 3 > 0 is a constant. Thus, we can get the upper bound of (2a) by combining Eq. ( 96) with Eq. ( 97), that is, N i=1 ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i ≤ i∈Q t+1 ( L 2 2a3 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 )+ a3 2 ||θ t+1 i -θ t i || 2 + c t-1 2 -c t 2 2 (||θ t+1 i || 2 -||θ t i || 2 )- c t-1 2 -c t 2 2 ||θ t+1 i -θ t i || 2 ). Next we focus on the (2b) in Eq. ( 95). According to Cauchy-Schwarz inequality we can write (2b) as, N i=1 ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), v t+1 2,l ≤ N i=1 ( a4 2 ||∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i })|| 2 + 1 2a4 ||v t+1 2,l || 2 ), where a 4 > 0 is a constant. Then, we focus on the (2c) in Eq. ( 95). Defining L 2 ′ = L + c 0 2 , according to Assumption 1 and the trigonometric inequality, we have, ||∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i })|| ≤ L 2 ′ ||θ t i -θ t-1 i ||. Following Eq. ( 100) and the strong concavity of L p ({x i },{y i },v,z,{λ l },{θ i }) w.r.t θ i , the upper bound of (2c) can be obtained, that is, N i=1 ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), θ t i -θ t-1 i ≤ N i=1 (- 1 L2 ′ +c t-1 2 ||∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i })|| 2 - c t-1 2 L2 ′ L2 ′ +c t-1 2 ||θ t i -θ t-1 i || 2 ). In addition, the following inequality can also be obtained, N i=1 1 η θ θ t i -θ t-1 i , θ t+1 i -θ t i ≤ N i=1 ( 1 2η θ ||θ t+1 i -θ t i || 2 -1 2η θ ||v t+1 2,l || 2 + 1 2η θ ||θ t i -θ t-1 i || 2 ). Combining Eq. ( 94), ( 95), ( 98), ( 99), ( 101), (102), η θ 2 ≤ 1 L2 ′ +c 0 2 , and setting a 4 = η θ , we have, L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t+1 i })-L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i }) ≤ |Q t+1 |L 2 2a3 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 ) +( a3 2 - c t-1 2 -c t 2 2 + 1 2η θ ) N i=1 ||θ t+1 i -θ t i || 2 + c t-1 2 2 N i=1 (||θ t+1 i || 2 -||θ t i || 2 )+ 1 2η θ N i=1 ||θ t i -θ t-1 i || 2 . ( ) By combining Lemma 1 with Eq. ( 90) and Eq. ( 103), we conclude the proof of Lemma 2. Lemma 3 Firstly, we denote S t+1 1 , S t+1 2 and F t+1 as, S t+1 1 = 4 η λ 2 c t+1 1 |P t | l=1 ||λ t+1 l -λ t l || 2 - 4 η λ ( c t-1 1 c t 1 -1) |P t | l=1 ||λ t+1 l || 2 , S t+1 2 = 4 η θ 2 c t+1 2 N i=1 ||θ t+1 i -θ t i || 2 - 4 η θ ( c t-1 2 c t 2 -1) N i=1 ||θ t+1 i || 2 , F t+1 = L p ({x t+1 i }, {y t+1 i }, z t+1 , h t+1 , {λ t+1 l }, {θ t+1 i }) + S t+1 1 + S t+1 2 -7 2η λ |P t | l=1 ||λ t+1 l -λ t l || 2 - c t 1 2 |P t | l=1 ||λ t+1 l || 2 -7 2η θ N i=1 ||θ t+1 i -θ t i || 2 - c t 2 2 N i=1 ||θ t+1 i || 2 . ( ) Defining a 5 = max{1, 1 + L 2 , 6τ k 1 N L 2 }, ∀t ≥ T 1 , we have, F t+1 -F t ≤ ( L+a5 2 -1 η t x + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 ) N i=1 ||x t+1 i -x t i || 2 +( L+a5 2 -1 η t y + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 ) N i=1 ||y t+1 i -y t i || 2 +( L+a5 2 -1 η t v + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 )||v t+1 -v t || 2 +( L+a5 2 -1 η t z + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 )||z t+1 -z t || 2 -( 1 10η λ -6τ k1N L 2 2 ) |P t | l=1 ||λ t+1 l -λ t l || 2 -1 10η θ N i=1 ||θ t+1 i -θ t i || 2 + c t-1 1 -c t 1 2 |P t | l=1 ||λ t+1 l || 2 + c t-1 2 -c t 2 2 N i=1 ||θ t+1 i || 2 + 4 η λ ( c t-2 1 c t-1 1 - c t-1 1 c t 1 ) |P t | l=1 ||λ t l || 2 + 4 η θ ( c t-2 2 c t-1 2 - c t-1 2 c t 2 ) N i=1 ||θ t i || 2 . ( ) Proof of Lemma 3: Let a 1 = 1 η λ , a 3 = 1 η θ and substitute them into the Lemma 2, ∀t ≥ T 1 , we have, L p ({x t+1 i },{y t+1 i }, v t+1 , z t+1 ,{λ t+1 l },{θ t+1 i })-L p ({x t i },{y t i }, v t , z t ,{λ t l },{θ t i }) ≤ ( L+L 2 +1 2 -1 η t x + η λ |P t |L 2 +η θ |Q t+1 |L 2 2 ) N i=1 ||x t+1 i -x t i || 2 +( L+1 2 -1 η t y + η λ |P t |L 2 +η θ |Q t+1 |L 2 2 ) N i=1 ||y t+1 i -y t i || 2 +( L+6τ k1N L 2 2 -1 η t v + η λ |P t |L 2 +η θ |Q t+1 |L 2 2 )||v t+1 -v t || 2 + 1 2η λ |P t | l=1 ||λ t l -λ t-1 l || 2 +( L+6τ k1N L 2 2 -1 η t z + η λ |P t |L 2 +η θ |Q t+1 |L 2 2 )||z t+1 -z t || 2 + 1 2η θ N i=1 ||θ t i -θ t-1 i || 2 +( 6τ k1N L 2 2 - c t-1 1 -c t 1 2 + 1 η λ ) |P t | l=1 ||λ t+1 l -λ t l || 2 +( 1 η θ - c t-1 2 -c t 2 2 ) N i=1 ||θ t+1 i -θ t i || 2 + c t-1 1 2 |P t | l=1 (||λ t+1 l || 2 -||λ t l || 2 )+ c t-1 2 2 N i=1 (||θ t+1 i || 2 -||θ t i || 2 ). According to Eq. ( 19), in (t + 1) th iteration, it follows that: λ t+1 l -λ t l -η λ ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i }), λ t l -λ t+1 l = 0. Similar to Eq. ( 109), in t th iteration, we have, λ t l -λ t-1 l -η λ ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), λ t+1 l -λ t l = 0. Thus, ∀t ≥ T 1 , by combining Eq. ( 109) with Eq. ( 110), we can obtain that, 1 η λ v t+1 1,l , λ t+1 l -λ t l = ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), λ t+1 l -λ t l = ∇ λ l L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), λ t+1 l -λ t l + ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i }), v t+1 1,l + ∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l }, {θ t-1 i }), λ t l -λ t-1 l . ( ) Since we have that, 1 η λ v t+1 1,l , λ t+1 l -λ t l = 1 2η λ ||λ t+1 l -λ t l || 2 + 1 2η λ ||v t+1 1,l || 2 -1 2η λ ||λ t l -λ t-1 l || 2 , ( ) it follows from Eq. ( 111) and Eq. ( 112) that, 1 2η λ ||λ t+1 l -λ t l || 2 + 1 2η λ ||v t+1 1,l || 2 -1 2η λ ||λ t l -λ t-1 l || 2 = L 2 2b t 1 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 )+ b t 1 2 ||λ t+1 l -λ t l || 2 + c t-1 1 -c t 1 2 (||λ t+1 l || 2 -||λ t l || 2 )- c t-1 1 -c t 1 2 ||λ t+1 l -λ t l || 2 + η λ 2 ||∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i })|| 2 + 1 2η λ ||v t+1 1,l || 2 - 1 L1 ′ +c t-1 1 ||∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ λ l L p ({x t i },{y t i },v t ,z t ,{λ t-1 l },{θ t-1 i })|| 2 - c t-1 1 L1 ′ L1 ′ +c t-1 1 ||λ t l -λ t-1 l || ( ) where b t 1 > 0. According to the setting that c 0 1 ≤ L 1 ′ , we have - c t-1 1 L1 ′ L1 ′ +c t-1 1 ≤ - c t-1 1 L1 ′ 2L1 ′ = - c t-1 1 2 ≤ - c t 1 2 . Multiplying both sides of Eq. ( 113) by 8 η λ c t 1 , we have, 4 η λ 2 c t 1 ||λ t+1 l -λ t l || 2 -4 η λ ( c t-1 1 -c t 1 c t 1 )||λ t+1 l || 2 ≤ 4 η λ 2 c t 1 ||λ t l -λ t-1 l || 2 -4 η λ ( c t-1 1 -c t 1 c t 1 )||λ t l || 2 + 4b t 1 η λ c t 1 ||λ t+1 l -λ t l || 2 -4 η λ ||λ t l -λ t-1 l || 2 + 4L 2 η λ c t 1 b t 1 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 ). ( ) Setting b t 1 = c t 1 2 in Eq. ( 114) and using the definition of S t 1 , ∀t ≥ T 1 , we have, S t+1 1 -S t 1 ≤ |P t | l=1 4 η λ ( c t-2 1 c t-1 1 - c t-1 1 c t 1 )||λ t l || 2 + |P t | l=1 ( 2 η λ + 4 η λ 2 ( 1 c t+1 1 -1 c t 1 ))||λ t+1 l -λ t l || 2 - |P t | l=1 4 η λ ||λ t l -λ t-1 l || 2 + 8|P t |L 2 η λ (c t 1 ) 2 ( N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 ). Published as a conference paper at ICLR 2023 Similarly, according to Eq. ( 20), it follows that, 1 η θ v t+1 2,l , θ t+1 i -θ t i = ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), θ t+1 i -θ t i = ∇ θi L p ({x t+1 i },{y t+1 i },v t+1 ,z t+1 ,{λ t+1 l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), θ t+1 i -θ t i + ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), v t+1 2,l + ∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i })-∇ θi L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t-1 i }), θ t i -θ t-1 i Proof of Theorem 1: First, we set that, a t 6 = 4|P t |(γ -2)L 2 η λ (c t 1 ) 2 + 4N (γ -2)L 2 η θ (c t 2 ) 2 + η θ (N -|Q t+1 |)L 2 2 - a 5 2 , ( ) where constant γ satisfies that γ > 2 and 4(γ-2)L 2 η λ (c 0 1 ) 2 + 4N (γ-2)L 2 η θ (c 0 2 ) 2 > a5 2 , thus we have that a t 6 > 0, ∀t. According to the setting of η t x , η t y , η t v , η t z and c t 1 , c t 2 , we have, L+a 5 2 - 1 η t x + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 = -a t 6 , L+a 5 2 - 1 η t y + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 = -a t 6 , L+a 5 2 - 1 η t v + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 = -a t 6 , L+a 5 2 - 1 η t z + η λ |P t |L 2 2 + η θ |Q t+1 |L 2 2 + 8|P t |L 2 η λ (c t 1 ) 2 + 8N L 2 η θ (c t 2 ) 2 = -a t 6 . Combining Eq. ( 123), ( 124), ( 125), ( 126) with Lemma 3, ∀t ≥ T 1 , we can obtain that, a t 6 N i=1 (||x t+1 i -x t i || 2 + ||y t+1 i -y t i || 2 )+a t 6 ||v t+1 -v t || 2 +a t 6 ||z t+1 -z t || 2 +( 1 10η λ -6τ k1N L 2 2 ) |P t | l=1 ||λ t+1 l -λ t l || 2 + 1 10η θ N i=1 ||θ t+1 i -θ t i || 2 ≤ F t -F t+1 + c t-1 1 -c t 1 2 |P t | l=1 ||λ t+1 l || 2 + c t-1 2 -c t 2 2 N i=1 ||θ t+1 i || 2 + 4 η λ ( c t-2 1 c t-1 1 - c t-1 1 c t 1 ) |P t | l=1 ||λ t l || 2 + 4 η θ ( c t-2 2 c t-1 2 - c t-1 2 c t 2 ) N i=1 ||θ t i || 2 . ( ) Utilizing the definition of (∇ G t ) xi and combining it with trigonometric inequality, Cauchy-Schwarz inequality and Assumption 1 and 2, we can obtain that, ||(∇ G t ) xi || 2 ≤ 2 ηx 2 ||x ti i -x t i || 2 +6L 2 τ k 1 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ). Utilizing the definition of (∇ G t ) yi and combining it with trigonometric inequality and Cauchy-Schwarz inequality, it follows that, ||(∇ G t ) yi || 2 ≤ 2 ηy 2 ||y ti i -y t i || 2 +6L 2 τ k 1 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ). Utilizing the definition of (∇ G t ) v and combining it with trigonometric inequality and Cauchy-Schwarz inequality, we have that, ||(∇ G t ) v || 2 ≤ 2L 2 N i=1 (||x t+1 i -x t i || 2 +||y t+1 i -y t i || 2 )+ 2 ηv 2 ||v t+1 -v t || 2 . ( ) Using the definition of (∇ G t ) z and combining it with trigonometric inequality and Cauchy-Schwarz inequality, it follows that, ||(∇ G t ) z || 2 ≤ 2L 2 N i=1 (||x t+1 i -x t i || 2 +||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 + 2 ηz 2 ||z t+1 -z t || 2 . ( ) Using the definition of (∇ G t ) λ l and combining it with trigonometric inequality and Cauchy-Schwarz inequality, we can obtain the following inequality, ||(∇ G t ) λ l || 2 ≤ 3 η λ 2 ||λ t+1 l -λ t l || 2 + 3((c t-1 1 ) 2 -(c t 1 ) 2 )||λ t l || 2 +3L 2 N i=1 (||x t+1 i -x t i || 2 +||y t+1 i -y t i || 2 )+||v t+1 -v t || 2 +||z t+1 -z t || 2 . ( ) Combining the definition of (∇ G t ) θi with Cauchy-Schwarz inequality and Assumption 2, we have, ||(∇ G t ) θi || 2 ≤ 3 η θ 2 ||θ ti i -θ t i || 2 +3L 2 N i=1 (||x ti i -x t i || 2 +||y ti i -y t i || 2 )+||v ti -v t || 2 +3(c ti-1 2 -c ti-1 2 ) 2 ||θ t i || 2 ≤ 3 η θ 2 ||θ ti i -θ t i || 2 +3L 2 N i=1 (||x ti i -x t i || 2 +||y ti i -y t i || 2 ) +3L 2 τ k 1 (||v t+1 -v t || 2 +||z t+1 -z t || 2 + |P t | l=1 ||λ t+1 l -λ t l || 2 ) + 3((c ti-1 2 ) 2 -(c ti-1 2 ) 2 )||θ t i || 2 . ( ) In sight of the Definition B.2 as well as Eq. ( 128), ( 129), ( 130), ( 131), (132) and Eq. ( 133), we can obtain that, ||∇ G t || 2 = N i=1 (||(∇ G t ) xi || 2 +||(∇ G t ) yi || 2 +||(∇ G t ) θi || 2 )+||(∇ G t ) v || 2 +||(∇ G t ) z || 2 + |P t | l=1 ||(∇ G t ) λ l || 2 ≤ ( 2 ηx 2 +3N L 2 ) N i=1 ||x ti i -x t i || 2 +( 2 ηy 2 +3N L 2 ) N i=1 ||y ti i -y t i || 2 +(4+3|P t |L 2 ) N i=1 ||x t+1 i -x t i || 2 + (4+3|P t |L 2 ) N i=1 ||y t+1 i -y t i || 2 +( 2 ηv 2 +(2+15τ k 1 N +3|P t |)L 2 )||v t+1 -v t || 2 +( 2 ηz 2 +(15τ k 1 N +3|P t |)L 2 )||z t+1 -z t || 2 + |P t | l=1 ( 3 η λ 2 +15τ k 1 N L 2 )||λ t+1 l -λ t l || 2 + |P t | l=1 3((c t-1 1 ) 2 -(c t 1 ) 2 )||λ t l || 2 + N i=1 3 η θ 2 ||θ ti i -θ t i || 2 + N i=1 3((c ti-1 2 ) 2 -(c ti-1 2 ) 2 )||θ t i || 2 . ( ) Let constant a 6 denote the lower bound of a t 6 (a 6 > 0), and we set constants d 1 , d 2 , d 3 , d 4 that, d 1 = 2k τ τ +(4+3M +3k τ τ N )L 2 η x 2 η x 2 (a 6 ) 2 ≥ 2k τ τ +(4+3|P t |+3k τ τ N )L 2 η x 2 η x 2 (a t 6 ) 2 , ( ) d 2 = 2k τ τ +(4+3M +3k τ τ N )L 2 η y 2 η y 2 (a 6 ) 2 ≥ 2k τ τ +(4+3|P t |+3k τ τ N )L 2 η y 2 η y 2 (a t 6 ) 2 , ( ) d 3 = 2+(2+15τ k 1 N +3M )L 2 η v 2 η v 2 (a 6 ) 2 ≥ 2+(2+15τ k 1 N +3|P t |)L 2 η v 2 η v 2 (a t 6 ) 2 , ( ) d 4 = 2+(15τ k 1 N +3M )L 2 η z 2 η z 2 (a 6 ) 2 ≥ 2+(15τ k 1 N +3|P t |)L 2 η z 2 η z 2 (a t 6 ) 2 , ( ) where k τ is a positive constant. Thus, combining Eq. ( 134) with Eq. ( 135), Eq. ( 136), ( 137), ( 138), we can obtain,  ||∇ G t || 2 ≤ N i=1 d 1 (a t 6 ) 2 ||x t+1 i -x t i || 2 + N i=1 d 2 (a t 6 ) 2 ||y t+1 i -y t i || 2 +d 3 (a t 6 ) 2 ||v t+1 -v t || 2 +d 4 (a t 6 ) 2 ||z t+1 -z t || 2 + N i=1 3 η θ 2 ||θ i ti -θ t i || 2 + |P t | l=1 ( 3 η λ 2 +15τ k 1 N L 2 )||λ t+1 l -λ t l || 2 + |P t | l=1 3((c t-1 1 ) 2 -(c t 1 ) 2 )||λ t l || 2 + N i=1 3((c ti-1 2 ) 2 -(c ti-1 2 ) 2 )||θ t i || 2 +( 2 ηx 2 +3N L 2 ) N i=1 ||x ti i -x t i || 2 -( 2kτ τ ηx 2 +3k τ τ N L 2 ) N i=1 ||x t+1 i -x t i || 2 +( 2 ηy 2 +3N L 2 ) N i=1 ||y ti i -y t i || 2 -( 2kτ τ ηy 2 +3k τ τ N L 2 ) N i=1 ||y t+1 i -y t i || 2 . ( λ τ k 1 N L 2 1-30η λ τ k 1 N L 2 , 30τ η θ } . We denote the upper and lower bound of d t 5 as d 5 and d 5 , respectively. And we set the constant k τ satisfies k τ ≥ max{ d5( 2 ηy 2 +3N L 2 ) d5( 2 ηy 2 +3N L 2 ) , d5( 2 ηx 2 +3N L 2 ) d5( 2 ηx 2 +3N L 2 ) }, where η x and η y are the upper bounds of η t x and η t y , respectively. We can obtain the following inequality by combining Eq. ( 139) with the definition of d t 5 : d t 5 ||∇ G t || 2 ≤ a t 6 N i=1 (||x t+1 i -x t i || 2 +||y t+1 i -y t i || 2 )+a t 6 ||v t+1 -v t || 2 +a t 6 ||z t+1 -z t || 2 +( 1 10η λ -6τ k1N L 2 2 ) |P t | l=1 ||λ t+1 l -λ t l || 2 + 1 10τ η θ N i=1 ||θ i ti -θ t i || 2 + |P t | l=1 3d t 5 ((c t-1 1 ) 2 -(c t 1 ) 2 )||λ t l || 2 + N i=1 3d t 5 ((c ti-1 2 ) 2 -(c ti-1 2 ) 2 )||θ t i || 2 +d t 5 ( 2 ηx 2 +3N L 2 ) N i=1 ||x ti i -x t i || 2 -d t 5 ( 2kτ τ ηx 2 +3k τ τ N L 2 ) N i=1 ||x t+1 i -x t i || 2 +d t 5 ( 2 ηy 2 +3N L 2 ) N i=1 ||y ti i -y t i || 2 -d t 5 ( 2kτ τ ηy 2 +3k τ τ N L 2 ) N i=1 ||y t+1 i -y t i || 2 . ( ) T1+ T (ϵ) t=T1+2 N i=1 ||θ i ti -θ t i || 2 -1 10η θ T1+ T (ϵ) t=T1+2 N i=1 ||θ i t+1 -θ t i || 2 +d 5 ( 2 ηx 2 + 3N L 2 ) T1+ T (ϵ) t=T1+2 N i=1 ||x ti i -x t i || 2 -d 5 ( 2kτ τ ηx 2 + 3k τ τ N L 2 ) T1+ T (ϵ) t=T1+2 N i=1 ||x t+1 i -x t i || 2 +d 5 ( 2 ηy 2 + 3N L 2 ) T1+ T (ϵ) t=T1+2 N i=1 ||y ti i -y t i || 2 -d 5 ( 2kτ τ ηy 2 + 3k τ τ N L 2 ) T1+ T (ϵ) t=T1+2 N i=1 ||y t+1 i -y t i || 2 , ( ) where σ 3 = max{||λ 1 -λ 2 ||}, σ 4 = max{||θ 1 -θ 2 ||} and L - = min L p ({x t i },{y t i },v t ,z t ,{λ t l },{θ t i }), which satisfy that, ∀t ≥ T 1 + 2, F t ≥ L - - 4 η λ c 1 1 c 2 1 M α 3 - 4 η θ c 1 2 c 2 2 N α 4 - 7 2η λ M σ 3 2 - 7 2η θ N σ 4 2 - c T1+2 1 2 M σ 3 2 - c T1+2 2 2 N σ 4 2 . ( ) For each worker i, we have that t i -ti ≤ τ , thus, T1+ T (ϵ) t=T1+2 3d 5 ((c ti-1 2 ) 2 -(c ti-1 2 ) 2 )α 4 ≤ τ vi(j)∈Vi( T (ϵ)), T1+2≤vi(j)≤T1+ T (ϵ) 3d 5 ((c vi(j)-1 2 ) 2 -(c vi(j+1)-1 2 ) 2 )α 4 ≤ 3τ d 5 (c 1 2 ) 2 α 4 . Since the idle workers do not update their variables in each master iteration, for any t that satisfies vi (j -1) ≤ t < vi (j), we have θ t i = θ vi(j)-1 i . And for t / ∈ V i (T ), we have ||θ t i -θ t-1 i || 2 = 0. Combining with vi (j) -vi (j -1) ≤ τ , we can obtain that, T1+ T (ϵ) t=T1+2 N i=1 ||θ i tj -θ t i || 2 ≤ τ vi(j)∈Vi( T (ϵ)), T1+3≤vi(j) N i=1 ||θ vi(j) i -θ vi(j)-1 i || 2 = τ T1+ T (ϵ) t=T1+2 N i=1 ||θ i t+1 -θ t i || 2 + τ T (ϵ)+τ -1 t= T (ϵ)+1 N i=1 ||θ i t+1 -θ t i || 2 ≤ τ T1+ T (ϵ) t=T1+2 N i=1 ||θ i t+1 -θ t i || 2 + 4τ (τ -1)N α 4 . Similarly, for any t that satisfies vi (j -1) ≤ t < vi (j), we have (147) x t i = x vi(j)-1 i , y t i = y vi(j)-1 i . And for t / ∈ V i (T ), we have ||x t i -x t-1 i || 2 = 0, ||y t i -y t-1 i || 2 = 0. Combining with vi (j)-v i (j -1) ≤ τ , we can get that, T1+ T (ϵ) t=T1+2 N i=1 ||x tj i -x t i || 2 ≤ τ vi(j)∈Vi( T (ϵ)), T1+3≤vi(j) N i=1 ||x vi(j) i -x vi(j)-1 i || 2 = τ T1+ T (ϵ) t=T1+2 N i=1 ||x t+1 i -x t i || 2 +τ T (ϵ)+τ -1 t= T (ϵ)+1 N i=1 ||x t+1 i -x t i || 2 ≤ τ T1+ T (ϵ) It follows from Eq. ( 142), ( 144), ( 145), ( 146) that, T1+ T (ϵ) t=T1+2 d t 5 ||∇ G t || 2 ≤ F T1+2 -L - + 4 η λ ( +( 2N α4 5η θ + 4d 5 ( 2 ηx 2 + 3N L 2 )N α 1 τ + 4d 5 ( 2 ηy 2 + 3N L 2 )N α 2 τ )(τ -1) = - d +k d τ (τ -1), where  According to the setting of c t 1 , c t 2 , we have, 1 a t 6 ≥ 1 4(γ -2)L 2 (M η λ + N η θ )(t + 1) 1 2 + η θ (N -S)L 2 2 . ( ) Summing up 1 a t 6 from t = T 1 + 2 to t = T 1 + T (ϵ), it follows that, . (153) The second inequality in Eq. ( 153) is due to that ∀t ≥ T 1 + 2, we have, 4(γ -2)L 2 (M η λ +N η θ )(t+1) (T 1 + T (ϵ)) 1 2 -(T 1 + 2) 1 2 . ( ) Let constant d 7 = 4(γ -2)L 2 (M η λ + N η θ ), and according to the definition of T (ϵ), we have: T 1 + T (ϵ) ≥ ( 4(d 7 + η θ (N -S)L 2 2 )( - d +k d τ (τ -1))d 6 ϵ + (T 1 + 2) 1 2 ) 2 . ( ) Combining the definition of ∇G t and ∇ G t with trigonometric inequality, we then get:  ||∇G t || -||∇ G t || ≤ ||∇G t -∇ G t || ≤ |P t | l=1 ||c t-1 1 λ t l || 2 + N i=1 ||c t-1 2 θ t i || 2 . ( )

C PROOF OF THEOREM 1

Assuming that there are cutting planes added every k iteration, i.e., P 0 ⊇ P k ⊇ • • • ⊇ P nk . ( ) Let R k denote the feasible region of problem in Eq. ( 12) in k th iteration, and let R ′ denote the feasible region of problem in Eq. ( 10), we have that, R 0 ⊇ R k ⊇ • • • ⊇ R nk ⊇ R ′ . ( ) Let F ({x k * i }, {y k * i }, v k * , z k * ) denote the optimal objective value of the problem in Eq. ( 12) in k th iteration and let F * denote the optimal objective value of the problem in Eq. ( 10). According to Eq. ( 160), we have that,  And we can obtain that, F * F ({x 0 * i },{y 0 * i },v 0 * ,z 0 * ) ≥ F * F ({x k * i },{y k * i },v k * ,z k * ) ≥ • • • ≥ F * F ({x nk * i },{y nk * i },v nk * ,z nk * ) ≥ β. It is seen from Eq. ( 162) that the sequence { F * F ({x k * i },{y k * i },v k * ,z k * ) } is monotonically nonincreasing. When nk → ∞, the optimal objective value of the problem in Eq. ( 12) monotonically converges to β (β ≥ 1).

D DETAILS OF EXPERIMENTS D.1 ADDITIONAL RESULTS

In this section, additional experiment results on CIFAR-10 ( Krizhevsky et al., 2009) and Australian (Quinlan, 1987) datasets are reported in Figure 11 and Figure 12 . It is seen from Figure 11 and Figure 12 that the proposed ADBO also achieves faster convergence rate.

D.2 DETAILS OF EXPERIMENTS

In this section, we provide more details of the experimental setup in this work. In data hyper-cleaning task, experiments are carried out on MNIST, Fashion MNIST and CIFAR-10 datasets. Following (Ji et al., 2021) , we utilize the same model in data-hypercleaning task for MNIST, Fashion MNIST and CIFAR-10 datasets, and SGD optimizer is utilized. And the step-sizes are summarized in Table 2 . In MNIST and Fashion MNIST datasets, we set N = 18, S = 9, τ = 15. And in CIFAR-10 dataset, we set N = 18, S = 9, τ = 5. We set that the (communication + computation) delays of each worker obey log-normal distribution LN(3.5, 1). In regularization coefficient optimization task, experiments are carried out on Covertype, IJCNN1 and Australian datasets. Following (Chen et al., 2022a) , we utilize the same logistic regression model, and SGD optimizer is used. And the step-sizes are summarized in Table 2 . In Covertype dataset, we set N = 18, S = 9, τ = 15; in IJCNN1 dataset, we set N = 24, S = 12, τ = 15; and in Australian dataset, we set N = 4, S = 2, τ = 5. In the experiments that consider straggler problems, three stragglers are set in the distributed system, and the mean of (communication + computation) delay of stragglers is four times the delay of normal workers. Codes are available in https://github.com/ICLR23Submission6251/adbo.



Codes are available in https://github.com/ICLR23Submission6251/adbo.



31) where α 3 , α 4 , γ, k d , T 1 , d, d 6 and d 7 are constants. The detailed proof is given in Appendix B.

Figure 1: Test accuracy vs time on (a) MNIST and (b) Fashion MNIST datasets.

Figure 7: Comparison of (a) test accuracy vs time, (b) test loss vs time on Covertype dataset.

Figure 8: Comparison of (a) test accuracy vs time, (b) test loss vs time on IJCNN1 dataset.

Figure 10: Comparison of (a) test accuracy vs time, (b) test loss vs time on CIFAR-FS dataset.

i || 2 + 4τ (τ -1)N α 2 .

and k d are constants. Constant d 6 is given by,d 6 = max{d 1 , d 2 , d 3 , d 4 , 30 η λ +150η λ τ k1N L 2 (1-30η λ τ k1N L 2 )a6 , 30τ η θ a6 } ≥ max{d 1 , d 2 , d 3 , d 4 , 30 η λ +150η λ τ k1N L 2(1-30η λ τ k1N L 2 )a t

L 2 (M η λ +N η θ )(t+1) L 2 (M η λ +N η θ )(t+1) L 2 (M η λ +N η θ )+ η θ (N -S)L 2 2

γ -2)L 2 (M η λ +N η θ )+ η θ (N -SEq. (153) into Eq. (151), we can obtain:||∇ G T1+ T (ϵ) || 2 ≤ ( γ -2)L 2 (M η λ +N η θ ) + η θ (N -S)L 2 +k d τ (τ -1))d 6

({x 0 * i },{y 0 * i },v 0 * ,z 0 * ) ≤ F ({x k * i },{y k * i },v k * ,z k * ) ≤ • • • ≤ F ({x nk * i },{y nk * i },v nk * ,z nk * ).

Figure 11: (a) Test accuracy vs time and (b) Test loss vs time on CIFAR-10 dataset on distributed data hyper-cleaning task.

)

Step-sizes of all variables in the experiments. ||∇G t || 2 ≤ ϵ, which concludes our proof.

ACKNOWLEDGMENTS

The work of Yang Jiao, Kai Yang and Chengtao Jian was supported in part by the National Natural Science Foundation of China under Grant 61771013, in part by the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS), in part by the Fundamental Research Funds for the Central Universities of China, and in part by the Fundamental Research Funds of Shanghai Jiading District. We thank Haibo Zhao for providing the experiment results in meta-learning.

E PARAMETER SERVER ARCHITECTURE

In this section, we give the illustration of the parameter server architecture, which is shown in Figure 13 . In parameter server architecture, the communication is centralized around a set of master nodes (or servers) that constitute the hubs of a star network, and worker nodes (or clients) pull the shared parameters from and send their updates to the master nodes. 

