PROMETHEUS: ENDOWING LOW SAMPLE AND COM-MUNICATION COMPLEXITIES TO CONSTRAINED DE-CENTRALIZED STOCHASTIC BILEVEL LEARNING

Abstract

In recent years, constrained decentralized stochastic bilevel optimization has become increasingly important due to its versatility in modeling a wide range of multi-agent learning problems, such as multi-agent reinforcement learning and multi-agent meta-learning with safety constraints. However, one under-explored and fundamental challenge in constrained decentralized stochastic bilevel optimization is how to achieve low sample and communication complexities, which, if not addressed appropriately, could affect the long-term prospect of many emerging multi-agent learning paradigms that use decentralized bilevel optimization as a bedrock. In this paper, we investigate a class of constrained decentralized bilevel optimization problems, where multiple agents collectively solve a nonconvexstrongly-convex bilevel problem with constraints in the upper-level variables. Such problems arise naturally in many multi-agent reinforcement learning and meta learning problems. In this paper, we propose an algorithm called Prometheus (proximal tracked stochastic recursive estimator) that achieves the first O( -1 ) results in both sample and communication complexities for constrained decentralized bilevel optimization, where > 0 is a desired stationarity error. Collectively, the results in this work contribute to a theoretical foundation for low sample-and communicationcomplexity constrained decentralized bilevel learning.

1. INTRODUCTION

In recent years, the problem of constrained decentralized bilevel optimization has attracted increasing attention due to its foundational role in many emerging multi-agent learning paradigms with safety or regularization constraints. Such applications include, but are not limited to, safety-constrained multiagent reinforcement learning for autonomous driving (Bennajeh et al., 2019) , sparsity-regularized multi-agent meta-learning (Poon & Peyré, 2021) , and rank-constrained decentralized matrix completion for recommender systems (Pochmann & Von Zuben, 2022) , etc. As its name suggests, a defining feature of constrained decentralized bilevel optimization is "decentralized," which implies that the problem needs to be solved over a network without any coordination from a centralized server. As a result, all agents must rely on communications to reach a consensus on an optimal solution. Due to the potentially unreliable network connections and the limited computation capability at each agent, such network-consensus approaches for constrained decentralized bilevel optimization typically call for low sample and communication complexities. To date, however, none of the existing works on sample-and communication-efficient decentralized bilevel optimization in the literature considered domain constraints (e.g., Gao et al. (2022) ; Yang et al. (2022) ; Lu et al. (2022) ; Chen et al. (2022b) and Section 2 for detailed discussions). In light of the growing importance of constrained decentralized bilevel optimization, our goal in this paper is to fill this gap by developing sample-and communication-efficient consensus-based algorithms that can effectively handle domains constraints. Specifically, this paper focuses on a class of constrained decentralized multi-task bilevel optimization problems, where we aim to solve a decentralized nonconvex-strongly-convex bilevel optimization problem with i) multiple lower-level problems and ii) consensus and domain constrains on the upper level. Such problems naturally arise in security-constrained bi-level model for integrated natural gas and electricity system (Li et al., 2017) , multi-agent actor-critic reinforcement learning (Zhang et al., 2020) and constraint meta-learning (Liu et al., 2019) . In the optimization literature, a natural approach for handling domain constraints is the proximal operator. However, as will be shown later, proximal algorithm design and theoretical analysis for constrained decentralized bilevel optimization problems is much more complicated than those of unconstrained counterparts and the results are very limited. In fact, in the literature, the proximal operator for constrained bilevel optimization has been under-explored even in the single-agent setting, not to mention the more complex multi-agent settings. The most related works in terms of handling domain constraints can be found in (Hong et al., 2020; Chen et al., 2022a; Ghadimi & Wang, 2018) , which rely on direct projected (stochastic) gradient descent to solve the constrained bilevel problem. In contrast, our work considers general domain constraints that require evaluation of proximal operators in each iteration. Also, these works only considered the single-agent setting, and hence their techniques are not implementable over networks. Actually, up until this work, it is unclear how to design proximal algorithms to handle domain constraints for decentralized bilevel optimization. Moreover, it is worth noting that existing methods for hyper-gradient approximation in both single-and multi-agent bilevel optimization are either based on first-order Taylor-type approximation approaches (Khanduri et al., 2021; Ghadimi & Wang, 2018; Hong et al., 2020) , implicit differentiation (Ghadimi & Wang, 2018; Gould et al., 2016; Ji et al., 2021) , or iterative differentiation (Franceschi et al., 2017; Maclaurin et al., 2015; Ji et al., 2021) , all of which suffer from high communication and sample complexities that are problematic in decentralized settings over networks. The main contribution of this paper is that we propose a series of new proximal-type algorithmic techniques to overcome the challenges mentioned above and achieve low sample and communication complexities for constrained decentralized bilevel optimization problem. The main technical contributions of this work are summarized below: • We propose a decentralized optimization approach called Prometheus (proximal tracked stochastic recursive estimator), which is a cleverly designed hybrid algorithm that integrates proximal operations, recursive variance reduction, lower-level gradient tracking, and upper-level consensus techniques. We show that, to achieve an -stationary point, Prometheus enjoys a convergence rate of O(1/T ), where T is the maximum number of iterations. This implies O( -1 ) communication complexity and O( √ nK -1 + n) sample complexity per agent. • We propose a new hyper-gradient estimator for the upper-level function, which leads to a far more accurate stochastic estimation than the conventional stochastic estimator used in (Khanduri et al., 2021; Ghadimi & Wang, 2018; Hong et al., 2020; Liu et al., 2022) . We show that our new hyper-gradient stochastic estimator has a smaller variance and outperforms existing estimators both theoretically and experimentally. We note that our proposed estimator could be of independent interest for other bilevel optimization problems. • We reveal an interesting insight that the variance reduction in Prometheus is not only sufficient but also necessary in the following sense: a "non-variance-reduced" special version of Prometheus could only achieve a much slower O(1/ √ T ) convergence to a constant error-ball rather than an -stationary point with arbitrarily small -tolerance. This insight advances our understanding and state of the art of algorithm design for constrained decentralized bilevel optimization. The rest of the paper is organized as follows. In Section 2, we review related literature. In Section 3, we provide the preliminaries of the decentralized bilevel optimization problem. In Section 4, we provide details on our proposed Prometheus algorithm, including the convergence rate, communication complexity, and sample complexity results. Section 5 provides numerical results to verify our theoretical findings and Section 6 concludes this paper.

2. RELATED WORK

In this section, we first provide a quick overview of the state-of-the-art on single-agent constrained bilevel optimization as well as decentralized bilevel optimization. 1) Constrained Bilevel Optimization in the Single-Agent Setting: As mentioned in Section 1, various techniques have been proposed to solve single-agent bilevel optimization, such as utilizing full-gradient-based techniques (e.g., AID-based methods (Rajeswaran et al., 2019; Franceschi et al., 2018; Ji et al., 2021) , ITD-based methods (Pedregosa, 2016; Maclaurin et al., 2015; Ji et al., 2021) ), stochastic gradient-based techniques (Ghadimi & Wang, 2018; Khanduri et al., 2021; Guo & Yang, 2021) , STORM-based techniques (Cutkosky & Orabona, 2019) , and VR-based techniques (Yang et al., 2021) . However, none of these existing works have considered domain constraints. To our knowledge, the only works that considered domain constraints in the single-agent setting can be found in (Hong et al., 2020; Chen et al., 2022a; Ghadimi & Wang, 2018) . In (Ghadimi & Wang, 2018) , the authors proposed a double-loop algorithm called BSA, where in the inner loop the lower level problem is solved to sufficient accuracy, while in the outer loop projected (stochastic) gradient descent is utilized to update the model parameters. The double-loop structure of BSA led to slow convergence. In (Hong et al., 2020) , a two-timescale single loop stochastic approximation (TTSA) algorithm based on projected (stochastic) gradient descent was proposed to solve the constrained bilevel optimization problems. However, TTSA has to choose step-sizes of different orders for the upper and lower level problems to ensure convergence, which leads to suboptimal complexity results. Later in (Chen et al., 2022a) , an algorithm called STABLE algorithm is proposed to utilize a momentum-based gradient estimator and combines the Moreau-envelop-based analysis to achieve an O( -2 ) sample-complexity. As mentioned in Section 1, however, the methods in (Ghadimi & Wang, 2018; Hong et al., 2020; Chen et al., 2022a) consider only simple constraints. Moreover, the aforementioned methods are not applicable in the decentralized setting. 2) Decentralized Bilevel Optimization: Decentralized bilevel optimization has also received increasing attention in recent years. For example, Yang et al. (2022) , Lu et al. (2022) and Chen et al. (2022b) respectively proposed stochastic gradient (SG)-type decentralized algorithms for bilevel optimization and achieve an O( -2 ) sample-communication complexity. The VRDBO method in (Gao et al., 2022) employed the momentum-based techniques for decentralized bilevel optimization to achieve better O( -1.5 ) complexity results. However, VRDBO updates upper-and lower-level variables in an alternating fashion. As will be shown later, our Prometheus algorithm updates upper-level and lower-level variables simultaneously, which renders a much lower implementation complexity than VRDBO. Besides, Prometheus achieves O( √ nK -1 + n) sample complexities, which is a near-optimal sample complexity and outperforms existing decentralized bilevel algorithms. It is worth noting that, the in aforementioned works, consensus requirements exist on both lowerand upper-level subproblems. To certain extent, such a formulation can be viewed as multiple agents collaboratively solving the same bilevel optimization problem. In contrast, our work only has a consensus requirement in the upper-level subproblem, which implies multiple different lower-level tasks. This is more practically-relevant and a more appropriate formulation for multi-agent reinforcement learning, multi-agent meta-learning, etc. We note that the most related work on decentralized bilevel optimization is (Liu et al., 2022) , which also considered multiple lower-level tasks. However, the INTERACT method in (Liu et al., 2022) is unconstrained and cannot handle non-smooth objectives considered in our work. In contrast, we propose a special proximal operator xi,t to address this challenge. Last but not least, we note that none of the aforementioned works on decentralized bilevel optimization took domain constraints into consideration. For clearer comparisons, we summarize and compare the complexity results of all algorithms mentioned above in Table 1 .

3. PROBLEM FORMULATION AND MOTIVATING APPLICATIONS

1) Network Consensus Formulation for Decentralized Bilevel Optimization: Consider an undirected connected network G = (N , L) that represents a peer-to-peer network, where N and L are the sets of agents (nodes) and edges, respectively, with |N | = m. Each agent i has local computation capability and can share information with its neighboring agents denoted as N i {i ∈ N : (i, i ) ∈ L}. Each agent i has access to a local dataset of size n. All agents in the network collaboratively solve the following constrained decentralized bilevel optimization problem: min xi∈X 1 m m i=1 [ (x i ) + h(x i )] 1 mn m i=1 n j=1 [f x i , y * i (x i ; ξij ) + h(x i )] s.t. y * i (x i ) = arg min yi∈R p 2 g(x i , y i ) 1 n n j=1 g(x i , y i ; ζ ij ), ∀i; x i = x i , if (i, i ) ∈ L, (1) where X ⊆ R p1 is a convex constraint set, and x i ∈ X and y i ∈ R p2 are parameters to be trained for the upper-level and lower-level subproblems at agent i, respectively. Here, (x i ) f (x i , y * i (x i )) = 1 n n j=1 f x i , y * i (x i ) ; ξij is the local objective function, and h(x i ) is a convex proximal function (possibly non-differentiable) for regularization. The equality constraints x i = x i ensure that the local copies at connected agents i and i are equal to each other, hence the name "consensus form."  ( -2 ) O( -2 ) SPDB (Lu et al., 2022) O( -2 ) O( -2 ) DSBO (Chen et al., 2022b) O( -2 ) O( -2 ) VRDBO (Gao et al., 2022) O( -1.5 ) O( -1.5 ) INTERACT (Liu et al., 2022) O(n -1 ) O( -1 ) INTERACT-VR Liu et al. (2022) O( √ nK -1 + n) O( -1 ) Prometheus [Ours.] O( √ nK -1 + n) O( -1 ) Next, we define the notion of -stationarity point for Problem (1) for convergence performance characterization. We say that {x i , y i , ∀i ∈ [m]} is an -stationarity point if it satisfies: E x -1 ⊗ x 2 Saddle point error + E x -1 ⊗ x 2 Consensus error + E y -y * 2 lower problem error ≤ , where x 1 m m i=1 x i , y [y 1 , ...y m ] , and y * [y * 1 , ...y * m ] , and x is a proximal point that will be defined later in Section 4. The first term in (2) quantifies the convergence of the x to a proximal point of stationarity of the global objective. The second term in (2) measures the consensus error among local copies of the upper variable, while the last term in (2) quantifies the (aggregated) error in the lower problem's iterates across all agents. Thus, → 0 implies that the algorithm achieves three goals simultaneously: i) consensus of upper variables, ii) stationary point of Problem (1), and iii) solution to the lower problem. As mentioned in Section 1, two of the most important performance metrics in decentralized optimization are the sample and communication complexities. 2) Motivating Applications: Problem (1) arises naturally from many interesting real-world applications. Here, we present two motivating applications to showcase its practical relevance: • Multi-agent meta-learning (Rajeswaran et al., 2019) : Meta-learning (or learning to learn) is to find model that can adapt to multiple related tasks. A popular meta-learning framework is the model-agnostic meta learning (MAML), which minimizes an upper objective of empirical risk on all tasks. Consider a multi-agent meta-learning task with m lower level problems and m agents collectively solve this meta-learning problem over a network. This problem can be formulated as: min x∈X m i=1 f (x, y * i (x)) , s.t. y * i (x) ∈ argmin yi∈R p 2 g (x, y i ) , i = 1, . . . , m. Here, agent i has a local dataset with n samples, x ∈ X is the constrained (e.g., due to safety) model parameters shared by all agents, and y i are task-specific parameters solved by each agent. • Decentralized min-max optimization (Huang et al., 2022) : Another application of the constrained decentralized bilevel optimization in (1) is the decentralized nonconvex strongly-concave min-max optimization problem, which is typically seen in, e.g., multi-agent reinforcement learning (Zhang et al., 2021) , fair multi-agent machine learning (Baharlouei et al., 2019) , and data poisoning attack (Liu et al., 2020b) . A decentralized min-max optimization problem is a special case of a decentralized bilevel optimization problem because: min x∈X max y i ∈R p 2 i=1,...,m m i=1 f (x, yi) ⇐⇒ min x∈X m i=1 f (x, yi (x * )) , s.t. y * i (x) = argmin y i ∈R p 2 -f (x, yi) , ∀i.

4. SOLUTION APPROACH

In this section, we first present the Prometheus algorithm for solving the constrained decentralized bilevel optimization problems in Problem (1) in Sections 4.1-4.2. Then, we provide its theoretical convergence guarantees in Section 4.3. Lastly, we will reveal a key insight on the necessity of using the proposed variance reduction techniques in Section 4.4. Due to space limitation, we relegate the proofs and the notation Table . 2 to supplementary material.

4.1. PRELIMINARIES

To present the Prometheus algorithm, we first introduce several basic components as preparation. 1) Network-Consensus Matrix: Our Prometheus algorithm is based on the network-consensus mixing approach: in each iteration, every agent exchanges and aggregates neighboring information through a consensus weight matrix M ∈ R m×m . We define λ as the second largest eigenvalue of the matrix M. Let [M] ii represent the element in the i-th row and the i -th column in M. The choice of M should satisfy the following properties: (a) doubly stochastic: m i=1 [M] ii = m j=1 [M] ii = 1; (b) symmetric: [M] ii = [M] i i , ∀i, i ∈ N ; and (c) network-defined sparsity: [M] ii > 0 if (i, i ) ∈ L; otherwise [M] ii = 0, ∀i, i ∈ N . 2) Stochastic Estimators: In Prometheus, we need to estimate the stochastic gradient of the bilevel problem using the implicit function theorem. We note that in the literature of bilevel optimization with stochastic gradient, a commonly adopted stochastic gradient estimator is of the form (Khanduri et al., 2021; Ghadimi & Wang, 2018; Hong et al., 2020; Liu et al., 2022) : ∇f (xi,t, yi,t; ξij) = ∇xf (xi,t, yi,t; ξ 0 i ) - 1 Lg ∇ 2 xy g(xi,t, yi,t; ζ 0 i ) Ĥi,k ∇yf (xi,t, yi,t; ξ 0 i ), where Ĥi,k K k(K) p=1 (I- ∇ 2 yy g(xi,t,yi,t;ζ p i ) Lg ). Here, K ∈ N is a predefined parameter and k(K) ∼ U{0, . . . , K -1} is an integer-valued random variable uniformly chosen from {0, . . . , K -1}. It can be shown that Ĥi,k is a biased estimator for the Hessian inverse ∇ 2 yy g (x, y; ζ) -1 = ∞ i=1 (I -∇ 2 yy g (x, y; ζ)) i . However, this estimator has the limitation that it only incorporates the first term in the Taylor approximation, thus resulting in a large variance and could eventually increase the communication complexity of decentralized bilevel optimizaiton. To address this issue, in this paper, we propose a new stochastic gradient estimator as follows: Hi,0 = I; H i,k =I+ I- ∇ 2 yy g xi,t, yi,t; ζ k i Lg H i,k-1 = I+ k(K) j =1 j p=1 I- ∇ 2 yy g (xi,t, yi,t; ζ p i ) Lg ; ∇f (xi,t, yi,t; ξij)=∇xf (xi,t, yi,t; ξ 0 i ) - 1 Lg ∇ 2 xy g xi,t, yi,t; ζ 0 i H i,k ∇yf (xi,t, yi,t; ξ 0 i ). Compared to the conventional estimator, the key difference in our new estimator lies in the matrix H i,k . The new Hessian inverse estimator is inspired by ideas in stochastic second-order optimization (Agarwal et al., 2016) . Similar technique to estimate the Hessian inverse can also be found in Koh & Liang (2017) . However, our Hessian inverse estimator differ from Koh & Liang (2017) in the following key aspects: (i) In our Hessian inverse estimator, we multiply the hessian term ∇ 2 yy g(x, y; ξ) by 1/L g as it ensures that 1/L g × ∇ 2 yy g(x, y; ξ) will have eigenvalue less than 1. Otherwise, the power series of the Hessian Inverse will not converge. In comparison, Koh & Liang (2017) does not have 1/L g term because the authors assume w.l.o.g. that the Hessian ∇ 2 yy g(x, y) 1, which implies that the authors implicitly assume L g = 1. (ii) Koh & Liang (2017) is only designed for solving a conventional single-level minimization problem with loss function L(•). In comparision, our proposed stochastic estimator can be used in bilevel learning especially for solving non-smooth regularizers in the upper-level problems. Note that our H i,k is in a recursive form that is able to capture the entire Taylor series at once without increasing the sample complexity. Thanks to this recursive form, H i,k utilizes O(k 2 ) samples, as opposed to only O(k) samples in the conventional Ĥi,k -Hessian inverse estimator, thus leading to a much smaller variance and eventually much lower communication complexity. It is worth noting that although our H i,k estimator leverages more training samples, the computation cost is the same as that of Ĥi,k due to the recursive structure in (5). In Sections 4.3 and 5, we will theoretically and numerically demonstrate the smaller variance of our new estimator over the conventional one.

4.2. THE Prometheus ALGORITHM.

Our Prometheus algorithm is an advanced triple-hybrid of proximal, gradient tracking, and variance reduction techniques. The procedure of Prometheus can be organized into three key steps: • Step 1 (Local Proximal Operations): In each iteration t, each agent i performs the following proximal operations to cope with the domain constraint set X for the upper-level variables: x i,t = xi (x i,t ) = arg min x∈X [ u i,t , x -x i,t + τ 2 x -x i,t 2 + h(x)], where τ > 0 is a proximal control parameter and u i,t is an auxiliary vector. The proximal update rule is motivated by the SONATA method (Scutari & Sun, 2019) used in a decentralized minimization. • Step 2 (Consensus Update in Upper-Level Variables): Next, each agent i updates the upper and lower model parameters x i , y i as follows: x i,t+1 = i ∈Ni [M] ii x i ,t + α(x i (x i,t ) -x i,t ), y i,t+1 = y i,t -βv i,t , where α and β are constant step-sizes for updating xand y-variables, respectively. Note that updating x i,t+1 in Eq. ( 7) is a local weighted average at agent i and plus a local update in the spirit of Frank-Wolfe given a proximal point. The right-hand side of Eq. ( 7) performs a local stochastic gradient descent update for the y-variable at each agent i. Remark 1. The used auxiliary proximal operator xi,t and the resultant local update α(x i (x i,t )x i,t ) in the consensus step play an important role in helping us alleviate the non-smooth objective challenge. It will be difficult to achieve convergence guarantees in decentralized learning if we use x i,t+1 = P X (x i,t -αu i,t ) = argmin x∈X x -(x i,t -αu i,t ) 2 instead. See proof details in Lemma 5 and 7 in our Appendix. • Step 3 (Local Variance-Reduced Stochastic Gradient Estimate): In the local gradient estimator step, each agent i estimates its local gradients using the following stochastic gradient estimators: p i (x i,t , y i,t ) =      ∇f (x i,t , y i,t ) = 1 n n j=1 ∇f (x i,t , y i,t ; ξij ), if mod(t, q) = 0, p i (x i,t-1 , y i,t-1 ) + 1 |Si,t| j∈Si,t ∇f x i,t , y i,t , ξij -∇f x i,t-1 , y i,t-1 , ξij , d i (x i,t , y i,t ) =      ∇g(x i,t , y i,t ) = 1 n n i=1 ∇g(x i,t , y i,t ; ζ ij ), if mod(t, q) = 0, d i (x i,t-1 , y i,t-1 ) + 1 |Si,t| j∈Si,t (∇g (x i,t , y i,t , ζ ij ) -∇g (x i,t-1 , y i,t-1 , ζ ij )) . (8b) Here, S i,t is the sample mini-batch in the t-th iteration, and q is a pre-determined inner loop iteration number. The local stochastic gradient estimation is a recursive estimator that shares some structural similarity with those in SARAH (Nguyen et al., 2017) , SPIDER (Fang et al., 2018) , and PAGE (Li et al., 2021) used for traditional minimization problems. • Step 4 (Gradient Tracking in Upper-Level Parameters): Each agent i updates u i,t and v i,t by averaging over its neighboring tracked gradients: u i,t = i ∈Ni [M] ii u i ,t-1 + p i (x i,t , y i,t ) -p i (x i,t-1 , y i,t-1 ); v i,t = d i (x i,t , y i,t ). (9) To summarize, we illustrate the Prometheus algorithm in Algorithm 1.

4.3. CONVERGENCE PERFORMANCE ANALYSIS OF THE Prometheus ALGORITHM

Now, we focus on the convergence performance analysis for the proposed Prometheus algorithm. Before presenting the main convergence results, we first state several technical assumptions: Assumption 1. For all ζ ∈ supp (π g ) where supp(π) is the support of π, x ∈ X , X ⊆ R p1 , y ∈ R p2 , the lower-level function g has the following properties : i) g(x, y; ζ) is µ g -strongly convex with µ g > 0, ∇ y g(x, y; ζ) is L g -Lipschitz continuous with L g > 0; ii) ∇ 2 xy g(x, y; ζ) 2 ≤ C gxy for some C gxy > 0, ∇ 2 xy g(x, y; ζ) and ∇ 2 yy g(x, y; ζ) are Lipschitz continuous with constants L gxy > 0 and L gyy > 0, respectively.

Algorithm 1

The Prometheus Algorithm at Each Agent i. Set parameter pair (x i,0 , y i,0 ) = (x 0 , y 0 ). Calculate local gradients: u i,0 = ∇f (x i,0 , y i,0 ); v i,0 = ∇ y g(x i,0 , y i,0 ); for t = 1, • • • , T do Update local parameters (x i,t+1 , y i,t+1 ) as in Eq. ( 6)-( 7); if Prometheus : then Compute local estimators (p i (x i,t+1 , y i,t+1 ), d i (x i,t+1 , y i,t+1 )) as in Eq. ( 8); end if if Prometheus-SG: then Compute local estimators (p i (x i,t+1 , y i,t+1 ), d i (x i,t+1 , y i,t+1 )) as in Eq. ( 10); end if Track global gradients (u i,t+1 , v i,t+1 ) as in Eq. ( 9); end for Assumption 2. For all ξ ∈ supp (π f ) where supp(π) is the support of π, x ∈ X , X ⊆ R p1 , the upperlevel function f has the following properties : ∇ x f (x, y; ξ), ∇ y f (x, y; ξ) (w.r.t. y) are Lipschitz smooth continuous with constant L fx ≥ 0, L fy ≥ 0. ∇ y f (x, y; ξ) ≤ C fy , for some C fy ≥ 0. Assumption 3. i) The stochastic gradient estimate of the upper-level function satisfies: Eξ[ ∇f (x, y; ξ) -Eξ[ ∇f (x, y; ξ)] 2 ] ≤ σ 2 f ; and ii) The stochastic gradient estimate of the lowerlevel function satisfies: E ζ [ ∇ y g(x, y; ζ) -∇ y g(x, y) 2 ] ≤ σ 2 g . We note that Assumptions.1, 2 and 3(b) are standard in the literatures of bilevel optimization (see, e.g., Ghadimi & Wang (2018) ; Khanduri et al. (2021) . In addition, Assumption 3(a) has been verified in (Khanduri et al., 2021) . To establish the convergence result of Prometheus, we first prove the Lipschitz-smoothness of the new gradient estimator proposed in (5), which is stated as follows: Lemma 1. (Lipschitz-smoothness of the new stochastic gradient estimator in (5)). If the stochastic functions f (x, y; ξ) and g(x, y; ζ) satisfy Assumptions 1-3, then we have (i) for a fixed y ∈ R p2 , ∇f x 1 , y; ξ -∇f x 2 , y; ξ 2 ≤ L 2 f x 1 -x 2 2 , ∀ x 1 , x 2 ∈ R p1 ; and (ii) for a fixed x ∈ R p1 , ∇f x, y 1 ; ξ -∇f x, y 2 ; ξ 2 ≤ L 2 f y 1 -y 2 2 , ∀y 1 , y 2 ∈ R p2 . In the above expressions, L f > 0 is defined as: L 2 f := 2L 2 fx + 6C 2 gxy L 2 fy K 2µgLg-µ 2 g + 6C 2 fy L 2 gxy K 2µgLg-µ 2 g + 6C 2 gxy C 2 fy K L 2 g K j=1 j 2 1 - µg Lg 2(j-1) 1 L 2 g L 2 gyy . Remark 2. We note that the Lipschitz-smoothness constant L f of Lemma 1 is smaller than that of the conventional estimator in (4), which we denote as L conv here, i.e., L f ≤ L conv . This also shows superiority of our new estimator. Due to space limitation, we state the definition of L conv in Lemma 4 in the appendix. Next, we need the following Lipschitz-continuity properties of the approximate gradient ∇f (x, y), the lower level solution y * , and the true gradient ∇ (x), which have been proved in the literature: Lemma 2. (Ghadimi & Wang, 2018) Under Assumptions 1-2, we have ∇f (x, y) - Lemma 2 establishes the smoothness of the implicit function in (1), which only relies on the Assumptions 1 and 2 to hold. Lastly, following the same token as in (Hong et al., 2020) , we show a critical fact on the exponentially fast decay of the bias of our stochastic estimator in (5), which is stated below. Lemma 3 (Exponentially Decaying Bias). Under Assumptions 1-3, the stochastic gradient estimate of the upper level objective in (5) satisfies ∇f (x, y) -E[ ∇f (x, y; ξ)] ≤ ∇ (x) ≤ L y * (x)-y , y * (x 1 )-y * (x 2 ) ≤ L y x 1 -x 2 , ∇ (x 1 )-∇ (x 2 ) ≤ L x 1 -x 2 for all x, x 1 , x 2 ∈ R p1 , y ∈ R p2 , Cg xy C fy µg (1 - µg Lg ) K . The assumptions and Lemmas 1-3 above lead to the main convergence result of Prometheus which is stated next. Under review as a conference paper at ICLR 2023 Theorem 1. Under Assumptions1-3, if the step-sizes α ≤ min (1-λ)m 2 √ β(L +τ ) τ 6+3τ , (1-λ)m 8 √ βL 2 f τ 6+3τ , τ 3L , (1-λ)µ 2 g β 1.5 23040L 2 y L 2 , 8 √ βτ 12m(1-λ) , 20L 2 y τ 27(1-λ)β 1.5 L 2 f m , τ (1-λ) 24mL 2 f β , τ √ β(1-λ) 12m , µg(1-λ) 240L 2 y β 2.5 9L 2 f m τ 6+3τ , (1-λ)β 2m 3 τ 6+3τ , β ≤ √ 40Ly 3L f , 1-λ 16L f , ( µg(1-λ) 2 1440L 2 y L 2 f ) 2 , 2µg 81L 2 f , then the outputs of Prometheus satisfy: 1 T T -1 t=0 E xt -1 ⊗ xt 2 + E xt -1 ⊗ xt 2 + E yt -y * t 2 = O 1 T . Remark 3. It is worth noting that, compared to existing works on decentralized bilevel optimization, the major challenge in proving the convergence results in Theorem 1 stems from the proximal operator needed to solve the upper-level subproblem, which prevents the use of conventional descent lemma for convergence analysis (see Eq. ( 34) in the appendix). Also, compared to single-agent constrained bilevel optimization, one cannot provide theoretical convergence guarantee by using the direct projection method x i,t = arg min x∈X x -(x i,t -τ u i,t ) 2 as in (Hong et al., 2020; Chen et al., 2022a) Interestingly, the following convergence result states that there always exists a non-vanishing constant independent of m, n, and α if (10) is used in Step 3 of Prometheus (i.e., a constant only dependent on problem instance and cannot be made arbitrarily small algorithmically). Proposition 3. Under Assumptions1-3, with step-sizes α ≤ min{ 1-λ 8βL f , τ 3L , (1-λ)m 2 √ β(L +τ ) τ 6+3τ , τ √ β 6m(1-λ) , τ (1-λ) 48mL 2 f β , (1-λ)µ 2 g β 1.5 23040L 2 y L 2 , O(T -1 2 ), (1-λ)m 4 √ βτ }, β ≤ min{ 1-λ 8L f , (1-λ) 4 µ 2 g 480 2 L 2 y L 2 f , O(T - 3 )}, we have the following result if (10) replaces Step 3 in Prometheus, 1 T T -1 t=0 E xt -1 ⊗ xt 2 + E xt -1 ⊗ xt 2 = O 1 √ T + C σ , where the constant C σ is defined as C σ 9(6+3τ ) τ 2 (( Cg xy C fy µg (1- µg Lg ) K ) 2 +σ 2 f )+ 27(1-λ) 40(8+4α 2 )L 2 y β 1.5 ατ σ 2 g . Remark 4. A key insight of Proposition 3 is in order. The SG-type update in ( 10) is similar to the SGtype update in unconstrained bilevel optimization in the single-agent setting (Ji et al., 2021) . However, unlike the SG-type method in (Ji et al., 2021) that can approach zero at an O(1/ √ T ) convergence rate, the SG-type method can only approach a constant error C σ at an O(1/ √ T ) convergence rate in the constrained decentralized setting. The non-vanishing constant error C σ is caused by the variance σ 2 f and σ 2 g of the stochastic gradient . This make the benefit of using the variance reduction techniques to eliminate the {σ f , σ g }-variance in order to approach zero asymptotically.

5. NUMERICAL RESULTS

In this section, we will first conduct experiments to demonstrate the small variance of our new stochastic gradient estimator. Then, we will compare Prometheus' convergence with several baselines.



CONCLUSIONIn this paper, we studied the constrained decentralized nonconvex-strongly-convex bilevel optimization problems. First, we proposed an algorithm called Prometheus with a new stochastic estimator. We then showed that, to achieve an -stationary point, Prometheus achieves a sample complexity of O(K √ n -1 + n) and a communication complexity of O( -1 ). Our numerical studies also showed the advantages of our proposed Prometheus and verified the theoretical results. Collectively, the results in this work contribute to the state of the art of low sample-and communication-complexity constrained decentralized bilevel learning.



Comparisons among algorithms for bilevel optimization problems. Sample complexities (both upper and lower) as defined in the sense of achieving an -stationary point defined in (2), n is the size of dataset at each agent.

where the Lipschitz constants are defined as: L L fx +

due to the gradient tracking procedure in the decentralized learning. Instead, we use a different proximal update rule as shown in (6). We will numerically show in Section 5 that Prometheus with the direct proximal operator can only converge to a neighborhood of a stationary point. Further, Theorem 1 implies the following sample and communication complexity results: Corollary 2 (Sample and Communication Complexities of Prometheus). Under the conditions of Theorem 1, to achieve an -stationary solution, Prometheus requires that: i) the total number of communication rounds is O( -1 ), and ii) the total number of samples is O( √ nK -1 + n).4.4 DISCUSSION: THE BENEFIT OF VARIANCE REDUCTION IN PrometheusSince the variance reduction in (8) in Step 3 of Prometheus requires full gradient evaluation, it is tempting to ask what is the benefit of using the variance reduction technique. In other words, could we relinquish variance reduction (VR) in Step 3 to avoid full gradient evaluation? To answer this question, consider changing Step 3 to the following basic stochastic gradient estimator without VR:

annex

1) New estimator vs. conventional estimator: Note that the major difference between the new and conventional estimators lies in how they estimate the Hessian inverse of the matrix A. Thus, it suffices to compare the Hessian inverse approximations. The conventional estimator to estimate the A -1 can be denoted as Ã-p=1 (I-A s ), while the new estimator can be denoted as Ã-1 = k(K) j =1 j p=1 (I -A s ). To see the benefits of our estimator and due to the high complexity of computing matrix inverse, here we consider a small example A = [[0.25, 0.0], [0.0, 0.25]], so thatLet A s be a random matrix obtained from A plus Gaussian noise. We use Ã-1 conv and Ã-1 to estimate A -1 , respectively. We run 10000 independent trials with K = 10 and the results are shown in Fig. 1 . We can see from Fig. 1 that the new Hessian inverse estimator has a much smaller variance than the conventional one. Additional experiments on varing K and different matrix A are relegated to our Appendix. • Prometheus with Stochastic Gradient (Prometheus-SG): Prometheus-SG is the SG-type algorithm discussed in Section 4.4: p i (x i,t , y i,t ) = ∇f (x i,t , y i,t , ξi0 ); d i (x i,t , y i,t ) = ∇g(x i,t , y i,t ; ζ i0 ). • Prometheus with Direct Proximal Method (Prometheus-dir): Instead of performing x i,t = arg min x∈X [ u i,t , xx i,t + τ 2 xx i,t 2 + h(x i )] in Prometheus, Prometheus-dir directly adds the constraints on x: x i,t = arg min x∈X x -(x i,t -τ u i,t ) 2 . • Proximal Decentralized Stochastic Gradient Descent (Prox-DSGD): This algorithm is motivated by the DSGD algorithm, which can be viewed as Prometheus without using gradient tracking. Specifically, we updates local gradient as u i,t = ∇f (x i,t , y i,t ; ξi0 ); v i,t = ∇g(x i,t , y i,t ; ζ i0 ). We also note that the Prox-DSGD algorithm can be seen as a generalization of DSBO (Yang et al., 2022) , SPDB (Lu et al., 2022) , DSBO (Chen et al., 2022b) with the proximal operator. Prometheusdir can also be seen as an extension of the algorithm INTERACT (Liu et al., 2020a) to handle the constrained decentralized bilevel optimization problem. We compare Prometheus with these baselines using a two-hidden-layer neural network with 20 hidden units. The consensus matrix is chosen as M = I-2L 3λmax(L) , where L is the Laplacian matrix of G and λ max (L) denotes the largest eigenvalue of L. Due to space limitation, we relegate the detailed parameter choices of all algorithms to the appendix. In Fig. 2 , we compare the performance of Prometheus, Prometheus-SG, Prometheus-dir, and Prox-DSGD on the MNIST and CIFAR-10 datasets with with a five-agent network. The network topology can be seen in Fig. 4 in Appendix D. We note that Prometheus converges much faster than than all other algorithms in terms of the total number of communication rounds. In Fig. 3 , we also observe similar results when the number of tasks (and agents) is increased to 10. Our experimental results thus verify our theoretical analysis that Prometheus has the lowest communication complexity.

