PERSONALIZED DECENTRALIZED BILEVEL OPTIMIZA-TION OVER STOCHASTIC AND DIRECTED NETWORKS

Abstract

While personalization in distributed learning has been extensively studied, existing approaches employ dedicated algorithms to optimize their specific type of parameters (e.g., client clusters or model interpolation weights), making it difficult to simultaneously optimize different types of parameters to yield better performance. Moreover, their algorithms require centralized or static undirected communication networks, which can be vulnerable to center-point failures or deadlocks. This study proposes optimizing various types of parameters using a single algorithm that runs on more practical communication environments. First, we propose a gradient-based bilevel optimization that reduces most personalization approaches to the optimization of client-wise hyperparameters. Second, we propose a decentralized algorithm to estimate gradients with respect to the hyperparameters, which can run even on stochastic and directed communication networks. Our empirical results demonstrated that the gradient-based bilevel optimization enabled combining existing personalization approaches which led to state-of-the-art performance, confirming it can perform on multiple simulated communication environments including a stochastic and directed network.

1. INTRODUCTION

In distributed learning, providing personally tuned models to clients, or personalization, has shown to be effective when the clients' data are heterogeneously distributed (Tan et al., 2022) . While various approaches have been proposed, they are dedicated to optimizing specific types of parameters for personalization. A typical example is clustering-based personalization (Sattler et al., 2020) , which employs similarity-based clustering specifically for seeking client clusters. Another approach called model interpolation (Mansour et al., 2020; Deng et al., 2020) also specializes in optimizing interpolation weights between local and global models. These dedicated algorithms prevent developers from combining different personalization methods to achieve better performance. Another limitation of previous personalization algorithms is that they can run only on centralized or static undirected networks. Most approaches for federated learning (Smith et al., 2017; Sattler et al., 2020; Jiang et al., 2019) require centralized settings in which a host server can communicate with any client. Although a few studies (Lu et al., 2022; Marfoq et al., 2021) consider fully-decentralized settings, they assume that the communication edge between any clients is static and undirected (i.e., synchronized). These commutation networks are known to be vulnerable to practical issues, such as bottlenecks or central point failures on the host servers (Assran et al., 2019) , or failing nodes and deadlocks on the static undirected networks (Tsianos et al., 2012) . This study proposes optimizing various parameters for personalization using a single algorithm while allowing more practical communication environments. First, we propose a gradient-based Personalized Decentralized Bilevel Optimization (PDBO), which reduces many personalization approaches to the optimization of hyperparameters possessed by each client. Second, we propose Hyper-gradient Push (HGP) that allows any client to solve PDBO by estimating the gradient with respect to its hyperparameters (hyper-gradient) via stochastic and directed communications, that are immune to the practical problems of centralized or static undirected communications (Assran et al., 2019) . We also introduce a variance-reduced HGP to avoid estimation variance, which is particularly effective when communications are stochastic, providing its theoretical error bound. We empirically demonstrated that the generality of our gradient-based PDBO enabled combining existing personalization approaches which led to state-of-the-art performance in a distributed classification task. We also demonstrated that the gradient-based PDBO succeeded in the personalization on multiple simulated communication environments including a stochastic and directed network. Our contributions are summarized as follows: • We propose a gradient-based PDBO that can solve existing personalization problems and their combinations as its special cases. • We propose a decentralized hyper-gradient estimation algorithm called HGP which can run even on stochastic and directed networks. We also propose a variance-reduced HGP, which is particularly effective in stochastic communications, and provide its theoretical error bound. • We empirically validated the advantages of the gradient-based PDBO with HGP; it enabled solving a combination of different personalization problems which led to state-of-the-art performance, and it performed on different communication environments including a stochastic directed network. Notation ⟨A⟩ ij denotes the matrix at the i-th row and j-th column block of the matrix A, and ⟨a⟩ i denotes the i-th block vector of the vector a. For a function f : R d1 → R d2 , we denote its total and partial derivatives with respect to a vector x ∈ R d1 by d x f (x) ∈ R d1×d2 and ∂ x f (x) ∈ R d1×d2 , respectively. We denote the product of matrices by m s=0 Â(s) = Â(m) • • • Â(0) and -1 s=0 Â(s) = I.

2. PRELIMINARIES

We formulate distributed learning (Li et al., 2014) , communication networks, and stochastic gradient push (Nedić & Olshevsky, 2016, SGP) as a generalization of gradient-based distributed learning. Distributed learning Distributed learning with n clients is commonly formulated for all i ∈ [n] as x * i = arg min xi 1 n k∈[n] E ξ k [f k (x k , λ k ; ξ k )] , s.t. x i = x j , ∀j ∈ [n] , where, the i-th client pursues the optimal parameter x * i ∈ R dx , that makes consensus (x i = x j , ∀j ∈ [n]) over all the clients, while minimizing its cost f i : R dx × R d λ → R for the input ξ i ∈ X sampled from its local data distribution. We allow f i to take the hyperparameters λ i ∈ R d λ as its argument. We further explain the examples of the choice of λ i in Sections 3 and 5.

Stochastic and directed communication network

In distributed learning, clients solve Eq. ( 1) by exchanging messages over a physical communication network. The type of edge connections categorizes the communication network: static undirected (Lian et al., 2017) , which represents synchronization over all clients; stochastic undirected (Lian et al., 2018) , which represents asynchronicity between different client pairs; and stochastic directed (Nedić & Olshevsky, 2016) , which represents push communication where any message passing can be unidirectional. This study considers distributed learning on stochastic and directed communication networks. Such a network has several desirable properties: robustness to failing clients and deadlocks (Tsianos et al., 2012) , immunity to central failures, and small communication overhead (Assran et al., 2019) . We model stochastic directed networks by letting communication edges be randomly realized, as simulated in Assran et al. (2019) and Nedić & Olshevsky (2016) . Let δ (t) i j ∈ {0, 1} be a random variable where δ (t) i j = 1 denotes that there is a communication channel from the i-th client to the j-th client at the time step t, and δ (t) i j = 0 otherwise. We set δ (t) i i = 1 for all i ∈ [n] and t ∈ N allowing every client to send a message to itself at any time step. Note that the edge model above can recover the other fully-decentralized settings as its special cases; the symmetric edges (δ (t) i j = δ (t) j i , ∀i, j ∈ [n] , ∀t ∈ N) recover stochastic undirected networks, and the symmetric constant edges, which additionally require δ (t) i j = δ (t) j i = δ ij , recover static and undirected networks. Stochastic gradient push (SGP) SGP (Nedić & Olshevsky, 2016) is one of the most general solvers of Eq. (1). This section formulates SGP with further generalization for its variants. The i-th client in SGP updates its weight ω i ∈ R along with biased parameter z i ∈ R dx to obtain its debiased parameter x i = z i /ω i . Let y i = [z ⊤ i ω i ] ⊤ ∈ R dy be a concatenated vector. At the t-th step, the i-th client samples its minibatch ζ (t) i and sending edges δ (t) i = {δ (t) i 1 • • • δ (t) i n }, runs a local update ψ i : R dy → R dy and message generator φ i : R dy → R dy , and updates y i as y (t+1) i = j∈[n] p ji (δ (t) j )φ j y (t) j ; λ j , ζ (t) j + ψ i y (t) i ; λ i , ζ (t) i , s.t. k∈[n] p ik (δ i ) = 1 and p ik (δ (t) i ) = δ (t) i k p ik (δ (t) i ), ∀k ∈ [n], where, p ji : {0, 1} n → [0, 1] is a weight function that forms column stochastic matrix P (t) such that P (t) ij = p ji (δ (t) j ) to ensure the convergence of x i to the consensus. Denoting the learning rate by α i ∈ R + , the following formulations of φ i and ψ i recover the two SGP variants:        φ i (y i ; λ i , ζ i ) = z ⊤ i -αi |ζi| ξ∈ζi ∂ xi f i zi ωi , λ i ; ξ ⊤ ω i ⊤ , ψ i (y i ; λ i , ζ i ) = 0 dx 0 ⊤ φ i (y i ; λ i , ζ i ) = z ⊤ i ω ⊤ i , ψ i (x i ; λ i , ζ i ) = -αi |ζi| ξ∈ζi ∂ xi f i zi ωi , λ i ; ξ ⊤ 0 ⊤ , where Eq. (4a) and Eq. ( 4b) run local gradient descent with a minibatch before (Assran et al., 2019) and after (Nedić & Olshevsky, 2016) communication, respectively. We can recover other popular distributed learning schemes as special cases of SGP. By making p ji (δ (t) j ) form a doubly stochastic mixing matrix P (t) , Eq. (4a) and Eq. ( 4b) recover the decentralized stochastic gradient descent (DSGD) in Bianchi et al. (2013) and Lian et al. (2017) , respectively. We can also recover FedAVG (McMahan et al., 2017) by choosing a fully-connected graph with averaging over all clients, i.e., δ (t) i j = 1 and p ij (δ (t) i ) = 1/n for all i, j ∈ [n] and t ∈ N in Eq. (4a) 1 .

3. PERSONALIZED DECENTRALIZED BILEVEL OPTIMIZATION (PDBO)

We then propose the formulation of PDBO as a generalization of existing personalization problems. PDBO played by n clients is formulated as follows: min λ1,...,λn 1 n s∈[n] F s (x * s (λ 1 , . . . , λ n ) , λ s ) , s.t. x * i satisfies Eq. (1), ∀i ∈ [n] , where the outer-problem (Eq. (5-left)) lets the i-th client find its optimal hyperparameter λ i that minimizes the average of outer-cost F i : R dx × R d λ → R across all clients. Here, we write x * i (λ 1 , . . . , λ n ) to show its dependency to hyperparameters explicitly. The generality of Eq. ( 5) in personalization comes from the flexibility in the choice of f i , F i , x i , and λ i . For example, suppose that f i is the cross-entropy loss of a DNN with a feature extractor and classifier parameterized by x i and λ i , respectively. By letting F i be a validation loss, we can recover a family of personalized layer scheme (Arivazhagan et al., 2019; Bui et al., 2019) . See Section 5 for further examples. We then reformulate PDBO by replacing Eq. ( 1) with the stationary point of an iteration as in Grazzi et al. (2020) . Following the original works of SGP (Nedić & Olshevsky, 2014) and the push-sum (Bénézit et al., 2010) , we introduce additional assumptions: Assumption 1. For every i ∈ [n], and for all λ i ∈ R d λ and ξ i ∈ X , f i (•, λ i ; ξ i ) is strongly convex. Assumption 2. A graph with edge set {(i, j) | E δi [p ij (δ i )] > 0, i, j ∈ [n]} is strongly connected. Let δ = {δ 1 , . . . , δ n }, ζ = {ζ 1 , . . . , ζ n }. For every i ∈ [n] , the expectation of iteration Eq. ( 2) with Assumptions 1 and 2 admits the following unique stationary point which gives the optimum of Eq. (1) (Nedić & Olshevsky, 2014; Assran & Rabbat, 2020) : y * i = E δ,ζ n j=1 p ji (δ j ) φ j (y * j ; λ j , ζ j ) + ψ i (y * i ; λ i , ζ i ) = [z * i ⊤ ω * i ] ⊤ s.t. x * i = z * i /ω * i , ) Replacing the inner-problem in Eq. ( 5) by Eq. ( 6) reformulates PDBO as min λ F (x * (y * (λ)) , λ) , s.t. Eq. (6) is satisfied for all i ∈ [n], where , λ = [λ ⊤ 1 • • • λ ⊤ n ] ⊤ , x * = [x * 1 ⊤ • • • x * n ⊤ ] ⊤ , and y * (λ) = [y * 1 ⊤ • • • y * n ⊤ ] ⊤ are concatenated parameters, and F (x, λ) := 1 n k∈[n] [F k (x k , λ k )] is the average outer-cost.

4. HYPER-GRADIENT ESTIMATION OVER STOCHASTIC AND DIRECTED COMMUNICATION NETWORKS

To solve PDBO using gradient-based methods, this section introduces an empirical estimate of the hyper-gradient and its decentralized computation algorithm, which we named HGP.

4.1. EMPIRICAL ESTIMATE VIA APPROXIMATE IMPLICIT DIFFERENTIATION

Below, we derive the estimator of hyper-gradient following the recurrent backpropagation for approximate implicit differentiation (Grazzi et al., 2020; Lorraine et al., 2020) . The hyper-gradient with respect to λ is written as d λ F (x * (y * (λ)) , λ) under Assumption 3. Assumption 3. For all i ∈ [n] and ζ i , φ i (y i ; λ i , ζ i ) and ψ i (y i ; λ i , ζ i ) are differentiable with respect to y i and λ i , and F i (x i , λ i ) is differentiable with respect to x i and λ i . Estimator of hyper-gradient We introduce Jacobian matrices A (δ, ζ) and B (δ, ζ) whose (j, i) blocks are the partial derivative of Eq. ( 2) with respect to y j and λ j for j, i ∈ [n], respectively: ⟨A (δ, ζ)⟩ ji = p ji (δ j ) ∂ yj φ j y * j ; λ j , ζ j + 1 ji ∂ yj ψ j y * j ; λ j , ζ j ∈ R dy×dy , ⟨B (δ, ζ)⟩ ji = p ji (δ j ) ∂ λj φ j y * j ; λ j , ζ j + 1 ji ∂ λj ψ j y * j ; λ j , ζ j ∈ R d λ ×dy , where, 1 ij denotes the Kronecker delta. We introduce their expectations by Ā := E δ,ζ [A (δ, ζ)] and B := E δ,ζ [B (δ, ζ) ] assuming the following: Assumption 4. The largest singular value of Ā is strictly smaller than one. Let c y = ∂ y F (x * (y * (λ)) , λ) and c λ = ∂ λ F (x * (y * (λ)) , λ). Using Assumption 4, Eq. ( 6), and empirical estimates ( Â(t) , B(t) ) = (A δ (t) , ζ (t) , B δ (t) , ζ (t) ), we obtain the estimator as d λ F (x * (y * (λ)) , λ) ≈ B M -1 m=0 Ām c y + c λ ≈ M -1 m=0 B(2m) m-1 s=0 Â(2m+1) c y + c λ =: d λ F . (9) where, the first approximation is obtained from Grazzi et al. (2020, (Eq. ( 4), ( 5), ( 19))) and the second approximation simply replaces the expected Jacobians with their estimates, as in Ghadimi & Wang (2018, 3.62, 3.66) . We estimate Jacobians in the odd-and even-rounds introducing the following assumption to ensure unbiasedness: B M -1 m=0 Ām = E[ M -1 m=0 B(2m) m-1 s=0 Â(2m+1) ]. We include the complete derivation of the estimator in Appendix A. Assumption 5. δ (t) and ζ (t) are independent across the time steps t ∈ N.

Recurrent backpropagation

We compute Eq. ( 9) using the fact that a finite number of recurrent backpropagation around the stationary point approximates the hyper-gradient (Lorraine et al., 2020, 4. 2), which avoids the explicit computation of Jacobian matrices, Â(m) and B(m ) . Let u (m) = m-1 s=0 Â(2m+1) c y and v (m) = m-1 m ′ =0 B(2m ′ ) u (m ′ ) + c λ . By initializing u (0) ← c y and v (0) ← c λ , and by the following iterations for m = 0, . . . , M -1, v (m+1) ← B(2m) u (m) + v (m) , u (m+1) ← Â(2m+1) u (m) , we obtain the hyper-gradient estimate as d λ F ← v (M ) . Eq. ( 10) only requires Jacobian-vector products B(2m) u (m) and Â(2m+1) u (m) leading O (nd x + nd λ ) and O (nd x ) in time, respectively. Decentralizing backpropagation Decentralized computation of Eq. ( 10) requires consideration of the data locality and the communication stochasticity. From the locality of ζ (m) i , clients need to communicate because only the i-th client can compute Jacobian-vector products of the i-th row block of Â(m) and B(m) from their definitions Eq. ( 8). Moreover, we need to design a decentralized algorithm so that any required communication can be performed on stochastic and directed networks.

4.2. HYPER-GRADIENT PUSH (HGP)

We propose HGP which enables any i-th client to update their hyperparameter λ i by estimating its hyper-gradient d λi F = ⟨ d λ F ⟩ i over stochastic directed networks. HGP runs an unbiased alternative of Eq. ( 10) based on our observation that the exact computation of Eq. ( 10) requires undirected edges. Exact backpropagation requires undirected edges Suppose the i-th client is responsible for computing the i-th block of u (m+1) and v (m+1) , denoted by u (m+1) i ∈ R dx and v (m+1) i ∈ R d λ , respectively. From Eq. ( 10), we obtain the following recursive iteration performed by the i-th client:    v (m+1) i ← n j=1 δ (2m) i j B(2m) ij u (m) j + v (m) i , u (m+1) i ← n j=1 δ (2m+1) i j Â(2m+1) ij u (m) j , where, we use the equivalencies ⟨ Â(m) ⟩ ij = δ (m) i j ⟨ Â(m) ⟩ ij and ⟨ B(m) ⟩ ij = δ (m) i j ⟨ B(m) ⟩ ij , as they are non-zeros only when δ (m) i j = 1 from Eq. ( 8) and Eq. (3). To complete Eq. ( 11), the i-th client needs to receive u (m) j from all the j-th client with δ (m) i j = 1, which is possible only when there is the communication channel from j to i (i.e., δ (m) j i = 1). In other words, the exact computation of Eq. ( 11) is available only when the communications are undirected (i.e., δ (m) i j = δ (m) j i , ∀m ∈ N). Unbiased estimation via directed edges To relax the undirected communication constraint to stochastic directed communication, we propose HGP as a simple yet effective alternative of Eq. ( 11). We first assumes that the i-th client knows the receiving frequency δj i = E δ [δ j i ] and expected sending weight pij = E δ [p ij (δ i )] for all j ∈ [n]. In practice, we can estimate them through T rounds of SGP communication. We also adopt the following assumptions: Assumption 6. If δj i > 0, then δi j > 0 and vice versa. Assumption 7. The realization of δ (m) i j are independent over different j and i for all m ∈ N. The key idea of HGP is to replace the sending edges δ (m) i j in Eq. ( 11) with the debiased receiving edges δ (m) j i / δj i. By initializing u (0) i ← ⟨c y ⟩ i = 1 n ∂ yi F i (x * i , λ i ) and v (0) i ← ⟨c λ ⟩ i = 1 n ∂ λi F i (x * i , λ i ), we obtain the estimate as d λi F ← v (M ) i after the following iterations for m = 0, . . . , M -1,      v (m+1) i ← n j=1 δ (2m) j i δj i B(2m) ij u (m) j + v (m) i , u (m+1) i ← n j=1 δ (2m+1) j i δj i Ã(2m+1) ij u (m) j , where, ⟨ Ã(m) ⟩ ij and ⟨ B(m) ⟩ ij are defined by replacing p ij (δ (m) i ) in Eq. (8a) and Eq. (8b) with pij , respectively. The iterations above are always computable even on stochastic directed networks because the i-th client needs to receive u (m) j from the clients with δ (m) j i = 1, which is always possible. We also note that Assumption 6 ensures that ⟨ Ã(m) ⟩ ij and ⟨ B(m) ⟩ ij are unbiased: Variance reduction We now introduce the variance-reduced version of HGP, which we call VR-HGP. The naive HGP above suffers from large variance because of δ (m) j i / δj i, which can take a value far larger than one when δj i is small. The multiplication of such values induces a high variance. E δ,ζ [ δ (m) j i / δj i⟨ Ã(m) ⟩ ij ] = E δ [ δ (m) j i / δj i]Eδ,ζ [⟨ Â(m) ⟩ ij ] = ⟨ Ā⟩ ij The idea of VR-HGP is to combine HGP with its following variant, where w (0) i ← ⟨c y ⟩ i ,      v (m+1) i = n j=1 δ (2m) j i δj i B(2m) ij w (m) j , w (m+1) i = n j=1 δ (2m+1) j i δj i Ã(2m+1) ij w (m) j + ⟨c y ⟩ i . (13) Here, w (m) corresponds to the estimator of m-1 m ′ =0 A m ′ c y . Note that the weighted average of two different estimators results in an estimator with a smaller variance. By averaging Eq. ( 12) and Eq. ( 13) with weights α, β ∈ (0, 1), we obtain VR-HGP as the following iterations for m = 0, . . . , M -1:              v (m+1) i ← α v (m) i + j δ (2m) j i δj i B(2m) ij u (m) j + (1 -α) j δ (2m) j i δj i B(2m) ij w (m) j , u (m+1) i ← j δ (2m+1) j i δj i Ã(2m+1) ij u (m) j , w (m+1) i ← β j δ (2m+1) j i δj i Ã(2m+1) ij w (m) j + ⟨c y ⟩ i + (1 -β) w (m) i + u (m+1) i , with v (0) i ← 0 d λ , u (0) i ← ⟨c y ⟩ i , and w (0) i ← ⟨c y ⟩ i having the estimate as d λi F ← v (M ) i + ⟨c λ ⟩ i . The following theorem provides the estimation error of the hyper-gradient using VR-HGP. Assumption 8. ∃η A ∈ (0, 1), η B ∈ (0, ∞) such that ∀y i , λ i , ζ i and ∀i, max ∥∂ yi ψ i ∥ 2 , ∥∂ yi φ i ∥ 2 ≤ η A 2κ , max {∥∂ λi ψ i ∥ 2 , ∥∂ λi φ i ∥ 2 } ≤ η B 2κ , where κ = i,j pji δi j and ∥•∥ 2 denotes spectral norm. Theorem 1 (Estimation Error of VR-HGP). Suppose that Assumptions 1-8 hold true and |ζ (2m) i | = |ζ (2m+1) i | = b for any i and m. Then, for α, β ∈ (0, 1), with probability at least 1 -ϵ, we have d λi F -d λ F (x * , λ) ≤ µ α,β τ   i,j p2 ji δ2 i j + 4n b   log n(d y + d λ ) ϵ + e -O(M ) , where, ∥•∥ denotes ℓ 2 norm, e -O(M ) denotes the exponentially diminishing term over M , and µ α,β = 8 1 -α 1 + α 1 + 1 + α(1 -β + βη A ) 1 -α(1 -β + βη A ) β 2 η 2 A 1 -(1 -β + βη A ) 2 , τ = η B ∥c y ∥ κ(1 -η A ) . One can see that the coefficient µ α,β dominates the magnitude of the estimation error. Setting α, β ∈ (0, 1) that minimizes µ α,β can attain a small error. 2 The proof is provided in Appendix C.

5. RELATED WORK

Personalization in federated learning We compare our work to standard personalization methods by recovering them as special cases of PDBO and pointing their applicable communication networks. Mansour et al. (2020) and Deng et al. (2020) propose model interpolation that provides personalized models as the optimal interpolation between local models and the global model, which is recovered by letting the inner-problem train the global model and the outer-problem optimize the interpolation weight. Federated multi-task learning (MTL) (Marfoq et al., 2021) obtains personalized models by allowing clients to tune the ensemble weights of the global base-predictors. In Section 6, we demonstrated that PDBO can recover the federated MTL by letting the inner-problem optimize the base-predictors and the outer-problem learn ensemble weights. We see the clustering personalization (Sattler et al., 2020) as a sub-problem of federated MTL from the empirical results (Marfoq et al., 2021, J.4 ) that demonstrated the personalized ensemble weights recover the client clusters. Data augmentation (Duan et al., 2019; Zhao et al., 2018) mitigates data heterogeneity by over-or undersampling to train a generalized global model. This can be recovered by optimizing pseudo-sampling rates as hyperparameters. Furthermore, the generality of PDBO allows us to optimize different types of parameters simultaneously, which current personalization algorithms cannot handle. For communication networks, most personalization schemes require a centralized network (Sattler et al., 2020; Jiang et al., 2019) , which is vulnerable to a central point of failure (Assran et al., 2019) . A few fully-decentralized algorithms (Marfoq et al., 2021; Lu et al., 2022) assume static undirected networks which are vulnerable to failing clients and deadlocks (Tsianos et al., 2012) . While Vanhaesebrouck et al. (2017) and Zantedeschi et al. (2020) consider stochastic undirected settings, their applicability are limited to linear models or a linear combination of pre-trained models. Gradient-based PDBO can learn more complex models and run on stochastic directed networks, which are immune to practical problems in centralized and static undirected networks. Distributed bilevel optimization Distributed bilevel optimizations proposed in concurrent works differ from PDBO in formulations. We categorize them into consensus distributed bilevel optimization (CDBO) (Chen et al., 2022; Tarzanagh et al., 2022; Gao et al., 2022; Yang et al., 2022) and CDBO with the local inner-problem (CDBO-Local) (Li et al., 2022; Liu et al., 2022; Lu et al., 2022) . CDBO requires clients to make a consensus both on the outer-and inner-problem. (Lu et al., 2022) requires consensus in the outer-problem as in CDBO, whereas its inner-problem is a local optimization. Clients in CDBO-Local thus cannot benefit from others in the inner-loop for better generalization. In our PDBO, both outer-and inner-problems are optimized using global information; the inner-parameters are trained for consensus, and the outer-parameters are optimized to improve the total performance across all clients. We highlight that our gradient-based PDBO recovers CDBO by running SGP using the estimated hyper-gradient for the outer-problem, and recovers CDBO-Local by using SGP for the outer-problem and designing p ij to form the self-loop topology in the inner SGP. Hyper-gradient estimation over communication Networks We compare our HGP with the other hyper-gradient estimation methods performed over communication networks. Yang et al. (2022) proposes a hyper-gradient estimation algorithm in fully-decentralized settings. However, they assume static and undirected networks, and their algorithm is complex both in computation and communication as they involve computations and communications of full Jacobians and Hessians. Tarzanagh et al. ( 2022) considers the hyper-gradient estimation in centralized settings, which is typical in federated learning. While their approach is advantageous in complexity because clients only compute Jacobian-vector products and exchange O (d x ) vectors, its applicability is tied to the centralized host-clients setting. Other CDBO methods (Chen et al., 2022; Gao et al., 2022) estimate different types of hyper-gradient. See Appendix E for further details. Our HGP enjoys reasonable complexity in computation and communication, as stated in Section 4.2, and covers a wide range of communication networks, including stochastic and directed networks. Hyper-gradient estimation for single agent The hyper-gradient estimation approaches are categorized into iterative differentiation (ITD), and approximate implicit differentiation with recurrent backpropagation (AID-RB) and conjugated gradient (AID-CG) (Grazzi et al., 2020) . We found that applying ITD or AID-CG to the hyper-gradient estimation on stochastic and directed communication networks is infeasible for the following reasons. Applying the ITD variants, backward and forward mode (Franceschi et al., 2017) , have limitations in communication; the backward mode requires static and undirected network, the forward mode requires all-to-all communication at the end of iteration and exchanging large O (d y × d λ ) sized matrices. A detailed discussion is provided in Appendix H. To apply AID-CG (Pedregosa, 2016) , we need to solve min q∈R dy 1 2 ∥(I -Ā)qc y ∥ 2 . However, in our setting where I -Ā can be asymmetric, AID-CG is slower than AID-RB (Grazzi et al., 2020) . AID-RB only requires the network to be undirected and our HGP relaxes this limitation by simple and effective modification.

6. EXPERIMENTS

To demonstrate the generality in personalization and applicability to practical communication environments, we introduced three different personalization approaches as special cases of gradient-based PDBO and benchmarked them with baselines on four different communication networks.

6.1. SETTINGS

We followed the settings of EMNIST classification played by n = 100 clients in Marfoq et al. (2021) unless otherwise mentioned. The detailed experimental settings are described in Appendix D.

Communication networks

We simulated four communication networks: fully-connected (FC), static undirected (FixU), stochastic undirected (StoU), and stochastic directed (StoD). FC allows clients to communicate with all the others at any time step, i.e. δ (t) i j = 1 for all i, j ∈ [n] and t ∈ N. FixU is static undirected network simulated by a binomial Erdős-Rényi graph (Erdős & Rényi, 1959) with parameter p = 0.4 adding the self-loop edges. Following the setting in Marfoq et al. (2021) , we generated a doubly stochastic mixing matrix using the fast-mixing Markov chain (Boyd et al., 2003) rule. StoU simulates stochastic undirected network by letting undirected edge δ (t) j i = δ (t) i j independently realize at each step with the probability δj i ∈ [0, 1]. In StoD, every direction of edges δ (t) j i is independently sampled at probability δj i , simulating a stochastic directed network. For all i, j ∈ [n], δj i was sampled from the uniform distribution with [0.4, 0.8]. Proposed approaches We introduce and evaluated three different personalization methods as special cases of PDBO, that are, PDBO-DA, PDBO-MTL, and PDBO-MTL&DA. PDBO-DA optimizes the pseudo-sampling rates to recover the data-augmentation-based personalization (Duan et al., 2019; Zhao et al., 2018) . PDBO-DA optimizes λ i ∈ R C to obtain the label-wise weight vector CSoftmax (λ i ) ∈ [0, C] C . In the inner-problem, the losses of the instances labeled as c ∈ [C] are multiplied by the c-th element of the weight vector. PDBO-MTL is obtained by formulating FedEM (Marfoq et al., 2021) as PDBO. PDBO-MTL lets each client train an ensemble classifier that outputs weighted average predictions across K = 3 of CNNs. PDBO-MTL trains the CNN parameters as the inner-problem and optimizes the hyperparameters λ i ∈ R K to obtain the ensemble weight vector Softmax (λ i ) ∈ [0, 1] K . PDBO-MTL&DA combines PDBO-DA and PDBO-MTL by optimizing λ i ∈ R C+K to obtain both the label weight and ensemble weight vectors. Baseline approaches We compared our approaches with baselines on each communication setting. For FC and FixU settings, we compared several personalization approaches: a personalized model trained only on the local dataset (Local), FedAvg with local tuning (FedAvg+) (Jiang et al., 2019) , Clustered-FL (Sattler et al., 2020) , pFedMe (T Dinh et al., 2020) , and centralized and decentralized versions of FedEM (Marfoq et al., 2021) . We also trained the global models using SGP (Nedić & Olshevsky, 2016; Assran et al., 2019) and FedProx (Li et al., 2020) . As SGP recovers FedAvg and DSGD on FC and FixU, respectively, we treat them as equivalent approaches. Among all approaches including ours, model architecture follows the setting in Marfoq et al. (2021) . Training procedure We allowed every client to generate its local dataset which has its unique label distribution, following Marfoq et al. (2021) , and split it into train, validation, and test datasets. All baselines and PDBO inner-optimizations ran the distributed learning following Marfoq et al. (2021) on FC and FixU, and ran SGP of Eq. (4b) on StoU and StoD using the train dataset. In PDBO, any i-th client estimates δj i , pij for all j ∈ [n] through communications in the inner-optimization, and approximates y * i by y (T ) i obtained from the T steps of inner-optimization. Theorem 11 in Appendix C.7 proves this approximation of y * i is reasonable when ∥y PDBO outer-optimizations ran 20 outer-steps tracing the average validation accuracy, and we reported the average test accuracy at an outer-step that showed the best validation accuracy. Outer-steps were performed by Adam (Kingma & Ba, 2015) from the zeros initial hyperparameters 0 d λ . To estimate the hyper-gradient for each outer-step, clients ran M = 200 HGP iterations with Eq. (4b) using the average cross-entropy on the train dataset as F i . We adopted HGP for FC and FixU, and VR-HGP with (α, β) = (0.9, 0.1) for StoU and StoD. We also made a practical modification in HGP to sample Ã(m) and B(m) together at the single m-th round, which leads to the same length of the Neumann series with the half sampling costs of the original HGP, while they are no longer unbiased. (T ) i -y * i ∥ is sufficiently small.

6.2. RESULTS AND DISCUSSIONS

Personalization performance Table 1 shows the average test accuracy with weights proportional to local test dataset sizes. We observed that the ensemble-based approaches, FedEM, PDBO-MTL, and PDBO-MTL&DA performed the best on FC, and PDBO-MTL&DA outperformed on all fullydecentralized settings, that are, FixU, StoU, and StoD. Although PDBO-DA improved the average accuracy from SGP in all communication settings, it was especially effective when combined with PDBO-MTL. These results indicate that optimizing different parameters simultaneously, which is newly enabled by our PDBO, is advantageous to the personalization performance. We also investigated whether the accuracy gain was shared among all clients. Table 1 shows the accuracy of the bottom 10% percentile of clients. All our approaches improved accuracy at the 10% percentile from global model approaches (SGP and FedProx) in all communication settings, confirming that the clients fairly benefited from our personalization. Applicability to stochastic communication networks The communication network limits the available personalization methods, especially when the network is stochastic. Although FedEM is one of the few personalization methods feasible in fully-decentralized settings, it requires the doubly stochastic mixing matrix to be known, which is impractical on stochastic networks (Tsianos et al., 2012) . As PDBO encompasses SGP and HGP can run on stochastic communication networks, our approaches succeeded in the personalization on StoU and StoD. Robustness to communication directionality Our HGP and VR-HGP estimate hyper-gradient solely from the directed communication edges, rather than running the standard recurrent backpropagation which requires undirected edges. The improvement in our approaches on StoD demonstrated that VR-HGP estimated the hyper-gradient with sufficiently small errors to solve PDBO.

7. CONCLUSION

This study proposed a gradient-based PDBO, which reduces most personalization approaches to the optimization of hyperparameters possessed by each client. We also proposed HGP that estimates the hyper-gradient through communications over stochastic and directed communication networks. In addition, we introduced a variance-reduced HGP that mitigated the estimation variance caused by the stochasticity of communication edges and provided its theoretical error bound. Our empirical results demonstrated that our gradient-based PDBO with HGP enabled combining different personalization approaches which led to state-of-the-art performance, and it performed on different simulated communication environments including a stochastic and directed network.

REPRODUCIBILITY STATEMENT

We provide the detailed experiment settings of Section 6 in Appendix D including any modification to the benchmark conducted by Marfoq et al. (2021) and will be releasing their implementations after the review process. Our theoretical contributions and required assumptions are stated in Section 4.2. The detailed derivations of our HGP and VR-HGP are provided in Appendix A and Appendix B, respectively. We also provide detailed proof of the estimation error bound of VR-HGP in Appendix C. The code for reproducing the results in Section 6 and Appendix G are provided by a separated supplement. A ESTIMATION OF HYPER-GRADIENT Notation For a vector v ∈ R d , ∥v∥ = d i=1 v 2 i is its ℓ 2 norm. For a matrix V ∈ R d1×d2 , ∥V ∥ 2 is its largest singular value. A.1 STATIONARITY OF SGP We consider the generalized version of SGP over n nodes as follows: y (t+1) i = n j=1 p ji (δ (t) j )φ j y (t) j ; λ j , ζ (t) j + ψ i y (t) i ; λ i , ζ (t) i , p ji (δ (t) j ) = δ (t) j i p ji δ (t) j 1 , . . . , δ (t) j n , We set δ (t) i i = 1 for all i and t, i.e., every client can send a message to itself at any time step. Assumptions 1 and 2 ensures existence of the unique stationary point y * . y * i = E δ,ζ   n j=1 p ji φ j y * j ; λ j , ζ j + ψ i (y * i ; λ i , ζ i )   = n j=1 pji E ζ φ j y * j ; λ j , ζ j + E ζ [ψ i (y * i ; λ i , ζ i )] , where pji = E δ [p ji ].

A.2 HYPER-GRADIENT BY IMPLICIT DIFFERENTIATION

We adopt Assumption 3 so that φ, ψ, and F to be differentiable. The differentiation of y * i by λ j is d λj y * i = n j=1 pji d λj y * i ∂ yi E ζ φ j y * j ; λ j , ζ j + ∂ λj E ζ φ j y * j ; λ j , ζ j + 1 ji d λj y * i ∂ yi E ζ [ψ i (y * i ; λ i , ζ i )] + ∂ λi E ζ [ψ i (y * i ; λ i , ζ i )] . Let y * = [y * i ] i and λ = [λ i ] i be the concatenated parameters and hyperparameters, respectively. We can write the differentiation in the matrix form by d λ y * = d λ y * Ā + B, where Ā = 1 ji Āψ i + pji Āφ j ji ∈ R ndy×ndy , Āψ i = ∂ yi E ζ [φ i (y * i ; λ i , ζ i )] ∈ R dy×dy , Āφ j = ∂ yj E ζ φ j (y * j ; λ j , ζ j ) ∈ R dy×dy , B = 1 ji Bψ i + pji Bφ j ji ∈ R nd λ ×ndy , Bψ i = ∂ λi E ζ [φ i (y * i ; λ i , ζ i )] ∈ R d λ ×dy , Bφ j = ∂ λj E ζ φ j (y * j ; λ j , ζ j ) ∈ R d λ ×dy . Then, we have d λ y * = B(I -Ā) -1 . In particular, we have d λj y * i = k ⟨ B⟩ jk ⟨(I -Ā) -1 ⟩ ki , where ⟨•⟩ jk and ⟨•⟩ ki denotes the (j, k)-th and (k, i)-th block of the matrix. The hyper-gradient of the objective function F (x * , λ) = i F i (x * i , λ i ) is then given as d λj F (y * , λ) = i d λj y * i ∂ yi F i (y * i , λ i ) c y i + ∂ λj F j (y * j , λ j ) c λ j = i d λj y * i c y i + c λ j . Under review as a conference paper at ICLR 2023

A.3 ESTIMATION OF HYPERGRADIENT

In the remainder, we consider ψ i and φ j of the following forms: ψ i (y i ; λ i , ζ i ) = 1 |ζ i | ξ∈ζi g i (y i ; λ i , ξ) φ j (y j ; λ j , ζ j ) = 1 |ζ j | ξ∈ζi h j (y j ; λ j , ξ), for some g i (•; λ i , ξ) : R dy → R dy and h j (•; λ j , ξ) : R dy → R dy , which are true for SGP in Eq. (4a) and Eq. (4b). Assumption 3 ensures that g i and h j are differentiable with respect to both y and λ.

A.3.1 ESTIMATION OF Ā AND B

Because the matrices Ā and B are defined as the expectation over the data minibatch ζ i , ζ j as well as the realization of communication network δ, we estimate them from the observation as follows. Â = 1 ji Âψ i + pji Âφ j ji ∈ R ndy×ndy , Âψ i = 1 |ζ i | ξ∈ζi ∂ yi g i (y * i ; λ i , ξ) ∈ R dy×dy , Âφ j = 1 |ζ j | ξ∈ζj ∂ yj h j (y * j ; λ j , ξ) ∈ R dy×dy , B = 1 ji Bψ i + pji Bφ j ji ∈ R nd λ ×ndy , Bψ i = 1 |ζ i | ξ∈ζi ∂ λi h i (y * i ; λ i , ξ) ∈ R d λ ×dy , Bφ j = 1 |ζ j | ξ∈ζj ∂ λj φ ji (y * j ; λ j , ξ) ∈ R d λ ×dy .

A.3.2 APPROXIMATION BY NEUMANN SERIES

With Assumption 4, we have Ā 2 < 1 We can thus approximate (I -Ā) -1 by the truncated Neumann series up to the M -th term as (I -Ā) -1 = ∞ m=0 Ām ≈ M -1 m=0 Ām . The approximation of the hyper-gradient could be expressed as d λ F (x * , λ) ≈ B M -1 m=0 Ām c y i + c λ j . By replacing Ā and B with the estimators Â and B, we have d λ F (x * , λ) ≈ M -1 m=0 B(2m) m-1 s=0 Â(2s+1) c y + c λ , where Â(2s+1) and B(2m) denotes the estimators at the 2s + 1-th and the 2m-th step of the communication round, respectively. In this estimator, we estimate Ā in the odd-numbered steps and estimate B in the even-numbered steps of the communication round, respectively.

A.4 HYPER-GRADIENT PUSH (HGP)

We now present our proposed method, hyper-gradient push (HGP), which is a modified version of the recurrent backpropagation. HGP can run even on stochastic and directed networks while enjoying the same order of communication efficiency as SGP. In HGP, we adopt Assumptions 6 and 7, and assume that { δj i = E δ [δ j i ]} j,i and {p ji } j,i are known. The idea of HGP is to use δ ij pji δij instead of p ji in Â and B as follows. Â = 1 ji Âψ i + δ i j pji δi j Âφ j ji , B = 1 ji Bψ i + δ i j pji δi j Bφ j ji . Under Assumption 5 where δ and ζ are independent, these are the unbiased estimators because E δ,ζ Â =        1 ji E ζi Âψ i = Āψ i + E δi j δ i j pji δi j δi j pji δi j = pji E ζj Âφ j = Āφ j        ji = 1 ji Āψ i + pji Āφ j ji = Ā, E δ,ζ B =        1 ji E ζi Bψ i = Bi + E δi j δ i j pji δi j δi j pji δi j = pji E ζj Bφ j = Bφ j        ji = 1 ji Bψ i + pji Bφ j ji = B. Recall that the hyper-gradient can be approximated as d λj F (x * , λ) ≈ M -1 m=0 k ⟨ B⟩ jk i ⟨ Ām ⟩ ki c y i + c λ j . ( ) By replacing the expectation with the above estimators Â and B, we have d λj F (x * , λ) = M -1 m=0 k B(2m) jk i m-1 s=0 Â(2s+1) ki c y i + c λ j . ( ) Let u (m) k = i m-1 s=0 Â(2s+1) ki c y i . We note that u (m+1) k can be computed recursively as u (m+1) k = k ′ Â(2m+1) kk ′ u (m) k ′ . By using this fact, we can rewrite the estimator as d λj F (x * , λ) = M -1 m=0 k B(2m) jk k ′ Â(2m-1) kk ′ i m-2 s=0 Â(2s+1) k ′ i c y i u (m-1) k ′ +c λ j = M -1 m=0 k B(2m) jk k ′ Â(2m-1) kk ′ u (m-1) k ′ u (m) k +c λ j . We can then derive the proposed algorithm, hyper-gradient push, as follows: Hyper-Gradient Push (HGP) u (0) j ← c y j , v (0) j ← c λ j    v (m+1) j ← v (m) j + k B(2m) jk u (m) k u (m+1) k ← k ′ Â(2m+1) kk ′ u (m) k ′ for m = 0, 1, 2, . . . , M -1 d λi F (x * , λ) ← v (M ) j In HGP, the estimator could be obtained after the 2M rounds of communication. In each round of the communication, the clients communicate u (m) k ∈ R dy which is O(d y ) parameters only, the same as the standard communication for SGP update.

B VARIANCE REDUCTION

We now introduce the variance-reduced version of HGP. The naive HGP above suffers from the large variance because of δ (m) j i / δj i; this term can take a value far larger than one when δj i is small. The multiplication of such values induces high variance. Recall that, in HGP, we aim at approximating the estimator d λj F (x * , λ) ≈ M -1 m=0 k ⟨ B⟩ jk i ⟨ Ām ⟩ ki c y i + c λ j . With v (0) j ← 0, u k ← C y k , HGP computes the first term of the right-hand-side by iterating v (m+1) j ← v (m) j + k ⟨ B⟩ jk u (m) k , u (m+1) k ← k ′ ⟨ Ā⟩ kk ′ u (m) k ′ , where u (m+1) k is equivalent to k ′ ⟨A m+1 ⟩ kk ′ c y k ′ . We can also consider another way of computing the first term. With v (0) j ← 0 d λ , w (0) k ← c y k , we can compute v (m+1) j ← k ⟨ B⟩ jk w (m) k , w (m+1) k ← k ′ ⟨ Ā⟩ kk ′ w (m) k ′ + c y k , where w (m+1) k is equivalent to m+1 m ′ =0 k ′ Ām ′ kk ′ c y k ′ = m+1 m ′ =0 u (m ′ ) k = w (m) k + u (m+1) k . By combining the above two formulas, we can derive the general expression of HGP as v (m+1) j ← α v (m) j + k ⟨ B⟩ jk u (m) k + (1 -α) k ⟨ B⟩ jk w (m) k , u (m+1) k ← k ′ ⟨ Ā⟩ kk ′ u (m) k ′ , w (m+1) k ← β k ′ ⟨ Ā⟩ kk ′ w (m) k ′ + c y k + (1 -β) w (m) k + u (m+1) k , where α, β ∈ [0, 1] are the interpolation weights. By replacing Ā, B by the empirical estimates Â, B, we obtain the general expression of HGP as follows. General HGP for Variance Reduction v (0) j ← 0 d λ , u (0) j ← c y j , w (0) j ← c y j            v (m+1) j ← α v (m) j + k B(2m) jk u (m) k + (1 -α) k B(2m) jk w (m) k u (m+1) k ← k ′ Â(2m+1) kk ′ u (m) k ′ w (m+1) k ← β k ′ Â(2m+1) kk ′ w (m) k ′ + c y k + (1 -β) w (m) k + u (m+1) k , for m = 0, 1, 2, . . . , M -1 d λi F (x * , λ) ← v (M ) j + c λ j We note that this general HGP is the weighted average of the two different estimation algorithms, which results in an estimator with a smaller variance. That is, by choosing α, β ∈ [0, 1] appropriately, we can obtain an estimate of the hyper-gradient with a smaller variance. From the computational perspective, this general HGP has properties similar to the original HGP: it can be computed even on stochastic and directed networks; the estimator could be obtained after the 2M rounds of communication; and the clients communicate O(d y ) parameters in each iteration.

C ESTIMATION ERROR OF HYPER-GRADIENT

In the following, we assume that the derivatives of g i and h j are bounded. Assumption 9. ∃η A ∈ (0, 1), η B ∈ (0, ∞) such that ∀ξ and ∀i, j, max sup yi,λi,ξ ∥∂ yi g i (y i , λ i , ξ)∥ 2 , sup yj ,λj ,ξ ∂ yj h j (y j , λ j , ξ) 2 ≤ η A 2 i,j pji δi j , max sup yi,λi,ξ ∥∂ λi g i (y i , λ i , ξ)∥ 2 , sup yj ,λj ,ξ ∂ λj h j (y j , λ j , ξ) 2 ≤ η B 2 i,j pji δi j . Recall that i,j pji δi j ≥ n by the properties n i=1 pji = 1, δi j ∈ [0, 1]. Assumption 9 implies Ā 2 ≤ i sup yi,λi,ξ ∥∂ yi g i (y i , λ i , ξ)∥ 2 + i,j pji sup yj ,λj ,ξ ∂ yj h j (y j , λ i , ξ) 2 ≤   n + i,j pji   η A 2 i,j pji δi j ≤ η A , Â 2 ≤ i sup yi,λi,ξ ∥∂ yi g i (y i , λ i , ξ)∥ 2 + i,j pji δi j sup yj ,λj ,ξ ∂ yj h j (y j , λ i , ξ) 2 ≤   n + i,j pji δi j   η A 2 i,j pji δi j ≤ η A , B 2 ≤ i sup yi,λi,ξ ∥∂ λi g i (y i , λ i , ξ)∥ 2 + i,j pji sup yj ,λj ,ξ ∂ λj h j (y j , λ i , ξ) 2 ≤   n + i,j pji   η B 2 i,j pji δi j ≤ η B , B 2 ≤ i sup yi,λi,ξ ∥∂ λi g i (y i , λ i , ξ)∥ 2 + i,j pji δi j sup λj ,λj ,ξ ∂ yj h j (y j , λ i , ξ) 2 ≤   n + i,j pji δi j   η B 2 i,j pji δi j ≤ η B .

C.1 PRELIMINARY LEMMAS

In this section, we present a few preliminary lemmas we use in the proof of the theorems. We recall that we can express the general HGP using the concatenated vectors and matrices as v (m+1) = α v (m) + B(2m) u (m) + (1 -α) B(2m) w (m) , u (m+1) = Â(2m+1) u (m) , w (m+1) = β Â(2m+1) w (m) + c y + (1 -β) w (m) + u (m+1) . ( ) with the initial conditions v (0) ← 0, u (0) ← c y , and w (0) ← c y . The following lemmas show explicit formula of v and w and their decomposition. Lemma 2 (Explicit Formula of w). w (M ) = M -1 m=0 (1 -β)I + β Â(2m+1) c y + M i=1 M -1 m=i (1 -β)I + β Â(2m+1) βI + (1 -β) i-1 m=0 Â(2m+1) c y , where we define m∈∅ (•) m = 1 so that M -1 m=M (•) m = I. Proof. We prove the claim by induction. We first recall that u (M ) = M -1 m=0 Â(2m+1) c y . ( ) By setting m = 0 in (19), we have w (1) = β Â(1) w (0) + c y + (1 -β) w (0) + u (1) = β Â(1) c y + c y + (1 -β) c y + Â(1) c y = c y + Â(1) c y . By setting M = 1 in (20), we also have w (1) = (1 -β)I + β Â(1) c y + βI + (1 -β) Â(1) c y = c y + Â(1) c y , which confirms that (20) is valid when M = 1. Now, suppose that the statement is true for some M ≥ 1. Then, by (19), w (M +1) = β Â(2M+1) w (M ) + C y + (1 -β) w (M ) + u (M +1) = βc y + (1 -β) M m=0 Â(2m+1) c y + (1 -β)I + β Â(2M+1) w (M ) = βc y + (1 -β) M m=0 Â(2m+1) c y + (1 -β)I + β Â(2M+1) M -1 m=0 (1 -β)I + β Â(2m+1) c y + (1 -β)I + β Â(2M+1) M i=1 M -1 m=i (1 -β)I + β Â(2m+1) βI + (1 -β) i-1 m=0 Â(2m+1) c y = M m=0 (1 -β)I + β Â(2m+1) c y + 1 × βI + (1 -β) M m=0 Â(2m+1) c y + M i=1 M m=i (1 -β)I + β Â(2m+1) βI + (1 -β) i-1 m=0 Â(2m+1) c y = M m=0 (1 -β)I + β Â(2m+1) c y + M +1 i=1 M m=i (1 -β)I + β Â(2m+1) βI + (1 -β) i-1 m=0 Â(2m+1) c y , where the last line follows from the fact that M i=M +1 (•) m = I. Lemma 3 (Decomposition of w). w (M ) - M i=0 Āi c y = M -1 i=0 L(i,M) 1 ( Â(2i+1) -Ā)R (i) 1 + L(i,M) 2 ( Â(2i+1) -Ā) Āi c y , where L(i,M) 1 = β M -1 m=i+1 (1 -β)I + β Â(2m+1) , L(i,M) 2 = (1 -β) M j=i+1   M -1 m=j (1 -β)I + β Â(2m+1)   j-1 m=i+1 Â(2m+1) , R (i) 1 = (1 -β)I + β Ā i + i j=1 (1 -β)I + β Ā i-j βI + (1 -β) Āj . Proof. We first recall that, as the corollary of Lemma 2, M i=0 Āi c y = (1 -β)I + β Ā M c y + M i=1 (1 -β)I + β Ā M -i βI + (1 -β) Āi c y . By using Lemma 2, we can expand the difference as w (M ) - M i=0 Āi c y = M -1 m=0 (1 -β)I + β Â(2m+1) -(1 -β)I + β Ā M c y + M i=1 M -1 m=i (1 -β)I + β Â(2m+1) -(1 -β)I + β Ā M -i βI + (1 -β) Āi c y + M i=1 M -1 m=i (1 -β)I + β Â(2m+1) (1 -β) i-1 m=0 Â(2m+1) -Āi c y = M -1 i=0 M -1 m=i+1 (1 -β)I + β Â(2m+1) β Â(2i+1) -Ā (1 -β)I + β Ā i c y + M -1 j=1 M -1 i=j M -1 m=i+1 (1 -β)I + β Â(2m+1) β Â(2i+1) -Ā (1 -β)I + β Ā i-j × βI + (1 -β) Āj c y + M j=1   M -1 m=j (1 -β)I + β Â(2m+1)   (1 -β) j-1 i=0 j-1 m=i+1 Â(2m+1) Â(2i+1) -Ā Āi c y = M -1 i=0 M -1 m=i+1 (1 -β)I + β Â(2m+1) β Â(2i+1) -Ā (1 -β)I + β Ā i c y + M -1 i=1 M -1 m=i+1 (1 -β)I + β Â(2m+1) β Â(2i+1) -Ā × i j=1 (1 -β)I + β Ā i-j βI + (1 -β) Āj c y + M -1 i=0 M j=i+1   M -1 m=j (1 -β)I + β Â(2m+1)   (1 -β) j-1 m=i+1 Â(2m+1) Â(2i+1) -Ā Āi c y = M -1 i=0 β M -1 m=i+1 (1 -β)I + β Â(2m+1) = L(i,M) 1 Â(2i+1) -Ā ×   (1 -β)I + β Ā i + i j=1 (1 -β)I + β Ā i-j βI + (1 -β) Āj   =R (i) 1 c y + M -1 i=0 (1 -β) M j=i+1   M -1 m=j (1 -β)I + β Â(2m+1)   j-1 m=i+1 Â(2m+1) = L(i,M) 2 Â(2i+1) -Ā Āi c y . Lemma 4 (Explicit Formula of v). v (M +1) = M i=0 α M -i+1 B(2i) u (i) + (1 -α) M i=0 α M -i B(2i) w (i) . Proof. We prove the claim by induction. By setting m = 0 in (17), we have v (1) = α v (0) + B(0) u (0) + (1 -α) B(0) w (0) = α B(0) c y + (1 -α) B(0) c y = B(0) c y . By setting M = 0 in (23), we also have v (1) = α B(0) c y + (1 -α)α B(0) w (0) = B(0) c y , which confirms that ( 23) is valid when M = 0. Now, suppose that the statement is true for some M ≥ 1. Then, by (17), v (M +1) = α v (M ) + B(2M) u (M ) + (1 -α) B(2M) w (M ) = α M -1 i=0 α M -i B(2i) u (i) + (1 -α) M -1 i=0 α M -i-1 B(2i) w (i) + α B(2M) u (M ) + (1 -α) B(2M) w (M ) = M -1 i=0 α M -i+1 B(2i) u (i) + α B(2M) u (M ) + (1 -α) M -1 i=0 α M -i B(2i) w (i) + B(2M) w (M +1) = M i=0 α M -i+1 B(2i) u (i) + (1 -α) M i=0 α M -i B(2i) w (i) . Lemma 5 (Decomposition of v). v (M +1) - B M i=0 Āi c y = M i=0 ( B(2i) -B)R (i,M ) 3 c y + M -1 i=0 L(i,M) 4 ( Â(2i+1) -Ā) Āi + L(i,M) 5 ( Â(2i+1) -Ā)R (i) 1 c y , where R (i,M ) 3 = α M -i+1 Āi + (1 -α)α M -i i j=0 Āj , L(i,M) 4 = M j=i+1 α M -j+1 B(2j) j-1 m=i+1 Â(2m+1) + (1 -α)(1 -β) M j=i+1 α M -j B(2j) j k=i+1 j-1 m=k (1 -β)I + β Â(2m+1) k-1 m=i+1 Â(2m+1) , L(i,M) 5 = (1 -α)β M j=i+1 α M -j B(2j) j-1 m=i+1 (1 -β)I + β Â(2m+1) . Proof. We first recall that, as the corollary of Lemma 4, B M i=0 Āi c y = M i=0 α M -i+1 B Āi c y + (1 -α) M i=0 α M -i B i j=0 Āj c y . By using Lemma 4 and Lemma 2, we can expand the difference as v (M +1) - B M i=0 Āi c y = M i=0 α M -i+1 B(2i) i-1 m=0 Â(2m+1) -B Āi c y + (1 -α) M i=0 α M -i   B(2i) w (i) - B i j=0 Āj c y   = M i=0 α M -i+1 B(2i) -B Āi c y + M j=1 α M -j+1 B(2j) j-1 i=0 j-1 m=i+1 Â(2m+1) Â(2i+1) -Ā Āi c y + (1 -α) M i=0 α M -i B(2i) -B i j=0 Āj c y + (1 -α) M i=1 α M -i B(2i)   w (i) - i j=0 Āj c y   . By substituting ( 22), we have v (M +1) - B M i=0 Āi c y = M i=0 α M -i+1 B(2i) -B Āi c y + M j=1 α M -j+1 B(2j) j-1 i=0 j-1 m=i+1 Â(2m+1) Â(2i+1) -Ā Āi c y + (1 -α) M i=0 α M -i B(2i) -B i j=0 Āj c y + (1 -α) M j=1 α M -j B(2j) j-1 i=0 L(i,j) 1 ( Â(2i+1) -Ā)R (i) 1 + L(i,j) 2 ( Â(2i+1) -Ā) Āi c y = M i=0 B(2i) -B   α M -i+1 Āi + (1 -α)α M -i i j=0 Āj   =R (i,M ) 3 c y + M -1 i=0   M j=i+1 α M -j+1 B(2j) j-1 m=i+1 Â(2m+1) + (1 -α) M j=i+1 α M -j B(2j) L(i,j) 2   = L(i,M) 4 Â(2i+1) -Ā Āi c y + M -1 i=0 (1 -α) M j=i+1 α M -j B(2j) L(i,j) 1 = L(i,M) 5 ( Â(2i+1) -Ā)R (i) 1 c y . By substituting L (i,j) 2 , L (i,j) 1 , we obtain the claim. To bound the estimation error of hyper-gradient, we need to bound each term of (24). The following lemma gives the bounds for each coefficient matrices in (24). Lemma 6. Under Assumption 9, we have R (i,M ) 3 2 ≤ 1 -α 1 -η A α M -i + 1 1 -η A α M -i+1 η i A - 1 1 -η A α M -i η i+1 A , L(i,M) 4 2 ≤ η B αβ α -(1 -β + βη A ) α M -i - η B 1 -η A η M -i A - η B 1 -η A 1 -α α -(1 -β + βη A ) (1 -β + βη A ) M -i+1 , L(i,M) 5 2 ≤ η B (1 -α)β α -(1 -β + βη A ) α M -i -(1 -β + βη A ) M -i , R (i) 1 2 ≤ 1 -η i+1 A 1 -η A . ( ) Proof. Recall that Assumption 9 ensures Ā 2 ≤ η A , Â 2 ≤ η A , B 2 ≤ η B , B 2 ≤ η B . Then, we have R (i,M ) 3 2 ≤ α M -i+1 Ā i 2 + (1 -α)α M -i i j=0 Ā j 2 ≤ α M -i+1 η i A + (1 -α)α M -i i j=0 η j A = α M -i+1 η i A + (1 -α)α M -i 1 -η i+1 A 1 -η A = α M -i+1 η i A + 1 -α 1 -η A α M -i - 1 1 -η A α M -i η i+1 A + η A 1 -η A α M -i+1 η i A = 1 -α 1 -η A α M -i + 1 1 -η A α M -i+1 η i A - 1 1 -η A α M -i η i+1 A , L(i,M) 4 2 ≤ M j=i+1 α M -j+1 B(2j) 2 j-1 m=i+1 Â(2m+1) 2 + (1 -α)(1 -β) M j=i+1 α M -j B(2j) 2 j k=i+1 j-1 m=k (1 -β)I + β Â(2m+1) 2 k-1 m=i+1 Â(2m+1) 2 ≤ η B M j=i+1 α M -j+1 η j-i-1 A + η B (1 -α)(1 -β) M j=i+1 α M -j j k=i+1 (1 -β + βη A ) j-k η k-i-1 A = η B α α -η A α M -i -η M -i A + η B 1 -α 1 -η A M j=i+1 α M -j (1 -β + βη A ) j-i -η j-i A = η B α α -η A α M -i -η M -i A + η B 1 -α 1 -η A (1 -β + βη A ) α M -i -(1 -β + βη A ) M -i α -(1 -β + βη A ) - η A α M -i -η M -i A α -η A = η B 1 1 -η A α M -i -η M -i A + η B 1 -α 1 -η A 1 -β + βη A α -(1 -β + βη A ) α M -i -(1 -β + βη A ) M -i = η B αβ α -(1 -β + βη A ) α M -i - η B 1 -η A η M -i A - η B 1 -η A 1 -α α -(1 -β + βη A ) (1 -β + βη A ) M -i+1 , L(i,M) 5 2 ≤ (1 -α)β M j=i+1 α M -j B(2j) 2 j-1 m=i+1 (1 -β)I + β Â(2m+1) 2 ≤ η B (1 -α)β M j=i+1 α M -j (1 -β + βη A ) j-i-1 = η B (1 -α)β α M -i -(1 -β + βη A ) M -i α -(1 -β + βη A ) , R (i) 1 2 ≤ (1 -β)I + β Ā i 2 + i j=1 (1 -β)I + β Ā i-j 2 β + (1 -β) Ā j 2 ≤ (1 -β + βη A ) i + β i j=1 (1 -β + βη A ) i-j + (1 -β) i j=1 (1 -β + βη A ) i-j η j A = (1 -β + βη A ) i + 1 -(1 -β + βη A ) i 1 -η A + η A (1 -β + βη A ) i -η i A 1 -η A = 1 -η i+1 A 1 -η A .

C.2 DECOMPOSITION OF Â, B

We can decompose the difference Â -Ā and B -B as Â -Ā = 1 ji ( Âψ i -Āψ i ) + pji δ i j δi j Âφ j -Āφ j ji = δ i j δi j -1 pji Âφ j ji + 1 ji ( Âψ i -Āψ i ) + pji Âφ j -Āφ j ji = n i,j=1 e j e ⊤ i ⊗ δ i j δi j -1 pji Âφ j + n i,j=1 e j e ⊤ i ⊗   1 ji |ζ i | ξ∈ζi ∂ yi g i (y * i , λ i , ξ) -Āψ i + pji |ζ j | ξ∈ζj ∂ yj h j (y * j , λ j , ξ) -Āφ j   , B -B = 1 ji ( Bψ i -Bψ i ) + pji δ i j δi j Bφ j -Bφ j ji = δ i j δi j -1 pji Bφ j ji + 1 ji ( Bψ i -Bψ i ) + pji Bφ j -Bφ j ji = n i,j=1 e j e ⊤ i ⊗ δ i j δi j -1 pji Bφ j + n i,j=1 e j e ⊤ i ⊗   1 ji |ζ i | ξ∈ζi ∂ λi g i (y * i , λ i , ξ) -Bψ i + pji |ζ j | ξ∈ζj ∂ λj h j (y * j , λ j , ξ) -Bφ j   , where e i , e j are i-th and j-th canonical basis vectors. By using these expressions, we can rewrite Lemma 5 as v (M +1) - B M i=0 Āi c y = M i=0 n s,t=1 X (i) B,st c y + M i=0 n t=1 ξ∈ζ (2i) s Y (i,ξ) B,t c y + M -1 i=0 n s,t=1 X (i) A,st c y + M -1 i=0 n t=1 ξ∈ζ (2i+1) s Y (i,ξ) A,t c y , where X (i) B,st = e t e ⊤ s ⊗ δ (2i) s t δs t -1 pts Bφ(2i) t R (i,M ) 3 , Y (i) B,t = n s=1 e t e ⊤ s ⊗ 1 |ζ (2i) t | 1 ts ∂ λt g t (y * t , λ t , ξ) + pts ∂ λt h t (y * t , λ t , ξ) -1 ts Bψ t -pts Bφ t R (i,M ) 3 , X (i) A,st = L(i,M) 4 e t e ⊤ s ⊗ δ (2i+1) s t δs t -1 pts Âφ(2i+1) t Āi + L(i,M) 5 e t e ⊤ s ⊗ δ (2i+1) s t δs t -1 pts Âφ(2i+1) t R (i) 1 , Y (i) B,t = L(i,M) 4 n s=1 e t e ⊤ s ⊗ 1 |ζ (2i+1) t | 1 ts ∂ yt g t (y * t , λ t , ξ) + pts ∂ yt h t (y * t , λ t , ξ) -1 ts Āψ t -pts Āφ t Āi + L(i,M) 5 n s=1 e t e ⊤ s ⊗ 1 |ζ (2i+1) t | 1 ts ∂ yt g t (y * t , λ t , ξ) + pts ∂ yt h t (y * t , λ t , ξ) -1 ts Āψ t -pts Āφ t R (i) 1 . Here, we note that L(i,M) 4 and L(i,M) 5 depend only on Â(2i+3) , . . . , Â(2M-1) and B(2i+2) , . . . , B(2M) . We therefore have We now derive the error bound of VR-HGP for the case when α, β ∈ (0, 1). The error bound follows from the next bounds on E δ (2i+1) s t ,ζ (2i+1) t X (i) A,st | Â(2i+3) , . . . , Â(2M-1) , B(2i+2) , . . . , B(2M) = 0, E δ (2i+1) s t ,ζ (2i+1) t Y (i,ξ) A,t | Â( X (i) B,st , Y (i) B,t , X (i) A,st , and Y (i) A,t . Lemma 7. Under Assumption 9, when α, β ∈ (0, 1) so that 1 -β + βη A ∈ (η A , 1), we have Proof. X (i) B,st 2 2 ≤ η 2 B κ 2 p2 ts δ2 s t 1 -α 1 -η A 2 α 2(M -i) + exp(-O(M )), X (i) A,st 2 2 ≤ η 2 B κ 2 p2 ts δ2 s t η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 α M -i -(1 -β + βη A ) M -i 2 + exp(-O(M )), Y (i,ξ) B,t 2 2 ≤ 4η 2 B κ 2 |ζ (2i) t | 2 1 -α 1 -η A 2 α 2(M -i) + exp(-O(M )), Y (i,ξ) A,t 2 2 ≤ 4η 2 B κ 2 |ζ (2i+1) t | 2 η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 α M -i -(1 -β + βη A ) M -i 2 + exp(-O(M )), X (i) B,st 2 2 ≤ 1 + 1 δs t -1 2 ≤ 1 δ2 s t p2 ts Bφ(2i) t 2 2 R (i,M ) 3 2 2 ≤ p2 ts δ2 s t η B 2κ 2 1 -α 1 -η A α M -i + exp(-O(M )) 2 = η 2 B κ 2 p2 ts δ2 s t 1 -α 1 -η A 2 α 2(M -i) + exp(-O(M )), X (i) A,st 2 2 ≤ 1 + 1 δs t -1 2 p2 ts Âφ(2i+1) t 2 2 L(i,M) 4 2 η i A + L(i,M) 5 2 R (i) 1 2 2 ≤ p2 ts δ2 s t η A 2κ 2 η B 1 -η A (1 -α)β α -(1 -β + βη A ) α M -i -(1 -β + βη A ) M -i + exp(-O(M )) 2 = η 2 B κ 2 p2 ts δ2 s t η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 α M -i -(1 -β + βη A ) M -i 2 + exp(-O(M )), Y (i,ξ) B,s 2 2 ≤ 1 |ζ (2i) t | 2 n s=1 1 ts ∂ λt g t (y * t , λ t , ξ) + pts ∂ λt h t (y * t , λ t , ξ) -1 ts Bψ t -pts Bφ t 2 2 R (i,M ) 3 2 2 ≤ 1 |ζ (2t) s | 2 η B 2κ 2      2 + 2 n s=1 pts =1      2 1 -α 1 -η A α M -i + exp(-O(M )) 2 = 4η 2 B κ 2 |ζ (2i) t | 2 1 -α 1 -η A 2 α 2(M -i) + exp(-O(M )), Y (i,ξ) A,s 2 2 ≤ 1 |ζ (2i+1) s | 2 n s=1 1 ts ∂ yt g t (y * t , λ t , ξ) + pts ∂ yt h t (y * t , λ t , ξ) -1 ts Āψ t -pts Āφ t 2 2 × L(i,M) 4 2 η i A + L(i,M) 5 2 R (i) 1 2 2 ≤ 1 |ζ (2i+1) t | 2 η A 2κ 2      2 + 2 n s=1 pts =1      2 × η B 1 -η A (1 -α)β α -(1 -β + βη A ) α M -i -(1 -β + βη A ) M -i + exp(-O(M )) 2 = 4η 2 B κ 2 |ζ (2i+1) t | 2 η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 α M -i -(1 -β + βη A ) M -i 2 + exp(-O(M )). Theorem 8. Suppose Assumptions 1-9 hold true, and |ζ (2i) t | = |ζ (2i+1) t | = b for any t and i. Then, with probability at least 1 -ϵ, we have v (M +1) + c λ -d λj F (y * , λ) ≤ µ α,β τ n s,t=1 p2 ts δ2 s t + 4n b log n(d y + d λ ) ϵ + exp(-O(M )), where µ α,β = 8 1 -α 1 + α 1 + 1 + α(1 -β + βη A ) 1 -α(1 -β + βη A ) β 2 η 2 A 1 -(1 -β + βη A ) 2 , τ = η B ∥c y ∥ κ(1 -η A ) . Proof. We first have v (M +1) + c λ -d λj F (y * , λ) ≤ v (M +1) - B M i=0 Āi c y + B ∞ i=M +1 Āi c y . Here, we can bound the second term by B ∞ i=M +1 Āi c y ≤ η B ∥c y ∥ ∞ i=M +1 η i A ≤ η B ∥c y ∥ 1 -η A η M +1 A = exp (-O(M )) . The conditions ( 29) ensure that we can bound the first term by using Matrix Azuma's inequality; with probability at least 1 -ϵ, we have v (M +1) - B M i=0 Āi c y ≤ 8σ 2 n(d y + d λ ) ϵ , where σ 2 ∥c y ∥ 2 ≤ M i=0 n s,t=1 X (i) B,st 2 2 + M -1 i=0 n s,t=1 X (i) A,st 2 2 + M i=0 n t=1 ξ∈ζ (2i) t Y (i,ξ) B,t 2 2 + M -1 i=0 n t=1 ξ∈ζ (2i+1) t Y (i,ξ) A,t 2 2 ≤ M i=0 n s,t=1 η 2 B κ 2 p2 ts δ2 s t 1 -α 1 -η A 2 α 2(M -i) + M -1 i=0 n s,t=1 η 2 B κ 2 p2 ts δ2 s t η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 α M -i -(1 -β + βη A ) M -i 2 + M i=0 n t=1 ξ∈ζ (2i) t 4η 2 B κ 2 |ζ (2i) t | 2 1 -α 1 -η A 2 α 2(M -i) + M -1 i=0 n t=1 ξ∈ζ (2i+1) t 4η 2 B κ 2 |ζ (2i+1) t | 2 η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 α M -i -(1 -β + βη A ) M -i 2 + exp(-O(M )) = η 2 B κ 2 n s,t=1 p2 ts δ2 s t 1 -α 1 -η A 2 1 -α 2(M +1) 1 -α 2 + η 2 B κ 2 n s,t=1 p2 ts δ2 s t η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 × α 2 -α 2(M +1) 1 -α 2 + (1 -β + βη A ) 2 -(1 -β + βη A ) 2(M +1) 1 -(1 -β + βη A ) 2 -2 α(1 -β + βη A ) -α M +1 (1 -β + βη A ) M +1 1 -α(1 -β + βη A ) + η 2 B κ 2 n t=1 4 |ζ (2i) t | 1 -α 1 -η A 2 1 -α 2(M +1) 1 -α 2 + η 2 B κ 2 K s=1 4 |ζ (2i+1) t | η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 × α 2 -α 2(M +1) 1 -α 2 + (1 -β + βη A ) 2 -(1 -β + βη A ) 2(M +1) 1 -(1 -β + βη A ) 2 -2 α(1 -β + βη A ) -α M +1 (1 -β + βη A ) M +1 1 -α(1 -β + βη A ) + exp(-O(M )) ≤ η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i) t | 1 -α 1 -η A 2 1 1 -α 2 + η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i+1) t | η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 × α 2 1 -α 2 + (1 -β + βη A ) 2 1 -(1 -β + βη A ) 2 -2 α(1 -β + βη A ) 1 -α(1 -β + βη A ) + exp(-O(M )) = η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i) t | 1 -α 1 -η A 2 1 1 -α 2 + η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i+1) t | η A 1 -η A (1 -α)β α -(1 -β + βη A ) 2 × (1 + α(1 -β + βη A ))(α -(1 -β + βη A )) 2 (1 -α 2 )(1 -(1 -β + βη A ) 2 )(1 -α(1 -β + βη A )) + exp(-O(M )) = η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i) t | 1 1 -η A 2 1 -α 1 + α + η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i+1) t | 1 1 -η A 2 1 -α 1 + α × 1 + α(1 -β + βη A ) 1 -α(1 -β + βη A ) β 2 η 2 A 1 -(1 -β + βη A ) 2 + exp(-O(M )). When |ζ (2i) t | = |ζ (2i+1) t | = b for any t and i, we further have σ 2 ≤ η 2 B ∥c y ∥ 2 κ 2 (1 -η A ) 2 n s,t=1 p2 ts δ2 s t + 4n b × 1 -α 1 + α 1 + 1 + α(1 -β + βη A ) 1 -α(1 -β + βη A ) β 2 η 2 A 1 -(1 -β + βη A ) 2 + exp(-O(M )). C.4 BOUND FOR α = 1 AND β = 0 Setting α = 1 and β = 0 recovers naive HGP. Here, we derive the error bound for naive HGP. Lemma 9. Under Assumption 9, when α = 1 and β = 0 so that 1 -β + βη A = 1, we have X (i) B,st 2 2 ≤ η 2 B κ 2 p2 ts δ2 s t η 2i A , X (i) A,st 2 2 ≤ η 2 B κ 2 p2 ts δ2 s t η A 1 -η A 2 η 2i A + exp (-O(M )) , Y (i,ξ) B,t 2 2 ≤ 4η 2 B κ 2 |ζ (2i) t | 2 η 2i A , Y (i,ξ) A,t 2 2 ≤ 4η 2 B κ 2 |ζ (2i+1) t | 2 η A 1 -η A 2 η 2i A + exp (-O(M )) . ( ) Proof. X (i) B,st 2 2 ≤ 1 + 1 δs t -1 2 p2 ts Bφ(2i) t 2 2 R (i,M ) 3 2 2 ≤ p2 ts δ2 s t η B 2κ 2 η i A 2 = η 2 B κ 2 p2 ts δ2 s t η 2i A , X (i) A,st 2 2 ≤ 1 + 1 δs t -2 2 p2 ts Âφ(2i+1) t 2 2 L(i,M) 4 2 η i A + L(i,M) 5 2 R (i) 1 2 2 ≤ p2 ts δ2 s t η A 2κ 2 η B 1 -η A (η i A -η M A ) 2 = η 2 B κ 2 p2 ts δ2 s t η A 1 -η A 2 η 2i A + exp (-O(M )) , Y (i,ξ) B,s 2 2 ≤ 1 |ζ (2i) t | 2 n s=1 1 ts ∂ λt g t (y * t , λ t , ξ) + pts ∂ λt h t (y * t , λ t , ξ) -1 ts Bψ t -pts Bφ t 2 2 R (i,M ) 3 2 2 ≤ 1 |ζ (2t) s | 2 η B 2κ 2 2 + 2 n s=1 pts 2 η i A 2 = 4η 2 B κ 2 |ζ (2i) t | 2 η 2i A , Y (i,ξ) A,s 2 2 ≤ 1 |ζ (2i+1) s | 2 n s=1 1 ts ∂ yt g t (y * t , λ t , ξ) + pts ∂ yt h t (y * t , λ t , ξ) -1 ts Āψ t -pts Āφ t 2 2 × L(i,M) 4 2 η i A + L(i,M) 5 2 R (i) 1 2 2 ≤ 1 |ζ (2i+1) t | 2 η A 2κ 2 2 + 2 n s=1 pts 2 η B 1 -η A (η i A -η M A ) 2 = 4η 2 B κ 2 |ζ (2i+1) t | 2 η A 1 -η A 2 η 2i A + exp (-O(M )) . Theorem 10. Suppose Assumptions 1-9 hold true, and |ζ (2i) t | = |ζ (2i+1) t | = b for any t and i. When α = 1, β = 0, with probability at least 1 -ϵ, we have v (M +1) -d λj F (y * , λ) ≤ µ 1,0 τ n s,t=1 p2 ts δ2 s t + 4n |ζ| log n(d y + d λ ) ϵ + exp(-O(M )), where µ 1,0 = 8 η 2 A + (1 -η A ) 2 1 -η 2 A , τ = η B ∥c y ∥ κ(1 -η A ) . Proof. We first have v (M +1) -d λj F (y * , λ) ≤ v (M +1) - B M i=0 Āi c y + B ∞ i=M +1 Āi c y . Here, we can bound the second term by B ∞ i=M +1 Āi c y ≤ η B ∥c y ∥ ∞ i=M +1 η i A ≤ η B ∥c y ∥ 1 -η A η M +1 A = exp (-O(M )) . We can bound the first term by using Matrix Azuma's inequality; with probability at least 1 -ϵ, we have v (M +1) - B M i=0 Āi c y ≤ 8σ 2 n(d y + d λ ) ϵ , where σ 2 ∥c y ∥ 2 ≤ M i=0 n s,t=1 X (i) B,st 2 2 + M -1 i=0 n s,t=1 X (i) A,st 2 2 + M i=0 n t=1 ξ∈ζ (2i) t Y (i,ξ) B,t 2 2 + M -1 i=0 n t=1 ξ∈ζ (2i+1) t Y (i,ξ) A,t 2 2 ≤ M i=0 n s,t=1 η 2 B κ 2 p2 ts δ2 s t η 2i A + M -1 i=0 n s,t=1 η 2 B κ 2 p2 ts δ2 s t η A 1 -η A 2 η 2i A + M i=0 n t=1 ξ∈ζ (2i) t 4η 2 B κ 2 |ζ (2i) t | 2 η 2i A + M -1 i=0 n t=1 ξ∈ζ (2i+1) t 4η 2 B κ 2 |ζ (2i+1) t | 2 η A 1 -η A 2 η 2i A + exp(-O(M )) = η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i) t | 1 -η 2(M +1) A 1 -η 2 A + η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i+1) t | η A 1 -η A 2 1 -η 2M A 1 -η 2 A + exp(-O(M )) = η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i) t | 1 1 -η 2 A + η 2 B κ 2 n s,t=1 p2 ts δ2 s t + n t=1 4 |ζ (2i+1) t | η A 1 -η A 2 1 1 -η 2 A + exp(-O(M )). When |ζ (2i) t | = |ζ (2i+1) t | = b for any t and i, we further have σ 2 ≤ η 2 B ∥c y ∥ 2 κ 2 (1 -η 2 A ) 1 + η 2 A (1 -η A ) 2 n s,t=1 p2 ts δ2 s t + 4n b + exp(-O(M )). C.5 COMPARISON OF µ α,β AND µ 0,1 The estimation errors of VR-HGP and naive HGP are dominated by their scaling factors. µ α,β = 8 1 -α 1 + α 1 + 1 + α(1 -β + βη A ) 1 -α(1 -β + βη A ) β 2 η 2 A 1 -(1 -β + βη A ) 2 , µ 1,0 = 8 η 2 A + (1 -η A ) 2 1 -η 2 A . Figure 1 shows that µ α,β is a few times smaller than µ 1,0 for any η A ∈ (0, 1) if we choose α close to one and β close to zero. This result indicates that the error of VR-HGP can be a few times smaller than the one of naive HGP for sufficiently large M where the diminishing term exp (-O(M )) is negligibly small. Figure 1 : Comparisons of µ α,β and µ 1,0 for η A ∈ (0, 1).

C.6 COMPARISON OF α AND β

We empirically evaluated the advantages of VR-HGP in stochastic communications as well as found that (α, β) = (0.9, 0.1) performed well in practice. We compared the ℓ 2 norm between of the hyper-gradient estimation v (m) at the m-th round of HGP and the true hyper-gradient d λ F (x * , λ) which computed using the explicit (I -Ā) -1 . We made a synthetic one-dimensional dataset with two classes by randomly selecting two digits from MNIST and averaging the inputs of each sample. We let n = 3 clients performed 500 iterations of Eq. ( 4b) ensuring the convergence of SGP. For all i ∈ [n], we used the binary cross-entropy loss for f i and F i computed on local training and validation datasets with 100 samples, respectively. We adopted StoU communication network presented in Section 6. In order to purely evaluate the effect of edge stochasticity δ (m) j i / δj i, which we pointed the source of the high variance in Section 4.2, we excluded the randomness of minibatches ζ by adopting |ζ (t) i | = 100 for all time steps in SGP and HGP and by using the true pij and δj i for all i, j ∈ [n]. We computed d λ F (x * , λ) from the explicit computation of B I -Ā -1 c y + c λ using expected values of pij and δj i for all i, j ∈ [n]. The HGP was conducted to obtain v (m) after the iterations of SGP using M = 500 and the alternative samplings, i.e., Ã(2m+1) and B(2m) for m = 0, . . . , M -1. Fig. 2 shows VR-HGP with (α, β) = (0.9, 0.1) provided the smallest estimation error and the larger number of estimation rounds tends to have smaller error. However HGP, which is a special case of VR-HGP with parameters (α, β) = (1.0, 0.0), failed to attain smaller error than the well-tuned VR-HGP with (α, β) = (0.9, 0.1). This larger estimation error was also observed in experiments with different random seeds. We also observed that HGP could not reduce the estimation error after around m = 5 indicating the larger number of rounds does not always help the better estimation in HGP on stochastic communication networks. 

C.7 RELAXATION OF CONVERGENCE TO THE STATIONARY POINT

While VR-HGP relies on the assumption that the unique stationary point y * is available, a client may only have y (T ) i ̸ = y * in a practical case where the inner-problem is solved by a finite T of SGP iterations. We show that this assumption can be relaxed by adopting an extra smoothness assumption below. Assumption 10. There exist finite positive constants L y , L λ such that for any i ∈ [n] and for any y i , y ′ i , sup λi,ξ ∥∂ yi g i (y i , λ i , ξ) -∂ yi g i (y ′ i , λ i , ξ)∥ 2 ≤ L y ∥y i -y ′ i ∥, sup λi,ξ ∥∂ yi h i (y i , λ i , ξ) -∂ yi h i (y ′ i , λ i , ξ)∥ 2 ≤ L y ∥y i -y ′ i ∥, sup λi,ξ ∥∂ λi g i (y i , λ i , ξ) -∂ λi g i (y ′ i , λ i , ξ)∥ 2 ≤ η B L λ ∥y i -y ′ i ∥, sup λi,ξ ∥∂ λi h i (y i , λ i , ξ) -∂ λi h i (y ′ i , λ i , ξ)∥ 2 ≤ η B L λ ∥y i -y ′ i ∥. Below, we show that the error between y (T ) i and y * induces a bias to Theorem 1. Theorem 11. Let ṽ(M+1) be the estimate of v (M +1) obtained by VR-HGP using y (T ) instead of y * . Suppose Assumptions 1-10 hold true, |ζ (2i) t | = |ζ (2i+1) t | = b for any t and i, and α, β ∈ (0, 1). Then, with probability at least 1 -ϵ, we have ṽ(M+1) + c λ -d λj F (y * , λ) ≤ µ α,β τ n s,t=1 p2 ts δ2 s t + 4n b log n(d y + d λ ) ϵ + η B ∥c y ∥ 1 -η A (L y + L λ )G + exp(-O(M )), where G = n n i=1 y * i -y (T ) i . Proof. To compute ṽ(M+1) using VR-HGP, Ā and B are estimated by Ã = 1 ji Ãψ i + pji Ãφ j ji ∈ R ndy×ndy , Ãψ i = 1 |ζ i | ξ∈ζi ∂ yi g i (y (T ) i ; λ i , ξ) ∈ R dy×dy , Ãφ j = 1 |ζ j | ξ∈ζj ∂ yj h j (y (T ) j ; λ j , ξ) ∈ R dy×dy , B = 1 ji Bψ i + pji Bφ j ji ∈ R nd λ ×ndy , Bψ i = 1 |ζ i | ξ∈ζi ∂ λi h i (y (T ) i ; λ i , ξ) ∈ R d λ ×dy , Bφ j = 1 |ζ j | ξ∈ζj ∂ λj φ ji (y (T ) j ; λ j , ξ) ∈ R d λ ×dy . We can decompose the difference Ã -Ā and B -B as Ã -Ā = Ã -Â + Â -Ā , B -B = B -B + B -B . Here, by Assumption 10, we have Ã -Â 2 ≤ L y n n i=1 y * i -y (T ) i = L y G, B -B 2 ≤ η B L λ n n i=1 y * i -y (T ) i = η B L λ G. By using these expressions, we can derive the expression similar to Lemma 5 as ṽ(M+1) - B M i=0 Āi c y = M i=0 n s,t=1 X (i) B,st c y + M i=0 n t=1 ξ∈ζ (2i) s Y (i,ξ) B,t c y + M -1 i=0 n s,t=1 X(i) A,st c y + M -1 i=0 n t=1 ξ∈ζ (2i+1) s Ỹ (i,ξ) A,t c y + M i=0 ( B(2i) -B)R (i,M ) 3 c y + M -1 i=0 L(i,M) 4 ( Ã(2i+1) -Â) Āi + L(i,M) 5 ( Ã(2i+1) -Â)R (i) 1 c y (Bias) , where the last line corresponds to the bias induced by the use of y (T ) instead of y * , and . L(i,M) 4 = M j=i+1 α M -j+1 B(2j) j-1 m=i+1 Ã(2m+1) + (1 -α)(1 -β) M j=i+1 α M -j B(2j) j k=i+1 j-1 m=k (1 -β)I + β Ã(2m+1) k-1 m=i+1 Ã(2m+1) , L(i,M) 5 = (1 -α)β M j=i+1 α M -j B(2j) j-1 m=i+1 (1 -β)I + β Ã(2m+1) , X(i) A,st = L(i, Then, we can bound the bias as ∥(Bias)∥ ∥c y ∥ ≤ M i=0 B(2i) -B 2 R (i,M ) 3 2 + M -1 i=0 L(i,M) 4 2 Ã(2i+1) -Â 2 Ā i 2 + L(i,M) 5 2 Ã(2i+1) -Â 2 R (i) 1 2 ≤ η B L λ G 1 -α 1 -η A M i=0 α M -i + L y G η B 1 -η A (1 -α)β α -(1 -β + βη A ) M -1 i=0 α M -i -(1 -β + βη A ) M -i + exp (-O(M )) = η B 1 -η A (L y + L λ )G + exp (-O(M )) .

D DETAILED EXPERIMENTAL SETTINGS

The experiments in Section 6 followed the settings of EMNIST (Cohen et al., 2017) classification in Marfoq et al. (2021) , unless otherwise mentioned.

Communication networks

We simulated four communication networks on which the clients perform the distributed learning: fully-connected (FC), static undirected (FixU), stochastic undirected (StoU), and stochastic directed (StoD). FC allows clients to communicate with all the other clients in all the time steps, i.e. δ i j = 1 for all i, j ∈ [n] and t ∈ N. FixU uses time-invariant and sparse undirected communication network simulated by a binomial Erdős-Rényi graph (Erdős & Rényi, 1959) with parameter p = 0.4 adding the self-loop edges. Following the setting in Marfoq et al. (2021) , we generated a doubly stochastic mixing matrix using the Fast Mixing Markov Chain (Boyd et al., 2003) rule. StoU uses stochastic and undirected network in which any undirected edge δ (t) j i = δ (t) i j independently realizes at each step with the probability δj i ∈ [0, 1]. In StoD each direction of edges δ (t) j i are independently sampled at probability δj i , forming stochastic and directed network. StoD forms the asymmetric expected mixing matrix given by the StoD network is asymmetric representing the communication bias between the clients; some clients may communicate more infrequently than others due to bottlenecks in physical network environments or long computation times of local updates due to poor computational resources. We sampled δj i from the uniform distribution with [0.4, 0.8] both in StoU and StoD Proposed approaches We solved personalization of classification models using three different formulation: PDBO-MTL, PDBO-DA, and PDBO-MTL&DA. For PDBO-DA, we optimize the pseudo sampling rate to recover data augmentation-based personalization (Duan et al., 2019; Zhao et al., 2018) . PDBO-DA optimize λ C i ∈ R C to learn the label-wise weight vector CSoftmax (λ i ) ∈ [0, C] C . In the inner-problem, the losses of instances labeled as c ∈ [C] are multiplied by the c-th element of the weight vector. PDBO-MTL is obtained by applying PDBO to FedEM Marfoq et al. (2021) . PDBO-MTL lets each client train an ensemble classifier that outputs weighted average predictions across K = 3 of CNNs. We trained CNN parameters as the inner-problem and optimized the hyperparameters λ K i ∈ R K to obtain ensemble weight vector Softmax (λ i ) ∈ [0, 1] K . PDBO-MTL&DA combines PDBO-DA and PDBO-MTL optimizing [λ K⊤ i λ C⊤ i ] ⊤ ∈ R C+K to obtain both the label-weight and model-weight. For all i ∈ [n] in the outer-problem, we ran 20 outer-steps of Adam (Kingma & Ba, 2015) iterations with (β 1 , β 2 ) = (0.9, 0.999) from the initial hyperparameters 0 C , 0 K , and 0 C+K for PDBO-DA, 4b) in all the settings. We also made a practical modification in HGP to sample Ã(m) and B(m) together at the single m-th round, which leads the same length of the Neumann series with the half sampling costs of the original HGP, although they are no more unbiased. For all the approaches the cases and for all i ∈ [n], we used the average cross-entropy loss over the local train dataset of the i-th node and L2 regularization loss of λ i for F i with the rates shown in Table 2 (L2 reg. rate). We reported the mean test accuracy of an intermediate step that had maximum validation accuracy (i.e., early stopping) which was sampled independently from the train dataset as described in Appendix D. Baseline approaches We compared our approaches with baselines for each communication setting. For FC and FixU settings, we compared with several personalization approaches: a personalized model trained only on the local dataset (Local), FedAvg with local tuning (FedAvg+) (Jiang et al., 2019) , Clustered-FL (Sattler et al., 2020) , pFedMe (T Dinh et al., 2020) , and centralized and decentralized version of FedEM adopted in Marfoq et al. (2021) . We also trained global models using SGP (Nedić & Olshevsky, 2016; Assran et al., 2019) and FedProx (Li et al., 2020) . From the fact that SGP recovers FedAvg and DSGD on FC and FixU, respectively, we treat them as equivalent approaches. All the approaches on FC and FixU followed the training procedure with epoch-wise communication in Marfoq et al. (2021) while using Eq. (4b) for HGP computation. And any method ran on StoU and StoD adopted the SGP iteration (Eq. ( 4b)) with T = 600 steps, batch size |ζ i | = 128, L2 regularization with 0.001 decay. For SGP StoU and StoD, we adopted the learning rate α i = 0.05 for SGP, Local, and PDBO-DA, α i = 0.25 for PDBO-MTL and PDBO-MTL&DA. Those learning rates were scheduled to be multiplied by 0.1 at t = 500, 550. As we have no baseline ensemble model approach (i.e. FedEM) to be compared to our PDBO-MTL and PDBO-MTL&DA in StoU and StoD, we also examined our performance improvement from the initial hyperparameter. We confirmed PDBO-MTL and PDBO-MTL&DA improved their test accuracy from the initial hyperparameter both in StoU and StoD, confirming the performance gain of PDBO-MTL and PDBO-MTL&DA from SGP were not solely due to their differences in architectures and learning rates. Dataset and model We adopted the procedure of generating a federated version of EMNIST in Marfoq et al. (2021) except for train and validation split. In our experiments, we consider 10%of the EMNIST dataset as in that were partitioned according to Dirichlet allocation of parameter α = 0.4 over n = 100 clients as in Marfoq et al. (2021) . We randomly selected 20% of the obtained dataset to make a validation dataset. We use the validation dataset only for the early stopping in outeroptimization of PDBO-DA, PDBO-MTL, and PDBO-MTL&DA. We trained the same CNN in Marfoq et al. (2021) for all the baselines with a single model and PDBO-DA, and for base-predictor of FedEM, PDBO-MTL, and PDBO-MTL&DA. 

E GRADIENT-BASED DISTRIBUTED BILEVEL OPTIMIZATION

We compare concurrent studies of distributed bilevel optimization (Chen et al., 2022; Tarzanagh et al., 2022; Gao et al., 2022; Yang et al., 2022; Li et al., 2022; Liu et al., 2022; Lu et al., 2022) in terms of problem settings, applicability on communication networks, hyper-gradient value to estimate, and complexity in communication and computation. Bilevel problem setting We categorize them into two problems (Bilevel problem in Table 3 ): the consensus distributed bilevel optimization (CDBO) (Chen et al., 2022; Tarzanagh et al., 2022; Gao et al., 2022; Yang et al., 2022) and CDBO with the local inner-problem (CDBO-Local) (Li et al., 2022; Liu et al., 2022; Lu et al., 2022) . CDBO pursue consensus also in outer-problem, which can be obtained by imposing λ i = λ j for all i, j ∈ [n] on PDBO outer-problem (Eq. (5-left)): While CDBO-Local also requires consensus in the outer-problem as in CDBO, its inner-problem is a local optimization problem in which optimal parameters are independent of each other client, unlike PDBO and CDBO: However, no client in CDBO-Local can benefit from the others in the inner loop for better generalization. We note that in our PDBO, both outer and inner problems are optimized from the global information; the inner-parameter is trained for consensus among the clients and the outer parameter is optimized to improve the total performance across all the clients. min λi λi=λj ,∀j 1 n n i=1 F i (x * i (λ 1 , . . . , λ n ) , λ i ) , s.t. x * i = arg min xi xi=xj ,∀j 1 n n i=1 E ξi [f i (x i , λ i ; ξ i )] , min λi λi=λj ,∀j 1 n n i=1 F i (x * i (λ i ) , λ i ) , s.t. x * i = arg min xi E ξi [f i (x i , λ i ; ξ i )] ,

Communication networks

The communication networks can be categorized into stochastic directed, static undirected, and centralized (Communication network in Table 3 ). Studies for CDBO (Chen et al., 2022; Gao et al., 2022; Yang et al., 2022) and CDBO-Local (Liu et al., 2022; Lu et al., 2022) suppose the communication networks are static and undirected. More specifically, they assume the weighted mixing matrix P (t) to be a double-stochastic matrix at all time steps t ∈ N for the consensus of DSGD in the outer-problem (Liu et al., 2022; Lu et al., 2022) (i.e. x i = x j , ∀i, j ∈ [n]), and both in the outer-problem and inner-problem (Chen et al., 2022; Gao et al., 2022; Yang et al., 2022) Our HGP is the only method that runs even on stochastic and directed communication networks. (i.e. x i = x j , λ i = λ j , ∀i, j ∈ [n]). In terms of the consensus, we can relax the assumption of the static undirected communication in  λ) , λ i ). Unlike GlobalGrad, ClientGrad only lets the client know how the perturbation on the client's hyperparameter changes its own outer-objective. Thus the gradient step of the client hyperparameter using ClientGrad is not supposed to improve the performance of the others, which is not the case with GlobalGrad. Gao et al. (2022) estimates the LocalGrad which is equivalent to the hyper-gradient estimation of SGD that estimates d λi F i (x * i (λ i ) , λ i ). LocalGrad differs from ClientGrad because LocalGrad needs no communication because the optimal inner-parameter x * i is only parameterized by its hyperparameter λ i . Complexity in communication and computation For a fair comparison, we compare the complexity of communication and computation between methods that intend to estimate the same hyper-gradient. Note that we only focus on the requirement of computation or communication for the full Jacobian matrix as it is dominant in decentralized hyper-gradient estimation (rightmost two columns of Table 3 ). No approach for LocalGrad involves the full Jacobian computation and communication as they can naively adopt efficient algorithms such as backward mode. For GlobalGrad, the algorithm proposed by Yang et al. (2022) 2022) and our HGP enjoys reasonable complexity because these methods avoid computation and communication of full Jacobian by using Jacobian-vector products.

F DETAILED ALGORITHMS

We provide an algorithm Alg. 1 which describes a case of PDBO in which outer-problem is solved by local SGD. We also describe the complete algorithms of SGP (Alg. 2) formulated by Eq. ( 2), HGP (Alg. 4) formulated by Eq. ( 12), VR-HGP (Alg. 5), formulated in Section 4.2 (Variance reduction), and the exact recurrent backpropagation (Alg. 3) formulated by Eq. ( 11). All the algorithms above are expected to run locally at every i-th client, showing how all clients collaboratively solve the PDBO (Eq. ( 7)) without any central orchestration. For a better understanding, we describe below special notes on several lines in the algorithms that characterize our approach. Outer-loop in PDBO Let λ (s) i be a hyperparameter of the i-th client at the s-th outer-step. PDBO runs multiple outer-steps for s = 0, . . . , S -1 from a given initial hyperparameter λ (0) i . Alg. 1 supposes λ (s) i is updated locally by SGD step (Line 7 in Alg. 1). As the output of HGP can be seen as an unbiased estimate of stochastic gradient, the convergence property of outer steps is simply given by the common convergence property of SGD whose noise is characterized by Theorem 1. We can also use other optimizers such as Adam (Kingma & Ba, 2015) for outer-steps, as we adopted in our experiments (Section 6).

HGP vs. Exact recurrent backpropagation

We explain the difference between HGP (Eq. ( 12)) and exact recurrent backpropagation (Eq. ( 11)) in algorithmic perspective. As mentioned in Section 4.2 decentralization of exact recurrent backpropagation is impossible on directed communication networks. In exact recurrent backpropagation Line 9 and Line 17 in Alg. 3 require a client to receive the intermediate backpropagation vector u (m) j from clients such that δ (m) i j = 1, indicating the i-th client needs to receive the message from whom the i-th client sent messages. This is possible only when the communication network is undirected or synchronized. In our HGP, Line 9 and Line 17 in Alg. 4 let the i-th client to receive u (m) j from clients such that δ (m) j i = 1, thus any client simply receives the information from all the client who is able to send to i. The estimation bias incurred by this simple modification is corrected according to the expected sending weight pij = E δ [p ij (δ i )] and receiving frequency δj i = E δ [δ j i ] estimated through inner SGP iterations (Line 7 and Line 12 in Alg. 2). Note that both HGP and the exact recurrent backpropagation enjoys cheap time complexity since the computations related to Jacobians, ∂ λi φ i y * i ; λ i , ζ 



This is a mathematical equivalence; FedAVG runs on a centralized network in practice. Although setting α = 1 makes µ α,β = 0, the remaining error is no longer e -O(M ) in that case. This observation implies that α slightly smaller than 1 is preferred. A similar analysis also shows that β slightly larger than 0 is preferred. Our empirical results show that (α, β) = (0.9, 0.1) performs well (Appendix C.6).



and the same for B(m) . HGP enjoys the same complexity as SGP in both communication and computation. HGP exchanges only u (•) i having O (d y ) in communication. In practical cases where d λ ≪ d y , the Jacobian-vector products ⟨ B(•) ⟩ ij u (•) j and ⟨ Ã(•) ⟩ ij u (•) j are computed in O (d y ) time.

2i+3) , . . . , Â(2M-1) , B(2i+2) , . . . , B(2M) = 0, FOR α ∈ (0, 1) AND β ∈ (0, 1)

Figure 2: ℓ 2 norm between the estimation of VR-HGP v (m) and the true hyper-gradient d λ F (x * , λ) at the m-th estimation round with different combinations of α and β.

Chen et al. (2022); Tarzanagh et al. (2022); Gao et al. (2022); Yang et al. (2022) applied CDBO to hyperparameter (e.g. L2 regularization coefficient) optimization.

Lu et al. (2022) demonstrated the ability of CDBO-Local problem to handle personalization tasks.

Tarzanagh et al. (2022);Liu et al. (2022) addresses the consensus in the outer-problem by adopting centralized communication settings so that the single global hyperparameter are shared among the clients at every step.

is complex both in computation and communication as they involve computations and communications of full Jacobian matrix (O (d y × d λ )) and Hessian matrix (O (d y × d y )). Tarzanagh et al. (

computed by Jacobian-vector product.Algorithm 1: PDBO with SGD ran by the i-th client i , pij , δj i j∈[n] ← Alg. 2 y

Test accuracy of personalized models on EMNIST (average clients / 10% percentile).

Parameters for the outer-problems in Section 6 PDBO-MTL, and PDBO-MTL&DA, respectively. For Adam optimizer, we adopted different learning rate shown in Table2(Hyper-learning rate). We adopt HGP for FC and FixU setting, and VR-HGP with (α, β) = (0.9, 0.1) for StoU and StoD settings. Both HGP and VR-HGP ran M = 200 estimation steps using iteration Eq. (

Comparison of the gradient-based PDBO, CDBO, and CDBO-Local.

Chen et al. (2022);Gao et al. (2022);Yang et al. (2022);Liu et al. (2022);Lu et al. (2022) to the stochastic and directed networks by replacing DSGD with SGP for the inner-loop and outer-loop. However, in terms of the hyper-gradient estimation, we cannot naively replace the communication networks setting as discussed in Section 4.2.Hyper-gradient to estimate Both PDBO and CDBO require hyper-gradient estimation as they involve the interaction of clients in the inner-problem. However, the estimated hyper-gradient varies among the studies, so we categorize them into GlobalGrad, ClientGrad, and LocalGrad (Hypergradient in Table3). Our HGP andYang et al. (2022);Tarzanagh et al. (2022) aim at estimating the gradient of the average outer-objective across the client with respect to the hyperparameter of the client (GlobalGrad), i.e. d λi F (x * (λ) , λ) ∈ R d λ .Chen et al. (2022) estimate slightly different hyper-gradient, that is, the gradient of client outerobjective with respect to the hyperparameter of the client (ClientGrad), i.e. d λi F i (x * i

estimates1 pij ← 0, ∀j ∈ [n] 2 δj i ← 0, ∀j ∈ [n]3 foreach t = 0, . . . , T -1 do // Sample a minibatch and communication edges , pij , δj i j∈[n]    Algorithm 3: (Maybe impossible) Exact recurrent backpropagation ran by the i-th client Input:y * i , λ i // Compute i-th block of ∂y F (x * , λ) denoted by ⟨c y ⟩i ∂ yi F i (x * i , λ i ) // Compute i-th block of ∂ λ F (x * , λ) denoted by ⟨c λ ⟩i ∂ λi F i (x * i , λ i ) 3 foreach m = 0, . . . , M -1 do // Sample a minibatch and communication edges )∂ λi φ i y * i ; λ i , ζ += ∂ λi ψ i y * i ; λ i , ζ += ∂ yi ψ i y * i ; λ i , ζAlgorithm 4: HGP ran by the i-th clientInput: y * i , λ i , pij , δj i j∈[n] // Compute i-th block of ∂y F (x * , λ) denoted by ⟨c y ⟩i ∂ yi F i (x * i , λ i ) // Compute i-th block of ∂ λ F (x * , λ) denoted by ⟨c λ ⟩i ∂ λi F i (x * i , λ i ) 3 foreach m = 0, . . . , M -1 do // Sample a minibatch and communication edges += pij δj i ∂ λi φ i y * i ; λ i , ζ += ∂ λi ψ i y * i ; λ i , ζ += pij δj i ∂ yi φ i y * i ; λ i , ζ += ∂ yi ψ i y * i ; λ i , ζAlgorithm 5: VR-HGP ran by the i-th clientInput: y * i , λ i , pij , δj i j∈[n] , α, β // Compute i-th block of ∂y F (x * , λ) denoted by ⟨c y ⟩i ∂ yi F i (x * i , λ i ) // Compute i-th block of ∂ λ F (x * , λ) denoted by ⟨c λ ⟩i += α pij δj i ∂ λi φ i y * i ; λ i , ζ += α ∂ λi ψ i y * i ; λ i , ζ += pij δj i ∂ yi φ i y * i ; λ i , ζ += β pij δj i ∂ yi φ i y * i ; λ i , ζ += ∂ yi ψ i y * i ; λ i , ζ += β ∂ yi ψ i y * i ; λ i , ζ

G ADDITIONAL EXPERIMENTS

We conducted personalization benchmarks on different tasks: image classification (CIFAR10 and CIFAR100 (Krizhevsky, 2009) ), language modeling (Shakespeare (Caldas et al., 2018; McMahan et al., 2017) ), and handwritten character recognition (EMNIST (Cohen et al., 2017) ) on a simulated stochastic directed communication network.

G.1 SETTINGS

We ran our approaches, PDBO-DA, PDBO-MTL, and PDBO-MTL&DA, and Local and SGP for baselines which are explained in Section 6. We adopted the stochastic directed network StoD for a simulated communication network with the same setting as in Section 6. For each approach, we solve different tasks on corresponding datasets: CIFAR10, CIFAR100, Shakespeare, and EMNIST.Tasks For image classification on CIFAR10, we distributed samples with the same labels across clients according to a symmetric Dirichlet distribution with parameter 0.4, as in Marfoq et al. (2021) ; Wang et al. (2019) , to create a federated version. We used 40% of the total data as the train and validation dataset in a 3:1 ratio and the rest as the test dataset. We also tested image classification using CIFAR100 exploiting the availability of "coarse" and "fine" labels, using a two-stage Pachinko allocation method (Li & McCallum, 2006) as in Reddi et al. (2020) ; Marfoq et al. (2021) , to distribute 900, 300, and 1800 sized train, validation, test datasets to each client, respectively. Pachinko allocation ran with the parameters adopted in Marfoq et al. (2021) . For both CIFAR10 and CIFAR100, we set n = 20 and trained MobileNet-v2 (Sandler et al., 2018) , implemented in TorchVision (Marcel & Rodriguez, 2010) , with an additional linear layer.The Shakespeare dataset was naturally divided by assigning all lines from the same character to the same client as in Marfoq et al. (2021) ; McMahan et al. (2017) . From 728 characters, we randomly selected n = 20 characters and assigned each of them to a client. We trained two stacked-LSTM layers with 256 hidden units followed by a densely-connected layer, to predict the next character from a sequence of 200 English characters as input. The model embeds 80 characters into a learnable 8-dimensional embedding space. For each client, we used 80% of lines as the train and validation dataset in a 3:1 ratio and the rest as the test dataset. The lines are split from the beginning in the order train, validation, and test to simulate the practical time dependence between datasets.The settings of handwritten character recognition on EMNIST (Cohen et al., 2017) are described in Section 6 and Appendix D.Approaches For PDBO-DA, PDBO-MTL, and PDBO-MTL&DA, we adopted the same strategies and parameters in Section 6. PDBO-DA optimizes weights vector of loss which elements correspond to labels (characters) of EMNIST, CIFAR10, and CIFAR100 (Shakespeare). PDBO-MTL optimizes ensemble weights of predictions of 3 models of each task, and PDBO-MTL&DA simultaneously optimizes the outer-parameters of PDBO-DA and PDBO-MTL. Except for reducing M to 20 rounds for efficiency, we adopted the same VR-HGP setting in Section 6.For baselines, due to the absence of personalization methods applicable to stochastic directed communication networks, Local and SGP were adopted. We also trained an ensemble model with uniform prediction weights for both baselines, Local-MTL and SGP-MTL to fairly compare the performance difference between the baselines and our approaches. This allows us to exclude architectural differences from the reasons for performance improvements.Results and discussions Table 4 shows the average test accuracy with weights proportional to local test dataset sizes. We observed that our approaches PDBO-DA, PDBO-MTL, and PDBO-MTL&DA improved accuracy from baselines on all tasks with a few exceptions: PDBO-MTL on CIFAR10 and PDBO-DA on EMNIST. PDBO-MTL&DA out performed on CIFAR10, CIFAR100, and EMNIST in average accuracy, confirming the simultaneous optimization of different parameters is effective on complex tasks. Even on the next character prediction task on time series data, all of our approaches improved the performance from baselines, indicating our gradient-based PDBO is effective in a variety of tasks. Note that the performance improvements did not come from the architectural differences in the models (single model or ensemble model) since PDBO-MTL and PDBO-MTL&DA outperformed SGP-MTL and Local-MTL. Accuracy at the 10% percentile are also improved from the baselines in all the tasks and our approaches, which validated that clients fairly benefited from our personalization.

H HYPER-GRADIENT ESTIMATION BY ITERATIVE DIFFERENTIATION

Iterative differentiation is categorized into forward (ITD-Forward) and backward (ITD-Backward) (Franceschi et al., 2017) . ITD-Forward and ITD-Backward are advantageous to AID as they do not assume convexity on the loss. In this section, we show that ITD-Forward and ITD-Backward suffer from physical limitations and large complexity in communications.ITD-Forward and ITD-Backward compute the hyper-gradient by recursively tracing back all innerloops, which is performed to obtain the trained parameter, y (T ) , without requiring y (T ) to be the stationary point. To apply ITD-Forward and ITD-Backward to our setting, we suppose y (T ) is obtained by T iterations of Eq. ( 2). Considering iterations for t = 0, . . . , T -1, concatenated hyper-gradient of ITD-Forward and ITD-Backward can be given asITD-Forward and ITD-Backward are differentiated by how they compute Eq. ( 41).

Forward mode iterative differentiation (ITD

Â(s) .To compute Eq. ( 41), ITD-Forward updates the matrix U (t) ∈ R nd λ ×ndy . After initializing by U (0) = O nd λ ×ndy , the following iterations for t = 0, . . . , T -1,provides the hyper-gradient byWe then consider a decentralized algorithm ran by the i-th client to obtain the i-th block of concatenated hyper-gradient, a decentralized algorithm of ITD-Forward can be written as follows.Forward mode iterative differentiation (ITD-Forward)The i-th node can compute the first term of the right hand of Eq. ( 43) because Jacobian-matrix product U (t) kj ⟨ Â(t) ⟩ ji is always receivable from the j-th client even when edges are directed; whenIn the same manner, the second term can be computed receiving ⟨ B(t) ⟩ ki from the k-th node. Note that when we choose to let the i-th client update row block matrices U 44). This requires communicating with all the clients. The i-th client, therefore, needs to wait for several communication rounds until it can receive the Jacobian-vector productBackward mode iterative differentiation (ITD-Backward) ITD-Backward computes Eq. ( 41) in the reverse time sequence of ITD-Forward, i.e., t = T -1, . . . , 0, resulting in having an iteration similar to Eq. ( 10). Let uB(s) u (s) + c λ . By initializing u (T ) ← c y and v (T ) ← c λ , and by the following iterations for t = T -1, . . . , 0,we obtain the hyper-gradient estimate as d λ F ← v (0) .Mathematically, the decentralized algorithm of ITD-Backward for the i-th client can be written as Backward mode iterative differentiation (ITD-Backward)i .In centralized bilevel optimization, iterations of t = T -1, . . . , 0 are realized by storing the intermediate parameters and indices of every minibatch during the training (Franceschi et al., 2017) to recover all iterations after T inner-gradient descent steps. However, when we use SGP iterations for inner-steps, the stochastic network does not guarantee to reproduce the communication edges which clients experienced during the training, making the computation of the decentralized ITD-Backward infeasible.Moreover, ITD-Backward also requires undirected edges similar to the exact recurrent backpropagation as pointed out in Section 4.2 (Exact backpropagation requires undirected edges).We thus conclude that ITD-Backward requires the communication network to be restricted to static and undirected in order to function.

