PERSONALIZED DECENTRALIZED BILEVEL OPTIMIZA-TION OVER STOCHASTIC AND DIRECTED NETWORKS

Abstract

While personalization in distributed learning has been extensively studied, existing approaches employ dedicated algorithms to optimize their specific type of parameters (e.g., client clusters or model interpolation weights), making it difficult to simultaneously optimize different types of parameters to yield better performance. Moreover, their algorithms require centralized or static undirected communication networks, which can be vulnerable to center-point failures or deadlocks. This study proposes optimizing various types of parameters using a single algorithm that runs on more practical communication environments. First, we propose a gradient-based bilevel optimization that reduces most personalization approaches to the optimization of client-wise hyperparameters. Second, we propose a decentralized algorithm to estimate gradients with respect to the hyperparameters, which can run even on stochastic and directed communication networks. Our empirical results demonstrated that the gradient-based bilevel optimization enabled combining existing personalization approaches which led to state-of-the-art performance, confirming it can perform on multiple simulated communication environments including a stochastic and directed network.

1. INTRODUCTION

In distributed learning, providing personally tuned models to clients, or personalization, has shown to be effective when the clients' data are heterogeneously distributed (Tan et al., 2022) . While various approaches have been proposed, they are dedicated to optimizing specific types of parameters for personalization. A typical example is clustering-based personalization (Sattler et al., 2020) , which employs similarity-based clustering specifically for seeking client clusters. Another approach called model interpolation (Mansour et al., 2020; Deng et al., 2020) also specializes in optimizing interpolation weights between local and global models. These dedicated algorithms prevent developers from combining different personalization methods to achieve better performance. Another limitation of previous personalization algorithms is that they can run only on centralized or static undirected networks. Most approaches for federated learning (Smith et al., 2017; Sattler et al., 2020; Jiang et al., 2019) require centralized settings in which a host server can communicate with any client. Although a few studies (Lu et al., 2022; Marfoq et al., 2021) consider fully-decentralized settings, they assume that the communication edge between any clients is static and undirected (i.e., synchronized). These commutation networks are known to be vulnerable to practical issues, such as bottlenecks or central point failures on the host servers (Assran et al., 2019) , or failing nodes and deadlocks on the static undirected networks (Tsianos et al., 2012) . This study proposes optimizing various parameters for personalization using a single algorithm while allowing more practical communication environments. First, we propose a gradient-based Personalized Decentralized Bilevel Optimization (PDBO), which reduces many personalization approaches to the optimization of hyperparameters possessed by each client. Second, we propose Hyper-gradient Push (HGP) that allows any client to solve PDBO by estimating the gradient with respect to its hyperparameters (hyper-gradient) via stochastic and directed communications, that are immune to the practical problems of centralized or static undirected communications (Assran et al., 2019) . We also introduce a variance-reduced HGP to avoid estimation variance, which is particularly effective when communications are stochastic, providing its theoretical error bound. We empirically demonstrated that the generality of our gradient-based PDBO enabled combining existing personalization approaches which led to state-of-the-art performance in a distributed classification task. We also demonstrated that the gradient-based PDBO succeeded in the personalization on multiple simulated communication environments including a stochastic and directed network. Our contributions are summarized as follows: • We propose a gradient-based PDBO that can solve existing personalization problems and their combinations as its special cases. • We propose a decentralized hyper-gradient estimation algorithm called HGP which can run even on stochastic and directed networks. We also propose a variance-reduced HGP, which is particularly effective in stochastic communications, and provide its theoretical error bound. • We empirically validated the advantages of the gradient-based PDBO with HGP; it enabled solving a combination of different personalization problems which led to state-of-the-art performance, and it performed on different communication environments including a stochastic directed network. Notation ⟨A⟩ ij denotes the matrix at the i-th row and j-th column block of the matrix A, and ⟨a⟩ i denotes the i-th block vector of the vector a. For a function f : R d1 → R d2 , we denote its total and partial derivatives with respect to a vector x ∈ R d1 by d x f (x) ∈ R d1×d2 and ∂ x f (x) ∈ R d1×d2 , respectively. We denote the product of matrices by m s=0 Â(s) = Â(m) • • • Â(0) and -1 s=0 Â(s) = I.

2. PRELIMINARIES

We formulate distributed learning (Li et al., 2014) , communication networks, and stochastic gradient push (Nedić & Olshevsky, 2016, SGP) as a generalization of gradient-based distributed learning. Distributed learning Distributed learning with n clients is commonly formulated for all i ∈ [n] as x * i = arg min xi 1 n k∈[n] E ξ k [f k (x k , λ k ; ξ k )] , s.t. x i = x j , ∀j ∈ [n] , where, the i-th client pursues the optimal parameter x * i ∈ R dx , that makes consensus (x i = x j , ∀j ∈ [n]) over all the clients, while minimizing its cost f i : R dx × R d λ → R for the input ξ i ∈ X sampled from its local data distribution. We allow f i to take the hyperparameters λ i ∈ R d λ as its argument. We further explain the examples of the choice of λ i in Sections 3 and 5.

Stochastic and directed communication network

In distributed learning, clients solve Eq. ( 1) by exchanging messages over a physical communication network. The type of edge connections categorizes the communication network: static undirected (Lian et al., 2017) , which represents synchronization over all clients; stochastic undirected (Lian et al., 2018) , which represents asynchronicity between different client pairs; and stochastic directed (Nedić & Olshevsky, 2016), which represents push communication where any message passing can be unidirectional. This study considers distributed learning on stochastic and directed communication networks. Such a network has several desirable properties: robustness to failing clients and deadlocks (Tsianos et al., 2012) , immunity to central failures, and small communication overhead (Assran et al., 2019) . We model stochastic directed networks by letting communication edges be randomly realized, as simulated in Assran et al. (2019) and Nedić & Olshevsky (2016) . Let δ (t) i j ∈ {0, 1} be a random variable where δ (t) i j = 1 denotes that there is a communication channel from the i-th client to the j-th client at the time step t, and δ (t) i j = 0 otherwise. We set δ (t) i i = 1 for all i ∈ [n] and t ∈ N allowing every client to send a message to itself at any time step. Note that the edge model above can recover the other fully-decentralized settings as its special cases; the symmetric edges (δ (t) 1). This section formulates SGP with further generalization for its variants.



i j = δ (t) j i , ∀i, j ∈ [n] , ∀t ∈ N) recover stochastic undirected networks, and the symmetric constant edges, which additionally require δ (t) i j = δ (t) j i = δ ij , recover static and undirected networks. Stochastic gradient push (SGP) SGP (Nedić & Olshevsky, 2016) is one of the most general solvers of Eq. (

