MOMENTUM TRACKING: MOMENTUM ACCELERA-TION FOR DECENTRALIZED DEEP LEARNING ON HET-EROGENEOUS DATA

Abstract

SGD with momentum acceleration is one of the key components for improving the performance of neural networks. For decentralized learning, a straightforward approach using momentum acceleration is Distributed SGD (DSGD) with momentum acceleration (DSGDm). However, DSGDm performs worse than DSGD when the data distributions are statistically heterogeneous. Recently, several studies have addressed this issue and proposed methods with momentum acceleration that are more robust to data heterogeneity than DSGDm, although their convergence rates remain dependent on data heterogeneity and decrease when the data distributions are heterogeneous. In this study, we propose Momentum Tracking, which is a method with momentum acceleration whose convergence rate is proven to be independent of data heterogeneity. More specifically, we analyze the convergence rate of Momentum Tracking in the standard deep learning setting, where the objective function is non-convex and the stochastic gradient is used. Then, we identify that it is independent of data heterogeneity for any momentum coefficient β ∈ [0, 1). Through image classification tasks, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum acceleration and can consistently outperform these existing methods when the data distributions are heterogeneous.

1. INTRODUCTION

Neural networks have achieved remarkable success in various fields such as image processing (Simonyan & Zisserman, 2015; Chen et al., 2020) and natural language processing (Devlin et al., 2019) . To train neural networks, we need to collect large amounts of training data. However, because of privacy concerns, it is often difficult to collect large amounts of data such as medical images on one server. In such scenarios, decentralized learning has attracted significant attention because it allows us to train neural networks without aggregating all the data onto one server. Recently, decentralized learning has been studied from various perspectives, including data heterogeneity (Tang et al., 2018b; Esfandiari et al., 2021) , communication compression (Tang et al., 2018a; Lu & De Sa, 2020; Liu et al., 2021; Takezawa et al., 2022a) , and network topologies (Ying et al., 2021) . One of the key components for improving the performance of neural networks is SGD with momentum acceleration (SGDm). Whereas SGD updates the model parameters using a stochastic gradient, SGDm updates the model parameters using the moving average of the stochastic gradient, which is called the momentum. Because SGDm can accelerate convergence and improve generalization performance, SGDm has become an indispensable tool, enabling neural networks to achieve high accuracy (He et al., 2016) . Recently, SGDm has been improved in many studies, and methods such as Adam (Kingma & Ba, 2015) and RAdam (Liu et al., 2020a) have been proposed. In decentralized learning, the straightforward approach to using the momentum is Distributed SGD (DSGD) with momentum acceleration (DSGDm) (Gao & Huang, 2020) . When the data distributions held by each node (i.e., the server) are statistically homogeneous, DSGDm works well and can improve the performance as well as SGDm (Lin et al., 2021) . However, in real-world decentralized learning settings, the data distributions may be heterogeneous (Hsieh et al., 2020) . In such cases, DSGDm performs worse than DSGD (i.e., without momentum acceleration) (Yuan et al., 2021) . Table 1 : Comparison of the convergence rates. In the "Data-Heterogeneity" column, "✓" indicates that the convergence rate is independent of data heterogeneity, and "(✓)" indicates that it is independent, but there is no discussion about data heterogeneity either theoretically or experimentally. In the "Momentum," "Stochastic," and "Non-Convex" columns, "✓" respectively indicates that the method is accelerated using momentum, the convergence rate is provided when the stochastic gradient is used, and the convergence rate is provided when the objective function is non-convex. Data-Heterogeneity Momentum Stochastic Non-Convex DSGD (Lian et al., 2017) ✓ ✓ Gradient Tracking (Koloskova et al., 2021) ✓ ✓ ✓ DSGDm (Gao & Huang, 2020) ✓ ✓ ✓ QG-DSGDm (Lin et al., 2021) ✓ ✓ ✓ DecentLaM (Yuan et al., 2021) ✓ ✓ ✓ ABm (Xin & Khan, 2020) (✓) ✓ GTAdam (Carnevale et al., 2022) (✓) ✓ Momentum Tracking (our work) ✓ ✓ ✓ ✓ This is because, when the data distributions are heterogeneous and we use the momentum instead of the stochastic gradient, each model parameter is updated in further different directions and drifts away more easily. As a result, the convergence rate of DSGDm falls below that of DSGD. To address this issue, Lin et al. (2021) and Yuan et al. (2021) modified the update rules of the momentum in DSGDm and proposed methods that are more robust to data heterogeneity than DSGDm. However, their convergence rates remain dependent on data heterogeneity, and our experiments revealed that their performance are degraded when the data distributions are strongly heterogeneous (Sec. 4). Data heterogeneity for decentralized learning has been well studied from both experimental and theoretical perspectives (Hsieh et al., 2020; Koloskova et al., 2020) . Subsequently, many methods including Gradient Tracking (Lorenzo & Scutari, 2016; Nedić et al., 2017) have been proposed and it has been shown that their convergence rates do not depend on data heterogeneity (Tang et al., 2018b; Vogels et al., 2021; Koloskova et al., 2021) . However, these studies considered only the case where the momentum was not used, and it remains unclear whether these methods are robust to data heterogeneity when the momentum is applied. In the convex optimization literature, Xin & Khan (2020) and Carnevale et al. (2022) proposed combining Gradient Tracking with momentum or Adam and analyzed the convergence rates. However, they considered only the case where the objective function is strongly convex and the full gradient is used, which does not hold in the standard deep learning setting, where the objective function is non-convex and only the stochastic gradient is accessible. Hence, their convergence rates are still unknown in standard deep learning settings, and it remains unclear whether their convergence rates are independent of data heterogeneity. Furthermore, they did not discuss data heterogeneity, either theoretically or experimentally. In this work, we propose a decentralized learning method with momentum acceleration, which we call Momentum Tracking, whose convergence rate is proven to be independent of data heterogeneity in the standard deep learning setting. More specifically, we provide the convergence rate of Momentum Tracking in a setting in which the objective function is non-convex and the stochastic gradient is used. Then, we identify that the convergence rate of Momentum Tracking is independent of data heterogeneity for any momentum coefficient β ∈ [0, 1). In Table 1 , we compare the convergence rate of Momentum Tracking with those of existing methods. To the best of our knowledge, Momentum Tracking is the first decentralized learning method with momentum acceleration whose convergence rate has been proven to be independent of data heterogeneity in the standard deep learning setting. Experimentally, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum acceleration and can consistently outperform these existing methods when the data distributions are heterogeneous.

2. PRELIMINARIES AND RELATED WORK

2.1 DECENTRALIZED LEARNING Let G = (V, E) be an undirected graph that represents the underlying network topology, where V denotes the set of nodes and E denotes the set of edges. Let N := |V | be the number of nodes, and we label each node in V by a set of integers {1, 2, • • • , N } for simplicity. We define N i := {j ∈ V | (i, j) ∈ E} as the set of neighbor nodes of node i and define N + i := N i ∪ {i}. In decentralized learning, node i has a local data distribution D i and local objective function f i : R d → R, and can communicate with node j if and only if (i, j) ∈ E. Then, decentralized learning aims to minimize the average of the local objective functions as follows: min x∈R d f (x) := 1 N N i=1 f i (x) , f i (x) := E ξi∼Di [F i (x; ξ i )] , where x is the model parameter, ξ i is the data sample that follows D i , and local objective function f i (x) is defined as the expectation of F i (x; ξ i ) over data sample ξ i . In the following, ∇F i (x; ξ i ) and ∇f i (x) := E ξi∼Di [∇F i (x; ξ i )] denote the stochastic and full gradient respectively. Distributed SGD (DSGD) (Lian et al., 2017) is one of the most well-known algorithms for decentralized learning. Formally, the update rules of DSGD are defined as follows: x (r+1) i = j∈N + i W ij x (r) j -η∇F j (x (r) j ; ξ (r) j ) , where η > 0 is the step size and W ij ∈ [0, 1] is the weight of edge (i, j). Let W ∈ [0, 1] N ×N be the matrix whose (i, j)-element is W ij if (i, j) ∈ E and 0 otherwise. In general, a mixing matrix is used for W (i.e., W = W ⊤ , W 1 = 1, and W ⊤ 1 = 1). Lian et al. (2018) extended DSGD in the case where each node communicates asynchronously and analyzed the convergence rate. Koloskova et al. (2020) analyzed the convergence rate of DSGD when the network topology changes over time. These results revealed that the convergence rate of DSGD decreases and the performance is degraded when the data distributions held by each node are statistically heterogeneous. This is because the local gradients ∇f i are different across nodes and each model parameter x i tends to drift away when the data distributions are heterogeneous. To address this issue, D 2 (Tang et al., 2018b ), Gradient Tracking (Lorenzo & Scutari, 2016; Nedić et al., 2017) , and primal-dual algorithms (Niwa et al., 2020; 2021; Takezawa et al., 2022b) were proposed to correct the local gradient ∇f i to the global gradient ∇f . As a different approach, Vogels et al. (2021) proposed a novel averaging method to prevent each model parameter x i from drifting away. It has been shown that the convergence rates of these methods do not depend on data heterogeneity and do not decrease, even when the data distributions held by each node are statistically heterogeneous. However, these methods do not consider the case in which momentum is used.

2.2. MOMENTUM ACCELERATION

The methods with momentum acceleration were originally proposed by Polyak (1964) , and SGD with momentum acceleration (SGDm) has achieved successful results in training neural networks (Simonyan & Zisserman, 2015; He et al., 2016; Wang et al., 2020b) . In decentralized learning, a straightforward approach to using the momentum is DSGD with momentum acceleration (DSGDm) (Gao & Huang, 2020) . The update rules of DSGDm are defined as follows: u (r+1) i = βu (r) i + ∇F i (x (r) i ; ξ (r) i ), x (r+1) i = j∈N + i W ij x (r) j -ηu (r+1) j , where u i is the local momentum of node i and β ∈ [0, 1) is a momentum coefficient. In addition, several variants of DSGDm were studied by Yu et al. (2019) ; Assran et al. (2019) ; Wang et al. (2020a) ; Singh et al. (2021) . When the data distributions held by each node are statistically homogeneous, DSGDm works well and can improve the performance as well as SGDm. However, when the data distributions are statistically heterogeneous, DSGDm leads to poorer performance than DSGD. This is because when the data distributions held by each node are statistically heterogeneous (i.e., ∇f i varies significantly across nodes), the difference in the updated value of the model parameter across the nodes (i.e., ηu i ) is amplified by the momentum (Lin et al., 2021) . To address this issue, Yuan et al. (2021) and Lin et al. (2021) proposed methods to modify the update rules of the momentum in DSGDm, called DecentLaM and QG-DSGDm, respectively. They further experimentally demonstrated that these methods are more robust to data heterogeneity than DSGDm. However, their convergence rates have been shown to still depend on data heterogeneity and decrease when the data distributions are heterogeneous.

2.3. GRADIENT TRACKING

One of the most well-known methods whose convergence rate does not depend on data heterogeneity is Gradient Tracking (Lorenzo & Scutari, 2016) . Whereas DSGD exchanges only the model parameter x i , Gradient Tracking exchanges the model parameter x i and local (stochastic) gradient ∇f i and then updates the model parameters while estimating global gradient ∇f . Nedić et al. (2017) and Qu & Li (2018) analyzed the convergence rate of Gradient Tracking when the objective function is (strongly) convex and the full gradient is used. Pu & Nedic (2021) analyzed the convergence rate when the objective function is strongly convex and the stochastic gradient is used. Recently, Koloskova et al. (2021) analyzed the convergence rates of Gradient Tracking in a standard deep learning setting, where the objective function is non-convex and the stochastic gradient is used. There is also a line of research to combine Gradient Tracking with variance reduction methods (Xin et al., 2022) . They showed that the convergence rate of Gradient Tracking does not depend on data heterogeneity. However, these studies only consider the case without momentum acceleration, and the convergence analysis for Gradient Tracking with momentum acceleration has not been explored thus far in the aforementioned studies. In the convex optimization literature, Xin & Khan (2020) and Carnevale et al. (2022) proposed a combination of Gradient Tracking and the momentum or Adam (Kingma & Ba, 2015) . However, they only considered the case where the objective function is strongly convex and the full gradient is used. The convergence rate is still unclear in the standard deep learning setting, where the objective function is non-convex and the stochastic gradient is used. Furthermore, there is no discussion about data heterogeneity in these studies, either theoretically or experimentally.

3. PROPOSED METHOD

In this section, we propose Momentum Tracking, which is a decentralized learning method with momentum acceleration whose convergence rate is proven to be independent of the data heterogeneity in the standard deep learning setting.

3.1. SETUP

We assume that the following four standard assumptions hold: Assumption 1. There exists a constant f ⋆ > -∞ that satisfies f (x) ≥ f ⋆ for all x ∈ R d . Assumption 2. There exists a constant p ∈ (0, 1] that satisfies for all x 1 , • • • , x N ∈ R d , ∥XW -X∥ 2 F ≤ (1 -p)∥X -X∥ 2 F , where X := (x 1 , • • • , x N ) ∈ R d×N and X := 1 N X11 ⊤ . Assumption 3. There exists a constant L > 0 that satisfies for all i ∈ V and x, y ∈ R d , ∥∇f i (x) -∇f i (y)∥ ≤ L∥x -y∥. (

5)

Assumption 4. There exists a constant σ 2 that satisfies for all i ∈ V and x i ∈ R d , E ξi∼Di ∥∇F i (x i ; ξ i ) -∇f i (x i )∥ 2 ≤ σ 2 . ( ) Assumptions 1, 2, 3, and 4 are commonly used for decentralized learning algorithms (Lian et al., 2017; Yu et al., 2019; Koloskova et al., 2021; Lin et al., 2021) . Additionally, the following assumption, which represents data heterogeneity, is commonly used in the convergence analysis of decentralized learning algorithms (Lian et al., 2017; Yu et al., 2019; Lin et al., 2021) . Assumption 5. There exists a constant ζ 2 that satisfies for all x ∈ R d , 1 N N i=1 ∥∇f i (x) -∇f (x)∥ 2 ≤ ζ 2 . Under Assumption 5, the convergence rates of DSGD (Lian et al., 2017) , DSGDm (Gao & Huang, 2020; Yuan et al., 2021) , QG-DSGDm (Lin et al., 2021) , and DecentLaM (Yuan et al., 2021) were shown to be dependent on data heterogeneity ζ 2 and decrease as ζ 2 increases. By contrast, in Sec. 3.3, we prove that Momentum Tracking converges without Assumption 5 and the convergence rate is independent of data heterogeneity ζ 2 . In addition, we do not assume the convexity of the objective functions f (x) and f i (x). Therefore, f (x) and f i (x) are potentially non-convex functions (e.g., the loss functions of neural networks).

3.2. MOMENTUM TRACKING

In this section, we propose Momentum Tracking, which is robust to data heterogeneity and accelerated by the momentum. The update rules of Momentum Tracking are defined as follows: u (r+1) i = βu (r) i + ∇F i (x (r) i ; ξ (r) i ), x (r+1) i = j∈N + i W ij x (r) j -η u (r+1) i -c (r) i , c (r+1) i = j∈N + i W ij c (r) j -u (r+1) j + u (r+1) i , where β ∈ [0, 1) is a momentum coefficient. The pseudo-code for Momentum Tracking is presented in Sec. A. In Momentum Tracking, c i corrects the local momentum u i to the global momentum 1 N j u j and prevents each model parameter x i from drifting, even when the data distributions are statistically heterogeneous (i.e., the local momentum u i varies significantly across nodes). Because Momentum Tracking is equivalent to Gradient Tracking when β = 0, Momentum Tracking is a simple extension of Gradient Tracking. Hence, when β = 0, it has been shown that the convergence rate of Momentum Tracking is independent of data heterogeneity ζ 2 (Koloskova et al., 2021) . However, because data heterogeneity is amplified when the momentum is used instead of the stochastic gradient (i.e., β > 0) (Lin et al., 2021; Yuan et al., 2021) , it is unclear whether the convergence rate of Momentum Tracking is independent of data heterogeneity ζ 2 for any β ∈ [0, 1) or for only a restricted range of β. In Sec. 3.3, we provide the convergence rate of Momentum Tracking and prove that it is independent of data heterogeneity ζ 2 for any β ∈ [0, 1).

3.3. CONVERGENCE ANALYSIS

Under Assumptions 1, 2, 3, and 4, Theorem 1 provides the convergence rate of Momentum Tracking in the standard deep learning setting. All proofs are presented in Sec. D. Theorem 1 (Convergence Rate in Non-Convex Setting). Suppose that Assumptions 1, 2, 3, and 4 hold, each model parameter x i is initialized with the same parameters, and both u i and c i are initialized as 1 1-β (∇F i (x (0) i ; ξ (0) i )-1 N N j=1 ∇F j (x (0) j ; ξ (0) j )). Then, for any β ∈ [0, 1) and R ≥ 1, there exists a step size η such that the average parameter x := 1 N N i=1 x i generated by Eqs. (7-9) satisfies 1 R R-1 r=0 E ∇f ( x(r) ) 2 (10) ≤ O r 0 σ 2 L N R + r 2 0 σ 2 L 2 p 4 R 2 (1 -β) 1 + pβ 2 1 -β 1 3 + Lr 0 (1 -β)p 2 R 1 + β 2 (1 -β 2 ) 3 p , where r 0 := f ( x(0) ) -f ⋆ . Remark 1. Combinations of Gradient Tracking with the momentum or Adam have also been proposed by Xin & Khan (2020) and Carnevale et al. (2022) . However, they considered only the setting in which the objective function is strongly convex and the full gradient is used. By contrast, our study focuses on the deep learning setting. Hence, our proof strategies are completely different from those in these previous studies, and Theorem 1 provides the convergence rate in the setting where the objective function is non-convex and the stochastic gradient is used. Remark 2. The convergence rate of Gradient Tracking in the standard deep learning setting was provided by Koloskova et al. (2020) . However, they did not consider the case where the momentum is used, and it is not trivial to provide the convergence rate of Momentum Tracking from the results in this previous work.

3.4. DISCUSSION

Comparison with Gradient Tracking: Theorem 1 indicates that the convergence rate of Momentum Tracking does not depend on data heterogeneity ζ 2 for any β ∈ [0, 1) and does not decrease even when the data distributions are statistically heterogeneous (i.e., ζ 2 > 0). Therefore, Theorem 1 indicates that Momentum Tracking is theoretically robust to data heterogeneity for any β ∈ [0, 1). Although Momentum Tracking is a simple extension of Gradient Tracking, our work is the first to identify that the combination of Gradient Tracking and the momentum converges without being affected by data heterogeneity ζ 2 for any β ∈ [0, 1) in the standard deep learning setting. Because the convergence rate of Momentum Tracking Eq. ( 10) is optimal when β = 0, Theorem 1 does not show that the convergence rate is improved by using the momentum. However, the convergence rates of DSGDm and QG-DSGDm provided by Gao & Huang (2020) and Lin et al. (2021) are also optimal when β = 0. Moreover, they do not provide theoretical results that are consistent with the experimental results that the convergence rates are improved when β > 0. As in these studies, we experimentally demonstrate that convergence is accelerated when β > 0 in Sec. 4 and leave for future work to show the theoretical benefits of using β > 0. Comparison with Existing Algorithms with Momentum Acceleration: Next, we compare the convergence rate of Momentum Tracking with those of existing decentralized learning algorithms with momentum acceleration: DSGDm (Gao & Huang, 2020) , DecentLaM (Yuan et al., 2021) , and QG-DSGDm (Lin et al., 2021) . Here, we only show the convergence rate of QG-DSGDm, but the same discussion holds for the other methods. The convergence rate of QG-DSGDm is as follows: Theorem 2 ( (Lin et al., 2021) ). Suppose that Assumptions 1, 2, 3, and 4 hold, and Assumption 5 also holds. Then, for any β ∈ [0, p 21+p ] and R ≥ 1, there exists a step size η such that the average parameter x := 1 N i x i generated by QG-DSGDm satisfies 1 1 R R-1 r=0 E ∇f ( x(r) ) 2 ≤ O r 0 σ 2 L N R + r 2 0 L 2 (ζ 2 + σ 2 ) p 2 R 2 1 3 + Lr 0 R 1 p + 1 1 -β + β (1 -β) 3 , where r 0 := f ( x(0) ) -f ⋆ . Data heterogeneity ζ 2 appears in the second term, and the convergence rate of QG-DSGDm depends on data heterogeneity ζ 2 . Therefore, the convergence rate of QG-DSGDm decreases when the data distributions held by each node are statistically heterogeneous. By contrast, the convergence rate of Momentum Tracking Eq. ( 10) does not depend on data heterogeneity ζ 2 . Therefore, Momentum Tracking is more robust to data heterogeneity than QG-DSGDm. Because the convergence rates of DSGDm and DecentLaM also depend on data heterogeneity ζ 2 , the same discussion holds for DSGDm and DecentLaM. Hence, Momentum Tracking is more robust to data heterogeneity than these methods. To the best of our knowledge, Momentum Tracking is the first decentralized learning method with momentum acceleration whose convergence rate has been proven to be independent of data heterogeneity ζ 2 in the standard deep learning setting. Next, we discuss the range of β. The convergence rates of QG-DSGDm and DecentLaM provided by Lin et al. (2021) and Yuan et al. (2021) hold only when the range of β is restricted. For instance, Theorem 2 assumes that β ≤ p 21+p (< 0.05). However, these restrictions on the range of β do not hold in practice. (Typically, β is set to 0.9.) Therefore, the convergence rates of QG-DSGDm and DecentLaM are unclear in such practical cases. By contrast, Theorem 1 can provide the convergence rate of Momentum Tracking that holds for any β ∈ [0, 1). Comparison with SGDm: Next, we compare the convergence rate of Momentum Tracking with that of SGDm. In a setting where the objective function is non-convex and the stochastic gradient is used, SGDm has been proven to converge to the stationary point with O(1/ √ R) (Yan et al., 2018; Liu et al., 2020b) . By contrast, Theorem 1 indicates that if the number of rounds R is sufficiently large, Momentum Tracking converges with O(1/ √ N R). Therefore, Momentum Tracking can achieve a linear speedup with respect to the number of nodes N , which is a common and important property in decentralized learning methods (Lian et al., 2018; Koloskova et al., 2020) . 

4. EXPERIMENT

In this section, we present the results of an experimental evaluation of Momentum Tracking and demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum acceleration. In this section, we focus on test accuracy, and more detailed evaluation about the convergence rate is presented in Sec. C.4.

4.1. SETUP

Comparison Methods: (1) DSGD (Lian et al., 2017) : the method described in Sec. 2.1; (2) DS-GDm (Gao & Huang, 2020) : the method described in Sec. 2.2; (3) QG-DSGDm (Lin et al., 2021) : a method in which the update rule of the momentum in DSGDm is modified to be more robust to data heterogeneity than DSGDm; (4) DecentLaM (Yuan et al., 2021) : a method in which the update rule of the momentum in DSGDm is modified to be more robust to data heterogeneity; (5) Gradient Tracking (Nedić et al., 2017) : a method without momentum acceleration that is robust to data heterogeneity; (6) Momentum Tracking: the proposed method described in Sec. 3.

Dataset and Model:

We evaluated Momentum Tracking using three 10-class image classification tasks: FashionMNIST (Xiao et al., 2017) , SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky, 2009) . Following the previous work (Niwa et al., 2020) , we distributed the data to nodes such that each node was given data of randomly selected k classes. When k = 10, the data distributions held by each node can be regarded as statistically homogeneous. When k < 10, the data distributions are regarded as statistically heterogeneous. We evaluated the comparison methods by setting k to {4, 6, 8, 10} and changing data heterogeneity. Note that a smaller k indicates that the data distributions are more heterogeneous. For the neural network architecture, we used LeNet (LeCun et al., 1998) with group normalization (Wu & He, 2018) in Sec. 4.2. In Sec. 4.3, we present more detailed evaluation by varying the neural network architecture (e.g., VGG-11 (Simonyan & Zisserman, 2015) and ResNet-34 (He et al., 2016) ). For each comparison method, we used 10% of the training data for validation and individually tuned the step size. For DSGDm, QG-DSGDm, DecentLaM, and Momentum Tracking, we set β to 0.9. All experiments were repeated using three different seed values, and we report their averages. More detailed hyperparameter settings are presented in Sec E. Network Topology and Implementation: In Secs. 4.2 and 4.3, we present the results of setting the underlying network topology to a ring consisting of eight nodes (i.e., N = 8). In Sec. C.1, we present more detailed evaluation by varying the network topology. All comparison methods were implemented using PyTorch and run on eight GPUs (NVIDIA RTX 3090).

4.2. EXPERIMENTAL RESULTS

Table 2 lists the test accuracy for FashionMNIST, SVHN, and CIFAR-10. Under review as a conference paper at ICLR 2023 Table 3 lists the test accuracy with VGG-11 (Simonyan & Zisserman, 2015) and ResNet-34 (He et al., 2016) when we set k to {2, 4, 10}, and Fig. 2 shows the learning curves. For both neural network architectures, Table 3 reveals that when the data distributions are homogeneous (i.e., 10-class), Momentum Tracking is comparable with DSGDm, QG-DSGDm, and De-centLaM and outperforms DSGD and Gradient Tracking. By contrast, when the data distributions are heterogeneous (e.g., 2-class), Table 3 and Fig. 2 reveal that Momentum Tracking outperforms all comparison methods for both neural network architectures. In particular, Fig. 2 indicates that DSGDm, QG-DSGDm, and DecentLaM are unstable and continue to oscillate in the final training phase, whereas Momentum Tracking converges stably. These results are consistent with those of LeNet presented in Table 2 . Therefore, the results indicate that Momentum Tracking is more robust to data heterogeneity than DSGDm, QG-DSGDm, and DecentLaM, and can outperform these methods regardless of the neural network architecture.

5. CONCLUSION

In this study, we propose Momentum Tracking, which is a method with momentum acceleration whose convergence rate is proven to be independent of data heterogeneity. More specifically, we provide the convergence rate of Momentum Tracking in the standard deep learning setting, in which the objective function is non-convex and the stochastic gradient is used. Our theoretical analysis reveals that the convergence rate of Momentum Tracking is independent of data heterogeneity for any β ∈ [0, 1). Through image classification tasks, we demonstrated that Momentum Tracking can consistently outperform the decentralized learning methods without momentum acceleration regardless of data heterogeneity. Moreover, we showed that Momentum Tracking is more to data heterogeneity than existing decentralized learning methods with momentum acceleration and can consistently outperform these existing methods when the data distributions are heterogeneous.



For simplicity, we set the additional hyperparameter µ for QG-DSGDm to β.



Figure 1: (a) Learning curve on CIFAR-10 with LeNet in the 10-class (i.e., homogeneous) setting. We evaluated the test accuracy per 10 epochs. (b) Learning curve in the 4-class (i.e., heterogeneous) setting. (c) Average test accuracy for all datasets (i.e., FashionMNIST, SVHN, and CIFAR-10).

Fig. 1 (a) and (b) present the learning curves for CIFAR-10 and Fig. 1 (c) presents the average test accuracy for all datasets.

Figure 2: Learning curves for CIFAR-10 with VGG-11 and ResNet-34 in the 2-class setting. In summary, when the data distributions are homogeneous, DSGDm, QG-DSGDm, DecentLaM, and Momentum Tracking are comparable and outperform DSGD and Gradient Tracking. When the data distributions are heterogeneous, Momentum Tracking is more robust to data heterogeneity than DSGDm, QG-DSGDm, and DecentLaM, and can outperform all comparison methods.

Test accuracy on FashionMNIST, SVHN, and CIFAR-10 with LeNet. "k-class" means that each node has only the data of randomly selected k classes. Bold font means the highest accuracy.Comparison of Momentum Tracking and GradientTracking: First, we discuss the results of Momentum Tracking and Gradient Tracking. Table2 and Fig. 1 indicate that Momentum Tracking achieves a higher accuracy faster than Gradient Tracking and outperforms Gradient Tracking in all settings. When the data distributions are homogeneous (i.e., 10-class), Momentum Tracking outperforms Gradient Tracking by 5.8% on average. When the data distributions are heterogeneous (e.g., 4-class), Momentum Tracking outperforms Gradient Tracking by 4.4% on average. Therefore, the results show that Momentum Tracking can consistently outperform Gradient Tracking regardless of data heterogeneity. Comparison of Momentum Tracking and DSGDm: Next, we discuss the results of Momentum Tracking and DSGDm. The results show that when the data distributions are homogeneous (i.e., 10-class), Momentum Tracking and DSGDm are comparable and outperform DSGD and Gradient Tracking. However, when the data distributions are heterogeneous (e.g., 4-class), the test accuracy of DSGDm decreases even more than that of DSGD, and DSGDm underperforms DSGD by 9.9% on average. By contrast, the results indicate that Momentum Tracking consistently outperforms DSGD and Gradient Tracking by 14.9% and 4.4% respectively when the data distributions are heterogeneous. The results indicate that Momentum Tracking is more robust to data heterogeneity than DSGDm and outperforms DSGDm by 24.9% on average.

Test accuracy on CIFAR-10 with VGG-11 and ResNet-34. "k-class" indicates that each node has only the data of randomly selected k classes, and bold font indicates the highest accuracy.

