MOMENTUM TRACKING: MOMENTUM ACCELERA-TION FOR DECENTRALIZED DEEP LEARNING ON HET-EROGENEOUS DATA

Abstract

SGD with momentum acceleration is one of the key components for improving the performance of neural networks. For decentralized learning, a straightforward approach using momentum acceleration is Distributed SGD (DSGD) with momentum acceleration (DSGDm). However, DSGDm performs worse than DSGD when the data distributions are statistically heterogeneous. Recently, several studies have addressed this issue and proposed methods with momentum acceleration that are more robust to data heterogeneity than DSGDm, although their convergence rates remain dependent on data heterogeneity and decrease when the data distributions are heterogeneous. In this study, we propose Momentum Tracking, which is a method with momentum acceleration whose convergence rate is proven to be independent of data heterogeneity. More specifically, we analyze the convergence rate of Momentum Tracking in the standard deep learning setting, where the objective function is non-convex and the stochastic gradient is used. Then, we identify that it is independent of data heterogeneity for any momentum coefficient β ∈ [0, 1). Through image classification tasks, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum acceleration and can consistently outperform these existing methods when the data distributions are heterogeneous.

1. INTRODUCTION

Neural networks have achieved remarkable success in various fields such as image processing (Simonyan & Zisserman, 2015; Chen et al., 2020) and natural language processing (Devlin et al., 2019) . To train neural networks, we need to collect large amounts of training data. However, because of privacy concerns, it is often difficult to collect large amounts of data such as medical images on one server. In such scenarios, decentralized learning has attracted significant attention because it allows us to train neural networks without aggregating all the data onto one server. Recently, decentralized learning has been studied from various perspectives, including data heterogeneity (Tang et al., 2018b; Esfandiari et al., 2021 ), communication compression (Tang et al., 2018a; Lu & De Sa, 2020; Liu et al., 2021; Takezawa et al., 2022a) , and network topologies (Ying et al., 2021) . One of the key components for improving the performance of neural networks is SGD with momentum acceleration (SGDm). Whereas SGD updates the model parameters using a stochastic gradient, SGDm updates the model parameters using the moving average of the stochastic gradient, which is called the momentum. Because SGDm can accelerate convergence and improve generalization performance, SGDm has become an indispensable tool, enabling neural networks to achieve high accuracy (He et al., 2016) . Recently, SGDm has been improved in many studies, and methods such as Adam (Kingma & Ba, 2015) and RAdam (Liu et al., 2020a) have been proposed. In decentralized learning, the straightforward approach to using the momentum is Distributed SGD (DSGD) with momentum acceleration (DSGDm) (Gao & Huang, 2020) . When the data distributions held by each node (i.e., the server) are statistically homogeneous, DSGDm works well and can improve the performance as well as SGDm (Lin et al., 2021) . However, in real-world decentralized learning settings, the data distributions may be heterogeneous (Hsieh et al., 2020) . In such cases, DSGDm performs worse than DSGD (i.e., without momentum acceleration) (Yuan et al., 2021) . Table 1 : Comparison of the convergence rates. In the "Data-Heterogeneity" column, "✓" indicates that the convergence rate is independent of data heterogeneity, and "(✓)" indicates that it is independent, but there is no discussion about data heterogeneity either theoretically or experimentally. In the "Momentum," "Stochastic," and "Non-Convex" columns, "✓" respectively indicates that the method is accelerated using momentum, the convergence rate is provided when the stochastic gradient is used, and the convergence rate is provided when the objective function is non-convex. 2021) modified the update rules of the momentum in DSGDm and proposed methods that are more robust to data heterogeneity than DSGDm. However, their convergence rates remain dependent on data heterogeneity, and our experiments revealed that their performance are degraded when the data distributions are strongly heterogeneous (Sec. 4). Data heterogeneity for decentralized learning has been well studied from both experimental and theoretical perspectives (Hsieh et al., 2020; Koloskova et al., 2020) . Subsequently, many methods including Gradient Tracking (Lorenzo & Scutari, 2016; Nedić et al., 2017) have been proposed and it has been shown that their convergence rates do not depend on data heterogeneity (Tang et al., 2018b; Vogels et al., 2021; Koloskova et al., 2021) . However, these studies considered only the case where the momentum was not used, and it remains unclear whether these methods are robust to data heterogeneity when the momentum is applied. In the convex optimization literature, Xin & Khan (2020) and Carnevale et al. ( 2022) proposed combining Gradient Tracking with momentum or Adam and analyzed the convergence rates. However, they considered only the case where the objective function is strongly convex and the full gradient is used, which does not hold in the standard deep learning setting, where the objective function is non-convex and only the stochastic gradient is accessible. Hence, their convergence rates are still unknown in standard deep learning settings, and it remains unclear whether their convergence rates are independent of data heterogeneity. Furthermore, they did not discuss data heterogeneity, either theoretically or experimentally. In this work, we propose a decentralized learning method with momentum acceleration, which we call Momentum Tracking, whose convergence rate is proven to be independent of data heterogeneity in the standard deep learning setting. More specifically, we provide the convergence rate of Momentum Tracking in a setting in which the objective function is non-convex and the stochastic gradient is used. Then, we identify that the convergence rate of Momentum Tracking is independent of data heterogeneity for any momentum coefficient β ∈ [0, 1). In Table 1 , we compare the convergence rate of Momentum Tracking with those of existing methods. To the best of our knowledge, Momentum Tracking is the first decentralized learning method with momentum acceleration whose convergence rate has been proven to be independent of data heterogeneity in the standard deep learning setting. Experimentally, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum acceleration and can consistently outperform these existing methods when the data distributions are heterogeneous.

2. PRELIMINARIES AND RELATED WORK

2.1 DECENTRALIZED LEARNING Let G = (V, E) be an undirected graph that represents the underlying network topology, where V denotes the set of nodes and E denotes the set of edges. Let N := |V | be the number of nodes, and we label each node in V by a set of integers {1, 2, • • • , N } for simplicity. We define N i := {j ∈



This is because, when the data distributions are heterogeneous and we use the momentum instead of the stochastic gradient, each model parameter is updated in further different directions and drifts away more easily. As a result, the convergence rate of DSGDm falls below that of DSGD. To address this issue, Lin et al. (2021) and Yuan et al. (

