DOES LEARNING FROM DECENTRALIZED NON-IID UNLABELED DATA BENEFIT FROM SELF SUPERVISION?

Abstract

The success of machine learning relies heavily on massive amounts of data, which are usually generated and stored across a range of diverse and distributed data sources. Decentralized learning has thus been advocated and widely deployed to make efficient use of distributed datasets, with an extensive focus on supervised learning (SL) problems. Unfortunately, the majority of real-world data are unlabeled and can be highly heterogeneous across sources. In this work, we carefully study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL), specifically contrastive visual representation learning. We study the effectiveness of a range of contrastive learning algorithms under a decentralized learning setting, on relatively large-scale datasets including ImageNet-100, MS-COCO, and a new real-world robotic warehouse dataset. Our experiments show that the decentralized SSL (Dec-SSL) approach is robust to the heterogeneity of decentralized datasets, and learns useful representation for object classification, detection, and segmentation tasks, even when combined with the simple and standard decentralized learning algorithm of Federated Averaging (FedAvg). This robustness makes it possible to significantly reduce communication and to reduce the participation ratio of data sources with only minimal drops in performance. Interestingly, using the same amount of data, the representation learned by Dec-SSL can not only perform on par with that learned by centralized SSL which requires communication and excessive data storage costs, but also sometimes outperform representations extracted from decentralized SL which requires extra knowledge about the data labels. Finally, we provide theoretical insights into understanding why data heterogeneity is less of a concern for Dec-SSL objectives, and introduce feature alignment and clustering techniques to develop a new Dec-SSL algorithm that further improves the performance, in the face of highly non-IID data. Our study presents positive evidence to embrace unlabeled data in decentralized learning, and we hope to provide new insights into whether and why decentralized SSL is effective and/or even advantageous. 1 2 PRELIMINARIES AND OVERVIEW Consider a decentralized learning setting with K different data sources, which might correspond to different devices, machines, embodied agents, or datasets/users that can generate and store data locally. The goal is to collaboratively solve a learning problem, by exploiting the decentralized data from all data sources. More specifically, consider each data source k ∈ [K] has local dataset i=1 , and x k,i ∈ X ⊆ R d are identically and independently distributed (IID) samples

1. INTRODUCTION

The success of machine learning hinges heavily on the access to large-scale and diverse datasets. In practice, most data are generated from different locations, devices, and embodied agents, and stored in a distributed fashion. Examples include a fleet of self-driving cars collecting a massive amount of streaming images under various road and weather conditions during everyday driving, or individuals using mobile devices to take photos of objects and scenery all over the world. Besides being largescale, these datasets have two salient features: they are heterogeneous across data sources, and mostly unlabeled. For instance, images of road conditions, which are expensive to label, vary across cars driving on highways vs. rural areas, and under sunny vs. snowy weather conditions (Figure 19 ). Methods that can make the best use of these large-scale distributed datasets can significantly advance the performance of current machine learning algorithms and systems. This has thus motivated a surge of research in decentralized learning/learning from decentralized datafoot_1 (Konečnỳ et al., 2016; Hsieh et al., 2017; McMahan et al., 2017; Kairouz et al., 2021; Nedic, 2020) , where usually a global model is trained on the distributed datasets using communication between the local data sources and a centralized server, or sometimes even only among the local data sources. The goal is typically to reduce or eliminate the exchanges of local raw data to save communication costs and protect data privacy. How to mitigate the effect of data heterogeneity remains one of the most important research questions in this area (Zhao et al., 2018; Hsieh et al., 2020; Karimireddy et al., 2020; Ghosh et al., 2020; Li et al., 2021a) , as it can heavily downgrade the performance of decentralized learning. Moreover, most existing decentralized learning studies focused on supervised learning (SL) problems that require data labels (McMahan et al., 2017; Jeong et al., 2020; Hsieh et al., 2020) . Hence, it remains unclear whether and how decentralized learning can benefit from large-scale, heterogeneous, and especially unlabeled datasets typically encountered in the real world. On the other hand, people have developed effective methods of learning purely from unlabeled data and demonstrated impressive results. Self-supervised learning (SSL), a technique that learns representations by generating supervision signals from the data itself, has unleashed the power of unlabeled data and achieved tremendous successes for a wide range of downstream tasks in computer vision (He et al., 2020; Chen et al., 2020; He et al., 2021b) , natural language processing (Devlin et al., 2018; Sarzynska-Wawer et al., 2021) , and embodied intelligence (Sermanet et al., 2018; Florence et al., 2018) . These SSL algorithms, however, are usually trained in a centralized fashion by pooling all the unlabeled data together, without accounting for the heterogeneous nature of the decentralized data sources. Very recently, there have been a few contemporaneous/concurrent attempts (He et al., 2021a; Zhuang et al., 2021; 2022; Lu et al., 2022; Makhija et al., 2022) that bridged unsupervised/self-supervised learning and decentralized learning, with focuses on designing better algorithms that mitigate the data heterogeneity issue. In contrast, we revisit this new paradigm and ask the question: Does learning from decentralized non-IID unlabeled data really benefit from SSL? We focus on understanding the use of SSL in decentralized learning when handling unlabeled data. We aim to answer whether and when decentralized SSL (Dec-SSL) is effective (even combined with simple and off-the-shelf decentralized learning algorithms, e.g., FedAvg (McMahan et al., 2017) ); what are the unique inherent properties of Dec-SSL compared to its SL counterpart; how do the properties play a role in decentralized learning, especially with highly heterogeneous data? We also aim to validate our observations on large-scale and practical datasets. We defer a more detailed comparison with these most related works to §A. In this paper, we show that unlike in decentralized (supervised) learning, data heterogeneity can be less concerning in decentralized SSL, with both empirical and theoretical evidence. This leads to more communication-efficient and robust decentralized learning schemes, which can sometimes even outperform their supervised counterpart that assumes the availability of label information. Among the first studies to bridge decentralized learning and SSL, our study provides positive evidence to embrace unlabeled data in decentralized learning, and provides new insights into this setting. We detail our contributions as follows. Contributions. (i) We show that decentralized SSL, specifically contrastive visual representation learning, is a viable learning paradigm to handle relatively large-scale unlabeled datasets, even when combined with the simple FedAvg algorithm. Moreover, we also provide both experimental evidence and theoretical insights that decentralized SSL can be inherently robust to the data heterogeneity across different data sources. This allows more local updates, and can significantly improve the communication efficiency in decentralized learning. (ii) We provide further empirical and theoretical evidences that even when labels are available and decentralized supervised learning (and associated representation learning) is allowed, Dec-SSL still stands out in face of highly non-IID data. (iii) To further improve the performance of Dec-SSL, we design a new Dec-SSL algorithm, FeatARC, by using an iterative feature alignment and clustering procedure. Finally, we validate our hypothesis and algorithm in practical and large-scale data and task domains, including a new real-world robotic warehouse dataset. from probability distribution D k , i.e., x k,i ∼ D k . Note that the distributions D k is in general different across data sources k, yielding an overall heterogeneous (i.e., non-IID) data distribution for the data from all the sources. Let D = k∈[K] D k denote the set of all data samples. Moreover, we are interested in situations where no label is provided alongside the data x. To effectively utilize the large-scale unlabeled data, we resort to self-supervised learning approaches. Specifically, SSL approaches extract representations from these unlabeled data, by finding an embedding function f w : X → R m , where w is the parameter of the embedding function. z = f w (x) is the representation vector that can be useful for downstream tasks, e.g., classification or segmentation. We summarize several popular SSL approaches here that will be used later in the paper. Self-supervised representation learning. Now consider a given data source k ∈ [K]. There are two popular methods in the SSL community. In contrastive learning (Chen et al., 2020; He et al., 2020) specifically, a sample x is used to provide supervision signals along with two generated positive samples x + and x (overloaded for notational simplicity) and (possibly multiple) negative samples x -sampled from the training batch. The goal of SSL is to find an embedding f w that makes x and x + close, while keeping x and x -s apart, if negative samples are used. One commonly used loss for SSL is the InfoNCE loss (Oord et al., 2018) , which has been used in popular SSL approaches as SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) : L k (w) := 1 |D k | |D k | i=1 -log exp(-D(fw(x k,i ), fw(x + k,i ))/τ ) exp(-D(fw(x k,i ), fw(x + k,i ))/τ ) + j exp(-D(fw(x k,i ), fw(x - k,j ))/τ ) (2.1) where τ > 0 is a temperature hyperparameter, j is the index for negative samples, D(•, •) is a distance function such as the cosine distance, i.e., D(z 1 , z 2 ) = -z1•z2 ||z1||||z2|| . Some other effective SSL approaches, such as BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021) , remove the terms related to negative samples in (2.1). These methods also add an additional function g, the feature predictor, which only applies to x to create an asymmetry and to avoid the collapsed solutions. This usually leads to the following objective: L k (w) := 1 |D k | |D k | i=1 D g(f w (x k,i )), f w (x + k,i ) . In our experiments, we make use of both losses and the SSL approaches associated with them. Decentralized SSL. To exploit the heterogeneous data distributed at different locations/devices, decentralized SSL optimizes the following global objective: k for the next round t + 1. The server then broadcasts the global model to each data source to reset w t+1,0 k as w t+1 . The number of local updates (δ) determines the communication efficiency (larger δ means less communication); in the experiments, we use E to denote the number of epochs of local updates (as a surrogate for δ). Both E and the participation rate ρ are important factors that determine the efficiency of decentralized learning. The learned representation f w (x) can then be used in downstream supervised learning tasks. There are many real-world applications of decentralized SSL, including self-driving cars, warehouse robots, and mobile devices. A further discussion can be found in Appendix §D. min w k∈[K] |D k | |D| L k (w),

2.1. OVERVIEW OF OUR STUDY

Terminology & setup. We separate our experiment pipeline into representation learning (pretraining phase) and downstream evaluation (evaluation phase). Our main focus is on the aforementioned Dec-SSL approach. We use FedAvg (McMahan et al., 2017) with SimCLR (Chen et al., 2020) as the default method. Moreover, we will also compare with settings where the label information is available, i.e., the classical decentralized (supervised) learning, which should be more favorable for learning. See Figure 1 for a summary of different settings. The first setting is Dec-SL: we simply run FedAvg on the decentralized labeled data, for end-to-end classification.Dec-SL does not learn representations explicitly, and serves as a natural baseline when labels are available. The second setting is representation learning from Dec-SL, where we train supervised learning with FedAvg, and then use the feature extractor network as the backbone for downstream tasks. This way, we can also learn the representation from decentralized labeled data, and make the comparison with Dec-SSL more fair, since both are learning features for various downstream tasks. We term this setting as Dec-SLRep. b 1 1 x 8 p k j + A P n 8 w e 9 t 4 1 M 2 Z g O e T e l m i p u / e n 8 1 h k 5 S 5 U s h a 1 _ b a s e 6 4 = " s 0 h U S + g 4 6 D 5 U w 1 D 3 v w 3 J m 0 O 2 v p g 4 P H e D D P z g l h w Y 1 3 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 The evaluation phase tests the representations from Dec-SSL or Dec-SLRep. We consider two protocols in the evaluation phase:

Dec-SSL

Unlabeled Data Downstream Task D 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 F m K Y u d 7 l 2 s / I u o o L O Z a J Q 6 1 t L U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j U g 8 e K 9 g P a U D b b S b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y 1 H G q G D Z Y L G L V D q h G w S U 2 D D c C 2 4 l C G g U C W 8 H o Z u q 3 n l B p H s t H M 0 7 Q j + h A 8 p A z a q z 0 c N v z e q W y W 3 F n I M v E y 0 k Z c t R 7 p a 9 u P 2 Z p h N I w Q b X u e G 5 i / I w q w 5 n A S b G b a k w o G 9 E B d i y V N E L t Z 7 N T J + T U K n 0 S x s q W N G S m / p 7 I a K T 1 O A p s Z 0 T N U C 9 6 U / E / r 5 O a 8 M r P u E x S g 5 L N F 4 W p I C Y m 0 7 9 J n y t k R o w t o U x x e y t h Q 6 o o M z a d o g 3 B W 3 x 5 m T S r F e + 8 U r 2 / K N e u 8 z g K c A w n c A Y e X E I N 7 q A O D W A w g G d 4 h T d H O C / O u / M x w < / l a t e x i t > D 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 t T q g p L z B V E f P M D y x f N g v 5 V 7 U L Y = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i H j x W s B / Q h r L Z b t q l u 5 u w O x F K 6 V / w 4 k E R r / 4 h b / 4 b k z Y H b X 0 w 8 H h v h p l 5 Q S y F R d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z a P E M N 5 k k Y x M J 6 C W S 6 F 5 E w V K 3 o k N p y q Q v B 2 M b z O / / c S N F Z F + x E n M f U W H W o S C U c y k u 3 6 t 1 C 9 X 3 K o 7 B 1 k l X k 4 q k K P R L 3 / 1 B h F L F N f I J L W 2 6 7 k x + l N q U D D J Z 6 V e Y n l B C S O T l k Y y V 3 9 P T K m y d q K C t F N R H N l l L x P / 8 7 o J h t f + V O g 4 Q a 7 Z Y l G Y S I I R y R 4 n A 2 E 4 Q z l J C W V G p L c S N q K G M k z j y U L w l l 9 e J a 1 a 1 b u o 1 h 4 u K / W b P I 4 i n M A p n I M H V 1 C H e 2 h A E x i M 4 B l e 4 c 1 R z o v z 7 n w s W g t O P n M M f + B 8 / g D 0 I I 2 F < / l a t e x i t > D 3 < l a t e x i t + Q 2 p x D a t m q O R V 0 = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l a Q Y 9 F P X i s Y G u h D W W z n b R L d z d h d y O U 0 r / g x Y M i X v c O j t o k S z b D F I h H p T k A N C q 6 w Z b k V 2 I k 1 U h k I f A z G N 5 n / + I T a 8 E g 9 2 E m M v q R D x U P O q M 2 k 2 3 6 9 1 C 9 X 3 K o 7 B 1 k l X k 4 q k K P Z L 3 / 1 B h F L J C r L B D W m 6 7 m x 9 a d U W 8 4 E z k q 9 x G B M 2 Z g O s Z t S R S U a f z q / d U b O U m V A w k i n p S y Z q 7 8 n p l Q a M 5 F B 2 i m p H Z l l L x P / 8 7 q J D a / 8 K V d x Y l G x x a I w E c R G J H u c D L h G Z s U k J Z R p n t 5 K 2 I h q y m w a T x a C t / z y K m n X q l 6 9 W r u / q D S u 8 z i K c A K n c A 4 e X E I D 7 q A J L W A w g m d 4 h T d H O i / O u / O x a C 0 4 + c w x / I H z + Q P 1 p Y 2 G < / l a t e x i t > D K < l a t e x i t s h a 1 _ b a s e 6 4 = " s x O B l b 7 x M N P q z Z 2 X h p G X a M 0 9 e k s = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x G Q Y 9 B P Q h e I p o H J E u Y n U y S I b O z y 0 y v E J Z 8 g h c P i n j 1 i 7 z 5 N 0 6 S P W h i Q U N R 1 U 1 3 V x B L Y d B 1 v 5 3 c y u r a + k Z + s 7 C 1 v b O 7 V 9 w / a J g o 0 Y z X W S Q j 3 Q q o 4 V I o X k e B k r d i z W k Y S N 4 M R t d T v / n E t R G R e s R x z P 2 Q D p T o C 0 b R S g 8 3 3 b t u s e S W 3 R n I M v E y U o I M t W 7 x q 9 O L W B J y h U x S Y 9 q e G 6 O f U o 2 C S T 4 p d B L D Y 8 p G d M D b l i o a c u O n s 1 M n 5 M Q q P d K P t C 2 F Z K b + n k h p a M w 4 D G x n S H F o F r 2 p + J / X T r B / 6 a d C x Q l y x e a L + o k k G J H p 3 6 Q n N G c o x 5 Z Q p o W 9 l b A h 1 Z S h T a d g Q / A W X 1 4 m j U r Z O y t X 7 s 9 L 1 a s s j j w c w T G c g g c X U I V b q E E d G A z g G V 7 h z Z H O i / P u f M x b c 0 4 2 c w h / 4 H z + A O U f j Y o = < / l a t e x i t > D < l a t e x i t s h a 1 _ b a s e 6 4 = " W o f l C 9 O s Y A D N d O + 1 4 n z c Y B d C A m o = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x G Q Y 9 B P X h M w D w g W c L s p D c Z M z u 7 z M w K I e Q L v H h Q x K u f 5 M 2 / c Z L s Q R M L G o q q b r q 7 g k R w b V z 3 2 8 m t r W 9 s b u W 3 C z u 7 e / s H x c O j p o 5 T x b D B Y h G r d k A 1 C i 6 x Y b g R 2 E 4 U 0 i g Q 2 A p G t z O / 9 Y R K 8 1 g + m H G C f k Q H k o e c U W O l + l 2 v W H L L 7 h x k l X g Z K U G G W q / 4 1 e 3 H L I 1 Q G i a o 1 h 3 P T Y w / o c p w J n B a 6 K Y a E 8 p G d I A d S y W N U P u T + a F T c m a V P g l j Z U s a M l d / T 0 x o p P U 4 C m x n R M 1 Q L 3 s z 8 T + v k 5 r w 2 p 9 w m a Q G J V s s C l N B T E x m X 5 M + V 8 i M G F t C m e L 2 V s K G V F F m b D Y F G 4 K 3 / P I q a V b K 3 k W 5 U r 8 s V W + y O P J w A q d w D h 5 c Q R X u o Q Y N Y I D w D K / w 5 j w 6 L 8 6 7 8 7 F o z T n Z z D H 8 g f P 5 A 5 i 3 j M w = < / l a t e x i t > fw < l a t e x i t s h a 1 _ b a s e 6 4 = " j T t R k 2 m E T r R G d o T Y 9 o 3 / Y W z M L B E = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G i / Y A 2 l M 1 2 0 y 7 d b M L u R C m h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R z d R v P X J t R K w e c J x w P 6 I D J U L B K F r p P u w 9 9 U p l t + L O Q J a J l 5 M y 5 K j 3 S l / d f s z S i C t k k h r T 8 d w E / Y x q F E z y S b G b G p 5 Q N q I D 3 r F U 0 Y g b P 5 u d O i G n V u m T M N a 2 F J K Z + n s i o 5 E x 4 y i w n R H F o V n 0 p u J / X i f F 8 M r P h E p S 5 I r N F 4 W p J B i T 6 d + k L z R n K M e W U K a F v Z W w I d W U o U 2 n a E P w F l 9 e J s 1 q x T u v V O 8 u y r X r P I 4 C H M M J n I E H l 1 C D W 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A F b q o 3 Y < / l a t e x i t > fw < l a t e x i t s h a 1 _ b a s e 6 4 = " j T t R k 2 m E T r R G d o T Y 9 o 3 / Y W z M L B E = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G i / Y A 2 l M 1 2 0 y 7 d b M L u R C m h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R z d R v P X J t R K w e c J x w P 6 I D J U L B K F r p P u w 9 9 U p l t + L O Q J a J l 5 M y 5 K j 3 S l / d f s z S i C t k k h r T 8 d w E / Y x q F E z y S b G b G p 5 Q N q I D 3 r F U 0 Y g b P 5 u d O i G n V u m T M N a 2 F J K Z + n s i o 5 E x 4 y i w n R H F o V n 0 p u J / X i f F 8 M r P h E p S 5 I r N F 4 W p J B i T 6 d + k L z R n K M e W U K a F v Z W w I d W U o U 2 n a E P w F l 9 e J s 1 q x T u v V O 8 u y r X r P I 4 C H M M J n I E H l 1 C D W 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A F b q o 3 Y < / l a t e x i t > fw < l a t e x i t s h a 1 _ b a s e 6 4 = " j T t R k 2 m E T r R G d o T Y 9 o 3 / Y W z M L B E = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G i / Y A 2 l M 1 2 0 y 7 d b M L u R C m h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R z d R v P X J t R K w e c J x w P 6 I D J U L B K F r p P u w 9 9 U p l t + L O Q J a J l 5 M y 5 K j 3 S l / d f s z S i C t k k h r T 8 d w E / Y x q F E z y S b G b G p 5 Q N q I D 3 r F U 0 Y g b P 5 u d O i G n V u m T M N a 2 F J K Z + n s i o 5 E x 4 y i w n R H F o V n 0 p u J / X i f F 8 M r P h E p S 5 I r N F 4 W p J B i T 6 d + k L z R n K M e W U K a F v Z W w I d W U o U 2 n a E P w F l 9 e J s 1 q x T u v V O 8 = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G z W A R K k h J q q D g p u h G 0 E U F + 4 A 2 h M l 0 2 g 6 d T M L M R A 2 x C 3 / F j Q t F 3 P o b 7 v w b J 2 0 W 2 n p g 4 H D O v d w z x w s Z l c q y v o 3 c 3 P z C 4 l J + u b C y u r a + Y W 5 u N W Q Q C U z q O G C B a H l I E k Y 5 q S u q G G m F g i D f Y 6 T p D S 9 S v 3 l H h K Q B v 1 V x S B w f 9 T n t U Y y U l l x z p + M j N c C I J d c j 9 6 r 0 c B i f 3 R / A g m s W r b I 1 B p w l d k a K I E P N N b 8 6 3 Q B H P u E K M y R l 2 7 Z C 5 S R I K I o Z G R U 6 k S Q h w k P U J 2 1 N O f K J d J J x / h H c 1 0 o X 9 g K h v U i B l U A 0 z J g l w q C F Y s 1 Q V h Q n R X i A R I I K 1 1 Z W o I 9 / e V Z 0 q i U 7 a N y 5 e a 4 W D 3 P 6 s i D X b A H S s A G J 6 A K L k E N 1 A E G j + A Z v I I 3 4 8 l 4 M d 6 N j 8 l o z s h 2 t s E f G J 8 / T n e U + Q = = < / l a t e x i t > LK(x, x + , x ; w) < l a t e x i t s h a 1 _ b a s e 6 4 = " D z d Y r n x 5 N X I + 8 a C A B w G H J 6 w U K A 0 = " > A A A C C X i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 W o q C W p g o K b o h t B F x X s A 9 o Y J t N J O 3 Q y C T M T b Q n Z u v F X 3 L h Q x K 1 / 4 M 6 / M W m z U O u B G Q 7 n 3 M u 9 9 z g B o 1 I Z x p e W m 5 m d m 1 / I L x a W l l d W 1 / T 1 j Y b 0 Q 4 F J H f v M F y 0 H S c I o J 3 V F F S O t Q B D k O Y w 0 n c F 5 6 j f v i J D U 5 z d q F B D L Q z 1 O X Y q R S i R b h x 0 P q T 5 G L L q K 7 c v S c H 9 4 G + 3 F 6 X 8 Q n 9 7 v w o K t F 4 2 y M Q a c J m Z G i i B D z d Y / O 1 0 f h x 7 h C j M k Z d s 0 A m V F S C i K G Y k L n V C S A O E B 6 p F 2 Q j n y i L S i 8 S U x 3 E m U L n R 9 k T y u 4 F j 9 2 R E h T 8 q R 5 y S V 6 d 7 y r 5 e K / 3 n t U L k n V k R 5 E C r C 8 W S Q G z K o f J j G A r t U E K z Y K C E I C 5 r s C n E f C Y R V E l 4 a g v n 3 5 G n S q J T N w 3 L l + q h Y P c v i y I M t s A 1 K w A T H o A o u Q A 3 U A Q Y P 4 A m 8 g F f t U X v W 3 r T 3 S W l O y 3 o 2 w S 9 o H 9 + u p Z k E < / l a t e x i t > linear probing for image classification (Zhang et al., 2016) and finetuning for object detection/segmentation (Doersch et al., 2015) . For classification, we train a linear classifier on top of the frozen pretrained network and evaluate the top-1 classification accuracy. For object detection/segmentation, we finetune the network by using the pretrained weights as initialization and training in an end-to-end fashion, and then we evaluate the mean Average Precision (mAP) metric. Downstream tasks are performed on centralized train and test dataset. Please refer to Appendix §C.1 for implementation details and Table 3 for experiment setups.

Questions of interest.

Through extensive experiments on large-scale datasets, and theoretical analysis in simplified settings, we seek to answer the following questions: (i) How well can decentralized SSL, even instantiated with the simple FedAvg algorithm, rival the performance of its centralized counterpart, and handle the non-IIDness of decentralized unlabeled data? (ii) Is there any unique and inherent property of Dec-SSL, compared to its supervised learning counterpart; how and why may the property benefit decentralized learning, even when the label information is available? (iii) Is there a way to further improve the performance of Dec-SSL in face of highly non-IID data? Our hypothesis is that SSL, whose objective is not particularly dependent on the x to y mappings, learns a relatively uniform representation across decentralized and heterogeneous unlabeled datasets, thus leading to more efficient and robust decentralized learning schemes. We aim to validate this hypothesis and answer these questions in the following sections.

3. DEC-SSL IS EFFICIENT AND ROBUST TO DATA HETEROGENEITY

We first seek to address question (i) in §2.1 -how well decentralized SSL performs, in face of non-IID and decentralized unlabeled data. To this end, we first introduce the notion of data heterogeneity in decentralized learning, which is usually categorized as input heterogeneity, label distribution heterogeneity, and the heterogeneity in the relationships between the features and labels, respectively (Hsieh et al., 2020) . We create label heterogeneity by distributing each data source with different proportion of classes; we construct the heterogeneity via either sampling from a Dirichlet process with hyperparameter α or via skewness partitioning (Hsieh et al., 2020) with hyperparameter β. We also create input heterogeneity by leveraging the feature space of a pretrained network on the data. See §C.2 for more details on how we create data heterogeneity across data sources.

3.1. EXPERIMENTAL OBSERVATIONS

CIFAR classification under different types of non-IIDness. In this experiment, we construct input and label non-IIDness using 5 data sources in the CIFAR-10 ( Krizhevsky et al., 2009) dataset based on the Dirichlet Process. The sources of non-IIDness are the feature clusters and labels, respectively. We control parameter α to create datasets from very IID (each data source has roughly a uniform distribution over 10 classes / 5 feature clusters) to very non-IID (each data source has data from 2 classes / 1 feature clusters). Recall that E denotes the number of epochs for local updates and ρ denotes the participation ratio of data sources at each round. We use E = 50 epochs of local updates in this experiment, which is equivalent to around δ = 1000 iterations, i.e., each local data source updates 50 epochs independently before averaging. The results are shown in Figure 2 . Surprisingly, In the pie chart below, each pie denotes one data source, and color denotes the sample number of one source of non-IIDness (left to right, more non-IID). We observe that Dec-SSL is surprisingly robust to the non-IIDness in both input (X) and label (Y ) and also behaves closer to its centralized counterpart. Y-axis denotes accuracy. the performance of downstream classification, with representations trained using decentralized SSL, is very insensitive to the non-IIDness across the datasets and only bears a slight performance drop. This robustness over data non-IIDness is encouraging, and stands in sharp contrast with most existing decentralized supervised learning algorithms, which are known to suffer from the data heterogeneity in general (Hsieh et al., 2020) . As a baseline, we consider the classical decentralized SL approach of FedAvg, trained over the same non-IID data, but with label information. Indeed, the performance of decentralized SL can drop significantly as the non-IIDness increases. Finally, we note that the simple use of FedAvg in SSL can achieve performance comparable to the centralized SSL, showing that Dec-SSL is an effective decentralized learning scheme to handle unlabeled data. Finetuning ImageNet representation for COCO detection. In this experiment, we finetune the representations learned from ImageNet to COCO detection benchmark (Lin et al., 2014) with the Detectron pipeline (Girshick et al., 2018) . Specifically, we use ImageNet-100 with ResNet-18 and 1× training schedule for Mask R-CNN (He et al., 2017) with a ResNet18 FPN being the backbone. Compared to the contemporary works (Zhuang et al., 2022; Lu et al., 2022) on federated self-supervised learning, our setup is more relevant to real-world applications, as it works on largerscale and more practical datasets and tasks. We run Dec-SSL on ImageNet-100 dataset with 5 data sources, and with E = 1 epoch of local updates, which corresponds to around δ = 500 local updates, to learn the global representation using FedAvg. On Table 1 left, we observe that the representation from Dec-SSL almost reaches the performance of the representation from centralized SSL and improves upon baselines that train the model from scratch, i.e., the no pretrain row. This conveys that SSL can learn useful representations in decentralized settings, avoiding the heavy communication cost of centralized learning. Decentralized SSL for real-world package segmentation. The issue of data heterogeneity and communication efficiency is significant for real-world applications such as those in Amazon warehouses, whose fleets of working robots can generate millions of images per day (see Figure 21 for an illustration). We provide details about the Amazon dataset in §D.1. We use data from one sample warehouse site at Amazon, and split the data based on the session ID (which is usually a sequence of days). Each decentralized learner is only allowed to access the local data at one session, which is equivalent to the non-IID case where skewness β = 0. We then deploy decentralized self-supervised learning on a subset of the enormous warehouse data, which has around 80000 images with contour labels output by the Amazon work-cells. We use SimCLR with FedAvg and communication efficiency E = 1 number of local update epochs, as the pretraining method. On the right subtable of Table 1 , we compare different ways to initialize weights for finetuning, and show that the representations learned from decentralized SSL outperforms training from scratch and even matches centralized SSL on the Amazon dataset. We also experiment with finetuning segmentation task using Mask R-CNN on different fractions of the data, and show that Dec-SSL can further improve the performance of training from scratch, when there is no as much labeled data.

3.2. THEORETICAL INSIGHTS

We now provide some theoretical insights into why the objective of Dec-SSL leads to more robust performance in face of data heterogeneity. In particular, we analyze the property of the solutions to the local and global objectives of Dec-SSL in a simplified setting, and show that the global objective is not affected significantly by the heterogeneity of local datasets. Our setup is inspired by the very recent work (Liu et al., 2021) , where the effect of imbalanced data in centralized SSL was studied in a simplified setting. In particular, we generalize the centralized and 3-way classification setting to a decentralized and 2K-way one, carefully design the generation of data distribution across data sources, and establish analyses for both local and global objectives in decentralized SSL. We also ……. 𝑒 " 𝑒 # 𝑒 " 𝑒 # 𝑒 $…& 𝑫 𝟐 𝑒 "…)*" 𝑒 )," , … , 𝑒 & 𝑒 ) 𝑫 𝑲 𝑒 $…& 𝑫 𝟏 SL SSL SL SSL SL SSL Figure 3: The learned feature space of SSL is more insensitive to heterogeneity under the linear settings. In §3.2, we consider a decentralized learning setting where each local dataset has a skewed distribution with most data points (each color is a class) concentrated on one axis. Each basis vector inside the sphere denotes how well it is represented in the learned subspace. For contrastive objectives, the learned feature space (green sphere) of the local model is more uniform and close to the global model. On the other hand, the SL objective (red sphere) tends to overfit to local dataset, and the learned feature spaces become heterogeneous. (Tian et al., 2020a) dataset and then finetune on MS-COCO with metrics bounding-box mAP (AP bb ) and mask mAP (AP mk ). Right: Finetuning results on the Amazon package segmentation dataset with representations pretrained on the Amazon dataset. We observe that Dec-SSL reaches similar performance (AP mk ) as centralized SSL and also outperforms training from scratch. Note that 100%, 10%, 1% denote the portion of the data used for finetuning.

ImageNet

improve some analysis therein, and design new metrics to characterize the performance adapted to the decentralized setting. Due to space limitation, we include an abridged introduction here, and defer more details to Appendix §E. Setup. Consider a Dec-SSL problem with K data sources. Similar to the SimSiam approach, we first augment x, an anchor sample from the dataset, by sampling ξ, ξ ′ ∼ N (0, I) IID from the Gaussian distribution. Consider the linear embedding function f w (x) = wx, where w ∈ R m×d and m ≥ 2K. The SSL objective for data source k is given by L k (w) := -E (w(x k,i + ξ k,i )) ⊤ (w(x k,i + ξ ′ k,i )) + 1 2 ∥w ⊤ w∥ 2 F , where E is taken expectation over the empirical dataset x k,i ∼ D k , and the randomness of ξ k,i and ξ ′ k,i . Moreover, recall the global objective is given in (2.2). Note that (3.1) instantiates SimSiam loss with the negative inner-product ⟨a, b⟩ as the distance function D(a, b) and no feature predictor, and with a regularization term for mathematical tractability, as in Liu et al. (2021) . Data heterogeneity. The K data sources collaboratively solve (2.2) to learn a representation for a 2K-way classification task. The K local datasets are generated in a way that for each fixed k ∈ [K], the labels are skewed in that data from classes 2k -1 and 2k constitute the majority of the data, while other classes are rare, or even unseen. More details on the specifications of data heterogeneity can be found in §E.1. We visualize the heterogeneity of the data distributions in Figure 3 . To compare the representations learned across data sources and that learned from jointly solving (2.2), we introduce the following definition on the representability of the representation space. Definition 3.1 (Representability vector). Let S ⊆ R d be the subspace spanned by the rows of the learned feature matrix w ∈ R m×d , where the embedding function f w (x) = wx. The representability of S is defined as a vector r = [r 1 , • • • , r d ] ⊤ ∈ R d , such that r i = ∥Π S (e i )∥ 2 2 for i ∈ [d] , where Π S (e i ) ∈ R d is the projection of standard basis e i onto S, and thus r i = s j=1 ⟨e i , v j ⟩ 2 where s = dim(S) and {v 1 , • • • , v s } is a set of orthonormal bases for S. The intuition of this definition is that a good feature space should have the property that many standard unit bases among e 1 , • • • , e d , which can be used to represent any vectors in R d , can be represented well by the feature space, i.e., have large projections onto it. Note that as a vector, r provides a quantitative way to compare the representability of two feature spaces across different directions (i.e., different unit basis). In the following theorem, we compare the representability learned by local objectives and the global one, for Dec-SSL. Method / Setting IID non-IID FURL (Zhang et al., 2020a) 71.25 68.01 EMA (Zhuang et al., 2022) 86.26 83.34 Per-SSFL (He et al., 2021a) N/A 83.10 FEDU (Zhuang et al., 2021) 83 Participation ratio under high non-IIDness. In this experiment, we split ImageNet-100 into 20 data sources and use local update E = 5 epochs. We measure the performance of decentralized learning algorithms with respect to the participation ratio of data sources at each round. For instance, when ρ = 1, at each round, all data sources update their local weights and upload to the server, while ρ = 0.05 means that each round a single random data source is selected for update. On the right of Figure 4 , we show that with non-IID data, the convergence of Dec-SSL is more stable to less participants compared to Dec-SLRep. This allows more efficient decentralized learning, especially when deployed with extremely large number of data sources and unstable communication channels.

4.2. THEORETICAL INSIGHTS

To shed light on the above observations, we provide analysis for the feature spaces learned by the local objective of Dec-SLRep, under the same setup as in §3.2. For Dec-SLRep and each data source k, we consider learning a two-layer linear network g u k ,v k (x) := v k u k x as classifier, where u k ∈ R m×d and v k ∈ R c×m , and use u k x as the learned representation for downstream tasks. The network is learned by minimizing 20 ). In other words, the correlation between the learned features in w k and e j is small for all j ∈ [K] \ {k}, while the correlation between the features and e k is large. ∥(u k ) ⊤ u k ∥ 2 F + ∥(v k ) ⊤ v k ∥ 2 F subject to the margin constraint that [g u k ,v k (x)] y ≥ [g u k ,v k (x)] k = [u k,1 , • • • , u k,m ] ⊤ ∈ R m×d learned from the local dataset D k satisfies that m i=1 ⟨u k,i , e j ⟩ 2 ≤ O(d -1 10 ), for j ∈ [K]\{k}; while m i=1 ⟨u k,i , e k ⟩ 2 ≥ 1-O(d - The proposition suggests that the feature spaces learned by Dec-SLRep differ significantly across local data sources, given the highly heterogeneous data. More specifically, we show that most of the unit bases in {e 1 , • • • , e K } have small correlations with the features learned at each local data source, while these feature spaces themselves vary significantly across data sources. The unit bases that are not learned might be significant for various other downstream tasks, making the learned representations less favorable. This heterogeneity among local solutions is not in favor of local updates, as too many local updates would drift the iterates towards its local solution, and the iterates would become too far away from each other, hurting the convergence of decentralized learning. Hence, compared with the Dec-SSL case and Theorem 3.2, Dec-SLRep can be less robust to data heterogeneity and less communication-efficient. We note that the advantage of Dec-SSL does not come from using more data, since we use exactly the same data for training Dec-SLRep and Dec-SSL. The intuition is also illustrated in Figure 3 . Finally, we remark that the uniformity of features, which is believed to be the key to better transfer performance in SSL (Wang & Isola, 2020; Caron et al., 2020) , is not always preferred given specific learning tasks (Burgess et al., 2018) .

5. OUR ALGORITHM -FeatARC (FEATURE ALIGNMENT AND CLUSTERING)

Although Dec-SSL tends to learn relatively uniform features that are robust across datasets, the uniformity itself might not imply the alignment of features across datasets: the representation network from different local data sources can still map the same data point to different regions in the feature space. This misalignment becomes more significant when the data is highly non-IID and can have an adverse effect on the model aggregation process in decentralized learning (Zhang et al., 2020a) . To mitigate this issue and address question (iii) in §2.1, we propose to use the same feature distance loss as an auxiliary local objective to align the local models with the global model. The alignment between two features is defined as the negative cosine distance metric D(z 1 , z 2 ) = -z1•z2 ||z1||||z2|| . To further improve the Dec-SSL algorithm, we propose to learn multiple models using clusteringbased approach. In particular, instead of learning a single global model as in (2.2), we learn C models and separate the K data sources into C clusters. The update of C models and the assignment of data sources to C clusters are conducted alternatively. When C = K, the algorithm reduces to learning K local models; when C = 1, it reduces to learning a single global one. The clustering approach intuitively learns multiple models to interpolate the performance between learning a single global model and K local models, thus achieving a good bias-variance tradeoff when testing on each local dataset (Mansour et al., 2020; Ghosh et al., 2020) . However, unlike the supervised learning case, we do not use the loss of the decentralized learning (i.e., (2.1)) as the metric for clustering. This is because for contrastive learning, it has been observed that the SSL loss might not be indicative enough for the performance of the representation on downstream tasks (Robinson et al., 2021) . Hence, we here again use the feature alignment distance D(•, •) as the metric for clustering. We adopt the alignment regularization and clustering techniques, and developed a new Dec-SSL algorithm FeatARC, summarized in Algorithm 1 and Algorithm 2 in Appendix. We show the performance of FeatARC in Figure 5 , in comparison with different baselines including FedAvg, under different levels of data heterogeneity and communication frequency. It is shown that FeatARC outperforms the baselines consistently, including the variants that only uses alignment ("Align Only") or clustering ("Cluster Only"). Moreover, on the top of Table 2 , we show that FeatARC also outperforms other recent decentralized self-supervised learning algorithms on CIFAR-10 dataset.

6. EXTENSIONS

In this section, we discuss a few extended experiments of our framework. Please see Appendix §B for a thorough set of experiments and ablation studies with visualizations.

6.1. FULLY DECENTRALIZED CASE AND DIFFERENT NETWORK TOPOLOGY

We conduct experiments on the fully decentralized learning in Appendix §B.5, where the local data sources are only allowed to communicate with their neighbors over a peer-to-peer network, without a centralized server. In short, most observations we had regarding Dec-SSL in the setting with a centralized server still hold, even under several different network topologies. This aligns with our theoretical insight provided in Section 3, which came from the benign properties of the solution to the Dec-SSL objective, instead of the properties of specific algorithms (averaging the iterates via a star or other network topologies) that achieves the solution.

6.2. EXTREMELY HETEROGENEOUS CASE FOR DECENTRALIZED LEARNING

In Figure 13 , we show that even in the extremely heterogeneous case where each local source only owns one class, the Dec-SSL framework is still robust to the non-IIDness of the data. This also holds true when we scale to more clients, as shown in Figure 15 . The Dec-SSL objective would not be biased by the highly heterogeneous class labels at each local dataset, while the Dec-SL objective could be biased by it. This is also consistent with our theoretical insights in Section §3.2 and the key reason for the success of Dec-SSL is that, despite only having one single class, the information of features obtained from local datasets may still be useful for the jointly classifying of all the classes.

6.3. COMPARISON OF FEATARC WITH OTHER ALGORITHMS

We also compare our algorithm with the Dec-SSL algorithms that are combined with other federated learning algorithms, including Li et al. (2020a) (FedProx) and Li et al. (2020b) (FedBN) . In Figure 16 (Left), we show that our proposed FeatARC can outperform these two baselines.

7. CONCLUSION

We propose the framework of decentralized SSL that learns representations from non-IID unlabeled data and conduct an empirical study on the robustness of Dec-SSL to different types of heterogeneity, communication constraints, and participation rates of data sources. We also provide findings and theoretical analyses of Dec-SSL compared to its supervised learning counterpart, as well as developing a new algorithm to further address the high heterogeneity in decentralized datasets.



Code is available at https://github.com/liruiw/Dec-SSL Hereafter, we often use decentralized learning as a shorthand for learning from decentralized data.



u y r X r P I 4 C H M M J n I E H l 1 C D W 6 h D A x g M 4 B l e 4 c 2 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A F b q o 3 Y < / l a t e x i t > LK(x, y; w) < l a t e x i t s h a 1 _ b a s e 6 4 = " O p Y v 9 V h j k j D o c / d 2 m O 6 e S 0 Y y m r Q

H 1 d w r P 7 e S J A v Z e x 7 e j J N K 6 e 9 V P z P a 0 e q d + o k l I e R I h x P D

Figure 1: Comparisons among Dec-SL, Dec-SLRep, and Dec-SSL.

Figure 2: SSL objective is robust to different types of X and Y heterogeneity on the CIFAR-10 dataset.

Ablation Study [𝜌 = 1, 𝐾 = 5, 𝐸 = 5] Algorithm Comm. [𝜌 = 1, 𝐾 = 5, 𝛼 = 0.02]

Figure 5: Ablation study on the FeatARC algorithm. We observe that under non-IIDness and communication constraints, FeatARC outperforms the baseline variants of the algorithm and FedAvg.

y ′ + 1 for all data (x, y) in the local dataset k with all y ′ ̸ = y. . We now have the following proposition on the representations learned by Dec-SLRep across data sources. Proposition 4.1 (Representations learned by Dec-SLRep across heterogeneous data sources). With high probability, the features u

Left: Object detection and semantic segmentation finetuned on COCO: The model is pretrained on ImageNet-100

Top). Algorithm performance comparison. Bottom). CIFAR-10 Linear probing on the representation of CIFAR-100. Our algorithm surpasses previous works on federated SSL both in the IID and non-IID settings.

acknowledgement

Acknowledgement. This work is supported in part by Amazon.com Services LLC, PO2D-06310236 and Defense Science & Technology Agency, DST00OECI20300823. L.W. was supported by the MIT EECS Xianhong Wu Graduate Fellowship. K.Z. also acknowledges support from Simons-Berkeley Research Fellowship. We thank MIT Supercloud for providing compute resources. The authors would like to thank many helpful discussions from Phillip Isola at MIT and Andrew Marchese at Amazon.

annex

Theorem 3.2 (Representability of local v.s. global objectives for Dec-SSL). For decentralized SSL in the setting described above, with high probability, the representability vector learned from any local objective of source k, denoted byMoreover, the representability vector learned from the global objective, denoted by rTheorem 3.2 states that the feature spaces learned from local SSL objectives are relatively uniform, in the sense that for the K basis directions e 1 , • • • , e K that generate the data, any two data sources have similar representability in all of them but two directions, especially when the dimension d of the data is large. Furthermore, when solving the global objective (2.2), the learned representation is also uniform, and its representability differs at most one direction from that of each local data source. Note that the results hold with highly heterogeneous data across data sources. In other words, Dec-SSL is not affected significantly by the non-IIDness of the data, justifying the empirical observations in §3.1. Illustration of the results can also be found in Figure 3 .Intuition & implication. The main intuition behind Theorem 3.2 is that, the objective of SSL is not biased by the heterogeneous distribution of labels at each local dataset, and tends to learn uniform representations. Related arguments have also been made in the recent works on the theoretical understanding contrastive learning/SSL (Wang & Isola, 2020; Liu et al., 2021) . In the decentralized setting, this insensitivity to data heterogeneity becomes even more relevant, as it potentially allows each local data source to perform much more local updates, without drifting the iterates significantly. This enables more communication-efficient decentralized learning schemes, in contrast to most existing ones that are vulnerable to data non-IIDness. We validate these points next.

4. DEC-SSL CAN BE FAVORABLE EVEN WHEN LABELS ARE AVAILABLE

We here seek to address question (ii) in §2.1 -how does the unique property of Dec-SSL, such as the robustness to data heterogeneity, benefit decentralized learning? While lack of labels seems a limitation, we show that this might not be the case in decentralized learning with heterogeneous data. First, it is known that decentralized SL in general performs poorly when the data is highly heterogeneous (Zhao et al., 2018; Hsieh et al., 2020) . Further, even in the decentralized representation learning setting when labels are available, Dec-SSL still stands out in face of highly non-IID data.To make a fair comparison, we mainly compare Dec-SSL with Dec-SLRep (recall the definition in §2.1), which are both decentralized representation learning approaches. We defer the comparison with Dec-SL to Appendix §B. We conduct experiments on both ImageNet and CIFAR-10 datasets, and evaluate the performance of the learned representations in terms of the variations of two commonly used metrics in decentralized learning -the number of local updates epochs E, and the participation ratio of data sources ρ. We observe consistently that Dec-SSL indeed outperforms Dec-SLRep in learning representations in terms of communication efficiency and participation ratio, especially with highly non-IID data. We remark that such observations are also consistent with those on object detection and semantic segmentation given in Table 1 .

4.1. EXPERIMENTAL OBSERVATIONS

In this experiment, we train and evaluate the feature backbone on ImageNet-100 in a decentralized setting. We create non-IIDness across the local datasets based on label skewness and use β = 0.1 (each data source has only 10% of its data coming from the uniform class distributions).Communication efficiency under high non-IIDness. In Figure 4 , we show that under the non-IID scenario, averaging weights with an infrequent communication schedule causes less trouble to Dec-SSL than to Dec-SLRep. In FedAvg, the idea of averaging weights after multiple epochs might sound sub-optimal, but we notice that decentralized SSL is very robust with respect to this parameter. Intuitively, the robustness of Dec-SSL allows each local model to drift longer, leading to a lower communication frequency for decentralized learning.

