REVISITING THE ACTIVATION FUNCTION FOR FEDERATED IMAGE CLASSIFICATION Anonymous authors Paper under double-blind review

Abstract

Federated learning (FL) has become one of the most popular distributed machine learning paradigms; these paradigms enable training on a large corpus of decentralized data that resides on devices. The recent evolution in FL research is mainly credited to the refinements in training procedures by developing the optimization methods. However, there has been little verification of other technical improvements, especially improvements to the activation functions (e.g., ReLU), that are widely used in the conventional centralized approach (i.e., standard data-centric optimization). In this work, we verify the effectiveness of activation functions in various federated settings. We empirically observe that off-the-shelf activation functions that are used in centralized settings exhibit a totally different performance trend than do federated settings. The experimental results demonstrate that HardTanh achieves the best accuracy when severe data heterogeneity or low participation rate is present. We provide a thorough analysis to investigate why the representation powers of activation functions are changed in a federated setting by measuring the similarities in terms of weight parameters and representations. Lastly, we deliver guidelines for selecting activation functions in both a cross-silo setting (i.e., a number of clients ≤ 20) and a cross-device setting (i.e., a number of clients ≥ 100). We believe that our work provides benchmark data and intriguing insights for designing models FL models.

1. INTRODUCTION

Federated learning (FL) has become a common and ubiquitous paradigm for collaborative machine learning techniques (Bonawitz et al., 2019; Caldas et al., 2018; Kairouz et al., 2019; Li et al., 2020; 2019; Shokri & Shmatikov, 2015; McMahan et al., 2017; Smith et al., 2017) because it maintains data privacy. Each client (e.g., mobile devices or the entire business) communicates with the central server by transferring their trained model but not the data; all local updates are aggregated into a global server-side model. Although a centralized method enhances generalization by employing a large amount of training data, the features of the FL methods appear to differ from those of a centralized method owing to data non-IIDness, client resource capability, and model communication (Kairouz et al., 2021; McMahan et al., 2017; Zhao et al., 2018) . Most FL studies focus on improving the performance of the global model by applying a new regularizer in the optimization algorithm. For instance, a proximal term is attached to optimize the local update to enhance the method's stability (Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; 2020) , and hence the local model does not diverge from the global model. Some studies (Hsu et al., 2019; Lin et al., 2020; Wang et al., 2020a; b; Yurochkin et al., 2019) improve the aggregation step of the local models by weight averaging for the server model. When additional public or synthetic datasets are allowed, the server proofreads the weights of the models by utilizing their data distribution for balancing (Zhao et al., 2018; Jeong et al., 2018; Goetz & Tewari, 2020; Hao et al., 2021) . Recently, there has been an increasing demand for the personalization of models according to the client. Jiang et al. (2019) and Fallah et al. (2020) attempt to train personalized models for each client with a few rounds of fine-tuning rather than focusing on the performance of the server model. In consideration of system heterogeneity (i.e., clients having different computational and communication capabilities), Avdiukhin & Kasiviswanathan (2021) mitigate model communication 2021) present an empirical analysis of the impact of hyperparameter settings for federated training dynamics from the perspective of a large cohort size. However, FL activation functions (McMahan et al., 2017; Karimireddy et al., 2020; Li et al., 2019) have rarely been studied, although activation functions play a crucial role in facilitating generalization and convergence. We thus raise the seemingly doubtful question: Do activation functions that are popular in centralized settings also produce good optima in FL? To answer this question, we conduct a pilot experiment to compare performance in centralized settings with performance in FL settings. Figure 1 (b) and (c) show the accuracy of neural networks trained under a centralized setting and an FL setting according to the replacements in the activation function. Surprisingly, a neural network with Tanh has better accuracy than ReLU, which is a silver bullet in the field of centralized deep learning field. The problems mentioned above lead us to the intriguing question: Do off-the-shelf activation functions that are intended for a centralized setting also perform appropriately in the FL setting? In this work, we answer the question with thorough empirical evaluations: the most recently developed activation functions tend to degrade the performance of the server as the heterogeneity becomes more severe. Several considerations (e.g., the total number of clients, client participation, non-IIDness) in selecting activation functions may significantly improve the selection of activation function. Combining considerations may further boost the model accuracy. We experiment with various activation functions, including functions that are widely-used and rarely-used in the centralized setting, in various environments based on CIFAR-10 and CIFAR-100. The experiments identify an interesting phenomenon in FL, in which applying activation functions like ReLU in stacked convolutional layers demonstrates low accuracy owing to the shape of the function. We also provide an analysis of the representation power according to the different activation functions for federated image classification. Our key contributions are summarized as follows: • We provide guidelines for selecting activation functions in FL. FL has the following special considerations: number of clients, participation ratio, and non-IIDness. We provide guide-



Figure 1: (a) Plots of different activation functions. (b), (c) Accuracies on CIFAR-10 and CIFAR-100 according to the different activation functions, respectively. The blue line indicates the performance of models trained on a single central server. The orange line indicates the performance of models trained in an FL environment where 20 clients of the total 100 clients participate in the training per round. We use a model having four convolution layers and one classifier. Here, 'Linear' indicates a model without any activation function. In a single machine that centralizes the training data, the more up-to-date the activation function used, the better the performance (blue line). In contrast, interestingly, HardTanh (Collobert et al., 2011) prints the best server accuracy for both CIFAR-10 and CIFAR-100 under heterogeneous scenarios (orange line). A detailed explanation of the activation functions is provided in Appendix A. problems by using asynchronous local stochastic gradient descent (SGD), and Horvath et al. (2021) improve accuracy for heterogeneous resource capacity by using different model sizes per client.Despite the popularity of FL, some options for federated model optimization remain under-explored. Designing FL-familiar training recipes is essential to optimizing model performance, but few studies have attempted to design a new recipe instead of using those intended for a centralized setting.Charles et al. (2021)  present an empirical analysis of the impact of hyperparameter settings for federated training dynamics from the perspective of a large cohort size. However, FL activation functions(McMahan et al., 2017; Karimireddy et al., 2020; Li et al., 2019)  have rarely been studied, although activation functions play a crucial role in facilitating generalization and convergence. We thus raise the seemingly doubtful question: Do activation functions that are popular in centralized settings also produce good optima in FL?

availability

//anonymous.4open.

