REVISITING THE ACTIVATION FUNCTION FOR FEDERATED IMAGE CLASSIFICATION Anonymous authors Paper under double-blind review

Abstract

Federated learning (FL) has become one of the most popular distributed machine learning paradigms; these paradigms enable training on a large corpus of decentralized data that resides on devices. The recent evolution in FL research is mainly credited to the refinements in training procedures by developing the optimization methods. However, there has been little verification of other technical improvements, especially improvements to the activation functions (e.g., ReLU), that are widely used in the conventional centralized approach (i.e., standard data-centric optimization). In this work, we verify the effectiveness of activation functions in various federated settings. We empirically observe that off-the-shelf activation functions that are used in centralized settings exhibit a totally different performance trend than do federated settings. The experimental results demonstrate that HardTanh achieves the best accuracy when severe data heterogeneity or low participation rate is present. We provide a thorough analysis to investigate why the representation powers of activation functions are changed in a federated setting by measuring the similarities in terms of weight parameters and representations. Lastly, we deliver guidelines for selecting activation functions in both a cross-silo setting (i.e., a number of clients ≤ 20) and a cross-device setting (i.e., a number of clients ≥ 100). We believe that our work provides benchmark data and intriguing insights for designing models FL models.

1. INTRODUCTION

Federated learning (FL) has become a common and ubiquitous paradigm for collaborative machine learning techniques (Bonawitz et al., 2019; Caldas et al., 2018; Kairouz et al., 2019; Li et al., 2020; 2019; Shokri & Shmatikov, 2015; McMahan et al., 2017; Smith et al., 2017) because it maintains data privacy. Each client (e.g., mobile devices or the entire business) communicates with the central server by transferring their trained model but not the data; all local updates are aggregated into a global server-side model. Although a centralized method enhances generalization by employing a large amount of training data, the features of the FL methods appear to differ from those of a centralized method owing to data non-IIDness, client resource capability, and model communication (Kairouz et al., 2021; McMahan et al., 2017; Zhao et al., 2018) . Most FL studies focus on improving the performance of the global model by applying a new regularizer in the optimization algorithm. For instance, a proximal term is attached to optimize the local update to enhance the method's stability (Acar et al., 2021; Karimireddy et al., 2020; Li et al., 2021; 2020) , and hence the local model does not diverge from the global model. Some studies (Hsu et al., 2019; Lin et al., 2020; Wang et al., 2020a; b; Yurochkin et al., 2019) improve the aggregation step of the local models by weight averaging for the server model. When additional public or synthetic datasets are allowed, the server proofreads the weights of the models by utilizing their data distribution for balancing (Zhao et al., 2018; Jeong et al., 2018; Goetz & Tewari, 2020; Hao et al., 2021) . Recently, there has been an increasing demand for the personalization of models according to the client. Jiang et al. (2019) and Fallah et al. (2020) attempt to train personalized models for each client with a few rounds of fine-tuning rather than focusing on the performance of the server model. In consideration of system heterogeneity (i.e., clients having different computational and communication capabilities), Avdiukhin & Kasiviswanathan (2021) mitigate model communication In a single machine that centralizes the training data, the more up-to-date the activation function used, the better the performance (blue line). In contrast, interestingly, HardTanh (Collobert et al., 2011) prints the best server accuracy for both CIFAR-10 and CIFAR-100 under heterogeneous scenarios (orange line). A detailed explanation of the activation functions is provided in Appendix A. problems by using asynchronous local stochastic gradient descent (SGD), and Horvath et al. (2021) improve accuracy for heterogeneous resource capacity by using different model sizes per client. Despite the popularity of FL, some options for federated model optimization remain under-explored. Designing FL-familiar training recipes is essential to optimizing model performance, but few studies have attempted to design a new recipe instead of using those intended for a centralized setting. Charles et al. (2021) present an empirical analysis of the impact of hyperparameter settings for federated training dynamics from the perspective of a large cohort size. However, FL activation functions (McMahan et al., 2017; Karimireddy et al., 2020; Li et al., 2019) have rarely been studied, although activation functions play a crucial role in facilitating generalization and convergence. We thus raise the seemingly doubtful question: Do activation functions that are popular in centralized settings also produce good optima in FL? To answer this question, we conduct a pilot experiment to compare performance in centralized settings with performance in FL settings. Figure 1 (b) and (c) show the accuracy of neural networks trained under a centralized setting and an FL setting according to the replacements in the activation function. Surprisingly, a neural network with Tanh has better accuracy than ReLU, which is a silver bullet in the field of centralized deep learning field. The problems mentioned above lead us to the intriguing question: Do off-the-shelf activation functions that are intended for a centralized setting also perform appropriately in the FL setting? In this work, we answer the question with thorough empirical evaluations: the most recently developed activation functions tend to degrade the performance of the server as the heterogeneity becomes more severe. Several considerations (e.g., the total number of clients, client participation, non-IIDness) in selecting activation functions may significantly improve the selection of activation function. Combining considerations may further boost the model accuracy. We experiment with various activation functions, including functions that are widely-used and rarely-used in the centralized setting, in various environments based on CIFAR-10 and CIFAR-100. The experiments identify an interesting phenomenon in FL, in which applying activation functions like ReLU in stacked convolutional layers demonstrates low accuracy owing to the shape of the function. We also provide an analysis of the representation power according to the different activation functions for federated image classification. Our key contributions are summarized as follows: • We provide guidelines for selecting activation functions in FL. FL has the following special considerations: number of clients, participation ratio, and non-IIDness. We provide guide-lines for cross-silo settings (i.e., for a number of clients ≤ 20) and cross-device settings (i.e., for a number of clients ≥ 100); the suitability of the activation function depends on the situation. • We provide an explanation for the performance degradation (i.e., for the performance difference between centralized settings and FL settings) of activation functions that are preferred in a centralized setting. Specifically, we measure similarities in the weight parameters and representations, and visualize the landscape. • We empirically show that the HardTanh activation function (Collobert et al., 2011) leads to a better optimum than other activation functions such as ReLU (Nair & Hinton, 2010 ), Leaky ReLU (Maas et al., 2013) , and GeLU (Hendrycks & Gimpel, 2016) for a severe non-IID setting, a low participation rate, and using a large number of clients. Additionally, we provide benchmark data for activation functions in FL for various models. 2 RELATED WORK (Misra, 2019) , and GeLU (Hendrycks & Gimpel, 2016) . Activation functions have been devised for the gradient exploding/vanishing problem; where the magnitudes of gradients become either near zero or infinite during backward propagation. A general choice for activation functions has been ReLU, which enable efficient propagation. On the other side, ReLU is not differentiable at zero, and it causes a significant number of dying neurons by forgetting the information during propagation. Recent works (Ramachandran et al., 2017; Misra, 2019) have achieved smoother optima in centralized learning by designing self-regularized gradients; whereas those based on FL settings are badlands.

2.2. FL METHODS

Federated optimization methods manage to handle multiple clients without collecting data, and they use server weights from a central server to coordinate the global model across the network. In particular, these methods aim to minimize the following objective function: min w f (w) where f (w) = 1 N N k=1 f (k) (w) where f (k) is the loss function based on the client k. N is the total number of clients. At each round, K ≪ N clients are selected from the total number of devices. The selected clients run each local model using SGD for E number of local epochs, and they ultimately aggregate the selected models at the server model. In the FL environment, the global model can be drifted by optimizing the local clients because the statistical data non-IIDness causes different local optima that are far apart from each other. This is called client drift (Karimireddy et al., 2020; Khaled et al., 2019; Reddi et al., 2020)  w t = 1 K k∈St w t k (2) where w t is the server weight of the t-th round, w t k is the client k's weight after local training using w t-1 , and S t is the client set. McMahan et al. (2017) empirically show the significance of additionally tuning the hyperparameters in FL training. We present that with respect to the architectural and operational side. Additional details of related work are explained in Appendix A.

3. EXPERIMENTS

In this section, we compare several activation functions. We categorize the activation functions into two groups: (1) ReLU, Leaky ReLU, Swish, Mish, and GeLU as recent state-of-the-art (SOTA) activation functions that are widely used in centralized settings; and (2) Tanh and HardTanh as Tanh-like activation functions that are not widely used in centralized settings.

3.1. EXPERIMENTAL SETUP

Dataset and non-IID Settings. Two benchmark datasets are employed: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) . We provide the descriptions of the datasets in Appendix B. To randomize the non-IID data, we assume that all client training data use class labels according to an independent categorical distribution of N classes parameterized by the vector q: q i ≥ 0, i ∈ [1, N ] and i∈[1,N ] q i = 1 For the heterogeneous distribution, the Dirichlet distribution (Hsu et al., 2019; Yurochkin et al., 2019) , q ∼ Dir(α) is used, where α is an N -length concentration vector having all elements α > 0, that is, the prior distribution for N classes controls the heterogeneity of clients. Models. Our study focuses on compact models that are realistically possible in FL. Therefore, we mainly use a simple ConvNet having four convolutional layers and one classifier; ConvNet4 refers to ConvNet with four convolutional layers. The first convolution layer has 64 kernels, and deeper layers have a larger number of kernels (O'Shea & Nash, 2015) . For additional models, which have shortcut and batch normalization layers, we use Resnet20, Resnet32, Resnet44 (He et al., 2016) , and MobileNetv2 (Sandler et al., 2018) . The Details of settings and model architectures of ConvNet are provided in Appendix B. Training Details. In this study, we conduct numerical experiments by changing the number of clients N , the client participation ratio R, and the Dirichlet distribution constant α. We adapt FedAvg and perform 200 rounds with 5 local epochs using a learning rate of 0.01, with a learning decay of 0.1 at the 50th and 75th round, a weight decay of 1e-4, and a momentum of 0.9. The number of clients available in different FL settings is limited; in a cross-silo setting, a small number of clients are available, and a large number of clients are requested in a cross-device. For the cross-silo setting, we use N = 20 and R = 0.2, which we select 4 clients at each round. For the cross-device setting, we use N = 100 and R = 0.2, which we select 20 clients at each round. We mainly demonstrate the training of ConvNet4 on CIFAR-10 heterogeneously distributed by modifying the α in the Dirichlet distribution and the client participation rate R. In the captions, we explain each N , R, and α value.

3.2. COMPARATIVE EXPERIMENTS ON THE CHANGES IN ACTIVATION FUNCTIONS

Table 1 shows the result of both centralized and FL settings using CIFAR-10 and CIFAR-100 as the datasets. With the centralized setting, GeLU shows the best performance, and other recent SOTA activation functions surpass the Tanh-like activation functions. However, with the FL setting, the activation functions show a significantly different tendency. HardTanh achieves the highest accuracy. Furthermore, the recent SOTA activation functions show lower accuracy than Linear using CIFAR-100 as the dataset. The activation functions show different accuracy drops; only the recent SOTA activation functions have an accuracy drop near 40, whereas HardTanh and Tanh have 26.21 and 28.42 on CIFAR-10. As a result, we can find that the most popular activation function, ReLU (as well as recent SOTA activation functions), does not show outstanding performance in an FL setting.

3.3. STRATEGIES FOR SELECTING ACTIVATION FUNCTIONS IN FL

This section presents the experimental results and guidelines for selecting the activation functions for various FL settings. FL settings have various environmental limitations relative to centralized settings. There have additional components to consider, such as the number of clients, non-IIDness, and the participation ratio. Table 1 : Server accuracy of ConvNet4 in centralized and FL settings on two datasets (CIFAR-10, CIFAR-100). Centralized settings are used to train one server model using all training data, and FL settings are used to train 100 clients with non-IID data. We use R = 0.2, and α = 0.1. For all tables afterwards we bold the highest accuracy except for Linear. As the number of clients increases, the overall drop in accuracy increases. The Tanh-like activation functions surpass the recent SOTA activation functions with larger client numbers (20, 50, 100, 200) . Additionally, the accuracy difference between the recent SOTA activation functions and Tanh-like activation functions gets larger as client numbers increase. Considering the observations for the number of clients, we hypothesize that as the number of clients increases, recent SOTA activation functions are increasingly affected and show a more significant accuracy drop. Non-IIDness. With the Dirichlet distribution parameter α, we can control the IID-ness of data: a larger value of α indicates lower non-IIDness (lower heterogeneity). Table 3 presents the accuracy for different values of α. The overall reduction in accuracy rises as non-IIDness does. In most cases, HardTanh shows the highest accuracy. For 20 clients, the accuracy of the recent SOTA activation functions surpasses the Tanh-like activation functions at low non-IIDness. The shape of the recent SOTA activation functions with high non-IIDness causes a severe accuracy drop, which we discuss in section 4. For 100 clients, the Tanh-like activation functions are virtually unaffected by non-IIDness and outperforms the recent SOTA activation functions. The low accuracy of the Tanh-like activation functions at α = 0.01 occurs due to tough training settings and the Tanh-like activation functions fail to find an optimum, such as Linear. We conduct an additional experiment for α = 0.01 with learning rate 0.005 to compare the Tanh-like activation functions and recent SOTA activation functions where the Tanh-like activation functions does not fail to train. Table 13 in Appendix C shows that Tanh-like activation functions surpass the recent SOTA activation functions without failure of finding optimum. Participation Ratio. The participation of clients is limited in FL depending on the environment. In a cross-silo setting, high participation may be possible, whereas only limited participation is possible in a the cross-device setting. Table 4 shows the accuracy in FL settings for four different values of the participation ratio R. As the client participation decreases, the overall accuracy drop increases. For 100 clients, the Tanh-like activation functions achieve the highest accuracy. With 20 clients, however, there is a noticeably larger accuracy drop as participation decreases for the most recent SOTA activation functions. As a result, as the participation ratio decreases, the accuracy of the Tanh-like activation functions reverses recent SOTA activation functions. With increased We use N = 100 and N = 20 with α = 0.1. client participation, client drift diminishes and the impact of the current SOTA activation functions' accuracy reduction decreases. Activation Function N = 100 N = 20 R = 0.4 R = 0.3 R = 0.2 R = 0.1 R = 0.4 R = 0.3 R = 0.2 R = Different FL settings components affect accuracy when different activation functions are used, according to the observations above. The number of clients is the most dominant component, and it interacts with the effect of other components when a small number of clients is used. Therefore, the Tanh-like activation functions are favorable for a large number of clients, such as in a cross-device setting. For a cross-silo setting, for a data with high data non-IIDness, and for a low ratio of client participation, the Tanh-like activation functions are preferred. Conversely, the recent SOTA activation functions are preferred for a cross-silo setting, for data with low data non-IIDness, and for a high ratio of client participation. We implement additional experiments with a different FL method and models. We choose FedProx as the additional FL method and Resnet20, Resnet32, Resnet44, and MobileNetv2 as the additional models. For the additional models, we only use ReLU and Leaky ReLU from the set of recent SOTA activation functions. For more experimental results, refer to Appendix C.

3.4. ADDITIONAL EXPERIMENT

FedProx. As mentioned in section 1, studies to improve the server model in FL add proximal terms to the local object to enhance the method's stability so that the local model rarely differs from the global model. We choose FedProx (Li et al., 2020) as an additional FL method because it is the most basic method that improves the server model by adding a proximal term. The details of FedProx are shown in Appendix A. Table 5 shows the accuracy for different activation functions using FedProx as a learning algorithm. Similar to FedAvg, HardTanh achieves the best accuracy, followed by the other Tanh-like activation functions. Proximal terms in local training do not appear to help prevent accuracy loss. Table 6 : Server accuracy of Resnet using four different Dirichlet constants, α (10, 1, 0.1, 0.01), with a fixed client participation ratio, R = 0.2, and four different client participation ratios, R (0.4, 0.3, 0.2, 0.1) with a fixed Dirichlet constant, α = 0.1. We use N = 100 with CIFAR-10 as the dataset. Resnet. Table 6 shows the result of Resnet20, Resnet32, and Resnet44 with four different values of α and R. The Tanh-like activation functions acheive better accuracy than the recent SOTA activation functions do. Even when employing models with batch normalization and shortcut layers, the Tanh-like activation functions outperform the recent SOTA activation functions. For α = 0.01 and using Resnet32 as the model, the Tanh-like activation functions fails to find the optimum and perform poorly.

Activation Function

α R α = 10 α = 1 α = 0.1 α = 0.01 R = 0.4 R = 0.3 R = 0.2 R = MobileNetv2. Table 7 shows the result of MobileNetv2 with four different values of R for N = 100 and α = 0.1. Tanh-like activation functions surpass recent SOTA activation functions. In particular, the accuracy of the recent SOTA activation functions is lower than Linear. As the participation ratio rises, the recent SOTA activation functions surpass the Tanh-like activation functions on the CIFAR-10 dataset, demonstrating the same result as Table 4 . Table 7 : Server accuracy of MobileNetv2 with four different participation ratios, R (0.1, 0.2, 0.3, 0.4). We use N = 100 with α = 0.1. Activation Function CIFAR-10 CIFAR-100  R = 0.4 R = 0.3 R = 0.2 R = 0.1 R = 0.4 R = 0.3 R = 0.2 R = 0.

4. ANALYSIS

We present thorough investigations of model behavior and the changes in representation during the local training to answer the question: Do recent SOTA activation functions have a disadvantage in an FL setting?

4.1. INVESTIGATION OF WEIGHT PARAMETERS AND LATENT REPRESENTATIONS

The shape of the activation function varies accuracy in FL setting. The activation function selects important features to pass through each layer, and the model is trained using these features. According to Figure 2 (a), the number of selected features varies according to the shape of the activation function. In a conventional centralized setting, a single model can access all the data and select optimal training features. However, in an FL setting, each client can only access a portion of the data, which is partitioned in the non-IID condition, and each client trains its model to select features that are important to itself. This results in a phenomenon known as client drift. During the FL aggregation step, a problem arises where important features for the global optimum cannot be selected due to client drift. This phenomenon appears to be severe when the recent SOTA activation functions are used. Due to the shape of their activation functions, the excluded features are greater in number than for the Tanh-like activation functions, and a severe accuracy drop occurs. In addition, when heterogeneity increases, selection failure of features using the recent SOTA activation functions achieves its peak, and this phenomenon reaches its pinnacle. This can be summarized simply by saying that the Tanh-like activation functions have low sensitivity to the accuracy drop in the FL aggregation step because they exclude a much smaller number of features than do the recent SOTA activation functions. In this aspect, deeper models that achieve greater accuracy in a centralized context perform poorly in a FL situation because they reject a greater number of features as the model gets deeper. In Figure 2 This perspective can be used to explain the empirical results in section 3. A greater number of clients, a high data non-IIDness, and a low client participation rate all contribute to a high level of heterogeneity. In Table 2 , as the number of clients increases, in These findings suggest that the sensitivity of the activation functions in the FL aggregation step, where the accuracy drop occurs, is indicated by the shape of the activation functions. The accuracy drop reaches its peak as the number of features excluded by an activation function increases, and it reaches its maximum with severe client drift: a large number of clients, a low client participation ratio, and high data non-IIDness.

5. CONCLUSION

This study clarifies that the drop in accuracy varies according to the activation function in FL. Our key finding is that the accuracy of the recent SOTA activation functions drops in an FL setting due to the shape of the functions, and HardTanh outperforms other activation functions in most environments. Additionally, we provide guidelines and benchmark data for selecting activation functions in various FL settings.

A.2 ALGORITHMS OF FEDERATED LEARNING METHODS

We use both FedAvg and FedProx for federated learning methods. Algorithm 1 shows the algorithm of FedAvg and Algorithm 2 shows the algorithm of FedProx. FedProx is similar to FedAvg in that it selects a selection of clients at each round, performs local training, and then averages client's weight to generate a global update. However, the difference between FedAvg and FedProx is shown in line 6. For local training, FedAvg trains each client's model using SGD with its local data whereas FedProx, trains each client with additional proximal term, µ 2 ∥w -w t ∥ 2 . Using the proximal term which contributes to the method's stability by efficiently reducing the impact of variable modifications. Algorithm 1 Federated Averaging (FedAvg) 1: Input: K, T , η, E, w 0 , N , k ∈ [1, • • • , N ] 2: for t = 0, • • • , T -1 do 3: Server selects a subset S t randomly which includes number of K devices Server selects a subset S t randomly which includes number of K devices 16 shows the result of Resnet20, Resnet32, and Resnet44 with all combinations of R and α with N = 100. For all values of R and α, Resnet20 using HardTanh shows the highest accuracy. According to the existence of shortcut layer's existence, deeper layer can use features that the activation function has excluded via a shortcut (He et al., 2016) , which helps to prevent the recent SOTA activation functions' accuracy drop. As a result, using Resnet32 and Resnet44, the recent SOTA activation functions have a small accuracy gap with the Tanh-like activation functions and surpass in some conditions. Table 14 : Server accuracy of Resnet20 using four different α and four different participation R, where N = 100. The most right columns show the accuracy drop as non-IIDness increases.



Figure 1: (a) Plots of different activation functions. (b), (c) Accuracies on CIFAR-10 and CIFAR-100 according to the different activation functions, respectively. The blue line indicates the performance of models trained on a single central server. The orange line indicates the performance of models trained in an FL environment where 20 clients of the total 100 clients participate in the training per round. We use a model having four convolution layers and one classifier. Here, 'Linear' indicates a model without any activation function.In a single machine that centralizes the training data, the more up-to-date the activation function used, the better the performance (blue line). In contrast, interestingly, HardTanh(Collobert et al., 2011) prints the best server accuracy for both CIFAR-10 and CIFAR-100 under heterogeneous scenarios (orange line). A detailed explanation of the activation functions is provided in Appendix A.

Figure 2: (a) demonstrates the feature distribution of all class 0 images in the CIFAR-10 test dataset after passing through the first convolution layer and its activation function. After passing the convolution layer and activation function, we flatten the feature values and draw a distribution. (b) shows the accuracy of ConvNet in centralized setting with different widths and depths. (c) shows the accuracy of ConvNet in federated setting with different widths and depths. For all three figures, we use N = 100, R = 0.2 and α = 0.1.

(c), deeper ConvNets with recent SOTA activation functions in a FL setting demonstrate a significant reduction in accuracy, whereas Figure 2 (b) demonstrates an accuracy improvement for deeper Convnets in a centralized setting.

Figure 4: Weight difference between 10 client selected in Figure 3. We calculate the weight difference by subtracting each client's weight from the same layer and normalizing it with L 2 norm.

i = 0, • • • , E -1 do 6: Selected device k ∈ S t updates their local weight w t+1 k using SGD with step-size η device k ∈ S t sends their local weight w t+1 k back to the server 9:Server aggregates the local weights, w t+1 k , and gets new server weightw t+1 = 1 Input: K, T , η, µ, E, w 0 , N , k ∈ [1, • • • , N ] 2: for t = 0, • • • , T -1 do 3:

Figure6shows ConvNet with five different depth(3,4,5,6, and 7). Each version of ConvNet has convolution layer corresponding to the number after the model (e.g., ConvNet3 has three convolution layers and ConvNet7 has seven convolution layers). The details of each convolution layer is shown in Table8. ConvNet with varying depth employ convolution layers in the order shown in Table 8 (e.g., ConvNet3 use Conv1, Conv2, and Conv3 whereas, ConvNet7 use Conv1, Conv2, Conv3, Conv4, Conv5, Conv6, and Conv7).

, and it indicates the inconsistency between optima. Recently, some works have prevented client drift by designing aggregation methods; Wang et al. (2020b) present a method of normalized averaging that removes objective inconsistency, and Zhang et al. (2021) propose a training algorithm for group knowledge transfer, which allows each client to keep a personalized prediction on the server to assist the local training of other clients. Federated Averaging (FedAvg (McMahan et al.

Server accuracy of ConvNet4 with five different number of clientsN (10, 20, 50, 100,  200). We use α = 0.1 with R = 0.2.Number of Clients.For different FL strategies, the number of clients varies. A cross-silo setting uses fewer than 20 clients, and a cross-device setting use more than 100 clients. The total accuracy drop grows as the number of clients rises. Table2shows the accuracy with different client numbers.

Server accuracy of ConvNet4 with four different Dirichlet constant values α (0.01, 0.1, 1, 10). We use N = 100 and N = 20 with R = 0.2.

Server accuracy of ConvNet4 with four different participation ratios R (0.1, 0.2, 0.3, 0.4).

Server accuracy of ConvNet4 using FedProx with the following settings: N = 100, R = 0.2, and α = 0.1.

1

Table3with N = 20, as the dirichlet constant decreases, and in Table4, as the participation rate R decreases, heterogeneity increases, and as a result, the recent SOTA activation function demonstrates a significant drop in accuracy and is surpassed by Tanh-like activation functions.

ConvNet with varying depth employ convolution layers in the order shown in Table 8 (e.g., ConvNet3 use Conv1, Conv2, and Conv3 whereas, ConvNet7 use Conv1, Conv2, Conv3, Conv4, Conv5, Conv6, and Conv7).B.2 DATASET STATISTICSWe use both CIFAR-10 and CIFAR-100 for our experiments. As shown in Table9, CIFAR-10 consists of 60000 images of size 32×32. It is divided into 10 classes, and each class consists of 6000 images. Also, each class has 5,000 training images and 1,000 test images. CIFAR-100 also consists of 60000 images with a size of 32×43. It is classified into 100 classes, and each class consists of 600 images. These 100 classes are divided into 20 superclasses, which we do not use in our experiments. Also, each class has 500 training images and 100 test images.

Table 15, and Table

annex

In deep neural networks, using activation functions is a ubiquitous technique for learning non-linear latent representations; an input signal is transformed into the non-linear output centered on zero. Recent evolution occurs along with the enhancement of representation power and the efficiency of computational costs. In the experiment, we use the following non-linear activation functions:Tanh. A hyperbolic tangent function that is a smooth zero-centered function. The equation of Tanh is:e x -e -x e x + e -x It is known that it is zero-centered, but computationally expensive and causes vanishing gradient problems as neural networks become deeper.HardTanh. Another variant of Tanh that involves lower computational costs owing to its piece-wise linearity. It is another variant of the hyperbolic tangent function, which represents computationally more efficient form of tanh:ReLU. An element-wise threshold operation on each input element where values less than zero are set to zero or otherwise used as is. The equation of ReLU (Hahnloser et al., 2000; Jarrett et al., 2009; Nair & Hinton, 2010) is:ReLU(x) = max(0, x) ReLU is the abbreviation of Rectified Linear Unit, a modified linear function. When the input value of ReLU is negative, the gradient of its output value is zero; the model does not learn. ReLU has shown great performance with Convolutional Neural Networks (CNN) (Krizhevsky et al., 2012; LeCun et al., 2015b) . Since ReLU is computationally cheap, it is still commonly used regardless of numerous attempts to replace it (Maas et al., 2013; Goodfellow et al., 2013; He et al., 2015; Clevert et al., 2015; Klambauer et al., 2017; Elfwing et al., 2018) .Leaky ReLU. A variant of ReLU that is designed to prevent the Dying Neuron problem, where gradients will not be zero at any time during training. The equation of Leaky ReLU (Maas et al., 2013) is:LReLU(x) = max(0.01x, x) It is ReLU multiplied by a tiny constant on the negative part. Due to the small range, the graphs are drawn almost similarly with ReLU.Swish. An unbounded function that uses the sigmoid function in its formula. The curve is similar to ReLU, but has is smooth and non-monotonic. The equation of Swish (Ramachandran et al., 2017) is:This function shows better accuracy than ReLU in deep neural networks regardless of batch size.Mish. A self-regularized non-monotonic activation function. The curve is similar to Swish, but it has a stronger regularization effect and a smoother gradient. The equation of Mish (Misra, 2019) is:This function has the characteristic of allowing gradients to flow better than the Relu Zero Bound because it allows some negative numbers.GeLU. A Gaussian distribution-involved smooth variation on ReLU. It is often used in NLP tasks and vision tasks with Vision Transformer (Dosovitskiy et al., 2020) models. The equation of GeLU (Hendrycks & Gimpel, 2016 ) is:GeLU is derived by combining the characteristics of dropout, zoneout, and ReLU. Experiment results in section 3 fix variables such as N , R, and α. Table 10 shows the result with all combinations of R and α with N = 20. Table 11 shows the result with all combinations of R and α with N = 100. Table 12 shows the result using CIFAR-100 as the dataset with all combinations of R and α with N = 100. Comparing Table 10 and Table 11 , we can see that with bigger client number, the Tanh-like activation function surpasses the recent SOTA activation functions. Additionally, with more complex dataset (CIFAR-100) shown in Table 12 , the Tanh-like activation functions outperform the recent SOTA activation functions with bigger accuracy gap. The low accuracy of the Tanh-like activation functions using CIFAR-10 as dataset with α = 0.01 occurs due to severe tough training settings and fails to find the optimum.To checkout the performance when the Tanh-like activation functions does not fail to find the optimum, we use an additional setting with learning rate 0.005. Table 13 shows the accuracy of ConvNet4 with fixed dirichlet constant α = 0.01, which the Tanh-like activation functions failed to find optimum in Table 10 and Table 11 . The Tanh-like activation functions surpass the recent SOTA activation functions in all conditions. In addition, the Tanh-like activation functions surpass the highest accuracy with dirichlet constant α = 0.01 of the recent SOTA activation functions presented in Table 10 and Table 11 . 

