ON NEURAL NETWORK GENERALIZATION VIA PROMOTING WITHIN-LAYER ACTIVATION DIVERSITY Anonymous

Abstract

During the last decade, neural networks have been intensively used to tackle various problems and they have often led to state-of-the-art results. These networks are composed of multiple jointly optimized layers arranged in a hierarchical structure. At each layer, the aim is to learn to extract hidden patterns needed to solve the problem at hand and forward it to the next layers. In the standard form, a neural network is trained with gradient-based optimization, where the errors are back-propagated from the last layer back to the first one. Thus at each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically study how the within-layer activation diversity affects the generalization performance of a neural network in a supervised context and we prove that increasing the diversity of hidden activations reduces the estimation error. In addition to the theoretical guarantees, we present an empirical study confirming that the proposed approach enhances the performance of neural networks.

1. INTRODUCTION

Neural networks are a powerful class of non-linear function approximators that have been successfully used to tackle a wide range of problems. They have enabled breakthroughs in many tasks, such as image classification (Krizhevsky et al., 2012) , speech recognition (Hinton et al., 2012a) , and anomaly detection (Golan & El-Yaniv, 2018) . Formally, the output of a neural network consisting of P layers can be defined as follows: f (x; W) = φ P (W P (φ P -1 (• • • φ 2 (W 2 φ 1 (W 1 x)))), where φ i (.) is the element-wise activation function, e.g., ReLU and Sigmoid, of the i th layer and W = {W 1 , . . . , W P } are the corresponding weights of the network. The parameters of f (x; W) are optimized by minimizing the empirical loss: L(f ) = 1 N N i=1 l f (x i ; W), y i , where l(•) is the loss function, and {x i , y i } N i=1 are the training samples and their associated groundtruth labels. The loss is minimized using the gradient decent-based optimization coupled with backpropagation. However, neural networks are often over-parameterized, i.e., have more parameters than data. As a result, they tend to overfit to the training samples and not generalize well on unseen examples (Goodfellow et al., 2016) . While research on Double descent (Belkin et al., 2019; Advani et al., 2020; Nakkiran et al., 2020) shows that over-parameterization does not necessarily lead to overfitting, avoiding overfitting has been extensively studied (Neyshabur et al., 2018; Nagarajan & Kolter, 2019; Poggio et al., 2017) and various approaches and strategies have been proposed, such as data augmentation (Goodfellow et al., 2016) , regularization (Kukačka et al., 2017; Bietti et al., 2019; Arora et al., 2019) , and dropout (Hinton et al., 2012b; Wang et al., 2019; Lee et al., 2019; Li et al., 2016) , to close the gap between the empirical loss and the expected loss. Diversity of learners is widely known to be important in ensemble learning (Li et al., 2012; Yu et al., 2011) and, particularly in deep learning context, diversity of information extracted by the network neurons has been recognized as a viable way to improve generalization (Xie et al., 2017a; 2015b) . In most cases, these efforts have focused on making the set of weights more diverse (Yang et al.; Malkin & Bilmes, 2009) . However, diversity of the activation has not received much attention. Inspired by the motivation of dropout to co-adapt neuron activation, Cogswell et al. (2016) proposed to regularize the activations of the network. An additional loss using cross-covariance of hidden activations was proposed, which encourages the neurons to learn diverse or non-redundant representations. The proposed approach, known as Decov, has empirically been proven to alleviate overfitting and to improve the generalization ability of neural network, yet a theoretical analysis to prove this has so far been lacking. In this work, we propose a novel approach to encourage activation diversity within the same layer. We propose complementing 'between-layer' feedback with additional 'within-layer' feedback to penalize similarities between neurons on the same layer. Thus, we encourage each neuron to learn a distinctive representation and to enrich the data representation learned within each layer. Moreover, inspired by Xie et al. (2015b) , we provide a theoretical analysis showing that the within-layer activation diversity boosts the generalization performance of neural networks and reduces overfitting. Our contributions in this paper are as follows: • Methodologically, we propose a new approach to encourage the 'diversification' of the layer-wise feature maps' outputs in neural networks. The proposed approach has three variants based on how the global diversity is defined. The main intuition is that by promoting the within-layer activation diversity, neurons within the same layer learn distinct patterns and, thus, increase the overall capacity of the model. • Theoretically, we analyse the effect the within-layer activation diversity on the generalization error bound of neural network. The analysis is presented in Section 3. As shown in Theorems 3.7, 3.8, 3.9, 3.10, 3.11, and 3.12, we express the upper-bound of the estimation error as a function of the diversity factor. Thus, we provide theoretical evidence that the within-layer activation diversity can help reduce the generalization error. • Empirically, we show that the within-layer activation diversity boosts the performance of neural networks. Experimental results show that the proposed approach outperforms the competing methods.

2. WITHIN-LAYER ACTIVATION DIVERSITY

We propose a diversification strategy, where we encourage neurons within a layer to activate in a mutually different manner, i.e., to capture different patterns. To this end, we propose an additional within-layer loss which penalizes the neurons that activate similarly. The loss function L(f ) defined in equation 2 is augmented as follows: Laug (f ) = L(f ) + λ P i=1 J i , where J i expresses the overall pair-wise similarity of the neurons within the i th layer and λ is the penalty coefficient for the diversity loss. As in (Cogswell et al., 2016) , our proposed diversity loss can be applied to a single layer or multiple layers in a network. For simplicity, let us focus on a single layer. Let φ i n (x j ) and φ i m (x j ) be the outputs of the n th and m th neurons in the i th layer for the same input sample x j . The similarity s nm between the the n th and m th neurons can be obtained as the average similarity measure of their outputs for N input samples. We use the radial basis function to express the similarity: s nm = 1 N N j=1 exp -γ||φ i n (x j ) -φ i m (x j )|| 2 , ( ) where γ is a hyper-parameter. The similarity s nm can be computed over the whole dataset or batchwise. Intuitively, if two neurons n and m have similar outputs for many samples, their corresponding similarity s nm will be high. Otherwise, their similarity s mn is small and they are considered "diverse". Based on these pair-wise similarities, we propose three variants for the global diversity loss J i of the i th layer: • Direct: J i = n =m s nm . In this variant, we model the global layer similarity directly as the sum of the pairwise similarities between the neurons. By minimizing their sum, we encourage the neurons to learn different representations. • Det: J i = -det(S) , where S is a similarity matrix defined as S nm = s nm . This variant is inspired by the Determinantal Point Process (DPP) (Kulesza & Taskar, 2010; 2012) , as the determinant of S measures the global diversity of the set. Geometrically, det(S) is the volume of the parallelepiped formed by vectors in the feature space associated with s. Vectors that result in a larger volume are considered to be more "diverse". Thus, maximizing det(•) (minimizing -det(•)) encourages the diversity of the learned features. • Logdet: J i = -logdet(S)foot_0 . This variant has the same motivation as the second one. We use logdet instead of det as logdet is a convex function over the positive definite matrix space. It should be noted here that the first proposed variant, i.e., direct, similar to Decov (Cogswell et al., 2016) , captures only the pairwise diversity between components and is unable to capture the higherorder "diversity", whereas the other two variants consider the global similarity and are able to measure diversity in a more global manner. Our newly proposed loss function defined in equation 3 has two terms. The first term is the classic loss function. It computes the loss with respect to the ground-truth. In the back-propagation, this feedback is back-propagated from the last layer to the first layer of the network. Thus, it can be considered as a between-layer feedback, whereas the second term is computed within a layer. From equation 3, we can see that our proposed approach can be interpreted as a regularization scheme. However, regularization in deep learning is usually applied directly on the parameters, i.e., weights (Goodfellow et al., 2016; Kukačka et al., 2017) , while in our approach, similar to (Cogswell et al., 2016) , an additional term is defined over the output maps of the layers. For a layer with C neurons and a batch size of N , the additional computational cost is O(C 2 (N + 1)) for direct variant and O(C 3 + C 2 N )) for both the determinant and log of the determinant variants.

3. GENERALIZATION ERROR ANALYSIS

In this section, we analyze how the proposed within-layer diversity regularizer affects the generalization error of a neural network. Generalization theory (Zhang et al., 2017; Kawaguchi et al., 2017) focuses on the relation between the empirical loss, as defined in equation 2, and the expected risk defined as follows: L(f ) = E (x,y)∼Q [l(f (x), y)], where Q is the underlying distribution of the dataset. Let f * = arg min f L(f ) be the expected risk minimizer and f = arg min f L(f ) be the empirical risk minimizer. We are interested in the estimation error, i.e., L(f * ) -L( f ), defined as the gap in the loss between both minimizers (Barron, 1994) . The estimation error represents how well an algorithm can learn. It usually depends on the complexity of the hypothesis class and the number of training samples (Barron, 1993; Zhai & Wang, 2018) . Several techniques have been used to quantify the estimation error, such as PAC learning (Hanneke, 2016; Arora et al., 2018) , VC dimension (Sontag, 1998; Harvey et al., 2017; Bartlett et al., 2019) , and the Rademacher complexity (Xie et al., 2015b; Zhai & Wang, 2018; Tang et al., 2020) . The Rademacher complexity has been widely used as it usually leads to a tighter generalization error bound (Sokolic et al., 2016; Neyshabur et al., 2018; Golowich et al., 2018) . The formal definition of the empirical Rademacher complexity is given as follows: Definition 3.1. (Bartlett & Mendelson, 2002) For a given dataset with N samples D = {x i , y i } N i=1 generated by a distribution Q and for a model space F : X → R with a single dimensional output, the empirical Rademacher complexity R N (F) of the set F is defined as follows: R N (F) = E σ sup f ∈F 1 N N i=1 σ i f (x i ) , where the Rademacher variables σ = {σ 1 , • • • , σ N } are independent uniform random variables in {-1, 1}. In this work, we analyse the estimation error bound of a neural network using the Rademacher complexity and we are interested in the effect of the within-layer diversity on the estimation error. In order to study this effect, inspired by (Xie et al., 2015b) , we assume that with a high probability τ, the distance between the output of each pair of neurons, (φ n (x) -φ m (x)) 2 , is lower bounded by d min for any input x. Note that this condition can be expressed in terms of the similarity s defined in equation 4: s nm ≤ e (-γdmin) = s min for any two distinct neurons with the probability τ . Our analysis starts with the following lemma: Lemma 3.2. (Xie et al., 2015b; Bartlett & Mendelson, 2002) With a probability of at least 1 -δ L( f ) -L(f * ) ≤ 4R N (A) + B 2 log(2/δ) N (7) for B ≥ sup x,y,f |l(f (x), y)|, where R N (A) is the Rademacher complexity of the loss set A. It upper-bounds the estimation error using the Rademacher complexity defined over the loss set and sup x,y,f |l(f (x), y)|. Our analysis continues by seeking a tighter upper bound of this error and showing how the within-layer diversity, expressed with d min , affects this upper bound. We start by deriving such an upper-bound for a simple network with one hidden layer trained for a regression task and then we extend it for a general multi-layer network and for different losses.

3.1. SINGLE HIDDEN-LAYER NETWORK

Here, we consider a simple neural network with one hidden-layer with M neurons and onedimensional output trained for a regression task. The full characterization of the setup can be summarized in the following assumptions: Assumptions 1. • The activation function of the hidden layer, φ(t), is a L φ -Lipschitz continuous function. • The input vector x ∈ R D satisfies ||x|| 2 ≤ C 1 . • The output scalar y ∈ R satisfies |y| ≤ C 2 . • The weight matrix W = [w 1 , w 2 , • • • , w M ] ∈ R D×M connecting the input to the hidden layer satisfies ||w m || 2 ≤ C 3 . • The weight vector v ∈ R M connecting the hidden-layer to the output neuron satisfies ||v|| 2 ≤ C 4 . • The hypothesis class is F = {f |f (x) = M m=1 v m φ m (x) = M m=1 v m φ(w T m x)}. • Loss function set is A = {l|l(f (x), y) = 1 2 |f (x) -y| 2 }. • With a probability τ , for n = m, ||φ n (x) -φ m (x)|| 2 2 = ||φ(w T n x) -φ(w T m x)|| 2 2 ≥ d min . We recall the following two lemmas related to the estimation error and the Rademacher complexity: Lemma 3.3. (Bartlett & Mendelson, 2002) For F ∈ R X , assume that g : R - → R is a L g -Lipschitz continuous function and A = {g • f : f ∈ F}. Then we have R N (A) ≤ L g R N (F). (8) Lemma 3.4. (Xie et al., 2015b) Under Assumptions 1, the Rademacher complexity R N (F) of the hypothesis class F = {f |f (x) = M m=1 v m φ m (x) = M m=1 v m φ(w T m x) } can be upper-bounded as follows: R N (F) ≤ 2L φ C 134 √ M √ N + C 4 |φ(0)| √ M √ N , where C 134 = C 1 C 3 C 4 and φ(0) is the output of the activation function at the origin. Lemma 3.4 provides an upper-bound of the Rademacher complexity for the hypothesis class. In order to find an upper-bound for our estimation error, we start by deriving an upper bound for sup x,f |f (x)|: Lemma 3.5. Under Assumptions 1, with a probability at least τ Q , we have sup x,f |f (x)| ≤ √ J , ( ) where Q is equal to the number of neuron pairs defined by M neurons, i.e., Q = M (M -1)

2

, and J = C 2 4 M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) and C 5 = L φ C 1 C 3 + φ(0), The proof can be found in Appendix 7.1. Note that in Lemma 3.5, we have expressed the upperbound of sup x,f |f (x)| in terms of d min . Using this bound, we can now find an upper-bound for sup x,f,y |l(f (x), y)| in the following lemma: Lemma 3.6. Under Assumptions 1, with a probability at least τ Q , we have sup x,y,f |l(f (x), y)| ≤ ( √ J + C 2 ) 2 . ( ) The proof can be found in Appendix 7.2. The main goal is to analyze the estimation error bound of the neural network and to see how its upper-bound is linked to the diversity, expressed by d min , of the different neurons. The main result is presented in Theorem 3.7. Theorem 3.7. Under Assumptions 1, with probability at least τ Q (1 -δ), we have L( f ) -L(f * ) ≤ 8 √ J + C 2 2L φ C 134 + C 4 |φ(0)| √ M √ N + ( √ J + C 2 ) 2 2 log(2/δ) N ( ) where C 134 = C 1 C 3 C 4 , J = C 2 4 M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) , and C 5 = L φ C 1 C 3 + φ(0). The proof can be found in Appendix 7.3. Theorem 3.7 provides an upper-bound for the estimation error. We note that it is a decreasing function of d min . Thus, we say that a higher d min , i.e., more diverse activations, yields a lower estimation error bound. In other words, by promoting the withinlayer diversity, we can reduce the generalization error of neural networks. It should be also noted that our Theorem 3.7 has a similar form to Theorem 1 in (Xie et al., 2015b) . However, the main difference is that Xie et al. analyse the estimation error with respect to the diversity of the weight vectors. Here, we consider the diversity between the outputs of the activations of the hidden neurons.

3.2. BINARY CLASSIFICATION

We now extend our analysis of the effect of the within-layer diversity on the generalization error in the case of a binary classification task, i.e., y ∈ {-1, 1}. The extensions of Theorem 3.7 in the case of a hinge loss and a logistic loss are presented in Theorems 3.8 and 3.9, respectively. Theorem 3.8. Using the hinge loss, we have with probability at least τ Q (1 -δ) L( f ) -L(f * ) ≤ 4 2L φ C 134 + C 4 |φ(0)| √ M √ N + (1 + √ J ) 2 log(2/δ) N ( ) where C 134 = C 1 C 3 C 4 , J = C 2 4 (M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) , and C 5 = L φ C 1 C 3 + φ(0). Theorem 3.9. Using the logistic loss l(f (x), y) = log(1 + e -yf (x) ), we have with probability at least τ Q (1 -δ) L( f ) -L(f * ) ≤ 4 1 + e √ -J 2L φ C 134 + C 4 |φ(0)| √ M √ N + log(1 + e √ J ) 2 log(2/δ) N ( ) where C 134 = C 1 C 3 C 4 , J = C 2 4 (M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) , and C 5 = L φ C 1 C 3 + φ(0). The proofs are similar to Lemmas 7 and 8 in (Xie et al., 2015b) . As we can see, for the classification task, the error bounds of the estimation error for the hinge and logistic losses are decreasing with respect to d min . Thus, employing a diversity strategy can improve the generalization also for the binary classification task.

3.3. MULTI-LAYER NETWORKS

Here, we extend our result for networks with P (> 1) hidden layers. We assume that the pair-wise distances between the activations within layer p are lower-bounded by d p min with a probability τ p . In this case, the hypothesis class can be defined recursively. In addition, we replace the fourth assumption in Assumptions 1 with: ||W p || ∞ ≤ C p 3 for every W p , i.e., the weight matrix of the p-th layer. In this case, the main theorem is extended as follows: Theorem 3.10. With probability of at least P -1 p=0 (τ p ) Q p (1 -δ), we have L( f ) -L(f * ) ≤ 8( √ J + C2) (2L φ ) P C1C 0 3 √ N P -1 p=0 √ M p C p 3 + |φ(0)| √ N P -1 p=0 (2L φ ) P -1-p P -1 j=p √ M j C j 3 + √ J + C2 2 2 log(2/δ) N (15) where Q p is the number of neuron pairs in the p th layer, defined as Q p = M p (M p -1) 2 , and J P is defined recursively using the following identities: J 0 = C 0 3 C 1 and J p = M p C p2 M p2 (L φ C p-1 3 J p-1 + φ(0)) 2 -M (M -1) d p min 2 2 ) , for p = 1, . . . , P . The proof can be found in Appendix 7.4. In Theorem 3.10, we see that J P is decreasing with respect to d p min . Thus, we see that maximizing the within-layer diversity, we can reduce the estimation error of a multi-layer neural network.

3.4. MULTIPLE OUTPUTS

Finally, we consider the case of a neural network with a multi-dimensional output, i.e., y ∈ R D . In this case, we can extend Theorem 3.7 by decomposing the problem into D smaller problems and deriving the global error bound as the sum of the small D bounds. This yields the following two theorems: Theorem 3.11. For a multivariate regression trained with the squared error, we have with probability at least τ Q (1 -δ), L( f ) -L(f * ) ≤ 8D( √ J + C 2 ) 2L φ C 134 + C 4 |φ(0)| √ M √ N + D( √ J + C 2 ) 2 2 log(2/δ) N ( ) where C 134 = C 1 C 3 C 4 , J = C 2 4 (M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) and C 5 = L φ C 1 C 3 + φ(0) . Theorem 3.12. For a multi-class classification task using the cross-entropy loss, we have with probability at least τ Q (1 -δ), L( f ) -L(f * ) ≤ D(D -1) D -1 + e -2 √ J 2L φ C134 + C4|φ(0)| √ M √ N + log 1 + (D -1)e 2 √ J 2 log(2/δ) N ( ) where C 134 = C 1 C 3 C 4 , J = C 2 4 (M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) and C 5 = L φ C 1 C 3 + φ(0). The proofs can be found in Appendix 7.5. Theorems 3.11 and 3.12 extend our result for the multidimensional regression and classification tasks, respectively. Both bounds are inversely proportional to the diversity factor d min . We note that for the classification task, the upper-bound is exponentially decreasing with respect to d min .

4. RELATED WORK

Diversity promoting strategies have been widely used in ensemble learning (Li et al., 2012; Yu et al., 2011) , sampling (Derezinski et al., 2019; Bıyık et al., 2019; Gartrell et al., 2019) , ranking (Yang et al.; Gan et al., 2020) , and pruning by reducing redundancy (Kondo & Yamauchi, 2014; He et al., 2019; Singh et al., 2020; Lee et al., 2020) . In the deep learning context, various approaches have used diversity as a direct regularizer on top of the weight parameters. Here, we present a brief overview of these regularizers. Based on the way diversity is defined, we can group these approaches into two categories. The first group considers the regularizers that are based on the pairwise dissimilarity of components, i.e., the overall set of weights are diverse if every pair of weights are dissimilar. Given the weight vectors {w m } M m=1 , Yu et al. (2011) define the regularizer as mn (1 -θ mn ), where θ mn represents the cosine similarity between w m and w n . Bao et al. (2013) proposed an incoherence score defined as -log 1 M (M -1) mn β|θ mn | 1 β , where β is a positive hyperparameter. Xie et al. (2015a; 2016) used mean(θ mn )var(θ mn ) to regularize Boltzmann machines. They theoretically analyzed its effect on the generalization error bounds in (Xie et al., 2015b) and extended it to kernel space in (Xie et al., 2017a) . The second group of regularizers considers a more globalist view of diversity. For example, in (Malkin & Bilmes, 2009; 2008; Xie et al., 2017b) , a weight regularization based on the determinant of the weights covariance is proposed and based on determinantal point process in (Kulesza & Taskar, 2012; Kwok & Adams, 2012) . Unlike the aforementioned methods which promote diversity on the weight level and similar to our method, Cogswell et al. (2016) proposed to enforce dissimilarity on the feature map outputs, i.e., on the activations. To this end, they proposed an additional loss based on the pairwise covariance of the activation outputs. Their additional loss, L Decov is defined as the squared sum of the non-diagonal elements of the global covariance matrix C: L Decov = 1 2 (||C|| 2 F -||diag(C)|| 2 2 ), where ||.|| F is the Frobenius norm. Their approach, Decov, yielded superior empirical performance; however, it lacks theoretical proof. In this paper, we closed this gap and we showed theoretically how employing a diversity strategy on the network activations can indeed decrease the estimation error bound and improve the generalization of the model. Besides, we proposed variants of our approach which consider a global view of diversity.

5. EXPERIMENTAL RESULTS

In this section, we present an empirical study of our approach in a regression context using Boston Housing price dataset (Dua & Graff, 2017) and in a classification context using CIFAR10 and CI-FAR100 datasets (Krizhevsky et al., 2009) . We denote as Vanilla the model trained with no diversity protocol and as Decov the approach proposed in (Cogswell et al., 2016) .

5.1. REGRESSION

For regression, we use the Boston Housing price dataset (Dua & Graff, 2017) . It has 404 training samples and 102 test samples with 13 attributes each. We hold the last 100 sample of training as a validation set for the hyper-parameter tuning. The loss weight, is chosen from {0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005} for both our approach and Decov (Cogswell et al., 2016) . Parameter γ in the radial basis function is chosen from {0.00001, 0.0001, 0.01, 0.1.1, 10, 100}. As a base model, we use a neural network composed of two fully connected hidden layers, each with 64 neurons. The additional loss is applied on top of both hidden layers. We train for 80 epochs using stochastic gradient descent with a learning rate of 0.01 and mean square error loss. For hyperparamrter tuning, we keep the model that perform best on the validation and use it in the test phase. We experiment with three different activation functions for the hidden layers: Sigmoid, Rectified Linear Units (ReLU) (Nair & Hinton, 2010), and LeakyReLU (Maas et al., 2013) . 1 reports the results in terms of the mean average error for the different approaches over the Boston Housing price dataset. First, we note that employing a diversification strategy (ours and Decov) boosts the results compared to the Vanilla approach for all types of activations. The three variants of our approach, i.e., the within-layer approach, consistently outperform the Decov loss except for the LeakyReLU where the latter outperforms our direct variant. Table 1 shows that the logdet variant of our approach yields the best performance for all three activation types.

5.2. CLASSIFICATION

For classification, we evaluate the performance of our approach on CIFAR10 and CIFAR100 datasets (Krizhevsky et al., 2009) . They contain 60,000 32 × 32 images grouped into 10 and 100 distinct categories, respectively. We train on the 50,000 given training examples and test on the 10,000 specified test samples. We hold the last 10000 of the training set for validation. For the neural network model, we use an architecture composed of 3 convolutional layers. Each convolution layer is composed of 32 3 × 3 filters followed by 2 × 2 max pooling. The flattened output of the convolutional layers is connected to a fully connected layer with 128 neurons and a softmax layer. The different additional losses, i.e., ours and Decov, are added only on top of the fully connected layer. The models are trained for 150 epochs using stochastic gradient decent with a learning rate of 0.01 and categorical cross entropy loss. For hyper-paramters tuning, we keep the model that performs best on the validation set and use it in the test phase. We experiment with three different activation functions for the hidden layers: sigmoid, Rectified Linear Units (ReLU) (Nair & Hinton, 2010), and LeakyReLU (Maas et al., 2013) . All reported results are average performance over 4 trials with the standard deviation indicated alongside. Tables 2 and 3 report the test error rates of the different approaches for both datasets. Compared to the Vanilla network, our within-layer diversity strategies consistently improve the performance of the model. For the CIFAR10, the direct variant yields more than 0.72% improvement for the ReLU and 2% improvement for the sigmoid activation. For the LeakyReLU case, the determinant variant achieves the lowest error rate. This is in accordance with the results on CIFAR100. Here, we note that our proposed approach outperforms both the Vanilla and the Decov models, especially in the sigmoid case. Compared to the Vanilla approach, we note that the model training time cost on CIFAR100 increases by 9% for the direct approach, by 36.1% for the determinant variant, and by 36.2%for the log of determinant variant. 

6. CONCLUSIONS

In this paper, we proposed a new approach to encourage 'diversification' of the layer-wise feature map outputs in neural networks. The main motivation is that by promoting within-layer activation diversity, neurons within the same layer learn to capture mutually distinct patterns. We proposed an additional loss term that can be added on top of any fully-connected layer. This term complements the traditional 'between-layer' feedback with an additional 'within-layer' feedback encouraging diversity of the activations. We theoretically proved that the proposed approach decreases the estimation error bound, and thus improves the generalization ability of neural networks. This analysis was further supported by experimental results showing that such a strategy can indeed improve the performance of neural networks in regression and classification tasks. Our future work includes extensive experimental analysis on the relationship between the distribution of the neurons output and generalization.

7. APPENDIX

In the following proofs, we use Lipschitz analysis. In particular, a function f : A → R, A ⊂ R n , is said to be L-Lipschitz, if there exist a constant L ≥ 0, such that |f (a) -f (b)| ≤ L||a -b|| for every pair of points a, b ∈ A. Moreover: • sup x∈A f ≤ sup(L||x|| + f (0)). • if f is continuous and differentiable, L = sup |f (x)|. 7.1 PROOF OF LEMMA 3.5 Lemma 3.5. Under Assumptions 1, with a probability at least τ Q , we have sup x,f |f (x)| ≤ √ J , ( ) where Q is equal to the number of neuron pairs defined by M neurons, i.e. Q = M (M -1)

2

, and J = C 2 4 M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) and C 5 = L φ C 1 C 3 + φ(0). Proof.  Here Q is equal to the number of neuron pairs defined by M neurons, i.e, Q = M (M -1) . By putting everything back to equation 20, we have with a probability τ Q , f 2 (x) ≤ C 2 4 M C 2 5 + M (M -1)(C 2 5 -d 2 min /2) = J . Thus, with a probability τ Q , sup x,f |f (x)| ≤ sup x,f f (x) 2 ≤ √ J . ( ) 7.2 PROOF OF LEMMA 3.6 Lemma 3.6. Under Assumptions 1, with a probability at least τ Q , we have 



This is defined only if S is positive definite. It can be shown that in our case S is positive semi-definite. Thus, in practice we use a regularized version (S + I) to ensure the positive definiteness.



sup w,x φ(x) < sup(L φ |w T x| + φ(0)) because φ is L φ -Lipschitz. Thus, ||φ|| ∞ < L φ C 1 C 3 + φ(0) = C 5 . For the first term in equation 20, we have m φ m (x) 2 < M (L φ C 1 C 3 + φ(0)) 2 = M C 2 5 .The second term, using the identityφ m (x)φ n (x) = 1 2 φ m (x) 2 + φ n (x) 2 -(φ m (x) -φ n (x)) 2, can be rewritten asm =n φ m (x)φ n (x) = 1 2 m =n φ m (x) 2 + φ n (x) 2 -φ m (x) -φ n (x)we have with a probability τ , ||φ m (x) -φ n (x)|| 2 ≥ d min for m = n. Thus, we have with a probability at least τ Q : m =n φ m (x)φ n (x)

sup x,y,f |l(f (x), y)| ≤ ( √ J + C 2 ) 2 (25) Proof. We have sup x,y,f |f (x) -y| ≤ 2 sup x,y,f (|f (x)| + |y|) = 2( √ J + C 2 ). Thus sup x,y,f |l(f (x), y)| ≤ ( √ J + C 2 ) 2 .

Mean average error of different approaches on Boston Housing price dataset



Test error rates on CIFAR100

7.3. PROOF OF THEOREM 3.7

Theorem 3.7. Under Assumptions 1, with probability at least τ Q (1 -δ), we havewhereFor R N (F), we use the bound found in Lemma 3.4. Using Lemmas 3.2 and 3.6 completes the proof.7.4 PROOF OF THEOREM 3.10 Theorem 3.10. Under Assumptions 1, with probability of at leastwhere Q p is the number of neuron pairs in the p th layer, defined as2 , and J P is defined recursively using the following identities:min /2) , for p = 1, . . . , P .Proof. Lemma 5 in (Xie et al., 2015b) provides an upper-bound for the hypothesis class. We denote by v p denote the outputs of the p th hidden layer before applying the activation function:andWe use the same decomposition trick of φ p m φ p n as in the proof of Lemma 3.5. We need to bound sup x φ p :Thus, we haveWe found a recursive bound for ||v p || 2 2 , we note that for p = 0, we have7.5 PROOFS OF THEOREMS 3.11 AND 3.12 Theorem 3.11. For a multivariate regression trained with the squared error, we have with probability at least τwhereProof. The squared loss ||f (x) -y|| 2 can be decomposed into D terms (f (x) k -y k ) 2 . Using Theorem 3.7, we can derive the bound for each term.Theorem 3.12. For a multiclass classification task using the cross-entropy loss, we have with probability at least τwhereProof. Using Lemma 9 in (Xie et al., 2015b) , we have sup f,x,y l = log 1 + (D -1)e 2 √ J and l is D-1 D-1+e -2 √ J -Lipschitz. Thus, using the decomposition property of the Rademacher complexity, we have

