ON NEURAL NETWORK GENERALIZATION VIA PROMOTING WITHIN-LAYER ACTIVATION DIVERSITY Anonymous

Abstract

During the last decade, neural networks have been intensively used to tackle various problems and they have often led to state-of-the-art results. These networks are composed of multiple jointly optimized layers arranged in a hierarchical structure. At each layer, the aim is to learn to extract hidden patterns needed to solve the problem at hand and forward it to the next layers. In the standard form, a neural network is trained with gradient-based optimization, where the errors are back-propagated from the last layer back to the first one. Thus at each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically study how the within-layer activation diversity affects the generalization performance of a neural network in a supervised context and we prove that increasing the diversity of hidden activations reduces the estimation error. In addition to the theoretical guarantees, we present an empirical study confirming that the proposed approach enhances the performance of neural networks.

1. INTRODUCTION

Neural networks are a powerful class of non-linear function approximators that have been successfully used to tackle a wide range of problems. They have enabled breakthroughs in many tasks, such as image classification (Krizhevsky et al., 2012) , speech recognition (Hinton et al., 2012a) , and anomaly detection (Golan & El-Yaniv, 2018) . Formally, the output of a neural network consisting of P layers can be defined as follows: f (x; W) = φ P (W P (φ P -1 (• • • φ 2 (W 2 φ 1 (W 1 x)))), where φ i (.) is the element-wise activation function, e.g., ReLU and Sigmoid, of the i th layer and W = {W 1 , . . . , W P } are the corresponding weights of the network. The parameters of f (x; W) are optimized by minimizing the empirical loss: L(f ) = 1 N N i=1 l f (x i ; W), y i , where l(•) is the loss function, and {x i , y i } N i=1 are the training samples and their associated groundtruth labels. The loss is minimized using the gradient decent-based optimization coupled with backpropagation. However, neural networks are often over-parameterized, i.e., have more parameters than data. As a result, they tend to overfit to the training samples and not generalize well on unseen examples (Goodfellow et al., 2016) . While research on Double descent (Belkin et al., 2019; Advani et al., 2020; Nakkiran et al., 2020) shows that over-parameterization does not necessarily lead to overfitting, avoiding overfitting has been extensively studied (Neyshabur et al., 2018; Nagarajan & Kolter, 2019; Poggio et al., 2017) and various approaches and strategies have been proposed, such as data augmentation (Goodfellow et al., 2016 ), regularization (Kukačka et al., 2017; Bietti et al., 2019; Arora et al., 2019), and dropout (Hinton et al., 2012b; Wang et al., 2019; Lee et al., 2019; Li et al., 2016) , to close the gap between the empirical loss and the expected loss. Diversity of learners is widely known to be important in ensemble learning (Li et al., 2012; Yu et al., 2011) and, particularly in deep learning context, diversity of information extracted by the network neurons has been recognized as a viable way to improve generalization (Xie et al., 2017a; 2015b) . In most cases, these efforts have focused on making the set of weights more diverse (Yang et al.; Malkin & Bilmes, 2009) . However, diversity of the activation has not received much attention. Inspired by the motivation of dropout to co-adapt neuron activation, Cogswell et al. (2016) proposed to regularize the activations of the network. An additional loss using cross-covariance of hidden activations was proposed, which encourages the neurons to learn diverse or non-redundant representations. The proposed approach, known as Decov, has empirically been proven to alleviate overfitting and to improve the generalization ability of neural network, yet a theoretical analysis to prove this has so far been lacking. In this work, we propose a novel approach to encourage activation diversity within the same layer. We propose complementing 'between-layer' feedback with additional 'within-layer' feedback to penalize similarities between neurons on the same layer. Thus, we encourage each neuron to learn a distinctive representation and to enrich the data representation learned within each layer. Moreover, inspired by Xie et al. (2015b) , we provide a theoretical analysis showing that the within-layer activation diversity boosts the generalization performance of neural networks and reduces overfitting. Our contributions in this paper are as follows: • Methodologically, we propose a new approach to encourage the 'diversification' of the layer-wise feature maps' outputs in neural networks. The proposed approach has three variants based on how the global diversity is defined. The main intuition is that by promoting the within-layer activation diversity, neurons within the same layer learn distinct patterns and, thus, increase the overall capacity of the model. • Theoretically, we analyse the effect the within-layer activation diversity on the generalization error bound of neural network. The analysis is presented in Section 3. As shown in Theorems 3.7, 3.8, 3.9, 3.10, 3.11, and 3.12, we express the upper-bound of the estimation error as a function of the diversity factor. Thus, we provide theoretical evidence that the within-layer activation diversity can help reduce the generalization error. • Empirically, we show that the within-layer activation diversity boosts the performance of neural networks. Experimental results show that the proposed approach outperforms the competing methods.

2. WITHIN-LAYER ACTIVATION DIVERSITY

We propose a diversification strategy, where we encourage neurons within a layer to activate in a mutually different manner, i.e., to capture different patterns. To this end, we propose an additional within-layer loss which penalizes the neurons that activate similarly. The loss function L(f ) defined in equation 2 is augmented as follows: Laug (f ) = L(f ) + λ P i=1 J i , where J i expresses the overall pair-wise similarity of the neurons within the i th layer and λ is the penalty coefficient for the diversity loss. As in (Cogswell et al., 2016) , our proposed diversity loss can be applied to a single layer or multiple layers in a network. For simplicity, let us focus on a single layer. Let φ i n (x j ) and φ i m (x j ) be the outputs of the n th and m th neurons in the i th layer for the same input sample x j . The similarity s nm between the the n th and m th neurons can be obtained as the average similarity measure of their outputs for N input samples. We use the radial basis function to

