SVMAX: A FEATURE EMBEDDING REGULARIZER

Abstract

A neural network regularizer (e.g., weight decay) boosts performance by explicitly penalizing the complexity of a network. In this paper, we penalize inferior network activations -feature embeddings -which in turn regularize the network's weights implicitly. We propose singular value maximization (SVMax) to learn a uniform feature embedding. The SVMax regularizer integrates seamlessly with both supervised and unsupervised learning. During training, our formulation mitigates model collapse and enables larger learning rates. Thus, our formulation converges in fewer epochs, which reduces the training computational cost. We evaluate the SVMax regularizer using both retrieval and generative adversarial networks. We leverage a synthetic mixture of Gaussians dataset to evaluate SVMax in an unsupervised setting. For retrieval networks, SVMax achieves significant improvement margins across various ranking losses.

1. INTRODUCTION

A neural network's knowledge is embodied in both its weights and activations. This difference manifests in how network pruning and knowledge distillation tackle the model compression problem. While pruning literature Li et al. (2016) ; Luo et al. (2017) ; Yu et al. (2018) compresses models by removing less significant weights, knowledge distillation Hinton et al. (2015) reduces computational complexity by matching a cumbersome network's last layer activations (logits). This perspective, of weight-knowledge versus activation-knowledge, emphasizes how neural network literature is dominated by explicit weight regularizers. In contrast, this paper leverages singular value decomposition (SVD) to regularize a network through its last layer activations -its feature embedding. Our formulation is inspired by principal component analysis (PCA). Given a set of points and their covariance, PCA yields the set of orthogonal eigenvectors sorted by their eigenvalues. The principal component (first eigenvector) is the axis with the highest variation (largest eigenvalue) as shown in Figure 1c . The eigenvalues from PCA, and similarly the singular values from SVD, provide insights about the embedding space structure. As such, by regularizing the singular values, we reshape the feature embedding. The main contribution of this paper is to leverage the singular value decomposition of a network's activations to regularize the embedding space. We achieve this objective through singular value maximization (SVMax). The SVMax regularizer is oblivious to both the input-class (labels) and the sampling strategy. Thus it promotes a uniform embedding space in both supervised and unsupervised learning. Furthermore, we present a mathematical analysis of the mean singular value's lower and upper bounds. This analysis makes tuning the SVMax's balancing-hyperparameter easier, when the feature embedding is normalized to the unit circle. The SVMax regularizer promotes a uniform embedding space. During training, SVMax speeds up convergence by enabling large learning rates. The SVMax regularizer integrates seamlessly with various ranking losses. We apply the SVMax regularizer to the last feature embedding layer, but the same formulation can be applied to intermediate layers. The SVMax regularizer mitigates model collapse in both retrieval networks and generative adversarial networks (GANs) Goodfellow et al. In summary, we propose singular value maximization to regularize the feature embedding. In addition, we present a mathematical analysis of the mean singular value's lower and upper bounds to reduce hyperparameter tuning (Sec. 3). We quantitatively evaluate how the SVMax regularizer significantly boosts the performance of ranking losses (Sec. 4.1). And we provide a qualitative evaluation of using SVMax in the unsupervised learning setting via GAN training (Sec. 4.2). et al. (2019) . This helps compress the network and speed up matrix multiplications on embedded devices (iPhone and Raspberry Pi). In contrast, we regularize the embedding space through a high rank objective. By maximizing the mean singular value, we promote a higher rank representationa spread-out embedding.



(2014); Srivastava et al. (2017); Metz et al. (2017). Furthermore, the SVMax regularizer is useful when training unsupervised feature embedding networks with a contrastive loss (e.g., CPC) Noroozi et al. (2017); Oord et al. (2018); He et al. (2019); Tian et al. (2019).

Figure1: Feature embeddings scattered over the 2D unit circle. In (a), the features are polarized across a single axis; the singular value of the principal (horizontal) axis is large while singular value of the secondary (vertical) axis is small, respectively. In (b), the features are spread uniformly across both dimensions; both singular values are comparably large. (c) depicts the PCA analysis of a toy 2D Gaussian dataset to demonstrate our intuition. The principal component (green) has the highest eigenvalue, i.e., the axis with the highest variation, while the second component (red) has a smaller eigenvalue. Maximizing all eigenvalues promotes data dispersion across all dimensions. In this paper, we maximize the mean singular value to regularize the feature embedding and avoid a model collapse.

Zhang et al. (2018)  employ SVD to avoid vanishing and exploding gradients in recurrent neural networks. Similarly,Guo & Ye (2019)  bound the singular values of the convolutional layer around 1 to preserve the layer's input and output norms. A bounded output norm mitigates the exploding/vanishing gradient problem. Weight regularizers share the common limitation that they do not enforce an explicit feature embedding objective and are thus ineffective against model collapse. Feature embedding regularizers have also been extensively studied, especially for classification networks Rippel et al. (2015); Wen et al. (2016); He et al. (2018); Hoffman et al. (2019); Taha et al. (2020). These regularizers aim to maximize class margins, class compactness, or both simultaneously. For instance, Wen et al. (2016) propose center loss to explicitly learn class representatives and thus promote class compactness. In classification tasks, test samples are assumed to lie within the same classes of the training set, i.e., closed-set identification. However, retrieval tasks, such as product re-identification, assume an open-set setting. Because of this, a retrieval network regularizer should aim to spread features across many dimensions to fully utilize the expressive power of the embedding space. Recent literature Sablayrolles et al. (2018); Zhang et al. (2017) has recognized the importance of a spread-out feature embedding. However, this literature is tailored to triplet loss and therefore assumes a particular sampling procedure. In this paper, we leverage SVD as a regularizer because it is simple, differentiableIonescu et al. (2015), and class oblivious. SVD has been used to promote low rank models to learn compact intermediate layer representationsKliegl et al. (2017); Sanyal

