SVMAX: A FEATURE EMBEDDING REGULARIZER

Abstract

A neural network regularizer (e.g., weight decay) boosts performance by explicitly penalizing the complexity of a network. In this paper, we penalize inferior network activations -feature embeddings -which in turn regularize the network's weights implicitly. We propose singular value maximization (SVMax) to learn a uniform feature embedding. The SVMax regularizer integrates seamlessly with both supervised and unsupervised learning. During training, our formulation mitigates model collapse and enables larger learning rates. Thus, our formulation converges in fewer epochs, which reduces the training computational cost. We evaluate the SVMax regularizer using both retrieval and generative adversarial networks. We leverage a synthetic mixture of Gaussians dataset to evaluate SVMax in an unsupervised setting. For retrieval networks, SVMax achieves significant improvement margins across various ranking losses.

1. INTRODUCTION

A neural network's knowledge is embodied in both its weights and activations. This difference manifests in how network pruning and knowledge distillation tackle the model compression problem. While pruning literature Li et al. (2016) ; Luo et al. (2017) ; Yu et al. (2018) compresses models by removing less significant weights, knowledge distillation Hinton et al. (2015) reduces computational complexity by matching a cumbersome network's last layer activations (logits). This perspective, of weight-knowledge versus activation-knowledge, emphasizes how neural network literature is dominated by explicit weight regularizers. In contrast, this paper leverages singular value decomposition (SVD) to regularize a network through its last layer activations -its feature embedding. Our formulation is inspired by principal component analysis (PCA). Given a set of points and their covariance, PCA yields the set of orthogonal eigenvectors sorted by their eigenvalues. The principal component (first eigenvector) is the axis with the highest variation (largest eigenvalue) as shown in Figure 1c . The eigenvalues from PCA, and similarly the singular values from SVD, provide insights about the embedding space structure. As such, by regularizing the singular values, we reshape the feature embedding. The main contribution of this paper is to leverage the singular value decomposition of a network's activations to regularize the embedding space. We achieve this objective through singular value maximization (SVMax). The SVMax regularizer is oblivious to both the input-class (labels) and the sampling strategy. Thus it promotes a uniform embedding space in both supervised and unsupervised learning. Furthermore, we present a mathematical analysis of the mean singular value's lower and upper bounds. This analysis makes tuning the SVMax's balancing-hyperparameter easier, when the feature embedding is normalized to the unit circle. The SVMax regularizer promotes a uniform embedding space. During training, SVMax speeds up convergence by enabling large learning rates. The SVMax regularizer integrates seamlessly with various ranking losses. We apply the SVMax regularizer to the last feature embedding layer, but the same formulation can be applied to intermediate layers. The SVMax regularizer mitigates model collapse in both retrieval networks and generative adversarial networks (GANs) Goodfellow et al. (2014) ; Srivastava et al. (2017) ; Metz et al. (2017) . Furthermore, the SVMax regularizer is useful when training unsupervised feature embedding networks with a contrastive loss (e.g., CPC) Noroozi et al. (2017); Oord et al. (2018) ; He et al. (2019) ; Tian et al. (2019) . In summary, we propose singular value maximization to regularize the feature embedding. In addition, we present a mathematical analysis of the mean singular value's lower and upper bounds to reduce hyperparameter tuning (Sec. 3). We quantitatively evaluate how the SVMax regularizer significantly boosts the performance of ranking losses (Sec. 4.1). And we provide a qualitative evaluation of using SVMax in the unsupervised learning setting via GAN training (Sec. 4.2). (2020) . These regularizers aim to maximize class margins, class compactness, or both simultaneously. For instance, Wen et al. (2016) propose center loss to explicitly learn class representatives and thus promote class compactness. In classification tasks, test samples are assumed to lie within the same classes of the training set, i.e., closed-set identification. However, retrieval tasks, such as product re-identification, assume an open-set setting. Because of this, a retrieval network regularizer should aim to spread features across many dimensions to fully utilize the expressive power of the embedding space.

2. RELATED WORK

Recent literature Sablayrolles et al. (2018) ; Zhang et al. (2017) has recognized the importance of a spread-out feature embedding. However, this literature is tailored to triplet loss and therefore assumes a particular sampling procedure. In this paper, we leverage SVD as a regularizer because it is simple, differentiable Ionescu et al. (2015) , and class oblivious. SVD has been used to promote low rank models to learn compact intermediate layer representations Kliegl et al. (2017) ; Sanyal et al. (2019) . This helps compress the network and speed up matrix multiplications on embedded devices (iPhone and Raspberry Pi). In contrast, we regularize the embedding space through a high rank objective. By maximizing the mean singular value, we promote a higher rank representationa spread-out embedding.

3. SINGULAR VALUE MAXIMIZATION (SVMAX)

We first introduce our mathematical notation. Let I denote the image space and E I ∈ R d denote the feature embeddings space, where d is the dimension of the features. A feature embedding network is a function F θ : I → E I , parameterized by the network's weights θ. We quantify similarity between an image pair (I 1 , I 2 ) via the Euclidean distance in feature space, i.e., E I1 -E I2 2 . During training, a 2D matrix E ∈ R b×d stores b samples' embeddings, where b is the mini-batch size. Assuming b ≥ d, the singular value decomposition (SVD) of E provides the singular values S = [s 1 , ., s i , ., s d ], where s 1 and s d are the largest and smallest singular values, respectively. We maximize the mean singular value, s µ = 1 d d i=1 s i , to regularize the network's last layer activations -the feature embedding. By maximizing the mean singular value, the deep network spreads out its embeddings. This has the added benefit of implicitly regularizing the network's weights θ. The proposed SVMax regularizer integrates with both supervised and unsupervised feature embedding networks as follows L NN = L r -λ 1 d d i=1 s i = L r -λs µ , where L r is the original network loss and λ is a balancing hyperparameter. Lower and Upper Bounds of the Mean Singular Value: One caveat to equation 1 is the hyperparameter λ. It is difficult to tune because the mean singular value s µ depends on the range of values inside E and its dimensions (b, d). Thus, changing the batch size or embedding dimension requires a different λ. To address this, we utilize a common assumption in metric learning -the unit circle (L2-normalized) embedding assumption. This assumption provides both lower and upper bounds on ranking losses. This will allow us to impose lower and upper bounds on s µ . For an L2-normalized embedding E, the largest singular value s 1 is maximum when the matrix-rank of E equals one, i.e., rank(E) = 1, and Johnson (1991) provide an upper bound on this largest singular value s 1 as s * (E) ≤ ||E|| 1 ||E|| ∞ . This holds in equality for all L2-normalized E ∈ R b×d with rank(E) = 1. For an L2-normalized matrix E with ||E|| 1 = b, and ||E|| ∞ = 1, this gives: s i = 0 for i ∈ [2, d]. Horn & s * (E) = ||E|| 1 ||E|| ∞ = √ b. Thus, the lower bound L on s µ is L = s * (E) d = √ b d . Similarly, an upper bound is defined on the sum of the singular values Turkmen & Civciv (2007) ; Kong et al. (2018); Friedland & Lim (2016) . This summation is formally known as the nuclear norm of a matrix ||E|| * . Hu (2015) established an upper bound on this summation using the Frobenius Norm ||E|| F as follows ||E|| * ≤ b × d max(b, d) ||E|| F , where ||E|| F = rows i=1 cols j=1 |E ij | 2 1 2 = √ b because of the L2-normalization assumption. Accordingly, the lower and upper bounds of s µ are [L, U ] = [ s * (E) d , ||E|| * d ]. With these bounds, we rewrite our final loss function as follows L NN = L r + λ exp U -s µ U -L . ( ) The SVMax regularizer grows exponentially ∈ [1, e]. We employ this loss function in all our retrieval experiments. It is important to note that the L2-normalized assumption makes λ tuning easier, but it is not required. Equation 4 makes the hyperparameter λ only dependent on the range of L r which is also bounded for ranking losses. Lower and Upper Bounds of Ranking Losses: We briefly show that ranking losses are bounded when assuming an L2-normalized embedding. Equations 5 and 6 show triplet and contrastive losses, Table 1 : Quantitative evaluation on CUB-200-2011 with batch size b = 144, embedding dimension d = 128 and multiple learning rates lr = {0.01, 0.001, 0.0001}. R@1 column indicates the R@1 improvement margin relative to the vanilla ranking loss. A large learning rate lr increases the chance of model collapse, while a small lr slows convergence. λ is dependent on the ranking loss. lr = 0.01 lr = 0.001 lr = 0.0001  Method NMI R@1 R@8 R@1 NMI R@1 R@8 R@1 NMI R@1 R@8 R@ TL (a,p,n)∈T = [(D( a , p ) -D( a , n ) + m)] + [L,U ] ---→ [0, 2 + m], CL (x,y)∈P = (1 -δ x,y )D( x , y )) + δ x,y [m -D( x , y ))] + [L,U ] ---→ [0, 2], where [•] + = max(0, •), m < 2 is the margin between classes, since 2 is the maximum distance on the unit circle. • and D(, ) are the embedding and Euclidean distance functions, respectively. In equation 5, a, p, and n are the anchor, positive, and negative images in a single triplet (a, p, n) from the triplets set T . In equation 6, x and y form a single pair of images from the pairs set P . δ x,y = 1 when x and y belong to different classes; zero otherwise. In the supplementary material, we (1) show similar analysis for N-pair and angular losses, (2) provide an SVMax evaluation on small training batches, i.e., b < d, and (3) evaluate the computational complexity of SVMax.

4. EXPERIMENTS

In this section, we evaluate SVMax using both supervised and unsupervised learning. We leverage retrieval and generative adversarial networks for quantitative and qualitative evaluations. Unlike SVMax, these regularizers require a supervised setting to push anchor-negative pairs apart. We employ the spread-out regularizer Zhang et al. (2017) as a baseline for its simplicity, with default hyperparameter α = 1. To enable the spread-out regularizer on non-triplet ranking losses, we pair every anchor with a random negative sample from the training mini-batch. Method NMI R@1 R@8 R@1 NMI R@1 R@8 R@1 NMI R@1 R@8 R@ Evaluation Metrics: For quantitative evaluation, we use the Recall@K metric and Normalized Mutual Info (NMI) on the test split. The hyperparameter: λ = 1 for both contrastive and N-pair losses, λ = 0.1 for triplet loss, and λ = 2 for angular loss. We fix λ across datasets, architectures, and other hyperparameters (b, d). Results: Tables 1 and 2 present quantitative retrieval evaluation on CUB-200 and Stanford Online Products datasets -both using GoogLeNet. These tables provide in depth analysis and emphasize our improvement margins on a small and large dataset. Figure 2 provides quantitative evaluation on Stanford CARS196. We report the qualitative retrieval evaluation and quantitative evaluation on ResNet50 in the supplementary material. Our training hyperparameters -learning rate lr and number of iterations K -do not favor a particular ranking loss. We evaluate SVMax on various learning rates. A large learning rate, e.g., lr = 0.01, speeds up convergence, but increases the chance of model collapse. In contrast, a small rate, e.g., lr = 0.0001, is likely to avoid model collapse but is slow to converge. This undesirable effect is tolerable for small datasets -where increasing the number of training iterations K does not drastically increase the overall training time -but it is infeasible for large datasets. For contrastive and N-pair losses, SVMax is significantly superior to both the vanilla and spread-out baselines, especially with a large learning rate. A small lr slows convergence and all approaches become equivalent. The spread-out regularizer Zhang et al. (2017) and its hyperparameters are tuned for triplet loss. Thus, for this particular ranking loss, the SVMax and spread-out regularizers are on par. In our experiments, we employ a large learning rate because it is the simplest factor to introduce a model collapse. However, the learning rate is not the only factor. Another factor is the training dataset size and its intra-class variations. A small dataset with large intra-class variations increases the chances of a model collapse. For example, a pair of dissimilar birds from the same class justifies a model collapse when coupled with a large learning rate. The hard triplet loss experiments emphasize this point because every anchor is paired with the hardest positive and negative samples. Figure 2 : Quantitative evaluation on Stanford CARS196. X and Y-axis denote the learning rate lr and recall@1 performance, respectively. On small fine-grained datasets like CUB-200 or CARS196, the vanilla hard triplet loss suffers significantly. Yet, the same implementation is superior on a big dataset like Stanford Online Products. By carefully tuning the training hyperparameter on CUB-200, it is possible to avoid a degenerate solution. However, this tedious tuning process is unnecessary when using either the spread-out or the SVMax regularizer. The vanilla N-pair loss underperforms because it does not support feature embedding on the unit circle. Both spread-out and SVMax mitigate this limitation. For angular loss, a bigger λ = 2 is employed to cope with the angular loss range. SVMax is a class oblivious regularizer. Thus, λ should be significant enough to contribute to the loss function without dominating the ranking loss. Wu et al. (2017) show that the distance between any anchor-negative pair, which is randomly sampled from an n-dimensional unit sphere, follows the normal distribution N ( √ 2, 1 2n ). This mean distance √ 2 is large relative to the triplet loss margin m = 0.2, but comparable to the contrastive loss margin m = 1. Accordingly, triplet loss converges to zero after a few iterations, because most triplets satisfy the margin m = 0.2 constraint. When triplet loss equals zero, the SVMax regularizer with λ = 1 becomes the dominant term. However, the SVMax regularizer should not dominate because it is oblivious to data annotations; it equally pushes anchor-positive and anchor-negative pairs apart. Reducing λ to 0.1 solves this problem. A less aggressive triplet loss Schroff et al. (2015) ; Xuan et al. (2020) is another way to avoid model collapse. For instance, Schroff et al. (2015) have proposed a triplet loss variant that employs semihard negatives. The semi-hard triplet loss is more stable than the aggressive hard triplet and lifted structured losses Oh Song et al. (2016) . Unfortunately, the semi-hard triplet loss assumes a very large mini-batch (b = 1, 800 in Schroff et al. (2015) ), which is impractical. Furthermore, when model collapse is avoided, aggressive triplet loss variants achieve superior performance Hermans et al. (2017) . In contrast, the SVMax regularizer only requires a larger mini-batch than the embedding dimension, i.e., b ≥ d, a natural constraint for retrieval networks which favor compact embedding dimensions. Additionally, SVMax does not make any assumption about the sampling procedure. Thus, unlike Sablayrolles et al. (2018) ; Zhang et al. (2017) , SVMax supports various supervised ranking losses.

4.2. GENERATIVE ADVERSARIAL NETWORKS

Model collapse is one of the main challenges of training generative adversarial networks (GANs) Metz et al. (2017) ; Srivastava et al. (2017) ; Mao et al. (2019) ; Salimans et al. (2016) . To tackle this challenge, Metz et al. (2017) propose an unrolled-GAN to prevent the generator from overfitting to the discriminator. In an unrolled-GAN, the generator observes the discriminator for l steps before updating the generator's parameters using the gradient from the final step. Alternatively, we leverage the simpler SVMax regularizer to avoid model collapse. We evaluate our regularizer using a simple GAN on a 2D mixture of 8 Gaussians arranged in a circle. This 2D baseline Metz et al. (2017) ; Srivastava et al. (2017) ; Bang & Shim (2018) provides a simple qualitative evaluation and demonstrates SVMax's potential in unsupervised learning. We leverage this simple baseline because we assume b ≥ d, which does not hold for images.

Method

Step 1 Step 5k Step Figure 3 shows the dynamics of the GAN generator through time. We use a public PyTorch implementationfoot_0 of Metz et al. (2017) . We made a single modification to the code to use a relatively large learning rate, i.e., lr = 0.025 for both the generator and discriminator. This single modification is a simple and fast way to induce model collapse. The mixture of Gaussians circle has a radius r = 2, i.e., the generated fake embedding is neither L2-normalized nor strictly bounded by a network layer. We kept the radius parameter unchanged to emphasize that neither L2-normalization nor strict-bounds are required. To mitigate the impact of lurking variables (e.g., random network initialization and mini-batch sampling), we fix the random generator's seed for all experiments. We apply SVMax to a vanilla and an unrolled GAN for five steps. We apply the vanilla SVMax regularizer (Eq. 1), i.e., L NN = L GAN -λs µ , where λ = 0.01 and s µ is mean singular value of the generator fake embedding. GANs are typically used to generate high resolution images. This high-resolution output is the main limitation of the SVMax regularizer. The current formulation assumes the batch size is bigger than the embedding dimension, i.e., b ≥ d. This constraint is trivial for the Gaussians mixture 2D dataset and retrieval networks with a compact embedding dimensionality (e.g., d = {128, 256}). However, this constraint hinders high resolution image generators because the mini-batch size constraint becomes b ≥ W ×H ×C, where W , H, and C are the generated image's width, height, and number of channels, respectively. Nevertheless, this GAN experiment emphasizes the potential of the SVMax regularizer in unsupervised learning.

4.3. ABLATION STUDY

In this section, we evaluate two hypotheses: (1) the SVMax regularizer boosts retrieval performance because it learns a uniform feature embedding, (2) the same SVMax hyperparameter λ supports different embedding dimensions and batch sizes -the main objective of the mean singular value's bounds analysis. To evaluate the SVMax regularizer's impact on feature embeddings, we embed the MNIST dataset onto the 2D unit circle. In this experiment, we used a tiny CNN (one convolutional layer and one hidden layer). Figure 4 shows the embedding space after training for t epochs. When using the SVMax regularizer, the feature embeddings spread out more uniformly and rapidly than the vanilla contrastive loss. The mean singular value bound analysis makes tuning the hyperparameter λ easier. This hyperparameter becomes only dependent on the ranking loss and independent of both the batch size and the embedding dimension. Figure 5 presents a quantitative evaluation using the CUB-200 dataset. We explore various batch sizes b = {288, 72} and embedding dimensions d = {256, 64}. We employ a MobileNetV2 Sandler et al. (2018) to fit the big batch b = 288 on a 24GB GPU. The supplementary material contains a similar evaluation on the Stanford Online Products and CARS196 datasets.

5. CONCLUSION

We have proposed singular value maximization (SVMax) as a feature embedding regularizer. SV-Max promotes a uniform embedding, mitigates model collapse, and enables large learning rates. Unlike other embedding regularizers, the SVMax regularizer supports a large spectrum of ranking losses. Moreover, it is oblivious to data annotation and, as such, supports both supervised and unsupervised learning. Qualitative evaluation using a generative adversarial network demonstrates SVMax's potential in unsupervised learning. Quantitative retrieval evaluation highlight significant performance improvements due to the SVMax regularizer.



https://github.com/andrewliao11/unrolled-gans



Figure1: Feature embeddings scattered over the 2D unit circle. In (a), the features are polarized across a single axis; the singular value of the principal (horizontal) axis is large while singular value of the secondary (vertical) axis is small, respectively. In (b), the features are spread uniformly across both dimensions; both singular values are comparably large. (c) depicts the PCA analysis of a toy 2D Gaussian dataset to demonstrate our intuition. The principal component (green) has the highest eigenvalue, i.e., the axis with the highest variation, while the second component (red) has a smaller eigenvalue. Maximizing all eigenvalues promotes data dispersion across all dimensions. In this paper, we maximize the mean singular value to regularize the feature embedding and avoid a model collapse.

We evaluate the SVMax regularizer quantitatively using three datasets:CUB- 200-2011Wah et al. (2011), Stanford CARS196 Krause et al. (2013), and Stanford Online Products OhSong et al. (2016). We use GoogLeNetSzegedy et al. (2015) and ResNet50He et al. (2016); both pretrained on ImageNetDeng et al. (2009) and fine-tuned for K iterations. These are standard retrieval datasets and architectures. By default, the embedding ∈ R d=128 is normalized to the unit circle. In all experiments, a batch size b = 144 is employed, the learning rate lr is fixed for K/2 iterations then decayed polynomially to 1e -7 at iteration K. We use the SGD optimizer with 0.9 momentum. Each batch contains p different classes and l different samples per class. For example, triplet loss employs p = 24 different classes and l = 6 instances per class. The mini-batch of N-pair loss contains 72 classes and a single positive pair per class, i.e.p = 72 and l = 2. This same mini-batch setting is used for angular loss. For contrastive loss, p = 36 and l = 4 are divided into 72 positive and 72 negative pairs. For CUB-200 and CARS196, K = 5, 000 iterations; for Stanford Online Products, K = 20, 000.

Figure 3: The SVMax regularizer mitigates model collapse in a GAN trained on a toy 2D mixture of Gaussians dataset. Columns show heatmaps of the generator distributions at different training steps (iterations). The final column shows the groundtruth distribution. The first row shows the distributions generated by training a vanilla GAN suffering a model collapse. The second row shows the generated distribution when penalizing the generator's fake embedding with the SVMax regularizer. The third and fourth rows show two distributions generated using an unrolled-GAN with and without the SVMax regularizer, respectively. This high resolution figure is best viewed on a screen with zoom capabilities.

Figure 4: Qualitative feature embedding evaluation using the MNIST dataset projected onto the 2D unit circle. The first row shows the feature embedding learned using a vanilla contrastive loss and the second row applies the SVMax-regularized. A random subset of the test split is projected for visualization purpose. Different colors denote different classes. The regularized feature embedding spreads out uniformly and rapidly. The supplementary material shows the feature embedding evolves vividly up to 200 epochs. This high resolution figure is best seen on a screen.

1

Quantitative evaluation on Stanford Online Products.

1

