GRASSMANNIAN CLASS REPRESENTATION IN DEEP LEARNING

Abstract

We generalize the class representative vector found in deep classification networks to linear subspaces and show that the new formulation enables the simultaneous enhancement of the inter-class discrimination and intra-class feature variation. Traditionally, the logit is computed by the inner product between a feature and the class vector. In our modeling, classes are subspaces and the logit is defined as the norm of the projection from a feature onto the subspace. Since the set of subspaces forms Grassmann manifolds, finding the optimal subspace representation for classes is to optimize the loss on a Grassmannian. We integrate the Riemannian SGD into existing deep learning frameworks such that the class subspaces in a Grassmannian are jointly optimized with other model parameters in Euclidean. Compared to the vector form, subspaces have two appealing properties: they can be multi-dimensional and they are scaleless. Empirically, we reveal that these distinct characteristics improve various tasks. (1) Image classification. The new formulation brings the top-1 accuracy of ResNet50-D on ImageNet-1K from 78.04% to 79.37% using the standard augmentation in 100 training epochs. This confirms that the representative capability of subspaces is more powerful than vectors. (2) Feature transfer. Subspaces provide freedom for features to vary and we observed that the intra-class variability of features increases when the subspace dimensions are larger. Consequently, the quality of features is better for downstream tasks. The average transfer accuracy across 6 datasets improves from 77.98% to 80.12% compared to the strong baseline of vanilla softmax. (3) Long-tail classification. The scaleless property of subspaces benefits classification in the long-tail scenario and improves the accuracy of ImageNet-LT from 46.83% to 48.94% compared to the standard formulation. With these encouraging results, we believe that more applications could benefit from the Grassmannian class representation. Codes will be released.

1. INTRODUCTION

The idea of representing classes as linear subspaces in machine learning can be dated back, at least, to 1973 (Watanabe & Pakvasa (1973) ), yet it is mostly ignored in the current deep learning literature. In this paper, we revisit the scheme of representing classes as linear subspaces in the deep learning context. To be specific, each class i is associated with a linear subspace S i , and for any feature vector x, the i-th class logit is defined as the norm of projection l i := proj Si x . (1) Since a subspace is a point in the Grassmann manifold (Absil et al. (2009) ), we call this formulation the Grassmannian class representation. In the following, we answer the two critical questions, 1. Is Grassmannian class representation useful in real applications? 2. How to optimize the subspaces in training? The procedure fully-connected layer → softmax → cross-entropy loss is the standard practice in deep classification networks. Each column of the weight matrix of the fullyconnected layer is called the class representative vector and serves as a prototype for one class. This representation of class has achieved huge success, yet it is not without imperfections. In the study of transferable features, researchers noticed a dilemma that representations with higher classification accuracy on the original task lead to less transferable features for downstream tasks (Kornblith et al. (2021); Müller et al. (2019) ). This is connected to the fact that they tend to collapse intra-class variability of representations, resulting in loss of information in the logits about the resemblances between instances of different classes. Furthermore, the neural collapse phenomenon (Papyan et al. ( 2020)) indicates that as training progresses, the intra-class variation becomes negligible, and features collapse to their class-means. So this dilemma inherently originates from the practice of representing classes by a single vector. The Grassmannian class representation shed light on this issue as features of each class are allowed to vary in a high-dimensional subspace without incurring losses in classification. In the study of the long-tail classification, researchers found that the norm of class representative vectors is highly related to the number of training instances in the corresponding class (Kang et al. ( 2019)) and the recognition accuracy is affected. To counter this effect, the class representative vector is typically been rescaled to unit length during training (Liu et al. ( 2019)) or re-calibrated in an extra post-processing step (Kang et al. ( 2019)). In addition to these techniques, the Grassmannian class representation provides a natural and elegant solution for this as subspace is scaleless. It is well known that the set of k-dimensional linear subspaces form a Grassmann manifold, so finding the optimal subspace representation for classes is to optimize on the Grassmann manifold. Thus for the second question, the natural solution is to use the geometric optimization (Edelman et al. ( 1998)), which optimizes an objective function under the constraint of a given manifold. Points being optimized are moving along geodesics instead of following the direction of Euclidean gradients. The preliminary concepts of geometric optimization are reviewed in Section 3, and the technical details of subspace learning are presented in Section 4. We implemented an efficient Riemannian SGD for optimization in Grassmann manifold as shown in Algorithm 1, which integrates the geometric optimization algorithms to deep learning frameworks so that both the linear subspaces in Grassmannian and model weights in Euclidean are jointly optimized. Going back to the first question, we experiment on three concrete tasks in Section 5 to demonstrate the practicality and effectiveness of Grassmannian class representation. We find that (1) Grassmannian class representation improves large-scale image classification accuracy. (2) Grassmannian class representation produces high-quality features that can better transfer to downstream tasks. (3) Grassmannian class representation improves the long-tail classification accuracy. With these encouraging results, we believe that Grassmannian class representation is a promising formulation and more applications may benefit from its attractive features. 



Geometric Optimization Edelman et al. (1998) developed the geometric Newton and conjugate gradient algorithms on the Grassmann and Stiefel manifolds in their seminal paper. Riemannian SGD was introduced in Bonnabel (2013) with an analysis on convergence and there are variants such as Riemannian SGD with momentum (Roy et al. (2018)) or adaptive (Kasai et al. (2019)). Other popular Euclidean optimization methods such as Adam are also studied in the Riemannian manifold context (Becigneul & Ganea (2019)). Lezcano-Casado & Martınez-Rubio (2019) study the special case of SO(n) and U (n) and uses the exponential map to enable Euclidean optimization methods for Lie groups. The idea was generalized into trivialization in Lezcano Casado (2019). Our Riemannian SGD Algorithm 1 is tailored for Grassmannian, so we have a closed-form equation for geodesics.Lee (2008)  propose the Grassmann discriminant analysis, in which features are modeled as linear subspaces. These applications are mostly using shallow models. Zhang et al. (2018) use subspaces to model clusters in unsupervised learning, which share similar spirit with our work. Simon et al. (2020) model classes as subspaces in few-shot learning, however, their subspaces are computed from data matrix rather than explicitly parametrized and learned. Roy et al. (2019) use Stiefel manifold to construct Mahalanobis distance matrix in Siamese networks in order to improve feature embeddings of deep metric learning.

