DECOUPLED TRAINING FOR LONG-TAILED CLASSIFI-CATION WITH STOCHASTIC REPRESENTATIONS

Abstract

Decoupling representation learning and classifier learning has been shown to be effective in classification with long-tailed data. There are two main ingredients in constructing a decoupled learning scheme; 1) how to train the feature extractor for representation learning so that it provides generalizable representations and 2) how to re-train the classifier that constructs proper decision boundaries by handling class imbalances in long-tailed data. In this work, we first apply Stochastic Weight Averaging (SWA), an optimization technique for improving the generalization of deep neural networks, to obtain better generalizing feature extractors for long-tailed classification. We then propose a novel classifier re-training algorithm based on stochastic representation obtained from the SWA-Gaussian, a Gaussian perturbed SWA, and a self-distillation strategy that can harness the diverse stochastic representations based on uncertainty estimates to build more robust classifiers. Extensive experiments on CIFAR10/100-LT, ImageNet-LT, and iNaturalist-2018 benchmarks show that our proposed method improves upon previous methods both in terms of prediction accuracy and uncertainty estimation.

1. INTRODUCTION

While deep neural networks have achieved remarkable performance on various computer vision benchmarks (e.g., image classification (Russakovsky et al., 2015) and object detection (Lin et al., 2014) ), there still are many challenges when it comes to applying them for real-world applications. One of such challenges is that the real-world classification data are long-tailed -the distribution of class frequencies exhibits a long tail, and many of the classes have only a few observations belonging to them. As a consequence, the class distribution of such data is extremely imbalanced, degrading the performance of a standard classification model trained with the balanced class assumption due to a paucity of samples from tail classes (Van Horn et al., 2018; Liu et al., 2019) . Thus, it is worth exploring a novel technique dealing with long-tailed data for real-world deployments. While several works have diagnosed the performance bottleneck of long-tailed recognition as distinct from balanced one (e.g., improper decision boundaries over the representation space (Kang et al., 2020) , low-quality representations from the feature extractor (Samuel and Chechik, 2021) ), the shared design principle of them is giving tail classes a chance to compete with head classes. Decoupling (Kang et al., 2020) is one of the learning strategies proven to be effective for long-tailed data, where the representation learning via the feature extractor and classifier learning via the last classification layer are decoupled. Even for a classification network failing on long-tailed data, the representations obtained from the penultimate layer can be flexible and generalizable, provided that the feature extractor part is expressive enough (Donahue et al., 2014; Zeiler and Fergus, 2014; Girshick et al., 2014) . The main motivation behind the decoupling is that the performance bottleneck of the long-tailed classification is due to the improper decision boundaries set over the representation space. Based on this, Kang et al. (2020) has shown that a simple re-training of the last layer parameters could significantly improve the performance. The success of decoupling naturally motivates obtaining more informative representations from which the classifier re-training can benefit. To this end, we first investigate Stochastic Weight Averaging (SWA; Izmailov et al., 2018) , which improves the generalization performance of deep neural networks by seeking flat minima in loss surfaces. Although SWA has been successful for various tasks involving deep neural networks (for instance, supervised learning (Izmailov et al., 2018) , semisupervised learning (Athiwaratkun et al., 2019) and domain generalization (Cha et al., 2021) ), to the best of our knowledge, it has never been explored for long-tailed classification problems. In Section 3, we empirically show that a naïve application of SWA for long-tailed classification would fail due to a similar bottleneck issue, but when combined with decoupling, SWA significantly improves the classification performance due to its property to obtain generalizable representations. Confirming that SWA can benefit long-tailed classification, we take a step further and propose a novel classifier re-training strategy. For this, we first obtain stochastic representations, the output of penultimate layers computed with multiple feature extractor parameters drawn from an approximate posterior, where we construct the approximate parameter with SWA-Gaussian (SWAG; Maddox et al., 2019) ; SWAG is a Bayesian extension of SWA, adding Gaussian noise to the parameter obtained from SWA to approximate posterior parameter uncertainty. In Section 3.2, we first show that the diverse stochastic representations obtained from SWAG samples well reflect the uncertainty of inputs. Hinging on this observation, we propose a novel self-distillation algorithm where the stochastic representations are used to construct an ensemble of virtual teachers, and the classifier re-training is formulated as a distillation (Hinton et al., 2015) from the virtual teacher. Fig. 1 depicts the overall composition of this paper as a diagram. Using CIFAR10/100-LT (Cao et al., 2019) , ImageNet-LT (Liu et al., 2019 ), and iNaturalist-2018 (Van Horn et al., 2018) benchmarks for long-tailed image classification, we empirically validate that our proposed method improves upon previous approaches both in terms of prediction accuracy and uncertainty estimation.

2.1. DECOUPLED LEARNING FOR LONG-TAILED CLASSIFICATION

Let F θ : R D → R L be a neural network parameterized by θ that produces L-dimensional outputs for given D-dimensional inputs. For the K-way classification problem, an output from F θ is first transformed into K-dimensional logits via a linear classification layer parameterized by ϕ = (w k ∈ R L , b k ∈ R) K k=1 , and then turned into a classification probability with the softmax function, p (k) (x; Θ) = exp w ⊤ k F θ (x) + b k K j=1 exp w ⊤ j F θ (x) + b j , for k = 1, ..., K, where Θ = (θ, ϕ) denotes a set of trainable parameters. Given a training set D consisting of pairs of input x ∈ R D and corresponding label y ∈ {1, . . . , K}, Θ is trained to minimize the cross-entropy loss over D, Θ * = (θ * , ϕ * ) = arg min Θ E (x,y)∼D -log p (y) (x; Θ) . Throughout the paper, we call F θ with parameters θ as a feature extractor, and the last linear layer with parameters ϕ as classifier. Also, we refer to a basic algorithm training the feature extractor and classifier together with Stochastic Gradient Descent (SGD; Robbins and Monro, 1951) as SGD. Previous works (Liu et al., 2019; Van Horn et al., 2018) have shown that the vanilla SGD suffers from the data paucity of tail classes when applied to long-tailed classification tasks. Notably, Kang et al. (2020) showed that a simple re-training procedure on the classifier effectively resolves this problem. For instance, using pre-trained θ * from Eq. ( 2), we can re-train the classifier as ϕ * * = arg min ϕ E (x,y)∼p D CB -log p (y) (x; (θ * , ϕ)) , where D CB is the class-balanced training dataset, i.e., the probability of sampling a data (x, y) is given by p DCB ((x, y)) = 1/(K × n y ), where n y is the number of training examples for class y. This classifier Re-Training (cRT; Kang et al., 2020) method effectively improves the classification accuracy on tail classes without any changes in the feature extractor F θ * .

2.2. STOCHASTIC WEIGHT AVERAGING (SWA)

Stochastic Weight Averaging (SWA; Izmailov et al., 2018) is an optimization method to improve generalization performance of deep neural networks. Given a loss function L(Θ), the conventional SGD steps towards a local minima by following the gradient direction, where η denotes a step size, Θ t = Θ t-1 -η∇ Θ L(Θ)| Θ=Θt-1 , forming a parameter trajectory {Θ t } t≥1 . SWA constructs a moving average of parameters for a periodically sampled subset of this trajectory, starting from Θ SWA = 0, Θ SWA = (nΘ SWA + Θ n )/(n + 1), where Θ n is usually sampled at the end of every training epochs. This averaging in the weight space implicitly seeks flat minima in the loss surface, and thus enhances generalization. In practice, the averaging phase defined in Eq. ( 5) starts after the SGD trajectory falls into the basin of the loss function (e.g., after the 75% training epochs), and the learning rate during the averaging phase is set as high values to encourage exploration in the loss surface.

2.3. SWA-GAUSSIAN FOR APPROXIMATE BAYESIAN INFERENCE

SWA-Gaussian (SWAG; Maddox et al., 2019) conducts Bayesian inference using a Gaussian approximation to the posterior distribution over the model parameters. With a slight abuse of notation writing the element-wise square as Θ 2 , SWAG maintains the second moment for model parameters in addition to the first moment defined in Eq. ( 5), starting from Θ ′ SWA = 0, Θ ′ SWA = (nΘ ′ SWA + Θ 2 n )/(n + 1), to compute a diagonal covariance matrix Σ SWAG = diag(Θ ′ SWA -Θ 2 SWA ) approximating the sample covariance of parameters captured during SWA. The approximate posterior for the parameters is then constructed as Gaussian, q(Θ) = N (Θ; Θ SWA , Σ SWAG ). As suggested in Maddox et al. (2019) , one can also consider a higher-rank approximation for the covariance, but in this paper, we only consider the diagonal approximation.

3. LONG-TAILED CLASSIFICATION WITH STOCHASTIC WEIGHT AVERAGING

After the success in supervised image classification tasks (Izmailov et al., 2018) , SWA has been further validated for others, including semi-supervised learning (Athiwaratkun et al., 2019) and domain generalization (Cha et al., 2021) . In this section, we first study whether the success of SWA continues in the long-tailed classification (Section 3.1). Then we introduce the concept of stochastic representation constructed with SWAG and empirically justify how such stochastic representations can benefit the long-tailed classification (Section 3.2). 1 show that SWA does not bring significant performance gain for long-tailed classification tasks, unlike the previous results on other tasks. Compared to the SGD, 1) SWA does not significantly improve performance, and 2) SWA even degrades performance for the 'Few' split, raising a question about the efficacy of SWA for the long-tailed classification. We diagnose the performance bottleneck of SWA from the perspective of decoupled learning. Specifically, we adopt cRT (Kang et al., 2020) method to verify whether the SWA inherently hinders the quality of the feature extractor. If SWA shows worse performance than SGD even after the classifier re-training, we can conclude that SWA is not preferable for representation learning for long-tailed classification. The results 'After applying cRT' shown in Table 1 demonstrate the classification accuracy for SGD and SWA models after their classifiers are re-trained with Eq. ( 3). The result shows that SWA improves performance for all splits, indicating that SWA actually enhances the quality of the feature extractor, but the classification layer is acting as a bottleneck as in SGD, and this can be fixed with cRT.

3.2. STOCHASTIC REPRESENTATIONS CAPTURE PREDICTIVE UNCERTAINTIES

Now let the feature extractor parameter be constructed from SWA procedure as θ SWA and the retrained classifier parameter as ϕ * SWA , that is, ϕ * SWA = arg min ϕ E (x,y)∼p D CB -log p (y) (x; (θ SWA , ϕ)) . While the parameters (θ SWA , ϕ * SWA ) may generalize better than the one obtained from SGD, it is still a point estimate without fully capturing the uncertainty of the predictions. To overcome this limitation, we apply SWAG for the feature extractor parameters θ, approximating the posterior of θ with a Gaussian distribution q(θ|D) := N (θ|θ SWA , Σ SWA ). Given q(θ|D), a predictive distribution for an input x can be approximated as p(y|x, D; ϕ * SWA ) ≈ p (y) (x; (θ, ϕ * SWA ))q(θ|D)dθ ≈ 1 M M m=1 p (y) (x; (θ m , ϕ * SWA )), where θ 1 , . . . , θ M i.i.d. ∼ q(θ|D). Here, we are computing the model average from multiple predictions evaluated with the multiple feature extractor parameters (θ m ) M m=1 to better capture epistemic uncertainty. During that process, we implicitly compute the stochastic representations of x, that is, F m (x) := F θm (x), for m = 1, . . . , M. We hypothesize that these stochastic representations reflect the epistemic uncertainty of x, so it is important to consider them for the classifier re-training stage. For instance, if x is a hard example (e.g., a sample from tail classes) that is likely to be misclassified, the corresponding stochastic representations are expected to produce predictive distribution having high variances. Along with classification accuracy (ACC; left), we plot uncertainty estimates, including negative log-likelihood (NLL; center) and expected calibration error (ECE; right). We evaluate models on the test split of ImageNet-LT and colorize standard deviations over four seeds. Measuring per-instance dispersion. We empirically validate our hypothesis on stochastic representations by measuring their dispersion over different inputs. The average cosine distance of the set elements to the set centroid in the L-dimensional representation space can quantify such dispersionfoot_0 . To be concrete, we consider the following per-instance dispersion for each instance x in the representation space, Dispersion x; {θ m } M m=1 := 1 M M m=1 1 - F(x) ⊤ F θm (x) F(x) 2 ∥F θm (x)∥ 2 , ( ) where F(x) := M m=1 F θm (x)/M is the centroid of M stochastic representations {F θm (x)} M m=1 for an input x. Furthermore, we also measure the Jensen-Shannon Divergence (JSD) for a set of predictions {p(x; (θ m , ϕ * SWA ))} M m=1 , Dispersion x; {(θ m , ϕ SWA * )} M m=1 := JSD {p(x; (θ m , ϕ * SWA ))} M m=1 , to quantify the per-instance dispersion for each instance x in the probability space. Fig. 2 shows the box-and-whisker plots for dispersion values over 20,000 validation examples from ImageNet-LT. We split them into four disjoint groups consisting of 5,000 instances in each group based on the Negative Log-Likelihood (NLL) values: Q1 (a set of instances having NLLs lower than the first quartile), Q2 (a set of instances having NLLs in the range between the first and second quartiles), Q3 (a set of instances having NLLs in the range between the second and third quartiles), and Q4 (a set of instances having NLLs higher than the third quartile). We confirm that there exists a positive correlation between NLL and dispersion (0.233 for Eq. ( 10) and 0.385 for Eq. ( 11)), that is, a hard example (i.e., higher NLL) tends to have more dispersed stochastic representations (i.e., higher dispersion). The higher correlation in Eq. ( 11) compared to Eq. ( 10) motivates us to harness the diversity in the probability space instead of the representation space. Empirical results in Section 6.1 will show this indeed leads to improvements. Moreover, we refer the reader to the first paragraph in Appendix B.5 for further analysis of the per-class dispersion in the context of long-tailed recognition. Ensembling predictions with stochastic representations. We further provide an empirical evidence showing that the uncertainty captured by the stochastic representations can be harnessed by the classifier for robust classification. Specifically, we compute the classification probabilities ensembled over stochastic representations using (8) . Note that the classifier parameters ϕ * SWA is re-trained with θ SWA with (7) , not with the sampled parameters (θ m ) M m=1 , so in principle, there is no guarantee that ϕ * SWA is compatible with the stochastic representations computed from (θ m ) M m=1 . Still, if the stochastic representations properly capture the uncertainty of x in the feature space, the ensembled classifier would provide well-calibrated predictions. 8). While the classification accuracy remains at a similar level, uncertainty estimates improve as the number of stochastic representations increases, indicating that the stochastic representations indeed captures the uncertainty of inputs that are helpful for robust predictions. We would like to emphasize that the classifier Eq. ( 8) is a proof-of-concept model to validate our hypothesis, and there still is a room for improvement. First of all, as we mentioned earlier, the classifier parameters ϕ * SWA needs to be retrained according to the stochastic representations. Second, even with such re-training, the current form requires multiple forward pass through the feature extractor (with multiple θ m ), incurring undesirably heavy inference cost. In Section 4, we propose a novel re-training algorithm to resolve these issues.

4.1. RE-TRAINING CLASSIFIER WITH STOCHASTIC REPRESENTATIONS

A straightforward way to re-train the classifier with stochastic representations would be minimizing ϕ * = arg min ϕ E (x,y)∼p D CB E θ∼q(θ|D) [-log p (y) θ,ϕ (x)] ≈ arg min ϕ E (x,y)∼p D CB LCE (x, y) , (12) where p (y) θ,ϕ (x) := p (y) (x; (θ, ϕ)), θ 1 , . . . , θ M i.i.d. ∼ q(θ|D), and LCE (x, y) := - 1 M M m=1 log p (y) θm,ϕ (x). Fig. 4 depicts a two-dimensional representation space during the classifier re-training stage. While learning from a deterministic feature extractor (i.e., minimizing Eq. ( 7)) computes the deterministic representations, our proposed method minimizing Eq. ( 12) builds more robust decision boundaries since it accounts for predictive uncertainties estimated from the stochastic representations.

4.2. SELF-DISTILLATION WITH STOCHASTIC REPRESENTATION

While the re-training with (12) helps build a robust classifier, there still is room for improvement. After all, what we are ultimately interested in is the uncertainty in a set of predictions {p θm,ϕ (x)} M m=1 , not the stochastic representations {F θm (x)} M m=1 themselves. The objective (12) learns the classifier parameters ϕ with the mean loss (13), so it does not consider the diversities of the predictions made by different stochastic representations. Moreover, as we pointed out earlier, in the current form, we need M forward passes through the feature extractor with (θ m ) M m=1 to make a prediction. To resolve these issues, we present a self-distillation algorithm that can enhance the robustness of the classifier by transferring diversities in the multiple predictions made with stochastic representations reducing the inference cost to be the same as a single model. Setting up teachers and a student. We first set a set of teachers constructed from the set of stochastic representations. Formally, the m th teacher T m produces a prediction probability for an input x as p Tm (x) := p θm,ϕ (x) where θ m ∼ q(θ|D). For a student, we first fix the feature extractor parameter as θ SWA , which is a natural choice considering that the mean of the distribution q(θ|D) is θ SWA . The goal is to learn the classifier parameters ϕ such that the student prediction p θSWA,ϕ (x) can maximally absorb the diversities in the teacher predictions (p Tm (x)) M m=1 . Distillation objective. Following Ryabinin et al. (2021) , instead of directly distilling from the mean of ensemble predictions p T (x) := m p Tm (x)/M as in the original knowledge distillation (Hinton et al., 2015) , we distill from the distribution of the ensembled predictions. Specifically, we assume that the teacher prediction probabilities (p Tm (x)) M m=1 are Dirichlet distributed with parameter β(x), that is, p T1 (x), . . . , p T M (x) i.i.d. ∼ Dir(β(x)). (14) The parameter β(x) can be estimated via the approximate maximum likelihood procedure whose solution is given in a closed-form (Minka, 2000) : β (k) (x) := p (k) T (x) (K -1)/2 K j=1 p (j) T (x) log p (j) T (x) -1 M M m=1 log p (j) Tm (x) , for k = 1, ..., K. Having computed β(x), we assume that the output probabilities from a student model (our original model to be re-trained for classifier parameters) are also Dirichlet distributed with parameter α(x). We constraint all components in parameters to be greater than 1 (i.e., β ← β + 1 and α ← α + 1) and minimize the reverse KL divergence D KL [Dir(α)∥Dir(β)] for stable training (Ryabinin et al., 2021) . Expanding the KL divergence, we obtain the distillation loss as LKD (x, y) = E p∼Dir(α) - K k=1 p (k) T (x) log p (k) + 1 K j=1 β (j) D KL [Dir(α)∥Dir(1)] , (16) where we compute α(x) ∈ R K + for ϕ = (w k , b k ) K k=1 as α (k) (x) := exp w ⊤ k F θSWA (x) + b k for k = 1, ..., K. After learning α(x), we can take the mean of the student Dirichlet distribution as a representative classification probability for x. By the property of the softmax, this actually coincides with the p θSWA,ϕ (x). That is, E p∼Dir(α(x)) p (k) = α (k) (x)/ K j=1 α (j) (x) = p (k) θSWA,ϕ (x). The final version of our algorithm combines the two loss terms LCE and LKD for the classifier retraining (see Appendix A.4 for more details). While the first term builds decision boundaries over the stochastic representations, the second term further adjusts decision boundaries by distilling diverse predictions from the stochastic representations. We name this procedure as a self-distillation in the sense that the model distills probabilistic outputs within the network itself (Zhang et al., 2019) . We empirically validate the effectiveness of the self-distillation in Section 6.

5. RELATED WORKS

Decoupled learning. Recent works have shown that the performance bottleneck of deep neural classifiers on long-tailed datasets is improper decision boundaries (Kang et al., 2020; Zhang et al., 2021) . The vanilla training with instance-balanced sampling gives generalizable representations (Kang et al., 2020) , and thus a simple adjusting strategy for the classifier can alleviate such bottleneck. For instance, re-training the classifier from scratch or normalizing the classifier weights with class-balanced sampling (Kang et al., 2020) , adjusting the classifier biases with the empirical class frequencies on the training dataset (Menon et al., 2021) , and training additional module which performs input-dependent adjustment of the original classifier (Zhang et al., 2021) . Knowledge distillation. The seminal work of Hinton et al. (2015) has shown that distilling knowledge in deep neural networks is an effective way to obtain better generalizing models. One of the common practices in knowledge distillation is employing ensembles as a teacher model (Malinin et al., 2020; Ryabinin et al., 2021) based on the superior performance of the ensemble of deep neural networks (Lakshminarayanan et al., 2017; Ovadia et al., 2019) , and several works already applied this to long-tailed recognition (Iscen et al., 2021; Wang et al., 2021) . Nevertheless, we would like to clarify that there is a clear difference from them in that our method does not require a costly teacher model (i.e., we do not require an independently trained teacher model).

6. EXPERIMENTS

We present extensive experimental results on long-tailed image classification benchmarks for the family of residual networks (He et al., 2016) , i.e., ResNet-32 on CIFAR10/100-LT (Cao et al., 2019) and ResNet-50 on ImageNet-LT (Liu et al., 2019 ) and iNaturalist-2018 (Van Horn et al., 2018) . See Appendix C for more details. While some previous literature only reports classification accuracy (ACC), we also provide Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE) that further evaluate the calibration of classification models. See Appendix A.3 for definitions of each metric. Unless specified, we report numbers in Avg.±std. over four random seeds.

6.1. ABLATION STUDIES OF PROPOSED METHODS

In this section, we empirically verify that our proposed method progressively improves upon the previous baseline. More precisely, we test the following step-by-step; (a) introducing SWA for the representation learning from Section 3, (b) introducing stochastic representation for the classifier re-training from Section 4.1, and (c) introducing self-distillation strategy from Section 4.2. Here, we consider the following balancing strategies during the classifier re-training; 1) Class-Balanced Sampling (CBS; Kang et al., 2020) , 2) Generalized Re-Weighting (GRW; Zhang et al., 2021) , and 3) Logit Adjustment (LA; Menon et al., 2021) . Such strategies to overcome the difficulty of imbalanced data distribution are essential to re-train proper decision boundaries in the decoupled learning scheme. Refer to Appendix A.1 for more details on balancing strategies. Table 2 shows the evaluation results for each step on ImageNet-LT when we use CBS. To summarize, (a) the results displayed in the first and second rows show that SWA improves the SGD baseline with a decoupling scheme, as we discussed before in Section 3.1. (b) Moreover, the results displayed in the second and third rows show that using stochastic representations to re-train the classifier improves the classification accuracy. It indicates that the stochastic representations contribute to building more robust decision boundaries, as depicted in Fig. 4 . However, our experimental results up to this point show that the improvements in uncertainty estimates are not significant. (c) Finally, our proposed approach in Section 4.2 significantly outperforms previous baselines for every metric we measured. Notably, the improvement consistently appears for all balancing strategies we considered (see Appendix B.1 for the full results). In particular, definite improvement in uncertainty estimates confirms the effectiveness of the proposed self-distillation method.

6.2. RESULTS ON IMAGE CLASSIFICATION TASKS

We compare our approach to the existing methods for long-tailed learning; cRT (Kang et al., 2020) , LWS (Kang et al., 2020) , LA (Menon et al., 2021) , and DisAlign (Zhang et al., 2021) . See Appendices A.1 and A.2 for more details on existing methods. Since the representation learning using SWA, we introduced in Section 3, is also compatible with previous classifier re-training methods, we hereby report the results built upon this for a fair comparison. Further comparisons with state-of-the-art methods. One can claim the performance gap between ours and the state-of-the-art methods. This issue appears due to the shorter training epochs (i.e., 100 training epochs) and the vanilla data augmentation strategy (i.e., random resized cropping and horizontal flipping) in our experimental setups. Table 4 further clarifies that our proposed approach outperforms the state-of-the-art methods (Kang et al., 2021; Zhong et al., 2021; Hong et al., 2021; Liu et al., 2021) when the training costs get in line (see Appendix B.3 for more details). Moreover, Appendix B.4 further verifies the compatibility between ours and the existing state-of-the-art methods (Wang et al., 2021; Park et al., 2022; Zhang et al., 2022; Zhu et al., 2022) , which clarifies ours can be combined with existing algorithms taking different approaches.

7. CONCLUSION

In this paper, we proposed a simple yet effective classifier re-training strategy for long-tailed learning in decoupled learning scheme. We first showed that successful representation learning is achievable by SWA without any complex training methods. To the best of our knowledge, this is the first attempt to introduce SWA into long-tailed learning. While just combining existing classifier re-training methods with the representation learned by SWA shows better performance than with the vanilla SGD, we further proposed a novel self-distillation strategy that significantly improves uncertainty estimates of the final classifier.

ETHIC AND REPRODUCIBILITY STATEMENTS

For societal impacts, there can be potential negative impacts associated with face recognition (e.g., privacy threats) since face data often exhibit long-tailed distribution over entities (Zhang et al., 2017) . For reproducibility, Appendix C contains all the experimental setup including datasets and hyperparameters. A SUPPLEMENTARY MATERIALS

A.1 BALANCING STRATEGIES FOR RE-TRAINING CLASSIFIER

In Section 6.1, we considered the following balancing strategies for the classifier re-training stage. Such strategies for re-balancing the training data having long-tailed distribution over classes is essential to construct proper decision boundaries over the representation space. • Class-Balanced Sampling (CBS) is the most straightforward way to re-balance long-tailed data distribution by sampling. The probability of sampling a training example from class k is 1/K, where K is the number of training classes. More precisely, the softmax crossentropy loss over the class-balanced training dataset D CB is E (x,y)∼p D CB -log exp w ⊤ y F θ (x) + b y K j=1 exp w ⊤ j F θ (x) + b j , ( ) where the probability of sampling a data (x, y) is given by p DCB ((x, y)) = 1/(K × n y ) and n y denotes the number of training examples for class y. • Generalized Re-Weighting (GRW) performs loss re-weighting using the empirical class frequencies {π k } K k=1 on the training dataset (Zhang et al., 2021) . More precisely, the reweighted version of the softmax cross-entropy loss over the training dataset D is E (x,y)∼D - (1/π y ) ρ K j=1 (1/π j ) ρ log exp w ⊤ y F θ (x) + b y K j=1 exp w ⊤ j F θ (x) + b j , ( ) where ρ ≥ 0 is a hyper-parameter that controls the scale of per-class weighting coefficients, e.g., it reduces to the instance-balanced re-weighting with ρ = 0.0, and to the class-balanced re-weighting with ρ = 1.0. We use ρ = 1.0 throughout the experiments. • Logit Adjustment (LA) is a simple but efficient way to minimize the balanced error (i.e., average of per-class error rates) using the empirical class frequencies {π k } K k=1 on the training dataset (Menon et al., 2021) . More precisely, the logit adjusted version of the softmax cross-entropy loss over the training dataset D is E (x,y)∼D -log exp w ⊤ y F θ (x) + b y + ρ log π y K j=1 exp w ⊤ j F θ (x) + b j + ρ log π j , where ρ ≥ 0 is a hyper-parameter that controls the scale of offset to each of the logits. We use ρ = 1.0 throughout the experiments.

A.2 CLASSIFIER RE-TRAINING METHODS

Assume that we have pre-trained parameters Θ * = (θ * , ϕ * ) after the first representation learning stage. From this, the classifier re-training methods aim to find a new classifier parameters ϕ * * = (w * * k , b * * k ) K k=1 , while the feature extractor parameters θ * is frozen. • Classifier Re-Training (cRT) is the most straightforward way to re-train the classifier from scratch (Kang et al., 2020) . Specifically, it first randomly re-initializes the classifier parameters ϕ and re-trains them on the class-balanced training dataset D CB , ϕ * * = arg min ϕ E (x,y)∼p D CB -log p (y) (x; (θ * , ϕ)) . • Learnable Weight Scaling (LWS) only re-trains the scale of pre-trained weights (w * k ) K k=1 while the direction is kept (Kang et al., 2020) . Specifically, it introduces trainable parameter τ ∈ R which controls the intensity of the weight normalization, ϕ(τ ) = (w * k / ∥w * k ∥ τ , b * k ) K k=1 . and finds ϕ * * = ϕ(τ * ) on the class-balanced training dataset D CB , where τ * = arg min τ E (x,y)∼p D CB -log p (y) (x; (θ * , ϕ(τ ))) . • Distribution Alignment (DisAlign) trains additional modules while the original classifier parameters ϕ * is kept (Zhang et al., 2021) . Specifically, it introduces trainable parameters {α k ∈ R, β k ∈ R} K k=1 ∪ {γ ∈ R K , δ ∈ R} which performs a distribution calibration, z k (x) ← w * k ⊤ F θ * (x) + b * k , (x) ← Sigmoid(γ ⊤ z k (x) + δ), (25) ẑk (x) ← σ(x) • (α k z k (x) + β k ) + (1 -σ(x)) • z k (x), for k = 1, ..., K, where z k and ẑk respectively denote the original logits and the calibrated logits. While both θ * and ϕ * are fixed, it finds the optimal value of {α k , β k } K k=1 ∪ {γ, δ} = {α * k , β * k } K k=1 ∪ {γ * , δ * } on the training dataset D with Generalized Re-Weighting (GRW).

A.3 EVALUATION METRICS

The problem we addressed in this paper is the K-way classification problem. Let P : x → p be a model that outputs a categorical probability p ∈ [0, 1] K for a given input x. Following metrics are reported for all of the methods using our implementation. • Accuracy (ACC; higher is better): ACC(P, D) = E (x,y)∈D y = arg max k P (k) (x) , where inner [•] denotes the Iverson bracket. • Negative log-likelihood (NLL; lower is better): NLL(P, D) = E (x,y)∈D -log P (y) (x) , which is equivalent to the softmax cross-entropy loss used in training. • Expected calibration error (ECE; lower is better): ECE(P, D; N ) = N n=1 δ n (|B n |/|D|) , where {B 1 , ..., B N } is a partition of D, B n = (x, y) ∈ D max k P (k) (x) ∈ n -1 N , n N , for n = 1, ..., N, and δ n denotes a calibration error for the nth bin B n , δ n = ACC(P, B n ) -E (x,•)∈Bn max k P (k) (x) , for n = 1, ..., N. We used N = 15 bins in this paper.

A.4 DETAILED IMPLEMENTATION OF THE PROPOSED METHOD

Loss function During the optimization, we simply fix M = 10 and use equal weights for two terms in the final objective, i.e., the actual implementation is 0.5 LCE cross-entropy term + 0.5 LKD self-distillation term . Also, we blocked the gradient flow through the target Dirichlet distribution in the self-distillation term, i.e., jax.lax.stop gradient(β) in JAX library. Our training diverges when using the typical softmax outputs due to the sharp target Dirichlet distribution. Thus, we use the following temperature-scaled softmax outputs to stabilize the optimization on the self-distillation term LKD , p (k) θ (x) ← exp (w ⊤ k F (k) θ (x) + b k )/τ KD K j=1 exp (w ⊤ j F (j) θ (x) + b j )/τ KD , for k = 1, ..., K. Table 5 shows results for the final version of our algorithm (i.e., SRepr in Section 6.2) on the validation split of ImageNet-LT swept over τ KD ∈ {1, 2, 5, 10, 20}. Training losses were unstable for τ ∈ {1, 2, 5} and thus we fix τ KD = 20 throughout all experiments. Pseudo-code for the proposed method Algorithm 1 summarizes the proposed method in pseudocode. Note that training SWA-Gaussian with a diagonal covariance has virtually no additional cost over conventional training with SGD. It only requires extra space for storing the first and second moments of the backbone parameters (i.e., lines 1-10). Instead, an additional training cost of our approach compared with the existing methods (Kang et al., 2020; Zhang et al., 2021) Ablation studies of proposed methods (Section 6.1). Table 6 is an extended version of Table 2 . It provides the results when we apply the following balancing strategies; CBS, GRW, and LA. The arguments we discussed in Section 6.1 consistently hold for all balancing strategies we considered. Results on image classification tasks (Section 6.2). Table 7 is an extended version of Table 3 . Here, we also provide detailed classification accuracy on three splits introduced in (Liu et al., 2019) We also provide the experimental results on CIFAR10-LT and CIFAR100-LT. Table 8 shows that our approach outperforms the baselines in terms of every metric we measured. It clearly demonstrates that the proposed approach consistently benefits from training robust decision boundaries regardless of the scale of datasets. Per-class weight norms and marginal likelihoods after classifier re-training Following previous works (Kang et al., 2020; Ren et al., 2020; Alshammari et al., 2022) , we further investigate how our proposed approach affects 1) the per-class weight norm, i.e., ∥w y ∥ for y = 1, ..., K, and 2) the per-class marginal likelihood, i.e., p(y) = E (x,•)∼Dtest [p Alshammari et al. (2022) argued that balanced weight norms of the classifier give tail classes a chance to compete with head classes and produce the ideal marginal likelihood following a uniform distribution. However, Fig. 6 demonstrates that our re-training method (i.e., SRepr) achieves both the uniform marginal likelihood (middle) and the well-calibrated prediction (right), even though it does not balance weight norms (left). This phenomenon suggests that our proposed approach works distinctly from the existing works balancing the weight norms for dealing with long-tailed data. Training costs compared with vanilla decoupled training. Our approach (i.e., SRepr) requires M forward passes of the backbone network during the classifier re-training stage. However, the additional training cost due to these multiple forward passes is not a huge bottleneck since we only Generating stochastic representation by random augmentation. We can also generate stochastic representations by changing inputs (i.e., random augmentation) instead of model parameters (i.e., SWA-Gaussian, as we introduced in Eq. ( 9)). Table 11 shows that the proposed SRepr also works with random augmentation (i.e., random crop augmentation), but it is worse than one we originally proposed. One notable advantage of generating stochastic representations using SWAG is that we do not need to design data-dependent augmentation strategies. This helps when we are to apply our method to the domain for which no straightforward data augmentation strategies are available (e.g., text, graph, and speech data). Ablation study on mixup augmentation and our approach. We also present the ablation study on the mixup augmentation and our approach (i.e., SRepr) since both methods improve the calibration of the classification model. Table 12 clarifies our effectiveness distinct from the mixup augmentation for ResNet-50 on ImageNet-LT; (1) The performance gain is most significant when we use them together. (2) While both the mixup and SRepr improve all metrics upon the baseline, the contribution from SRepr is more significant than that from the mixup.

C EXPERIMENTAL DETAILS

Code is available at https://github.com/cs-giung/long-tailed-srepr. Our implementations are built on JAX (Bradbury et al., 2018) , Flax (Heek et al., 2020) , and Optax (Hessel et al., 2020) . These libraries are available under the Apache-2.0 licensefoot_1 . For ImageNet-LT and iNaturalist-2018, we conduct all experiments on 8 TPUv3 cores, supported by TPU Research Cloudfoot_2 .

C.1 DATASETS

ImageNet-LT. It is available at https://github.com/zhmiao/OpenLongTailRecognition-OLTR (Liu et al., 2019) CIFAR10/100-LT. It is available at https://github.com/kaidic/LDAM-DRW (Cao et al., 2019) . It consists of 10,847 train examples and 10,000 test examples from 10/100 classes when an exponential decay with an imbalance factor of 0.01 is applied. We use a simple data augmentation policy which consists of random padded cropping and random horizontal flipping (He et al., 2016) . All images are standardized with the mean of (0.4914, 0.4822, 0.4465) and standard deviation of (0.2023, 0.1994, 0.2010) in RGB order. 

C.3 WEIGHT DECAY (WD)

Weight Decay (WD; Krogh and Hertz, 1991) is the standard regularization technique for training deep neural networks. For instance, we additionally introduce the WD term of λ wd ∥Θ∥ 2 2 in Eq. ( 2), where λ wd > 0 is a hyperparameter to control the impact of the WD term. Table 13 shows validation accuracy of SGD swept over λ wd ∈ {0.0001, 0.0002, 0.0003, 0.0004, 0.0005}. We empirically found that tuning weight decay exerts a strong influence on long-tailed classification performance, as Alshammari et al. (2022) reported. Throughout the paper, we apply λ wd = 0.0003 for ImageNet-LT, λ wd = 0.0001 for iNaturalist-2018, and λ wd = 0.0005 for CIFAR10/100-LT.



We use the cosine distance instead of Euclidean distance since it is a more reasonable choice for the innerproduct-based representation space trained with Eq. (1). https://www.apache.org/licenses/LICENSE-2.0 https://sites.research.google/trc/about/



Figure 1: A schematic diagram depicting the overall composition of the paper. Left: We first apply SWA to obtain more generalizing feature extractor (Section 3). Right: We then propose a novel self-distillation strategy distilling SWAG into SWA to obtain more robust classifier (Section 4).

Figure 2: Box-and-whisker plots for dispersion values in four different groups of examples. We also compute the Pearson Correlation Coefficient (PCC) value to clarify the positive correlation between negative log-likelihood (NLL) and dispersion.

Figure 4: A schematic diagram depicting two-dimensional representation space after the first representation learning stage. Left: The vanilla approach for the second classifier learning stage re-trains the decision boundary between two classes using deterministic representations. Right: Our proposed methods consider stochastic representations having different degrees of dispersion for each input to build more robust decision boundary.

Fig.3demonstrates how classification accuracy and uncertainty estimates, including NLL and Expected Calibration Error (ECE), vary along with the number of stochastic representations for ensembling defined in Eq. (8). While the classification accuracy remains at a similar level, uncertainty estimates improve as the number of stochastic representations increases, indicating that the stochastic representations indeed captures the uncertainty of inputs that are helpful for robust predictions.

: Many (a set of classes each with over 100 training examples), Medium (a set of classes each with 20-100 training examples), and Few (a set of classes each with under 20 training examples). B.2 ADDITIONAL RESULTS ON CIFAR10/100-LT

Figure 5: The per-class dispersion along with class indices in the representation space (left) and the probability space (right). The results are with ResNet-50 on ImageNet-LT.

(x)] for y = 1, ..., K, where D test is a balanced test split. Fig. 6 depicts the per-class weight norm (left) and the per-class marginal likelihood (middle), along with class indices in x-axes sorted by the number of training examples for each class. We also plot the reliability diagram (right; Guo et al., 2017), showing whether the classification model produces well-calibrated predictions.

OPTIMIZATION ImageNet-LT and iNaturalist-2018. Throughout the main experiments on ImageNet-LT and iNaturalist-2018, we use an SGD optimizer with batch size 256, Nesterov momentum 0.9, and a single-cycle cosine decaying learning rate starting from the base learning rate of 0.1. Unless specified, the optimization for the representation learning stage terminates after 100 training epochs for ImageNet-LT and 200 training epochs for iNaturalist-2018. For the classifier re-training, we introduce an additional 10% training epochs to re-train the classifier. CIFAR10/100-LT. Throughout the additional experiments on CIFAR10/100-LT in Appendix B.2, we apply the same optimization strategy as that of ImageNet-LT and iNaturalist-2018, except for the batch size of 128 and the baseline learning rate of 0.5.

Comparison between SGD and SWA before and after applying cRT on benchmarks. Blue denotes a clear improvement of SWA over SGD, while red denotes deterioration.

Ablation studies of proposed methods on ImageNet-LT: classification accuracy (ACC), negative log-likelihood (NLL), and expected calibration error (ECE). These results are with Class Balanced Sampling (CBS); refer to Appendix B.1 the results for the other balancing strategies.

Results on ImageNet-LT and iNaturalist-2018: classification accuracy (ACC), negative loglikelihood (NLL), and expected calibration error (ECE). Full results are available in Appendix B.1.

Further comparisons in classification accuracy (ACC) with state-of-the-art methods on ImageNet-LT. Results for the baselines came from the corresponding paper. † This actually requires training cost of 600 epochs since rwSAM doubles forward and backward passes for each iteration.



Validation results of SRepr with varying temperatures (i.e., τ KD in Appendix A.4) on ImageNet-LT. 'N/A' denotes the training diverges.

Ablation study of proposed methods with various balancing strategies on ImageNet-LT: classification accuracy (ACC), negative log-likelihood (NLL), and expected calibration error (ECE).

come from multiple forward passes of the backbone network during the classifier re-training stage (i.e., lines 11-17). Please refer to Appendix B.5 for further investigation regarding this issue.

Training costs compared with vanilla decoupled training.

Ablation study on alternative way to generate stochastic representation. F m (x) denotes the m th stochastic representation for a given input x. SRepr (ours) changing parameters, i.e., F m (x) = F θm (x), where θ 1 , ..., θ M SRepr (ours) changing inputs, i.e., F m (x) = F θSWA (x m ), where x m is the m th augmented version of x. 53.35 2.115 0.068 backpropagate through the last classifier layer while holding the backbone network frozen. To be concrete, Table10presents the wall-clock time of re-training classifier with and without SRepr for ResNet-50 on ImageNet-LT when M is 10. While SRepr requires x1.8 times compared to the vanilla classifier re-training (i.e., 32.36 sec/epoch vs. 58.09 sec/epoch), the performance gain achieved by SRepr outweighs the additional training time. That is, even if we run cRT for twice more training epochs than SRepr, it is worse than the performance that SRepr could achieve with the half number of training epochs. We also note that cRT cannot reach the numbers obtained by SRepr even if it is ran for longer training epochs.

. It consists of 115,846 train examples, 20,000 validation examples and 50,000 test Ablation study on mixup augmentation and our approach. (+2.15) 1.990 (-0.319) 0.022(-0.110)    examples from 1,000 classes. While the train split is imbalanced, with maximally 1,280 images per class and minimally 5 images per class, the validation and test splits are balanced.We follow the standard data augmentation policy which consists of random resized cropping with an images size of 224 × 224 × 3 and random horizontal flipping. All images are standardized by subtracting the per-channel mean and dividing the result by the per-channel standard deviation. For per-channel mean and standard deviation values, we stay consistent with the values from the full ImageNet-1k dataset(Russakovsky et al., 2015), i.e., mean of (0.485, 0.456, 0.406) and standard deviation of(0.229, 0.224, 0.225)  in RGB order. iNaturalist-2018. It is available at https://github.com/visipedia/inat comp (Van Horn et al., 2018). It consists of 437,513 train examples, 24,426 validation examples and 149,394 test examples from 8,142 classes. Since the ground-truth labels of the test split are not publicly available, we instead use the balanced validation split as the test split. We apply the same data augmentation for training as that of ImageNet-LT.

ACKNOWLEDGEMENTS

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), Samsung Electronics Co., Ltd (IO201214-08176-01), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A5A708390811).

annex

 (Kang et al., 2020) 62.83±0.23 46.92±0.26 26.33±0.16 50.25±0.18 2.364±0.008 0.110±0.001 + LWS (Kang et al., 2020) 63.23±0.26 47.57±0.24 27.78±0.23 50.91±0.15 2.197±0.007 0.054±0.001 + LA (Menon et al., 2021) 60.79±0.20 48.11±0.14 33.20±0.34 50.97±0.13 2.231±0.004 0.063±0.001 + DisAlign (Zhang et al., 2021) 61.63±0.39 48.68±0.11 32.71±0.45 51.49±0.15 2.596±0.012 0.202±0.002

SWA (ours)

67.71±0.11 40.74±0.15 11.01±0.10 47.08±0.12 2.631±0.009 0.187±0.002 + cRT (Kang et al., 2020) 63.54±0.18 47.68±0.16 26.85±0.28 50.95±0.12 2.353±0.012 0.120±0.002 + LWS (Kang et al., 2020) 63.51±0.30 48.53±0.07 28.66±0.45 51.60±0.10 2.189±0.007 0.077±0.002 + LA (Menon et al., 2021) 61.60±0.07 48.70±0.03 33.68±0.34 51.62±0.05 2.206±0.009 0.077±0.002 + DisAlign (Zhang et al., 2021) 

B.3 FURTHER COMPARISONS WITH STATE-OF-THE-ART METHODS

Increasing the number of training epochs. Throughout the main text, we trained all the competing methods for 100 training epochs on ImageNet-LT, which is sufficient to validate the efficacy of our approach. However, state-of-the-art performances reported in other papers are typically from long training epochs than we set in our experiments. We thus provided the results when the number of training epochs gets in line in Table 4 . It clearly demonstrates that ours outperforms baselines by a wide margin. Besides, we also tested our method for ResNeXt-50 architecture (Xie et al., 2017) for a fair comparison with LADE (Hong et al., 2021) .Applying mixup augmentation. Zhong et al. (2021) employed mixup augmentation (Zhang et al., 2018) in decoupled training, which is actually one of the main ingredients to achieving that level of performance they reported. However, the mixup augmentation is not exclusive to their method but applicable to generic approaches, including ours. In Table 4 , our approach enhanced by the mixup augmentation indeed outperforms the previous baseline utilizing the mixup (Zhong et al., 2021) .

B.4 COMBINING OURS WITH THE EXISTING STATE-OF-THE-ART METHODS

Apart from the decoupled learning scheme (Kang et al., 2020) we mainly considered in the main text, there are other groups of methods achieving state-of-the-art performances: a) utilizing multiple experts (Zhou et al., 2020; Xiang et al., 2020; Wang et al., 2021) , or b) applying contrastive learning algorithms (Cui et al., 2021; Zhu et al., 2022) for dealing with long-tailed data.Although our proposed method is based on the decoupled learning scheme, we would like to clarify that we can combine ours with the existing state-of-the-art frameworks due to its simplicity (i.e., it only needs SWAG posterior and classifier re-training). To this end, we tested ours upon existing state-of-the-art code bases by simply 1) obtaining SWAG posterior from the publicly available pre-trained checkpoint with a few SGD iterations (e.g., 10 training epochs) and 2) re-training the classification layer as we proposed in Section 4.2. We adapted the following implementations: (Lakshminarayanan et al., 2017; Ovadia et al., 2019) . Even if SWA-Gaussian well captures a single modality, naive ensembling significantly benefits from the multi-modal Bayesian marginalization (Wilson and Izmailov, 2020) . Nevertheless, we believe the clear improvements upon the contrastive learning approach (i.e., BCL), which uses only a single model, bear out the value of the proposed method. Stochastic Weight Averaging (SWA; Izmailov et al., 2018) has three hyper-parameters; 1) when does the averaging phase start? 2) how frequently update the moving average? and 3) how to set the learning rate during the averaging phase? In response, we follow the instruction from (Izmailov et al., 2018) ; 1) the averaging phase starts at 75% of the training epoch, 2) we capture parameters at each epoch for averaging, and 3) we use the high constant learning rate η SWA during the averaging phase. Table 14 shows validation accuracy of SWA swept over η SWA ∈ {0.001, 0.005, 0.010, 0.020} on ImageNet-LT. Throughout the paper, we use η SWA = 0.010 for ImageNet-LT, η SWA = 0.005 for iNaturalist-2018, and η SWA = 0.1 for CIFAR10/100-LT.Algorithm 1 Decoupled training w/ SWA + SRepr (ours).Require: Train dataset D = {(x i , y i )} N i=1 , the feature extractor F parameterized by θ, and the linear classification layer parameterized by ϕ = (w k , b k ) K k=1 . Require: The number of representation learning steps T 1 , the number of classifier retraining steps T 2 , the learning rate schedule η t , moment update start time T SWA and update frequency f SWA . Ensure: Parameters (θ * , ϕ * * ), trained in decoupled learning scheme. Update the parameters with stochastic gradients,where the loss is defined asif t > T SWA and MOD(t, f ) = 0 then 8:Update the first and second moments via moving average,end if 10: end for 11: // Classifier retraining stage 12: We have the pre-trained parameters Θ T1 , where Θ T1 ← Θ SWA if we applied SWA. 13: We also have the approximate posterior over the feature extractor parameters, i.e., q(θ|D) = N (θ|θ SWA , Σ SWAG ), where Σ SWAG = diag(θ ′ SWA -θ 2 SWA ).(39)14: for t ∈ {T 1 + 1, ..., T 1 + T 2 } do 15:Sample mini-batch B ⊂ D.

16:

Update the parameters with stochastic gradients (with some balancing strategy),where the loss is defined asor L(ϕ) = E (x,y)∼B 0.5 LCE + 0.5 LKD , if we applied SRepr.(42)Here, the first term is LCE = E (x,y)∼q(θ|D) -log p (y) (x; (θ, ϕ)) .(43)

