IMPROVING TAIL LABEL PREDICTION FOR EXTREME MULTI-LABEL LEARNING

Abstract

Extreme multi-label learning (XML) works to annotate objects with relevant labels from an extremely large label set. Many previous methods treat labels uniformly such that the learned model tends to perform better on head labels, while the performance is severely deteriorated for tail labels. However, it is often desirable to predict more tail labels in many real-world applications. To alleviate this problem, in this work, we show theoretical and experimental evidence for the inferior performance of representative XML methods on tail labels. Our finding is that the norm of label classifier weights typically follows a long-tailed distribution similar to the label frequency, which results in the over-suppression of tail labels. Base on this new finding, we present two new modules: (1) RANKNET learns to re-rank the predictions by optimizing a population-aware loss, which predicts tail labels with high rank; (2) TAUG augments tail labels via a decoupled learning scheme, which can yield more balanced classification boundary. We conduct experiments on commonly used XML benchmarks with hundreds of thousands of labels, showing that the proposed methods improve the performance of many state-of-the-art XML models by a considerable margin (6% performance gain with respect to PSP@1 on average).

1. INTRODUCTION

Extreme multi-label learning (XML) aims to annotate objects with relevant labels from an extremely large candidate label set. Recently, XML has demonstrated its broad applications. For example, in webpage categorization Partalas et al. (2015) , millions of labels (categories) are collected in Wikipedia and one wishes to annotate new webpages with relevant labels from a huge candidate set; in recommender systems McAuley et al. (2015) , one hopes to make informative personalized recommendations from millions of items. Because of the high dimensionality of label space, classic multi-label learning algorithms, such as Zhang & Zhou (2007) ; Tsoumakas & Vlahavas (2007) , become infeasible. To this end, a number of computational efficient XML approaches are proposed Weston et al. (2011) ; Agrawal et al. (2013) ; Bi & Kwok (2013) ; Yu et al. (2014) ; Bhatia et al. (2015) ; E.-H. Yen et al. (2016) ; Yeh et al. (2017) ; Yen et al. (2017) ; Tagami (2017) . In XML, one important statistical characteristic is that labels follow a long-tailed distribution as illustrated in Figure 4 (left). Most labels occur only a few times in the dataset. Infrequently occurring labels (referred to as tail label) possess limited training samples and are harder to predict than frequently occurring ones (referred to as head label). Many existing XML approaches treat labels with equal importance, such as Prabhu & Varma (2014) ; Babbar & Schölkopf (2017) ; Khandagale et al. (2019) , while Wei & Li (2018) demonstrates that most predictions of well-established methods are heads labels. However, in many real-world applications, it is still desirable to predict more tail labels which are more rewarding and informative, such as recommender systems Jain et al. (2016) ; Babbar & Schölkopf (2019) ; Wei & Li (2018) ; Wei et al. (2019) . To improve the performance for tail labels, existing solutions typically involve optimizing loss functions that are suitable for tail labels Jain et al. (2016) ; Babbar & Schölkopf (2019) , leveraging the sparsity of tail labels in the annotated label matrix Xu et al. (2016) , and transferring knowledge from data-rich head labels to data-scarce tail labels K. Dahiya (2019) . These methods typically achieve better performance on tail labels than standard XML methods which treat labels equally, while they Figure 1 : Left: Label frequency follows a long-tailed distribution. Middle: Norm of classifier weights of Bonsai models Khandagale et al. (2019) . Right: Norm of classifier weights of Bonsai models when decoupled tail label augmentation is applied. usually involve high computational costs. Moreover, previous studies do not explicitly explain the underlying cause of the inferior performance of many standard XML methods for tail labels. In this work, we disclose theoretical and experimental evidence for the inferior performance of previous XML methods on tail labels. Our finding is that the norm of label classifier weights follows a long-tailed distribution similar to the label frequency as shown in Figure 4 (middle), and the prediction score of tail labels thereby is underrated. To alleviate this problem, we propose to rectify the classifier's outputs and training data distribution such that the prediction of tail labels is enhanced. We present two general modules suitable for any well-established XML methods: (1) RANKNET learns to re-rank the predictions by optimizing a population-aware loss function, which predicts tail labels with high rank; (2) TAUG augments tail labels via a decoupled learning scheme, which reduces the skewness of training data and yields more balanced classification boundary. We conduct experiments to verify the effectiveness of the aforementioned instantiations. From our extensive studies across four benchmark datasets, we make the following intriguing contributions: • We show that from both theoretical and experimental perspectives, the norm of label classifier weights follow a long-tailed distribution, i.e., the norms of head label classifier weights are considerably larger than that of tail label classifiers, which is a key cause of the inferior performance of many XML methods on tail labels. • We propose two general modules: RANKNET for prediction score re-ranking by optimizing a new population-aware loss, and TAUG for decoupled tail label augmentation. Both methods can be paired with any XML model without changing the model. • Experiments verify that our proposed modules achieve significant improvements (6% w.r.t. PSP@1 on average) for well-established XML methods on benchmark datasets. • We provide an ablation study to highlight the effectiveness of each individual factor.

2. PREVIOUS EFFORTS

Existing work on XML can be roughly categorized as three directions: One-vs-all methods. This branch of work trains classifiers for each label separately. Due to the huge size of label set, parallelization Babbar & Schölkopf (2017) , label partitioning Khandagale et al. (2019) , and label filter Niculescu-Mizil & Abbasnejad (2017) techniques are used to facilitate efficient training and testing. To alleviate memory overhead, recent works restrict the model capacity by imposing sparse constraints E.-H. Yen et al. (2016) or removing spurious parameters Babbar & Schölkopf (2017) . One criticism of one-vs-all methods is that it fails to capture label correlations. Embedding-based methods. Along this direction, researchers have proposed to embed the feature space and label space onto a joint low-dimensional space, then model the correlation between features and labels in hidden space Tai & Lin (2012) ; Chen & Lin (2012) ; Yu et al. (2014) ; Bhatia et al. (2015) ; Tagami (2017) ; Evron et al. (2018) . This method can dramatically reduce the model parameters compared with the one-vs-all methods, but involves solving complex optimization problems. Tree-based methods. In comparison to other types of approaches, tree-based methods greatly reduce inference time, which generally scales logarithmically in the number of labels. There are typically two types of trees including instance trees Prabhu & Varma (2014) ; Siblini et al. (2018) and label trees Daume III et al. (2016) ; You et al. (2018) , depending whether instance or label is partitioned in tree nodes. Tree-based methods usually suffer from low prediction accuracy affected by the cascading effect, where the prediction error at the top cannot be corrected at a lower level. These methods can readily scale up to problems with hundreds of thousands of labels. However, Wei & Li (2018; 2019) claims that head labels make a significantly higher contribution to the performance than tail labels. Therefore, many work are conducted to improve the performance for tail labels. Optimization. Jain et al. (2016) proposes propensity scored loss functions that promote the prediction of tail label with high ranks. Xu et al. (2016) decomposes the label matrix into a low-rank matrix and a sparse matrix. The low-rank matrix is expected to capture label correlations, and the sparse matrix is used to capture tail labels. Babbar & Schölkopf (2019) views tail label from an adversarial perspective and optimizes hamming loss to yield a robust model. Knowledge transfer. K. Dahiya (2019) trains two deep models on head labels and tail labels. The semantic representations learned from head labels are transferred to the tail label model. These methods achieve better performance on tail labels than standard XML methods which treat labels equally, while they do not explicitly explain the underlying cause of the inferior performance of many standard XML methods for tail labels. In this work, we find that the classification boundary of existing XML methods is skewed to head labels, causing the inferior performance.

3. METHODOLOGY

In XML, as we possess fewer data about tail labels, models learned on long-tailed datasets tend to exhibit inferior performance on tail labels Wei & Li (2018) . However in practice, it is more informative and rewarding to accurately predict tail labels than head labels Jain et al. (2016) . In this work, we attempt to alleviate this problem from the perspective of the classification boundary. We make an observation that the norm of label classifier weights follow a long-tailed distribution similar to the label frequency, which means that the prediction of tail labels is over-suppressed. This finding provides an evidence for us to improve the prediction of tail labels. We present ways of rectifying the classifier's outputs and data distribution via re-ranking and tail label augmentation, respectively. Notations. We first describe notations used through the paper. Let X = {x i } N i=1 , Y = {y i } N i=1 be a training set of size N , where y i is the label vector for data point x i . Formally, XML is the task of learning a function f that maps an input (or instance) x ∈ R D to its target y ∈ {0, 1} L . We denote n j = N i=1 y ij as the frequency of the j-th label. Without loss of generality, we assume that the labels are sorted by cardinality in non-increasing order, i.e., if j < k, then n j ≥ n k , where 1 ≤ j, k ≤ L. In our setting, we have n 1 n L . According to the label frequency, we can split the label set into head labels and tail labels by a threshold τ ∈ (0, 1). We denote head label set H = {1, . . . , τ L } and tail label set T = { τ L + 1, . . . , L}. τ is a user-specified parameter.

3.1. THE LONG-TAILED DISTRIBUTION OF CLASSIFIER WEIGHTS NORM

We present a different perspective regarding XML model, showing its inferior performance on tail labels is due to the imbalanced classification boundary. In Figure 4 (middle), we empirically observe that the norm of label classifier weights follows a similar long-tailed distribution as the label frequency. The results are produced on EUR-Lex dataset using a representative one-vs-all method Bonsai Khandagale et al. (2019) . A similar observation on Wiki10-31K dataset is presented in the supplementary material. Since the norm of tail label classifier weights is considerably smaller than that of head label classifier weights, the predicted score of tail labels are typically underestimated in inference. We further support our finding theoretically and demonstrate the fact that the small norm of tail label classifier weights is the root cause of inferior performance. We make the following mild assumption on the data: every input x is sampled from feature space completely at random, and there exists a constant threshold t > 0 for the input x, such that the top-k prediction for x is made as β (k) = {y l | P(y l | x) ≥ t, 1 ≤ l ≤ L}, where P(y l | x) denotes the estimated label distribution. We assume W = {w j } L j=1 be the weight matrix of a standard XML method. In particular, for binary relevance and tree-based classifier, W can be obtained by optimizing Eq. ( 1), where L denotes the loss function, e.g., squared hinge loss, and constant λ is a trade-off parameter. Note that for some tree-based methods, such as Bonsai Khandagale et al. (2019) and Parabel Prabhu et al. (2018) , we consider W be the label classifier weights in leaf nodes, i.e., excluding meta-labels of internal tree nodes. min wj w j 2 2 + λ N i=1 L Y i,j , w T j x i , ∀1 ≤ j ≤ L (1) For deep learning methods, we denote W be the weights of the last linear layer for classification by optimizing Eq. ( 2), where σ is the softmax function, f θ is the feature extractor parameterized by θ, and L denotes the selected loss function, e.g., binary cross entropy. Note that this interpretation can also be adapted to typical embedding-based methods, such as Yu et al. (2014) , where f θ is linear and σ is the identity function. min W N i=1 L y i , σ W f θ (x i ) With the above setup, we summarize our findings in Theorem 1. Theorem 1. Let D = {(x i , y i )} N i=1 be a sample set and W, which can be decomposed as {w j } L j=1 , be the label classifier weights learned on D by optimizing Eq. ( 1) and Eq. ( 2). For an uniformly sampled point x which is i.i.d. with points in D, we have ||w j || ∝ E y j ∈ β (k) , ∀1 ≤ j ≤ L, where β (k) denotes the k top-ranked indices of predicted labels in P(y | x). This theorem shows that the need for re-balancing the classifier weights to improve the performance on tail labels. Motivated by our finding, in the following we propose two new modules and discuss their effectiveness on tail labels. Proof of this theorem can be found in the supplementary material.

3.2. RANKNET: PREDICTION SCORE RE-RANKING NETWORKS

We introduce a novel re-ranking module motivated by our finding that the prediction score of tail labels are usually over-suppressed. Conventionally, the ranked list of predicted labels is determined by sorting the predicted score by a XML model f . Since the prediction score of tail labels are usually underrated as analyzed above, to alleviate this problem, we propose a re-ranking module to prioritize tail labels with higher score than head labels. More specifically, for a given sample x, the probability of l-th label y l being relevant to x is specified by ŷl = f (x) l = P(y l | x), which can be estimated by any existing XML model f . A RankNet block is a computational unit which maps the raw predictions to its enhanced population-aware predictions. Formally, given the raw prediction ŷ = f (x) and ŷ ∈ R L×1 , a RankNet building block is defined as: F (x) = W r ŷ + ŷ, where W r ∈ R L×L is learnable parameters of this RankNet block which is able to capture label correlations. To train RankNet, we propose a new population-aware loss function to minimize, which enforces higher predictions for tail labels than head labels and maintains prediction accuracy: L(F (x), y) = i∈Ly j∈Ly n i -n j n i + n j (F (x) i -F (x) j ) + + λ i∈Ly k / ∈Ly [F (x) k -F (x) i + c] + . Here, [z] + := max(0, z) and L y denotes the set of indices that are non-zero in y, where |L y | L. c ≥ 0 is a constant which controls the margins. In particular, the first term enforces F (x) i < F (x) j if n i > n j and the denominator n i + n j is used for normalization. The second term aims to retain the predictive accuracy by enforcing predicted score of relevant labels to be larger than that of irrelevant labels. Finally, λ is used to balance these two terms. Note that since |L y | L, the first term can be calculated effectively. To further speedup the training, a subset of irrelevant labels is sampled randomly. Therefore, RankNet introduces limited computational overhead, hence more expressive networks can be used. In other words, our design for function F can be formally defined as: F (x) = W r2 σ(W r1 ŷ) + ŷ, where W r1 ∈ R V ×L and W r2 ∈ R L×V are learnable weights using gradient descent, V L is the dimension of bottleneck layer to reduce the number of parameters, and σ is the activation function. Actually, we can stack any number of RankNet blocks to form a deep RankNet module and the output of each block is an enhancement over the output of the previous block. By doing so, the enhanced predictions F (x) is able to aware the population of labels. We also provide a simple instantiations of F (x) as F (x) l = r l f (x) l for the l-th label, where r l represents the inverse propensity score: r l = 1 + C (n l + B) -A . A, B, C are set as recommended in paper Jain et al. (2016) . Finally, the top-ranked predictions in F (x), rather than f (x), are selected as final predictions. As one can expect, it increases the propensities of tail labels in inference, thereby more tail labels are shortlisted in final predictions. Unlike existing re-ranking methods Jain et al. (2016) , RankNet is a general architecture and it can be concatenated after any standard XML methods.

3.3. TAUG: DECOUPLED LEARNING SCHEME AND TAIL LABEL AUGMENTATION

From another point of view, since the root cause of the imbalance of classifier norms, which is caused by the long-tailed data distribution, we propose to resolve this problem by reducing the skewness of training data. Decoupling the learning of head label and tail label. We propose to decouple the learning of head labels and tail labels, instead of learning models jointly. This has two main benefits: (1) decoupled learning scheme helps prevent from modeling highly imbalanced data, i.e., the data distribution within head labels and tail labels are relatively less imbalanced; (2) head label model and tail label model can be trained in a parallel manner which reduces the training time. Recall that H and T denote head label set and tail label set, respectively. We split the training set D = {(x i , y i )} N i=1 into two parts: D h = {(x i , y i ) | y ij = 1, ∀j ∈ H} and D t = {(x i , y i ) | y ij = 1, ∀j ∈ T }. Models are then respectively learned on D h and D t . In inference, the prediction score of models are integrated. Tail Label Augmentation. To better explore the data distribution of tail labels, we consider two data augmentation techniques, Input dropout and Input swap. 1. Input dropout: For a selected keep probability ρ ∈ [0, 1] and an input sample x, it produces an augmented input x = x Bernoulli(ρ, D), where denotes element-wise multiplication. 2. Input swap: For each instance x, two activated features are randomly identified and their values are swapped. This procedure can repeat multiple times. Formally, for a pair of feature coordinates i, j, where 1 ≤ i, j ≤ D, we swap their values x i and x j . Note that both data augmentation methods are label-invariant. In other words, for a given sample (x, y) and its augmented instance x , we take y = y as the corresponding label vector of x . Importantly, it is observed that there is a significant variation in the input features of tail labels from training set to test set Babbar & Schölkopf (2019) , by generating more similar samples, it discourages the model from fitting spurious patterns in input features when training data is scarce and it also promotes the model to be robust to the corruption of the input features. The proposed decoupled learning scheme and tail label augmentation methods are observed to yield more balanced classification boundary as demonstrated in Figure 4 (right).

4. EXPERIMENTS

Datasets. We perform experiments on four XML datasets which are publicly available from the XML repository. Detailed statistics are summarized in Table 1 , where L denotes average labels per sample and N denotes average sample per label.foot_0 Implementation. Without further specification, we set the label splitting threshold τ = 0.1 for EUR-Lex, and τ = 0.01 for AmazonCat-13K, Wiki10-31K, and Amazon-670K. For tail label augmentation, we fix n_aug = 4, which means four auxiliary data points are generated for each sample. Recommended settings are used for all XML algorithms as specified in their paper. Evaluation Metrics. We evaluate XML models on the test set and report results with respect to the commonly used evaluation metrics, i.e., P@k, nDCG@k, PSP@k, and PSnDCG@k (PSN@k), where k ∈ {1, 3, 5}. Due to limited space, we elaborate the definitions in the supplementary material. From Table 2 , it is effortless to observe that in all cases, three XML methods employing prediction score re-ranking achieve significantly higher PSP@k and PSnDCG@k compared with their baselines. In particular, FastXML respectively achieves as much as 5.81%, 8.4%, 0.94%, and 2.39% overall improvement on four datasets across PSP@k and PSnDCG@k. In comparison, Bonsai outperforms its baseline by a larger margin, i.e., 5.31%, 7.49%, 5.84%, and 2.64% on four datasets, respectively. Similarly, Parabel achieves performance gains comparable to Bonsai, i.e., 5.31%, 7.51%, 5.26%, and 2.66%. This demonstrates that RANKNET provides an effective way to rectify the predictions for existing XML models, by which the predicted score of tail labels are indeed over-suppressed. PSP@1 PSP@3 PSP@5 PSP@1 PSP@3 PSP@5 Avg. In the following, we verify the effectiveness of the decoupled tail label augmentation. We choose Bonsai Khandagale et al. (2019) as our base model for its appealing performance as shown in Table 2 . Bag-of-Words (BOW) vs. Dense Embedding. Since many benchmark datasets for XML are text data, we find that the dense embedding used in Chang et al. (2020) achieves significant gains over BOW. We compare the results on EUR-Lex, and find that dense embedding respectively achieves 2.87% and 3.13% higher performance w.r.t. PSP@k and PSnDCG@k on average. We conduct experiments using dense embedding in the rest of this paper except for Amazon-670K, which is not available. Classifier Weights Normalization. As a straightforward way of balancing the norm of classifier weights Kang et al. (2020) , we examine the effectiveness of weights normalization. It does not show significant effect on the performance. In particular, it improves the performance by 0.63% w.r.t. PSnDCG@k, but drops the performance with 0.57% w.r.t. PSP@k, on EUR-Lex. This suggests that weights normalization, which equalizes the propensity of labels, is undesirable for XML. Tail Label Augmentation. To justify our claim that it is beneficial to decouple the learning of head labels and tail labels, and augment the tail label, which generate more balanced classification boundaries. The results are reported in Table 3 . Since the focus of this paper is the performance improvement for tail label, we report and compare the results in terms of PSP@k and PSnDCG@k. As we can see from the results, TAUG achieves averagely 3%, 1.06%, 1.59%, 7.42% improvement w.r.t. PSP@k, and 3.5%, 1.08%, 1.41%, 7.54% w.r.t. PSnDCG@k, on four datasets. This demonstrates that the investigated two data augmentation techniques via decoupled learning scheme can help the learning of tail labels, by yielding more balanced classification boundary which predicts tail labels with relatively larger score compared with the baseline. Dataset PSN@1 PSN@3 PSN@5 PSN@1 PSN@3 PSN@5 Avg. EUR et al. (2018), and GLas Guo et al. (2019) , that report state-of-the-art results on tail labels. Since PSnDCG@k is unavailable for AttentionXML and GLaS, we report and compare the results w.r.t. PSP@k in this part. We apply the proposed prediction score re-ranking module (RANKNET) individually and jointly with the decoupled tail label augmentation module (TAUG) for comparison. Though both AttentionXML and GLaS are carefully designed deep learning methods, it is surprising to see that our Bonsai based variants achieve the best results in 8 out of 12 cases, and the second-best results in other cases. Apart from deep learning methods, PfastreXML and ProXML are two leading approaches which achieve good performance on tail labels, while they are outperformed by our methods in most cases. In comparison with the baselines in Table 2 , RANKNET+TAUG demonstrates more than 6% performance gains w.r.t. PSP@1 on average.foot_1 We are interested in how would the strength of data augmentation affects the classifier weights and the model performance. In Figure 2 (left), we illustrate the norm of classifier weights by choosing n_aug ∈ {0, 1, 4, 8}, where n_aug indicates the number of augmented samples to generate for each data point. Note that when n_aug = 0, models are learned on the initial training data with augmentation. It can be noted that the norms of the tail label classifiers become larger as n_aug increases. In Figure 2 (right), the performance tends to improve as n_aug increases w.r.t. PSP@k. As one can expect, P@k drops with a narrow margin. These results suggest that data augmentation can help re-balance the norm of classifier weights, which is beneficial to tail labels. In addition, we conduct ablation studies to compare two data augmentation techniques, i.e., input dropout and input swap, in the supplementary material. We also demonstrate the effect of different label splitting threshold τ . 

5. CONCLUSION

In this paper, we show that from both theoretical and empirical perspectives, norm of label classifier weights follows long-tailed distribution, if labels are treated uniformly, which is a key cause of the inferior performance for tail labels. To alleviate this problem, we explore the re-ranking module that optimizes a new population-aware loss, and tail label augmentation module that decouples head labels and tail labels. Through extensive studies, our proposed two modules achieve significant performance gains. Moreover, both modules can be readily applied to any well-established XML methods without changing their models. We believe that our findings not only contribute to a deeper understanding of the tail label problem, but can offer inspiration for future work.

A THEORETICAL JUSTIFICATION OF THE INFERIOR PERFORMANCE ON TAIL LABELS

Theorem 1. Let D = {(x i , y i )} N i=1 be a sample set and W, which can be decomposed as {w j } L j=1 , be the label classifier weights learned on D by optimizing Eq. ( 1) and Eq. ( 2). For an uniformly sampled point x which is i.i.d. with points in D, we have ||w j || ∝ E y j ∈ β (k) , ∀1 ≤ j ≤ L, where β (k) denotes the k top-ranked indices of predicted labels in P(y | x). Proof. Without loss of generality, we assume ||w 1 || ≥ ||w 2 || • • • ≥ ||w L || > 0. For any input x, its prediction score is computed as P(y j | x) = g(w j x) for the j-th label, where g(•) is a monotonically increasing link function, such as the exponential function, the largest top-k prediction score will be selected as the final predictions. For simplicity, we assume g(z) = z as an identical function and our analysis can be easily extended to the exponential function. Suppose that t ∈ (0, 1) is the threshold of input x, such that the final prediction is β (k) = {y j | w j x ≥ t, 1 ≤ j ≤ L}, where |β (k) | = k. Here we assume there exists a small constant k L such that w j x ≥ 0, ∀1 ≤ j ≤ k, which is reasonable in extreme classification. Since w j x = ||w j || • ||x|| cos θ j , where θ j denotes the included angle of classifier w j and sample x. Note that x is usually normalized in advance and ||x|| can be considered as a constant for different samples. In other words, the prediction can be rewritten as β (k) = {y j | cos θ j ≥ t ||wj ||•||x|| , 1 ≤ j ≤ L}. This can be considered as that x is sampled from a ball with radius equals ||x|| in the feature space completely at random, which means that θ j is uniformly sampled from [0, π]. Note that we always have ||w j || ≥ 1, ∀1 ≤ j ≤ L because the bias term of each label classifier is set to be 1, which means t ||wj || < 1. Let b = arccos t ||wj ||•||x|| , we have P(y j ∈ β (k) ) = b π . By taking the expectation over θ j , we have E y j ∈ β (k) = π 0 P(y j ∈ β (k) ) dθ j = b. Since b typically scales as ||w j ||, we conclude that the probability of the j-th label is included in top-k predictions of input x, is proportional to its classifier's norm ||w j ||, or formally ||w j || ∝ E y j ∈ β (k) , ∀1 ≤ j ≤ L. We empirically illustrate this finding in Figure 3 . will be considered. For each instance x, the Top-k precision is defined for a predicted score vector ŷ ∈ R L and ground truth label vector y ∈ {-1, 1} L as P@k := 1 k l∈rank k (ŷ) y l , where rank k (ŷ) returns the indices of k largest value in ŷ ranked in descending order. nDCG@k. nDCG@k is another commonly used ranking based performance measure: nDCG@k := DCG@k min(k, y 0) l=1 1 log(l+1) , where DCG@k := l∈rank k (ŷ) y l log(l+1) and ||y|| 0 returns the 0-norm of the true-label vector. PSP@k. Propensity scored variants of such losses, including precision@k and nDCG@k, are developed and proved to give unbiased estimates of the true loss function even when ground-truth labels go missing under arbitrary probabilistic label noise models.

PSP@k

:= 1 k l∈rank k (ŷ) y l p l . Here, p l is the propensity score for label l which helps in making metrics unbiased. PSnDCG@k. Similar to nDCG@k, its propensity scored variant is defined as PSnDCG@k := PSDCG@k k l=1 1 log(l+1) , where PSDCG@k := l∈rank k (ŷ) y l p l log(l + 1) B.3 RESULTS OF RE-RANKING W.R.T. P@k AND NDCG@k As shown in Table 5 , performance with respect to P@k and nDCG@k usually deteriorates. By using RANKNET, more tail labels are predicted with higher confidence than head labels, which would introduce more false-positive predictions. This shows that there is a trade-off between propensity scored measures and vanilla measures according to specific demands in applications.

B.4 HOW DOES THE LABEL SPLITTING THRESHOLD MATTER

In Figure 5 , we demonstrate how the splitting threshold affects the results. We experiment with L h = τ L and L t = L -τ L , where τ ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 1} for EUR-Lex and τ ∈ {0.01, 0.02, 0.03, 0.04, 0.05, 1} for Wiki10-31K. Note that, when τ = 1, all labels are considered as the head label and no data augmentation is conducted. As L t decreases (i.e., τ increases), performance in terms PSP@k typically drops slightly suggesting most labels are should be considered as the tail label and in the need for data augmentation. Table 5 : Comparison between methods with and without re-ranking in terms of P@k and nDCG@k.

Dataset

Method P@1 P@3 P@5 nDCG@3 nDCG@5 FastXML 

Scores

PSP@1 PSP@3 PSP@5 Figure 5 : The performance in terms of PSP@k as a function of splitting thresholds on EUR-Lex (left) and Wiki10-31K (right). Results are produced using Bonsai method.

B.5 ABLATIONS ON THE INPUT DROPOUT AND INPUT SWAP

We conduct ablation studies to compare the effectiveness of the proposed two tail label data augmentation strategies. We compare four methods using Bonsai as the base model: • baseline: this method does not use any data augmentation techniques. • TAUG-d: this method uses input dropout only with n_aug = 4.



Datasets are available at the Extreme Classification Repository. Our anonymous code is available in the supplementary material.



HOW DOES THE PREDICTION SCORE RE-RANKING AFFECT THE RESULTSWe evaluate the effectiveness of the proposed score re-ranking method RANKNET. We run three popular XML algorithms, including FastXMLPrabhu & Varma (2014), BonsaiKhandagale et al. (2019), and Parabel Prabhu et al. (2018)  for comparisons.

Figure 2: Left: Norms of classifier weights with varying n_aug. Right: The performance w.r.t. P@5 and PSP@k as a function of n_aug. Results in both figures are produced using Bonsai on EUR-Lex.

Figure 3: Illustration on different classifiers and their corresponding decision boundaries, where w i and w j , ||w i || > ||w j ||, denote the classification weights for class i and j respectively. C i and C j are the classification cone belongs to the i-th and j-th class in the feature space, respectively. The classifier with larger weights norm, i.e., w i , has wider decision boundary.

Statistics of datasets.





Comparison with state-of-the-art methods w.r.t. PSP@k. Bold numbers are the best and underlined numbers are the second-best. RANKNET and RANKNET+TAUG are our proposed methods.

