GANDALF: DATA AUGMENTATION IS ALL YOU NEED FOR EXTREME CLASSIFICATION

Abstract

Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain have increasingly focused on the problem setting with short-text input data, and labels endowed with short textual descriptions called label features. Short-text XMC with label features has found numerous applications in areas such as prediction of related searches, title-based product recommendation, bid-phrase suggestion, amongst others. In this paper, we propose Gandalf, a graph induced data augmentation based on label features, such that the generated data-points can supplement the training distribution. By exploiting the characteristics of the short-text XMC problem, it leverages the label features to construct valid training instances, and uses the label graph for generating the corresponding soft-label targets, hence effectively capturing the label-label correlations. While most recent advances (such as SIAMESEXML and ECLARE) in XMC have been algorithmic, mainly aimed towards developing novel deep-learning architectures, our data-centric augmentation approach is orthogonal to these methodologies. We demonstrate the generality and effectiveness of Gandalf by showing up to 30% relative improvements for 5 state-of-the-art algorithms across 4 benchmark datasets consisting of up to 1.3 million labels.

1. INTRODUCTION

Extreme Multilabel Classification (XMC) has found multiple applications in the domains of related searches (Jain et al., 2019) , product recommendation (Medini et al., 2019) , dynamic search advertising (Prabhu et al., 2018) , etc. which require predicting the most relevant results that either frequently co-occur or are highly correlated with the given product instance or search query. In the XMC setting, these problems are often modelled through embedding-based retrieval-cum-ranking pipelines over millions of possible web pages/products/ad-phrases considered as labels. Nature of short-text XMC and Extreme class imbalance Typically, in the tasks of related search prediction, bid-phrase suggestion, and related-product recommendation based on titles, the input data instance is in the form of a short-text query. These short-text instances (names or titles), on average, consist of only 3-8 words . In order to effectively model these scenarios, there has been an increasing focus on building encoders as part of deep learning pipelines that can capture the nuances of such short-text inputs (Dahiya et al., 2021b; Kharbanda et al., 2021) . The real world datasets in XMC are highly imbalanced towards popular or trending adphrases/products. Moreover, these datasets adhere to Zipf's law (Ye et al., 2020) , i.e., most labels in these extremely large output spaces are tail labels, having very few (< 5) instances in a training set spanning hundreds of thousands data points (Tab : 1, Appendix). While there is already an insufficiency of training data, the short-text nature of training instances makes it even more challenging for the models to learn meaningful, non-overfitting encoded representations for tail words and labels. Frugal architectures and Label features Due to the low latency requirements of XMC applications, most recent works are also focused on building lightweight and frugal architectures that can predict in milliseconds and scale up to millions of labels (Dahiya et al., 2021a) . Despite being frugal in terms of number of layers/parameters in the network, these models are capable of fitting well enough on the training data, although their generalization to the test samples remains poor (Fig : 1a ). Hence, creating deeper models for better representation learning is perhaps not optimal under this setting. (a) shows that a significant generalization gap exists between Train and Test P@1. However, remarkable improvements can be noted in (b) and (c) as a result of using the proposed data augmentation Gandalf. While text mixup (Chen et al., 2020) provides a regularization effect and is effective in reducing overfitting, our proposed alternative LabelMix baseline performs much better. Recent works, however, make expensive architectural adjustments (Mittal et al., 2021a) to leverage the text associated with labels ("label features", discussed in §2) in order to improve generalization.

1.1. RELATED WORK: XMC WITH LABEL FEATURES

Earlier works in XMC primarily focused on problems consisting of entire long-text documents, consisting of hundreds of words/tokens, such as those encountered in tagging for Wikipedia (Babbar & Schölkopf, 2017; You et al., 2019) . On the output side, the labels were identified by numeric IDs and hence devoid of any semantic meaning. Most works under this setting are aimed towards scaling up transformers as encoders for XMC tasks (Chang et al., 2020; Zhang et al., 2021) . By associating labels with their corresponding texts, which are in turn, product titles, document names or bid-phrases themselves, the contemporary application of XMC has gone beyond standard document tagging tasks. With the existence of label features, there exist three correlations that can be exploited for better representation learning: (i) query-label (ii) query-query and (iii) label-label correlations. Recent works have been successful in leveraging label features and pushing state-ofthe-art by exploiting the first two correlations. For example, SIAMESEXML (Dahiya et al., 2021a) employs a siamese pre-training stage based on a contrastive learning objective between a data point and its label features optimizing negative log-likelihood loss. GALAXC (Saini et al., 2021) employs a graph convolutional network over a combined query-label bipartite graph. DECAF and ECLARE (Mittal et al., 2021a; b ) make architectural additions to exploit higher order query-label correlations by extending the DeepXML pipeline to accommodate extra ASTEC-like encoders (Dahiya et al., 2021b) . In contrast to the recent algorithmic developments for short-text XMC with label features, and following the work of (Banko & Brill, 2001) , which posits higher relevance of developing more training data as compared to choice of classifiers in small data regimes, we take a data-centric approach and focus on developing data augmentation techniques for short-text XMC. We show that by using Gandalf, methods which inherently do not leverage label features beat strong baselines which either employ complicated training procedures (Dahiya et al., 2021a) or make heavy architectural modifications (Mittal et al., 2021a; b) to benefit by leveraging label features. • In order to test Gandalf against a strong data-augmentation baseline, we propose LabelMix as an effective interpolation-based data augmentation baseline, which currently does not exist for short-text XMC. In the process of arriving at LabelMix, we also discuss the effectiveness of mixup (Zhang et al., 2018) and its variants and aim at answering "Can we extend mixup to feature-label extrapolation to guarantee a robust model behavior far away from the training data?", a question posed in (Zhang et al., 2018) as a future work.

2. WHAT EXACTLY ARE LABEL FEATURES?

To elaborate label features, we take examples relevant to our datasets (i) LF-WikiTitles-500K, where the model needs to predict the relevant categories, given only the title of a wikipedia page, and (ii) LF-AmazonTitles-131K, where given a product's name, model needs to recommend related products. Observations: In view of these examples, one can affirm two important observations: (i) the shorttext XMC problem indeed requires recommending similar items which are either highly correlated or co-occur frequently with the queried item, and (ii) the queried item and the corresponding labelfeatures form an "equivalence class" and convey similar intent (Dahiya et al., 2021a) . For example, a valid news headline search on a search engine should either result in a page mentioning the same headline or similar re-phrased headlines from other news media outlets (see Example 1). As a result, it can be argued that data instances are interchangeable with their respective labels' features.

3. GANDALF: DATA AUGMENTATION FOR EXTREME CLASSIFICATION

Notation & Background For training, we have available a multi-label dataset D = {{x i , y i } N i=1 , {z l } L l=1 } 1 comprising of N data points. Each i ∈ [N ] is associated with a small ground truth label set y i ⊂ [L] from L ∼ 10 6 possible labels. Further, x i , z l ∈ X denote the textual descriptions of the data point i and the label l which, in this setting, derive from the same vocabulary universe V (Dahiya et al., 2021a) . The goal is to learn a parameterized function f which maps each instance x i to the vector of its true labels y i ∈ {0, 1} L where y il = 1 ⇔ l ∈ y i . A common strategy for handling this learning problem, called the two towers approach, is to map instances and labels into a common Euclidean space E = R d , in which the relevance s l (x) of a label l to an instance is scored using an inner product, s l (x) = ⟨Φ(x), Ψ(l)⟩. We call Φ(x) the encoding representation of the instance x, and w l := Ψ(l) the decoding representation of label l. If labels are featureless integers, then Ψ turns into a simple table lookup. In our setting, l is associated with features z l , so we identify Ψ(l) = Ψ(z l ). The prediction function selects the k highest-scoring labels, f (x) = top k (⟨Φ(x), Ψ(•)⟩). Training is usually handled using the one-vs-all paradigm, which applies a binary loss function ℓ to each entry in the score vector. In practice, performing the sum over all labels for each instance is prohibitively expensive, so the sum is approximated by a shortlist of labels S(x i ) that typically contains all the positive labels, and only those negative labels which are expected to be particularly challenging for classification (You et al., 2019; Dahiya et al., 2021b; Kharbanda et al., 2021) , leading to L D [Φ, Ψ] = N i=1 L l=1 ℓ(y il , ⟨Φ(x), Ψ(l)⟩) ≈ N i=1 l∈S(xi) ℓ(y il , ⟨Φ(x), Ψ(l)⟩). (1) Label Features as Data Points It is known that standard training on XMC datasets can easily lead to overfitting even with simple classifiers (Guo et al., 2019) , which is confirmed for our setting in Fig : 1 . To reduce the generalization gap, regularization needs to be applied to the label decoder Ψ, either explicitly as a new term in the loss function (Guo et al., 2019) , or implicitly through the inductive biases of the network structure (Mittal et al., 2021a; b) . Exploiting the interchangability of label and instance text, SIAMESEXML (Dahiya et al., 2021a) proposes to tie encoder and decoder together and require Ψ(l) = Φ(z l ). While indeed yielding improved test performance, this approach has two drawbacks: Firstly, the condition Ψ(l) = Φ(z l ) turns out to be too strong, and it has to allow for some fine-tuning corrections η l , yielding Ψ(l) = Φ(z l ) + η l . Secondly, training SIAMESEXML becomes a multi-staged process. Initially, a contrastive loss needs to be minimized, followed by fine-tuning with a classification objective. Dahiya et al. (2021a) motivates its approach by postulating a self-annotation property (Label Self Proximity), which claims that a label l is relevant to its own textual features with high probability, P[Y l = 1 | X = z l ] > 1 -ϵ for some small ϵ ≪ 1. One natural question thus arises, in a label space spanning ∼ 10 6 labels, what are the other labels which annotate z l , when posed as a data point? Therefore, to effectively augment the training set with z l as a data point, we need to provide values for the other entries of the label vector y l . These labels should be sampled according to y l ∼ P[Y | X = z l ] , which means we need to find sensible approximations to the probabilities for the other labels P[Y j = 1 | X = z l ]. When using the cross-entropy loss, sampling can be forgone by replacing the discrete labels y l ∈ {0, 1} L by soft labels y soft l = P[Y | X = z l ]. Exploiting Label Co-Occurrences In order to derive a model for P[Y l ′ = 1 | X = z l ], we can take inspiration from the GLAS regularizer (Guo et al., 2019) . This regularizer tries to make the Gram matrix of the label embeddings ⟨w l , w l ′ ⟩ reproduce the co-occurrence statistics of the labels S, R GLaS [Ψ] = L -2 L l=1 L l ′ =1 (⟨w l , w l ′ ⟩ -S ll ′ ) 2 . Here, S denotes the symmetrized conditional probabilities, S ll ′ := 0.5(P[Y l = 1 | Y l ′ = 1] + P[Y l ′ = 1 | Y l = 1]). Plugging in w l = Ψ(z l ), this regularizer reaches its minimum if ⟨Ψ(z l ), Ψ(z l ′ )⟩ = S ll ′ . ( ) By the self-proximity postulate, we can assume Ψ(z l ) ≈ Φ(z l ). For a given augmented instance with target soft-label (z l , y soft ll ′ ), the training will try to minimize ℓ(⟨Φ(z l ), Ψ(z l ′ )⟩, y soft ll ′ ). To be consistent with equation 4, we therefore want to choose y soft ll ′ such that S ll ′ = arg min ℓ(•, y soft ll ′ ). This is fulfilled for y soft ll ′ = σ(S ll ′ ) for ℓ being the binary cross-entropy, where σ denotes the logistic function. If ℓ is the squared error, then the solution is even simpler, with y soft ll ′ = S ll ′ . For simplicity, and because of good empirical performance, we choose y soft ll ′ = S ll ′ even when training with cross-entropy loss. This results in the following, extended version of the self-proximity postulate: Postulate 1 (Soft-Labels for Label Features) Given a label l with features z l ∈ X , and a proxy for semantic similarity of labels S, the labels features, when interpreted as an input instance, should result in predictions P[Y l ′ = 1 | X = z l ] ≈ S ll ′ . ( ) Label Correlation Graph The label-similarity measure equation 3 used in the original GLaS regularizer uses only direct co-occurences of labels, which results in a noisy signal that does not capture higher-order label interdependencies. Therefore, we propose to replace it with the label correlation graph (LCG) as constructed in ECLARE. LCG ∈ R L×L is inferred by performing a random walk (with restarts) over a bipartite graph between input data instances and their corresponding ground-truth labels. Since entries in LCG are normalized and skewed in favor of tail labels, the LCG can be interpreted as a smoothed and regularized variant of the label co-occurrence matrix. More intuitively, (Mittal et al., 2021b) show that the LCG correctly identifies a set of semantically similar labels that either share tokens with the queried label, or co-occur frequently in the same context (for details, see Fig : 4 in Appendix A), thus making it a good candidate for a label-similarity measure. While ECLARE uses the LCG to efficiently mine higher order query tail-label relations by augmenting the classifier Ψ with graph information, we propose to leverage the graph weights (with an additional row-wise normalization to get values in range [0, 1]) as probabilistic soft labels for z l as data instance. Further, to restrict the impact of noisy correlations in large output spaces (Babbar & Schölkopf, 2019) , we empirically find it beneficial to threshold the soft labels obtained from LCG at δ = 0.1 (uniformly for all datasets). The algorithmic procedure of the data augmentation via Gandalf is shown below : Algorithm 1: Gandalf Augmentation 1 # j -label index, Z -label feature token matrix 2 def Gandalf(j, Z, LCG, delta=0.1): 3 x = Z[j] 4 y = LCG[j, :] / LCG[j, j] #row-normalize LCG to obtain values in [0, 1] 5 y = numpy. where(y > delta, y, 0) #threshold noisy correlations 6 return (x, y) Capturing Label-label Correlations The models benefit from Gandalf in two ways: (i) from Fig. 3 it is evident that Φ(z l ) does not exist in the vicinity of Φ(x i ), for l ∈ y i , for either head or tail labels. Thus, Gandalf essentially expands the dataset by adding label features as data points which are far from training instances in D and, (ii) as labels are product names or document titles themselves, the new data points created through Gandalf essentially capture the apriori statistical correlations between products/documents that exist in the label space. As a result, the encoded representation of correlated labels, learnt by an underlying algorithm, are closer in the representation space. This especially benefits the tail labels which, more often than not, either get missed out during shortlisting or rank outside the desired top-k predictions. As shown in the experimental results (Table 2 ), the data points generated by Gandalf, indeed, lead to significant improvements for a suite of existing algorithms. It may be noted that apart from LCG, other sources of modeling correlations, such as those capturing global and local label correlations or a combination thereof, are also equally applicable (Huang & Zhou, 2012; Zhu et al., 2017) . 

4. LABELMIX: QUERY-LABEL INTERPOLATION

Since the introduction of mixup for images (Zhang et al., 2018) , approaches adapted for textual data (Guo et al., 2020; Chen et al., 2020) have also been proposed. Similar to Verma et al. (2019) , these approaches propose to mix (interpolate) intermediate representations after t layers {ϕ t (x i ), ϕ t (x j )} of the encoder Φ(x) = φ t (ϕ t (x)) along with the corresponding label vectors as: φt (x i , x j ) := λϕ t (x i ) + (1 -λ)ϕ t (x j ); ỹ := λy i + (1 -λ)y j (6) where the mixing parameter λ ∈ [0, 1] is sampled from Beta(α, α). The mixed latent representation φt is propagated through the rest of the encoder layers and the loss is calculated using the mixed label vector as ℓ(⟨φ t ( φt ), Ψ⟩, ỹ). However, we observe that while using such formulation of mixup does reduce overfitting by acting as a regularizer, it does not improve prediction performance on unseen data (refer to Mixup curves in Fig : 1 ). These observations are in line with (Chou et al., 2020) , who argue that such formulation of ỹ does not make sense under the imbalanced data regime and hence argue to create the mixed label vector to favor the minority class. In this section, we thus propose a new mixup technique -LabelMix as a strong data augmentation baseline for XMC, which favors tail labels and is more suitable for highly imbalanced problem as encountered in XMC. Mixup techniques draw inspiration from vicinal risk minimization(VRM) (Chapelle et al., 2000) . In VRM, a model is not trained to minimize the risk over the empirical distribution dP D (x, y) = 1 n n i=1 δ xi (x) δ yi (y) , but instead over a smoothed out version P v which also comprises the vicinity of x. The key task is then to determine what constitutes the vicinity of a data point.

Query-Label Interpolation

In recommendation problems, formulated as short-text XMC tasks, works have focused on reducing distance between Φ(x i ) and Φ(z l ) ∀ l ∈ y i in order to ensure high recall rate during the retrieval step and high efficiency while ranking the relevant labels (Mittal et al., 2021a; b; Saini et al., 2021) . Thus, for the short-text XMC task at hand, we require the model to be invariant under a novel mixup transformation that relates more closely to the aforementioned recommendation objective. Since Φ(z l ) is already expected to be in the vicinity of Φ(x i ) and also exhibit such behaviour in a trained classifier (Fig : 3 ), the VRM perspective motivates to mix the encoded representations of a data point with one of its annotating label features as opposed to another data point in standard mixup formulations. We, therefore, propose to use a new definition of vicinity: given a data point (x i , y i ) ∈ D, its vicinity is given by V (x) := { φ(x i , z l ) : l ∈ y i }. Sampling Label for Mixup In imbalanced data regimes, tail labels often have very few data points and thus it makes more sense to sample these labels more often. We thus use an instance-independent weight vector r ∈ R L (specifically, label frequency raised to the power 0.5 (Mikolov et al., 2013) ), the probability of choosing z l for interpolation from y i is given by y i ⊙ r/⟨y i , r⟩, where the denominator term ensures summation to unity. While Dahiya et al. (2021a) employ a siamese contrastive loss between Φ(x i ) and Φ(z l ) s.t. l ∈ y i in order to bring these closer in the latent space, we posit that an interpolation between these encoded representations in the latent space should result in an invariance i.e. keep the annotating labels unchanged. Intuitively, since the encoded representation of a data point is being mixed with that of one of its label's text, this should result in a Label-Affirming Invariance. More formally, we propose a novel postulate for query-label interpolation in shared embedding space: Postulate 2 (Label-Affirming Invariance) Let (x, y) be a training data point in D, and l ∈ y be a label relevant to x. Then the classifier should be invariant under mixup with z l in the latent space top k (⟨Φ(x), Ψ⟩) = top k (⟨φ t ( φt (x, z l )), Ψ⟩) = y ; K = |y| (7) Modifying Eqn. 6 using postulate 2 for a data point (x, y), we arrive at: φt (x, z l ) = λϕ t (x) + (1 -λ)ϕ t (z l ); ỹ = y (8) However, we find it empirically beneficial (ref. Tab : 3) to also accommodate for the label vector of z l as proposed in postulate 1. This gives us LabelMix: φt (x, z l ) = λϕ t (x) + (1 -λ)ϕ t (z l ); ỹ = min(1, y + y soft l ) Figure 3 : To obtain this plot, we take 50,000 product titles from LF-AmazonTitles-131K dataset and evaluate average cosine similarity between Φ(x i ) and (i) Φ(z l ) where z l is a label feature of one of the annotating labels of x i , and (ii) Φ(x j ), where x i and x j are "co-documents" i.e. share a label. Evidently, Φ(x i ) is already closer to Φ(z l ) in the embedding space as compared to Φ(x j ) and this correlation increases by using the proposed augmentations. Not only does LabelMix perform much better than standard mixup techniques (ref. As an algorithmic contribution, we extend the INCEPTIONXML encoder to leverage label features in order to further the state-of-the-art on benchmark datasets and call it INCEPTIONXML-LF. For this, we augment the OVA classifier with additional label-text embeddings (LTE) and graph-augmented label embeddings (GALE) as done in (Mittal et al., 2021b) . The implementation details and training strategy can be found in Appendix B. We measure the models' performance using standard metrics precision@k, denoted P@k, and its propensity-scored version PSP@k (Jain et al., 2016) .  Datasets

5.1. MAIN RESULTS

We can make some key observations and develop strong insights not only about the short-text XMC problem with label features but also about specific dataset properties from Table 2 . For example, the training on data points generated via Gandalf gives remarkable improvements on top of the base versions of existing algorithms especially on LF-AmazonTitles-131K and LF-WikiSeeAlsoTitles-320K where most labels have ∼5 training data points on average. In these low data regimes, Gandalf helps capture correlations which are not inherently captured by most existing models. In contrast, improvements on LF-WikiTitles-500K remain relatively mild where there is enough data per label for the models to be inherently able to capture these correlations. Gandalf With Gandalf, gains of up to 30% can be observed in case of ASTEC and INCEPTIONXML which, by default, do not leverage label features and yet perform at par with their LF-counterparts, i.e. DECAF and ECLARE, and INCEPTIONXML-LF across all datasets. While architectural modifications help capture higher order query-label relations and help model predict unseen labels better, they are computationally expensive, e.g. DECAF (having LTE) takes ∼ 2× time to train while ECLARE (having both LTE & GALE) takes ∼ 3× compared to its base model ASTEC. Gandalf -augmented Method P@1 P@3 P@5 PSP@1 PSP@3 PSP@5 P@1 P@3 P@5 PSP@1 PSP@3 PSP@5 LF-AmazonTitles-131K LF-AmazonTitles-1.3M base encoders, on the other hand, do not need to make any architectural modifications or employ complicated training pipelines to imbue necessary invariances. LabelMix While being effective in capturing query tail-label correlations, LabelMix can only imbue limited additional inductive bias into the model. ECLARE, on the other hand, is able to better capture these higher order correlations through its label graph-augmented classifier (GALE), and thus only gains trivially from LabelMix. DECAF gains non-trivially on both LF-AmazonTitles-131K and LF-WikiSeeAlsoTitles-320K datasets as it only encodes label text embeddings (LTE) in its classifier, which leaves out the scope to capture query tail-label correlations further. Similarly, INCEPTIONXML stands to gain significantly more from LabelMix compared to its LF-counterpart which also employs GALE. Notably, LabelMix works much better on INCEPTIONXML(-LF) than ECLARE because of their dynamic negative mining, which enables the augmentation to work more effectively. Gandalf vs GALE ECLARE leverages LCG to encode label-label correlations in w l through GALE which helps the model improve prediction performance on new unseen labels. However, this only allows the classifier to distribute the loss gradient from a training instance {x i , y i } across y i and correlated labels as per LCG. This essentially captures higher order query-label correlations while not exploiting label-labels correlations in a way Gandalf does. Since the correlations learnt from GALE and Gandalf are independent of each other, we find ECLARE and INCEPTIONXML-LF, both of which employ GALE, to benefit off training on data points generated using Gandalf.

5.2. ABLATION STUDY

We try using Gandalf and LabelMix without soft-labels (SL) from LCG in Table 3 , where Gandalf w/o SL is essentially equivalent to using label features as data points with self-annotation property alone. However, that only helps the model learn label-to-words associations, like LTE in DECAF. Notably, soft-targets play an important role in enabling the encoder to intrinsically learn the label-label correlations Table 3 and imbue the necessary inductive bias in the models. For further analysis, we provide visualizations depicting differences in prediction performances obtained with and without by our proposed augmentations in Appendix B (Table 5 ). Method P@1 P@3 P@5 PSP@1 PSP@3 PSP@5 P@1 P@3 P@5 PSP@1 PSP@3 PSP@5 LF-AmazonTitles-131K LF-WikiSeeAlsoTitles-320K Table 3 : Results demonstrating the effectiveness of using Gandalf soft-labels (denoted SL) and synonyms replacement on a single InceptionXML model.

6. OTHER RELATED WORK : DATA AUGMENTATION AND XMC

Architectural design choices are often complemented with data augmentation methodologies which have been found to be successful in imbuing necessary problem-specific invariances in the model, thereby improving model's generalization capability on unseen data. Textual augmentations in the discrete space such as making spelling errors (Xie et al., 2017) , WordNet-based (Miller et al., 1990) replacement with synonyms (Kolomiyets et al., 2011; Li et al., 2017; Wang et al., 2018 ), text-fragment switch (Andreas, 2020) , random insertion, swap and deletion as proposed in versions of EDA (Wei & Zou, 2019; Karimi et al., 2021) have been shown to bring some performance improvements (Coulombe, 2018) . However, such transformations can lead to semantic inconsistency and illegibility and, thus decrease performance for classification tasks (Qiu et al., 2020; Anaby-Tavor et al., 2020) . More recent methods have tried filling these gaps in semantic consistency; (Zhao et al., 2022) improve upon EDA by converting the requirements of diversity and semantic consistency as a minmax optimization problem. Many methods leverage language models to suggest context-specific replacements for masked tokens either discretely via a single synonym (Kobayashi, 2018; Wu et al., 2019) or as a weighted sum of word embeddings of semantically similar words (Gao et al., 2019) . Even though the above approaches help mitigate semantic inconsistency to some extent, they are not able to preserve the annotating label, especially in low data regimes (Hu et al., 2019) where a major chunk of XMC data lies. These issues of semantic inconsistency and label distortion can be more explicit, particularly for short-text instances in XMC i.e. document titles or product names, where each word in the query has high correlation with the labels. Deletion or insertion of a word in the query could completely alter the search to either a more generalized or narrowed down one, or result in something with little sense. For example, changing the search query from "Beats Wireless headphones" to "Beats Wireless headphones with microphone" would lead to a filtered result. Furthermore, similar to label-altering random crops in images (which can be considered as the visual equivalent of word deletion) as pointed out by (Balestriero et al., 2022) , altering the aforementioned query to remove or replace "Beats" with a synonym might lead to a result not having the intended brand in top 10 hits.

7. CONCLUSION

In this paper, we proposed Gandalf, a data augmentation strategy, which is particularly suited for short-text extreme classification. It not only eliminates the need for complicated training procedures in order to imbue inductive biases, but dramatic increase in prediction performance of state-of-the-art methods in this domain. Additionally, we also developed LabelMix, as a baseline data augmentation which is motivated from previous interpolation-based textual mixup techniques. It is expected that our treatment towards studying invariances in this domain will spur further data-centric research on designing other data augmentation methods to effectively replace architectural additions in order to leverage label features and achieve faster inference times.

A VISUALIZATIONS

The highly sparse nature of the XMC problem makes the LCG noisy. In order to reduce this noise from our soft targets, we threshold the correlation values at δ, and quantify its effect by varying the parameter, as shown in Table 4 . Additional visualizations capturing the label correlations and their first order-neighbors are shown in Figure 4 . To better denote the impact of Gandalf on tail label prediction, we perform a quantile analysis by distributing the labels into 5 equi-voluminous bins based on the label frequency in the training data, as shown in Figure 5 . Finally, the qualitative comparison of correctness of outputs generated by the baseline model, and those as a result of the proposed augmentations is shown in Table 5 . Figure 4 : Correlations between labels and their first-order neighbours, as found by the LCG on the LF-WikiTitles-500K dataset. The legend shows the label in question, the bar chart shows the degree of correlation with its neighbouring labels. Correlated labels often share tokens with each other and/or may be used in the same context. P@1 P@3 P@5 PSP@1 PSP@3 PSP@5 P@1 P@3 P@5 PSP@1 PSP@3 PSP@5 δ LF-AmazonTitles-131K LF-WikiSeeAlsoTitles-320K Labels indicate mispredictions. It may be noted that queries with even just a single word, like "Oat", which predicts unrelated labels in the case of the baseline prediction, gets all the labels right with the addition of Gandalf. Furthermore, even mispredictions get closer when our data augmentation strategy is introduced. We make two improvements to the inception module INCEPTIONXML for better efficiency. Firstly, in the inception module, the activation maps from the first convolution layer are concatenated before passing them onto the second convolution layer. To make this more computationally efficient, we replace this "inception-like" setting with a "mixture of expert" setting Yang et al. (2019) . Specifically, a route function is added that produces dynamic weights for each instance to perform a dynamic element-wise weighted sum of activation maps of each filter. Along with the three convolutional experts, we also add an average pool as a down sampling residual connection to ensure better gradient flow across the encoder.

B INCEPTIONXML-LF

Second, we decouple the second convolution layer to have one each for the meta and extreme classification tasks.

B.2 DYNAMIC HARD NEGATIVE MINING

Training one-vs-all (OvA) label classifiers becomes infeasible in the XMC setting where we have hundreds of thousands or even millions of labels. To mitigate this problem, the final prediction or loss calculation is done on a shortlist of size √ L comprising of only hard-negatives label. This mechanism helps reduce complexity of XMC from an intractable O(N DL) to a computationally feasible O(N D √ L) problem. INCEPTIONXML-LF inherits the synchronized hard negative mining framework as used in the INCEPTIONXML. Specifically, the encoded meta representation is passed through the meta-classifier which predicts the top-K relevant label clusters per input query. All labels present in the top-K shortlisted label clusters then form the hard negative label shortlist for the extreme task. This allows for progressively harder labels to get shortlisted per short-text query as the training proceeds and the encoder learns better representations.



bold symbols y indicate vectors, captial letters Y indicate random variables, and sans-serif y denotes a set



Figure 1: Effect of different data augmentations on INCEPTIONXML-LF over LF-AmazonTitles-131K dataset. (a)  shows that a significant generalization gap exists between Train and Test P@1. However, remarkable improvements can be noted in (b) and (c) as a result of using the proposed data augmentation Gandalf. While text mixup(Chen et al., 2020) provides a regularization effect and is effective in reducing overfitting, our proposed alternative LabelMix baseline performs much better.

three-fold contributions: • As our primary contribution, we propose Gandalf -GrAph iNduced Data Augmentation based on Label Features -a simple data augmentation algorithm to efficiently leverage label features as valid training instances in XMC. Augmenting training data via Gandalf faciliates the core objective of short-text XMC by enabling the model to effectively capture label-label correlations in the latent space without the need of making architectural modifications, • Empirically, we demonstrate the generality and effectiveness of Gandalf by showing up to 30% relative improvements in 5 state-of-the-art extreme classifiers across 4 public benchmark datasets.

Figure2: A pictorial representation of the proposed Gandalf and LabelMix strategies formed as per Alg : 1 and Eqn. 9. The title of each plot denotes the data point, the y-axis its labels and the x-axis their target values. We demonstrate our augmentations on the data point Of the Rings of Power and the Third Age, which is the final book in the Lord of the Rings(LOTR) series along with labels The Hobbit and The Lord of the Rings. Notably, the labels found through soft targets through the LCG are all related to the LOTR universe, with J. R. R. Tolkien being the author, The Quest of Erebor is a central plot line and, Celebrimbor and Gandalf are major characters. Beyond this, the soft targets also cover generic labels like 1954/55 in Literature, which is the correct timeline for book release.

Fig :1), but also serves as a strong data augmentation baseline for short-text XMC, as shown in Tab : 2.5 EXPERIMENTS & DISCUSSIONBenchmarks, Baselines & Metrics We benchmark our experiments on 4 standard public datasets, the details of which are mentioned in Tab : 1. To test the generality and effectiveness of our proposed Gandalf, we apply the augmentation across multiple state-of-the-art short-text extreme classifiers: (i) ASTEC, (ii) DECAF, (iii) ECLARE, and (iv) INCEPTIONXML. Additionally, we also compare against transformer-encoder based XR-Transformer(Zhang et al., 2021), and SiameseXML++. To compare Gandalf with conventional data augmentation approaches, we test it against LabelMix which serves as a strong mixup-based data augmentation baseline more suited for short-text XMC.

Figure 6: INCEPTIONXML-LF. The improved Inception Module along with instance attention is shown in detail. Changes to the INCEPTIONXML framework using the ECLARE classifier are also shown.

Details of short-text benchmark datasets with label features. APpL stands for avg. points per label, ALpP stands for avg. labels per point and AWpP is the length i.e. avg. words per point.

Results showing the effectiveness and generality of Gandalf on state-of-the-art extreme classifiers.

Results demonstrating the sensitivity of Gandalf with respect to δ, as defined in Algorithm 1. All experiments were performed on the InceptionXML-LF model, augmented with Gandalf. As shown, the empirical performance peaks at a δ value of 0.1 which is sufficient to suppresses the impact of noisy correlations.

Prediction examples of different datapoints from the LF-WikiSeeAlsoTitles-320K dataset.

annex

(a) Contributions to P@5 in LF-AmazonTitles-131K (b) Contributions to P@5 in LF-WikiSeeAlsoTitles-320KFigure 5 : Analysis demonstrating the effectiveness of Gandalf in improving performance over tail labels. For this graph, labels were divided into 5 equi-voluminous bins in increasing order of frequency. The graph shows contribution of each bin to P@5 on different datasets and short-text extreme classifiers.

B.3 LABEL-TEXT AND LCG AUGMENTED CLASSIFIERS

INCEPTIONXML-LF's extreme classifier weight vectors W e comprise of 3 weights, as in Mittal et al. (2021b) . Specifically, the weight vectors are a result of an attention-based sum of (i) label-text embeddings, created through Φ l , (ii) graph augmented label embeddings, created through graph encoder Φ g and, (iii) randomly initialized per-label independent weights w l .As shown in Fig. 6 , we first obtain label-text embeddings as z 1 l = E • z 0 l , where z 0 l are the TF-IDF weights of label feature corresponding to label l. Next, we use the label correlation graph G to create the graph-weighted label-text embeddings z 2 l = m∈[L] G lm • z 0 l to capture higher order query-tail label correlations. z 1 l and z 2 l are then passed into the frugal encoders Φ l and Φ g respectively. These encoders comprise only of a residual connection across a fully connected layer as α • R • G(z l ) + β • zl , where zl = {z 1 l , z 2 l }, G represents GELU activation and α and β are learned weights. Finally, the per-label weight vectors for the extreme task are obtained aswhere A is the attention block and α {1,2,3} are the dynamic attention weights produced by the attention block.

B.4 TWO-PHASED TRAINING

Motivation: We find there to be a mismatch in the training objectives in DeepXML-based approaches like ASTEC, DECAF and ECLARE which first train their word embeddings on meta-labels in Phase I and then transfer these learnt embeddings for classification over extreme fine-grained labels in Phase III Dahiya et al. (2021b) . Thus, in our two-phased training for INCEPTIONXML-LF, we keep our training objective same for both phases. Note that, in INCEPTIONXML-LF the word embeddings are always learnt on labels instead of meta-labels or label clusters and we only augment our extreme classifier weight vectors W e with label-text embeddings and LCG weighted label embeddings. We keep the meta-classifier W m as a standard randomly initialized classification layer.Phase I: In the first phase, we initialize the embedding layer E with pre-trained GloVe embeddings Pennington et al. ( 2014), the residual layer R in Φ l and Φ g is initialized to identity and the rest of the model comprising of Φ q , W m and A is randomly initialized. The model is then trained end-to-end but without using free weight vectors w l in the extreme classifier W e . This set up implies that W e only consists of weights tied to E through Φ l and Φ g which allows for efficient joint learning of query-label word embeddings Mittal et al. (2021a) in the absence of free weight vectors. Model training in this phase follows the INCEPTIONXML+ pipeline as described in Kharbanda et al. (2021) without detaching any gradients to the extreme classifier for the first few epochs. In this phase, the final per-label score is given by:Phase II: In this phase, we first refine our clusters based on the jointly learnt word embeddings. Specifically, we recluster the labels using the dense z 1 l representations instead of using their sparse PIFA representations Chang et al. (2020) and consequently reinitialize W m . We repeat the Phase I training, but this time the formulation of W e also includes w l which are initialised with the updated z 1 l as well. Here, the final per-label score is given by: P l = A(Φ l (z 1 l ), Φ g (z 2 l ), w l ) • Φ q (x)

