GANDALF: DATA AUGMENTATION IS ALL YOU NEED FOR EXTREME CLASSIFICATION

Abstract

Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain have increasingly focused on the problem setting with short-text input data, and labels endowed with short textual descriptions called label features. Short-text XMC with label features has found numerous applications in areas such as prediction of related searches, title-based product recommendation, bid-phrase suggestion, amongst others. In this paper, we propose Gandalf, a graph induced data augmentation based on label features, such that the generated data-points can supplement the training distribution. By exploiting the characteristics of the short-text XMC problem, it leverages the label features to construct valid training instances, and uses the label graph for generating the corresponding soft-label targets, hence effectively capturing the label-label correlations. While most recent advances (such as SIAMESEXML and ECLARE) in XMC have been algorithmic, mainly aimed towards developing novel deep-learning architectures, our data-centric augmentation approach is orthogonal to these methodologies. We demonstrate the generality and effectiveness of Gandalf by showing up to 30% relative improvements for 5 state-of-the-art algorithms across 4 benchmark datasets consisting of up to 1.3 million labels.

1. INTRODUCTION

Extreme Multilabel Classification (XMC) has found multiple applications in the domains of related searches (Jain et al., 2019) , product recommendation (Medini et al., 2019) , dynamic search advertising (Prabhu et al., 2018) , etc. which require predicting the most relevant results that either frequently co-occur or are highly correlated with the given product instance or search query. In the XMC setting, these problems are often modelled through embedding-based retrieval-cum-ranking pipelines over millions of possible web pages/products/ad-phrases considered as labels. Nature of short-text XMC and Extreme class imbalance Typically, in the tasks of related search prediction, bid-phrase suggestion, and related-product recommendation based on titles, the input data instance is in the form of a short-text query. These short-text instances (names or titles), on average, consist of only 3-8 words . In order to effectively model these scenarios, there has been an increasing focus on building encoders as part of deep learning pipelines that can capture the nuances of such short-text inputs (Dahiya et al., 2021b; Kharbanda et al., 2021) . The real world datasets in XMC are highly imbalanced towards popular or trending adphrases/products. Moreover, these datasets adhere to Zipf's law (Ye et al., 2020) , i.e., most labels in these extremely large output spaces are tail labels, having very few (< 5) instances in a training set spanning hundreds of thousands data points (Tab : 1, Appendix). While there is already an insufficiency of training data, the short-text nature of training instances makes it even more challenging for the models to learn meaningful, non-overfitting encoded representations for tail words and labels. Frugal architectures and Label features Due to the low latency requirements of XMC applications, most recent works are also focused on building lightweight and frugal architectures that can predict in milliseconds and scale up to millions of labels (Dahiya et al., 2021a) . Despite being frugal in terms of number of layers/parameters in the network, these models are capable of fitting well enough on the training data, although their generalization to the test samples remains poor (Fig : 1a ). Hence, creating deeper models for better representation learning is perhaps not optimal under this setting. Recent works, however, make expensive architectural adjustments (Mittal et al., 2021a) to leverage the text associated with labels ("label features", discussed in §2) in order to improve generalization.

1.1. RELATED WORK: XMC WITH LABEL FEATURES

Earlier works in XMC primarily focused on problems consisting of entire long-text documents, consisting of hundreds of words/tokens, such as those encountered in tagging for Wikipedia (Babbar & Schölkopf, 2017; You et al., 2019) . On the output side, the labels were identified by numeric IDs and hence devoid of any semantic meaning. Most works under this setting are aimed towards scaling up transformers as encoders for XMC tasks (Chang et al., 2020; Zhang et al., 2021) . By associating labels with their corresponding texts, which are in turn, product titles, document names or bid-phrases themselves, the contemporary application of XMC has gone beyond standard document tagging tasks. With the existence of label features, there exist three correlations that can be exploited for better representation learning: (i) query-label (ii) query-query and (iii) label-label correlations. Recent works have been successful in leveraging label features and pushing state-ofthe-art by exploiting the first two correlations. For example, SIAMESEXML (Dahiya et al., 2021a) employs a siamese pre-training stage based on a contrastive learning objective between a data point and its label features optimizing negative log-likelihood loss. GALAXC (Saini et al., 2021) employs a graph convolutional network over a combined query-label bipartite graph. DECAF and ECLARE (Mittal et al., 2021a; b ) make architectural additions to exploit higher order query-label correlations by extending the DeepXML pipeline to accommodate extra ASTEC-like encoders (Dahiya et al., 2021b) . In contrast to the recent algorithmic developments for short-text XMC with label features, and following the work of (Banko & Brill, 2001) , which posits higher relevance of developing more training data as compared to choice of classifiers in small data regimes, we take a data-centric approach and focus on developing data augmentation techniques for short-text XMC. We show that by using Gandalf, methods which inherently do not leverage label features beat strong



Figure 1: Effect of different data augmentations on INCEPTIONXML-LF over LF-AmazonTitles-131K dataset. (a) shows that a significant generalization gap exists between Train and Test P@1. However, remarkable improvements can be noted in (b) and (c) as a result of using the proposed data augmentation Gandalf. While text mixup (Chen et al., 2020) provides a regularization effect and is effective in reducing overfitting, our proposed alternative LabelMix baseline performs much better.

three-fold contributions: • As our primary contribution, we propose Gandalf -GrAph iNduced Data Augmentation based on Label Features -a simple data augmentation algorithm to efficiently leverage label features as valid training instances in XMC. Augmenting training data via Gandalf faciliates the core objective of short-text XMC by enabling the model to effectively capture label-label correlations in the latent space without the need of making architectural modifications, • Empirically, we demonstrate the generality and effectiveness of Gandalf by showing up to 30% relative improvements in 5 state-of-the-art extreme classifiers across 4 public benchmark datasets.

