GANDALF: DATA AUGMENTATION IS ALL YOU NEED FOR EXTREME CLASSIFICATION

Abstract

Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain have increasingly focused on the problem setting with short-text input data, and labels endowed with short textual descriptions called label features. Short-text XMC with label features has found numerous applications in areas such as prediction of related searches, title-based product recommendation, bid-phrase suggestion, amongst others. In this paper, we propose Gandalf, a graph induced data augmentation based on label features, such that the generated data-points can supplement the training distribution. By exploiting the characteristics of the short-text XMC problem, it leverages the label features to construct valid training instances, and uses the label graph for generating the corresponding soft-label targets, hence effectively capturing the label-label correlations. While most recent advances (such as SIAMESEXML and ECLARE) in XMC have been algorithmic, mainly aimed towards developing novel deep-learning architectures, our data-centric augmentation approach is orthogonal to these methodologies. We demonstrate the generality and effectiveness of Gandalf by showing up to 30% relative improvements for 5 state-of-the-art algorithms across 4 benchmark datasets consisting of up to 1.3 million labels.

1. INTRODUCTION

Extreme Multilabel Classification (XMC) has found multiple applications in the domains of related searches (Jain et al., 2019) , product recommendation (Medini et al., 2019) , dynamic search advertising (Prabhu et al., 2018) , etc. which require predicting the most relevant results that either frequently co-occur or are highly correlated with the given product instance or search query. In the XMC setting, these problems are often modelled through embedding-based retrieval-cum-ranking pipelines over millions of possible web pages/products/ad-phrases considered as labels. Nature of short-text XMC and Extreme class imbalance Typically, in the tasks of related search prediction, bid-phrase suggestion, and related-product recommendation based on titles, the input data instance is in the form of a short-text query. These short-text instances (names or titles), on average, consist of only 3-8 words . In order to effectively model these scenarios, there has been an increasing focus on building encoders as part of deep learning pipelines that can capture the nuances of such short-text inputs (Dahiya et al., 2021b; Kharbanda et al., 2021) . The real world datasets in XMC are highly imbalanced towards popular or trending adphrases/products. Moreover, these datasets adhere to Zipf's law (Ye et al., 2020) , i.e., most labels in these extremely large output spaces are tail labels, having very few (< 5) instances in a training set spanning hundreds of thousands data points (Tab : 1, Appendix). While there is already an insufficiency of training data, the short-text nature of training instances makes it even more challenging for the models to learn meaningful, non-overfitting encoded representations for tail words and labels. Frugal architectures and Label features Due to the low latency requirements of XMC applications, most recent works are also focused on building lightweight and frugal architectures that can predict in milliseconds and scale up to millions of labels (Dahiya et al., 2021a) . Despite being frugal in terms of number of layers/parameters in the network, these models are capable of fitting well enough on the training data, although their generalization to the test samples remains poor (Fig : 1a ). Hence, creating deeper models for better representation learning is perhaps not optimal under this setting.

