IMPROVING TAIL LABEL PREDICTION FOR EXTREME MULTI-LABEL LEARNING

Abstract

Extreme multi-label learning (XML) works to annotate objects with relevant labels from an extremely large label set. Many previous methods treat labels uniformly such that the learned model tends to perform better on head labels, while the performance is severely deteriorated for tail labels. However, it is often desirable to predict more tail labels in many real-world applications. To alleviate this problem, in this work, we show theoretical and experimental evidence for the inferior performance of representative XML methods on tail labels. Our finding is that the norm of label classifier weights typically follows a long-tailed distribution similar to the label frequency, which results in the over-suppression of tail labels. Base on this new finding, we present two new modules: (1) RANKNET learns to re-rank the predictions by optimizing a population-aware loss, which predicts tail labels with high rank; (2) TAUG augments tail labels via a decoupled learning scheme, which can yield more balanced classification boundary. We conduct experiments on commonly used XML benchmarks with hundreds of thousands of labels, showing that the proposed methods improve the performance of many state-of-the-art XML models by a considerable margin (6% performance gain with respect to PSP@1 on average).

1. INTRODUCTION

Extreme multi-label learning (XML) aims to annotate objects with relevant labels from an extremely large candidate label set. Recently, XML has demonstrated its broad applications. For example, in webpage categorization Partalas et al. (2015) , millions of labels (categories) are collected in Wikipedia and one wishes to annotate new webpages with relevant labels from a huge candidate set; in recommender systems McAuley et al. (2015) , one hopes to make informative personalized recommendations from millions of items. Because of the high dimensionality of label space, classic multi-label learning algorithms, such as Zhang & Zhou ( 2007 In XML, one important statistical characteristic is that labels follow a long-tailed distribution as illustrated in Figure 4 (left). Most labels occur only a few times in the dataset. Infrequently occurring labels (referred to as tail label) possess limited training samples and are harder to predict than frequently occurring ones (referred to as head label). Many existing XML approaches treat labels with equal importance, such as Prabhu & Varma (2014); Babbar & Schölkopf (2017) ; Khandagale et al. (2019 ), while Wei & Li (2018) demonstrates that most predictions of well-established methods are heads labels. However, in many real-world applications, it is still desirable to predict more tail labels which are more rewarding and informative, such as recommender systems Jain et al. ( 2016 To improve the performance for tail labels, existing solutions typically involve optimizing loss functions that are suitable for tail labels Jain et al. ( 2016); Babbar & Schölkopf (2019), leveraging the sparsity of tail labels in the annotated label matrix Xu et al. (2016) , and transferring knowledge from data-rich head labels to data-scarce tail labels K. Dahiya (2019). These methods typically achieve better performance on tail labels than standard XML methods which treat labels equally, while they



); Tsoumakas & Vlahavas (2007), become infeasible. To this end, a number of computational efficient XML approaches are proposed Weston et al. (2011); Agrawal et al. (2013); Bi & Kwok (2013); Yu et al. (2014); Bhatia et al. (2015); E.-H. Yen et al. (2016); Yeh et al. (2017); Yen et al. (2017); Tagami (2017).

); Babbar & Schölkopf (2019); Wei & Li (2018); Wei et al. (2019).

