LEARNING TOP-K CLASSIFICATION WITH LABEL RANKING

Abstract

Class confusability and multi-label nature of examples inevitably arise in classification tasks with the increasing number of classes, which poses a huge challenge to classification. To mitigate this problem, top-k classification is proposed, where the classifier is allowed to predict k label candidates and the prediction result is considered correct as long as the ground truth label is included in the k labels. However, existing top-k classification methods neglect the ranking of the ground truth label among the predicted k labels, which has high application value. In this paper, we propose a novel three-stage approach to learn top-k classification with label ranking. We first propose an ensemble based relabeling method and relabel the training data with k labels, which is used to train the top-k classifier. We then propose a novel top-k classification loss function that aims to improve the ranking of the ground truth label. Finally, we have conducted extensive experiments on four text datasets and four image datasets, and the experimental results show that our method could significantly improve the performance of existing methods.

1. INTRODUCTION

Multi-class classification aims to classify examples into one of more than two classes, and as the number of classes increases to a large extent, e.g., thousands of classes, training a multi-class classifier will become extremely challenging due to the presence of multi-label nature of the examples and class confusability (Gupta et al., 2014; Lapin et al., 2015; Chang et al., 2017) . To mitigate this problem, the task of top-k classification is proposed (Berrada et al., 2018; Petersen et al., 2022) , where the classifier is allowed to predict k label candidates and the prediction result is considered correct as long as the ground truth label is included in the k labels. This evaluation measure is commonly referred to as the top-k error (Lapin et al., 2016) , i.e., the loss function will not penalize k -1 mistakes. Though state-of-the-art models directly trained with cross-entropy can also yield remarkable results in terms of top-k error, the data used for training must be both large and clean (Berrada et al., 2018) , which cannot be guaranteed in real scenarios. Moreover, traditional top-1 error loss function like cross-entropy may have over-fitting problem when noisy label exists (Berrada et al., 2018) . Hence, loss functions tailored for top-k error minimization are needed. However, existing top-k classification loss functions (Lapin et al., 2015; Chang et al., 2017; Berrada et al., 2018) only consider whether the ground truth label is reported in the predicted k labels and neglect the ranking of the ground truth label within the top-k candidates. In fact, ranking is crucial for tasks of top-k classification. For example, in a classic human-in-the-loop (Zanzotto, 2019) data annotation scenario, a classification model trained with a small amount of labeled data is used to predict for each unlabeled example, and humans are required to check the prediction result of a classification model and relabel those with low confidence. However, manually selecting the correct label from a large label set is time-consuming and inefficient. In this case, the model is allowed to predict the k most likely labels so that humans could easily find the ground truth label from those k labels, i.e., top-k classification. Meanwhile, improving the ranking of the ground truth label in the k labels will allow humans to get the ground truth label at the first time, which could effectively improve the efficiency of humans checking. The ranking motivation behind above scenario actually aligns with applications like recommendation system and search engine (Oosterhuis & de Rijke, 2020) . Therefore, in this paper, we aim to design a top-k classification method that can predict k most likely labels where the ground truth label is not only included but also its ranking should be as high as possible. Driven by the idea of multi-label classification (MLC) (Tsoumakas & Katakis, 2007) , we propose a novel three-stage approach for top-k classification. As shown in Figure 1 , in the first stage, we use the existing training and validation sets to train the base classifier which will be used to predict k-labels for each training example in next stage. Considering that the correct k labels are crucial for top-k classification in our problem setting, we use the idea of ensemble learning (Sagi & Rokach, 2018) to train m base classifiers. Each base classifier is trained with classic cross-entropy loss function on the subset that randomly selected from training and validation sets. In the second stage, we relabel each single-label example to k-label example by using one ground truth label and k -1 most likely labels. More specifically, we predict m probability distributions p for each sample by the m base classifiers, and then average all m probability distributions to get the average probability distribution p avg . Finally, we output the most likely k -1 labels for each training sample besides its ground truth label according to the p avg . We refer to these k -1 most likely labels as pseudo-labels. In the third stage, based on the transformed k-label examples, we train a multi-label classifier which predict exactly k labels for a test example. Then, the trained multi-label classifier can be viewed as a top-k classifier. It is important to note that, we propose a new top-k loss function with label ranking (TkLR) for training in this stage. For the example with k labels, TkLR aims to maximize the difference between the scores of these k labels and the scores of other labels. To improve the ranking of the ground truth label, we embed an additional rank loss in TkLR, which aims to maximize the difference between the scores of the ground truth label and the scores of pseudo-labels. Finally, we conduct sufficient experiments on four text datasets and four image datasets with BERT (Kenton & Toutanova, 2019) and Swin Transformer (Liu et al., 2021) as the backbone model respectively. Then we evaluate the experimental results with top-k accuracy and Normalized Discounted Cumulated Gains at top K (N DCG@K) (Wang et al., 2013) . The experimental results demonstrate that our method could significantly improve the performance of existing top-k classification methods. In brief, the main contributions of this paper are summarized as follows: • We propose to consider the ranking of ground truth label in top-k classification, which has high application value but hasn't been well addressed so far. • We propose a novel three-stage approach to learn top-k classification with label ranking, which can be easily deployed with different classification models. • We propose an ensemble based relabeling method to obtain the most likely k labels for each training example which will benefit the final top-k classification. • We propose a novel top-k loss function that takes the ranking of the ground truth label into account. • The extensive experiments over different text and image datasets show that our method greatly outperforms existing baselines in terms of top-k accuracy and N DCG@K metrics. The remainder of this paper is structured as follows. In Section 2, we present the related work. Section 3 describes our approach. We describe our experiments in Section 4. Section 5 concludes the paper.

2. RELATED WORK

In this section, we will present our related work in terms of top-k classification and multi-label classification. Top-k classification. Lanpin et al. (Lapin et al., 2015) proposed a top-k multiclass SVM based on a tight convex upper bound of the top-k hinge loss. Then Chu et al. (Chu et al., 2018) propose an optimized top-k multiclass SVM algorithm, which employs semismooth Newton algorithm for the key building block to improve the training speed. Chang et al. (Chang et al., 2017) propose a generic, robust multiclass SVM formulation that directly aims at minimizing a weighted and truncated combination of the ordered prediction scores. Recently Berrada et al. (Berrada et al., 2018) propose Smooth loss function which is a new top-k classification loss function for deep neural networks. The Smooth loss function creates a margin between the correct top-k predictions and the incorrect ones. Petersen et al. (Petersen et al., 2022) propose a family of differentiable top-k cross-entropy classification losses and relax the assumption of a fixed k. However, all the above methods do not take into account the ranking of ground truth labels among the predicted k labels. Multi-label classification. Boutell et al. (Boutell et al., 2004) decomposed the multi-label classification into several independent binary classification problems. Read et al. (Read et al., 2011) transformed the multi-label classification problem into a chain of binary classification problems. For text dataset, Yang et al. (Yang et al., 2018) proposed a sequence-to-sequence multi-label classification model to learn the correlation among labels. Huang et al. (Huang et al., 2021) introduce the application of balancing loss functions for multi-label text classification. For image dataset, Chen et al. (Chen et al., 2019) propose a multi-label classification model based on Graph Convolutional Network (GCN) to capture and explore label dependencies. Lanchantin et al. (Lanchantin et al., 2021) propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Though our method is inspired from multi-label classification, the difference is obvious: 1) The input is different. The number of ground truth label per training sample for multi-label classification is indeterminate, while the training sample of our method only has one ground truth label. 2) The output result is different. The number of labels predicted by multi-label classification is also uncertain, while the number of labels predicted by our method is fixed at k. 3) The evaluation measure is different. The all predicted labels of multi-label classification are required to be ground truth labels, while our method only requires the ground truth label included in the predicted k labels. 3 THREE-STAGE APPROACH Stage 2: Relabeling training data. The purpose of this step is to relabel each single-label training example with its k most likely labels. We consider the ground truth label to be the most likely label, so we only need to find the other k -1 most likely labels. Specifically, given the single-label training set {(x i , y i )} t i=1 , where x i are the input features of each example i and y i is the ground truth label. We aim to relabel each example (x i , y i ) to (x i , Y i ), where Y i ∈ R k is composed of the ground truth label y i and the k -1 other labels that are most related to x i . To this end, we use the m base classifiers to predict m different probability distribution p of all labels for each sample in the training set, where p ∈ R c , c is the number of total labels. Then we average the m probability distributions and get the average probability distribution p avg . According to the p avg , we output the k -1 labels with the highest probability except the ground truth label, and we use these k -1 labels as the pseudo-labels of the sample. At this point, relabeling is complete, and each example in the training set has one ground truth label and k -1 pseudo-labels. Note that, the training set in this stage is used for deriving a probability distribution over different labels and obtaining for a modified training set. 

3.2. TOP-K LOSS FUNCTION WITH LABEL RANKING

In this section, we will introduce our rank top-k loss function. In order to better understand the top-k classification, we start with the simple case of k=1, then the top-1 classification is an ordinary multiclass classification. Given a training sample (x i , y i ), i = 1, . . . , t, where y ∈ Y := {1, . . . , c}, the classifier will output the score of the sample x on each label, i.e., the score vector s := {s 1 , . . . , s c }. The multi-class classification aims to maximize the probability p y of ground truth label y, where p y is calculated by Softmax. Then multi-class classification aims at minimizing the following Softmax cross-entropy loss: L k=1 = -log( e sy c j=1 e sj ) where s y is the score of the ground truth label. From Eq.1, it can be seen that reducing the loss is equivalent to increasing s y . Taking s y as the object, Eq.1 can be transferred as follows: The goal of Eq.2 is to minimize (s j -s y ), that is, it is hoped that the score of the ground truth label should be much larger than the scores of other labels. We define the ground truth label as positive label and the rest as negative labels, then Eq.2 can be rewritten as follows: L k=1 = -log( L k=1 = log 1 + n∈neg e sn-sp where neg is the set of the negative labels, s n is the score of the negative label, s p is the score of the positive label. Since k = 1, Eq.3 has 1 positive label and c -1 negative labels. Obviously, Eq.3 takes maximizing the score of positive label as the optimization goal, and directly outputs the label with the largest score as the prediction result in the prediction stage. When k > 1, each training sample has k labels, like the case of k = 1, the loss function expects that the scores of the k labels are larger than the rest labels. Specifically, given a training sample (x i , Y i ) which is the top-k relabeling result of (x i , y i ), where Y i ∈ R k is composed of the ground truth label y i and the k -1 other pseudo-labels that are most related to x i , the loss function aims to maximize the scores of the Y i and minimize the scores of the rest labels. Similarly, we define Y i as positive labels and the rest as negative labels, then loss function can be rewritten as follows: L k>1 = log(1 + n∈neg,p∈pos e sn-sp ) (4) where pos is the set of the k positive labels and neg is the set of the c -k negative labels. Obviously, Eq.4 hopes to maximize the score of positive labels, and the classifier only needs to output the k labels with the largest scores as the predicted results. However, the loss function in Eq.4 does not consider the ranking of the ground truth label. Another goal of our top-k classification is to expect that the ground truth label has a higher ranking in the predicted k labels, which means that the score of the ground truth label should be as large as possible. More specifically, the score of the ground truth label s y should be larger than that of other positive labels, i.e., maximize (s y -s p ), p = y. Then the final loss function can be written as follows: (5 L = log(1 + n∈neg,p∈pos ) where pos\y is the set of positive labels other than the ground truth label y. We name the third part e -sy p∈pos\y e sp in the loss function of Eq.5 as rank loss. Adding rank loss to the final loss function has two benefits. The first is to improve the ranking of the ground truth label in the k predicted labels mentioned above. Second, the top-k classification expects the ground truth label to appear in the k predicted labels, and the rank loss could improve the score of the ground truth label, which is helpful for the ground truth label to be predicted in the k labels more likely. Therefore, rank loss also helps to improve the accuracy of top-k classification.

4. EXPERIMENT

In this section, we perform extensive experiments to validate the effectiveness of our method. First, we describe the experimental setup in detail. Second, we compare our method with different approaches. We then conduct experiments on both text datasets and image datasets, simultaneously evaluating our experimental results with two evaluation metrics. The source code will be available at https://github.com/Tracy-6914/TkLR 4.1 EXPERIMENTAL SETUP Dataset and model selection. In our experiment, we select 4 single-label text datasets and 4 single-label image datasets: Ohsumed (Joachims, 1998) , 20NG (Johnson & Zhang, 2016) , WOS (Kowsari et al., 2017) , TREC (Li & Roth, 2002) , CIFAR100 (Krizhevsky et al., 2009) , Aircraft (Lu et al., 2021) , CUB-200-2011 (Wah et al., 2011) , Indoor67 (Quattoni & Torralba, 2009) . Ohsumed includes the medical abstracts of MEDLINE database and describes different cardiovascular diseases. 20 Newsgroups (20NG) is a collection of newsgroup documents posted on 20 different topics. WOS-46985 (WOS) collects the abstracts of papers published in Web Of Science. The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. CIFAR100 is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. CUB-200-2011 is a challenging dataset of 200 bird species. Aircraft contains different aircraft model variants, most of which are airplanes. Indoor67 contains 67 indoor scene categories, and a total of 15620 images. Table 1 shows the statistics of all datasets. Our method is a general top-k classification method that can be applied to various models. In our experiment, we choose BERT (Kenton & Toutanova, 2019) as the backbone model for the text dataset and Swin Transformer (Liu et al., 2021) as the backbone model for image dataset. Comparison methods. For text dataset, we compare our method with four approaches: (1) BERT + CE, a simple multi-class classification method that uses cross-entropy (CE) as the loss function of BERT. (2) BERT+Top-k SVM, which obtains the features from the fine-tuned BERT and then input the features into top-k SVM (Lapin et al., 2015) for top-k classification. (3) BERT+Top-k Entropy, which trains the top-k classifier by top-k Entropy (Lapin et al., 2015) and the features obtained from the fine-tuned BERT. (4) BERT+Top-k SVM Semismooth , which use Top-k SVM Semismooth (Chu et al., 2018) and the features obtained from the fine-tuned BERT to train the top-k classifier. (5) BERT + Smooth, which uses the recent top-k loss function Smooth loss function (Berrada et al., 2018) as the loss function of the BERT. ( 6) BERT + Relabel, a multi-label classification method of BERT that uses the training set after relabeling process and uses the binary cross-entropy loss function. (7) BalancedLoss + Relabel, the state-of-the-art multi-label text classification model BalancedLoss (Huang et al., 2021) trained by the training set after relabeling process. Similarly, we compare our method with the following approaches on image dataset: (1) Swin + CE, a multi-class classification method that uses cross-entropy (CE) as the loss function of Swin Transformer. (2) Swin+Top-k SVM, which obtains the features from the fine-tuned Swin and then input the features into top-k SVM (Lapin et al., 2015) for top-k classification. (3) Swin+Top-k Entropy, which trains the top-k classifier by top-k Entropy (Lapin et al., 2015) and the features obtained from the fine-tuned Swin. (4) Swin+Top-k SVM Semismooth , which use Top-k SVM Semismooth (Chu et al., 2018) and the features obtained from the fine-tuned Swin to train the top-k classifier. ( 5 Ablation experiment. We simply name our method as BERT (Swin) + Relabel + TkLR, and to demonstrate the effectiveness of our proposed rank loss, we remove the rank loss from our method and conducted experiment, namely BERT (Swin) + Relabel + TkLR-No-Rank. Experimental setting. In the base classifiers training process, we choose BERT as the backbone network of the classifier for text dataset and Swin-small as the backbone network of the classifier for image dataset. In the top-k classifier training process, for text dataset, we set epoch to 20, maximum sequence length to 512, batch size to 10, and learning rate to 2e-5. We train our model with AdamW (Loshchilov & Hutter, 2017) optimizer with weight decay = 0.01. For image classification, we set epoch to 100, batch size to 64, learning rate to 5e-5 and adopt AdamW optimizer with weight decay = 5e-2. Training process and convergence analyses of the loss function are discussed in Appendix A.2 Evaluation metrics. Considering the ranking of the ground true label among the predicted k labels, we use top-k accuracy (Yan et al., 2018) and Normalized Discounted Cumulated Gains at top K (N DCG@k) (Wang et al., 2013) as our evaluation metrics. The specific calculation methods are as follows. Top-k accuracy is defined as: Acc = n i=1 f lag i n , f lag i = 1 If ground truth label in the predicted k labels 0 Otherwise (6) where n is the number of samples. N DCG@k is defined as: Suppose that l is the total number of the labels, N DCG@K is defined according to the predicted score vector ŷ ∈ R l and the ground truth label vector y ∈ {0, 1} l as follows: DCG@k = j∈rank k (ŷ) y j log(pos(j) + 1) , N DCG@k = DCG@k min(k, y 0) i=1 1 log(i+1) (7) where rank k (y) is the label indexes of the top-k highest scores of the current prediction result, pos(j) is the position of j, i.e., pos(j) = 1, 2, 3... y 0 counts the number of labels in the ground truth label vector y. All experiments were implemented with Python 3.8 and PyTorch 1.8 on a Linux server. Our experiments for text classification on the RTX 3080Ti and for image classification on the RTX 3090.

4.2. HYPERPARAMETER SETTING

In our experiments, we employ Bagging (Zhou, 2012) method to train m base classifiers. The Bagging method randomly selects data with a proportion of α from the total training data for training each time. We conduct experiments both on text and image datasets to determine the number of base classifiers m and proportion α. We first randomly initialize α, and determine the optimal m with the fixed α. Then determine α with the optimal m. The experimental results are shown in Figure 2 . Figure 2a and Figure 2b show that in general, the larger m is, the higher the top-k accuracy is. On the Ohsumed dataset, when m = 8, the overall results are good, and the performance is no longer significant when m continues to increase. On the Cifar100 dataset, the result is also good when m = 5, and the effect of increasing m is not obvious. Figure 2c and Figure 2d show the performance of different α with the optimal m. The results show that when α = 0.8, top-k classifier can achieve the best performance on the Ohsumed dataset and the cifar100 dataset. We found similar experimental results on other datasets. Therefore, for hyperparameter m, we set the m to 8 for the text dataset and m to 5 for the image dataset. For hyperparameter α, we set α to 0.8 for both the text dataset and the image dataset. The theoretically analysis of hyperparameters is present in A.1

4.3. EXPERIMENTAL RESULT ON TEXT DATASET

Table 2 and Table 3 show the top-k accuracy and the N DCG values for different k of all methods on text dataset, respectively. It is clear that our method BERT+Relabel+TkLR achieves the best results in both top-k accuracy and N DCG metrics, which demonstrates the effectiveness of our method to solve the top-k classification problem. As can be seen from the top-k accuracy in Table 2 , our method is much better than the method BERT+CE. Especially on Ohsumed dataset, when set different k, the accuracy of our method is at least 3.7% higher than that of BERT+CE method. At the same time, cross entropy aims to maximize the probability of the ground truth label, hence BERT+CE will perform well on the N DCG metric in Table 3 . Nevertheless, the results of our method on N DCG metric are still better than that of BERT+CE, which illustrates the effectiveness of our rank loss. The top-k accuracy results of BERT+Relabel are similar to the results of BalancedLoss+Relabel. Besides, in the most cases, the results of BERT+Relabel and BalancedLoss+Relabel are better than BERT+CE, BERT+Smooth, BERT+Top-k SVM, BERT+Top-k Entropy and BERT+Top-k SVM Semismooth overall on all text datasets, which illustrate the relabeling process could improve the top-k classification accuracy and the idea of predicting the most likely k labels is effective. However, the results of BERT+Relabel and BalancedLoss+Relabel on N DCG metric are bad, because they do not consider the ranking of the ground truth label, on the contrary, our loss function does. Compared with BERT+Relabel, our method BERT+Relabel+TkLR achieves better results in topk accuracy and N DCG metric on all datasets, which shows that our proposed loss function is very effective. Though the top-7 accuracy of BERT+Relabel is equal to BERT+Relabel+TkLR, the result on N DCG metric of BERT+Relabel+TkLR are extremely better than BERT+Relabel, which shows the strength of rank loss. Comparing the results of BERT+Relabel+TkLR-No-Rank and BERT+Relabel+TkLR shows that adding rank loss can not only improve the top-k accuracy, but also improve the N DCG metric. 4 and Table 5 show the top-k accuracy and the N DCG values for different k of all methods on image dataset, respectively. The experimental results on the image dataset are similar to those on the text dataset, our method achieves the best results overall on both top-k accuracy and N DCG metrics, which illustrates the generality of our method. Compared with the Swin+CE method, our method Swin+Relabel+TkLR has improved the top-k accuracy, especially on the dataset with low top-k accuracy such as Cifar100 and Aircraft. Although Swin+CE performs well on the N DCG metric due to the cross entropy, and gets slightly better results than Swin+Relabel+TkLR at N DCG@7 on CUB-200-2011, Swin+Relabel+TkLR still gets the best results on the N DCG metric on all other cases. This shows that the rank loss is still very effective on image datasets. The results between Swin+CE, Swin+Top-k SVM, Swin+Top-k SVM Semismooth and Swin+Smooth are almost similar and better than Swin+Top-k Entropy. But the performance of Swin+CE on the N DCG metric is better, because other methods neglect the ranking of ground truth label. The experimental performance of C-Tran+Relabel is poor on two metrics, especially on datasets Cifar100. Compared with the method Swin+Relabel+TkLR-No-Rank, Swin+Relabel+TkLR obviously has a better performance both on top-k accuracy and N DCG metric, which shows the rank loss could also improve the top-k accuracy and the N DCG metric on the image dataset.

5. CONCLUSION

In this paper, we propose the importance of ground truth label ranking in top-k classification, which none of the previous top-k classification methods take into account. Inspired by the multi-label classification, we transform the top-k classification problem into a problem of predicting the most likely k labels and we propose a novel three-stage approach for top-k classification, which can be easily deployed with different classification models. We then propose an ensemble based relabeling method and relabel the training data with k labels. Besides, we propose a novel top-k loss function that takes the ranking of the ground truth label into account. Finally, we conduct extensive experiments on both text datasets and image datasets, and the experimental results show that our method achieves the best results. the training data of each classifier should be different, so α cannot be 1. However, if α is too small, a better basic classifier cannot be trained. Therefore, in practice, the number m of ensemble models and the proportion α are usually determined through experiments such as Fig. 2 . In our experiments, we employ Bagging method to train multiple base classifiers. The Bagging method randomly selects data with a proportion of α from the total training data for training each time. If α is too small, the classifier will underfit, and if α is too large, the classifiers will be too similar, both of which will make the prediction performance of the ensemble classifiers worse. Thus, α is usually obtained through experimental verification. We conduct experiments both on text and image datasets to determine the number of base classifiers m and proportion α. We first randomly initialize α, and determine the optimal m with the fixed α. Then determine α with the optimal m. The experimental results are shown in Figure 2 . Figure 2a and Figure 2b show that in general, the larger m is, the higher the top-k accuracy is. On the Ohsumed dataset, when m = 8, the overall results are good, and the performance is no longer significant when m continues to increase. On the Cifar100 dataset, the result is also good when m = 5, and the effect of increasing m is not obvious. Figure 2c and Figure 2d show the performance of different α with the optimal m. The results show that when α = 0.8, top-k classifier can achieve the best performance on the Ohsumed dataset and the cifar100 dataset. We found similar experimental results on other datasets. Therefore, for hyperparameter m, we set the m to 8 for the text dataset and m to 5 for the image dataset. For hyperparameter α, we set α to 0.8 for both the text dataset and the image dataset.

A.2 CHARACTERISTICS INVESTIGATION OF THE LOSS FUNCTION

Time complexity. We will discuss the time complexity of our loss function in this part. We first ignore the label ranking part and start with k = 1, then our loss function is define as: L k=1 = log 1 + n∈neg e sn-sp which is equal to cross-entropy. Obviously, the time complexity of the loss function is O(1). Suppose the total number of categories is c, when k > 1, the complexity of Eq.4 depends on Training process. To illustrate the performance of our loss function, we show the training process of our loss function. Fig. 3 shows the training process. First of all, it can be seen from Fig. 3a that our loss function converges slower than other loss functions, and the initial loss is larger. Comparing TkLR and TkLR-NO-RANK at the same time, it is obvious that TkLR-NO-RANK converges faster, which shows that label ranking has an impact on the function. At the same time, TkLR-NO-RANK can converge to a loss value close to 0 like other functions, which shows that our proposed loss Analysis of the convergence. We illustrate the state of convergence by analyzing the two objectives of the loss function. The first objective is top-k classification which aims to maximize (s p -s n ). Fig. 4 shows the state of s p and s n from never converged to converged. Fig. 4a shows the distribution of s p and s n when epoch=0 (training is not started). From the figure, it can be seen that s p and s n are distributed near (0,0). When epoch=1, Fig. 4b shows that the distribution slowly moves to the left and up, which indicates that the objective is converging. When epoch=20, Fig. 4c shows that the distribution of s p is concentrated in [0, 15], while the distribution of s n is concentrated in [-5, 2.5], which is in line with the goal of maximizing (s p -s n ). At the same time, compared with Fig. 4a and Fig. 4b , the distribution of Fig. 4c is more concentrated, which shows that the convergence of the objective function is completed. Another objective is label ranking which aims to maximize (s y -s p ), p = y. Fig. 5 also shows the state of s y and s p from never converged to converged. Fig. 5a shows that the distribution of s y and s p \y is concentrated around (0,0) when the initial state epoch=0. Fig. 5b shows that the distribution of s y and s p \y shifts upward when epoch=1, which shows that the optimization objective (s y -s p ) is converging. Fig. 5c shows that s y is distributed between [8, 15] , and s p \y is distributed between [0, 10]. Compared with Fig. 5a and Fig. 5b , the distribution in Fig. 5c is more concentrated, and the objective is greatly optimized, so the convergence is complete.



Figure 1: The proposed three-stage approach for top-k classification.

seen from the Figure1, three stages of our approach are: training base classifiers, relabeling training data and training top-k classifier. Stage 1: Training base classifiers. Base classifiers are used to predict the k most likely labels for each training example in stage 2. Considering that the correct k labels are crucial for top-k classification in our problem setting, we use the idea of ensemble learning to train m base classifiers. Specifically, given the single-label dataset D which is composed of training set and validation set. we randomly select m different subsets from dataset D to train the m base classifiers, where the proportion of subsets in dataset D is α. The loss function of training process is cross-entropy loss function. The settings of hyperparameters α and m will be discussed in Appendix 4.2.

Training top-k classifier. We take the output of the second stage as training data and use our proposed top-k loss function to train a new top-k classification model. The loss function is described in Section 3.2.

e sn-sp + p∈pos\y e sp-sy ) = log(1 + n∈neg e sn p∈pos e -sp + e -sy p∈pos\y e sp )

) Swin + Smooth, which uses the Smooth loss function as the loss function of the Swin Transformer (6) Swin + Relabel, a multi-label classification method for Swin Transformer that uses the training set after relabeling process and uses the binary cross-entropy loss function. (7) C-Tran + Relabel, the state-of-the-art multi-label image classification model C-Tran(Lanchantin et al., 2021) trained by the training set after relabeling process.

Figure 2: Setting of hyperparameters α and m

Figure 3: Training process

where |pos| = k and |neg| = c -k. The total number of calculations is k(c -k). Due to the number of categories c is much larger than k in practice, the complexity of Eq.4 is O(c * k). At the same time, since the complexity of the label ranking part is O(1), the time complexity of the loss function is O(c * k).

Figure 4: Visualization of top-k classification on test set. Top-k classification maximizes s p -s n . Fig (a) and (b) are the unconverged state, and Fig (c) is the converged state.

Statistics of the datasets

Top-k accuracy results on text datasets

N DCG@k results on text datasets

Top-k accuracy results on image datasets

N DCG@k results on image datasets

A APPENDIX

A.1 THEORETICALLY ANALYSIS OF HYPERPARAMETERS FOR ENSEMBLE LEARNING In statistics and machine learning, ensemble learning uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Take the easy binary classification as an example. Given y ∈ {-1, +1} and the ground truth function f . Suppose that each base classifier h i has an independent generalization error , i.e.,After combining m number of such base classifiers according tothe ensemble H makes an error only when at least half of its base classifiers make errors. Therefore, by Hoeffding inequality, the generalization error of the ensemble isThe above formula shows that when the number m of classifiers in the ensemble increases, the error rate of the ensemble decreases exponentially. That is to say, a higher number of classifiers can make the prediction results of the ensemble model more accurate.In our method, the basic classifier in stage 1 is used to predict the k most likely labels for samples, the prediction results is extremely important for subsequent training. Using only one basic classifier cannot guarantee the stability and accuracy of the prediction results, so we use the idea of ensemble learning to train multiple classifiers in stage 1. However, the number of ensemble models is not as good as possible, too many models will consume a lot of resources and time. At the same time, when the number of models reaches a certain level, increasing the number of models has little effect on the prediction results. At the same time, in order to ensure the diversity of each basic classifier,

