LPT: LONG-TAILED PROMPT TUNING FOR IMAGE CLASSIFICATION

Abstract

For long-tailed classification tasks, most works often pretrain a big model on a large-scale (unlabeled) dataset, and then fine-tune the whole pretrained model for adapting to long-tailed data. Though promising, fine-tuning the whole pretrained model tends to suffer from high cost in computation and deployment of different models for different tasks, as well as weakened generalization capability for overfitting to certain features of long-tailed data. To alleviate these issues, we propose an effective Long-tailed Prompt Tuning (LPT) method for long-tailed classification tasks. LPT introduces several trainable prompts into a frozen pretrained model to adapt it to long-tailed data. For better effectiveness, we divide prompts into two groups: 1) a shared prompt for the whole long-tailed dataset to learn general features and to adapt a pretrained model into the target long-tailed domain; and 2) group-specific prompts to gather group-specific features for the samples which have similar features and also to empower the pretrained model with fine-grained discrimination ability. Then we design a two-phase training paradigm to learn these prompts. In the first phase, we train the shared prompt via conventional supervised prompt tuning to adapt a pretrained model to the desired long-tailed domain. In the second phase, we use the learnt shared prompt as query to select a small best matched set for a group of similar samples from the group-specific prompt set to dig the common features of these similar samples, and then optimize these prompts with a dual sampling strategy and the asymmetric Gaussian Clouded Logit loss. By only fine-tuning a few prompts while fixing the pretrained model, LPT can reduce training cost and deployment cost by storing a few prompts, and enjoys a strong generalization ability of the pretrained model. Experiments show that on various long-tailed benchmarks, with only ∼1.1% extra trainable parameters, LPT achieves comparable or higher performance than previous whole model fine-tuning methods, and is more robust to domain-shift.

1. INTRODUCTION

Learning from long-tailed data (Cui et al., 2019; Kang et al., 2020; Zhang et al., 2021b) is very challenging in the deep learning era, since networks often excessively overfit to majority classes while ignoring the minority classes due to the overwhelming training sample number of majority classes. To eliminate this negative effect, previous methods focus on three individual aspects: 1) resampling the long-tailed data distribution (Kang et al., 2020; Li et al., 2022; 2021a; Ren et al., 2020) to achieve balance among all classes in each minibatch data, 2) re-weighting the training loss (Cui et al., 2019; Li et al., 2022; Menon et al., 2021) to give heavier weights to minority classes, and 3) specially-designed decoupled training (Kang et al., 2020 ), knowledge distillation (Li et al., 2021b) or ensemble learning (Zhou et al., 2020; Wang et al., 2020) . Although alleviating the negative effect in long-tailed learning in some sense and achieving better overall performance, these methods generally need to train both feature extractors and linear classifiers from scratch or from pretrained models on large-scale datasets, e.g. ImageNet (Deng et al., 2009) , thus suffering from three issues. Firstly, to adapt to long-tailed data, this whole model finetuning requires much higher extra training cost. Secondly, fine-tuning whole model also impairs the generalization ability of the pretrained model, since the pretrained model trained on a large-scale dataset often sees abundant data and enjoys strong discriminative ability to various kinds of fea- tures, while fine-tuning often weaken this generalization ability caused by the overfitting to certain features of long-tailed data and hardly handle domain-shift or out-of-distributed data which occur frequently in long-tailed learning. Finally, fine-tuning also results in very different models for different learning tasks, which destroys model compatibility and increases practical deployment cost. Contributions. To alleviate the above issues, we propose a novel and effective Long-tailed Prompt Tuning (LPT) approach. Specifically, LPT builds on a pretrained model, e.g. vision transformer (ViT) (Dosovitskiy et al., 2021) , and introduces extra trainable prompts into this pretrained model, and finally only fine-tunes these prompts for adapting the pretrained model to long-tailed data at hand. For prompts, there are two kinds, 1) shared prompt for all classes to learn general features (knowledge) and to adapt a pretrained model into the target domain; and 2) group-specific prompts to gather group-specific features for those samples which have similar features and also to empower the pretrained model with fine-grained distinguishing ability. For effective training, we design a two-phase training framework to learn these two kinds of prompts. In the first phase, LPT optimizes the shared prompt and a classifier on a long-tailed training dataset of interest. For this phase, its target is to 1) adapt the pretrained model to the target domain of interest via prompt tuning, and 2) to empower the pretained model with the trained classifier the discriminative ability to the training data which is a basic to learn the group-specific prompts. During the second phase, we learn the newly added group-specific prompt set and further fine-tune classifier used in the first phase. Specifically, given an input, LPT feeds it into the pretrained model with the learnt shared prompt, and views the output class token as the query to select a small set of matched prompts via computing the cosine similarity between query and the corresponding keys from group-specific prompt set. Next, the trainable matched group-specific prompts are introduced into the pretrained model with shared prompts to help learn class-specific attributes, and is trained by asymmetric Gaussian Clouded Logit (A-GCL) loss (Li et al., 2022) with a dual sampling strategy. This LPT can well alleviate the above three issues in existing methods as aforementioned. For training cost, LPT only needs to fine-tune a few prompts whose size is much smaller than the pretrained model, and thus uses much less training cost than fine-tuning whole pretrained model for adaptation. As for generalization ability, LPT only fine-tunes prompt while fixing the pretraining model, and thus enjoys the strong generalization capacity of the pretrained model. On compatibility, LPT shares a pretrained model for different learning tasks, and only needs to store the small-sized prompts, largely increasing the model compatibility and reducing practical deployment cost. As shown in Fig. 1 , on various long-tailed classification benchmarks, with only ∼1.1% additional parameters of prompts, LPT achieves comparable or higher performance than the previous methods which fine-tunes whole pretrained model. Especially, with only vision-based data for training and testing, LPT achieves 50.1% overall classification accuracy and 46.9% few-shot accuracy on Places-LT dataset (Zhou et al., 2017a) , and makes 8.9% and 11.6% improvement over the previous methods trained on vision-only data. Besides, more experimental results shows the superiority of LPT and also its generalization and robustness on long-tailed data and also domain shifted data.

2. RELATED WORK

Long-tailed Image Classification. To tackle negative effect from the highly imbalanced data distribution, previous works mainly focus on three different aspects, i.e.: data re-sampling (Kang et al., 



Figure 1: Comparison among SoTA long-tailed methods on the Places-LT dataset and iNaturalist 2018 dataset, where the size of each spot indicates the model size of the overall network, including model backbone, classier and prompts. Our LPT only needs ∼1.1% additional trainable parameters while achieving comparable or higher accuracy on two highly long-tailed datasets.

availability

publicly available at https://github.com/DongSky/LPT.

