TOWARDS ESTIMATING TRANSFERABILITY USING HARD SUBSETS

Abstract

As transfer learning techniques are increasingly used to transfer knowledge from the source model to the target task, it becomes important to quantify which source models are suitable for a given target task without performing computationally expensive fine-tuning. In this work, we propose HASTE (HArd Subset TransfErability), a new strategy to estimate the transferability of a source model to a particular target task using only a harder subset of target data. By leveraging the model's internal and output representations, we introduce two techniques -one class-agnostic and another class-specific -to identify harder subsets and show that HASTE can be used with any existing transferability metric to improve their reliability. We further analyze the relation between HASTE and the optimal average log-likelihood as well as negative conditional entropy and empirically validate our theoretical bounds. Our experimental results across multiple source model architectures, target datasets, and transfer learning tasks show that HASTE-modified metrics are consistently better or on par with the state-of-the-art transferability metrics. Our code is available here.

1. INTRODUCTION

Transfer learning (Pan & Yang, 2009; Torrey & Shavlik, 2010; Weiss et al., 2016) aims to improve the performance of models on target tasks by utilizing the knowledge from source tasks. With the increasing development of large-scale pre-trained models (Devlin et al., 2019; Chen et al., 2020a; b; Radford et al., 2021b) , and the availability of multiple model choices (e.g model hubs of Pytorch, Tensorflow, Hugging Face) for transfer learning, it is critical to estimate their transferability without training on the target task and determine how effectively transfer learning algorithms will transfer knowledge from the source to the target task. To this end, transferability estimation metrics (Zamir et al., 2018b; Achille et al., 2019; Tran et al., 2019b; Pándy et al., 2022; Nguyen et al., 2020) have been recently proposed to quantify how easy it is to use the knowledge learned from these models with minimal to no additional training using the target dataset. Given multiple pre-trained source models and target datasets, estimating transferability is essential because it is non-trivial to determine which source model transfers best to a target dataset, and that training multiple models using all source-target combinations can be computationally expensive. Recent years have seen a few different approaches (Zamir et al., 2018b; Achille et al., 2019; Tran et al., 2019b; Pándy et al., 2022; Nguyen et al., 2020) for estimating a given transfer learning task from a source model. However, existing such methods often require performing the transfer learning task for parameter optimization (Achille et al., 2019; Zamir et al., 2018b) or making strong assumptions on the source and target datasets (Tran et al., 2019b; Zamir et al., 2018b) . In addition, they are limited to estimating transferability on specific source architectures (Pándy et al., 2022) or achieve lower performance when there are large domain differences between the source and target dataset (Nguyen et al., 2020) . This has recently led to the questioning of the applicability of such metrics beyond specific settings (Agostinelli et al., 2022a) . Prior works in other contexts (Khan et al., 2018; Agarwal et al., 2022; Zhang et al., 2021b; Khan et al., 2018; Soviany et al., 2022; D'souza et al., 2021) show that machine learning (ML) models find some samples easier to learn while others are much harder. In this work, we observe and leverage a similar phenomenon in transfer learning tasks (Figure 1a ), where images belonging to the harder subset of the target dataset achieve lower prediction accuracy than images from the easy subset. The Column (b): Top-10 images from hard and easy subsets show that harder subsets comprise images (cliparts) that are out-of-distribution when compared to the source dataset images. See Figures 5 6 7 8 9 for more qualitative images for different source-target pairs. key principle is that easy samples do not contribute much when comparing the performance of a pre-trained model on multiple datasets or ranking the performance of different models on a given dataset. Additionally, in Figure 1b , we observe qualitatively that easy examples of the target dataset (Caltech101) comprise images that are in-distribution as compared to the source dataset (ImageNet), whereas images from the harder subset contain out-of-distribution clip art images that are not present in the source dataset and, hence, may be more challenging in the transfer learning process. Present work. In this work, we incorporate the aforementioned observation and propose a novel framework, HASTEfoot_0 (HArd Subset TransfErability), to estimate transferability by only using the hardest subset of the target dataset. More specifically, we introduce two complementary techniques -class-agnostic and class-specific -to identify harder subsets from the target dataset using the model's internal and output representations (Section 4.1). Further, we theoretically and empirically show that HASTE transferability metrics inherit the properties of its baseline metric and achieve tighter lower and upper bounds (Section 4.2). We perform experiments across a range of transfer learning tasks like source architecture selection (Section 5.1), target dataset selection (Section 5.2), and ensemble model selection (Section 5.4), as well as on other tasks such as semantic segmentation (Section 5.3) and language models (Section 5.5). Our results show that HASTE scores better correlate with the actual transfer accuracy than their corresponding counterparts (Nguyen et al., 2020; Tran et al., 2019a; Pándy et al., 2022) . Finally, we establish that our findings are agnostic to the choice of source architecture for identifying harder subsets, scale to transfer learning tasks for different data domains and that utilizing the hardest subsets can be highly beneficial for estimating transferability.

2. RELATED WORK

This work lies at the intersection of transfer learning and diverse metrics to estimate transferability from a source model to a target dataset. We discuss related works for each of these topics below. Transfer Learning (TL). It can be organized into three broad categories: i) Inductive Transfer (Erhan et al., 2010; Yosinski et al., 2014) , which leverages inductive bias, ii) Transductive Transfer, which is commonly known as Domain Adaptation (Wang & Deng, 2018; Wilson & Cook, 2020) , and iii) Task Transfer (Zamir et al., 2018a; Pal & Balasubramanian, 2019) , which transfers between different tasks instead of models. Amongst this, the most common form of a transfer learning task is fine-tuning a pre-trained source model for a given target dataset. For instance, recent works have demonstrated the use of large-scale pre-trained models such as CLIP (Radford et al., 2021a) and VirTex (Desai & Johnson, 2021) for learning representations for different source tasks. Transferability Metrics. Despite the development of a plethora of source models, achieving an optimal transfer for a given target task is still a nascent research area as it is non-trivial to identify the source model or dataset for efficient TL. Transferability metrics are used as proxy scores to estimate the transferability from a source to a target task. Prior works have proposed diverse metrics to estimate TL accuracy. For instance, NCE (Tran et al., 2019a) and LEEP (Nguyen et al., 2020) utilize the labels in the source and target task domains to estimate transferability. Further, metrics like H-Score (Bao et al., 2019) , GBC (Pándy et al., 2022) and TransRate Huang et al. (2022) use the embeddings from the source model to estimate transferability. In contrast to the above metrics that focus on a single source model, Agostinelli et al. (2022b) explored metrics to estimate the transferability for an ensemble of models and introduced two metrics -MS-LEEP and E-LEEPfor identifying a subset of model ensembles from the pool of available source models.

3. PRELIMINARIES

Notations Probability Estimations. Let the source model f s θ output softmax scores over the source dataset label space Z. Next, we construct a "source label distribution" of the target dataset over the source label space Z by passing them through f s θ and use it to build an empirical joint distribution over the source and target label spaces, i.e., P (y, z) = 1 n i:yi=y f s θ (x i ) z , where f s θ (x i ) z represents the softmax score of an instance x i for class z ∈ Z. Finally, the empirical marginal distribution and conditional distribution can be computed using P (z)= y∈Y P (y, z) and P (y|z)= P (y,z) P (z) , where Y denotes the target label space for dataset D t . Transferability Metric. Following prior works, the performance of a transferability metric is evaluated by measuring the correlation between T s→t and A s→t . Further, we focus on the fine-tuning style of transfer learning. Here, the final source classification layer of the source model f s θ is replaced with the target classification layer, and the whole model is trained on the target dataset task.

4. OUR METHOD: HASTE

Next, we describe, HASTE, a meta-transferability metric, which improves the transferability estimates by leveraging the harder subsets of the target data. We first discuss two complementary techniques (class-agnostic and class-specific; Sec. 4.1) to identify harder subsets and show that they can be applied to any of the existing transferability metrics. Next, we theoretically and empirically show that HASTE transferability metrics inherit the properties of their baseline metric (Sec. 4.2). Problem Statement (Transferability Metric) . Given a source model f s θ , source dataset D s , and target dataset D t , a transferability metric aims to output a score T s→t that correlates with the accuracy A s→t of the target model f s→t θ .

4.1. CALCULATING HARDNESS

Here, we define the methods for identifying harder subsets in the target dataset, where one method uses the overall data distribution, and the other controls individual samples/classes to provide more representation of the dataset. Class-Agnostic Method. The Class-Agnostic method uses the representation similarity between the samples in the source and target dataset to compute hardness scores. In particular, it employs embeddings from multiple layers of the source model and compares them for the source and target samples using cosine similarity, i.e.,  ψ(x s i , x t j ) = 1 L L l=1 E l (x s i ) • E l (x t j ), (•)} return T (f s θ , D hard t ) where ψ represents the similarity between a pair of source x s i and target x t j sample, E l (•) is the intermediate layer output from the l-th layer of f s θ , and L is the total number of layers in f s θ . Next, we calculate an activation similarity matrix S ∈ R M ×N , where S ij =ψ(x s i , x t j ), M =|D train s | and N =|D train t |. Using the pairwise similarity matrix S, we compute the hardness score for a target image, where samples closer to the source dataset obtain lower hardness scores, and vice-versa. H(x t j ) CA = 1 - 1 M M i=1 S ij , Class-Specific Method. In contrast to the class-agnostic strategy, which does not utilize label information of the target dataset, we introduce a Class-Specific technique to identify harder subsets by controlling the target classes (as they provide more representation of the dataset). Following Pándy et al. (2022) , we model each target class c as a normal distribution in the embedding space of f s θ and define the mean and covariance of the distribution as: µ c = 1 N c i:y t i =c f s θ (x t i ); Σ c = 1 N c i:y t i =c (f s θ (x i ) -µ c )(f s θ (x i ) -µ c ) ⊤ , where y t j =c, and N c is the number of samples in class c. For each target sample, the hardness is defined as the Mahalanobis distance of the sample from the mean of the corresponding class distribution (Equation 3), i.e., H(x t j ) CS = (f s θ (x i ) -µ c ) ⊤ Σ -1 c (f s θ (x i ) -µ c ) Next, we use the above-mentioned techniques to identify hard subsets from the target dataset.

HASTE.

To identify hard subsets, we sort the target dataset samples using either of the hardness scores, as defined in Equations 2,4. We denote the indices of the sorted samples using {q 1 , q 2 , . . . q N }. The hard subset is then defined as: D hard t ={(x t q1 , y t q1 ), . . . , (x t q k , y t q k )}; k ≤ N , where the hardness of each sample follows H(x t q1 )≥H(x t q2 )≥ . . . H(x t q N ). For using our HASTE modification with the existing metrics, we propose the use of only these identified harder subsets D hard t as an input to these metrics for estimating transferability, i.e., HASTE = T (f s θ , D hard t ), where T (•) denotes any existing transferability metric. See Algorithm 1 for details on getting harder subsets using Class Specific or Class Agnostic methods. Finally, we show the t-SNE (van der Maaten & Hinton, 2008) embeddings (Figure 2 ) of the entire target dataset and their harder subsets. We We observe that the embeddings from the harder subset are more entangled than the entire dataset. observe that the embeddings from the entire dataset (Figure 2a ) are well segregated, but embeddings of samples in the harder subsets are highly entangled (Figure 2b-2c ). These findings align with our findings in Figure 1 , where images from harder subsets achieve lower transfer accuracies, i.e., the source models struggle to find the decision boundaries between these harder samples.

4.2. THEORETICAL PROPERTIES

Here, we show that HASTE-modified metrics inherit the theoretical properties of their baseline metric. Note that showing theoretical bounds for all transferability metrics is outside the scope of this work. Hence, we take one representative metric (LEEP) and show that HASTE-LEEP retains its theoretical properties. LEEP. Let source model f s θ predict the target label y by directly drawing from the label distribution p(y|x; f s θ , D train t )= z∈Z P (y|z)f s θ (x) z . The LEEP score is then defined as average log-likelihood: LEEP = T (f s θ , D train t ) = 1 n n i=1 log z∈Z P (y|z)f s θ (x) z (6) Average log-likelihood. We fix the source model weights θ and re-train the classification model using maximum likelihood and the target dataset D train t to obtain a new classifier f * θ , i.e., f * θ = arg max k∈K l(θ, k), where l(θ, k) is the average likelihood for the weights θ and k on the target dataset D train t , and k is selected from a space of classifiers K. Lemma 1. HASTE-LEEP is a lower bound of the optimal average log-likelihood for the hard subset. T (f s θ , D hard t ) ≤ l(w, k * ) hard ≤ l(w, k * ) (8) Proof. This proof is true by definition as D hard t ⊂D train t represents the hard subset of the target dataset. Note that l(w, k * ) is the maximal average log-likelihood over k ∈ K, and T (f s θ , D train t ) is the average log-likelihood in K. From Nguyen et al. (2020) we know T (f s θ , D train t ) ≤ l(w, k * ) and by definition of D hard t , T (f s θ , D hard t ) ≤ l(w, k * ). In addition, the model struggles to learn the samples in the hard subset, and hence l(w, k * ) hard ≤ l(w, k * ) Lemma 2. HASTE-LEEP is an upper bound of the NCE measure plus the average log-likelihood of the source label distribution, computed over the hard subset, i.e., T (f s θ , D hard t ) ≥ H-NCE(Y | Z) + 1 |D hard t | Σ |D hard t | i=1 log f s θ (x i ) zi , Proof Sketch. This proof extends from the property of LEEP. See Appendix A.1 for detailed proof. Empirical Analysis. We analytically evaluated the upper and lower bounds for HASTE-LEEP by computing the RHS of Equations 8-9. In Figure 3 , our results show HASTE-LEEP and its corresponding theoretical upper and lower bounds, confirming that, across seven source model architectures, none of our theoretical bounds are violated. In addition, we empirically demonstrate that our bounds are tighter than LEEP.

5. EXPERIMENTS

Next, we present experimental results to show the effectiveness of HASTE modified transferability metrics for different transfer learning tasks, including source architecture selection (Sec. 5.1), target dataset selection (Sec. 5.2), semantic segmentation task (Sec. 5.3), ensemble model selection (Sec. 5.4), and language domain transferability (Sec. 5.5). Evaluation metrics and Baselines. We use the Pearson Correlation Coefficient (PCC) for correlation between T s→t and A s→t (see Tables 10-11 for results of other correlation coefficients). For baselines, we use LEEP, NCE, and GBC for single model transferability tasks, and MS-LEEP and E-LEEP for ensemble model selection. See Appendix A.2 for more details.

5.1. SOURCE ARCHITECTURE SELECTION

Experimental setup. As detailed inPándy et al. ( 2022), the target dataset is fixed and the T s→t are computed over multiple source architectures. The correlation scores are computed between T s→t and the transfer accuracies A s→t . We consider seven target datasets for our source architecture experiments: i) Caltech101 (Fei-Fei et al., 2004) , ii) CUB200 (Welinder et al., 2010) , iii) Oxford-IIIT Pets (Parkhi et al., 2012) , iv) Flowers102 (Nilsback & Zisserman, 2008) , v) Stanford Dogs (Khosla et al., 2011) , vi) Imagenette (Howard) , and vii) PACS-Sketch (Li et al., 2017) . Model architectures and Training. We consider seven source architectures pre-trained on Ima-geNet (Russakovsky et al., 2015) dataset, including ResNet-50, ResNet-101, ResNet-152 (He et al., 2016) , DenseNet-121, DenseNet-169, DenseNet-201 (Huang et al., 2017) , and MobileNetV2 (Sandler et al., 2018). All models were set using the publicly available pre-trained weights from the Torchvision library (Marcel & Rodriguez, 2010) . For each source architecture, we utilize the ResNet-50 model to calculate the hardness ranking, and the hard subset used to compute HASTE scores. Following Pándy et al. (2022) , we calculate the target accuracy A s→t by fine-tuning the source model on each target dataset. We fine-tune the source model for 100 epochs using an SGD optimizer with a momentum of 0.9, a learning rate of 10 -4 , and a batch size of 64. Results. On average, across seven target datasets, results show an improvement in correlation scores of +129.74% for LEEP, +29.38% for NCE, and -0.07% GBC, using HASTE-modified metrics (Table 1). Interestingly, for most target datasets, both CA and CS variants of the HASTE metrics outperform the baseline scores.

5.2. TARGET DATASET SELECTION

Experimental setup. Here, the source model is fixed and the transferability metric is computed over multiple target datasets (Nguyen et al., 2020) . We construct 50 target datasets by randomly selecting a subset of classes from the original target dataset. The PCC is computed between T s→t and A s→t across all 50 target tasks, where each target subset contains 40% to 100% of the total classes, and for each class, all train and test images are included in the subset. We consider six target datasets including Caltech101, CUB200, Oxford-IIIT Pets, Flowers102, Stanford Dogs, and PACS-Sketch. Model architectures and Training. We consider two source models: ResNet-18 pre-trained on CUB200 and ResNet-34 pre-trained on Caltech101. We train the transferred models for 100 epochs using SGD with a momentum of 0.9, a learning rate of 10 -3 , and a batch size of 64. Results. Across two source datasets and four target datasets, HASTE-LEEP achieves the highest correlation for the target selection task, and outperform their respective baseline methods (Table 2 ). In particular, we observe an improvement in correlation scores of +0.99% for LEEP, +1.15% for NCE, and +5.11% for GBC.

5.3. SEMANTIC SEGMENTATION

Experimental setup. We follow the fixed target setting described in Pándy et al. (2022) and report the correlation between meanIoU and T s→t for each target dataset. We consider a Fully Connected Network (FCN) Long et al. ( 2014) with a ResNet-50 backbone pre-trained on a subset of COCO2017 (Lin et al., 2014) . We consider CityScapes (Cordts et al., 2016 ), CamVid (Brostow et al., 2009) , BDD100k (Yu et al., 2018) , IDD (Varma et al., 2019) , PascalVOC (Li et al., 2020) and SUIM (Islam et al., 2020) datasets. Among them, we consider the target datasets CityScapes (Cordts et al., 2016 ), CamVid (Brostow et al., 2009) , BDD100k (Yu et al., 2018) and SUIM (Islam et al., 2020) . Note that we use the CA variant of HASTE for semantic segmentation as segmentation does not have class labels. . We train the individual models for 100 epochs using SGD with a momentum of 0.9, weight decay of 10 -4 , a batch size of 16, a learning rate of 10 -3 , and reduce it on plateau by a factor of 0.5. Each model is fine-tuned on the target dataset independently, using SGD with a momentum of 0.9, a learning rate of 10 -3 , and a batch size of 16. Results. On average across four target datasets, results show that HASTE-modified metrics outperform their baseline methods (Table 3 ). In particular, we observe an improvement in correlation scores of +182.23% for LEEP, +33.34% for NCE, and +149.30% for GBC. 

5.4. ENSEMBLE MODEL SELECTION

Experimental setup. Given a pool with P number of source models, this task aims to select the subset of models whose ensemble yields the best performance on a fixed target dataset (Agostinelli et al., 2022b) . Since evaluating every ensemble combination of the P source models is very expensive, the ensemble size K (i.e., number of models per ensemble) is fixed, which yields P K candidate ensembles. The PCC is then computed between the T s→t and A s→t across all candidate ensemble. We use K=4 and P =11 in our experiments and consider the target datasets from Section 5.2. Model architectures and Training. We include source models pre-trained on the above datasets as well as ImageNet. Each ensemble of model architectures consists of one or more models from the pool of ResNet-101, VGG-19 (Simonyan & Zisserman, 2015) , and DenseNet-201 with each model pre-trained on the mentioned datasets. For a given candidate ensemble, each member model is fine-tuned individually on a fixed target train dataset D train t , and, finally, the ensemble prediction is calculated as the mean of all individual predictions. Each model is fine-tuned on the target dataset independently, using SGD with a momentum of 0.9, a learning rate of 10 -4 , and a batch size of 64. 4 shows that, on average across five target datasets, HASTEmodified metrics outperform their baseline methods. In particular, we observe an improvement in correlation scores of +14.43% for MS-LEEP and +4.10% for E-LEEP.

5.5. ADDITIONAL RESULTS ON LANGUAGE MODELS

Experimental setup. We now evaluate the performance of HASTE for the sentiment classification transfer learning tasks and show results in the target dataset selection setting. We consider three target datasets, including TweetEval (Barbieri et al., 2018) , IMDB Movie Reviews (Maas et al., 2011) , and CARER (Saravia et al., 2018) for our language experiments. Model architecture and Training. We include source models trained using a classification head on a pre-trained BERT (Devlin et al., 2019) model on CARER (Saravia et al., 2018) and AG-News (Zhang et al., 2015) datasets. We fine-tune the entire source model, including the BERT layers for 3 epochs using the Adam optimizer, with a learning rate of 5 × 10 -5 , and a batch size of 8. 5 show that HASTE-modified transferability metrics outperform their baseline counterparts. On average across four source-target pairs and two techniques, we observe an improvement of +38.13% for LEEP, +33.40% for NCE, and +57.24% for GBC using HASTE metrics.

5.6. ABLATION STUDIES

We conduct ablations on two key components of HASTE modified transferability metrics: i) size of the hardest subset and ii) correlation estimates using different buckets. We also study the impact of different architectures for computing hardness on the performance of the HASTE metrics (see Appendix A.4). Bucket Size Analysis. Here, we follow the experimental setup from the source architecture selection (Section 5.1) and choose different sizes of the hardest bucket (or simply hardest subset). We vary the bucket size b by using different fractions of the entire dataset and report the results for b={0.01, 0.03, 0.1, 0.2, 0.25, 0.33, 0.5}. We find that, on average, the correlation performance increases as we increase the bucket size (Figure 4a ). Further, bucket sizes in the range b=[0.2, 0.4] generally give the best transferability estimates. This is intuitive because the influence of true hard samples might decrease in the light of easier samples for large bucket sizes, thus, going against the notion of HASTE, while for very small bucket sizes, the number of samples might not be enough for metrics like LEEP to show their effectiveness. Transferability along Hard to Easy Buckets. A key question in HASTE is to understand the effect of different subsets (depending on their easiness or hardness) on estimating transferability. Here, we follow the experimental setup from source architecture selection (Section 5.1), calculate HASTE-LEEP using different buckets, and compare it with the baseline LEEP score using the entire dataset. Results show that transferability estimates are the best for harder subsets and gradually degrade while moving towards easier subsets (Figure 4b ), confirming the core hypothesis of HASTE.

6. CONCLUSION

We propose and address the problem of estimating transferability from a source to the target domain using examples from the harder subset of the target dataset. To this end, we introduce HASTE (HArd Subset TransfErability) which leverages class-agnostic and class-specific strategies to identify harder subsets from a target dataset and can be used with any existing transferability metric. We show that HASTE-modified transferability metrics outperform their counterparts across different transfer learning tasks, data modalities, models, and datasets. In contrast to the findings in Agostinelli et al. (2022b) , i.e., one metric doesn't work for all transfer learning tasks, we show that HASTE metrics achieve favorable results across diverse transfer learning settings (Sec. 5). Hence, we anticipate that using HASTE could open new frontiers in estimating transferability and pave way for several exciting future directions, like developing new techniques to identify harder subsets and extending HASTE analysis to other transfer learning tasks. HASTE-E-LEEP, we need a single common subset as it involves the calculation of adding empirical probabilities followed by log and mean. To this end, we take the union of the hard subsets obtained from different sources and then proceed with further calculations. We report the results in Table 4 following the same. Models used in Ensemble Selection. We use the following pool of source models for ensemble selection experiment described in Section 5. Table 6 : Results on target task selection using the fine-tuning method for CUB200 source models. Shown are correlation scores (higher the better) computed across all target datasets. Results where HASTE modified metrics outperform their baselines are bolded. A.4 ABLATION ON HARDNESS SOURCE ARCHITECTURE HASTE aims to achieve better transferability estimates irrespective of the source of the hardness scores, i.e., the source architecture we use to calculate harder subsets in Class Agnostic way or Class Specific way. We follow the experimental setup from the target task selection (Section 5.2) experiments. We calculate HASTE-LEEP scores on harder subsets identified using i) ResNet18 and VGG19 trained on CUB200, and ii) ResNet50 trained on ImageNet. Results show that HASTE-LEEP outperforms LEEP (baseline calculated using the entire dataset) across all three architectures (Table 8 ). A natural question which may arise when using HASTE is whether to completely neglect the easier samples. To this end, we conduct an experiment where we add easier samples stochastically to our hard subsets and then compute the respective metric correlation score. We follow the same experiment setting as Section 1. We obtain results by iterating the addition of these easier samples 10 times followed by taking the mean. We observe that addition of these easier samples do not particularly enhance the results. In fact, the results come out to be worse than when using only hard subsets. Results for the same are shown in Table 9 Table 9 : Results on source architecture selection task for subset obtained by stochastically adding easier samples to the hard subset. Shown are correlation scores (higher the better) computed across all source architectures trained on ImageNet. Results where HASTE modified metrics outperform their baselines are in bold. We include results on LogME as an additional baseline metric. The results cover two experimental settings -Source Architecture Selection (Table 12 ), and Target Task Selection (Section 5.2). On average, HASTE-LogME shows an improvement of 120.53% in the source architecture selection experiment, and an improvement of 236.16% in the target task selection experiment. A.8 DISCUSSION ON TASK TRANSFERABILITY WORKS Our work focuses on the problem of estimating transferability of a source model on the target dataset prior to fine-tuning. Formally, this seeks to address two downstream tasks : i) of all the source models, find the most suitable to perform transfer learning on a given target dataset, ii) of all the target datasets, find the most suitable to perform transfer learning on a given source model. Thus, we can note that fine-tuning all the possible models or datasets is not a possible solution here. In stark contrast, some recent transferability works ( Zamir et al. (2018b) ; Dwivedi & Roig (2019) ; Song et al. (2020; 2019) ) consider models that are pre-trained on one or more tasks, and some further transfer these models to another task, requiring the expensive fine-tuning process. These works only discuss task transferability, i.e., transferring across computer vision tasks such as classification to semantic segmentation, semantic segmentation to depth prediction, etc. More precisely, these works establish task relatedness or how similar is one task to the other and do not propose a transferability metric, which is not the focus of our work. Further, these works are not generalizable as they either perform fine-tuning from scratch or have computational costs similar to fine-tuning, which renders these approaches infeasible for our problem. In addition to this, Zhang et al. (2021a) , quantified transferability for the task of Domain Generalization, while Tong et al. (2021) discussed transferability for multi-source transfer, both of which operate in a setting quite different from ours. A.9 MOTIVATION BEHIND HARDNESS METRIC Our approach for measuring hardness is motivated by recent works that have shown that the process of transfer learning shows maximum gains when the images from the source and target tasks are in the similar domains. Further, these recent works measure domain similarity as a function of the distance between source and target samples. In addition to these findings from previous works, Table 13 : Results on target task selection using the fine-tuning method for Caltech101 source models. Shown are correlation scores (higher the better) computed across all target datasets. Results, where HASTE modified metrics perform better than their counterparts, are in bold. 



Code: https://anonymous.4open.science/r/haste/



Figure 1: Analyzing the impact of hard subsets in transfer learning. Column (a): Results show the accuracy of different bins of a target dataset (Caltech101) based on their hardness. Across two source models (VGG-19 and ResNet-18) trained on the ImageNet dataset, we observe that the accuracy for images in the hardest subset (B1) is lower as compared to the easier subset (B5).Column (b): Top-10 images from hard and easy subsets show that harder subsets comprise images (cliparts) that are out-of-distribution when compared to the source dataset images. See Figures5-9for more qualitative images for different source-target pairs.

Figure2: t-SNE embeddings of the entire target dataset and its hardest subset using a ResNet-50 source model trained on ImageNet. We show the embeddings from five random classes from Stanford Dogs as the target dataset, using both the class-agnostic (b) and class-specific (c) methods. We observe that the embeddings from the harder subset are more entangled than the entire dataset.

Figure 3: Empirical results on the StanfordDogs target dataset show no violations of our theoretical bounds. Empirically calculated HASTE-LEEP (in blue) and our theoretical upper (in purple) and lower (in green) bounds from Equations 8-9 across seven source model architectures trained on the ImageNet dataset, where RN = ResNet, MN = MobileNet, DN = DenseNet.

Model architectures and Training. We train an FCN Resnet50 Long et al. (2014) model for each source training dataset (except the target) and individually fine-tune them on D train t

Figure4: Ablation results for HASTE. Left: It demonstrates the LEEP results (y-axis) on varying the size of the hard subset as a fraction of the complete dataset (x-axis) and shows that 25%-33% gives the best LEEP score. Right: It shows the LEEP results (y-axis) for hard-to-easy buckets (xaxis) and shows that the transferability scores are the highest for Bucket 1 (hardest).

Figure 5: The 5 × 5 grid shows the top-25 images from the easy (left) and hard (right) subset of the target dataset using the class-agnostic technique for the ImageNet-StanfordDogs source-target pair. Images with higher hardness scores tend to feature cluttered images with atypical vantage points, whereas images with lower hardness scores mostly comprise dogs in an uncluttered background.

Figure 8: The 5 × 5 grid shows the top-25 images from the easy (left) and hard (right) subset of the target dataset using the class-specific technique for the ImageNet-StanfordDogs source-target pair. Images with higher hardness scores tend to feature classes that are typically harder to classify (since they might have very less distinguishing features), whereas images with lower hardness scores mostly comprise classes that are easily distinguishable.

s→t when evaluated on the unseen target test dataset D test t . Despite fine-tuning the source model, training target models is computationally expensive. Hence, we define a transferability metric T s→t which correlates with the target model accuracy A s→t and gives an efficient estimation of how the transfer learning will unfold for a given pair of source model and target dataset.

) for each target image using S ij else if h v ='cs' then Collect target dataset activations E l (x t j ) ▷ Class-Specific Case for class c ∈ D t do Compute µ c , Σ c ▷ As per Eqn. 3 end for Compute Hardness H(x t j ) for each target image using Mahalanobis Distance from µ c end if D hard

Results on source architecture selection task. Shown are correlation scores (higher the better) computed across all source architectures trained on ImageNet. Results where HASTE modified metrics perform better than their counterparts are in bold.

Results on target task selection using the fine-tuning method for Caltech101 source models. Shown are correlation scores (higher the better) computed across all target datasets. Results, where HASTE modified metrics perform better than their counterparts, are in bold. See Table6for results on CUB200 source models.

Results on the semantic segmentation source architecture selection task. Shown are correlation scores (higher the better) computed across all source architectures. Results where HASTE modified metrics perform better than their counterparts are in bold.

Results on the ensemble model selection task. Shown are correlation scores (higher the better) computed across all ensemble candidates. Results where HASTE modified metrics perform better than their counterparts are in bold. See Appendix Table7for results using K=3.

Results on target task selection for sentiment classification. Shown are correlation scores (higher the better) computed across all target candidates. Results where HASTE modified metrics perform better than their counterparts are in bold.

Results on target task selection task different source model architectures. Shown are correlation scores (higher the better) computed across the target dataset using HASTE-LEEP. Irrespective of the architecture used for finding the hard subset, HASTE-LEEP outperforms the baseline LEEP score.

Results on source architecture selection. Shown are Kendall Tau correlation scores (higher the better) computed across all source architectures trained on ImageNet. Results where HASTE modified metrics outperform their baselines are in bold.

Results on source architecture selection task. Shown are Weighted Kendall Tau correlation scores (higher the better) computed across all source architectures trained on ImageNet. Results where HASTE modified metrics outperform their baselines are in bold.

Results on source architecture selection task with LogMe as the baseline. Shown are correlation scores (higher the better) computed across all source architectures trained on ImageNet. Results where HASTE modified metrics perform better than their counterparts are in bold.

our empirical analysis also confirms that hardest samples (i.e., with lowest similarity) obtain lower transfer accuracies than the rest of the samples (as shown in Figure1), confirming the efficacy of our hardness measure. A.10 DEFINITION OF TRANSFERABILITY Following previous works Bao et al. (2019); Nguyen et al. (2020), we define transferability as to when a transfer may work, and to what extent

A APPENDIX

A.1 PROOF FOR LEMMA 2 Lemma 2. HASTE-LEEP is an upper bound of the NCE measure plus the average log-likelihood of the source label distribution, computed over the hard subset, i.e.,Proof. Let z i be the dummy labels obtained when computing NCE and y i be the true labels.A.2 EXPERIMENTAL SETUP Implementation details. All experiments were run using the PyTorch library (Paszke et al., 2019) with Nvidia A-100/V-100 GPUs.Model Architectures. We use a variety of model architectures (VGG, ResNet, DenseNet), trained on different source datasets across our experiments. For each model architecture, we utilize embeddings from the final layer for the class-specific method (Eqn. 3). For the class-agnostic method, we utilize embeddings from intermediate layers for the similarity computation (Eqn. 1). Particularly, we utilize the embeddings from the final layer of each block of convolutions (Ex: Output of each residual block in ResNets). We only include 3/4 layers for any architecture and do not consider embeddings from the first block.Similarity Computation for Large Source Datasets. For the experiments with models pre-trained on ImageNet as the source, when using the class-agnostic method, it is infeasible to use the entire ImageNet dataset for the similarity comparison to generate the similarity matrix (using Eqn. 1). Instead, we use a random 10% subset of the ImageNet dataset, uniformly sampled from each class, as the source dataset to compute the similarity matrix. We do not observe any performance drop due to this sub-sampling, and this can be extended to other datasets as well, for further computational speedup. Additionally, we do not re-do the similarity computation for each source architecture. Instead, we only compute the similarity matrix using the ResNet-50 model and use the hard subset obtained from this to compute HASTE modified transferability metrics for all model architectures pre-trained on ImageNet. We repeat this in the model ensemble setting, utilizing a single similarity matrix for all model architectures trained on the same source dataset.GBC Implementation. In all experiments, for computing GBC and HASTE-GBC, we use a spherical covariance matrix, as we found this to yield better results, even for the base GBC score.Size of Hard Subset. The size of the hard subset in any experiment is a hyperparameter that can be tuned. Due to variations in dataset sizes, the exact value differs significantly. Instead of fixing a size, we set the size of the hard subset to be k% of the size of the target dataset. In general, we found a value of 10%-25% to work well.Subset for Ensemble Selection. The hard subset for the Class Agnostic method is dependent on the source dataset. As a result, the experiment setting described in Section 5.4 has different hard subsets depending on the source dataset and model for a single target dataset. Since MS-LEEP is simply the addition of LEEP scores, the calculation of HASTE-MS-LEEP is trivial. But in the case of 

