META-LEARNING TRANSFERABLE REPRESENTATIONS WITH A SINGLE TARGET DOMAIN

Abstract

Recent works found that fine-tuning and joint training-two popular approaches for transfer learning-do not always improve accuracy on downstream tasks. First, we aim to understand more about when and why fine-tuning and joint training can be suboptimal or even harmful for transfer learning. We design semi-synthetic datasets where the source task can be solved by either source-specific features or transferable features. We observe that (1) pre-training may not have incentive to learn transferable features and (2) joint training may simultaneously learn sourcespecific features and overfit to the target. Second, to improve over fine-tuning and joint training, we propose Meta Representation Learning (MeRLin) to learn transferable features. MeRLin meta-learns representations by ensuring that a head fit on top of the representations with target training data also performs well on target validation data. We also prove that MeRLin recovers the target ground-truth model with a quadratic neural net parameterization and a source distribution that contains both transferable and source-specific features. On the same distribution, pre-training and joint training provably fail to learn transferable features. MeRLin empirically outperforms previous state-of-the-art transfer learning algorithms on various real-world vision and NLP transfer learning benchmarks.

1. INTRODUCTION

Transfer learning-transferring knowledge learned from a large-scale source dataset to a small target dataset-is an important paradigm in machine learning (Yosinski et al., 2014) with wide applications in vision (Donahue et al., 2014) and natural language processing (NLP) (Howard & Ruder, 2018; Devlin et al., 2019) . Because the source and target tasks are often related, we expect to be able to learn features that are transferable to the target task from the source data. These features may help learn the target task with fewer examples (Long et al., 2015; Tamkin et al., 2020) . Mainstream approaches for transfer learning are fine-tuning and joint training. Fine-tuning initializes from a model pre-trained on a large-scale source task (e.g., ImageNet) and continues training on the target task with a potentially different set of labels (e.g., object recognition (Wang et al., 2017; Yang et al., 2018; Kolesnikov et al., 2019) , object detection (Girshick et al., 2014) , and segmentation (Long et al., 2015; He et al., 2017) ). Another enormously successful example of fine-tuning is in NLP: pre-training transformers and fine-tuning on downstream tasks leads to state-of-the-art results for many NLP tasks (Devlin et al., 2019; Yang et al., 2019) . In contrast to the two-stage optimization process of fine-tuning, joint training optimizes a linear combination of the objectives of the source and the target tasks (Kokkinos, 2017; Kendall et al., 2017; Liu et al., 2019b) . Despite the pervasiveness of fine-tuning and joint training, recent works uncover that they are not always panaceas for transfer learning. Geirhos et al. (2019) found that the pre-trained models learn the texture of ImageNet, which is biased and not transferable to target tasks. ImageNet pre-training does not necessarily improve accuracy on COCO (He et al., 2018) , fine-grained classification (Kornblith et al., 2019) , and medical imaging tasks (Raghu et al., 2019) . Wu et al. (2020) observed that large model capacity and discrepancy between the source and target domain eclipse the effect of joint training. Nonetheless, we do not yet have a systematic understanding of what makes the successes of fine-tuning and joint training inconsistent. The goal of this paper is two-fold: (1) to understand more about when and why fine-tuning and joint training can be suboptimal or even harmful for transfer learning; (2) to design algorithms that overcome the drawbacks of fine-tuning and joint training and consistently outperform them. To address the first question, we hypothesize that fine-tuning and joint training do not have incentives to prefer learning transferable features over source-specific features, and thus their capability of learning transferable features is rather accidental depending on the property of the datasets. To empirically analyze the hypothesis, we design a semi-synthetic dataset that contains artificiallyamplified transferable features and source-specific features simultaneously in the source data. Both the transferable and source-specific features can solve the source task, but only transferable features are useful for the target. We analyze what features fine-tuning and joint training will learn. See Figure 1 for an illustration of the semi-synthetic experiments. We observed following failure patterns of fine-tuning and joint training on the semi-synthetic dataset. • Pre-training may learn non-transferable features that don't help the target when both transferable and source-specific features can solve the source task, since it's oblivious to the target data. When the dataset contains source-specific features that are more convenient for neural nets to use, pretraining learns them; as a result, fine-tuning starting from the source-specific features does not lead to improvement. • Joint training learns source-specific features and overfits on the target. A priori, it may appear that the joint training should prefer transferable features because the target data is present in the training loss. However, joint training easily overfits to the target especially when the target dataset is small. When the source-specific features are the most convenient for the source, joint training simultaneously learns the source-specific features and memorizes the target dataset. Toward overcoming the drawbacks of fine-tuning and joint training, we first note that any proposed algorithm, unlike fine-tuning, should use the source and the target simultaneously to encourage extracting shared structures. Second and more importantly, we recall that good representations should enable generalization: we should not only be able to fit a target head with the representations (as joint training does), but the learned head should also generalize well to a held-out target dataset. With this intuition, we propose Meta Representation Learning (MeRLin) to encourage learning transferable and generalizable features: we meta-learn a feature extractor such that the head fit to a target training set performs well on a target validation set. In contrast to the standard model-agnostic meta-learning (MAML) (Finn et al., 2017) , which aims to learn prediction models that are adaptable to multiple target tasks from multiple source tasks, our method meta-learns transferable representations with only one source and one target domain. Empirically, we first verify that MeRLin learns transferable features on the semi-synthetic dataset. We then show that MeRLin outperforms state-of-the-art transfer learning baselines in real-world vision and NLP tasks such as ImageNet to fine-grained classification and language modeling to GLUE. Theoretically, we analyze the mechanism of the improvement brought by MeRLin. In a simple two-layer quadratic neural network setting, we prove that MeRLin recovers the target ground truth with only limited target examples whereas both fine-tuning and joint training fail to learn transferable features that can perform well on the target. In summary, our contributions are as follows. (1) Using a semi-synthetic dataset, we analyze and diagnose when and why fine-tuning and joint training fail to learn transferable representations. (2) We design a meta representation learning algorithm (MeRLin) which outperforms state-of-the-art transfer learning baselines. (3) We rigorously analyze the behavior of fine-tuning, joint training, and MeRLin on a special two-layer neural net setting.

2. SETUP AND PRELIMINARIES

In this paper, we study supervised transfer learning. Consider an input-label pair (x, y) ∈ R d × R. We are provided with a source distributions D s and a target distribution D t over R d × R. The source dataset D s = {x s i , y s i } ns i=1 and the target dataset D t = {x t i , y t i } nt i=1 consist of n s i.i.d. samples from D s and n t i.i.d. samples from D t respectively. Typically n s n t . We view a predictor as a composition of a feature extractor h φ : R d → R m parametrized by φ ∈ Φ, which is often a deep neural net, and a head classifier g θ : R m → R parametrized by θ ∈ Θ, which is often linear. That is, the final prediction is f θ,φ (x) = g θ (h φ (x)). Suppose the loss function is (•, •), such as cross entropy loss for classification tasks. Our goal is to learn an accurate model on the target domain D t . Since the label sets of the source and target tasks can be different, we usually learn two heads for the source task and the target task separately, denoted by θ s and θ t , with a shared feature extractor Using this notation, the standard supervised loss on the source (with the source head θ s ) and loss on the target (with the target head θ t ) can be written as L Ds (θ s , φ) and L Dt (θ t , φ) respectively. We next review mainstream transfer learning baselines and describe them in our notations. Target-only is the trivial algorithm that only trains on the target data D t with the objective L Dt (θ t , φ) starting from random initialization. With insufficient target data, target-only is prone to overfitting. Pre-training starts with random initialization and pre-trains on the source dataset with objective function L Ds (θ s , φ) to obtain the pre-trained feature extractor φpre and head θs . Fine-tuning initializes the target head θ t randomly and initializes the feature extractor φ by φpre obtained in pre-training, and fine-tunes φ and θ t on the target by optimizing L Dt (θ t , φ) over both θ t and φ. Note that in this paper, fine-tuning refers to fine-tuning all layers by default. Joint training starts with random initialization, and trains on the source and target dataset jointly by optimizing a linear combination of their objectives over the heads θ s , θ t and the shared feature extractor φ: min θs,θt,φ L joint (θ s , θ t , φ) := (1 -α)L Ds (θ s , φ) + αL Dt (θ t , φ). The hyper-parameter α is used to balance source training and target training. We use cross-validation to select optimal α. 3 LIMITATIONS OF FINE-TUNING AND JOINT TRAINING: ANALYSIS ON SEMI-SYNTHETIC DATA Previous works (He et al., 2018; Wu et al., 2020) Analysis: First of all, target-only has low accuracy (38%) because the target training set is small. Except when explicitly mentioned, all the discussions below are about algorithms on the source AB. Fine-tuning fails because pre-training does not prefer to learn transferable features and fine-tuning overfits. Joint training fails because it simultaneously learns mostly source-specific features and features that overfit to the target. In Section 5, we rigorously analyze the behavior of these algorithms on a more simplified settings and show that the phenomena above can theoretically occur.

4. MERLIN: META REPRESENTATION LEARNING

In this section, we design a meta representation learning algorithm that encourages the discovery of transferable features. As shown in the semi-synthetic experiments, fine-tuning does not have any incentive to learn transferable features if they are not the most convenient for predicting the source labels because it is oblivious to target data. Thus we have to use the source and target together to learn transferable representations. A natural attempt would have been joint training, but it overfits to the target when the target data is scarce as shown in the t-SNE visualizations in Figure 2 (a). To fix the drawbacks of the joint training, we recall that good representations should not only work well for the target training set but also generalize to the target distribution. More concretely, a good representation h φ should enable the generalization of the linear head learned on top of it-a linear head θ that is learned by fixing the feature h φ (x) as the inputs should generalize well to a held-out dataset. We design a bi-level optimization objective to learn such features, inspired by metalearning for fast adaptation (Finn et al., 2017) and learning-to-learn for automatic hyperparameter optimization (Maclaurin et al., 2015; Thrun & Pratt, 2012 ) (more discussions below.) We first split the target training set D t randomly into D tr t and D val t . Given a feature extractor φ, let θ t (φ) be the linear classifier learned by using h φ (x) as the inputs on the dataset D tr t . θ t (φ) = arg min θ L D tr t (θ, φ) Note that θ t (φ) depends on the choice by φ (and is almost uniquely decided by it because the objective is convex in θ.) As alluded before, our final objective involves the generalizability of θ t (φ) to the held-out dataset D val t : L meta,t (φ) = L D val t ( θ t (φ), φ) = E (x,y)∈ D val t (g θt(φ) (h φ (x)), y) The final objective is a linear combination of L meta,t (φ) with the source loss minimize φ∈Φ,θs∈Θ L meta (φ, θ s ) := L Ds (θ s , φ) + ρ • L meta,t (φ) To optimize the objective, we can use standard bi-level optimization technique as in learning-to-learn approaches. We also design a sped-up version of MeRLin by changing the loss to squared loss so that the θ t (φ) has an analytical solution. More details are provided in Section A.3 (Algorithm 2). Comparison to other meta-learning work. The key distinction of our approach from MAML (Finn et al., 2017) and other meta-learning algorithms (e.g., (Nichol et al., 2018a; Bertinetto et al., 2019) ) is that we only have a single source task and a single target task. Recent work (Raghu et al., 2020) argues that feature reuse is the dominating factor of the effectiveness of MAML. In our case, the training target task is exactly the same as the test task, and thus the only possible contributing factor is a better-learned representation instead of fast adaptation. Our algorithm is in fact closer to the work on hyperparameter optimization (Maclaurin et al., 2015; Zoph & Le, 2016 )-if we view the parameters of the head θ t as m hyperparameters and view φ and θ s as the ordinary parameters, then our algorithm is tuning hyperparameters on the validation set using gradient descent.

4.1. MERLIN LEARNS TRANSFERABLE FEATURES ON SEMI-SYNTHETIC DATASET

We verify that MeRLin learns transferable features in the semi-synthetic setting of Section 3 where fine-tuning and joint training fail. Figure 1 (right) shows that MeRLin outperforms fine-tuning and joint training by a large margin and is close to fine-tuning from the source A, which can be almost viewed as an upper bound of any algorithm's performance with AB as the source. Figure 2(b) shows that MeRLin (trained with source = AB) performs well on A, indicating it learns the transferable features. Figure 2 (a) (MeRLin, train& test) corroborates the conclusion.

5. THEORETICAL ANALYSIS WITH TWO-LAYER QUADRATIC NEURAL NETS

The experiments in Section 3 demonstrate the weakness of fine-tuning and joint training. On the other hand, MeRLin is able to learn the transferable features from the source datasets. In this section, we instantiate transfer learning in a quadratic neural network where the algorithms can be rigorously studied. For a specific data distribution, we prove that (1) fine-tuning and joint training fail to learn transferable features, and (2) MeRLin recovers target ground truth with limited target examples. Models. Consider a two-layer neural network f θ,φ (x) = g θ (h φ (x)) with g θ (z) = θ z and h φ = σ(φ x), where φ = [φ 1 , φ 2 , • • • , φ m ] ∈ R d×m is the weight of the first layer, θ ∈ R m is the linear head, and σ(•) is element-wise quadratic. We consider squared loss (f θ,φ (x), y) = (f θ,φ (x) -y) 2 . Source distribution. Let k ∈ Z + such that 2 ≤ k ≤ d. We consider the following source distribution which can be solved by multiple possible feature extractors. Let x [i] denotes the i-th entry of x ∈ R d . Let y = 0 happens with prob. 1 /3, and conditioned on y = 0, we have x [i] = 0 for i ≤ k, and x [i] ∼ {±1, 0} uniformly randomly and independently for i > k. With prob.foot_1 /3 we have y = 1, and conditioned on y = 1, we have x [i] ∼ {±1} uniformly randomly and independently for i ≤ k, and x [i] ∼ {±1, 0} uniformly randomly and independently for i > k. The design choice here is that x [1] , . . . , x [k] are the useful entries for predicting the source label, because y = x 2 [i] for any i ≤ k. In other words, features σ(e [i] x) for i ≤ k are useful features to learn, and any linear mixture of them works. All other entries of x are independent with the label y. Target distribution. The target distribution is exactly k = 1 version of the source distribution. Therefore, y = x 2 [1] , and σ(e [1] x) is the correct feature extractor for the target. All other x [i] for i > 1 are independent with the label. Source-specific features and transferable features. As mentioned before, σ(e [1] x), • • • , σ(e [k] x) are all good features for the source, whereas only σ(e [1] x) is transferable to the target. Since usually the source dataset is much larger than target, we assume access to infinite source data for simplicity, so D s = D s . We assume access to n t target data D t . Regularization: Because the limited target data, the optimal solutions with unregularized objective are often not unique. Therefore, we study 2 -regularized version of the baselines and MeRLin. The regularization strength λ > 0 is selected to achieve optimal L Dt . The regularized MeRLin objective is L λ meta (θ s , φ) := L meta (φ, θ s ) + λ( θ s 2 + φ 2 F ). The regularized joint training objective is L λ joint (θ s , θ t , φ) := L joint (θ s , θ t , φ) + λ( θ s 2 + θ t 2 + φ 2 F ). We also regularize the two objectives in the pre-training and fine-tuning. We pre-train with L λ Ds (θ s , φ) := L Ds (θ s , φ) + λ( θ s 2 + φ 2 F ), and then only fine-tune the head 2 by minimizing the target loss L λ, φpre Dt (θ t ) := L Dt (θ t , φpre )+λ θ t 2 . The following theorem shows that neither joint training nor fine-tuning is capable of recovering the target ground truth given limited number of target data. Theorem 1. There exists universal constants c ∈ (0, 1) and > 0, such that so long as n t ≤ cd, for any λ > 0, the following statements are true: • With prob. at least 1 -4 exp(-Ω(d)), the solution ( θs , θt , φjoint ) of the joint training satisfies L Dt ( θt , φjoint ) ≥ . • With prob. at least 1 -1 k (over the randomness of pre-training), the solution ( θt , φpre ) of the head-only fine-tuning satisfies L Dt ( θt , φpre ) ≥ . (5) As will be shown in the proof, not surprisingly, fine-tuning fails because it learns a random feature σ(e In contrast, the following theorem shows that MeRLin can recover the ground truth of the target task: Theorem 2. For any λ < λ 0 where λ 0 is some universal constant and any failure rate ξ > 0, if the target set size n t > Θ(log k ξ ), with probability at least 1 -ξ, the feature extractor φmeta found by MeRLin and the head θt ( φmeta ) trained on D tr t recovers the ground truth of the target task: L Dt ( θt ( φmeta ), φmeta ) = 0. Intuitively, MeRLin learns the transferable feature σ(e [1] x) because it simultaneously fits the source and enables the generalization of the head on the target. The proof can be found in Section B. 

6. EXPERIMENTS

We evaluate MeRLin on several vision and NLP datasets. We show that (1) MeRLin consistently improves over baseline transfer learning algorithms including fine-tuning and joint training in both vision and NLP (Section 6.2), and (2) as indicated by our theory, MeRLin succeeds because it learns features that are more transferable than fine-tuning and joint training (Section 6.3).

6.1. SETUP: TASKS, MODELS, BASELINES, AND OUR ALGORITHMS

The evaluation metric for all tasks is the top-1 accuracy. We run all tasks for 3 times and report their means and standard deviations. Further experimental details are deferred to Section A. Datasets and models. We consider the following four settings. The first three are object recognition problems (with different label sets). The fourth problem is the prominent NLP benchmark where the source is a language modeling task and the targets are classification problems. SVHN or Fashion-MNIST → USPS. We use either SVHN (Netzer et al., 2011) (Hull, 1994) , a hand-written digit dataset. We down-sampled USPS to simulate the setting where the target dataset is much smaller than the source. We use LeNet (LeCun et al., 1998) , a three-layer ReLU network in this experiment. ImageNet → CUB-200, Stanford Cars, or Caltech-256. We use ImageNet (Russakovsky et al., 2015) as the source dataset. The target dataset is Caltech-256, CUB-200 (Wah et al., 2011) (He et al., 2016) . Food-101 → CUB-200. Food (Bossard et al., 2014 ) is a fine-grained classification dataset of 101 classes of food. Here we validate MeRLin when the gap between the source and target is large. We also use ResNet-18. ImageNet → Stanford Dogs, MIT-indoors, or Aircraft. We further test the proposed method with ResNet-50 (He et al., 2016) as baselines. We still use ImageNet (Russakovsky et al., 2015) as the source dataset. The target dataset is Stanford Dogs (Khosla et al., 2011) , MIT-indoors (Quattoni & Torralba, 2009) , or Aircraft (Maji et al., 2013) consisting of 12000, 5360, 8000 labeled examples, respectively. Language modeling → GLUE. Pre-training on language modeling tasks and fine-tuning on labeled dataset such as GLUE (Wang et al., 2019) is dominant following the success of BERT (Devlin et al., Baselines. (1) target-only, (2) fine-tuning, and (3) joint-training have been defined in Section 2. Following standard practice, the initial learning rate of fine-tuning is 0.1× the initial learning rate of pre-training to avoid overfitting. For joint training, the overall objective can be formulated as: (1 -α)L Ds + αL Dt . We tune α to achieve optimal performance. The fourth baseline is (4) L2-sp (Li et al., 2018) , which fine-tunes the models with a regularization penalizing the parameter distance to the pre-trained feature extractor. ( 5) DELTA (Li et al., 2019) constrains the layer outputs selected with attention. ( 6) BSS (Chen et al., 2019) regularized the singular value of feature matrix. We also tuned the strength the L2-sp, Delta and BSS regularizer. MeRLin. We perform standard training with cross entropy loss on the source domain while metalearning the representation in the target domain as described in Section 4.

MeRLin-ft.

In BERT experiments, training on the source masked language modeling task is prohibitively time-consuming, so we opt to a light-weight variant instead: start from pre-trained BERT, and only meta-learn the representation in the target domain.

6.2. RESULTS

Results of digits classification and object recognition are provided in Table 1 . MeRLin consistently outperforms all baselines. Note that the discrepancy between Fashion-MNIST and USPS is very large, where fine-tuning and joint training perform even worse than target-only. Nonetheless, MeRLin is still capable of harnessing the knowledge from the source domain. On Food-101→CUB-200, MeRLin improves over fine-tuning by 6.58%, indicating that MeRLin helps learn transferable features even when the gap between the source and target tasks is huge. In Table 2 , we validate our method on GLUE tasks. MeRLin-ft outperforms standard BERT fine-tuning and L2-sp. Since MeRLin-ft only changes the training objective of fine-tuning, it can be easily applied to NLP models. We empirically analyze the representations and verify that MeRLin indeed learns more transferable features than finetuning and joint training.

6.3. ANALYSIS

Intra-class to inter-class variance ratio. Suppose the representation of the j-th example of the i-th class is φ i,j . µ i = 1 Ni Ni j=1 φ i,j , and µ = 1 C C i=1 µ i . Then the intra-class to inter-class variance ratio can be calculated as  σ 2 intra σ 2 inner = C N i,j φi,j -µi 2 i µi-µ 2 .

7. ADDITIONAL RELATED WORK

Transfer learning is prevalent in deep learning applications. In computer vision, ImageNet pre-training is a common practice for nearly all target tasks. Early works (Oquab et al., 2014; Donahue et al., 2014) directly apply ImageNet features to target tasks. Fine-tuning from ImageNet pre-trained models has become dominant (Long et al., 2015; He et al., 2017; Kolesnikov et al., 2019) . On the other hand, transfer learning is also crucial to the success of NLP. Pre-training transformers on large-scale language tasks boosts performance on downstream tasks (Devlin et al., 2019; Yang et al., 2019) . A recent line of literature casts doubt on the consistency of transfer learning's success (Raghu et al., 2019; He et al., 2018; Kornblith et al., 2019) . Huh et al. (2016) observed that some set of examples in ImageNet are more transferable than the others. Geirhos et al. (2019) found out that the texture of ImageNet is not transferable to some target tasks. Training on the source dataset may also need early stopping to find optimal transferability Liu et al. (2019a); Neyshabur et al. (2020) . Meta-learning, originated from the learning to learn idea (Hochreiter et al., 2001; Vilalta & Drissi, 2002; Maclaurin et al., 2015; Zoph & Le, 2016) , learns from multiple training tasks models that can be swiftly adapted to new tasks (Finn et al., 2017; Rajeswaran et al., 2019; Nichol et al., 2018b) . Raghu et al. (2020) ; Goldblum et al. (2020) empirically studied the mechanism of MAML's success. Computationally, our method uses bi-level optimization techniques similar to meta-learning work. E.g., Bertinetto et al. (2019) speeds up the implementation of MAML Finn et al. (2017) with closedform solution of the inner loop, which is a technique that we also use. However, the key difference between our paper from the meta-learning approach is that we only learn from a single target task and evaluate on it. Therefore, conceptually, our algorithm is closer to the learning-to-learn approach for hyperparameter optimization (Maclaurin et al., 2015; Zoph & Le, 2016) , where there is a single distribution that generates the training and validation dataset.

8. CONCLUSION

We study the limitations of fine-tuning and joint training. To overcome their drawbacks, we propose meta representation learning to learn transferable features. Both theoretical and empirical evidence verify our findings. Results on vision and NLP tasks validate our method on real-world datasets. Our work raises many intriguing questions for further study. Could we apply meta-learning to heterogeneous target tasks? What's more, future work can pay attention to disentangling transferable features from non-transferable features explicitly for better transfer learning.

A ADDITIONAL DETAILS OF EXPERIMENTS A.1 THE SEMI-SYNTHETIC EXPERIMENT

The original CIFAR images are of resolution 32 × 32. For the transferable dataset A, we reserve the upper 16 × 32 and fill the lower half with [0.485, 0.456, 0.406] for the three channels (the mean of CIFAR-10 images). For the non-transferable dataset B, the lower part 16 × 32 pixels are generated with i.i.d. gaussian distribution with the upper half filled with [0.485, 0.456, 0.406] similarly. To make the non-transferable part related to the labels, we set the mean of the gaussian distribution to 0.1 × c, where c is the class index of the image. The variance of the gaussian noise is set to 0.2. We always clamp the images to [0, 1] to make the generated images valid. For the source dataset, we use 49500 CIFAR-10 images, while for the target, we use the other 500 to avoid memorizing target examples. We use ResNet-32 implementation provided in github.com/akamaster/pytorch_ resnet_cifar10. We set the initial learning rate to 0.1, and decay the learning rate by 0.1 after every 50 epochs. We use t-SNE (van der Maaten & Hinton, 2008) visualizations provided in sklearn. The perplexity is set to 80.

A.2 IMPLEMENTATION ON REAL DATASETS

We implement all models on PyTorch with 2080Ti GPUs. All models are optimized by SGD with 0.9 momentum. For object recognition tasks, the initial learning rate is set to 0.01, with 5 × 10 -4 weight decay. The batch-size is set to 64. We run each model for 100 epochs. ImageNet pre-trained models can be found in torchvision. We use a batch size of 128 on the source dataset and 512 on the target dataset. The initial learning rate is set to 0.1 for training from scratch and 0.01 for ImageNet initialization. We decay the learning rate by 0.1 every 50 epochs until 150 epochs. The weight decay is set to 5 × 10 -4 . For GLUE tasks, we follow the standard practice of Devlin et al. (2019) . The BERT model is provided in github.com/Meelfy/pytorch_pretrained_BERT. For each model, we set the head (classifier) to the top one linear layer. We use a batch size of 32. The learning rate is set to 5 × 10 -5 with 0.1 warmup proportion. During fine-tuning, the initial learning rate is 10 times smaller than training from scratch following standard practice. The hyper-parameter ρ is set to 2, and λ is found with cross validation. We randomly split the training set into 80% and 20%. We train on the 80% part and use the rest 20% as the validation set. We also provide the results of varying ρ and λ in Section A.7. In object recognition however, we also provides a pre-trained version: starting from ImageNet pre-trained solution.

A.3 IMPLEMENTING THE SPEED-UP VERSION

Practical implementation: speeding up with MSE loss. Training the head g θ in the inner loop of meta learning can be time-consuming. Even using implicit function theorem or implicit gradients as proposed in (Bai et al., 2019; Rajeswaran et al., 2019; Lin et al., 2020) , we have to approximate the inverse of Hessian. To solve the optimization issues, we propose to analytically calculate the prediction of the linear head θ t and directly back-prop to the feature extractor h φ . Thus, we only need to compute the gradient once in a single step. Concretely, suppose we use MSE-loss. Denote by H ∈ R n t 2 ×m the feature matrix of the nt 2 target samples in the target meta-training set D tr t . Then θ t in equation 1 can be analytically computed as θ t = (HH + λI) -1 y, where λ is a hyper-parameter for regularization. The objective of the outer loop can be directly computed as minimize φ∈Φ,θs∈Θ J(φ, θ s ) = L Ds (θ s , φ) + ρ 2 n t n t 2 i=1 (g (HH +λI) -1 y • h φ (x t i ), y t i ). We implement the kernel speed-up version on classification tasks following Arora et al. (2019b) . We treat the classification problems as multi-variate ridge regression. Suppose we have label c ∈ {1, 2, • • • , 10}. Then the target encoding for regression is -0.1 × 1 + e c . For example, if the label is 3, then the encoding will be (-0.1, -0.1, 0.9, • • • , -0.1). Then the parameters of the target head in the inner loop can be computed as θ t = (HH + λI) -1 Y. We then compute the MSE loss on the target validation set: H val θ t -Y val 2 2 in the outer loop. We summarize the details of the vanilla version and the speed-up version in Algorithm 1 and Algorithm 2. Algorithm 1 Meta Representation Learning (MeRLin). for t = 0 to n do 7: Train the target head on D tr t : θ t ← θ t -η∇ θt L D tr t (θ t , φ). 8: end for 9: In the outer loop, update the representation φ and the source head θ s : Analytically calculate the solution of target head θ t (φ) in the inner loop (φ, θ s ) ← (φ, θ s ) -η∇ (φ,θs) L Ds (θ s , φ) + ρL D val t (θ t , θ t (φ) = (HH + λI) -1 Y. 6: In the outer loop, update the representation φ and the source head θ s : We extend the last column of Table 2 in Table 4 . We further compare with two variants of MeRLin as ablation study: (φ, θ s ) ← (φ, θ s ) -η∇ (φ,θs) L Ds (θ s , φ) + ρ 2 n t H val θ t -Y val MeRLin (pre-trained). We first pre-train the model on the source dataset and then optimize the MeRLin objective from the pre-trained solution. MeRLin-target-only. MeRLin-target-only only meta-learns representations on the target domain starting from random initialization. We test whether the meta-learning objective itself has regularization effect. Suppose the feature matrix is H, and the label vector is y, then the correlation between feature and label can be defined as y H H -1 y. As is shown by Arora et al. (2019a) ; Cao & Gu (2019) , this term is closely related to the generalization error of neural networks, with a smaller quantity indicating better generalization. We calculate y H H We run 10 seeds of each experiment on L2-sp and MeRLin in Table 1 and calculate 95% bootstrapped confidence interval of each results in Figure 5 . Lemma 1. Suppose 0 ≤ ≤ 2 3 . For each solution θ s , θ t , φ satisfying E x,y∼Dt [ (f θt,φ (x), y)] ≤ , the joint training objective L λ joint (θ s , θ t , φ) is lower bounded: kx [k+1:d] . (Note that x is defined on a different distribution than D s .) L λ joint (θ s , θ t , φ) = (1 -α)L Ds (θ s , φ) + Dt (θ t , φ) + λ θ s 2 + θ t 2 + φ 2 F ≥ min µ 3λ 2 4/3 |µ| 2/3 + 2 3 (1 -α)(µ -1) 2 + 3λ 2 4/3 1 - 3 2 2/3 . Proof of Lemma 1. Define d × d matrix A = m i=1 θ si φ i φ i . A 1 = x [1:k] A k,k x [1:k] , A 2 = x [k+1:d] Ak , We have bound (1 -α)E x,y∼Ds [ (f θs,φ (x), y)] + λ θ s 2 + 1 2 φ 2 F (9) =(1 -α)E x,y∼Ds x [1:k] A k,k x [1:k] + 2x [1:k] A k, kx [k+1:d] + x [k+1:d] Ak , kx [k+1:d] -y 2 + λ θ s 2 + 1 2 φ 2 F (10) ≥(1 -α)E x,y∼Ds x [1:k] A k,k x [1:k] + x [k+1:d] Ak , kx [k+1:d] -y 2 + λ θ s 2 + 1 2 φ 2 F (11) =(1 -α) 2 3 E (A 1 -1) 2 + 4 3 E [(A 1 -1)A 2 ] + E A 2 2 + λ θ s 2 + 1 2 φ 2 F (12) =(1 -α) 2 3 E (A 1 + A 2 -1) 2 + 1 3 E A 2 2 + λ θ s 2 + 1 2 φ 2 F (13) ≥ 2 3 (1 -α) (E [A 1 + A 2 ] -1) 2 + 3λ 2 4/3 (|E [A 1 + A 2 ] |) 2/3 (14) The first inequality is because E x,y∼Ds x [1:k] A k,k x [1:k] x [1:k] A k, kx [k+1:d] = 0, E x,y∼Ds x [k+1:d] Ak , kx [k+1:d] x [1:k] A k, kx [k+1:d] = 0, E x,y∼Ds x [1:k] A k, kx [k+1:d] y = 0. The second inequality is because m i=1 (θ si ) 2 + 1 2 φ i 2 ≥ 3 2 4/3 m i=1 |θ si | • φ i 2 2/3 (18) ≥ 3 2 4/3 m i=1 |θ si | • φ i 2 2/3 (19) ≥ 3 2 4/3 m i=1 |A [i,i] | 2/3 , where the first inequality is AM-GM inequality, the second inequality is by concavity of (•) 2/3 . The third inequality is because for diagonal matrix D that has 1 at (i, i) if A [i,i] ≥ 0, -1 at (i, i) if A [i,i] < 0, we have m i=1 |A [i,i] | = tr(AD) = m i=1 θ si φ i Dφ i ≤ m i=1 |θ si | • φ i 2 . ( ) On the other hand, for the target, we define d × d matrix B = m i=1 θ ti φ i φ i . Define x [1] and x [2:d] be the first 1 and last d -1 dimensions of x. B 1,1 , B 1, 1 and B1 , 1 be 1 × 1, 1 × (d -1) and (d -1) × (d -1) matrices that correspond to the upper left, upper right and lower right part of B. For a random vector x where the first dimension is uniformly independently from {±1}, the last d -1 dimensions are uniformly indepdently from {0, ±1}, define random variables B 1 = x [1] B 1,1 x [1] , B 2 = x [2:d] B1 , 1x [2:d] . (Note that x is defined on a different distribution than D t .) Using similar argument as above, we have E x,y∼Dt [ (f θt,φ (x), y)] ≥ 2 3 (E [B 1 + B 2 ] -1) 2 . ( ) However, we know that E x,y∼Dt [ (f θt,φ (x), y)] ≤ , so there has to be E [B 1 + B 2 ] ≥ 1 - 3 2 , therefore we have λ θ t 2 + 1 2 φ 2 F ≥ 3λ 2 4/3 1 - 3 2 2/3 . ( ) Summing up Equation 9and Equation 25 finishes the proof. Lemma 2. Assume φt is a vector such that φt , x t i = x t i[1] for all x t i ∈ D t , then there exists some solution (θ s , θ t , φ) such that L λ joint (θ s , θ t , φ) ≤ min µ 3λ 2 2/3 |µ| 2/3 + 2 3 (1 -α)(µ -1) 2 + 3λ 2 2/3 φt 4/3 2 . ( ) Proof of Lemma 2. Assume µ * ∈ arg min µ 3λ 2 2/3 |µ| 2/3 + 2 3 (1 -α)(µ -1) 2 , then obviously µ * ∈ [0, 1]. Let φ 1 = ( √ 2µ * ) 1/3 e 1 , φ 2 = 2 1/6 φt 1/3 2 φt , φ i = 0 for i > 2, θ s = (µ * /2) 1/3 e 1 and θ t = φt 2/3 2 1/3 e 2 . Now we prove that this model satisfies the Equation 26. First of all, we notice that for any x t i ∈ D t , there is x t i m i=1 θ ti φ i φ i x t i (27) =x t i θ t2 φ 2 φ 2 x t i (28) = φt , x t i 2 (29) =y t i . Therefore we have L Dt (θ t , φ) = 0 . On the other hand, we have (1 -α)L Ds (θ s , φ) + λ θ s 2 + φ 1 2 (32) = 2 3 (1 -α)(µ * -1) 2 + 3λ 2 2/3 |µ * | 2/3 (33) = min µ 3λ 2 2/3 |µ| 2/3 + 2 3 (1 -α)(µ -1) 2 . ( ) Plugging Equation 31and Equation 32 into the formula of L λ joint (θ s , θ t , φ) finishes the proof. Lemma 3. Let X ∈ R d×n be a random matrix where each entry is uniformly random and independently sample from {0, ±1}, n < d 2 . Let P X e 1 be the projection of e 1 to the column space of X. Then, there exists absolute constants c 0 > 0 and C > 0, such that with probability at least 1 -4 exp(-Cd), there is P X e 1 2 ≤ c 0 n d . ( ) Proof of Lemma 3. Let s min (X) and s max (X) be the minimal and maximal singular values of X respectively. Then we have P X e 1 2 = X(X X) -1 X e 1 2 (36) ≤ X op (X X) -1 op X e 1 2 (37) ≤ s max (X)(s min (X)) -2 √ n. By Theorem 3.3 in Rudelson & Vershynin (2010) , there exists constants c 1 , c 2 > 0, such that P (s min (X) ≤ c 1 ( √ d - √ n)) ≤ 2 exp(-c 2 d). By Proposition 2.4 in Rudelson & Vershynin (2010) , there exists constants c 3 , c 4 > 0, such that P (s max (X) ≥ c 3 ( √ d + √ n)) ≤ 2 exp(-c 4 d). Let C = min{c 2 , c 4 }, then with probability at least 1 -4 exp(Cd), there is s max (X)(s min (X)) -2 √ n (41) ≤ c 3 c 2 1 √ d + √ n ( √ d - √ n) 2 √ n (42) ≤ (2 + √ 2)c 3 ( √ 2 -1) 2 c 2 1 n d , which completes the proof.

B.2 PROOF OF THEOREM 2

Lemma 4. Define the source loss as L λ Ds (θ s , φ) = L Ds (θ s , φ) + λ θ s 2 + φ 2 F . Then, for any λ > 0, any minimizer of L λ Ds (θ s , φ) is one of the following cases: (i) θ s = 0 and φ = 0. (ii) for one i ∈ [m], θ si > 0, φ i = ±( √ 2θ si ) • e j for some j ≤ k; for all other i ∈ [m], |θ si | = φ i = 0. Furthermore, when 0 < λ < 0.1, all the minimizers look like (ii).  = 2 3 E (A 1 -1) 2 + 4 3 E [(A 1 -1)A 2 ] + E A 2 2 (59) = 2 3 E (A 1 + A 2 -1) 2 + 1 3 E A 2 2 . ( ) The inequality is because  E x,y∼Ds x [1:k] A k,k x [1:k] x [1:k] A k, kx [k+1:d] = 0, where the first inequality is AM-GM inequality, the second inequality is by concavity of (•) 2/3 . The third inequality is because for diagonal matrix D that has 1 at (i, i) if A [i,i] ≥ 0, -1 at (i, i) if A [i,i] < 0, we have Combining the two parts gives a lower bound for L λ Ds (θ s , φ): L λ Ds (θ s , φ) ≥ 3λ 2 2/3 m i=1 |A [i,i] | 2/3 + 2 3 E (A 1 + A 2 -1) 2 + 1 3 E A 2 2 (68) ≥ 3λ 2 2/3 |tr(A k,k + 2 3 Ak , k)| 2/3 + 2 3 (E [A 1 + A 2 ] -1) 2 , ( ) where both inequalitites are equality if and only if Ak , k = 0 (therefore A 2 = 0) and V ar [A 1 ] = 0 (therefore A k,k is diagonal by Lemma 5). Notice that E[A 1 + A 2 ] = tr(A k,k + 2 3 Ak , k), the above lower bound is further minimized when E[A 1 ] = µ * where µ * is the minimizer of function L(µ) = 3λ 2 2/3 (|µ|) 2/3 + 2 3 (µ -1) 2 . To see when this lower bound is achieved, we combine all the conditions for the inequalities to be equality. When µ * = 0, this lower bound is achieved if and only if θ s = 0 and φ = 0. When µ * > 0, this lower bound is only achieved when the solution look like this: for one i ∈ [m], θ si = ( µ * 2 ) 1/3 , φ i = ±( √ 2θ si ) • e j for some j ≤ k; for all other i ∈ [m], |θ si | = φ i = 0. Obviously, there is either µ * = 0 or µ * > 0. Also, when λ < 0.1, the minimizer µ * of L(µ) is strictly larger than 0 (since L(1) < L(0)). So this completes the proof. Lemma 5. Let M ∈ R k×k be a symmetric matrix, x ∈ R k is a random vector where each dimension is indepedently uniformly from {±1}. Then, Var[x M x] = 0 if and only if M is a diagonal matrix. Proof of Lemma 5. In one direction, when M is diagonal matrix, obviously Var[x M x] = 0. In the other direction, when Var[x M x] = 0, there has to be x M x be the same for all x ∈ {±1} k . For any i = j, let x (1) = 1 -2e i -2e j , x (2) = 1, x (3) = 1 -2e i , x (4) = 1 -2e j . Then the (i, j) element of M is 1 8 x (1) M x (1) + x (2) M x (2) -x (3) M x (3) -x (4) M x (4) , which is 0. So M has to be a diagonal matrix. Proof of Theorem 2. Define the source loss as in Lemma 4, then we have L λ meta (θ s , φ) = L λ Ds (θ s , φ) + E L D val t ( θ t (φ), φ) . ( ) By Lemma 4, the source loss L λ Ds (θ s , φ) is minimized by a set of solutions that look like this: for one i ∈ [m], θ si > 0, φ i = ±( √ 2θ si ) • e j for some j ≤ k; for all other i ∈ [m], |θ si | = φ i = 0. When j = 1, the only feature in φ is e 1 . When n t ≥ 18 log 2 ξ , according to Chernoff bound, with probability at least 1 -ξ 2 there is strictly less than half of the data satisfy x 1 = 0. Therefore, any D tr t contains data with x 1 = 0, and the only target head that fits D tr t has to recover the ground truth. Hence there is E L D val t ( θ t (φ), φ) = 0. When j = 1, the only feature is e j . This feature can be used to fit the target data if and only if either x 2 i[j] = x 2 i[1] for all target data x i , or x i[1] = 0 for all x i . Since there are at most k -1 possible j, by union bound we know the probability of any of these happens for any j = 1 is at most k( 23 ) nt .



As sanity checks, when the source contains only transferable features (Figure1, right, source = A), finetuning works well, and when no transferable features (Figure1, right, source = B), it does not. For theoretical analysis we consider only fine-tuning θt. It is worth noting that fine-tuning both θt and φ converges to the same solution as target-only training, which also has large generalization gap due to overfitting.



Figure 1: Comparison of fine-tuning, joint training, and MeRLin on the semi-synthetic dataset. Left: The semi-synthetic dataset and the qualitative observations on the representations learned by three algorithms. Right: Quantitative results on the target test accuracy. See more interpretations, analysis, and results in Section 3. φ. Let L D (θ, φ) be the empirical loss of model g θ (h φ (x)) on the empirical distribution D, that is, L D (θ, φ) := E (x,y)∈ D (g θ (h φ (x)), y) where (x, y) ∈ D means sampling uniformly from the dataset D.Using this notation, the standard supervised loss on the source (with the source head θ s ) and loss on the target (with the target head θ t ) can be written as L Ds (θ s , φ) and L Dt (θ t , φ) respectively.

(a) T-SNE embeddings of features on the target dataset. (b) Ablation.

Figure 2: (a) T-SNE visualizations of features on the target train and test set. The representations of pre-training work poorly on both target train and test set, indicating that transferable features are not learned. Both joint training and fine-tuning work well on the target train set but poorly on the test set, indicating overfitting. MeRLin works well on the target test set. (b) Evaluation of different methods on A and B. Joint-training and pre-training rely heavily on the source-specific feature B and learn the transferable feature A poorly compared to MeRLin. See more details in Section 3.In Figure1(right), we evaluate various algorithms' performance on target test data. In Figure2(a)(left), we run algorithms with AB being the source dataset and visualize the learned features on the target training dataset and target test dataset to examine the generalizability of the features. In Figure2(a) (right), we evaluate the algorithms on the held-out version of dataset A and B to examine what features the algorithms learn. ResNet-32(He et al., 2016) is used for all settings.

Figure 2(b)(pre-training)  shows that the pre-trained model has near-trivial accuracy on heldout A but near-perfect accuracy on held-out B, indicating that it solely relies on the source-specific feature (bottom half) and does not learn transferable features. Figure2(a) (pre-training) shows that indeed pre-trained features do not have even correlation with target training and test sets. Figure2(a) (fine-tuning) shows that fine-tuning improves the features' correlation with the training target labels but it does not generalize to the target test because of overfitting. The performance of fine-tuning (with source =AB) in Figure1(right) also corroborates the lack of generalization.

Figure 2(b) (joint training) shows that the joint training model performs much better on held-out B (with 92% accuracy) than on the held-out A (with 46% accuracy), indicating it learns the source-specific feature very well but not the transferable features. The next question is what features joint training relies on to fit the target training labels. Figure 2(a) shows strong correlation between joint training model's features and labels on the target training set, but much less correlation on the target test set, suggesting that the joint training model's feature extractor, applied on the target data (which doesn't have source-specific features), overfits to the target training set. This corroborates the poor accuracy of joint training on the target test set (Figure 1), which is similar to target-only's. 1

[i]  x) (where i ∈ [k]) for the source during pre-training which does not transfer to the target when i = 1. Joint training fails because it uses one neuron to learn a generalized linear model to overfit the target n t training data exactly, and then use another neuron to learn a random feature σ(e[i]  x) (where i ∈ [k]) for the source. The proof of Theorem 1 is deferred to Section B.

Figure 3: Comparison of intra-class to inter-class variance ratio. This quantity is lowest for MeRLin, indicating that it separates classes best.

Low values of this ratio correspond to representations where classes are well-separated. Results on ImageNet → CUB-200 and Stanford Cars task are shown in Figure 3. MeRLin reaches much smaller ratio than baselines.

Input: the source dataset D s and the evolving target dataset D t . 2: Output: learned representations φ. 3: for iter = 0 to MaxIter do 4: Initialize the target head θ t . 5: Randomly sample target train set D tr t and target validation set D val t from D t . 6:

CUB-200 (Wah et al., 2011)  is a fine-grained dataset of 200 bird species. The training dataset consists of 5994 images and the test set consists of 5794 images. http://www.vision.caltech.edu/ visipedia/CUB-200-2011.html    Stanford Cars(Krause et al., 2013) dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images. http://ai.stanford.edu/ ˜jkrause/cars/car_dataset.html Food-101 (Bossard et al., 2014) is a fine-grained classification dataset of 101 kinds of food, with 750 training images and 250 test images for each kind. http://www.vision.ee.ethz.ch/ datasets_extra/food-101/ Caltech-256 (Griffin et al., 2007) is an object recognition dataset of 256 categories. In our experiments, the training set consists of 25468 images, and the test set consists of 5139 images.http://www.vision.caltech.edu/Image_Datasets/Caltech256/ MIT-indoors (Quattoni & Torralba, 2009) is a scene recognition dataset of 67 classes. The training set contains 80 examples for each class, and the test set contains 20 examples for each class. http://web.mit.edu/torralba/www/indoor.html Stanford-Cars (Khosla et al., 2011) is a fine-grained classification dataset of 120 kinds of dogs. http://vision.stanford.edu/aditya86/ImageNetDogs/main.html Aircraft (Maji et al., 2013) is an object recognition dataset of 100 aircraft variants. Each class has 80 images for training and 20 images for testing. http://www.robots.ox.ac.uk/ ˜vgg/ data/fgvc-aircraft/ MNIST (LeCun et al., 1998) is a dataset of hand-written digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. We randomly sample 600 samples as the training set for the target domain. http://yann.lecun.com/exdb/mnist/ SVHN (Netzer et al., 2011) is a real-world image dataset of street view house numbers. It has 73257 digits for training, 26032 digits for testing. In Figure 3 of the main text, we randomly sample 600 digits for training, and all the 26032 digits are used in testing. http://ufldl.stanford.edu/ housenumbers/ USPS (Hull, 1994) is a database for handwritten text recognition. It has 7291 train and 2007 test images. We randomly sample 600 samples as the training set for the target domain. https: //www.kaggle.com/bistaumanga/usps-dataset A.5 FURTHER ABLATION STUDY.

MeRLin (pre-trained) performs worse than MeRLin, but it still improves over fine-tuning and joint training. Note-that MeRLin (pre-trained) only need to train on ImageNet for 2 epochs, much shorter than joint training. MeRLin-target-only improves target-only by 8%, indicating that meta-learning helps avoid overfitting even without the source dataset. A.6 FEATURE-LABEL CORRELATION (a) Comparison of feature-label correlation. (b) Sensitivity to hyper-parameters.

Figure 4: (a) Analysis of Feature Quality. Comparison of feature-label correlation. A lower quantity is better, and MeRLin has the lowest value. (b) Sensitivity of the proposed method to hyper-parameters. We test the accuracy on Food-101→CUB-200 with varying ρ and λ and provide the visualization.

-1 y on ImageNet → CUB-200 and Stanford Cars. As shown in Figure 4(a), the features learned by MeRLin are more closely related to labels than fine-tuning and joint training, indicating MeRLin is indeed learning more transferable features compared with baselines. A.7 SENSITIVITY OF THE PROPOSED METHOD TO HYPER-PARAMETERS. We test the model on Food-101→CUB-200 with varying hyper-parameters ρ and λ. Results in Figure 4(b) indicate that the model is not sensitive to varying ρ and λ. Intuitively, larger ρ indicates more emphasis on the target meta-task. When ρ approaches 0, the performance of MRL is approaching fine-tuning. λ exerts regularization to the classifier in the inner loop training. It is also note worthy that λ can avoid the problem that HH is occasionally invertible. Without λ the model can fail to converge sometimes. A.8 BOOTSTRAPPED CONFIDENCE INTERVAL OF OBJECTION RECOGNITION RESULTS

Figure 5: Comparison of L2-sp and MeRLin. The results are averaged over 10 runs, with error bars representing 95% confidence interval drawn by 1,000 bootstraps.

) Define x [1:k] and x [k+1:d] be the first k and last d -k dimensions of x. A k,k , A k, k and Ak , k be k × k, k × (d -k) and (d -k) × (d -k) matrices that correspond to the upper left, upper right and lower right part of A. For a random vector x where the first k dimensions are uniformly independently from {±1}, the last d -k dimensions are uniformly indepdently from {0, ±1}, define random variables

Define d × d matrix A = m i=1 θ si φ i φ i . (55) Define x [1:k] and x [k+1:d] be the first k and last d -k dimensions of x. A k,k , A k, k and Ak , k be k × k, k × (d -k) and (d -k) × (d -k) matrices that correspond to the upper left, upper right and lower right part of A. For a random vector x where the first k dimensions are uniformly independently from {±1}, the last d -k dimensions are uniformly indepdently from {0, ±1}, define random variablesA 1 = x [1:k] A k,k x [1:k] , A 2 = x [k+1:d] Ak , kx [k+1:d] . (Note that x is defined on a different distribution than D s .)The loss part of L λ Ds (θ s , φ) can be lower bounded by: E x,y∼Ds [ (f θs,φ (x), y)] (56)=E x,y∼Ds x [1:k] A k,k x [1:k] + 2x [1:k] A k,kx [k+1:d] + x [k+1:d] Ak , kx [k+1:d] -y 2 (57) ≥E x,y∼Ds x [1:k] A k,k x [1:k] + x [k+1:d] Ak , kx [k+1:d]

x,y∼Ds x [k+1:d] Ak , kx [k+1:d] x [1:k] A k, kx [k+1:d] = 0, (62) E x,y∼Ds x [1:k] A k, kx [k+1:d] y = 0. (63)The inequality is equality if and only if A k, k = 0.The regularizer part of L λ Ds (θ s , φ) can be lower bounded by:

are equality if and only if(θ si ) 2 = 1 2 φ i 2 > 0 for at most one i ∈ [m],and for all other i ∈ [m] there is |θ si | = φ i = 0.

Accuracy (%) on computer vision tasks. MeRLin 93.34 ± 0.41 93.10 ± 0.38 75.42 ± 0.47 82.45 ± 0.26 83.68 ± 0.57 58.68 ± 0.43

Accuracy (%) of BERT-base on GLUE sub-tasks dev set.

, or Stanford Cars(Krause et al., 2013). These datasets have 25468, 5994, 8144 labeled examples respectively, much smaller than ImageNet with 1.2M labeled examples. Caltech is a general image classification dataset of 256 classes. Stanford Cars and CUB are fine-grained classification datasets with 196 categories of cars and 200 categories of birds, respectively. We use ResNet-18

Accuracy (%) of computer vision tasks with ResNet-50 backbone and ImageNet.

φ) . Input: the source dataset D s and the evolving target dataset D t .

Accuracy on Food → CUB.

annex

Proof of Theorem 1. We prove the joint training part of Theorem 1 following this intuition: (1) the total loss of each solution with target loss E x,y∼Dt [ (f θt,φ (x), y)] ≤ is lower bounded as indicated by Lemma 1, and (2) there exists a solution with loss smaller than the aforementioned lower bound as indicated by Lemma 2.By Lemma 1, for any θ s , θ t , φ satisfying E x,y∼Dt [ (f θt,φ (x), y)] ≤ , the joint training lossLet P X e 1 be the projection of vector e 1 to the subspace spanned by the target data. According to Lemma 2, there exists some solution (θ s , θ t , φ) such that> 1 2 2/3 -1 2 4/3 . According to Lemma 3, there exists absolute constants c ∈ (0, 1), C > 0, such that so long as n t ≤ cd, there is with probability at least 1 -4 exp(-Cd),Now we prove the upper bound in Equation 45 is smaller than the lower bound in Equation 44. This is becausewhere the first inequality uses that fact that |µ| < 1 for the optimal µ, the second inequality is by Equation 46. This completes the proof for joint training.Then, we prove the result about fine-tuning. According to Lemma 4, any minimizer ( θs , φpre ) of L λ Ds (θ s , φ) either satisfies φpre = 0, or only one φ i is non-zero but looks like (up to scaling) e j for j ∈ [k]. When φpre = 0, there isWhen only one φ i is non-zero but looks like e j for j ∈ [k], since all the first k dimensions are equivalent for the source task, with probability 1 -1 k , this dimension is j = 1. The target funciton fine-tuned on this φpre looks like f θt, φpre (x) = γx 2 j for some γ ∈ R, so there isCombining these two possibilities finishes the proof for fine-tuning. Finnaly, setting = min{ 0 , 10 27 } finishes the proof of Theorem 1.Hence, when n t ≥ 3 log 2k ξ , the probability of any e j fits the target data is smaller than ξ 2 . Therefore, with probabiltiy 1 -ξ 2 , E L D val t ( θ t (φ), φ) > 0 for any j = 1.So with probability at least 1 -ξ, the only minimizer of L λ meta (θ s , φ) is the subset of minimizers of L λ Ds (θ s , φ) with feature e 1 , and with this φ and any random D tr t , the only θ t ( D tr t , φ) that fits the target recovers the ground truth, i.e., E x,y∼Dt θt( D tr t ,φ),φ (x, y) = 0.

