TO LEARN EFFECTIVE FEATURES: UNDERSTANDING THE TASK-SPECIFIC ADAPTATION OF MAML

Abstract

Meta learning, an effective way for learning unseen tasks with few samples, is an important research area in machine learning. Model Agnostic Meta-Learning (MAML) (Finn et al. (2017) ) is one of the most well-known gradientbased meta learning algorithms, that learns the meta-initialization through the inner and outer optimization loop. The inner loop is to perform fast adaptation in several gradient update steps with the support datapoints, while the outer loop to generalize the updated model to the query datapoints. Recently, it has been argued that instead of rapid learning and adaptation, the learned meta-initialization through MAML has already absorbed the high-quality features prior, where the task-specific head at training facilitates the feature learning. In this work, we investigate the impact of the task-specific adaptation of MAML and discuss the general formula for other gradient-based and metric-based meta-learning approaches. From our analysis, we further devise the Random Decision Planes (RDP) algorithm to find a suitable linear classifier without any gradient descent step and the Meta Contrastive Learning (MCL) algorithm to exploit the inter-samples relationship instead of the expensive inner-loop adaptation. We conduct sufficient experiments on various datasets to explore our proposed algorithms.

1. INTRODUCTION

Few-shot learning, aiming to learn from few labelled examples, is a great challenge for modern machine learning systems. Meta learning, an effective way for tracking this challenge, enables the model to learn general knowledge across a distribution of tasks. Various ideas of meta learning have been proposed to address the few-shot problems. Gradient-based meta learning (Finn et al. (2017) ; Nichol et al. (2018) ) learns the meta-parameters that can be quickly adapted to new tasks by few gradient descent steps. Metric-based meta learning (Koch et al. (2015) ; Vinyals et al. (2016) ; Snell et al. (2017) ) proposes to learn a metric space by comparing different datapoints. Memorybased meta learning (Santoro et al. (2016) ) can rapidly assimilate new data and leverage the stored information to make predictions. Model Agnostic Meta-Learning (MAML) (Finn et al. (2017) ) is one of the most well-known gradient-based meta learning algorithms, that learns the meta-initialization parameters through the inner optimization loop and the outer optimization loop. For a given task, the inner loop is to perform fast adaptation in several gradient descent steps with the support datapoints, while the outer loop to generalize the updated model to the query datapoints. With the learned meta-initialization, the model can be quickly adapted to the unseen tasks with few labelled samples. Following the MAML algorithm, many significant variants (Finn et al. (2018) ; Rusu et al. (2018) ; Oreshkin et al. (2018) ; Bertinetto et al. (2018) ; Lee et al. (2019b) ) are studied under the few-shot setting. To understand how the MAML works, Raghu et al. (2019) conduct a series of experiments and claim that rather than rapid learning and adaptation, the learned meta-initialization has already absorbed the high-quality features prior, thus the representations after fine-tuning are almost the same for the coming unseen tasks. Also, the task specific head of MAML at training facilitates the learning of better features. In this paper, we further design more representative experiments and present a formal argument to explain the importance of the task specific adaptation. Actually, the multi-step taskspecific adaptation, making the body and head have similar classification capabilities, can provide better gradient descent direction for the features learning of body. We also notice that for both the gradient-based methods (e.g. MAML (Finn et al. (2017) ), MetaOptNet (Lee et al. (2019b) )) and metric-based methods (e.g. Prototypical Networks (Snell et al. (2017) )) that attempt to learn a taskspecific head using the support datapoints, the adaptation is a common mode for features learning of body but varied in different methods. Based on our analysis, we first propose a new training paradigm to find a decision plane (linear classifier) for guidance with no gradient descent step during the inner loop and get more supporting conclusions. Moreover, we devise another training paradigm that removes the inner loop and trains the model with only the query datapoints. Specifically, inspired by contrastive representation learning (Oord et al. (2018) ; Chen et al. (2020) ; He et al. (2020) ), we exploit the inter-samples relationship of query set to find a guidance for the body across different tasks. This meta contrastive learning algorithm even achieves competitive results comparable to some state-of-the-art methods. In total, our contributions can be listed as follows: 1. We present sufficient experiments and formal argument to explore the impact of the taskspecific adaptation for body features learning and discuss the general formula for other gradient-based and metric-based meta-learning approaches. 2. We devise a training algorithm to obtain a decision plane with no gradient descent step during the inner loop, named as Random Decision Planes (RDP), and get more supporting conclusions. 3. Unlike prior gradient-based methods, we propose the Meta Contrastive Learning (MCL) algorithm to exploit the inter-samples relations instead of training a task-specific head during the inner loop. Even without the task-specific adaptation for guidance, our algorithm still achieve better results with even less computation costs. 4. We empirically shows the effectiveness of the proposed algorithm with different backbones on four benchmark datasets: miniImageNet (Vinyals et al. (2016) ), tieredImageNet (Ren et al. (2018) ), CIFAR-FS (Bertinetto et al. (2018) ) and FC100 (Oreshkin et al. (2018) ).

2. RELATED WORKS

MAML (Finn et al. (2017) ) is a highly influential gradient-based meta learning algorithm for fewshot learning. The amazing experiment results on several public few-shot datasets have proved its effectiveness. Following the core idea of MAML, there are numerous works to handle the data insufficiency problem in few-shot learning. Some works (Oreshkin et al. (2018) ; Vuorio et al. (2019) ) introduce the task-dependent representations via conditioning the feature extractor on the specific task to improve the performance. Sun et al. (2019) also employ the meta-learned scaling and shifting parameters for transferring from another large-scale dataset. Others (Grant et al. (2018); Finn et al. (2018) ; Lee et al. (2019a) ) study this problem from the perspective of Bayesian approach. Unlike prior methods, we provide two training paradigms, one with no gradient descent step during the inner loop and another removing the inner loop and exploiting the inter-sample relations for training. Recent works also explore the key factors that makes the meta-learned model perform better than others at few-shot tasks. Chen et al. (2019) discovers that a deeper backbone has a large effect on the success of meta learning algorithm, while Goldblum et al. (2020) finds that the meta learning tends to cluster object classes more tightly in feature space for those methods that fix the backbone during the inner loop (Bertinetto et al. (2018) ; Rusu et al. (2018) ). A very recent work (Raghu et al. (2019) ) argues that the meta-trained model can be applied to new task due to the high-quality features prior learned by the meta-initialized parameters rather than rapid learning. In this paper, we further study the impact of the task-specific adaptation for feature learning. Based on the analysis, we devise two algorithms, Random Decision Planes (RDP) and Meta Contrastive Learning (MCL) requiring less computation cost but still with competitive performance.

3. MODEL-AGNOSTIC META LEARNING (MAML)

The MAML aims to learn the meta-initialized parameters θ for the coming unseen tasks through the inner optimization loop and the outer optimization loop. Under the N -way-K-shot setting, for a task T b sampled from the task distribution P (T ), we have a support set of N × K examples T s b and 66.14 ± 0.37 68.39 ± 0.42 a query set T q b , where N is the number of sampled class and K is the number of instances for each class. During the inner loop, with the support set T s b , we perform fast adaptation in several gradient descent steps and obtain the task-specific parameters θ t T b where t is the number of gradient descent steps, given by: θ t T b = θ t-1 T b -α∇ θ t-1 T b L T s b (θ t-1 T b ) where α is the step size for inner loop and L T s b (θ t-1 T b ) denoted as the loss on the support set T s b after t -1 steps. With the query set T q b , we compute the meta loss on the task-specific parameters θ t T b and backward to update the meta-initialized parameters θ, given by θ = θ -β∇ θ 1 B B b=1 L T q b (θ t T b ) (2) where β is the learning rate and B is the number of sampled tasks in a batch.

4. IMPACT OF TASK-SPECIFIC ADAPTATION

4.1 THE MULTI-STEP TASK-SPECIFIC ADAPTATION IS IMPORTANT. To explore the effectiveness of MAML, Raghu et al. (2019) have conducted sufficient experiments, indicating that the network body (the representation layers) has already absorbed the high-quality features prior. During meta-testing, instead of fine tuning on the network head (the classifier), simply building the prototypes with the support set can achieve comparable performance to MAML. Raghu et al. ( 2019) also shows that the task specificity of head at training can facilitate feature learning and ensure good representation learning in the network body. In our work, we show that besides the task specificity of head, the multi-step adaptation is also essential, and further study the role of network body and head during meta-training. We devise several methods using different training regimes: (1) Multi-Task, where all the tasks simply share one common head and the model is trained in a traditional way without inner loop adaptation As the results shown in Table 1 , the ANIL training remains effective comparable to the standard MAML algorithm, indicating that the task-specific adaptation of network body is unnecessary to learn good features. More interestingly, the BOHI training that keeps the meta-initialization of head unchanged even performs better than MAML, further demonstrating that good features learning depends on the multi-step task-specific adaptation of head during inner loop more than updating the meta-initialization of head in outer loop. Also, the ANIL and BOHI have similar performance, indicating that compared with learned prior knowledge in head, the inner loop adaptation, as a guidance, contributes more to the features learning. More experimental results can be found in Appendix C.2.

4.2. WHY IS MULTI-STEP TASK-SPECIFIC ADAPTATION IMPORTANT?

Having observed that the MAML algorithm outperforms the Multi-Task training by a large margin and the multi-step task-specific adaptation is important for features learning, we extend our analysis to explore the reason why the inner loop adaptation is essential for MAML at different stages of meta training. Specifically, we freeze the initialized MAML model and model at 5,000 iterations, sample validation tasks from the task distribution, and record the test accuracy of model in different inner loop steps. Both the body accuracy based on prototypes construction and head accuracy based on fine-tuning are given in Figure 1 and Figure 2 , where "Task ID" stands for different tasks. As the results shows, at different stages of meta training, the head accuracy increases significantly in the first few adaptation steps since the model has learnt the correspondence between sample and label. However, at the beginning of training, there is only a small improvement on the body accuracy after first adaptation step. In Figure 2 , as the model converges, the body accuracy even decreases in the first few adaptation steps. In the following steps, with the task-specific adaptation of head, the network body then learns better representations, further demonstrating that the multi-  = {(x q i , y q i )} N ×K i=1 from task T b . for each sample x in {T s b , T q b } do z = f θ (x) end for define CrossEntropyLoss(H, D) as the cross entropy loss on the features representations set D with head H. W = arg min W ∈P CrossEntropyLoss(W , {(z s i , y s i )} N ×K i=1 ) L b = CrossEntropyLoss(W , {(z q i , y q i )} N ×K i=1 ) end for θ = θ -β∇ θ 1 B B b=1 L b end while step task-specific adaptation, making the body and head have similar classification capabilities, can be regarded as a guidance to provide better gradient descent direction for the feature learning of body. To understand this intuitive argument better, we consider a sample (x, y) for few-shot classification where the cross entropy loss is employed, formulated as: L c = -log( exp(w y h) k exp(w k h) ) = -w y h + log( k exp(w k h)) where {w 1 , w 2 , ..., w k } is the weights of the classifier head, h is the body representation of x. The gradients of loss L c with respect to the body representation h are denoted by, ∂L c ∂h = -w y + k w k exp(w k h) k exp(w k h) = -w y + w (4) where w is exactly the weighted average of the weights {w 1 , w 2 , ..., w k }. As shown in Equation 4, a reasonable direction for the network body to minimize the target loss L c is to make the representation h closer to the corresponding class weight w y , given by h = h + λ(w y -w). As the model converges, in the first few adaptation steps, there is a significant margin between the performance of head and body, and the classifier weights contain little knowledge about correspondence between samples and labels and differences between different classes. With the low-performance head, this updating rule for body may lead to a decline in the quality of features, which also explains why the simpler BOHI, ANIL even performs better than MAML in Table 1 . After several adaptation steps during the inner loop, the body then receives the useful guidance for features learning from the taskspecific head since w y can better express its corresponding class. The formulation above shows that the multi-step task-specific adaptation, making the body and head have similar classification capabilities, can provide better gradient descent direction for the features learning of body.

4.3. TASK-SPECIFIC ADAPTATION IN OTHER META-LEARNING ALGORITHMS

Having noticed that the multi-step task-specific adaptation of MAML, which promotes the performance of head, can facilitate the features learning of body. It works similarly for other gradientbased methods that use end-to-end fine-tuning, such as Reptile (Nichol et al. (2018) ). In the case of meta-learning methods that fix the network body and only update the head during the inner loop, such as MetaOptNet (Lee et al. (2019b) ) and R2-D2 (Bertinetto et al. (2018) ), the convex optimization of head also aims to provide a classifier with better classification capabilities. For metric-based methods, such as Prototypical Networks (Snell et al. (2017) ), the adaptation of head is actually conducted through the nearest neighbor algorithm. In conclusion, the adaptation is a common mode but varied in different methods. These meta-learning algorithms reveal a general formula that the inner loop is for building a task-specific head that matches the classification capabilities of body and the outer loop for task-independent features learning.

5. THE RANDOM DECISION PLANES ALGORITHM

As discussed above, the multi-step adaptation based on gradient descent during the inner loop aims to provide guidance for features learning of body. From this consideration, we suppose that if a suitable linear classifier is given, the feature learning can be facilitated even without gradient descent during the inner loop. From this consideration, we devise such an algorithm named Random Decision Planes (RDP), where a classifier is chosen from a predefined set P according to the target loss on the support set. The predefined set of classifier P consists of n p different orthonormal matrices that are generated through the Gram-Schmidt method from random matrices. During the inner loop, without gradient descent, we directly choose a most suitable classifier as the network head which minimizes the cross entropy loss on the support set. In the outer loop, we compute the loss based on the chosen head and run backward to update the network body. A formal description of RDP is presented in Algorithm 1. The implementation details can be found in Appendix C.1. The overall evaluation results on three datasets are presented in Table 2 . Note that we also remove the head and construct the prototypes from the body network f θ for predictions during meta-testing. The proposed RDP algorithm performs comparably to the standard MAML method on three datasets, especially on the FC100 dataset. Without any task-specific adaptation for the network body, a best performing classifier chosen from a set of randomly generated subspaces can also be a guidance to facilitate the features learning, further suggesting that a head with better classification capabilities, is key factor to learn good representations even if the chosen approximate head performs worse than a gradient-based head, and the main purpose of task-specific adaptation is to adjust the lowperformance head for features learning of body. Algorithm 2 The Meta Contrastive Learning (MCL) Algorithm for N-way learning Input: Network Body f θ , Projection Layer g φ , Learning Rate β, Constant τ , Task Distribution P (T ) while not done do Sample a batch of tasks {T b } B b=1 , where T b ∼ P (T ) for b ∈ {1, ..., B} do Sample the query set T q b from task T b where T q b = {(x q i , y q i )} 2N i=1 , and y q 2k-1 = y q 2k where k ∈ {1, ..., N }. for i ∈ {1, ..., 2N } do z i = g φ (f θ (x q i )) end for for i ∈ {1, ..., 2N } and j ∈ {1, ..., 2N } do s i,j = z i z j /( z i z j ) end for define l(i, j) = -log( exp(si,j /τ ) 2N k=1 1 [k =i] exp(s i,k /τ ) ) L b = 1 2N N k=1 [l(2k -1, 2k) + l(2k, 2k -1)] end for θ = θ -β∇ θ 1 B B b=1 L b φ = φ -β∇ φ 1 B B b=1 L b end while Also, we conduct experiments to explore the impact of the number of decision planes. Results are shown in Figure 3 on two datasets. With a small set of decision planes, it can be more difficult to find a suitable head to guide the features learning, while with enough decision planes, the performance then reaches the upper limit.

6. THE META CONTRASTIVE LEARNING ALGORITHM

We have already seen that the multi-step task-specific adaptation to improve the classifier head can essentially facilitate the features learning of body. In total, prior gradient-based methods based on the cross-entropy loss proposes to learn the correspondence between samples and assigned labels for different tasks, thus requiring the task-specific adaptation for the classifier head during inner loop. Since the task-specific head also serves for features learning of body, we wonder if we can remove the inner loop or adaptation, and make full use of the labels information in other way to be a guidance for features learning. From this consideration and inspired by recent works (Chen et al. (2020) ; He et al. (2020) ) about self-supervised contrastive learning, we further devise the Meta Contrastive Learning (MCL) algorithm that directly removes the inner loop and exploits the inter-sample relationship with only the query set. Specifically, rather than using cross entropy loss for task-specific adaptation, we simply impose that normalized representations from the same class are closer together than representations from different classes. For N -way few-shot learning, we sample two examples per class to build the query set. Next, for a given anchor example, the meta contrastive loss pulls it closer to the point of same class while pushes the anchor farther away from the negative examples of other classes. Following Chen et al. (2020) , we also employ a small neural network projection layer that maps the body features to the space where contrastive loss is applied. A formal description of MCL is presented in Algorithm 2. The implementation details can be found in Appendix C.1. During meta-testing, we discard the projection layer g φ and construct the prototypes from the body network f θ for predictions. The overall evaluation results on the MiniImageNet, TieredImageNet and FC100 datasets are presented in Table 3 . Note that TADAM (Oreshkin et al. (2018) ) employs a extra task embedding network (TEN) block to predict element-wise scale and shift vectors, and MetaOptNet (Lee et al. (2019b) ) proposes to learn a linear support vector machine (SVM) as classifier head during the inner loop. Unlike those methods, our MCL method is arguably simpler. By exploiting the relationship between different samples, we are able to remove the inner loop which contains a complex adaptation process, and devise a contrastive loss to train the network body directly. As the results shows, our method outperforms almost previous well-designed methods and The output dimension of g We also study the impact of the projection layer g φ . Figure 4 shows the evaluation results with different output dimensions. Note that "None" means that there is no projection layer for loss computation. As the results show, for a deeper ResNet12 backbone, the projection layer facilitates the features learning a lot (>6% for 5-shot, >5% for 1-shot). We conjecture that the projection layer is trained to extract task-specific information useful for the contrastive loss, while the body representations h learns more general information. More analysis can be found in Appendix C.4.

7. CONCLUSION

In this paper, based on the hypothesis that feature reuse is the dominant factor for the success of MAML algorithm, we further study the impact of task-specific adaptation and devise several training regimes including BOHI, Multi-Head and so on. Also, we provide a more formal argument from the perspective of gradient descent optimization. Based on analysis above, we find that the multistep task-specific adaptation, making the body and head have similar classification capabilities, can provide better gradient descent direction for the features learning of body. We further connect our results to other meta-learning algorithm, showing the adaptation is a common mode but varied in different methods. From our consideration, we devise the RDP algorithm where a suitable linear classifier is chosen without gradient descent and get more supporting conclusions. We also build the  L T s b (θ t-1 f , θ t-1 c ) θ f = θ f -β∇ θ f L T q b (θ f , θ t c )



Figure2: The adaptation after 5,000 iterations for the sampled tasks in different steps .

The Random Decision Planes (RDP) Algorithm for N-way-K-shot learning Input: Network Body f θ , Learning Rate β, Task Distribution P (T ) Perform the Gram-Schmidt method on random metrices to get the classifier set P = {W i } np i=1 while not done do Sample a batch of tasks {T b } B b=1 , where T b ∼ P (T ) for b ∈ {1, ..., B} do Sample the support set T s b = {(x s i , y s i )} N ×K i=1 and query set T q b

Figure 3: The effect of the number of decision planes on the miniImageNet and FC100 datasets.

Figure 4: The effect of output dimension of g φ on the MiniImageNet dataset.

For all training regimes, RDP and MCL, we use the Adam optimizer with weight decay of 5e-4 and the learning rate is set to 1e-3. For 4-layer convolution network with 64 filters, we flatten the output feature map of the network body, and obtain 1600-d features for miniImageNet and tieredImageNet, while 256-d features for CIFAR-FS and FC100. For ResNet12 network, we employ a global max pooling layer on the output feature map of the network body, and obtain 512-d features for four public datasets. During meta-training, we adopt horizontal flip, random crop and color (brightness,

The evaluation results of 5-way-K-shot learning for methods with different training regimes on the MiniImageNet and TieredImageNet datasets.

;(2) Multi-Head, where different tasks are equipped with different heads for task specificity and the model is trained in a traditional way without inner loop adaptation; (3) Almost No Inner Loop (ANIL), where the network body is fixed during the inner loop; (4) Body Outer Loop, Head Inner Loop (BOHI), where the network body is updated only by The adaptation of the random initialized model for the sampled tasks in different steps.

The evaluation results of 5-way-K-shot learning for the standard MAML and Random Decision Planes (RDP) with different backbones.

The evaluation results of 5-way-K-shot learning for the Meta Contrastive Learning (MCL) and other baselines with different backbones.

The evaluation results of 5-way-K-shot learning on the MiniImageNet dataset.

A FEW-SHOT IMAGE CLASSIFICATION DATASETS

In this section, we introduce four benchmark datasets often used for few-shot image classification: the miniImageNet (Vinyals et al. (2016) ), tieredImageNet (Ren et al. (2018) ), CIFAR-FS (Bertinetto et al. (2018) ) and FC100 (Oreshkin et al. (2018) ).The miniImageNet (Vinyals et al. (2016) ) dataset is standard benchmark for few-shot image classification, comprises 100 classes randomly chosen from the original ImageNet (Russakovsky et al. (2015) ) dataset, where 64 classes is used for meta-training, 16 classes for meta-validation and 20 classes for meta-testing. Each class contains 600 images of size 84 × 84. Since the original class splits are unavailable, we use the commonly-used split proposed in Ravi & Larochelle (2016) .The tieredImageNet (Ren et al. (2018) ) dataset is another larger subset of ImageNet (Russakovsky et al. (2015) ). This dataset contains 608 classes that are grouped into 34 high-level categories, where 20 categories (351 classes) are used for meta-training, 6 categories (97 classes) for meta-validation and 8 categories(160 classes) for meta-testing. All images are also size of 84 × 84.The CIFAR-FS (Bertinetto et al. (2018) ) dataset is a few-shot image classification benchmark, consisting of all 100 classes from CIFAR-100 (Krizhevsky et al. (2010) ). These classes are randomly split into 64, 16, and 20 separately for meta-training, meta-validation and meta-testing. Each class contains 600 images of size 32 × 32.The FC100 (Oreshkin et al. (2018) ) dataset is another benchmark derived from CIFAR-100 (Krizhevsky et al. (2010) ). This dataset comprises 100 classes that are grouped into 20 highlevel categories, where 12 categories (60 classes) are used for meta-training, 4 categories (20 classes) for meta-validation and 4 categories (20 classes) for meta-testing. Each class contains 600 images of size 32 × 32. et al. (2018) . We train all models 100 epochs and take 500 batches per epoch. For MAML, BOHI and ANIL, both models are trained using 5 gradient steps of size α = 0.01 for Conv4 and α = 0.1 for ResNet12. For the Random Decision Planes algorithm, the number of decision planes n p is set to 64. For the Meta Contrastive Learning (MCL) algorithm, we apply a two-layer nonlinear projection layer with hidden size of 512. Also, the query datapoints come from 10 different classes for each sampled task, which is helpful for accelerating model convergence.

C.2 MORE RESULTS FOR BOHI, ANIL, MAML, MCL

In this section, we provide complete experimental results for BOHI, ANIL, MAML and MCL with different backbones on four datasets. The complete results on four datasets are presented in Table 4 , Table 5 , Table 6 and Table 7 respectively. The results can further verify our description mentioned above. Good features learning depends on the multi-step task-specific adaptation of head during the inner loop more than updating the meta-initialization of head in outer loop. With the lowperformance head, the update of body may even lead to a decline in the quality of features. In addition, the results on four datasets further demonstrate the effectiveness of our proposed MCL algorithm. We have found that with a deeper backbone, the features learning can be facilitated a lot by the projection layer. We further evaluate the quality of features extracted by the network body and the projection layer. The evaluation results are given in Table 9 . Even if the contrastive loss is applied to the projection layer, the network body learns better and general representations. We conjecture that during the meta-training, the projection layer may absorb more task-specific information while the backbone tends to learn task-independent representations.

