FIT: PARAMETER EFFICIENT FEW-SHOT TRANSFER LEARNING FOR PERSONALIZED AND FEDERATED IMAGE CLASSIFICATION

Abstract

Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FIT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and metalearning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FIT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FIT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.

1. INTRODUCTION

With the success of the commercial application of deep learning in many fields such as computer vision (Schroff et al., 2015) , natural language processing (Brown et al., 2020) , speech recognition (Xiong et al., 2018) , and language translation (Wu et al., 2016) , an increasing number of models are being trained on central servers and then deployed on remote devices, often to personalize a model to a specific user's needs. Personalization requires models that can be updated inexpensively by minimizing the number of parameters that need to be stored and / or transmitted and frequently calls for few-shot learning methods as the amount of training data from an individual user may be small (Massiceti et al., 2021) . At the same time, for privacy, security, and performance reasons, it can be advantageous to use federated learning where a model is trained on an array of remote devices, each with different data, and share gradient or parameter updates instead of training data with a central server (McMahan et al., 2017) . In the federated learning setting, in order to minimize communication cost with the server, it is also beneficial to have models with a small number of parameters that need to be updated for each training round conducted by remote clients. The amount of training data available to the clients is often small, again necessitating few-shot learning approaches. In order to develop data-efficient and parameter-efficient learning systems, we draw on ideas developed by the few-shot learning community. Few-shot learning approaches can be characterized in Federated Learning (CIFAR100) Figure 1 : FIT is significantly more parameter efficient than BiT. Results summary for FIT and BiT in classification, personalization, and federated learning scenarios using the BiT-M-R50x1 backbone. The Parameters plots refer to the typical number of updateable parameters in each model, while the Cost plot refers to the total client-server communication cost during federated training. In all settings, FIT achieves similar or better classification accuracy using orders of magnitude fewer updateable parameters and communication cost. Refer to Table 2 , Table 4 , and Table 5 for more detail. terms of shared and updateable parameters. From a statistical perspective, shared parameters capture similarities between datasets, while updateable parameters capture the differences. Updateable parameters are those that are either recomputed or learned as the model is updated or retrained, whereas shared parameters are fixed. In personalized or federated settings, it is key to minimize the number of updateable parameters, while still retaining the capacity to adapt. Broadly, there are two different approaches to few-shot learning: meta-learning (Hospedales et al., 2020) and transfer learning (fine-tuning) (Yosinski et al., 2014) . Meta-learning approaches provide methods that have a small number of updatable parameters (Requeima et al., 2019) . However, while meta-learners can perform strongly on datasets that are similar to those they are meta-trained on, their accuracy suffers when tested on datasets that are significantly different (Dumoulin et al., 2021) . Transfer learning algorithms often outperform meta-learners, especially on diverse datasets and even at low-shot (Dumoulin et al., 2021; Tian et al., 2020) . However, the leading Big Transfer (BiT) (Dumoulin et al., 2021; Kolesnikov et al., 2019) algorithm requires every parameter in a large network to be updated. In summary, performant transfer learners are parameter-inefficient, and parameter-efficient few-shot learners perform relatively poorly. In this work we propose FiLM Transfer or FiT, a novel method that synthesizes ideas from both the transfer learning and meta-learning communities in order to achieve the best of both worlds -parameter efficiency without sacrificing accuracy, even when there are only a small number of training examples available. From transfer learning, we take advantage of backbones pretrained on large image datasets and the use of fine-tuned parameter efficient adapters. From meta-learning, we take advantage of metric learning based final layer classifiers trained with episodic protocols that we show are more effective than the conventional linear layer classifier. We experiment with FIT on a wide range of downstream datasets and show that it achieves better classification accuracy at low-shot with two orders of magnitude fewer updateable parameters when compared to BiT (Kolesnikov et al., 2019) and competitive accuracy when more data are available. We also demonstrate the benefits of FIT on a low-shot real-world model personalization application and in a demanding few-shot federated learning scenario. A summary of our results is shown in Fig. 1 , where we see that FIT has superior parameter efficiency and classification accuracy compared to BiT in multiple settings. Our contributions: • A parameter and data efficient network architecture for low-shot transfer learning that (i) utilizes frozen backbones pretrained on large image datasets; (ii) augments the backbone with parameter efficient FiLM (Perez et al., 2018) layers in order to adapt to a new task; and (iii) makes novel use of an automatically configured Naive Bayes final layer classifier instead of the usual linear layer, saving a large number of updateable parameters, yet improving classification performance; • A meta-learning inspired episodic training protocol for low-shot fine-tuning requiring no data augmentation, no regularization, and a minimal set of hyper-parameters; • Superior classification accuracy at low-shot on standard downstream datasets and state-of-the-art results on the challenging VTAB-1k benchmark (74.9% for backbones pretrained on ImageNet-21k) while using ≈ 1% of the updateable parameters when compared to the leading method BiT; • Demonstration of superior parameter efficiency and classification accuracy in distributed low-shot personalization and federated learning applications where model update size is a key performance metric. We show that the FIT communication cost is more than 3 orders of magnitude lower than BiT (7M versus 14B parameters transmitted) in our CIFAR100 federated learning experiment.



* Authors contributed equally. † Work performed while at Microsoft Research.

