FIT: PARAMETER EFFICIENT FEW-SHOT TRANSFER LEARNING FOR PERSONALIZED AND FEDERATED IMAGE CLASSIFICATION

Abstract

Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FIT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and metalearning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FIT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FIT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.

1. INTRODUCTION

With the success of the commercial application of deep learning in many fields such as computer vision (Schroff et al., 2015) , natural language processing (Brown et al., 2020) , speech recognition (Xiong et al., 2018) , and language translation (Wu et al., 2016) , an increasing number of models are being trained on central servers and then deployed on remote devices, often to personalize a model to a specific user's needs. Personalization requires models that can be updated inexpensively by minimizing the number of parameters that need to be stored and / or transmitted and frequently calls for few-shot learning methods as the amount of training data from an individual user may be small (Massiceti et al., 2021) . At the same time, for privacy, security, and performance reasons, it can be advantageous to use federated learning where a model is trained on an array of remote devices, each with different data, and share gradient or parameter updates instead of training data with a central server (McMahan et al., 2017) . In the federated learning setting, in order to minimize communication cost with the server, it is also beneficial to have models with a small number of parameters that need to be updated for each training round conducted by remote clients. The amount of training data available to the clients is often small, again necessitating few-shot learning approaches. In order to develop data-efficient and parameter-efficient learning systems, we draw on ideas developed by the few-shot learning community. Few-shot learning approaches can be characterized in Figure 1 : FIT is significantly more parameter efficient than BiT. Results summary for FIT and BiT in classification, personalization, and federated learning scenarios using the BiT-M-R50x1 backbone. The Parameters plots refer to the typical number of updateable parameters in each model, while the Cost plot refers to the total client-server communication cost during federated training. In all settings, FIT achieves similar or better classification accuracy using orders of magnitude fewer updateable parameters and communication cost. Refer to Table 2, Table 4, and Table 5 for more detail. terms of shared and updateable parameters. From a statistical perspective, shared parameters capture similarities between datasets, while updateable parameters capture the differences. Updateable parameters are those that are either recomputed or learned as the model is updated or retrained, whereas shared parameters are fixed. In personalized or federated settings, it is key to minimize the number of updateable parameters, while still retaining the capacity to adapt. Broadly, there are two different approaches to few-shot learning: meta-learning (Hospedales et al., 2020) and transfer learning (fine-tuning) (Yosinski et al., 2014) . Meta-learning approaches provide methods that have a small number of updatable parameters (Requeima et al., 2019) . However, while meta-learners can perform strongly on datasets that are similar to those they are meta-trained on, their accuracy suffers when tested on datasets that are significantly different (Dumoulin et al., 2021) . Transfer learning algorithms often outperform meta-learners, especially on diverse datasets and even at low-shot (Dumoulin et al., 2021; Tian et al., 2020) . However, the leading Big Transfer (BiT) (Dumoulin et al., 2021; Kolesnikov et al., 2019) algorithm requires every parameter in a large network to be updated. In summary, performant transfer learners are parameter-inefficient, and parameter-efficient few-shot learners perform relatively poorly. In this work we propose FiLM Transfer or FiT, a novel method that synthesizes ideas from both the transfer learning and meta-learning communities in order to achieve the best of both worlds -parameter efficiency without sacrificing accuracy, even when there are only a small number of training examples available. From transfer learning, we take advantage of backbones pretrained on large image datasets and the use of fine-tuned parameter efficient adapters. From meta-learning, we take advantage of metric learning based final layer classifiers trained with episodic protocols that we show are more effective than the conventional linear layer classifier. We experiment with FIT on a wide range of downstream datasets and show that it achieves better classification accuracy at low-shot with two orders of magnitude fewer updateable parameters when compared to BiT (Kolesnikov et al., 2019) and competitive accuracy when more data are available. We also demonstrate the benefits of FIT on a low-shot real-world model personalization application and in a demanding few-shot federated learning scenario. A summary of our results is shown in Fig. 1 , where we see that FIT has superior parameter efficiency and classification accuracy compared to BiT in multiple settings. Our contributions: • A parameter and data efficient network architecture for low-shot transfer learning that (i) utilizes frozen backbones pretrained on large image datasets; (ii) augments the backbone with parameter efficient FiLM (Perez et al., 2018) layers in order to adapt to a new task; and (iii) makes novel use of an automatically configured Naive Bayes final layer classifier instead of the usual linear layer, saving a large number of updateable parameters, yet improving classification performance; • A meta-learning inspired episodic training protocol for low-shot fine-tuning requiring no data augmentation, no regularization, and a minimal set of hyper-parameters; • Superior classification accuracy at low-shot on standard downstream datasets and state-of-the-art results on the challenging VTAB-1k benchmark (74.9% for backbones pretrained on ImageNet-21k) while using ≈ 1% of the updateable parameters when compared to the leading method BiT; • Demonstration of superior parameter efficiency and classification accuracy in distributed low-shot personalization and federated learning applications where model update size is a key performance metric. We show that the FIT communication cost is more than 3 orders of magnitude lower than BiT (7M versus 14B parameters transmitted) in our CIFAR100 federated learning experiment.

2. FILM TRANSFER (FIT)

In this section we detail the FIT algorithm focusing on the few-shot image classification scenario. Preliminaries We denote input images x ∈ R ch×W ×H where W is the width, H the height, ch the number of channels and image labels y ∈ {1, . . . , C} where C is the number of image classes indexed by c. Assume that we have access to a model f (x) = h ϕ (b θ (x)) that outputs class-probabilities for an image p(y = c|x, θ, ϕ) for c = 1, . . . , C and is comprised of a feature extractor backbone b θ (x) ∈ R d b with parameters θ that has been pretrained on a large upstream dataset such as Imagenet where d b is the output feature dimension and a final layer classifier or head h ϕ (•) ∈ R C with weights ϕ. Let D = {(x n , y n )} N n=1 be the downstream dataset that we wish to fine-tune the model f to. FIT Backbone For the network backbone, we freeze the parameters θ to the values learned during upstream pretraining. To enable parameter-efficient and flexible adaptation of the backbone, we add Feature-wise Linear Modulation (FiLM) (Perez et al., 2018) layers with parameters ψ at strategic points within b θ . A FiLM layer scales and shifts the activations a ij arising from the j th channel of a convolutional layer in the i th block of the backbone as FiLM(a ij , γ ij , β ij ) = γ ij a ij + β ij , where γ ij and β ij are scalars. The set of FiLM parameters ψ = {γ ij , β ij } is learned during fine-tuning. We add FiLM layers following the middle 3 × 3 convolutional layer in each ResNetV2 (He et al., 2016b) block and also one at the end of the backbone prior to the head. A.1b illustrates how a FiLM layer can be added to a ResNetV2 network block. FiLM layers can be similarly added to other backbones. An advantage of FiLM layers is that they enable expressive feature adaptation while adding only a small number of parameters (Perez et al., 2018) . For example, in a ResNet50 with a FiLM layer in every block, the set of FiLM parameters ψ account for only 11648 parameters which is fewer than 0.05% of the parameters in b θ . We show in Section 4 that FiLM layers allow the model to adapt to a broad class of datasets. Fig. A.2a and Fig. A.2b show the magnitude of the FiLM parameters as a function of layer for FiT-LDA on CIFAR100 and SHVN, respectively. Refer to Appendix A.7 for additional detail. FIT Head For the head of the network, we use a specially tailored Gaussian Naive Bayes classifier. Unlike a linear head, this head can be automatically configured directly from data and has only a small number of free parameters which must be learned, ideal for few-shot, personalization and federated learning. We will also show that this head is often more accurate than a standard linear head. The class probability for a test point x * is: p(y * = c|b θ,ψ (x * ), π, µ, Σ) = π c N (b θ,ψ (x * )|µ c , Σ c )) C c ′ π c ′ N (b θ,ψ (x * )|µ c ′ , Σ c ′ ) where π c = N c N , µ c = 1 N c Nc i=1 b θ,ψ (x i ), Σ c = 1 N c Nc i=1 (b θ,ψ (x i ) -µ c )(b θ,ψ (x i ) -µ c ) T are the maximum likelihood estimates, N c is the number of examples of class c in D, and N (z|µ, Σ) is a multivariate Gaussian over z with mean µ and covariance Σ. Estimating the mean µ c for each class c is straightforward and incurs a total storage cost of Cd b . However, estimating the covariance Σ c for each class c is challenging when the number of examples per class N c is small and the embedding dimension of the backbone d b is large. In addition, the storage cost for the covariance matrices may be prohibitively high if d b is large. Here, we use three different approximations to the covariance in place of Σ c in Eq. ( 1) (Fisher, 1936; Duda et al., 2012 ): • Quadratic Discriminant Analysis (QDA): QDA mainly serves as a baseline since it has a very large set of updateable parameters arising from the fact that it stores a covariance matrix for each class in the dataset. LDA is far more parameter efficient than QDA, sharing a single covariance matrix across all classes. We show that LDA leads to very similar performance to QDA. The number of model shared and updateable parameters for the three FIT variants as well as the BiT algorithm are detailed in Table 1 . Refer to Appendix A.2 for details on the parameter calculations. Σ QDA = e 1 Σ class + e 2 Σ An empirical justification for the use of the FIT-LDA head is shown in Fig. 2a where it outperforms a linear head in the case of FiLM and when all the backbone parameters are learned. Refer to Table A .5 for details. In Fig. 2b , we see for both datasets, FIT-LDA converges significantly faster than BiT, which uses a linear head. The primary limitation of the Naive Bayes head is the higher (versus linear) computational cost due to having to invert a d b × d b covariance matrix on each training iteration.

FIT Training

We learn the FiLM parameters ψ and the covariance weights e via fine-tuning (parameters θ are fixed from pretraining). One approach would be to apply standard batch training on the downstream dataset, however it was hard to balance under-and over-fitting using this setup. Instead, we found that an approach inspired by episodic training (Vinyals et al., 2016) that is often used in meta-learning yielded better performance. We refer to this approach as episodic fine tuning and it works as follows. Note that we require 'training' data to compute π, µ, Σ to configure the head, and a 'test' set to optimize ψ and e via gradient ascent. Thus, from the downstream dataset  D Q = {{x τ * q , y τ * q } Qτ q=1 } T τ =1 to train ψ and e with maximum likelihood. We optimize the following: L (ψ, e) = T τ =1 Qτ q=1 log p y τ * q |h e (b θ,ψ (x τ * q )), π(D τ s ), µ(D τ s ), Σ(D τ s ) . FIT training hyper-parameters include a learning rate, |D τ S |, and the number of training iterations. For the transfer learning experiments in Section 4 these are set to constant values across all datasets and do not need to be tuned based on a validation set. We do not augment the training data. In the 1-shot case, we do not perform episodic fine-tuning and leave the FiLM parameters at their initial value of γ = 1, β = 0 and e = (0.5, 0.5, 1.0) and predict as described next. In Section 4 we show this can yield results that better those when augmentation and training steps are taken on 1-shot data. FIT Prediction Once the FiLM parameters ψ and covariance weights e have been learned, we use D for the support set to compute π c , µ c , and Σ c for each class c and then Eq. ( 1) can be used to make a prediction for any unseen test input.

3. RELATED WORK

We take inspiration from residual adapters (Rebuffi et al., 2017; 2018) where parameter efficient adapters are inserted into a ResNet with frozen pretrained weights. The adapter parameters and the final layer linear classifier are then learned via fine-tuning. More recently, a myriad of additional parameter efficient adapters have been proposed including FiLM, Adapter (Houlsby et al., 2019 ), LoRA (Hu et al., 2021) , VPT (Jia et al., 2022) , AdaptFormer (Chen et al., 2022) , NOAH (Zhang et al., 2022) , Convpass (Jie & Deng, 2022 ), (Mudrakarta et al., 2019) , and CaSE (Patacchiola et al., 2022) . For FIT we use FiLM as it is the most parameter efficient adapter, yet it allows for expressive adaptation, and can be used in various backbone architectures including ConvNets and Transformers. To date, transfer learning systems that employ adapters use a linear head for the final classification layer. In meta-learning systems it is common to use metric learning heads (e.g. ProtoNets (Snell et al., 2017) ), which have no or few learnable parameters. Meta-Learning systems that employ a metric learning head are normally trained with an episodic training regime (Vinyals et al., 2016) . Some of these approaches (e.g. TADAM (Oreshkin et al., 2018) , FLUTE (Triantafillou et al., 2021) , and Simple CNAPs (Bateni et al., 2020) use both a metric head and FiLM layers to adapt the backbone. FIT differs from all of the preceding approaches by using a powerful Naive Bayes metric head that uses episodic fine-tuning in the context of transfer learning, as opposed to the usual meta-learning. We show in Fig. 2a and Section 4 that the episodically fine-tuned Naive Bayes head consistently outperforms a conventional batch trained linear head in the low-shot transfer learning setting.

4. EXPERIMENTS

In this section, we evaluate the classification accuracy and updateable parameter efficiency of FIT in a series of challenging benchmarks and application scenarios. In all experiments, we use Big Transfer (BiT) (Kolesnikov et al., 2019) , a leading, scalable, general purpose transfer learning algorithm as a point of comparison. First, we compare different variations of FIT to BiT on several standard downstream datasets in the few-shot regime. Second, we evaluate FIT against BiT on VTAB-1k (Zhai et al., 2019) , which is arguably the most challenging transfer learning benchmark. Additionally, we compare FIT to the latest vision transformer based methods that have reported the highest accuracies on VTAB-1k to date. Third, we show how FIT can be used in a personalization scenario on the ORBIT (Massiceti et al., 2021) dataset, where a smaller updateable model is an important evaluation metric. Finally, we apply FIT to a few-shot federated learning scenario where minimizing the number of parameter updates and their size is a key requirement. Training and evaluation details are in Appendix A.11. Source code for experiments can be found at: https://github.com/cambridge-mlg/fit.

4.1. FEW-SHOT RESULTS

Fig. 3 shows the classification accuracy as a function of updateable parameters for FIT-LDA and BiT on four downstream datasets (CIFAR10, CIFAR100 (Krizhevsky et al., 2009 ), Pets (Parkhi et al., 2012) , and Flowers (Nilsback & Zisserman, 2008) ) that were used to evaluate the performance of BiT (Kolesnikov et al., 2019) . Table A .1 contains complete tabular results with additional variants of FIT and BiT. All methods use the BiT-M-R50x1 (Kolesnikov et al., 2019) backbone that is pretrained on the ImageNet-21K (Russakovsky et al., 2015) dataset. The key observations from Fig. 3 are: • For ≤10 shots (except 1-shot on CIFAR100), FIT-LDA outperforms BiT, often by a large margin. • On 3 out of 4 datasets, FIT-LDA outperforms BiT even when all of D is used for fine-tuning. • FIT-LDA outperforms BiT despite BiT having more than 100 times as many updateable parameters. • To avoid overfitting when D is small, Table A .2 indicates that it is better to split D into two disjoint partitions D train and D test and that D τ S and D τ Q should be randomly sub-sampled as opposed to using all of the data in each training iteration. The datasets used in this section were similar in content to the dataset used for pretraining and the performance of FIT-QDA and FIT-LDA was similar, indicating that the covariance per class was not that useful for these datasets. In the next section, we test on a wider variety of datasets, many of which differ greatly from the upstream data.

4.2. VTAB-1K RESULTS

The VTAB-1k benchmark (Zhai et al., 2019) is a low to medium-shot transfer learning benchmark that consists of 19 datasets grouped into three distinct categories (natural, specialized, and structured). From each dataset, 1000 examples are drawn at random from the training split to use for the downstream dataset D. After fine-tuning, the entire test split is used to evaluate classification performance. Table 2 shows the classification accuracy and updateable parameter count for the three variants of FIT and BiT (see Table A .3 for error bars). The key observations from our results are: • Both FIT-QDA and FIT-LDA outperform BiT on VTAB-1k. • The FIT-QDA variant has the best overall performance, showing that the class covariance is important to achieve superior results on datasets that differ from those used in upstream pretraining (e.g. the structured category of datasets). However, the updateable parameter cost is high. • FIT-LDA utilizes two orders of magnitude fewer updateable parameters compared to BiT, making it the preferred approach. • Table A .4 shows that it is best to use all of D for each of D train and D test (i.e. no split) and that D τ S and D τ Q should be episodically sub-sampled as opposed to using all of the data in each iteration. Table 3 shows that FIT-LDA achieves state-of-the-art classification accuracy when compared to leading transfer learning methods pretrained on ImageNet-21k, while requiring the smallest number of updateable parameters and using the smallest backbone. All competing methods use a linear head. Table 3 : FIT achieves SOTA on VTAB-1k. Classification accuracy (%) for the 3 VTAB-1k categories (Natural, Specialized, and Structured) and mean accuracy over all 19 datasets (Overall Acc) and updateable parameter count (Params) for leading transfer learning methods using various backbones (backbone parameter count shown in parentheses) (Kolesnikov et al., 2019; Dosovitskiy et al., 2020; Tan & Le, 2021) pretrained on ImageNet-21k. ViT-Base-16 results from (Jie & Deng, 2022) . BiT results from (Kolesnikov et al., 2020) . Green indicates results summary columns.

Method

Backbone Params (M) ↓ Overall Acc ↑ Natural ↑ Specialized ↑ Structured ↑ BiT (Kolesnikov et al., 2019) BiT-M-R101x3 (382M) 382 72.7 80.3 85.8 59.4 BiT (Kolesnikov et al., 2019) BiT-M-R152x4 (928M) 928 73.5 80.8 85.7 61.1 VPT (Jia et al., 2022) ViT-Base-16 (85.8M) 0.5 69.4 78.5 82.4 55.0 Adapter (Houlsby et al., 2019) ViT 

4.3. PERSONALIZATION

In the personalization experiments, we use ORBIT (Massiceti et al., 2021) , a real-world few-shot video dataset recorded by people who are blind/low-vision. A user collects a series of short videos of objects that they would like to recognize. The collected videos and associated labels are then uploaded to a central service to train a personalized classification model for that user. Once trained, the personalized model is downloaded to the user's smartphone. In this setting, models with a smaller number of updateable parameters are preferred in order to save model storage space on the central server and in transmitting any updated models to a user. The personalization is performed by taking a large pretrained model and adapting it using user's individual data. We follow the object recognition benchmark task proposed by the authors, which tests a personalized model on two different video types: clean where only a single object is present and clutter where that object appears within a realistic, multi-object scene. In Table 4 , we compare FIT-LDA to several competitive transfer learning and meta-learning methods that benchmark on the ORBIT dataset. We use the LDA variant of FIT, as it achieves higher accuracy in comparison to the ProtoNets variant, while using far fewer updateable parameters than QDA. As baselines, for transfer learning, we include FineTuner (Yosinski et al., 2014) , which freezes the weights in the backbone and fine-tunes only the linear classifier head on an individual's data. For meta-learning approaches, we include ProtoNets (Snell et al., 2017) and Simple CNAPs (Bateni et al., 2020) , which are meta-trained on Meta-Dataset (Dumoulin et al., 2021) . In the lower part of Table 4 , we compare FIT-LDA and BiT. Both models use the same BiT-M-R50x1 backbone pretrained on ImageNet-21K. Training and evaluation details are in Appendix A.11.2. For this comparison, we show frame and video accuracy, averaged over all the videos from all tasks across all test users (17 test users, 85 tasks in total). We also report the number of shared and individual updateable parameters required to be stored or transmitted. The key observations from our results are: • FIT-LDA outperforms competitive meta-learning methods, Simple CNAPs and ProtoNets. • FIT-LDA also outperforms FineTuner in terms of the video accuracy and performs within error bars of it in terms of the frame accuracy. • The number of individual parameters for FIT-LDA is far fewer than in Simple CNAPs and BiT, and is of the same order of magnitude as FineTuner and ProtoNets. • FIT-LDA pretrained on ImageNet-21K performs on par with BiT, while having orders of magnitude fewer updatable parameters.

4.4. FEW-SHOT FEDERATED LEARNING

We now show how FIT can be used in the few-shot federated learning setting, where training data are split between client nodes, e.g. mobile phones or personal laptops, and each client has only a handful of samples. Model training is performed via numerous communication rounds between a server and clients. In each round, the server selects a fraction of clients making updates and then sends the current model parameters to these clients. Clients update models locally using only their personal data and then send their parameter updates back to the server. Finally, the server aggregates information from all clients, updates the shared model parameters, and proceeds to the next round until We evaluate FIT in two scenarios, global and personalized. In the global setting, the aim is to construct a global classifier and report accuracy on the CIFAR100 test set. We assume that the server knows which classes belong to each client, and constructs a shared classifier by taking a mean over prototypes produced by clients for a particular class. In the personalized scenario, we test a personalized model on test classes present in the individual's training set and then report the mean accuracy over all clients. As opposed to the personalization experiments on ORBIT, where a personalized model is trained using only the client's local data, in this experiment we initialize a personalized model with the learned global FiLM parameters and then construct a ProtoNets classifier with an individual's data. Thus, the goal of the personalized setting is to estimate how advantageous distributed learning can be for training FiLM layers to build personalized few-shot models. To test the global scenario, we compare FIT to BiT. As the training protocol used in BiT cannot be directly applied in the federated learning context, we trained BiT with a constant learning rate. We do not provide comparison on the personalized scenario, as BiT uses a linear classification head trained globally, as opposed to using only client's local data. As, to the best of our knowledge, there are no suitable federated learning systems to compare to, we define baselines which form an upper and lower bounds on the model performance. For the global scenario, we take a FIT model trained centrally on all available data as the upper bound baseline. To get the lower bound baseline, we train a FIT model for each client with their local data, then average FiLM parameters of these individual models and construct a global ProtoNets classifier using the resulting FiLM parameters. The upper bound is therefore standard batch training, the performance of which we hope federated learning can approach. The lower bound is a simplistic version of federated learning with a single communication round which federated averaging should improve over. For the personalized setting, the upper bound baseline is as in the global scenario from which we form a personalized classifier by taking a subset of classes belonging to a client from a global 100-way classifier. The lower bound baseline is set to a FIT model trained for each client individually. The upper bound is again standard batch training and the lower bound is derived from locally trained models which do not share information and therefore should be improved upon by federated learning.

Results

Table 5 shows the comparison of FIT to BiT. Fig. 4 shows global and personalized classification accuracy as a function of communication cost for different numbers of clients and shots per • FIT and BiT show comparable performance in terms of test accuracy, however FIT is much more parameter efficient and requires transmission of only 7M parameters through training, in contrast to 14B parameters for BiT. This makes FIT highly suitable for federated learning applications. • In the global setting, the federated learning model is only slightly worse (3 -5%) than the upper bound baseline, while outperforming the lower bound model, often by a large margin. This shows that FIT can be efficiently used in federated learning settings with different configurations. • In the personalized scenario, for a sufficient number of clients (≥ 50) the gap between the federated learning model and the upper bound model is significantly reduced with the increase in number of shots. Federated training strongly outperforms the lower bound baseline, surprisingly even in the case of 10 clients with disjoint classes. This provides empirical evidence that collaborative distributed training can be helpful for improving personalized models in the few-shot data regime. In Appendix A.8, we show that distributed training of a FIT model can be efficiently used to learn from more extreme, non-natural image datasets like Quickdraw (Jongejan et al., 2016) .

5. DISCUSSION

In this work, we proposed FIT, a parameter and data efficient few-shot transfer learning system that allows image classification models to be updated with only a small subset of the total model parameters. We demonstrated that FIT can outperform BiT using fewer than 1% of the updateable parameters and achieve state-of-the-art accuracy on VTAB-1k. We also showed the parameter efficiency benefits of employing FIT in model personalization and federated learning applications.

A APPENDIX

A.1 FILM LAYER PLACEMENT GN is a Group Normalization layer, ReLU is a Rectified Linear Unit, and 1 × 1, and 3 × 3 are 2D convolutional layers with the stated kernel size.

A.2 MODEL PARAMETERS

In this section, we provide details on the updateable parameter calculations for the LDA and QDA variants of BiT. Refer to Table 1 . In the case of FIT-LDA, where the covariance matrix is shared across all classes, a more compact representation is possible, resulting in considerable savings in updateable parameters: p(y * = c|b θ,ψ (x * ), π, µ, Σ LDA ) = π c N (b θ,ψ (x * )|µ c , Σ LDA )) C c ′ π c ′ N (b θ,ψ (x * )|µ c ′ , Σ LDA ) = π c det (2πΣ LDA ) -1 2 exp -1 2 (b θ,ψ (x * ) -µ c ) T Σ -1 LDA (b θ,ψ (x * ) -µ c ) C c ′ π c ′ det (2πΣ LDA ) -1 2 exp -1 2 (b θ,ψ (x * ) -µ c ′ ) T Σ -1 LDA (b θ,ψ (x * ) -µ c ′ ) = π c exp -1 2 b θ,ψ (x * ) T Σ -1 LDA b θ,ψ (x * ) + µ T c Σ -1 LDA b θ,ψ (x * ) -1 2 µ T c Σ -1 LDA µ c C c ′ π c ′ exp -1 2 b θ,ψ (x * ) T Σ -1 LDA b θ,ψ (x * ) + µ T c ′ Σ -1 LDA b θ,ψ (x * ) -1 2 µ T c ′ Σ -1 LDA µ c ′ = π c exp µ T c Σ -1 LDA b θ,ψ (x * ) -1 2 µ T c Σ -1 LDA µ c C c ′ π c ′ exp µ T c ′ Σ -1 LDA b θ,ψ (x * ) -1 2 µ T c ′ Σ -1 LDA µ c ′ (A.1) From Eq. (A.1), it follows that to compute the probability of classifying a test point x * , we need to store µ T c Σ -1 LDA which has dimensionality d b and µ T c Σ -1 LDA µ c which has dimensionality 1 for each class c, resulting in only C(d b + 1) parameters required for the FIT-LDA head. Since the covariance matrix is not shared in the case of FIT-QDA, no additional savings are possible in that case.

A.3 VTAB-1K DATASETS

The VTAB-1k benchmark (Zhai et al., 2019) is a low to medium-shot transfer learning benchmark that consists of 19 datasets grouped into three distinct categories (natural, specialized, and structured). The natural datasets are: Caltech101 (Fei-Fei et al., 2006) , CIFAR100 (Krizhevsky et al., 2009) , Flowers102 (Nilsback & Zisserman, 2008 ), Pets (Parkhi et al., 2012) , Sun397 (Xiao et al., 2010) , SVHN (Netzer et al., 2011) , and DTD (Cimpoi et al., 2014) . The specialized datasets are: EuroSAT (Helber et al., 2019) , Resics45 (Cheng et al., 2017 ),Patch Camelyon (Veeling et al., 2018) , and Retinopathy (Kaggle & EyePacs, 2015) . The structured datasets are: CLEVR-count (Johnson et al., 2017) , CLEVR-dist (Johnson et al., 2017) , dSprites-loc (Matthey et al., 2017) , dSprites-ori (Matthey et al., 2017) , SmallNORB-azi (LeCun et al., 2004) , SmallNORB-elev (LeCun et al., 2004) , DMLab (Beattie et al., 2016) , KITTI-dist (Geiger et al., 2013) .

A.4 ADDITIONAL RESULTS

This section contains additional results that would not fit into the main paper, including tabular versions of figures. Note, in some results, we use a new metric Relative Model Update Size or RMUS, which is the ratio of the number of updateable parameters in one model to another. In our experiments, we measure RMUS relative to the number of updatable parameters in BiT model.

A.4.1 ADDITIONAL FEW-SHOT RESULTS

Table A .1 shows the tabular version of Fig. 3 . In addition, Table A .1 includes results for an additional variant of BiT (BiT-FiLM), and two additional variants of FIT (FIT-QDA and FIT-ProtoNets). BiT-FiLM is a variant of BiT that uses the same training protocol as the standard version of BiT, but the backbone weights θ are frozen and FiLM layers are added in the same manner as FIT. The FiLM parameters ψ and the linear head weights ϕ are learned during training. The results are shown in Table A .1 and the key observations are: • In general, at low-shot, the standard version of BiT outperforms BiT-FiLM. However, as the shot increases, especially when training on all of D, BiT-FiLM is equal in classification accuracy. • The above implies that FiLM layers have sufficient capacity to accurately fine-tune to downstream datasets, but the FIT head and training protocol are needed to achieve superior results. • While the accuracy of FIT-QDA and FIT-LDA is similar, the storage requirements for a covariance matrix for each class makes QDA impractical if model update size is an important consideration. • The accuracy of FIT-ProtoNets is slightly lower than FIT-LDA, but often betters BiT, despite BiT having more than 100 times as many updateable parameters. A .2 shows that Use All is consistently the worst option. In general, in the few-shot case, Split either outperforms No Split (CIFAR10, Pets) or achieves the same level of performance (CIFAR100, Flowers102). As a result, we use the Split option when reporting the few-shot results. Table A .5 compares the classification accuracy of the LDA and linear heads as a function of the learnable parameters in the backbone. When all parameters in the backbone are learnable and when FiLM layers are employed, the Naive Bayes LDA head outperforms the linear head. In the case when all of the parameters in the backbone are frozen, the linear head is superior since in that case the LDA head has only 2 learnable parameters (the covariance weights e). FiLM works well for low shots per class because it has a small number of parameters to adapt -this gets you a long way with large pre-trained backbones -and it also prevents over-fitting as compared to adapting all the parameters in the backbone. Evidence: eventually as dataset size increases, full body adaptation is at least as good as FiLM. For larger models, this transition happens at a larger number of data points and so we expect adapters to be especially useful as the field transitions to using large foundation models. Fig. 3 provides an empirical justification of this argument, showing that FiT-LDA outperforms BiT at low-shot. The linear head is more flexible and works best of all methods in high data, but in the low-to medium-shot setting the linear head appears to lead to increased overfitting and the meta-trained LDA head performs best in this region. See Fig. 3 for empirical justification. The FIT approach suffers when there are a fairly large number of data points and the dataset is very far from the pretraining data (ImageNet) e.g. see the DSprites results in Table 3 where BiT beats FiT-LDA by a large margin. In addition, the LDA and QDA variants of the Naïve Bayes head are more computationally expensive compared to a linear head since LDA and QDA require inverting FiT-LDA on CIFAR100 and SHVN, respectively, in the VTAB-1k setting. We see that for a dataset that differs from the pretraining data (SHVN), the FiLM layers are required to learn a greater degree of adjustment compared to a dataset that is similar to the pretraining data (CIFAR100 In general, FIT can operate well with other FL aggregation methods, as they are agnostic to the neural network architectures used to learn the parameters. However, as we are constructing a Naïve Bayes head instead of training a linear head via SGD, some of the theoretical convergence results of the methods mentioned may not hold. Despite this fact, FIT has shown to have quite fast convergence in practice. In Table A .7 we provide the results of training FIT with Federated Averaging (FedAvg) and FedProx (Li et al., 2020) . To test distributed training of a FIT model on a more extreme, non-natural image dataset, we also include the results for federated training of FIT on the Quickdraw dataset. As there is no pre-defined train/test split for the Quickdraw dataset, we randomly choose 100 samples from each of the 345 classes and use them for testing. We train all federated training models for 120 communication rounds, with 5 clients per round, and 10 update steps per client. Since Quickdraw is a more difficult dataset than CIFAR100, it requires more communication rounds for training. Each client has 35 classes, which are sampled randomly at the start of training. In our experiments, we omit the 10-clients case, as the overall amount of data in the system is not enough to even train a robust global upper bound baseline model. A.9 CONNECTIONS TO PERSONALIZED FEDERATED LEARNING Partial model based personalized federated learning (PFL) (Collins et al., 2021; Arivazhagan et al., 2019; Liang et al., 2020; Pillutla et al., 2022) is related to the personalized setting in our federated learning experiments, but it is different in important regards. With regard to the similarities, the local Naïve Bayes head parameters in our approach may be considered as "personal" parameters (there are personalized parameters for each client), while the FiLM parameters could be viewed as "shared" parameters. However, there are a few major differences between the ideas: • The personalized parameters in our setting, i.e. the ProtoNets head, does not require an optimization loop to be learned. This simplifies deployment significantly. • The partial model personalization literature (Collins et al., 2021; Arivazhagan et al., 2019; Liang et al., 2020; Pillutla et al., 2022 ) is more concerned with proposing stable federated training algorithms that would work within heterogeneous settings, where clients have diverse data and standard FedAvg algorithm would fail, thus necessitating a need to introduce 'personal' parameters. In contrast, personalized heads are required in our setup as each user has a different classification task to perform. Moreover, we propose a particular architecture that would be highly suitable for federated learning applications, as the number of parameters required to be transmitted is small, while most methods in PFL propose architecture agnostic optimization algorithms. Our model can then be trained with an arbitrary Federated Learning algorithm. • Another distinctive difference between the model personalization federated learning literature and our work is that in the former most of the methods train all model's parameters, while in our work we used a deep pretrained network and fine-tune only the FiLM layers. 

A.11.1 FEW-SHOT AND VTAB-1K TRANSFER LEARNING EXPERIMENTS

FIT All of the FIT few-shot and VTAB-1k transfer learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory. The Adam optimizer (Kingma & Ba, 2015) with a constant learning rate of 0.0035, for 400 iterations, and |D τ S |=100 was used throughout. No data augmentation was used and images were scaled to 384×384 pixels unless the image size was 32×32 pixels or less, in which case the images were scaled to 224×224 pixels. In each iteration of episodic fine tuning, we perform a single step of gradient ascent on D τ Q . These hyper-parameters were empirically derived from a small number of runs. FIT-QDA, FIT-LDA, and FIT-ProtoNets take approximately 12, 10, and 9 hours, respectively, to fine-tune on all 19 VTAB datasets and 5, 3, and 3 hours, respectively, to fine tune all shots on the 4 low-shot datasets. For the FIT-LDA results using the EfficientNetV2-M backbone in Table 3 , we used 1000 iterations instead of 400 and ran the experiments on 4 NVIDIA A100 GPUs, each with 80GB of memory.  D τ Q ← D τ Q + D test [selected shots] ▷ Add examples to D τ Q 21: end for 22: return D τ S , D τ Q 23: end procedure BiT For the BiT few-shot experiments, we used the code supplied by the authors (Kolesnikov et al., 2020) with minor augmentations to read additional datasets. The BiT few-shot experiments were run on a single NVIDIA V100 GPU with 16GB. For the BiT VTAB-1k experiments, we used the three fine-tuned models for each of the datasets that were provided by the authors (Kolesnikov et al., 2020) . We evaluated all of the models on the respective test splits for each dataset and averaged the results of the three models. The BiT-HyperRule (Kolesnikov et al., 2019) was respected in all runs. These experiments were executed on a single NVIDIA GeForce RTX 3090 with 24GB of memory.

A.11.2 PERSONALIZATION ON ORBIT EXPERIMENTS

The personalization experiments were carried out on a single NVIDIA GeForce RTX 3090 with 24GB of memory. It takes approximately 5 hours to train FIT-LDA personalization models for all the ORBIT (Massiceti et al., 2021) test tasks. We derived all hyperparameters empirically from a small number of runs. We used the ORBIT codebasefoot_0 in our experiments, only adding the code for splitting test user tasks and slightly modifying the main training loop to make it suitable for FIT training. In the comparison of FIT to standard benchmarks on ORBIT (upper part of Table 4 ), all methods use an EfficientNet-B0 (d b = 1280) as the feature extractor and an image size of 224 × 224. FIT-LDA, FineTuner (Yosinski et al., 2014) and Simple CNAPs (Bateni et al., 2020 ) use a backbone pretrained on ImageNet (Deng et al., 2009) , while ProtoNets (Snell et al., 2017) meta-trained the weights of the feature extractor on Meta-Dataset (Dumoulin et al., 2021) . The task encoder in Simple CNAPs (Bateni et al., 2020 ) is meta-trained on Meta-Dataset. The results for all models (FineTuner (Yosinski et al., 2014) , Simple CNAPs (Bateni et al., 2020) and ProtoNets (Snell et al., 2017) ) are from (Bronskill et al., 2021) . FiLM layers in FIT-LDA are added to the feature extractor as described in Section 2, resulting in |ψ| = 20544. For the comparison of FIT and BiT (lower part of Table 4 ), we use the BiT-M-R50x1 backbone pretrained on ImageNet-21K. BiT was trained using the BiT-HyperRule (Kolesnikov et al., 2019) on every user task. For experiments with BiT, we used the code supplied by the authors (Kolesnikov et al., 2019) with minor augmentations to read the ORBIT dataset. We follow the task sampling protocols described in (Massiceti et al., 2021) , and train the FIT model for 50 optimization steps using the Adam optimizer with a learning rate of 0.007 for EfficientNet-B0 and 0.003 for BiT-M-R50x1. The ORBIT test tasks have a slightly different structure in comparison to standard few-shot classification tasks, so in Algorithm A.3 we provide a modified version of the data splitting for the classifier head construction. In particular, each test user has a number of objects (classes) they want to recognize, with several videos recorded per object. Each video is split into clips, consecutive 8-frame parts of the video. A user test task is comprised of these clips, randomly sampled from different videos of the user's objects, and associated labels. Since clips sampled from the same video can be semantically similar, we split the test task so that clips from the same video can only be in either the support or query set, except for the cases when there is only one video of an object available.

A.11.3 FEDERATED LEARNING EXPERIMENTS

For each local update a new Adam optimizer is initialized. In each communication round, 5 clients are randomly chosen for making model updates. All of the federated learning experiments were carried out on a single NVIDIA A100 GPU with 80GB of memory. In all experiments we use FIT with the BiT-M-R50x1 (Kolesnikov et al., 2019) backbone pretrained on the ImageNet-21K (Russakovsky et al., 2015) dataset and a ProtoNets head. We derive all hyperparameters empirically from a small number of runs. BiT As the training protocol proposed in BiT cannot be directly applied to the federated learning setting, we simplify it and train the BiT model for 60 communication rounds with a constant learning rate of 0.003. In our experiments, we used 2, 5 or 10 shot per class. For the global setting, we used the CIFAR100 test split for the global model evaluation. For the personalized setting, we test each client using the CIFAR100 test split, but using only the classes that a particular client owns. This results in 10-class classification task. We then report the average accuracy across all clients. We use a learning rate of 0.003 at the start of the training, decaying it by 0.3 every 20 communication rounds. Upper and lower bound baselines for both the global and personalized scenarios were trained for 400 epochs using the Adam optimizer with a constant learning rate of 0.003. It takes around 20 minutes to train federated learning models, with slightly more training time required for the models with a larger number of shots. Quickdraw Each client has 35 classes, which are sampled uniformly at the start of training. As there is no pre-defined train/test split for the Quickdraw dataset, we randomly choose 100 samples from each of the 345 classes and use them for testing. This test data is then used to test both the personalized and global settings as described in the paragraph above. We train all federated training models for 120 communication rounds, with 5 clients per round, and 10 update steps per client. We use a constant learning rate of 0.006 for training all federated learning models. Upper bound baseline models, which require training a global model using all available data, were trained for 3000 steps using the Adam optimizer with a constant learning rate of 0.006. Lower baseline models, requiring training a personalized model for each individual, were trained for 400 steps using the Adam optimizer with a learning rate of 0.006. It takes around 2 hours to train federated learning models, with slightly more training time required for the models with a larger number of shots.



https://github.com/microsoft/ORBIT-Dataset



Fig. A.1a illustrates a FiLM layer operating on a convolutional layer, and Fig.

task + e 3 I • Linear Discriminant Analysis (LDA): Σ LDA = e 2 Σ task + e 3 I • ProtoNets (Snell et al., 2017): Σ PN = I; i.e. there is no covariance and the class representation is parameterized only by µ c and the classifier logits are formed by computing the squared Euclidean distance between the feature representation of a test point b θ,ψ (x * ) and each of the class means. In the above, Σ class is the computed covariance of the examples in class c in D, Σ task is the computed covariance of all the examples in D assuming they arise from a single Gaussian with a single mean, e = {e 1 , e 2 , e 3 } are weights learned during training, and the identity matrix I is used as a regularizer.

FIT-LDA converges more quickly than BiT.

Figure 2: (a) Average accuracy on VTAB-1k for linear and LDA heads versus learnable parameters in the backbone. (b) Test accuracy versus training iteration for CIFAR100 and SVHN on VTAB-1k.

Figure 3: FIT-LDA outperforms BiT at low-shot. Classification accuracy as a function of the number of updateable parameters (log scale) and shots per class for FIT-LDA and BiT on four downstream datasets. Classification accuracy is on the vertical axis and is the average of 3 runs with different data sampling seeds. The dot size from smallest to largest indicates the number of shots per class -1, 2, 5, 10, and All. A tabular version with results for additional variants is in Table A.1.

Figure 4: Global and personalized classification accuracy as a function of communication cost over 60 rounds for different numbers of clients and shots per client on CIFAR100. Classification accuracy is on the vertical axis and is the average of 3 runs with different data sampling seeds. The color of the line indicates the number of shots per class. The solid line shows the federated learning model, while dashed and dotted lines indicate the upper and lower bounds baselines, respectively. client. By communication cost we mean the number of parameters transmitted during training. The key observations from our results are:

Fig. A.1a illustrates a FiLM layer operating on a convolutional layer, and Fig. A.1b illustrates how a FiLM layer can be added to a ResNetV2 network block. FiLM layers can be similarly added to EfficientNet based backbones, amongst others.

Figure A.1: (Left) A FiLM layer operating on convolutional feature maps in layer i and channel j. (Right) How a FiLM layer is placed within a basic Residual network block(He et al., 2016a). GN is a Group Normalization layer, ReLU is a Rectified Linear Unit, and 1 × 1, and 3 × 3 are 2D convolutional layers with the stated kernel size.

FIT-QDA and FIT-LDA, the means and covariances contribute to the updateable parameter count. We use a mean for every class which contributes Cd b updateable parameters. A covariance matrix has d b × d b values, however it can be represented in Cholesky factorized form which results in a lower triangular matrix and thus can be represented with d b (d b + 1)/2 values, with the rest being zeros.

Fig. A.2a and Fig. A.2b  show the magnitude of the FiLM parameters as a function of layer for FiT-LDA on CIFAR100 and SHVN, respectively, in the VTAB-1k setting. We see that for a dataset that differs from the pretraining data (SHVN), the FiLM layers are required to learn a greater degree of adjustment compared to a dataset that is similar to the pretraining data (CIFAR100). https://eprint.iacr.org/2017/715.pdf

Figure A.2: (Top) Box plots of the film parameter magnitudes as a function of layer in the feature extractor.

Fig. A.3 shows global and personalized classification accuracy as a function of communication cost for different numbers of clients and shots per client for Quickdraw, while Table A.9 shows the tabular version of this figure.

Figure A.3: Global and personalized classification accuracy as a function of communication cost over 120 rounds for different numbers of clients and shots per client on Quickdraw. Classification accuracy is on the vertical axis and is the average of 3 runs with different data sampling seeds. The color of the line indicates the number of shots per class. The solid line shows the federated learning model, while dashed and dotted lines indicate the upper and lower bounds baselines, respectively.

.10 FIT TRAINING ALGORITHMS Algorithm A.1 and Algorithm A.2 detail how episodic fine-tuning tasks are split and sampled, respectively, for use in the FIT training protocol. Algorithm A.1 Splitting the downstream dataset D Require: D = {(x n , y n )} N n=1 = {x, y}: downstream dataset Require: unique() ≡ function that returns a list of unique classes and list of counts of each class Require: select by class() ≡ function that extracts samples of a specified class from a dataset 1: procedure SPLIT(D) 2: D train ← [ ] ▷ Create an empty list to hold D train 3: D test ← [ ] ▷ Create an empty list to hold D test select by class(c) ▷ Select examples of class c from D 9: D train ← D train + D c [: train count] ▷ Add train count examples to D train 10: D test ← D test + D c [train count :] ▷ Add remaining examples to D test 11: end for 12: return D train , D test 13: end procedure A.11 TRAINING AND EVALUATION DETAILS In this section, we provide implementation details for all of the experiments in Section 4.

Sampling a task τRequire: D train = {(x s , y s )} Sτ s=1 = {x S , y S }: train portion of downstream dataset Require:D test = {(x q , y q )} Qτ q=1 = {x Q , y Q }:test portion of downstream dataset Require: support set size: size of the support set |D τ S | Require: unique() ≡ function that returns a list of unique classes and list of counts of each class Require: randint(min, max) ≡ function that returns a random integer between min and max Require: choice(range, count) ≡ function that returns a random list of count integers from range 1: procedure SAMPLE TASK(D train , D test , support set size) train class counts ← unique(y S ) 5: test classes, test class counts ← unique(y Q ) 6: min way ← min(len(train classes), 5) 7: max way ← min(len(train classes), support set size) 8: way ← randint(min way, max way) ▷ Classification way to use for this task 9: selected classes ← choice(train classes, way) ▷ List of classes to use in this ← max(1, floor(2000/way)) 12: for all c ∈ selected classes do 13: class shots ← train class counts(c) 14: shots to use ← min(class shots, balanced shots) 15: selected shots ← choice(class shots, shots to use) ▷ Support shot list 16: D τ S ← D τ S + D train [selected shots] ▷ Add examples to D τ S 17: class shots ← test class counts(c) 18: shots to use ← min(class shots, max test shots) 19: selected shots ← choice(class shots, shots to use) ▷ Query shot list 20:

Splitting a test task D for ORBIT personalization experiments Require: D: downstream dataset; D = {D c } C c=1 , where C is the number of classes in test task, D c is data of class c; D c = {V ci } nc i=1 , where n c is the number of videos in class c, V ci is the set of clips from ith video of class c; V ci = {(x cij , c)} nci j=1 , where n ci is the number of clips in ith video of class c, x cij is the jth clip from video V ci Require: batch size: size of context split Require: choose(n, m) ≡ function that randomly samples m different integers from a set {i} n i=1 Require: select by index(D, i) ≡ function that extracts samples of indices i from a dataset D Require: diff(a, b) ≡ function that computes set difference between sets a and b Require: range(n) ≡ function that returns a set of values {i} n i=1 1: procedure SPLIT ORBIT TASK(D) clips ← floor (batch size/C) 5: for c ← 1 to C do 6: num context videos ← ceil (n c /2) 7: context videos indices ← choose(n c , num context videos) 8: num clips per video ← floor (num clips/num context videos) 9: if n c = 1 then 10: context clips indices ← choose(n c1 , num clips per video) 11: target clips indices ← diff(range(n c1 ), context clips indices) 12: D train ← D train + select by index(V c1 , context clips indices) 13: D test ← D test + select by index(V c1 , target clips indices) 14: else 15: for j ← context videos indices do 16: context clips indices ← choose(n cj , num clips per video) 17: D train ← D train + select by index(V cj , context clips indices) j ← diff(range(n c ), context videos indices) do 20: D test ← D test + V cj D train , D test 25: end procedure CIFAR100 We train all federated learning models with different number of clients and shots per client for 60 communication rounds, with 5 clients per round and 10 update steps per client. Each client has 10 classes, which are sampled uniformly before the start of training.

Shared and updateable parameters for the transfer learning methods considered. The Example column contains the updateable parameters for all methods using a BiT-M-R50x1 backbone with |θ| = 23, 500, 352, ψ = 11, 648, d b = 2048, and C = 10.

D, we derive two sets -D train and D test . If D is sufficiently large (|D| ≈ 1000), such that overfitting is not an issue, we set D train = D test = D. Otherwise, in the few-shot scenario, we randomly split D into D train and D test such that the number of examples or shots in each class c are roughy equal in both partitions and that there is at least one example of each class in both. Refer to Algorithm A.1 for details. For each training iteration, we sample a task τ consisting of a support set D τ S drawn from D train with S τ examples and a query D τ Q set drawn from D test with Q τ examples. First, D τ If all of D train and D test are used for every iteration, overfitting occurs, limiting accuracy (see Table A.2 and Table A.4). The support set D τ S is then used to compute π, µ, and Σ and we then use

FIT outperforms BiT on VTAB-1k. Classification accuracy and updateable parameter count (with 10 classes) for FIT variants and BiT on VTAB-1k with BiT-M-R50x1 backbone. Accuracy figures are percentages. Bold type indicates the highest scores. Green indicates summary columns. FiT-ProtoNets 0.03 65.5 89.6 73.9 98.6 90.8 51.5 50.1 68.2 93.8 77.0 79.9 57.9 88.758.3 68.6 34.2 13.5 35.0 39.3 75.3

FIT outperforms competitive methods on ORBIT. Average accuracy (95% confidence interval) over 85 test tasks. b θ (x) is the backbone used. EN-1K(-MD) is EfficientNet-B0 pretrained on ImageNet (Meta-Dataset). RN-21K is BiT-M-R50x1 pretrained on ImageNet-21K. Shared is the number of parameters shared among all users. Per User is the number of parameters stored for each user with C classes. Average is the mean number of individual user parameters over ORBIT. FRAME ACC ↑ VIDEO ACC ↑ FRAME ACC ↑ VIDEO ACC ↑ SHARED ↓ PER USER ↓ In this setting, models with a smaller number of updateable parameters are preferred in order to reduce the client-server communication cost which is typically bandwidth-limited. The configuration and training protocol of FIT help significantly reduce communication cost compared to standard fine-tuning protocols, as i) only FiLM layers are transferred at each communication round, ii) in contrast to standard linear head, the Naive Bayes head is not transferred and is constructed locally for each client. Table5shows communication parameters savings of FIT compared to BiT.

FIT is significantly more parameter efficient than BiT for federated learning. Comparison of BiT and FIT in few-shot federated setting on CIFAR100 for different numbers of clients and shots per client. Global setting is used. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Parameter cost indicates the number of parameters sent in server-client communication. Per round is the number of parameters transmitted in each communication round and Overall is the number of parameters transmitted during the whole training.



2 shows the few-shot results for all three variants of FIT with different ablations on how the downstream dataset D is allocated during training. No Split indicates that D train is not split into two disjoint partitions and D train = D test = D. However, D train and D test are sampled to form episodic fine-tuning tasks as detailed in Algorithm A.2. Split indicates that D train is split into two disjoint partitions as detailed in Algorithm A.1 and then sampled into tasks as described in Algorithm A.2. Use All indicates that D train = D test = D (i.e. D is not split) and that D train and D test are not sampled and that D τ S = D τ Q = D for all tasks τ . Table

2: Classification accuracy for all three variants of FIT as a function of shots per class and how the downstream dataset D is during training on standard datasets. The backbone is BiT-M-R50x1 with |θ| = 23, 500, 352, ψ = 11, 648, and d b = 2048. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs.

4 shows the VTAB-1k results for all three variants of FIT with different ablations on how the downstream dataset D is allocated during training. Refer to Appendix A.4.2 for the meanings of No Split, Split, and Use All. With some minor exceptions, the Use All case performs the worst. The performance of the No Split and Split options is very close, with No Split being slightly better when averaged over all of the datasets. As a result, we use the No Split option when reporting the VTAB-1k results.

3: FIT outperforms BiT on VTAB-1k. Classification accuracy and Relative Model Update Size (RMUS) for all three variants of FIT and BiT on the VTAB-1k benchmark. The backbone is BiT-M-R50x1. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Bold type indicates the highest scores (within the confidence interval).TableA.4: Classification accuracy for all three variants of FIT on the VTAB-1k benchmark as a function of how the downstream dataset D is utilized during training. The backbone is BiT-M-R50x1. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Bold type indicates the highest scores.

5: Comparison of the classification accuracy of LDA and linear heads on VTAB-1k as a function of the learnable parameters in the backbone. "None" indicates that all of the parameters in the pre-trained backbone are frozen (i.e. no learnable parameters). "FiLM" indicates the that the only learnable parameters in the backbone are the FiLM layers. "All" indicates that all the parameters in backbone are learnable and that FiLM is not used. The backbone is BiT-M-R50x1. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. This is the full version of Fig.2a

6: Few-shot Federated Learning Results on CIFAR100 for different numbers of clients and shots per client. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Global stands for the global setting, while Personalized stands for the personalized scenario. FL indicates Federated Learning training. Table A.6 shows the tabular version of Fig. 4. For the Federated Learning results, it includes only the resulting accuracy after training for 60 communication rounds. Refer to Section 4.4 for analysis.

Table A.9  shows the tabular version of this figure.Similarly to CIFAR100 experiments, we compare FIT to BiT on the Quickdraw dataset. We train both models for 60 communication rounds. The results are presented in Table A.8. For 5 and 10 shots per class, FIT performs similarly to BiT for all numbers of clients tested. However, for 2-shot, BiT outperforms FIT for all numbers of clients tested, but the performance gap is reduced as the number of clients is increased. We attribute this behavior to the use of the ProtoNets classifier. During training at each client's optimization step, we construct a N-way local ProtoNets classifier using only one image per class, which may result in a classifier not robust enough for reasonable optimization of the FiLM layers. This hypothesis is also supported by Fig. A.3, where we see a huge gap between Table A.7: FIT trained with different FL algorithms on CIFAR100 for different numbers of clients and shots per client. Federated averaging (FedAvg) and FedProx methods were used. For FedProx µ = 0.01 is used. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Global stands for the global setting, while Personalized stands for the personalized scenario. Table A.8: Comparison of BiT and FIT in few-shot federated setting on Quickdraw for different numbers of clients and shots per client. Global setting is used. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Both models were trained for 60 communication rounds. Parameter cost indicates the number of parameters sent in server-client communication.Per round is the number of parameters transmitted in each communication round and Overall is the number of parameters transmitted during the whole training. Federated Learning and centralized learning (the upper bound) for 2-shot. In contrast, the linear head in BiT is not local and is trained using all clients' data, thus avoiding this pitfall. This leads us to the following observation -if there is enough local client data to construct a robust metric-based classifier, then Naïve Bayes head helps to significantly reduce communication cost without sacrificing the final model quality. However, if there is not enough local client data, then the use of a linear classification head may be more appropriate.

9: Few-shot Federated Learning Results on Quickdraw for different numbers of clients and shots per client. Accuracy figures are percentages and the ± sign indicates the 95% confidence interval over 3 runs. Global stands for the global setting, while Personalized stands for the personalized scenario. FL indicates Federated Learning training.

ACKNOWLEDGMENTS

Aliaksandra Shysheya, John Bronskill, Massimiliano Patacchiola and Richard E. Turner are supported by an EPSRC Prosperity Partnership EP/T005386/1 between the EPSRC, Microsoft Research and the University of Cambridge. This work has been performed using resources provided by the Cambridge Tier-2 system operated by the University of Cambridge Research Computing Service https: //www.hpc.cam.ac.uk funded by EPSRC Tier-2 capital grant EP/P020259/1. We thank the anonymous reviewers for key suggestions and insightful questions that significantly improved the quality of the paper. We also thank Aristeidis Panos and Siddharth Swaroop for providing helpful comments and suggestions.Reproducibility Statement Source code for experiments can be found at: https://github. com/cambridge-mlg/fit. The README file details the data preparation steps and includes the command lines to configure and run the experiments. Appendix A.11 details the training and evaluation procedures and hyperparameter settings for all of the experiments including few-shot and VTAB-1k transfer learning experiments, personalization on ORBIT experiments, and the federated learning experiments.

