META-LEARNING VIA CLASSIFIER(-FREE) GUIDANCE

Abstract

State-of-the-art meta-learning techniques do not optimize for zero-shot adaptation to unseen tasks, a setting in which humans excel. On the contrary, meta-learning algorithms learn hyperparameters and weight initializations that explicitly optimize for few-shot learning performance. In this work, we take inspiration from recent advances in generative modeling and language-conditioned image synthesis to propose meta-learning techniques that use natural language guidance to achieve higher zero-shot performance compared to the state-of-the-art. We do so by recasting the meta-learning problem as a multi-modal generative modeling problem: given a task, we consider its adapted neural network weights and its natural language description as equivalent multi-modal task representations. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing meta-learning methods with zero-shot learning experiments on our Meta-VQA dataset, which we specifically constructed to reflect the multi-modal meta-learning setting. * HNet + HyperCLIP Guidance 58.82 (± 0.27) * HVAE + HyperCLIP Guidance 58.75 (± 0.29) * HNet + HyperLDM γ = 1 58.70 (± 0.11) * HNet + HyperLDM γ = 1.5 58.60 (± 0.09) * HVAE + HyperLDM γ = 1 58.97 (± 0.09) * HVAE + HyperLDM γ = 1.5 58.89 (± 0.07)

1. INTRODUCTION

State-of-the-art machine learning algorithms often lack the ability to quickly generalize in a sample efficient manner to new unseen tasks. In contrast, humans show remarkable capabilities in leveraging previous knowledge for learning a new task from just a few examples. Often, not even a single example is needed, as all relevant task information can be conveyed in the form of natural language instructions. Indeed, humans can solve novel tasks when prompted from a variety of different interaction modalities such as visual task observations or natural language prompts. In this work we aim to improve forward generalization of deep networks by developing new models that are capable of a similar multi-modal task inference. In particular, we present two different approaches that utilize text-based task descriptors for rapid zero-shot adaptation to new tasks. The development of deep learning models that quickly adapt and generalize to unseen tasks is the focus of the field of meta-learning. A recent trend stemming from the success of Model-Agnostic Meta-Learning (Finn et al., 2017, MAML) describes meta-learning as a bi-level optimization problem: an outer loop meta-model is trained with the goal of improving the few-shot performance of a base model when fine-tuned on a variety of related tasks. MAML was specifically introduced as a gradientbased method to find an unconditional network initialization with good few-shot performance over an entire task universe. Recent progress in large scale deep learning is however challenging this explicit meta-learning framework. Large models trained on huge, rich, and diverse data sets have been shown to possess surprisingly good few-shot learning capabilities (Brown et al., 2020) . In particular, large scale pre-training and fine-tuning often outperforms explicit meta-learning procedures (Mandi et al., 2022) . These developments prompted us to develop alternative methods for explicit meta-learning which natively benefit from rich and multi-modal data. Inspired by recent advances in conditional image generation (Ramesh et al., 2022; Rombach et al., 2022) , we reframe meta-learning as a multi-modal generative modeling problem such that, given a task, its adapted neural network weights and its Figure 1 : Schematic of the three main components of our proposed metalearning approach. A. An unconditional variational autoencoder (VAE) models the latent space of adapted network weights W . Its generator hypernetwork h (highlighted in gray) can be re-used in the conditional setting with our guidance techniques. B. Our HyperCLIP encoder CLIP H is contrastively trained to map network weights W to the space of CLIP embeddings e i . Then, given a new task with descriptor t i , we can use CLIP guidance to find a VAE latent vector z i with embedding e (H) i that has a high cosine similarity to a given task embedding e (T ) i . C. Alternatively, our Hypernetwork Latent Diffusion Model (HyperLDM) learns, conditional on the task embedding e i , to iteratively denoise a VAE latent vector z T i , . . . , z 0 i over T iterations. natural language description are considered equivalent multi-modal task representations. What we show is that popular techniques for the image domain, such as CLIP-based guidance (Gal et al., 2021; Patashnik et al., 2021) , denoising diffusion models (Ho et al., 2020) , and classifier-free guidance (Dhariwal & Nichol, 2021; Ho & Salimans, 2021; Nichol et al., 2022) can be repurposed for the meta-learning setting to generate adapted neural network weights instead of images. Specifically, we approach the generation of neural network weights in two separate phases. In the unconditional pre-training phase, we train a generative hypernetwork (Ha et al., 2016) to map from its latent space to the weight space of a base model (Figure 1 .A). In the guidance phase, we learn language-conditioned models that can be used to traverse the hypernetwork latent space and find zero-shot adapted weights with high performance on our task (Figure 1 .B and 1.C).

1.1. OUR CONTRIBUTIONS

We summarise our contributions as follows: • We recast meta-learning as a multi-modal generative modeling problem using hypernetworks, providing a theoretical framework for the application of recent research on image synthesis to the field of meta-learning. • We propose HyperCLIP, a contrastive learning method equivalent to Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) , producing CLIP embeddings of fine-tuned neural network weights. We can then use HyperCLIP to: 1) perform CLIP-guidance in the latent space of a hypernetwork model (Figure 1 .B); and 2) perform task inference on fine-tuned models without given task descriptions. • We propose Hypernetwork Latent Diffusion Models (HyperLDM) as an alternative to HyperCLIP guidance for task-conditional network generation within the latent space of a hypernetwork model (Figure 1 .C). We show how combining this approach with classifier-free guidance (Ho & Salimans, 2021) improves the performance of generated networks. • We demonstrate the soundness of our methods on Meta-VQA, our modification of the Visual-Question-Answering VQA v2.0 dataset (Goyal et al., 2017) that interprets questions as task descriptions. We show how our guidance methods outperform traditional multi-task and meta-learning techniques for zero-shot learning on this dataset.

2. META-LEARNING WITH MULTI-MODAL TASK EMBEDDINGS

The setting we investigate is similar to the classic meta-learning framework, where we operate within a distribution of tasks T i ∼ p(T ), each associated with a loss function L Ti . Using a set of training tasks drawn from this distribution, our goal is to train a model such that it generally performs well on new unseen tasks drawn from p(T ).

2.1. MODEL-AGNOSTIC META-LEARNING

We present here a slightly altered formulation of MAML (Finn et al., 2017) introduced in (Zintgraf et al., 2019) , whereby the parameters of the model g are partitioned into two parts: context parameters ϕ that are adapted on individual tasks, and shared parameters θ that are meta-trained and shared across tasks. MAML and its variants focus on the few-shot setting, which aims to learn an initialization for these parameters such that the model g(•, θ, ϕ) generalizes well on new tasks after fine-tuning ϕ on a few data points from that task. To train such a model, the data from each task T i is split during training into a support set D s i and a query set D q i . The MAML objective aims to optimize the validation score evaluated on the query set when fine-tuning ϕ on the support set, e.g., consider the following optimization problem: min θ,ϕ E Ti∼p(T )   1 |D q i | (x,y)∈D q i L Ti (g(x, θ, A Ti (D s i , θ, ϕ)), y)   , where A Ti is some differentiable algorithm, typically implementing a variant of few-step gradient descent on the loss computed on the support set, e.g., in the case of one-step gradient descent: A Ti (D s i , θ, ϕ) = ϕ -η 1 |D s i | (x ′ ,y ′ )∈D s i ∇ ϕ L Ti (g(x ′ , θ, ϕ), y ′ ) with some learning rate η. The objective from Eq. 1 is itself solved with gradient descent, by iteratively optimizing the parameters ϕ in the inner loop on the support set of a sampled task, and updating θ and the initialization of ϕ with their gradient with respect to the entire inner loop training process, averaged over batches of tasks. Note that the original formulation of MAML considers θ = ∅.

2.2. NATURAL LANGUAGE TASK EMBEDDINGS

In this work, we assume to have access to an additional high-level context embedding e i for each task T i . In practice, such embeddings can come from a natural language description t i of the task, which can be encoded into task embeddings using pre-trained language models. While MAML is a powerful algorithm for finding universal initializations, the model naturally requires some data at test time to adapt its unconditioned network parameters. In contrast, by incorporating task embeddings, we aim to improve performance on a new task in the no-data (zero-shot), or very low data regime. A simple way to incorporate task embeddings into our model during training is by augmenting the input of the network, concatenating such input with the task embedding during the forward pass. Instead, we consider the use of hypernetworks (Ha et al., 2016) , a network that generates the weights of another network given a conditioning input. Concretely, given an unconditioned neural network f parametrized by a weight vector W , we reparametrize the model by introducing a hypernetwork h. The hypernetwork h is parametrized by θ, which generates a task-conditioned weight, h(e i , θ) = W i . The model is then defined as f (•, h(e i , θ)). We can thus rewrite the MAML objective with respect to the hypernetwork weight θ as min θ E Ti∼p(T )   1 |D q i | (x,y)∈D q i L Ti (f (x, h(A Ti (D s i , e i , θ), θ))), y)   , when A Ti (D s i , e i , θ) = e i , we recover the classic multi-task objective of a hypernetwork optimizing for zero-shot performance. When A Ti is instead the gradient descent algorithm on e i , the objective aligns with the few-shot performance of h when adapting the embedding initialized at e i .

3. HYPERNETWORKS AS GENERATIVE MODELS

A rich literature exists on hypernetworks interpreted as generative models of base network weights (see Section 7). Our work builds upon this interpretation to adapt multi-modal generative modeling techniques to the meta-learning domain. In generative modeling, we aim to learn the distribution p(x) over a high dimensional data domain X , such as images, given samples from the distribution. Since classic maximum-likelihood learning is insufficient to do so, we must resort to techniques such as variational inference, adversarial learning, or diffusion models. It seems apparent that, in meta-learning, the modeling of a distribution of diverse high-dimensional base network weights W may be considered in analogous terms. In the Bayesian setting, this distribution is made explicit as we seek to model the posterior p(W |D) given data D, but the framework is still useful even when no explicit posterior distribution is assumed, as it is the case for deep ensembles. In the present work, we indeed avoid explicit Bayesian inference: given training tasks, we consider their respective fine-tuned network weights W i as training samples for our generative model of network weights. The fundamental building block of our unconditional generative model is the hypernetwork h(z, θ) = W that we can train in two ways: 1) We define a Hypernetwork VAE (HVAE) as in Figure 1 .A, which, given samples of fine-tuned base network weights W i , learns a low-dimensional normally distributed latent representation z. The encoder d(W, ω) = (µ z , Σ z ) with parameters ω maps base network weights to means and variances used to sample a latent vector z, while the decoder (or generator) is a classic hypernetwork h(z, θ) = W which reconstructs the network weights from the latent vector. 2) Using MAML, we learn both an embedding z and hypernetwork weights θ such that, when fine-tuning only the embedding z on each task T i , we obtain high-performing base networks with weights W i = h(z i , θ). Concretely, we optimize θ and the initialization of z following the objective in Eq. 1 where z takes the role of the task-specific parameter ϕ.

4. HYPERCLIP: TRAINING A CLIP ENCODER FOR THE "META-LEARNING MODALITY"

To define our new meta-learning approach, we first borrow from the field of multi-modal contrastive learning. More specifically, we build on top of Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) , a popular method for joint learning of language and image embeddings with applications to zero-shot and few-shot classification. In the original CLIP formulation, separate text and image encoders are trained such that, given a bi-modal sample (x i , t i ) of an image and its corresponding language caption, their representations (CLIP I (x i ) = e (I) i and CLIP T (t i ) = e (T ) i ) are aligned across modalities. Specifically, the formulation maximizes the cosine similarity e (I)⊤ i e (T ) j /∥e (I) i ∥∥e (T ) j ∥ for pair-wise matches (i = j) and minimizes the cosine similarity for non-matches (i ̸ = j). Beyond the original language-image setting, the CLIP approach can be easily adapted to include additional modalities, aligning the representation of more than two encoders at a time. Existing works such as AudioCLIP (Guzhov et al., 2022) demonstrate the possibility of training an encoder for an additional modality such as audio on the side of the pre-trained frozen CLIP language-image encoders.

4.1. CONTRASTIVE LEARNING ON NEURAL NETWORK WEIGHTS

In our work, we consider multi-modal representations of meta-learning tasks T i , which may be presented in the form of language as task descriptions t i , but potentially also in the form of images, videos, and audio. We fine-tune a base machine learning model f (x, W i ) = y for task T i and consider the base model as part of an alternative meta-learning modality for task T i . Fine-tuned networks from the meta-learning modality can then be paired in contrastive learning with the other multi-modal descriptions of T i . We can thus define our new HyperCLIP encoder as a "reverse hypernetwork" CLIP H (W i ) = e (H) i , taking fine-tuned neural network weights W i as input, and outputting a CLIP embedding e (H) i optimized for high cosine similarity with the CLIP embedding for the textual (and image, video, etc.) description of the task. We point to Figure 2 for an illustration of the approach, and to Algorithm 1 for the detailed training procedure.

Who is wearing glasses

? CLIP T CLIP H y i j f w x i j T 1 T N H 1 H N ... ... CLIP T T 1 ,T 2 T N H l CLIP H y i j f w x i j max. ... T 1 T N T 2 C max.

How is the weather? Who is wearing glasses?

Where is the child sitting? ... Figure 2: Our HyperCLIP encoder CLIP H is contrastively trained to map neural network weights W to the latent space of a pre-trained language encoder CLIP T , which we use to embed the natural language questions associated to the tasks (see A). To perform task inference given an already finetuned network, we encode all candidate task questions using the language CLIP encoder (see B), then encode the fine-tuned network weights with HyperCLIP (see C), and finally infer the correct task with a softmax operation over cosine similarities between HyperCLIP and language CLIP embeddings.  t i define two N -sized arrays of d-dimensional embeddings T ∈ R N ×d and H ∈ R N ×d for i = 1, . . . , N do T [i] = CLIP T (t i ) / ∥CLIP T (t i )∥ Fine-tune W i with objective: min W (x ′ ,y ′ )∈D train i L Ti (f (x ′ , W ), y ′ ) H[i] = CLIP H (W i ) / ∥CLIP H (W i )∥ end for loss = L cross-entropy (T H ⊤ ) + L cross-entropy (HT ⊤ ) / 2 Update weights of CLIP H (.) using ∇loss

4.2. CLASSIFIER-GUIDED META-LEARNING

On their own, CLIP encoders are at most useful for zero-shot or few-shot multi-modal classification as opposed to data generation. Recent popular image synthesis techniques, however, use CLIP encoders or other classifiers to guide generation from pre-trained unconditional generative models. Classifier guidance or CLIP guidance (Gal et al., 2021; Patashnik et al., 2021) use gradients with respect to a classifier or CLIP encoder to traverse a generative model's latent space. In this work, we introduce HyperCLIP guidance, the first algorithm for classifier guidance in the meta-learning setting (Figure 1 .B). Given a previously unseen validation task T i and an unconditional generative hypernetwork model h(z, θ) = W , we can use a trained HyperCLIP encoder CLIP H (W ) = e (H) to guide the exploration of the hypernetwork's latent space and find a set of base weights W i with high zero-shot performance for T i . Specifically, as long as we are given a starting hypernetwork latent vector z 0 and a textual description t i of the task, we can update z 0 with gradient descent over the guidance loss L guidance (z) = - CLIP H (h(z, θ)) ⊤ CLIP T (t i ) ∥CLIP H (h(z, θ)) ∥∥CLIP T (t i )∥ + λ∥z -z 0 ∥, and then run the optimized latent vectors ẑi through the generative hypernetwork to find adapted zero-shot base network weights h(ẑ i , θ) = Ŵi that perform well for the task.

5. HYPERLDM: TASK-CONDITIONAL DIFFUSION OF HYPERNETWORK LATENTS

Using classifier guidance to explore the latent space of a traditional generative model with direct CLIP gradients is a powerful and elegant approach. However, rapid innovation in the image synthesis community recently led to this method being largely overcome in favor of applying classifier guidance and classifier-free guidance during the sampling process of a Diffusion Model (Dhariwal & Nichol, 2021; Ho & Salimans, 2021; Kim et al., 2022; Crowson, 2022; Nichol et al., 2022; Rombach et al., 2022) . To paint a more complete picture, we also investigate this setting in the meta-learning domain, as an alternative to the previously introduced HyperCLIP guidance.

5.1. (LATENT) DIFFUSION MODELS

Denoising Diffusion Probabilistic Models (Sohl-Dickstein et al., 2015; Ho et al., 2020, DDPM) are a powerful class of generative models designed to learn a data distribution p(x). They do so by learning the inverse of a forward diffusion process in which samples x 0 of our data distribution are slowly corrupted with additive Gaussian noise over T steps with a variance schedule β 1 , . . . , β T , resulting in the Markov Chain q(x t |x t-1 ) = N (x t ; 1 -β t x t-1 , β t I) q(x 1:T |x 0 ) = T t=1 q(x t |x t-1 ). (5) A property of such a process is that we can directly sample each intermediate step from x 0 as x t = √ ᾱt x 0 + ( √ 1 -ᾱt )ϵ given ϵ ∼ N (0, I), α t = 1 -β t and ᾱt = t s=1 α t . Then, to learn the reverse process p ψ (x t-1 |x t ), we parametrize the timestep-dependent noise function ϵ ψ (x t , t) with a neural network and learn it by optimizing a simplified version of the variational lower bound on p(x) L DM (ψ) = E x,ϵ∼N (0,1),t ∥ϵ -ϵ ψ (x t , t)∥ 2 2 . ( ) Sampling from the reverse process can then be done with x t-1 = 1 √ α t x t - β t √ 1 -ᾱt ϵ θ (x t , t) + σ t ξ, with ξ ∼ N (0, I) and σ t chosen between β t and β t / √ 1 -ᾱt . Sampling from the learned diffusion model can be seen as analogue to Langevin Dynamics, a connection explicitly made in works exploring the diffusion technique from the "score matching" perspective (Song & Ermon, 2019; Song et al., 2020) . In our meta-learning setting, we aim to train a diffusion model which generates adapted zero-shot base network weights Ŵi that perform well for task T i . Thus, our diffusion model has to be conditional on a task embedding e i . Moreover, in order to speed up training and leverage our previously trained generative hypernetwork h(z, ψ), we define the diffusion process on latent vectors instead of doing so in weight space, emulating the Latent Diffusion technique from Rombach et al. (2022) . Satisfying the above requirements, we propose Hypernetwork Latent Diffusion Models (HyperLDM), which learn to sample from the conditional distribution of fine-tuned latent vectors p(z 0 |e i ) given a language CLIP embedding corresponding to the task. The HyperLDM neural network models the noise function ϵ ψ (z t , t, e i ), and is learned by optimizing the reweighted variational lower bound, which in this setting is L LDM (ψ) = E Ti,henc(Wi),ϵ∼N (0,1),t ∥ϵ -ϵ ψ (z t , t, e i )∥ 2 2 . (8) 5.2 CLASSIFIER-FREE GUIDANCE FOR META-LEARNING The classifier guidance technique presented in Section 4.2 can be also adopted together with diffusion models. Even in the case of conditional diffusion models, the gradient of an auxiliary classifier (or CLIP encoder) can be added during sampling to induce an effect similar to GAN truncation (Brock et al., 2018) , producing samples that are less diverse but of higher quality. Here, we showcase how the same technique can be adopted when generating adapted base network weights conditioned on task representations, tuning the temperature of sampling to reduce variance and increase base network performance. The classifier-free guidance technique (Ho & Salimans, 2021; Nichol et al., 2022) allows us to leverage a conditional diffusion model to perform the same tempered sampling as above, without the auxiliary classifier. To do so, we train the conditional network ϵ ψ (z t , t, e i ) to also model the unconditional case ϵ ψ (z t , t). One way of doing this is with conditioning dropout, simply dropping the conditional input e i for a certain percentage of training samples, inputting zeros instead. We can then sample at each diffusion iteration with εψ (z t , t, e i ) = (1 -γ) ϵ ψ (z t , t, 0) + γϵ ψ (z t , t, e i ). For γ = 0, this recovers the unconditional diffusion model, while for γ = 1 it recovers the standard task-conditional model. For γ > 1, we instead obtain the classifier-free guidance effect, which we show results in the sampling of latent vectors ẑi corresponding to higher-performing task-conditional network weights h(ẑ i , ψ) = Ŵi . We point to a more in-depth discussion on classifier-free guidance and its connection to score matching in Appendix A.1.

6. EXPERIMENTAL SETUP AND RESULTS

In this section, we demonstrate the soundness of our two approaches with zero-shot image classification experiments against a series of traditional meta-learning baseline techniques. Throughout our experiments, we fix the choice of base network model to a CLIP-Adapter model (see Appendix A.2), only varying the meta-learning techniques employed to obtain adapted base model weights.

6.1. THE META-VQA DATASET

To evaluate the performance of our methods, we utilize a dataset that reflects the setting of metalearning with multi-modal task descriptors. Existing meta-learning benchmarks such as MiniImagenet (Ravi & Larochelle, 2016) or CIFAR-FS (Bertinetto et al., 2018) are unsuitable, as they are built for the traditional few-shot learning setting, in which the task T i is not associated with task descriptors but is meant to be inferred through exposure to the support set D s i . We thus introduce our own Meta-VQA dataset, a modification of the VQA v2.0 dataset (Goyal et al., 2017) for Visual-Question-Answering. The dataset is composed of training and test tasks T i , each associated with a natural language question t i and a mini image classification dataset (x i j , y i j ) ∈ D i . We refer to Appendix A.3 for a more in-depth discussion. Figure 3 : Example classification task from Meta-VQA, adapted from VQA v2 (Goyal et al., 2017) . A single question t i is associated to multiple image-answer tuples (x i j , y i j ).

6.2. ZERO-SHOT TASK ADAPTATION WITH CLASSIFIER(-FREE) GUIDANCE

In Table 1 we show how our methods compare to a series of baselines when tested on the Meta-VQA dataset in the zero-shot setting. For each training task T i , the algorithms are given access to the full image/answer support and query sets D s i , D q i , together with the question (task descriptor) t i . At test time, in the zero-shot setting, only the task descriptors t i for each test task T i are given, and the algorithms are tasked with predicting the correct labels of images in the query set D q i . To simulate a setting in which we possess a larger "unconditional" pre-training dataset, we also train our model while only keeping a fraction of task descriptors from the Meta-VQA dataset: tasks without language descriptors can still be used to learn the unconditional HNET/HVAE model. Classic zero-shot CLIP provides provide a 44.99% floor for performance on Meta-VQA, as our base model (CLIP-Adapter) consists of adapter layers over its output. We also can derive a ∼60.24% performance ceiling for zero-shot performance by considering the few-shot setting, in which models have also access to a data support set D s i for every test task (see Appendix A.6): it is apparent that our zero-shot techniques cannot surpass this ceiling while keeping the choice of base model fixed. We then benchmark several unconditional and conditional methods, with only conditional methods having access to language task descriptors. We apply MAML and its first-order variant FOMAML (Nichol et al., 2018) directly to the base network (MNet-MAML, MNet-FOMAML), and to both an unconditional hypernetwork (Uncond. HNet-MAML, as in Section 3) and a conditional one (Cond. HNet-MAML). We also benchmark against standard multitask learning (Uncond. Multitask, Cond. Multitask), and conditional multitask learning with the classic FiLM layer (Perez et al., 2017) (Cond. Multitask FiLM). It is apparent that the multitask approach, at least in this setting, leads to better zero-shot models than MAML, which instead optimizes for few-shot performance. We refer to Appendix A.2 and A.4 for more details on each model. We then test HyperCLIP Guidance and HyperLDM when trained on top of either a hypernetwork or a VAE generator (see Appendix A.2 and A.5 for more detail). HyperCLIP Guidance allows for faster sampling than HyperLDM but is generally less performant, still, it performs on par with or slighly improves upon all other zero-shot baselines except for Cond. Multitask. The best performing model for the zero-shot setting is HVAE + HyperLDM, and specifically for classifier-free guidance with γ = 1.5. As illustrated in Figure 4 .a, to further show the effectiveness of the classifier-free guidance technique, we sweep over several candidate γ parameters to find that the optimum occurs for γ > 1. As shown in Figure 4 .b, when training our model while only keeping 50% or 10% of task descriptors, traditional Cond. Multitask learning is heavily impacted, while HyperLDM is almost not affected due to its two-phased training regime based on an uncondtional VAE. The gap between the multitask baseline and our HyperLDM technique is particularly striking in this setting.

7. RELATED WORK

Hypernetworks By introducing multiplicative interactions within neural networks (Jayakumar et al., 2019) , hypernetworks (Ha et al., 2016) have been shown to allow the modeling of diverse target network weights in, e.g., continual learning, even in the compressive regime (von Oswald et al., 2021a; 2020) without loss of performance. For a given supervised problem, hypernetworks have been used to model the complex Bayesian posterior of the weights in conjunction with variational inference (Henning et al., 2018; Krueger et al., 2018) . Similar approaches have been used for continual learning (Henning et al., 2021) . Another vein of work consists in using hypernetworks to distill ensembles of diverse networks (Wang et al., 2018; Ratzlaff & Fuxin, 2020; von Oswald et al., 2021a) . Meta learning In the context of meta-learning, hypernetworks have been successfully used in combination with popular gradient-based meta-learning methods (Finn et al., 2017; Zintgraf et al., 2019; Zhao et al., 2020; Flennerhag et al., 2020) . More generally, related works have shown the usefulness of learning a low dimensional manifold in which to perform task-specific gradient-based adaptation at meta test time (Rusu et al., 2018; von Oswald et al., 2021b; Lee & Choi, 2018) , instead of directly adapting in weight space. Recent works bypasses the formal bi-level formulation of meta-learning by, e.g., using transformers to directly map the few-shot examples to the weights of the target network (Zhmoginov et al., 2022) . Generative Modeling and Classifier(-free) guidance A plethora of techniques have been proposed for the generation of samples from high-dimensional domains such as images, such as Generative Adversarial Networks (Goodfellow et al., 2014; Brock et al., 2018, GANs) and Variational Autoencoders (Kingma & Welling, 2014, VAEs) . Denoising Diffusion Probabilistic Models (Sohl-Dickstein et al., 2015; Ho et al., 2020, DDPM) overcome common issues in generative modeling using a simple likelihood-based reconstruction loss for iterative denoising, and have been shown to achieve state-of-the-art results in high resolution image generation (Dhariwal & Nichol, 2021; Rombach et al., 2022) . Several techniques have been proposed for effective conditional sampling in generative and diffusion models, such as classifier/CLIP guidance (Dhariwal & Nichol, 2021; Gal et al., 2021; Patashnik et al., 2021) and classifier-free guidance (Ho & Salimans, 2021; Crowson, 2022; Nichol et al., 2022) . Diffusion models with classifier-free guidance have also been successfully applied in non-visual domains, such as audio generation (Kim et al., 2022) and robotic planning (Janner et al., 2022) . Zero-shot learning There exists a large literature on zero-shot learning, including both established benchmarks and well known methods (Han et al., 2021; Su et al., 2022; Gupta et al., 2021) . While these zero-shot learning works consider the zero-shot performance on unseen class labels within a single classification task, our setting considers that of the zero-shot performance where test tasks themselves are unseen, thus raising the zero shot problem to the task-level.

8. CONCLUSION

In this work we introduced a framework that re-interprets meta-learning as a multi-modal generative modeling problem. Our HyperCLIP guidance and HyperLDM methods leverage this insight to generate task-adapted neural network weights in a zero-shot manner given natural language instructions, and constitute the first application of the CLIP guidance and classifier-free guidance techniques from image generation to the meta-learning domain. Our experiments show that our methods successfully make use of external task descriptors to produce high-performance adapted networks in the zero-shot setting.

ETHICS STATEMENT

Our contribution, despite showcasing experimental results, remains within the domain of general methods and thus does not face many immediate ethics concerns common in applied machine learning. Regardless, potential fairness and ethics issues should still be proactively investigated. A potential issue is that some of the methods described in the present work rely on pre-trained large models such as CLIP encoders, which may inadvertently encode harmful stereotypes and biases, in a manner which may not be immediately apparent to researchers and users making use of our downstream method.

REPRODUCIBILITY STATEMENT

We point to the supplementary material file attached to the submission for the complete codebase and Meta-VQA dataset tools, which can be used to reproduce our results. This code will be released online in more polished form after the requirement for anonymity is dropped. The code repository contains a README.md file with instructions for installing the necessary software environment. For a complete discussion of the hyperparameters used during our experimental runs, we refer to Appendix A.5. As for the Meta-VQA dataset, one must first download the original VQA v2 dataset (https://visualqa.org/download.html) and place it in the data/VQA/ folder, and then run: • scripts/precompute image features.py • scripts/precompute ques features.py • scripts/precompute text features.py to re-generate the pre-computed CLIP embeddings.

A APPENDIX A.1 CLASSIFIER-FREE GUIDANCE

We hereby provide a rationale for the use of classifier guidance and classifier-free guidance during diffusion model sampling. As per the "score matching" interpretation of diffusion models, we assume that our trained noise network approximates the score function of the true conditional latent distribution p(z|e i ) as ϵ ψ (z t , t, e i ) ≈ -σ t ∇ z t log p(z t |e i ). For classifier guidance, we can perturb our diffusion sampling by adding the gradient of the log likelihood of our CLIP encoder p ψ (e i |z t ) to the diffusion score as follows εψ (z t , t, e i ) = ϵ ψ (z t , t, e i ) -ησ t ∇ z t log p ψ (e i |z t ) ≈ -σ t ∇ z t log p(z t |e i ) + η log p ψ (e i |z t ) . We can rewrite this as classifier guidance on the unconditional score ∇ z t log p(z t ) with -σ t ∇ z t log p(z t ) + γ log p(e i |z t ) with γ = 1 + η using Bayes' rule, as log p(z t |e i ) = log p(e i |z t ) + log p(z t ) -log p(e i ), and thus ∇ z t log p(z t |e i ) = ∇ z t log p(e i |z t ) + ∇ z t log p(z t ). For classifier-free guidance, we aim to perform the above sampling without access to a classifier, as long we possess a conditional diffusion model ϵ ψ (z t , t, e i ) that doubles as an unconditional model ϵ ψ (z t , t, 0), as illustrated in Section 5.2. Using Bayes' rule again, we can see that ∇ z t log p(e i |z t ) = ∇ z t log p(z t |e i ) -∇ z t log p(z t ). If we substitute this into Eq. 11 we obtain -σ t ∇ z t log p(z t ) + γ log p(z t |e i ) -log p(z t ) , -σ t ∇ z t (1 -γ) log p(z t ) + γ log p(z t |e i ) , which can be implemented with our conditional network as εψ (z t , t, e i ) = (1 -γ) ϵ ψ (z t , t, 0) + γϵ ψ (z t , t, e i ). A.2 NETWORK ARCHITECTURES Base Network (f ) Our choice for a base model is a CLIP-Adapter (Gao et al., 2021) , which consists of a frozen CLIP image encoder with added learned fully-connected layers refining the output embedding. Specifically, we use the ViT-L/14@336px CLIP encoder type with embedding size of 768. The advantages of this model choice lie in its combination of high base performance (due to pre-trained knowledge contained in the CLIP component) and relatively small parameter count, enabling agile medium-small scale experiments. This base CLIP-Adapter network purely works as a base model and is not to be confused with HyperCLIP, which is employed at the meta-level. In Section 6.2, when benchmarking the base model alone in the zero-shot setting, we drop the Adapter and use pre-trained zero-shot CLIP (Radford et al., 2021) . Hypernetwork (h) For the hypernetworks used in our baseline as well as as the generative model, we use a MLP with one hidden layer of 256 units, which are followed by a rectified linear activation. For the unconditioned hypernetwork, the embedding to the hypernetwork is a vector of dimension 64, while for the conditioned counterpart, the task embedding is used. In order to ensure that the generated weights are properly normalized at initialization, we use the Kaiming initialization (He et al., 2015) for the hypernetwork weights, initialize the embedding as a sample from a multivariate standard gaussian distribution (for unconditioned models), and use the NTK parametrization (Jacot et al., 2020) for the target network. Variational Autoencoder For the variational autoencoder used as our unconditioned generative model, we use an MLP of 2 hidden layers of size 512 and 256, each followed by the rectified linear non-linearity. We chose 32 as the latent code dimension. We use the same architecture for the decoder, except for the dimensionality of the 2 hidden layers being swapped. We use the Kaiming initialization (He et al., 2015) to initialize the weight of both the encoder and decoder. HyperCLIP We parametrize our HyperCLIP model as a fully-connected MLP with a single hidden layer of dimension 256, taking as input the flattened weight of the base network and outputting the corresponding CLIP encoding. We chose the tangent hyperbolic function as the activation function in the hidden layer. HyperLDM While the original LDM makes use of a time-conditional UNet (Ronneberger et al., 2015) to parametrize the noise network, we are unfortunately unable to make use of spatial information and convolutions due to the non-spatial nature of our latent space. We parametrize our HyperLDM as a fully-connected network with residual connections and squeeze-and-excitation layers (Hu et al., 2018) . The time index t is embedded into a vector with a 150-dimensional sinusoidal positional embedding, and is concatenated together with the task-conditional embedding e i at the input layer and at intermediate activations. Hidden layer dimensions are 8192, 16384, 8192.

A.3 THE META-VQA DATASET

The original VQA problem is about choosing a suitable natural language answer a k when prompted with both a natural language question q k and an image I k . Our observation is that the VQA problem can then easily be reformulated as a meta-learning image classification problem with natural language task descriptions: given question-image-answer triples (q k , I k , a k ) ∈ D, we can group the data by unique questions q i (which will serve as task descriptor t i ), each of which can then be associated with supervised image classification tuples (I i j , a i j ) ∈ D i . To make sure the designed tasks are meaningful, we filter out question-answer pairs with questions in choosing form, e.g., "A or B?" or yes/no answers. From the remaining questions we keep the ones which appear at least 20 times throughout the dataset, such that each task contains enough samples. In the end our Meta-VQA dataset is composed of 1234 unique tasks (questions), split into 870 training tasks and 373 test tasks, for a total of 104112 image-answer pairs. There are on average 9.13 answer choices per question/task. The average size of the support set is 57.85 examples, while the average size of the query set is 25.9 examples.

A.4 BASELINE METHODS

We detail an overview of the baseline methods we benchmark in table 2, together with algorithm tables detailing each baseline method. Training: The number of epochs each model is trained on, the learning rate lr of the optimization, as well as the learning rate and number of steps of the adaptation algorithm used for each method can be found in table 3 . For all methods using an adaptation A Ti , the dataset from the task is randomly split into a support set and a query set during training, every time a task is sampled. The support set is then used to perform the adaptation (see Section 2.1), while the query set is used to compute the loss on which the meta-parameters are updated. When no adaptation is used, all the data is used for this update. Unconditional methods do not have access to the task embedding e i , while conditioned methods do. When the percentage of available task descriptor is reduced, conditioned methods are trained only on the tasks which descriptor is available. Evaluation: Evaluation is performed on a fixed query set on the predefined query set of the heldout test tasks of the Meta-VQA dataset. Zero-shot performance is evaluated before applying the adaptation procedure A Ti . For the few shot performance, all adaptation is performed on the support set of the test tasks. For MAML baselines, we keep the same adaptaiton-time learning rate as during training, while we always adapt for 50 steps. For each multitask baselines, we use the same adaptation scheme (steps, learning rate, adapting parameters) as their MAML counterpart. Guidance We use 10 steps guidance with λ = 0.01 and learning rate 0.1, for both when performed on HNet and HVAE. Evaluation Evaluation is performed on a fixed query set on the predefined query set of the heldout test tasks of the Meta-VQA dataset. Zero-shot performance is evaluated on the output of the generative hypernetwork h after applying latent space guidance. For the few-shot performance, all adaptation is performed on the support set of the test tasks, on the latent space initialized at the output of the guidance procedure. Similarly to our baselines, we use 50-steps gradient descent adaptation with learning rate 0.1.

Algorithm 9 HNet + HyperCLIP Training

Learn an unconditional hypernetwork h(z 0 , θ) with the Uncond. HNet-MAML procedure from Algorithm 6. Learn HyperCLIP network CLIP H (W ) using the HyperCLIP training procedure from Algorithm 1. For sampling fine-tuned W i , fine-tune the base-network on training tasks. Algorithm 10 HVAE + HyperCLIP Training Learn an unconditional hypernetwork h(z, θ), as the decoder of a HVAE (Algorithm 8). Learn HyperCLIP network CLIP H (W ) using the HyperCLIP training procedure from Algorithm 1. For sampling fine-tuned W i , fine-tune the base-network on training tasks. Algorithm 11 HyperCLIP Guidance (Inference time) Define a learned unconditional hypernetwork h(z, θ), as either a HNet h(z 0 , θ) (Algorithm 6) or the decoder of a HVAE (Algorithm 8). Define a learned HyperCLIP network CLIP H (W ). Define an unseen task T i with natural language task descriptor t i . Randomly sample z ∼ N (0, I) if using the decoder of a HVAE, or set z = z 0 where z 0 is the meta learned embedding initialization of the Hnet. Optimize z with gradient descent over L guidance (z) (Eq. 4), obtaining guided z i . Obtain guided base weights W i = h(z i , θ). Use adapted base network f with weights W i to classify examples from the unseen task T i .

A.5.3 HYPERLDM

Training Similarly to HyperCLIP, to train HyperLDM we need samples of fine tuned network weights W i , for which we use adaptations from Uncond. HNet-MAML, using 50-step adaptation A Ti with learning rate 0.1, on a support set stochastically sampled at every adaptation phase. We parametrize the diffusion process with a linear noise schedule, β starting at 0.0001 and ending at 0.06, and 350 diffusion timesteps. We train the HyperLDM for 1000 epochs with the Adam optimizer, 0.00025 learning rate and 128 epochs, for all our experiments. Evaluation Evaluation is performed as for HyperCLIP guidance, except for the fact that adaptation is performed natively through sampling from the learned reversed diffusion process, with parameters derived from the chosen β schedule. The guidance parameter γ > 0 can be tuned during inference to accentuate the effect of classifier-free guidance. Algorithm 12 HNet + HyperLDM Training Learn an unconditional hypernetwork h(z 0 , θ) with the Uncond. HNet-MAML procedure from Algorithm 6. Learn the HyperLDM network ϵ ψ (z t , t, e i ) using the HyperLDM training procedure, optimizing reconstruction of z 0 i with loss from Eq. 8. For sampling fine-tuned z i , fine-tune the base-network on training tasks, then encode the weights using the HNet.

Algorithm 13 HVAE + HyperLDM Training

Learn an unconditional hypernetwork h(z, θ), as the decoder of a HVAE (Algorithm 8). Learn the HyperLDM network ϵ ψ (z t , t, e i ) using the HyperLDM training procedure, optimizing reconstruction of z 0 i with loss from Eq. 8. For sampling fine-tuned z i , fine-tune the base-network on training tasks, then encode the weights using the HVAE.

Algorithm 14 HyperLDM Inference

Define a learned unconditional hypernetwork h(z, θ), as either a HNet h(z 0 , θ) (Algorithm 6) or the decoder of a HVAE (Algorithm 8). Define a learned HyperLDM network ϵ ψ (z t , t, e i ). Define an unseen task T i with natural language task descriptor t i , with clip embedding e i . Randomly sample z ∼ N (0, I). Iteratively modify z with diffusion sampling using the learned ϵ ψ network, obtaining guided z i . Obtain guided base weights W i = h(z i , θ). Use adapted base network f with weights W i to classify examples from the unseen task T i . A.6 FEW-SHOT LEARNING For completeness, we include in Table 4 the results for few-shot learning on the test split of Meta-VQA. Our technique, unlike classic MAML, does not optimize specifically for the few-shot learning setting. Instead, the few-shot learning results are meant to contextualize performance gains in the zero-shot setting: zero-shot performance gains should be interpreted as relative to the few-shot performance ceiling of 60.24, the maximum attained with our fixed choice of base model. 



HyperCLIP Training sample a batch of tasks T i=1,...,N with loss functions L Ti , training data D train i and text

Figure4: a) Performance of HyperLDM over different classifier-free guidance parameters γ. For γ = 0 we sample from an unconditional latent diffusion model. For γ = 1 we sample with classic conditioning. For γ > 1, we are in the classifier-free guidance regime. b) Performance of HyperLDM against baselines in the setting where only a fraction of natural language task labels are given.

Unconditional Multitask Training Define the base network f with parameters W . for epoch = 1, . . . , N do Sample a training batch of image-answer pairs (x k , y k ) from a mix of random training tasks T i . Update W with gradient descent computed with respect to the classification loss over the sampled batch. end for Algorithm 3 Unconditional MNet-MAML Training Define the base network f with parameters W . for meta-epoch = 1, . . . , N do Sample a training task T i and data D i .Randomly split D i into support set D s i and query set D q i . Run inner-loop adaptation A Ti using the support set D s i , fine-tuning W into task-adaptedW i = A Ti (W ).Use MAML gradient update to adapt W given the inner-loop adaptation. end for Algorithm 4 Unconditional HNet-MAML Training Define the base network f with parameters W . Define a hypernetwork h with meta-parameters θ, mapping a latent vector z 0 to base network weights W . for meta-epoch = 1, . . . , N do Sample a training task T i and data D i . Randomly split D i into support set D s i and query set D q i . Run inner-loop adaptation A Ti using the support set D s i , fine-tuning z 0 into task-adapted z i = A Ti (z 0 ). Use MAML gradient update to adapt z 0 and θ given the inner-loop adaptation. end for Algorithm 8 HVAE Training, Variant 2 Define the base network f with parameters W . Define an encoder z = d(W, ω) with parameters ω and a hypernetwork decoder W = h(z, θ) with parameters θ. Obtain a previously learned base network initialization W 0 according to Uncond. MNet-MAML (Algorithm 6). for epoch = 1, . . . , N do Create an empty batch B = {}. for b = 1, . . . , M do Sample a training task T i and data D i . Randomly split D i into support set D s i and query set D q i . Run inner-loop adaptation A Ti using the support set D s i , fine-tuning W i = A Ti (W 0 ). Add the fine-tuned weights to the batch: B = B ∪ {W i }. end for Train the HVAE encoder and decoder using the VAE loss to reconstruct the weight batch B. end for A.5.2 HYPERCLIPTraining In order to train the HyperCLIP model, we need samples of fine tuned network weights W i . Similarly to HVAE, we used adaptations from Uncond. HNet-MAML, using 50-step adaptation A Ti with learning rate 0.1, on a support set stochastically sampled at every adaptation phase, as this would allow us to use the same HyperCLIP model for doing guidance on both HNet and HVAE. We trained our HyperCLIP model for 600 epochs, with the Adam (Kingma & Ba, 2017) optimizer, 0.0003 learning rate and batch size 64, for all our experiments.

Zero-Shot learning accuracy averaged over Meta-VQA test tasks, with % of available question descriptors. Results should be interpreted as relative to a performance ceiling of ∼60.24 obtainable with few-shot learning on our fixed choice of base model (see Appendix A.6). (* ours)

Few-Shot learning accuracy averaged over Meta-VQA test tasks. (* ours)

annex

Randomly split D i into support set D s i and query set D q i . Run inner-loop adaptation A Ti using the support set D s i , fine-tuning e i into task-adapted ẽi = A Ti (e i ).Use MAML gradient update to adapt θ given the inner-loop adaptation. end for Table 2 : Overview of the different methods trained on MetaVQA. The parameters are optimized via the task loss evaluated on the output of the function, averaged over minibatches of tasks. The adaptation A Ti implements a few step gradient descent algorithm applied on the argument parameter, w.r.t the task loss evaluated on the support set.Table 3 : Hyperparameters used for the baseline methods. All methods are trained with the Adam (Kingma & Ba, 2017) optimizer, with meta-batch size of 32 tasks. We use gradient norm clipping for all optimization, with the maximum norm set to 10. Note that when the adaptation algorithm has a range of possible steps, the number of step is sampled uniformly from the range for every adaptation. To enable our guidance methods, we need to first train a generative hypernetwork h as in Section 3, either in the form of an Unconditional Hypernetwork, or of a Hypernetwork VAE:

Method

• For HNet + HyperCLIP guidance and HNet + HyperLDM, we meta-learnt an unconditioned hypernetwork with the exact same hyperparameters as the baseline Uncond. HNet-MAML, and used it as the generative hypernetwork.• For HVAE + HyperCLIP guidance and HVAE + HyperLDM, we trained an unconditioned VAE on samples of fine tuned network weights W i using the architecture specified in A.2. In order to be able to quickly sample new adapted weights, and to reduce the complexity of the manifold from which such weights are sampled, we use adaptations from our unconditional MAML baselines as W i . Specifically, Variant 1 (Algorithm 7 involves adaptations from Uncond. HNet-MAML, using 50-step adaptation A Ti with learning rate 0.1, on support set stochastically sampled for every adaptation phase. Variant 2 (Algorithm 8) involves adaptations over the base network (initialized from a learned Uncond. MNet-MAML initialization), using 50-step adaptation A Ti with learning rate 0.01, on support set stochastically sampled for every adaptation phase. We trained the VAE on 2000 epochs where each epoch is a single pass through all the tasks, with the Adam (Kingma & Ba, 2017) optimizer and 0.0001 learning rate and batch size 32. We used gradient norm clipping independently for both the encoder and decoder, with the maximum norm capped at 1000. For the final results presented in our main results table (Table 1 ), we use Variant 2. Randomly split D i into support set D s i and query set D q i . Run inner-loop adaptation A Ti using the support set D s i , fine-tuning z 0 from the HNet h into task-adapted zi = A Ti (z 0 ).Given fine-tuned z i , produce task-adapted weights W i = h(z i , θ).Add the fine-tuned weights to the batch: B = B ∪ {W i }. end for Train the HVAE encoder and decoder using the VAE loss to reconstruct the weight batch B. end for

