META-LEARNING VIA CLASSIFIER(-FREE) GUIDANCE

Abstract

State-of-the-art meta-learning techniques do not optimize for zero-shot adaptation to unseen tasks, a setting in which humans excel. On the contrary, meta-learning algorithms learn hyperparameters and weight initializations that explicitly optimize for few-shot learning performance. In this work, we take inspiration from recent advances in generative modeling and language-conditioned image synthesis to propose meta-learning techniques that use natural language guidance to achieve higher zero-shot performance compared to the state-of-the-art. We do so by recasting the meta-learning problem as a multi-modal generative modeling problem: given a task, we consider its adapted neural network weights and its natural language description as equivalent multi-modal task representations. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing meta-learning methods with zero-shot learning experiments on our Meta-VQA dataset, which we specifically constructed to reflect the multi-modal meta-learning setting.

1. INTRODUCTION

State-of-the-art machine learning algorithms often lack the ability to quickly generalize in a sample efficient manner to new unseen tasks. In contrast, humans show remarkable capabilities in leveraging previous knowledge for learning a new task from just a few examples. Often, not even a single example is needed, as all relevant task information can be conveyed in the form of natural language instructions. Indeed, humans can solve novel tasks when prompted from a variety of different interaction modalities such as visual task observations or natural language prompts. In this work we aim to improve forward generalization of deep networks by developing new models that are capable of a similar multi-modal task inference. In particular, we present two different approaches that utilize text-based task descriptors for rapid zero-shot adaptation to new tasks. The development of deep learning models that quickly adapt and generalize to unseen tasks is the focus of the field of meta-learning. A recent trend stemming from the success of Model-Agnostic Meta-Learning (Finn et al., 2017, MAML) describes meta-learning as a bi-level optimization problem: an outer loop meta-model is trained with the goal of improving the few-shot performance of a base model when fine-tuned on a variety of related tasks. MAML was specifically introduced as a gradientbased method to find an unconditional network initialization with good few-shot performance over an entire task universe. Recent progress in large scale deep learning is however challenging this explicit meta-learning framework. Large models trained on huge, rich, and diverse data sets have been shown to possess surprisingly good few-shot learning capabilities (Brown et al., 2020) . In particular, large scale pre-training and fine-tuning often outperforms explicit meta-learning procedures (Mandi et al., 2022) . These developments prompted us to develop alternative methods for explicit meta-learning which natively benefit from rich and multi-modal data. Inspired by recent advances in conditional image generation (Ramesh et al., 2022; Rombach et al., 2022) , we reframe meta-learning as a multi-modal generative modeling problem such that, given a task, its adapted neural network weights and its Figure 1 : Schematic of the three main components of our proposed metalearning approach. A. An unconditional variational autoencoder (VAE) models the latent space of adapted network weights W . Its generator hypernetwork h (highlighted in gray) can be re-used in the conditional setting with our guidance techniques. B. Our HyperCLIP encoder CLIP H is contrastively trained to map network weights W to the space of CLIP embeddings e i . Then, given a new task with descriptor t i , we can use CLIP guidance to find a VAE latent vector z i with embedding e Specifically, we approach the generation of neural network weights in two separate phases. In the unconditional pre-training phase, we train a generative hypernetwork (Ha et al., 2016) to map from its latent space to the weight space of a base model (Figure 1 .A). In the guidance phase, we learn language-conditioned models that can be used to traverse the hypernetwork latent space and find zero-shot adapted weights with high performance on our task (Figure 1 .B and 1.C).

1.1. OUR CONTRIBUTIONS

We summarise our contributions as follows: • We recast meta-learning as a multi-modal generative modeling problem using hypernetworks, providing a theoretical framework for the application of recent research on image synthesis to the field of meta-learning. • We propose HyperCLIP, a contrastive learning method equivalent to Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) , producing CLIP embeddings of fine-tuned neural network weights. We can then use HyperCLIP to: 1) perform CLIP-guidance in the latent space of a hypernetwork model (Figure 1 .B); and 2) perform task inference on fine-tuned models without given task descriptions. • We propose Hypernetwork Latent Diffusion Models (HyperLDM) as an alternative to HyperCLIP guidance for task-conditional network generation within the latent space of a hypernetwork model (Figure 1 .C). We show how combining this approach with classifier-free guidance (Ho & Salimans, 2021) improves the performance of generated networks. • We demonstrate the soundness of our methods on Meta-VQA, our modification of the Visual-Question-Answering VQA v2.0 dataset (Goyal et al., 2017) that interprets questions as task descriptions. We show how our guidance methods outperform traditional multi-task and meta-learning techniques for zero-shot learning on this dataset.



Alternatively, our Hypernetwork Latent Diffusion Model (HyperLDM) learns, conditional on the task embedding e i , to iteratively denoise a VAE latent vector z T i , . . . , z 0 i over T iterations. natural language description are considered equivalent multi-modal task representations. What we show is that popular techniques for the image domain, such as CLIP-based guidance (Gal et al., 2021; Patashnik et al., 2021), denoising diffusion models (Ho et al., 2020), and classifier-free guidance (Dhariwal & Nichol, 2021; Ho & Salimans, 2021; Nichol et al., 2022) can be repurposed for the meta-learning setting to generate adapted neural network weights instead of images.

