META-LEARNING VIA CLASSIFIER(-FREE) GUIDANCE

Abstract

State-of-the-art meta-learning techniques do not optimize for zero-shot adaptation to unseen tasks, a setting in which humans excel. On the contrary, meta-learning algorithms learn hyperparameters and weight initializations that explicitly optimize for few-shot learning performance. In this work, we take inspiration from recent advances in generative modeling and language-conditioned image synthesis to propose meta-learning techniques that use natural language guidance to achieve higher zero-shot performance compared to the state-of-the-art. We do so by recasting the meta-learning problem as a multi-modal generative modeling problem: given a task, we consider its adapted neural network weights and its natural language description as equivalent multi-modal task representations. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing meta-learning methods with zero-shot learning experiments on our Meta-VQA dataset, which we specifically constructed to reflect the multi-modal meta-learning setting.

1. INTRODUCTION

State-of-the-art machine learning algorithms often lack the ability to quickly generalize in a sample efficient manner to new unseen tasks. In contrast, humans show remarkable capabilities in leveraging previous knowledge for learning a new task from just a few examples. Often, not even a single example is needed, as all relevant task information can be conveyed in the form of natural language instructions. Indeed, humans can solve novel tasks when prompted from a variety of different interaction modalities such as visual task observations or natural language prompts. In this work we aim to improve forward generalization of deep networks by developing new models that are capable of a similar multi-modal task inference. In particular, we present two different approaches that utilize text-based task descriptors for rapid zero-shot adaptation to new tasks. The development of deep learning models that quickly adapt and generalize to unseen tasks is the focus of the field of meta-learning. A recent trend stemming from the success of Model-Agnostic Meta-Learning (Finn et al., 2017, MAML) describes meta-learning as a bi-level optimization problem: an outer loop meta-model is trained with the goal of improving the few-shot performance of a base model when fine-tuned on a variety of related tasks. MAML was specifically introduced as a gradientbased method to find an unconditional network initialization with good few-shot performance over an entire task universe. Recent progress in large scale deep learning is however challenging this explicit meta-learning framework. Large models trained on huge, rich, and diverse data sets have been shown to possess surprisingly good few-shot learning capabilities (Brown et al., 2020) . In particular, large scale pre-training and fine-tuning often outperforms explicit meta-learning procedures (Mandi et al., 2022) . These developments prompted us to develop alternative methods for explicit meta-learning which natively benefit from rich and multi-modal data. Inspired by recent advances in conditional image generation (Ramesh et al., 2022; Rombach et al., 2022) , we reframe meta-learning as a multi-modal generative modeling problem such that, given a task, its adapted neural network weights and its

