META LEARNING TO BRIDGE VISION AND LANGUAGE MODELS FOR MULTIMODAL FEW-SHOT LEARNING

Abstract

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.

1. INTRODUCTION

Learning quickly by observing a few examples in a multimodal environment is an integral part of human intelligence (Schmidhuber, 1987; Bengio et al., 1991 ). Yet, it is quite challenging for current vision and language models to deal with a limited labeled space, when performing multimodal few-shot learning (Tsimpoukelli et al., 2021; Alayrac et al., 2022) . On the contrary, language-only models have already flourished over the past years, especially when being transferred to a limited labeled space (Brown et al., 2020; Perez et al., 2021; Min et al., 2022) , as a result of the large-scale pre-training and huge model capacity. Such advances in natural language processing inspired similar efforts in the vision domain, yielding large vision models with impressive few-shot and zero-shot image classification capabilities (Radford et al., 2021; Zhai et al., 2021; Jia et al., 2021) . However, due to the large domain gap between vision and language modalities, it is non-trivial to directly embody few-shot capabilities into multimodal settings, which is the main motivation of this work. One of the main challenges for bridging this gap is finding a proper mechanism for communicating visual concepts to a language model, by accruing shared knowledge from sequences of multimodal tasks. The Frozen model (Tsimpoukelli et al., 2021) is the first multimodal few-shot learner which tackles this challenge by taking inspiration from language models able to do in-context learning (Brown et al., 2020) . This requires a task induction provided as a sentence followed by context data samples, to reduce the hypothesis space for open-ended image interpretation. This might not be a shortcoming for simpler tasks, like binary decisions; however it becomes an obstacle for more complicated tasks (Li & Liang, 2021) , since the task induction has to be engineered each time. We hypothesize that the task can be induced from the data itself in a completely learnable manner. Meta-learning or learning to learn (Schmidhuber, 1987; Thrun & Pratt, 2012; Andrychowicz et al., 2016) comes as a natural solution to any few-shot settings. While it has been extensively studied in unimodal settings, particularly for few-shot image classification (Vinyals et al., 2016; Ravi & Figure 1 : Multimodal few-shot meta-learning task for an example of a 2-way 1-shot setting, with two categories (ways) present in the support set images, each represented with one sample (shot). Given a batch of tasks T i , the support set is first used to obtain task-specific model parameters θ ′ i for each task by few gradient-step updates, which are then used together with the query set samples to perform a meta-update step for updating the meta-parameters θ. After the meta-training is finished, for a new given task, the meta-trained model is used for inference by further adapting the meta-learned model with the support set, and measuring the performance on unseen query samples. Larochelle, 2017; Finn et al., 2017; Raghu et al., 2019; Ye et al., 2020) , meta-learning remains almost unexplored for multimodal few-shot settings. Similar to unimodal settings, empowering a multimodal few-shot learner with the ability to accrue knowledge across tasks, would assist in building internal multimodal representations broadly suitable for many tasks. These representations could serve as a bridge between the different modalities and assist in learning quickly new tasks by observing only limited labeled examples. Motivated by this, we define a novel multimodal few-shot meta-learning approach, to bridge the gap between vision and language modalities, illustrated in Figure 1 . Specifically, our approach adopts publicly available pre-trained large vision encoders (Radford et al., 2021) and language models (Brown et al., 2020) , which are kept frozen to take advantages of their already-learned reasoning capability (Tsimpoukelli et al., 2021; Mokady et al., 2021; Zhou et al., 2022; Jia et al., 2022; Tewel et al., 2021; Zhai et al., 2021) . In doing so, our method avoids the huge computational resources required by those models during training and their dependency on large datasets. Unlike prior multimodal few-shot learners (Tsimpoukelli et al., 2021; Alayrac et al., 2022) , our approach decomposes the training of the model into sequential observing of multimodal few-shot tasks, in the spirit of meta-learning. During the meta-training stage, the model translates the visual representations into a visual prefix for the language model, by using a lightweight meta-mapper network. This network serves as a multimodal bridge between the large vision and language models and is entirely built from self-attention layers (Lee et al., 2019) . Essentially, the aim of the meta-mapper is to collect shared meta-knowledge from related tasks, by mapping the visual representation into the latent space of the language model. Then, during inference, or meta-test according to the meta-learning parlance, the model is able to induce the task in a fully data-driven manner by observing few labeled examples and thus entirely removes the need for hand-engineered task inductions. In summary, we contribute in three major aspects: Conceptually: We introduce meta-learning to multimodal few-shot learning, which enables fast adaptation and efficient learning of multimodal fewshot tasks. To that end, we design a new setting for multimodal few-shot learning by re-organizing existing datasets and following suitable benchmarks. Methodologically: We present a multimodal meta learner by using a lightweight meta-mapper, which learns to bridge a large frozen vision and language backbones. To the best of our knowledge, this is the first meta-learning based model to solve multimodal few-shot tasks. Empirically: We demonstrate through systematic experiments on those benchmarks that our model, using only the small trainable meta-mapper with frozen backbones, yields strong multimodal few-shot performance, while being computationally more efficient.

