TACKLING DIVERSE TASKS VIA CROSS-MODAL TRANSFER LEARNING

Abstract

Fine-tuning large-scale pretrained models has led to remarkable progress in wellstudied modalities such as vision and NLP. However, similar gains have not been observed in many other tasks due to an assumed lack of relevant pretrained models for these diverse modalities. In this work, we revisit this assumption by studying the cross-modal transfer ability of large-scale pretrained models. We introduce ORCA, a general cross-modal fine-tuning workflow that enables fast and automatic exploitation of existing pretrained models for diverse tasks. ORCA achieves taskspecific adaptation by performing data alignment before fine-tuning: it learns an embedding network that minimizes the optimal transport dataset distance between the end-task data and the pretraining data to close the modality gap. Through extensive experiments, we show that ORCA is the first viable approach that allows practitioners to use pretrained models to outperform hand-designed, AutoMLsearched, and general-purpose architectures-ORCA obtains state-of-the-art results on 10 of 13 diverse tasks we evaluate and ranks among the top three on the others. We shed light on why cross-modal transfer works by quantifying the importance of data alignment and highlight ORCA's utility for data-limited domains.

1. INTRODUCTION

The success of machine learning (ML) in vision and natural language processing (NLP) has spurred its application beyond these traditional ML domains to diverse tasks such as solving partial differential equations (Li et al., 2021b) , music modeling (Lewandowski et al., 2012) , detecting cardiac disease (Hong et al., 2020) , and many others. However, progress in these less-explored areas can be challenging due to (1) limited amounts of labeled data, (2) high computational cost and human effort for developing models from scratch, and (3) a lack of relevant large-scale pretrained models, which have in many cases obviated the first two issues in vision and NLP (e.g., Devlin et al., 2019; Carion et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2021b; Radford et al., 2021) . There are two common approaches for practitioners to handle these issues: automated machine learning (AutoML) techniques (e.g., Roberts et al., 2021; Shen et al., 2022) that focus on designing task-specific networks in a data-efficient manner; and multimodal general-purpose methods that either propose flexible architectures applicable to various tasks (Jaegle et al., 2022a) or expand the set of modalities for which pretrained models exist (e.g., Reed et al., 2022; Lu et al., 2022a) . However, both classes of approaches require training from scratch when applied to a new modality and proceed under the assumption of a lack of relevant pretrained models for these diverse problems. In this work, we re-examine this assumption by considering the general problem of cross-modal transfer. Our goal is to exploit existing large-scale pretrained models in data-rich modalities for solving diverse downstream tasks. A few recent works have demonstrated the potential promise of cross-modal transfer by applying language transformers to vision (Kiela et al., 2019; Dinh et al., 2022; Lu et al., 2022b ), referential games (Li et al., 2020c) , and reinforcement learning (Reid et al., 2022) . However, many of these approaches are ad-hoc (e.g., rely on manual prompt engineering or hand-craft new architecture components to solve specific tasks), and none of them yield models competitive with those trained from scratch. We tackle both shortcomings in our work. We introduce a general-purpose, cross-modal transfer workflow called ORCA (Optimal tRansport Cross-modal Adaptation) that yields state-of-the-art results on a wide range of non-text and nonvision problems using pretrained transformers (Figure 1 ). Our key insight is to align the feature scale pretrained models in data-rich modalities for solving diverse tasks. First, given target data (x t , y t ) and a pretrained transformer body g s , ORCA constructs an embedder architecture f t to match the input dimensionality of gs, and a predictor architecture h t to convert the output of g s back to the appropriate output space for the target task, e.g., classification logits or dense maps. Note that ORCA does not learn the weights for ft or ht during this stage. Next, ORCA learns the parameters for the embedder f t by minimizing the OTDD between the target dataset and an in-modality source dataset. Finally, ORCA fine-tunes the entire architecture {f t , g s , h t }. distribution of an unfamiliar, out-of-modality dataset with that of a familiar, in-modal dataset before fine-tuning. This data alignment process not only prevents distortion of pretrained weights but also enables cross-modal knowledge transfer, as we will show via extensive experiments in Section 4. Concretely, for any downstream task, we first generate an embedding network that maps the (potentially high-dimensional) inputs to sequence features. Then, we train it to minimize the optimal transport dataset distance (OTDD) (Alvarez-Melis & Fusi, 2020) between the feature-label distribution of the target data and data from the pretraining domainfoot_0 . Finally, we fine-tune the pretrained model and the embedding network. Using OTDD allows us to relax many distributional assumptions required by traditional domain adaptation and perform data alignment using both the feature and label information of the target data. However, we show in an ablation study in Section 4.2.1 that substituting OTDD with other distance metrics, such as maximum mean discrepancy (MMD) (Gretton et al., 2012) , can also aid cross-modal transfer, albeit to a lesser extent. This implies that it is the general idea of first-align-then-fine-tune that enables ORCA to obtain significantly better results than previous cross-modal learning methods that rely on vanilla fine-tuning (Lu et al., 2022b) . We evaluate ORCA on a diverse set of 13 tasks with different input dimensions (1D and 2D), prediction types (point and dense), and modalities (vision, audio, electrocardiogram, physics, protein, genomics, cosmic-ray, and music) . ORCA outperforms various competitors, including task-specific hand-designed architectures, leading AutoML methods, and general-purpose models, ranking first on 10 tasks and in the top three on all tasks. We compare ORCA with existing fine-tuning techniques and confirm that effective cross-modal transfer is only enabled by ORCA's feature alignment process. We further reveal an empirical correlation between the alignment quality and the downstream performance. Finally, we demonstrate ORCA's efficacy for limited-data tasks. Overall, our work not only explores the cross-modal transfer ability of pretrained models, but also establishes a practical workflow for solving diverse prediction problems efficiently and automatically.

2. RELATED WORK

In this section, we review several groups of related work in the areas of AutoML, in-modal transfer learning (unimodal domain adaptation, unimodal/multimodal fine-tuning, and general purpose methods), and cross-modal transfer learning (heterogeneous domain adaptation, task-specific finetuning, and FPT). Table 1 summarizes these groups along relevant axes, and contrasts them to ORCA. AutoML for diverse tasks is a growing research area, as evidenced by the NAS-Bench-360 benchmark (Tu et al., 2022) , along with several recent neural architecture search (NAS) methods that target this problem, e.g., AutoML-Zero (Real et al., 2020 ), XD (Roberts et al., 2021 ), and DASH (Shen et al., 2022) ). In contrast to these NAS methods, ORCA takes a transfer learning approach in order to leverage existing pretrained models from data-rich modalities for more esoteric tasks, rather than repeatedly incurring the overhead of designing new architectures and training them from scratch. That said, given the shared underlying motivation, our experimental evaluation makes use of the diverse



We do not assume access to the pretraining data due to practical concerns about data access and computational efficiency. We instead work with publicly available proxy data from the pretraining modality, e.g., CIFAR-10 for models pretrained on ImageNet and CoNLL-2003 for models pretrained on larger text corpora.



Figure 1: ORCA's three-stage cross-modal transfer workflow enables fast and automatic exploitation of large-

