TACKLING DIVERSE TASKS VIA CROSS-MODAL TRANSFER LEARNING

Abstract

Fine-tuning large-scale pretrained models has led to remarkable progress in wellstudied modalities such as vision and NLP. However, similar gains have not been observed in many other tasks due to an assumed lack of relevant pretrained models for these diverse modalities. In this work, we revisit this assumption by studying the cross-modal transfer ability of large-scale pretrained models. We introduce ORCA, a general cross-modal fine-tuning workflow that enables fast and automatic exploitation of existing pretrained models for diverse tasks. ORCA achieves taskspecific adaptation by performing data alignment before fine-tuning: it learns an embedding network that minimizes the optimal transport dataset distance between the end-task data and the pretraining data to close the modality gap. Through extensive experiments, we show that ORCA is the first viable approach that allows practitioners to use pretrained models to outperform hand-designed, AutoMLsearched, and general-purpose architectures-ORCA obtains state-of-the-art results on 10 of 13 diverse tasks we evaluate and ranks among the top three on the others. We shed light on why cross-modal transfer works by quantifying the importance of data alignment and highlight ORCA's utility for data-limited domains.

1. INTRODUCTION

The success of machine learning (ML) in vision and natural language processing (NLP) has spurred its application beyond these traditional ML domains to diverse tasks such as solving partial differential equations (Li et al., 2021b ), music modeling (Lewandowski et al., 2012) , detecting cardiac disease (Hong et al., 2020) , and many others. However, progress in these less-explored areas can be challenging due to (1) limited amounts of labeled data, (2) high computational cost and human effort for developing models from scratch, and (3) a lack of relevant large-scale pretrained models, which have in many cases obviated the first two issues in vision and NLP (e.g., Devlin et al., 2019; Carion et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2021b; Radford et al., 2021) . There are two common approaches for practitioners to handle these issues: automated machine learning (AutoML) techniques (e.g., Roberts et al., 2021; Shen et al., 2022) that focus on designing task-specific networks in a data-efficient manner; and multimodal general-purpose methods that either propose flexible architectures applicable to various tasks (Jaegle et al., 2022a) or expand the set of modalities for which pretrained models exist (e.g., Reed et al., 2022; Lu et al., 2022a) . However, both classes of approaches require training from scratch when applied to a new modality and proceed under the assumption of a lack of relevant pretrained models for these diverse problems. In this work, we re-examine this assumption by considering the general problem of cross-modal transfer. Our goal is to exploit existing large-scale pretrained models in data-rich modalities for solving diverse downstream tasks. A few recent works have demonstrated the potential promise of cross-modal transfer by applying language transformers to vision (Kiela et al., 2019; Dinh et al., 2022; Lu et al., 2022b ), referential games (Li et al., 2020c ), and reinforcement learning (Reid et al., 2022) . However, many of these approaches are ad-hoc (e.g., rely on manual prompt engineering or hand-craft new architecture components to solve specific tasks), and none of them yield models competitive with those trained from scratch. We tackle both shortcomings in our work. We introduce a general-purpose, cross-modal transfer workflow called ORCA (Optimal tRansport Cross-modal Adaptation) that yields state-of-the-art results on a wide range of non-text and nonvision problems using pretrained transformers (Figure 1 ). Our key insight is to align the feature

