TASK-CUSTOMIZED MASKED AUTOENCODER VIA MIXTURE OF CLUSTER-CONDITIONAL EXPERTS

Abstract

Masked Autoencoder (MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAEbased pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pretrained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation.

1. INTRODUCTION

Self-supervised learning (SSL), which learns effective transferable representations without human annotations, has become a prevailing model pre-training paradigm (He et al., 2020; Chen et al., 2021a; Bao et al., 2022) . Currently, the most prevalent SSL method is the Masked Autoencoder (MAE) (He et al., 2022) , which constructs supervision signals from raw image data by masking random input patches and then reconstructing the missing pixels. This simple strategy has proved efficient in the training of large-scale models. For example, ViT (Dosovitskiy et al., 2021) shows impressive performance on popular benchmarks such as the ImageNetfoot_0 (Deng et al., 2009) . However, does MAE really scale well for various downstream tasks (Deng et al., 2009; Lin et al., 2014; Zhou et al., 2019; Han et al., 2021; Li et al., 2022a) ? Preliminary studies (in Section 3.1) show that the MAE indeed suffers from negative transfer (Liu et al., 2022) when transferring to downstream tasks with very different semantics. Figure 1(a) shows that on 9 of 11 downstream tasks, an MAE pre-trained on the full ImageNet data is outperformed by the one that is pre-trained on only the semantically relevant data subsets. Hence, using pre-training data that are semantically irrelevant can hurt transfer performance. The above observation motivates the need for task-customized pre-training. A promising model for this is the Mixture of Experts (MoE) (Shazeer et al., 2017; Riquelme et al., 2021) , which uses a multi-expert architecture to provide customized models for different input tokens. However, unlike supervised pre-training, self-supervised learning lacks semantic labels, and thus the experts differ more on low-level information than semantics, referring to Figure 1(b) . Experiments in Section 4.2 show that a naive adoption of MoE to the MAE has inferior performance. Since various downstream tasks contain different semantics, semantic-related experts may be preferred.



We refer to ImageNet-1K as ImageNet if not specified in this paper.1

