TASK-CUSTOMIZED MASKED AUTOENCODER VIA MIXTURE OF CLUSTER-CONDITIONAL EXPERTS

Abstract

Masked Autoencoder (MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAEbased pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pretrained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation.

1. INTRODUCTION

Self-supervised learning (SSL), which learns effective transferable representations without human annotations, has become a prevailing model pre-training paradigm (He et al., 2020; Chen et al., 2021a; Bao et al., 2022) . Currently, the most prevalent SSL method is the Masked Autoencoder (MAE) (He et al., 2022) , which constructs supervision signals from raw image data by masking random input patches and then reconstructing the missing pixels. This simple strategy has proved efficient in the training of large-scale models. For example, ViT (Dosovitskiy et al., 2021) shows impressive performance on popular benchmarks such as the ImageNetfoot_0 (Deng et al., 2009) . However, does MAE really scale well for various downstream tasks (Deng et al., 2009; Lin et al., 2014; Zhou et al., 2019; Han et al., 2021; Li et al., 2022a) ? Preliminary studies (in Section 3.1) show that the MAE indeed suffers from negative transfer (Liu et al., 2022) when transferring to downstream tasks with very different semantics. Figure 1(a) shows that on 9 of 11 downstream tasks, an MAE pre-trained on the full ImageNet data is outperformed by the one that is pre-trained on only the semantically relevant data subsets. Hence, using pre-training data that are semantically irrelevant can hurt transfer performance. The above observation motivates the need for task-customized pre-training. A promising model for this is the Mixture of Experts (MoE) (Shazeer et al., 2017; Riquelme et al., 2021) , which uses a multi-expert architecture to provide customized models for different input tokens. However, unlike supervised pre-training, self-supervised learning lacks semantic labels, and thus the experts differ more on low-level information than semantics, referring to Figure 1(b) . Experiments in Section 4.2 show that a naive adoption of MoE to the MAE has inferior performance. Since various downstream tasks contain different semantics, semantic-related experts may be preferred. In this paper, we propose the Mixture of Cluster-conditional Expert (MoCE), a novel paradigm to achieve task-customized self-supervised pre-training by data clustering and explicitly training each expert with images of similar semantics. The MoCE procedure has three stages. First, we cluster the whole dataset by using a pre-trained, dense MAE model. We then construct the MoCE with a multi-expert structure. Each expert is trained using clusters selected by routing tokens based on cluster embedding (instead of token embedding). To stabilize training and enhance confidence of the gate results, a regularization loss is proposed. Finally, with the arrival of a downstream task, we propose a search procedure to select the closest cluster. Empirically, the proposed MoCE shows superior performance over MAE on a collection of 11 downstream tasks. Besides, one can use only a MoCE sub-model on deployment, thus saving inference time and model capacity. $ LU F UD IW & D UV 6 8 1 ' 7 ' ) OR Z H UV ) R R G 3 H WV & & & D OW H F K 9 2 & 7UDQVIHUDFFXUDF\ 0$(6SOLW$ 0$(6SOLW% 0$(IXOOVHW To summarize, our main contributions are: 1. We systematically analyze the negative transfer phenomenon of MAE, and show that naively adopting the MoE to MAE cannot improve transfer performance of downstream tasks. 2. We propose the MoCE, which trains each expert with semantics-aware clusters so that similar clusters can be routed to the same expert. 3. We demonstrate effectiveness of the proposed MoCE on a collection of 11 downstream tasks, and achieve up to 2.45% performance improvement in Top-1 accuracy. State-ofthe-art self-supervised results are also achieved on the detection and segmentation tasks. To the best of our knowledge, this is the first work that achieves state-of-the-art transfer performance by training vision MoE models with ImageNet under the SSL setting.

2. RELATED WORK

Self-supervised Learning. Previous works mainly focus on the design of pretext tasks with image transformations (Doersch et al., 2015; Gidaris et al., 2018 ), inpainting (Pathak et al., 2016 ), colorization (Zhang et al., 2016 ), contrastive learning (Chen et al., 2020; He et al., 2020; Grill et al., 2020; Caron et al., 2020; Radford et al., 2021b; Yao et al., 2022b) , and for specific downstream tasks (Wang et al., 2020; Xie et al., 2020; 2021a; Chen et al., 2021a; Yao et al., 2022a) . Motivated by the design of BERT (Devlin et al., 2018) , masked image modeling (MIM) is recently proposed to learn by reconstructing masked images. BEiT (Bao et al., 2022) is the pioneering work that predicts visual tokens generated by a pre-trained tokenizor (Radford et al., 2021a) . SimMIM (Xie et al., 2021c) simplifies the framework by directly utilizing the pixel RGB values as reconstruction targets.



We refer to ImageNet-1K as ImageNet if not specified in this paper.



a) Negative transfer phenomenon on MAE.

Figure 1: (a) Transfer performance of MAEs pre-trained on Split-A (blue), Split-B (red) and full ImageNet data (white). Only two of the eleven downstream tasks benefit from using the full Ima-geNet data for pre-training (more details in Section 3.1). (b) TokenMoE uses pixel RGB values as reconstruction targets. Thus, tokens with similar pixel values tend to be routed to the same expert, leading to two types of mistakes: (i) same semantics but routed to different experts, (ii) different semantics but routed to the same expert.

