MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Abstract

We present Masked Frequency Modeling (MFM), a unified frequency-domainbased approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.

1. INTRODUCTION

Following the success of Masked Language Modeling (MLM) such as BERT (Devlin et al., 2019) in natural language processing (NLP), Masked Image Modeling (MIM) (Bao et al., 2022; He et al., 2022; Wei et al., 2022; Xie et al., 2022) has shown promising performance in self-supervised pretraining of visual models. Both MLM and MIM follow a common corrupt-and-predict paradigmrandomly masking a portion of input data and then learning to predict the missing parts. This simple recipe enables modern Transformer-based deep architectures (Vaswani et al., 2017; Dosovitskiy et al., 2020) to learn generalizable representations from ubiquitous unlabeled text or image data. By default, current MIM methods such as BEiT (Bao et al., 2022) , MAE (He et al., 2022) and SimMIM (Xie et al., 2022) perform masking in the spatial domain by excluding image patches randomly, a strategy inspired by MLM that performs masking on words (Figure 1 (a-b)). However, unlike human-generated language that is succinct and highly semantic, raw pixel values in the spatial domain are of low information density. To cope with heavy spatial redundancy in images, MAE (He et al., 2022) shows that one would need to mask a very high proportion (e.g., 75%) to encourage the learning of meaningful features. Beyond masking image patches, which is a particular way of corruption, in this paper, we are interested in investigating the effectiveness of other corruption strategies for self-supervised representation learning. We first explore the corruption recipes commonly applied in low-level image processing tasks, including image super-resolution (SR), deblurring and denoising. As shown in A white cat runs on a green grass covered with small flowers. A white cat runs on a green grass covered with small flowers. Figure 1 (c), the downsampling, blur, and noise operations can degrade the exemplar image effectively in the spatial domain, thus potentially serving as useful corruption strategies. However, the corruption induced in the spatial domain prevents us from analyzing what specific information is corrupted and needs to be reconstructed. To better understand these low-level corruptions, we shift our attention from the spatial image domain to the frequency domain. In the frequency domain, one could observe underlying patterns of an image not conveniently visible from raw pixel values. For example, the downsampling and blur operations dominantly remove the high-frequency image details, while adding noises tends to corrupt the full frequency spectrum of an image globally (Figure 1 (c)). Driven by this observation, we present a simple and effective masking strategy in the frequency domain for self-supervised visual representation learning, dubbed as Masked Frequency Modeling (MFM). Specifically, we first perform Fast Fourier Transform (FFT) to convert each input image into its frequency representation, i.e., frequency spectrum. We then mask a portion of frequencies on the frequency spectrum using a low-/high-pass filter. With inverse FFT (iFFT), we finally take the corrupted image with some of the frequencies attenuated as input. Our encoder is quite flexible as no mask tokens are inserted. Thus, MFM can embrace both the vision Transformer (ViT) (Dosovitskiy et al., 2020) and convolutional neural network (CNN) (LeCun et al., 1989) families. Our decoder is a lightweight linear layer that reconstructs the masked frequency values on the frequency spectrum via a frequency loss. As shown in Figure 1 (d), an image with low or high frequencies attenuated would reveal entirely different patterns: the low-frequency components usually contain object smooth structure such as colors and styles, while the high-frequency counterparts largely depict the object outline or silhouette structure. Such unique properties of the frequency domain make it appealing for reducing information redundancy, thus creating a nontrivial and meaningful self-supervisory task. Our contributions are summarized as follows: 1) We propose a new masked frequency modeling task to pre-train visual encoders in a selfsupervised manner. Our MFM is agnostic to the architectures, and we demonstrate the flexibility of applying MFM for both ViT and CNN families. 2) We contribute the first study of low-level corruption tasks for self-supervised learning (SSL) in frequency domain. We investigate the effectiveness of corruption strategies commonly adopted in low-level image processing tasks (i.e., SR, deblurring and denoising) for SSL from a unified frequency perspective and reveal that the representation learning capability of these corruption tasks actually depends on the architectures: they can achieve comparable and even better results than their supervised counterpart on ViT, but no gains are observed on CNN. 3) Extensive experiments show that our MFM can achieve competitive performance among existing MIM approaches on downstream tasks, such as image classification and semantic segmentation, while not using mask tokens or other more complex designs. Further analysis on several robustness benchmarks also exhibits more appealing robustness of the studied corruption tasks than MIM.



Figure 1: Comparison of masking recipes in Masked Language Modeling (MLM), Masked Image Modeling (MIM), low-level image processing and Masked Frequency Modeling (MFM). Note the differences of masked information among MIM, low-level image processing and MFM.Figure1(c), the downsampling, blur, and noise operations can degrade the exemplar image effectively in the spatial domain, thus potentially serving as useful corruption strategies. However, the corruption induced in the spatial domain prevents us from analyzing what specific information is corrupted and needs to be reconstructed. To better understand these low-level corruptions, we shift our attention from the spatial image domain to the frequency domain.

