MASKED SIAMESE CONVNETS: TOWARDS AN EFFEC-TIVE MASKING STRATEGY FOR GENERAL-PURPOSE SIAMESE NETWORKS

Abstract

Siamese Networks are a popular self-supervised learning framework that learns useful representation without human supervision by encouraging representations to be invariant to distortions. Existing methods heavily rely on hand-crafted augmentations, which are not easy to adapt to new domains. To explore a domain-agnostic siamese network, we investigate using masking as augmentations in siamese networks. Recently, masking for siamese networks has only been shown useful with transformer architectures, e.g. MSN (Assran et al., 2022) and data2vec (Baevski et al., 2022). In this work, we identify the underlying problems of masking for siamese networks with arbitrary backbones, including ConvNets. We propose an effective masking strategy and demonstrate its effectiveness on various siamese network frameworks. Our method generally improves siamese networks' performances in the few-shot image classification, and object detection tasks.

1. INTRODUCTION

Self-supervised learning aims to learn useful representations from scalable unlabeled data without relying on human annotation. It has widely used in natural language processing (Devlin et al., 2019; Zhang et al., 2022; Brown et al., 2020) , speech recognition (van den Oord et al., 2018; Hsu et al., 2021; Schneider et al., 2019; Baevski et al., 2020) and other domains (Rives et al., 2021; Rong et al., 2020) . Recently, self-supervised visual representation learning has also become an active research area. Bromley et al. (1993) , the siamese network (Chen et al., 2020a; b; He et al., 2020; Chen et al., 20220; 2021; Caron et al., 2020; Grill et al., 2020; Chen & He, 2020; Wang et al., 2022; Zbontar et al., 2021; Bardes et al., 2021) is one promising approach among many selfsupervised learning approaches and outperforms supervised counterparts in many visual benchmarks. It encourages the encoder to be invariant to human-designed augmentations, capturing only the essential features. Practically, in the vision domain, the siamese network methods rely on domainspecific augmentations, such as cropping, color jittering and Gaussian blur, which do not transfer well to other domains. Therefore, it is desired to have a general augmentation approach for siamese networks that require minimal domain knowledge and can generalize.

First introduced by

Among various augmentations, masking the input is one of the simplest and effective choices, which has been demonstrated to be useful for language (Devlin et al., 2019) and speech (Hsu et al., 2021) . However, not until the recent success of vision transformers (ViTs) (Dosovitskiy et al., 2021; Touvron et al., 2021) can vision models leverage masking as a general augmentation. Self-supervised learning with masked inputs has demonstrated more scalable properties when combined with ViTs (He et al., 2021; Bao et al., 2021; Zhou et al., 2021; Baevski et al., 2022) . Unfortunately, siamese networks with naive masking do not work well with arbitrary architecture, e.g., ConvNets (He et al., 2016; Liu et al., 2022) . We identify the underlying issues behind masked siamese networks with ConvNets. We argue that ConvNets do not have a mechanism to encode null information. In addition, masking will introduce parasitic edges. We propose a preprocessing procedure to solve these problems. We present several general-purpose designs that allow siamese networks to benefit from masked inputs. Experiments show that siamese networks with ConvNets backbone can benefit from masked inputs with our masking strategy. We summarize our contributions below: • We discuss the role of augmentations in siamese networks and explore a general-purpose augmentation approach of masking. We identify the underlying problem of masking for siamese with ConvNets backbones. • We propose a preprocessing step to overcome the problem behind masked siamese networks with ConvNets. We present a series of masking designs which allow masking to benefit the siamese network with ConvNets backbones. • We propose Masked Siamese ConvNets (MSCN), an effective masking strategy for generalpurpose siamese networks. Our method can be applied to various siamese network frameworks, and it demonstrates competitive performances on few-shot image classification benchmarks and outperforms previous methods on object detection benchmarks.

2. RELATED WORKS

2.1 SIAMESE NETWORKS Self-supervised visual representation learning has become an active research area since they have shown superior performances over supervised counterparts in recent years. One promising approach is to learn useful representations by encouraging them to be invariant to augmentations, known as siamese networks or joint-embedding methods (Misra & van der Maaten, 2020; Chen et al., 2020a; b; He et al., 2020; Chen et al., 20220; 2021; Caron et al., 2020; Grill et al., 2020; Chen & He, 2020; Wang et al., 2022; Zbontar et al., 2021; Bardes et al., 2021) . These methods use different mechanisms to prevent collapse, and they all rely on carefully designed augmentations such as random resized cropping, color jittering, grayscale and Gaussian blur. These augmentations prevent the encoder from only using trivial features. Empirically, siamese networks with these standard augmentation settings usually work well with arbitrary architectures, including ResNets (He et al., 2016) and ViTs (Dosovitskiy et al., 2021) . Their representations are label-efficient (Assran et al., 2021; 2022) , more robust (Hendrycks et al., 2019) , and have improved fairness (Goyal et al., 2022) . In addition, Siamese networks have been demonstrated to benefit from scalable data (Goyal et al., 2021a) .

2.2. REPRESENTATION LEARNING WITH MASKED INPUTS

Masking the input is one of the simplest methods to corrupt the information input and could be applied to a wide range of data types. It has been mostly used in two scenarios. The first is for denoising autoencoder frameworks (Vincent et al., 2010; 2008) . Motivated by its success in NLP (Devlin et al., 2019; Brown et al., 2020) with transformers (Vaswani et al., 2017) , various visual representation learning methods (He et al., 2021; Bao et al., 2021; Zhou et al., 2021) using ViTs have also shown benefit from masked inputs. These methods have proven to be a promising general-purpose self-supervised learning approach. The second is for siamese networks. Siamese networks can benefit from masked inputs (Baevski et al., 2022; Assran et al., 2021) where masking serves as an extra augmentation. These methods are able to learn more transferable representations and have the benefit of being label-efficient. These works are limited to ViT architectures. However, no previous work has shown that the masking approach can work equally well with arbitrary ConvNets.

3. AUGMENTATIONS FOR SIAMESE NETWORKS

In this section, we discuss the role of augmentations in siamese networks and outline several design principles which will be used to guide our masking strategy.

