MASKED SIAMESE CONVNETS: TOWARDS AN EFFEC-TIVE MASKING STRATEGY FOR GENERAL-PURPOSE SIAMESE NETWORKS

Abstract

Siamese Networks are a popular self-supervised learning framework that learns useful representation without human supervision by encouraging representations to be invariant to distortions. Existing methods heavily rely on hand-crafted augmentations, which are not easy to adapt to new domains. To explore a domain-agnostic siamese network, we investigate using masking as augmentations in siamese networks. Recently, masking for siamese networks has only been shown useful with transformer architectures, e.g. MSN (Assran et al., 2022) and data2vec (Baevski et al., 2022). In this work, we identify the underlying problems of masking for siamese networks with arbitrary backbones, including ConvNets. We propose an effective masking strategy and demonstrate its effectiveness on various siamese network frameworks. Our method generally improves siamese networks' performances in the few-shot image classification, and object detection tasks.

1. INTRODUCTION

Self-supervised learning aims to learn useful representations from scalable unlabeled data without relying on human annotation. It has widely used in natural language processing (Devlin et al., 2019; Zhang et al., 2022; Brown et al., 2020 ), speech recognition (van den Oord et al., 2018; Hsu et al., 2021; Schneider et al., 2019; Baevski et al., 2020) and other domains (Rives et al., 2021; Rong et al., 2020) . Recently, self-supervised visual representation learning has also become an active research area. First introduced by Bromley et al. (1993) , the siamese network (Chen et al., 2020a; b; He et al., 2020; Chen et al., 20220; 2021; Caron et al., 2020; Grill et al., 2020; Chen & He, 2020; Wang et al., 2022; Zbontar et al., 2021; Bardes et al., 2021) is one promising approach among many selfsupervised learning approaches and outperforms supervised counterparts in many visual benchmarks. It encourages the encoder to be invariant to human-designed augmentations, capturing only the essential features. Practically, in the vision domain, the siamese network methods rely on domainspecific augmentations, such as cropping, color jittering and Gaussian blur, which do not transfer well to other domains. Therefore, it is desired to have a general augmentation approach for siamese networks that require minimal domain knowledge and can generalize. Among various augmentations, masking the input is one of the simplest and effective choices, which has been demonstrated to be useful for language (Devlin et al., 2019) and speech (Hsu et al., 2021) . However, not until the recent success of vision transformers (ViTs) (Dosovitskiy et al., 2021; Touvron et al., 2021) can vision models leverage masking as a general augmentation. Self-supervised learning with masked inputs has demonstrated more scalable properties when combined with ViTs (He et al., 2021; Bao et al., 2021; Zhou et al., 2021; Baevski et al., 2022) . Unfortunately, siamese networks with naive masking do not work well with arbitrary architecture, e.g., ConvNets (He et al., 2016; Liu et al., 2022) . We identify the underlying issues behind masked siamese networks with ConvNets. We argue that ConvNets do not have a mechanism to encode null information. In addition, masking will introduce parasitic edges. We propose a preprocessing procedure to solve these problems. We present several general-purpose designs that allow siamese networks to benefit from masked inputs. Experiments

