PATCH-LEVEL CONTRASTING WITHOUT PATCH COR-RESPONDENCE FOR ACCURATE AND DENSE CON-TRASTIVE REPRESENTATION LEARNING

Abstract

We propose ADCLR: Accurate and Dense Contrastive Representation Learning, a novel self-supervised learning framework for learning accurate and dense vision representation. To extract spatial-sensitive information, ADCLR introduces query patches for contrasting in addition with global contrasting. Compared with previous dense contrasting methods, ADCLR mainly enjoys three merits: i) achieving both global-discriminative and spatial-sensitive representation, ii) modelefficient (no extra parameters in addition to the global contrasting baseline), and iii) correspondence-free and thus simpler to implement. Our approach achieves new state-of-the-art performance for contrastive methods. On classification tasks, for ViT-S, ADCLR achieves 77.5% top-1 accuracy on ImageNet with linear probing, outperforming our baseline (DINO) without our devised techniques as plug-in, by 0.5%. For ViT-B, ADCLR achieves 79.8%, 84.0% accuracy on ImageNet by linear probing and finetune, outperforming iBOT by 0.3%, 0.2% accuracy. For dense tasks, on MS-COCO, ADCLR achieves significant improvements of 44.3% AP on object detection, 39.7% AP on instance segmentation, outperforming previous SOTA method SelfPatch by 2.2% and 1.2%, respectively. On ADE20K, ADCLR outperforms SelfPatch by 1.0% mIoU, 1.2% mAcc on the segmentation task.

1. INTRODUCTION

Self-supervised representation learning (SSL) has been attracting increasing attention for deep learning, whereby a prediction problem is often formulated by a pretext task for pre-training with unlabeled data. SSL methods can mainly be divided into three categories: 1) Generative approaches (Goodfellow et al., 2014) learn to generate samples in the input space. However, generation can be computationally expensive and may not be necessary for representation learning. It has been recently shown (Chen et al., 2020a; Wang et al., 2021) that existing contrastive learning in general aims to learn global-discriminative features, which may lack spatial sensitivity (Yi et al., 2022) , and it limits their ability on downstream fine-tuning tasks, especially for dense vision tasks like detection and segmentation. Consequently, object-level (Wei et al., 2021; Hénaff et al., 2022) and pixel-level (Xie et al., 2021c; Wang et al., 2021) contrastive objectives and frameworks are proposed. Meanwhile, with the success of recent ViT-based visual backbones (Dosovitskiy et al., 2020; Liu et al., 2021) , patch-level contrastive approaches (Yun et al., 2022) are devised, which



2) Contextual methods(Gidaris et al., 2018)  design pretext tasks (denoising auto-encoders(Vincent et al., 2008), context auto encoders(Zhang et al., 2016), etc). 3) Contrastive methods(Jin et al., 2022; Zhang  et al., 2022; Chen et al., 2021; Caron et al., 2021)  take augmented views of the same image as positive pairs and others as negative pairs. Contrastive-based methods have shown great promise e.g. in image classification/detection, video classification(Caron et al., 2021), and others (Chen et al., 2021).

funding

This work was in part supported by NSFC (62222607), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and SenseTime Collaborative Research Grant.

