EXPLICIT HOMOGRAPHY ESTIMATION IMPROVES CON-TRASTIVE SELF-SUPERVISED LEARNING

Abstract

The typical contrastive self-supervised algorithm uses a similarity measure in latent space as the supervision signal by contrasting positive and negative images directly or indirectly. Although the utility of self-supervised algorithms has improved recently, there are still bottlenecks hindering their widespread use, such as the compute needed. In this paper, we propose a module that serves as an additional objective in the self-supervised contrastive learning paradigm. We show how the inclusion of this module to regress the parameters of an affine transformation or homography, in addition to the original contrastive objective, improves both performance and learning speed. Importantly, we ensure that this module does not enforce invariance to the various components of the affine transform, as this is not always ideal. We demonstrate the effectiveness of the additional objective on two recent, popular self-supervised algorithms. We perform an extensive experimental analysis of the proposed method and show an improvement in performance for all considered datasets. Further, we find that although both the general homography and affine transformation are sufficient to improve performance and convergence, the affine transformation performs better in all cases.

1. INTRODUCTION

There is an ever-increasing pool of data, particularly unstructured data such as images, text, video, and audio. The vast majority of this data is unlabelled. The process of labelling is time-consuming, labour-intensive, and expensive. Such an environment makes algorithms that can leverage fully unlabelled data particularly useful and important. Such algorithms fall within the realm of unsupervised learning. A particular subset of unsupervised learning is known as Self-Supervised Learning (SSL). SSL is a paradigm in which the data itself provides a supervision signal to the algorithm. Somewhat related is another core area of research known as transfer learning (Wang et al., 2020) . In the context of computer vision, this means being able to pre-train an encoder network offline on a large, varietal dataset, followed by domain-specific fine-tuning on the bespoke task at hand. The state-of-the-art for many transfer learning applications remains dominated by supervised learning techniques (Tan et al., 2020; Martinez et al., 2019; Donahue et al., 2014; Girshick et al., 2014) , in which models are pre-trained on a large labelled dataset. However, self-supervised learning techniques have more recently come to the fore as potential alternatives that perform similarly on downstream tasks, while requiring no labelled data. Most selfsupervised techniques create a supervision signal from the data itself in one of two ways. The one approach are techniques that define a pre-text task beforehand that a neural network is trained to solve, such as inpainting (Pathak et al., 2016) or a jigsaw puzzle (Noroozi & Favaro, 2016) . In this way, the pre-text task is a kind of proxy that, if solved, should produce reasonable representations for downstream visual tasks such as image or video recognition, object detection, or semantic segmentation. The other approach is a class of techniques known as contrastive methods (Chen et al., 2020a; He et al., 2019; Chen et al., 2020b) . These methods minimise the distance (or maximise the similarity) between the latent representations of two augmented views of the same input image, while simultaneously maximising the distance between negative pairs. In this way, these methods enforce consistency regularisation (Sohn et al., 2020) , a well-known approach to semi-supervised learning. These contrastive methods often outperform the pre-text task methods and are the current state-of-the-art in self-supervised learning. However, most of these contrastive methods have several 1

