EXPLICIT HOMOGRAPHY ESTIMATION IMPROVES CON-TRASTIVE SELF-SUPERVISED LEARNING

Abstract

The typical contrastive self-supervised algorithm uses a similarity measure in latent space as the supervision signal by contrasting positive and negative images directly or indirectly. Although the utility of self-supervised algorithms has improved recently, there are still bottlenecks hindering their widespread use, such as the compute needed. In this paper, we propose a module that serves as an additional objective in the self-supervised contrastive learning paradigm. We show how the inclusion of this module to regress the parameters of an affine transformation or homography, in addition to the original contrastive objective, improves both performance and learning speed. Importantly, we ensure that this module does not enforce invariance to the various components of the affine transform, as this is not always ideal. We demonstrate the effectiveness of the additional objective on two recent, popular self-supervised algorithms. We perform an extensive experimental analysis of the proposed method and show an improvement in performance for all considered datasets. Further, we find that although both the general homography and affine transformation are sufficient to improve performance and convergence, the affine transformation performs better in all cases.

1. INTRODUCTION

There is an ever-increasing pool of data, particularly unstructured data such as images, text, video, and audio. The vast majority of this data is unlabelled. The process of labelling is time-consuming, labour-intensive, and expensive. Such an environment makes algorithms that can leverage fully unlabelled data particularly useful and important. Such algorithms fall within the realm of unsupervised learning. A particular subset of unsupervised learning is known as Self-Supervised Learning (SSL). SSL is a paradigm in which the data itself provides a supervision signal to the algorithm. Somewhat related is another core area of research known as transfer learning (Wang et al., 2020) . In the context of computer vision, this means being able to pre-train an encoder network offline on a large, varietal dataset, followed by domain-specific fine-tuning on the bespoke task at hand. The state-of-the-art for many transfer learning applications remains dominated by supervised learning techniques (Tan et al., 2020; Martinez et al., 2019; Donahue et al., 2014; Girshick et al., 2014) , in which models are pre-trained on a large labelled dataset. However, self-supervised learning techniques have more recently come to the fore as potential alternatives that perform similarly on downstream tasks, while requiring no labelled data. Most selfsupervised techniques create a supervision signal from the data itself in one of two ways. The one approach are techniques that define a pre-text task beforehand that a neural network is trained to solve, such as inpainting (Pathak et al., 2016) or a jigsaw puzzle (Noroozi & Favaro, 2016) . In this way, the pre-text task is a kind of proxy that, if solved, should produce reasonable representations for downstream visual tasks such as image or video recognition, object detection, or semantic segmentation. The other approach is a class of techniques known as contrastive methods (Chen et al., 2020a; He et al., 2019; Chen et al., 2020b) . These methods minimise the distance (or maximise the similarity) between the latent representations of two augmented views of the same input image, while simultaneously maximising the distance between negative pairs. In this way, these methods enforce consistency regularisation (Sohn et al., 2020) , a well-known approach to semi-supervised learning. These contrastive methods often outperform the pre-text task methods and are the current state-of-the-art in self-supervised learning. However, most of these contrastive methods have several drawbacks, such as requiring prohibitively large batch sizes or memory banks, in order to retrieve the negative pairs of samples (Chen et al., 2020a; He et al., 2019) . The intuition behind our proposed module is that any system tasked with understanding images can benefit from understanding the geometry of the image and the objects within it. An affine transformation is a geometric transformation that preserves parallelism of lines. It can be composed of any sequence of rotation, translation, shearing, and scaling. A homography is a generalisation of this notion to include perspective warping. A homography need not preserve parallelism of lines, however, it ensures lines remain straight. Mathematically, a homography is shown in Equation 1. It has 8 degrees of freedom and is applied to a vector in homogenous coordinates. An affine transformation has the same form, but with the added constraint that φ 3,1 = φ 3,2 = 0. H φ = φ 1,1 φ 1,2 φ 1,3 φ 2,1 φ 2,2 φ 2,3 φ 3,1 φ 3,2 (1) The ability to know how a source image was transformed to get to a target image implicitly means that you have learned something about the geometry of that image. An affine transformation or, more generally, a homography is a natural way to encode this idea. Forcing the network to estimate the parameters of a random homography applied to the source images thereby forces it to learn semantics about the geometry. This geometric information can supplement the signal provided by a contrastive loss, or loss in the latent space. In this paper, we propose an additional module that can be used in tandem with contrastive selfsupervised learning techniques to augment the contrastive objective (the additional module is highlighted in Figure 1 ). The module is simple, model-agnostic, and can be used to supplement a contrastive algorithm to improve performance and supplement the information learned by the network to converge faster. The module is essentially an additional stream of the network with the objective of regressing the parameters of an affine transformation or homography. In this way, there is a multi-task objective that the network must solve: 1. minimising the original contrastive objective, and 2. learning the parameters of a homography applied to one of the input images from a vector difference of their latent representations. We force the latent space to encode the geometric transformation information by learning to regress the parameters of the transformation in an MLP that takes the vector difference of two latent representations of an input, x, and its transformed analogue, x . By including the information in this way, the network is not invariant to the components of the transformation but is still able to use them as a self-supervised signal for learning. Moreover, this approach serves as a novel hybrid of the pre-text tasks and contrastive learning by enforcing consistency regularisation (Sohn et al., 2020) . Through extensive empirical studies, we show that the additional objective of regressing the transformation parameters serves as a useful supplementary task for self-supervised contrastive learning, and improves performance for all considered datasets in terms of linear evaluation accuracy and convergence speed. The remainder of the paper is structured as follows. In Section 2, we cover the related work in the area of self-supervised learning, going into detail where necessary. In Section 3 we detail our



Figure 1: Proposed architecture. The highlighted box highlights the additional proposed module tasked with regressing the parameters of an affine transformation or homography.

