ON DATA-AUGMENTATION AND CONSISTENCY-BASED SEMI-SUPERVISED LEARNING

Abstract

Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the Π-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the Π-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods.

1. INTRODUCTION

Consider a dataset D = D L ∪ D U that is comprised of labelled samples D L = {x i , y i } i∈I L as well as unlabelled samples D U = {x i } i∈I U . Semi-Supervised Learning (SSL) is concerned with the use of both the labelled and unlabeled data for training. In many scenarios, collecting labelled data is difficult or time consuming or expensive so that the amount of labelled data can be relatively small when compared to the amount of unlabelled data. The main challenge of SSL is in the design of methods that can exploit the information contained in the distribution of the unlabelled data (Zhu05; CSZ09). In modern high-dimensional settings that are common to computer vision, signal processing, Natural Language Processing (NLP) or genomics, standard graph/distance based methods (BC01; ZG02; ZGL03; BNS06; DSST19) that are successful in low-dimensional scenarios are difficult to implement. Indeed, in high-dimensional spaces, it is often difficult to design sensible notions of distances that can be exploited within these methods. We refer the interested reader to the book-length treatments (Zhu05; CSZ09) for discussion of other approaches. The manifold assumption is the fundamental structural property that is exploited in most modern approaches to SSL: high-dimensional data samples lie in a small neighbourhood of a low-dimensional manifold (TP91; BJ03; Pey09; Cay05; RDV + 11). In computer vision, the presence of this lowdimensional structure is instrumental to the success of (variational) autoencoder and generative adversarial networks: large datasets of images can often be parametrized by a relatively small number of degrees of freedom. Exploiting the unlabelled data to uncover this low-dimensional structure is crucial to the design of efficient SSL methods. A recent and independent evaluation of several modern methods for SSL can be found in (OOR + 18). It is found there that consistency-based methods (BAP14; SJT16; LA16; TV17; MMIK18; LZL + 18; GSA + 20), the topic of this paper, achieve state-of-the art performances in many realistic scenarios. Contributions: consistency-based semi-supervised learning methods have recently been shown to achieve state-of-the-art results. Despite these methodological advances, the understanding of these methods is still relatively limited when compared to the fully-supervised setting (SMG13; AS17; SBD + 18; TZ15; SZT17). In this article, we do not propose a new SSL method. Instead, we analyse consistency-based methods in settings where analytically tractable results can be obtained, when the data-samples lie in the neighbourhood of well-defined and tractable low-dimensional manifolds, and simple and controlled experiments can be carried out. We establish links with Manifold Tangent Classifiers and demonstrate that consistency-based SSL methods are in general more powerful since they can better exploit the local geometry of the data-manifold if efficient data-augmentation/perturbation schemes are used. Furthermore, in section 4.1 we show that the popular Mean Teacher method and the conceptually more simple Π-model approach share the same solutions in the regime when the data-augmentations are small; this confirms often reported claim that the data-augmentation schemes leveraged by the recent SSL, as well as fully unsupervised algorithms, are instrumental to their success. Finally, in section 4.3 we propose an extension of the Hidden Manifold Model (GMKZ19; GLK + 20). This generative model allows us to investigate the properties of consistency-based SSL methods, taking into account the data-augmentation process and the underlying low-dimensionality of the data, in a simple and principled manner, and without relying on a specific dataset. For gaining understanding of SSL, as well as self-supervised learning methods, we believe it to be important to develop a framework that (i) can take into account the geometry of the data (ii) allows the study of the influence of the quality of the data-augmentation schemes (iii) does not rely on any particular dataset. While the understanding of fully-supervised methods have largely been driven by the analysis of simplified model architectures (eg. linear and two-layered models, large dimension asymptotic such as the Neural Tangent Kernel), these analytical tools alone are unlikely to be enough to explain the mechanisms responsible for the success of SSL and self-supervised learning methods (CKNH20; GSA + 20), since they do not, and cannot easily be extended to, account for the geometry of the data and data-augmentation schemes. Our proposed framework offers a small step in that direction.

2. CONSISTENCY-BASED SEMI-SUPERVISED LEARNING

For concreteness and clarity of exposition, we focus the discussion on classification problems. The arguments described in the remaining of this article can be adapted without any difficulty to other situations such as regression or image segmentation. Assume that the samples x i ∈ X ⊂ R D can be represented as D-dimensional vectors and that the labels belong to C ≥ 2 possible classes, y i ∈ Y ≡ {1, . . . , C}. Consider a mapping F θ : R D → R C parametrized by θ ∈ Θ ⊂ R |Θ| . This can be a neural network, although that is not necessary. For x ∈ X , the quantity F θ (x) can represent probabilistic output of the classifier, or , for example, the pre-softmax activations. Empirical risk minimization consists in minimizing the function L L (θ) = 1 |D L | i∈I L (F θ (x i ), y i ) for a loss function : R C × Y → R. Maximum likelihood estimation corresponds to choosing the loss function as the cross entropy. The optimal parameter θ ∈ Θ is found by a variant of stochastic gradient descent (RM51) with estimated gradient ∇ θ 1 |B L | i∈B L (F θ (x i ), y i ) for a mini-batch B L of labelled samples. Consistency-based SSL algorithms regularize the learning by enforcing that the learned function x → F θ (x) respects local derivative and invariance constraints. For simplicity, assume that the mapping x → F θ (x) is deterministic, although the use of dropout (SHK + 14) and other sources of stochasticity are popular in practice. The Π-model (LA16; SJT16) makes use of a stochastic mapping S : X × Ω → X that maps a sample x ∈ X and a source of randomness ω ∈ Ω ⊂ R dΩ to another sample S ω (x) ∈ X . The mapping S describes a stochastic data augmentation process. In computer vision, popular data-augmentation schemes include random translations, rotations, dilatations, croppings, flippings, elastic deformations, color jittering, addition of speckle noise, and many more domain-specific variants. In NLP, synonym replacements, insertions and deletions, back-translations are often used although it is often more difficult to implement these data-augmentation strategies. In a purely supervised setting, data-augmentation can be used as a

