ON DATA-AUGMENTATION AND CONSISTENCY-BASED SEMI-SUPERVISED LEARNING

Abstract

Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the Π-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the Π-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods.

1. INTRODUCTION

Consider a dataset D = D L ∪ D U that is comprised of labelled samples D L = {x i , y i } i∈I L as well as unlabelled samples D U = {x i } i∈I U . Semi-Supervised Learning (SSL) is concerned with the use of both the labelled and unlabeled data for training. In many scenarios, collecting labelled data is difficult or time consuming or expensive so that the amount of labelled data can be relatively small when compared to the amount of unlabelled data. The main challenge of SSL is in the design of methods that can exploit the information contained in the distribution of the unlabelled data (Zhu05; CSZ09). In modern high-dimensional settings that are common to computer vision, signal processing, Natural Language Processing (NLP) or genomics, standard graph/distance based methods (BC01; ZG02; ZGL03; BNS06; DSST19) that are successful in low-dimensional scenarios are difficult to implement. Indeed, in high-dimensional spaces, it is often difficult to design sensible notions of distances that can be exploited within these methods. We refer the interested reader to the book-length treatments (Zhu05; CSZ09) for discussion of other approaches. The manifold assumption is the fundamental structural property that is exploited in most modern approaches to SSL: high-dimensional data samples lie in a small neighbourhood of a low-dimensional manifold (TP91; BJ03; Pey09; Cay05; RDV + 11). In computer vision, the presence of this lowdimensional structure is instrumental to the success of (variational) autoencoder and generative adversarial networks: large datasets of images can often be parametrized by a relatively small number of degrees of freedom. Exploiting the unlabelled data to uncover this low-dimensional structure is crucial to the design of efficient SSL methods. A recent and independent evaluation of several modern methods for SSL can be found in (OOR + 18). It is found there that consistency-based methods (BAP14; SJT16; LA16; TV17; MMIK18; LZL + 18; GSA + 20), the topic of this paper, achieve state-of-the art performances in many realistic scenarios. Contributions: consistency-based semi-supervised learning methods have recently been shown to achieve state-of-the-art results. Despite these methodological advances, the understanding of these methods is still relatively limited when compared to the fully-supervised setting (SMG13;

