ON DATA-AUGMENTATION AND CONSISTENCY-BASED SEMI-SUPERVISED LEARNING

Abstract

Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the Π-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the Π-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods.

1. INTRODUCTION

Consider a dataset D = D L ∪ D U that is comprised of labelled samples D L = {x i , y i } i∈I L as well as unlabelled samples D U = {x i } i∈I U . Semi-Supervised Learning (SSL) is concerned with the use of both the labelled and unlabeled data for training. In many scenarios, collecting labelled data is difficult or time consuming or expensive so that the amount of labelled data can be relatively small when compared to the amount of unlabelled data. The main challenge of SSL is in the design of methods that can exploit the information contained in the distribution of the unlabelled data (Zhu05; CSZ09). In modern high-dimensional settings that are common to computer vision, signal processing, Natural Language Processing (NLP) or genomics, standard graph/distance based methods (BC01; ZG02; ZGL03; BNS06; DSST19) that are successful in low-dimensional scenarios are difficult to implement. Indeed, in high-dimensional spaces, it is often difficult to design sensible notions of distances that can be exploited within these methods. We refer the interested reader to the book-length treatments (Zhu05; CSZ09) for discussion of other approaches. The manifold assumption is the fundamental structural property that is exploited in most modern approaches to SSL: high-dimensional data samples lie in a small neighbourhood of a low-dimensional manifold (TP91; BJ03; Pey09; Cay05; RDV + 11). In computer vision, the presence of this lowdimensional structure is instrumental to the success of (variational) autoencoder and generative adversarial networks: large datasets of images can often be parametrized by a relatively small number of degrees of freedom. Exploiting the unlabelled data to uncover this low-dimensional structure is crucial to the design of efficient SSL methods. A recent and independent evaluation of several modern methods for SSL can be found in (OOR + 18). It is found there that consistency-based methods (BAP14; SJT16; LA16; TV17; MMIK18; LZL + 18; GSA + 20), the topic of this paper, achieve state-of-the art performances in many realistic scenarios. Contributions: consistency-based semi-supervised learning methods have recently been shown to achieve state-of-the-art results. Despite these methodological advances, the understanding of these methods is still relatively limited when compared to the fully-supervised setting (SMG13; AS17; SBD + 18; TZ15; SZT17). In this article, we do not propose a new SSL method. Instead, we analyse consistency-based methods in settings where analytically tractable results can be obtained, when the data-samples lie in the neighbourhood of well-defined and tractable low-dimensional manifolds, and simple and controlled experiments can be carried out. We establish links with Manifold Tangent Classifiers and demonstrate that consistency-based SSL methods are in general more powerful since they can better exploit the local geometry of the data-manifold if efficient data-augmentation/perturbation schemes are used. Furthermore, in section 4.1 we show that the popular Mean Teacher method and the conceptually more simple Π-model approach share the same solutions in the regime when the data-augmentations are small; this confirms often reported claim that the data-augmentation schemes leveraged by the recent SSL, as well as fully unsupervised algorithms, are instrumental to their success. Finally, in section 4.3 we propose an extension of the Hidden Manifold Model (GMKZ19; GLK + 20). This generative model allows us to investigate the properties of consistency-based SSL methods, taking into account the data-augmentation process and the underlying low-dimensionality of the data, in a simple and principled manner, and without relying on a specific dataset. For gaining understanding of SSL, as well as self-supervised learning methods, we believe it to be important to develop a framework that (i) can take into account the geometry of the data (ii) allows the study of the influence of the quality of the data-augmentation schemes (iii) does not rely on any particular dataset. While the understanding of fully-supervised methods have largely been driven by the analysis of simplified model architectures (eg. linear and two-layered models, large dimension asymptotic such as the Neural Tangent Kernel), these analytical tools alone are unlikely to be enough to explain the mechanisms responsible for the success of SSL and self-supervised learning methods (CKNH20; GSA + 20), since they do not, and cannot easily be extended to, account for the geometry of the data and data-augmentation schemes. Our proposed framework offers a small step in that direction.

2. CONSISTENCY-BASED SEMI-SUPERVISED LEARNING

For concreteness and clarity of exposition, we focus the discussion on classification problems. The arguments described in the remaining of this article can be adapted without any difficulty to other situations such as regression or image segmentation. Assume that the samples x i ∈ X ⊂ R D can be represented as D-dimensional vectors and that the labels belong to C ≥ 2 possible classes, y i ∈ Y ≡ {1, . . . , C}. Consider a mapping F θ : R D → R C parametrized by θ ∈ Θ ⊂ R |Θ| . This can be a neural network, although that is not necessary. For x ∈ X , the quantity F θ (x) can represent probabilistic output of the classifier, or , for example, the pre-softmax activations. Empirical risk minimization consists in minimizing the function L L (θ) = 1 |D L | i∈I L (F θ (x i ), y i ) for a loss function : R C × Y → R. Maximum likelihood estimation corresponds to choosing the loss function as the cross entropy. The optimal parameter θ ∈ Θ is found by a variant of stochastic gradient descent (RM51) with estimated gradient ∇ θ 1 |B L | i∈B L (F θ (x i ), y i ) for a mini-batch B L of labelled samples. Consistency-based SSL algorithms regularize the learning by enforcing that the learned function x → F θ (x) respects local derivative and invariance constraints. For simplicity, assume that the mapping x → F θ (x) is deterministic, although the use of dropout (SHK + 14) and other sources of stochasticity are popular in practice. The Π-model (LA16; SJT16) makes use of a stochastic mapping S : X × Ω → X that maps a sample x ∈ X and a source of randomness ω ∈ Ω ⊂ R dΩ to another sample S ω (x) ∈ X . The mapping S describes a stochastic data augmentation process. In computer vision, popular data-augmentation schemes include random translations, rotations, dilatations, croppings, flippings, elastic deformations, color jittering, addition of speckle noise, and many more domain-specific variants. In NLP, synonym replacements, insertions and deletions, back-translations are often used although it is often more difficult to implement these data-augmentation strategies. In a purely supervised setting, data-augmentation can be used as a regularizer. Instead of directly minimizing L L , one can minimize instead θ → 1 |D L | i∈I L E ω [ (F θ [S ω (x i )], y i )]. In practice, data-augmentation regularization, although a simple strategy, is often crucial to obtaining good generalization properties (PW17; CZM + 18; LBC17; PCZ + 19). The idea of regularizing by enforcing robustness to the injection of noise can be traced back at least to (Bis95) . In the Π-model, the data-augmentation mapping S is used to define a consistency regularization term, R(θ) = 1 |D| i∈I L ∪I U E ω F θ [S ω (x i )] -F θ (x i ) 2 . (1) The notation θ designates a copy of the parameter θ, i.e. θ = θ, and emphasizes that when differentiating the consistency regularization term θ → R(θ), one does not differentiate through θ . In practice, a stochastic estimate of ∇R(θ) is obtained as follows. For a mini-batch B of samples {x i } i∈B , the current value θ ∈ Θ of the parameter and the current predictions f i ≡ F θ (x i ), the quantity ∇ 1 |B| i∈B F θ [S ω (x i )] -f i 2 is an approximation of ∇R(θ). There are indeed many variants (eg. use of different norms, different manners to inject noise), but the general idea is to force the learned function x → F θ (x) to be locally invariant to the data-augmentation scheme S. Several extensions such as the Mean Teacher (TV17) and the VAT (MMIK18) schemes have been recently proposed and have been shown to lead to good results in many SSL tasks. The recently proposed and state-of-the-art BYOL approach (GSA + 20) is relying on mechanisms that are very close to the consistency regularization methods discussed on this text. If one recalls the manifold assumption, this approach is natural: since the samples corresponding to different classes lie on separate manifolds, the function F θ : X → R C should be constant on each one of these manifolds. Since the correct value of F θ is typically well approximated or known for labelled samples (x i , y i ) ∈ D L , the consistency regularization term equation 1 helps propagating these known values across these manifolds. This mechanism is indeed similar to standard SSL graph-based approaches such as label propagation (ZG02). Graph-based methods are difficult to directly implement in computer vision, or NLP, when a meaningful notion of distance is not available. This interpretation reveals that it is crucial to include the labelled samples in the regularization term equation 1 in order to help propagating the information contained in the labelled samples to the unlabelled samples. Our numerical experiments suggest that, in the standard setting when the number of labelled samples is much lower than the number of unlabeled samples, i.e. |D L | |D U |, the formulation equation 1 of the consistency regularization leads to sub-optimal results and convergence issues: the information contained in the labelled data is swamped by the number of unlabelled samples. In all our experiments, we have adopted instead the following regularization term R(θ) = 1 |D L | i∈I L E ω F θ [S ω (x i )] -F θ (x i ) 2 + 1 |D U | j∈I U E ω F θ [S ω (x j )] -F θ (x j ) 2 (2) that balances the labelled and unlabelled data samples more efficiently. Furthermore, it is clear that the quality and variety of the data-augmentation scheme S : X × Ω → X is pivotal to the success of consistency-based SSL methods. We argue in this article that it is the dominant factor contributing to the success of this class of methods. Effort spent on building efficient local data-augmentation schemes will be rewarded in terms of generalization performances. Designing good data-augmentation schemes is an efficient manner of injecting expert/prior knowledge into the learning process. It is done by leveraging the understanding of the local geometry of the data manifold. As usual and not surprisingly (NGP98; MHF + 12), in data-scarce settings, any type of domain-knowledge needs to be exploited and we argue that consistency regularization approaches to SSL are instances of this general principle.

3. APPROXIMATE MANIFOLD TANGENT CLASSIFIER

It has long been known (SLDV98) that exploiting the knowledge of derivatives, or more generally enforcing local invariance properties, can greatly enhance the performance of standard classifiers/regressors (HK02; CS02). In the context of deep-learning, the Manifold Tangent Classifier (RDV + 11) is yet another illustration of this idea. Consider the data manifold M ⊂ X ⊂ R D and assume that the data samples lie on a neighbourhood of it. For x ∈ M, consider as well the tangent plane T x to M at x. Assuming that the manifold M is of dimension 1 ≤ d ≤ D, the tangent plane T x is also of dimension d with an orthonormal basis e x 1 , . . . , e x d ∈ R D . This informally means that, for suitably small coefficients ω 1 , . . . , ω d ∈ R, the transformed sample x ∈ X defined as x = x + d j=1 ω j e x j also lies, or is very close to, the data manifold M. A possible stochastic data-augmentation scheme can therefore be defined as S ω (x) = x + V ω where V ω = d j=1 ω j e x j . If ω is a multivariate ddimensional centred Gaussian random vector with suitably small covariance matrix, the perturbation vector V ω is also centred and normally distributed. To enforce that the function x → F θ (x) is locally approximately constant along the manifold M, one can thus penalize the derivatives of F θ at x in the directions V ω . Denoting by J x ∈ R C,D the Jacobian with respect to x ∈ R D of F θ at x ∈ M, this can be implemented by adding a penalization term of the type E ω [ J x V ω 2 ] = Tr Γ ⊗ J T x J x , where Γ ∈ R D,D is the covariance matrix of the random vector ω → V ω . This type of regularization of the Jacobian along the data-manifold is for example used in (BNS06). More generally, if one assumes that for any x, ω ∈ X × Ω we have S ε ω (x) = x + ε D(x, ω) + O(ε 2 ), for some derivative mapping D : X × Ω → X , it follows that lim ε→0 1 ε 2 E ω F θ [S ε ω (x)] -F θ (x) 2 = E ω J x D(x, ω) 2 = Tr Γ x,S ⊗ J T x J x where Γ x,S is the covariance matrix of the X -valued random vector ω → D(x, ω) ∈ X . This shows that consistency-based methods can be understood as approximated Jacobian regularization methods, as proposed in (SLDV98; RDV + 11).

3.1. LIMITATIONS

In practice, even if many local dimension reduction techniques have been proposed, it is still relatively difficult to obtain a good parametrization of the data manifold. The Manifold Tangent Classifier (MTC) (RDV + 11) implements this idea by first extracting in an unsupervised manner a good representation of the dataset D by using a Contractive-Auto-Encoder (CAE) (RVM + 11). This CAE can subsequently be leveraged to obtain an approximate basis of each tangent plane T xi for x i ∈ D, which can then be used for penalizing the Jacobian of the mapping x → F θ (x) in the direction of the tangent plane to M at x. The above discussion shows that the somewhat simplistic approach consisting in adding an isotropic Gaussian noise to the data samples is unlikely to deliver satisfying results. It is equivalent to penalizing the Frobenius norm J x 2 F of the Jacobian of the mapping x → F θ (x); in a linear model, that is equivalent to the standard ridge regularization. This mechanism does not take at all into account the local-geometry of the data-manifold. Nevertheless, in medical imaging applications where scans are often contaminated by speckle noise, this class of approaches which can be thought off as adding artificial speckle noise, can help mitigate over-fitting (DRS + 18). There are many situations where, because of data scarcity or the sheer difficulty of unsupervised representation learning in general, domain-specific data-augmentation schemes lead to much better regularization than Jacobian penalization. Furthermore, as schematically illustrated in Figure 1 , Jacobian penalization techniques are not efficient at learning highly non-linear manifolds that are common, for example, in computer vision. For example, in "pixel space", a simple image translation is a highly non-linear transformation only well approximated by a first order approximation for very small translations. In other words, if x ∈ X represents an image and g(x, v) is its translated version by a vector v, the approximation g(x, v) ≈ x + ∇ v g(x), with ∇ v g(x) ≡ lim ε→0 (g(x, ε v) -g(x)/ε, becomes poor as soon as the translation vector v is not extremely small. In computer vision, translations, rotations and dilatations are often used as sole data-augmentation schemes: this leads to a poor local exploration of the data-manifold since this type transformations only generate a very low dimensional exploration manifold. More precisely, the exploration manifold emanating from a sample x 0 ∈ X , i.e. {S(x 0 , ω) : ω ∈ Ω}, is very low dimensional: its dimension is much lower than the dimension d of the data-manifold M. Enriching the set of data-augmentation degrees of freedom with transformations such as elastic deformation or non-linear pixel intensity shifts is crucial to obtaining a high-dimensional local exploration manifold that can help propagating the information on the data-manifold efficiently (CZM + 19; PCZ + 19).

4.1. FLUID LIMIT

Consider the standard Π-model trained with a standard Stochastic Gradient Descent (SGD). Denote by θ t ∈ Θ the current value of the parameter and η > 0 the learning rate. We have θ k+1 = θ k -η ∇ θ 1 |B L | i∈B L ( F θ k (x i ), y i ) + λ |B L | j∈B L F θ k (S ω [x j ]) -f j 2 + λ |B U | k∈B U F θ k (S ω [x k ]) -f k 2 (3) for a parameter λ > 0 that controls the trade-off between supervised and consistency losses, as well as subsets B L and B U of labelled and unlabelled data samples, and f j ≡ F θ (x j ) for θ ≡ θ k as discussed in Section 2. The right-hand-side is an unbiased estimate of η ∇ θ L L (θ k ) + λ R(θ k ) with variance of order O(η 2 ), where the regularization term R(θ k ) is described in equation 2. It follows from standard fluid limit approximations (EK09)[Section 4.8] for Markov processes that, under mild regularity and growth assumptions and as η → 0, the appropriately time-rescaled trajectory {θ k } k≥0 can be approximated by the trajectory of the Ordinary Differential Equation (ODE).  The article (TV17) proposes the mean teacher model, an averaging approach related to the standard Polyak-Ruppert averaging scheme (Pol90; PJ92), which modifies the consistency regularization term equation 2 by replacing the parameter θ by an exponential moving average (EMA). In practical terms, this simply means that, instead of defining f j = F θ (x j ), with θ = θ k in equation 3, one sets f j = F θ avg,k (x j ) where the EMA process {θ avg,k } k≥0 is defined through the recursion θ avg,k = (1 -α η) θ avg,k-1 + α η θ k where the coefficient α > 0 controls the time-scale of the averaging process. The use of the EMA process {θ avg,k } k≥0 helps smoothing out the stochasticity of the process θ k . Similarly to Proposition 4.1, as η → 0, the joint process (θ η t , θ η avg,t ) ≡ (θ η [t/η] , θ η avg,[t/η] ) converges as η → 0 to the solution of the following ordinary differential equation    θt = -∇ L(θ t ) + λ R(θ t , θ avg,t ) θavg,t = -α (θ avg,t -θ t ) where the notation R(θ t , θ avg,t ) designates the same quantity as the one described in equation 2, but with an emphasis on the dependency on the EMA process. At convergence (θ t , θ avg,t ) → (θ ∞ , θ avg,∞ ), one must necessarily have that θ ∞ = θ avg,∞ , confirming that, in the regime of small learning rate η → 0, the Mean Teacher method converges, albeit often more rapidly, towards the same solution as the more standard Π-model. 

4.2. MINIMIZERS ARE HARMONIC FUNCTIONS

To understand better the properties of the solutions, we consider a simplified setting further exploited in Section 4.3. Assume that F : X ≡ R D → R and Y ≡ R and that, for every y i ∈ Y ≡ R, the loss function f → (f, y i ) is uniquely minimized at f = y i . We further assume that the data-manifold M ⊂ R D can be globally parametrized by a smooth and bijective mapping Φ : R d M ⊂ R D . Similarly to the Section 2, we consider a data-augmentation scheme that can be described as S εω (x) = Φ(z + εω) for z = Φ -1 (x) and a sample ω from a R d -valued centred and isotropic Gaussian distribution. We consider a finite set of labelled samples {x i , y i } i∈I L , with x i = Φ(z i ) and z i ∈ R d for i ∈ I L . We choose to model the large number of unlabelled data samples as a continuum distributed on the data manifold M as the push-forward measure Φ µ(dz) of a probability distribution µ(dz) whose support is R d through the mapping Φ. This means that an empirical average of the type (1/|D U |) i∈Iu ϕ(x i ) can be replaced by ϕ[Φ(z)] µ(dz). We investigate the regime ε → 0 and, similarly to Section 2, the minimization of the consistency-regularized objective L L (θ) + λ ε 2 R d E ω F θ [S εω (Φ(z))] -F θ (Φ(z)) 2 µ(dz). For notational convenience, set f θ ≡ F θ • Φ. Since S εω [Φ(z)] = Φ(z + ε ω), as ε → 0 the quantity 1 ε 2 E ω F θ [S εω (Φ(z))] -F θ (Φ(z)) 2 converges to ∇ z f θ 2 and the objective function equation 6 approaches the quantity G(f θ ) ≡ 1 |D L | i∈I L (f θ (z i ), y i ) + λ R d ∇ z f θ (z) 2 µ(dz). A minimizer f : R d → R of the functional G that is consistent with the labelled data, i.e. f (z i ) = y i for i ∈ I L , is a minimizer of the energy functional f → R d ∇ z f θ (z) 2 µ(dz) subject to the constraints f (z i ) = y i . It is the variational formulation of the Poisson equation ∆f (z) = 0 for z ∈ R d \ {z i } i∈I L f (z i ) = y i for i ∈ I L . ( ) Note that the solution does not depend on the regularization parameter λ in the regime of ε → 0: this indicates, as will be discussed in Section 4.3 in detail, that the generalization properties of consistency-based SSL methods will typically be insensitive to this parameter, in the regime of small data-augmentation at least. Furthermore, equation 8 shows that consistency-based SSL methods are indeed based on the same principles as more standard graph-based approaches such as Label Propagation (ZG02): solutions are gradient/Laplacian penalized interpolating functions. In Figure 2 , we consider the case where D = d = 2 with trivial mapping Φ(x) = x. We consider labelled data situated on the right (resp. left) boundary of the unit square and corresponding to the label y = 0 (resp. y = 1). For simplicity, we choose the loss function (f, y) = 1 2 (f -y) 2 and parametrize F θ ≡ f θ with a neural network with a single hidden layer with N = 100 neurons. As expected, the Π-model converges to the solution to the Poisson equation 8 in the unit square with boundary condition f (u, v) = 0 for u = 0 and f (u, v) = 1 for u = 1. 

4.3. GENERATIVE MODEL FOR SEMI-SUPERVISED LEARNING

As has been made clear throughout this text, SSL methods crucially rely on the dependence structure of the data. The existence and exploitation of a much lower-dimensional manifold M supporting the data-samples is instrumental to this class of methods. Furthermore, the performance of consistencybased SSL approaches is intimately related to the data-augmentation schemes they are based upon. Consequently, in order to understand the mechanisms that are at play when consistency-based SSL methods are used to uncover the structures present in real datasets, it is important to build simplified and tractable generative models of data that (1) respect these low-dimensional structures and (2) allow the design of efficient data-augmentation schemes. Several articles have investigated the influence of the dependence structures that are present in the data on the learning algorithm (BM13; Mos16). Here, we follow the Hidden Manifold Model (HMM) framework proposed in (GMKZ19; GLK + 20) where the authors describe a model of synthetic data concentrating near low-dimensional structures and analyze the learning curve associated to a class of two-layered neural networks. Low-dimensional structure: Similarly to Section 4.2, assume that the D-dimensional data-samples x i ∈ X can be expressed as x i = Φ(z i ) ∈ R D for a fixed smooth mapping Φ : R d → R D . In other words, the data-manifold M is d-dimensional and the mapping Φ can be used to parametrize it. The mapping Φ is chosen to be a neural network with a single hidden layer with H neurons, although other choices are indeed possible. For z = (z 1 , . . . , z d ) ∈ R d , set Φ(z) = A 1→2 ϕ(A 0→1 z + b 1 ) for matrices A 0→1 ∈ R H,d and A 1→2 ∈ R D,H , bias vector b 1 ∈ R H and non-linearity ϕ : R → R applied element-wise. In all our experiments, we use the ELU non-linearity. We adopt the standard normalization A 0→1 i,j = w (1) i,j / √ d and A 1→2 i,j = w (2) i,j / √ H for weights w (k) i,j drawn i.i.d from a centred Gaussian distribution with unit variance; this ensures that, if the coordinate of the input vector z ∈ R d are all of order O(1), so are the coordinates of x = Φ(z). Data-augmentation: consider a data sample x i ∈ M on the data-manifold. It can also be expressed as x i = Φ(z i ). We consider the natural data-augmentation process which consists in setting S εω (x i ) = Φ(z i + εω) for a sample ω ∈ R d from an isotropic Gaussian distribution with unit covariance and ε > 0. Crucially, the data-augmentation scheme respect the low-dimensional structure of the data: the perturbed sample S εω (x i ) belongs to the data-manifold M for any perturbation vector ε ω. Note that, for any value of ε, the data-augmentation preserves the low-dimensional manifold: perturbed samples S εω (x i ) exactly lie on the data-manifold. The larger ε, the more efficient the data-augmentation scheme; this property is important since it allows to study the influence of the amount of data-augmentation.  L (θ) ≡ (1/|D L |) i [F θ (x i ), y i ] where (f, y) = log(1 + exp[-y f ]). We assume that there are |D L | = 10 labelled data pairs {x i , y i } i=I L , as well as |D U | = 1000 unlabelled data samples, that the ambient space has dimension D = 100 and the data manifold M has dimension d = 10. The function Φ uses H = 30 neurons in its hidden layer. In all our experiments, we use a standard Stochastic Gradient Descent (SGD) method with constant learning rate and momentum β = 0.9. For minimizing the consistency-based SSL objective L L (θ) + λ R(θ), with regularization R(θ) given in equation 2, we use the standard strategy (TV17) consisting in first minimizing the un-regularized objective alone L L for a few epochs in order for the function F θ to be learned in the neighbourhood of the few labelled data-samples before switching on the consistency-based regularization whose role is to propagate the information contained in the labelled samples along the data manifold M. Insensitivity to λ: Figure 3 (Left) shows that this method is relatively insensitive to the parameter λ, as long as it is within reasonable bounds. This phenomenon can be read from equation 8 that does not depend on λ. Much larger or smaller values (not shown in Figure 3 ) of λ do lead, unsurprisingly, to convergence and stability issues. Amount of Data-Augmentation: As is reported in many tasks (CZM + 18; ZCG + 19; KYF20), tuning the amount data-augmentation in deep-learning applications is often a delicate exercise that can greatly influence the resulting performances. Figure 3 (Right) reports the generalization properties of the method for different amount of data-augmentation. Too low an amount of data-augmentation (i.e. ε = 0.03) and the final performance is equivalent to the un-regularized method. Too large an amount of data-augmentation (i.e. ε = 1.0) also leads to poor generalization properties. This is because the choice of ε = 1.0 corresponds to augmented samples that are very different from the distribution of the training dataset (i.e. distributional shift), although these samples are still supported by the data-manifold. Quality of the Data-Augmentation: to study the influence of the quality of the data-augmentation scheme, we consider a perturbation process implemented as S εω[k] (x i ) = Φ(z i +ω[k]) for x i = Φ(z i ) where the noise term ω[k] is defined as follows. For a data-augmentation dimension parameter 1 ≤ k ≤ d we have ω[k] = (ξ 1 , . . . , ξ k , 0, . . . , 0) for i.i.d standard Gaussian samples ξ 1 , . . . , ξ k ∈ R. This data-augmentation scheme only explores the first k dimensions of the d-dimensional datamanifold: the lower k, the poorer the exploration of the data-manifold. As demonstrated on Figure For β MT ∈ {0.9, 0.95, 0.99, 0.995}, the final test NLL obtained through the MT approach is identical to the test NLL obtained through the Π-model. In all the experiments, we used λ = 10 and used SGD with momentum β = 0.9. 4, lower quality data-augmentation schemes (i.e. lower values of k ∈ [0, d]) hurt the generalization performance of the Π-model. Mean-Teacher versus Π-model: we implemented the Mean-Teacher (MT) approach with an exponential moving average (EMA) process θ avg,k = β MT θ avg,k-1 + (1 -β MT ) θ k for the MT parameter θ avg,k with different scales β MT ∈ {0.9, 0.95, 0.99, 0.995}, as well as a Π-model approach, with λ = 10 and ε = 0.3. Figure 5 shows, in accordance with Section 4.1, that the different EMA schemes lead to generalization performances similar to a standard Π-model.

5. CONCLUSION

Consistency-based SSL methods rely on a subtle trade-off between the exploitation of the labelled samples and the discovery of the low-dimensional data-manifold. The results presented in this article highlight the connections with more standard methods such as Jacobian penalization and graphbased approaches and emphasize the crucial role of the data-augmentation scheme. The analysis of consistency-based SSL methods is still in its infancy and our numerical simulations suggest that the variant of the Hidden Manifold Model described in this text is a natural framework to make progress in this direction.



Figure 1: Left: Jacobian (i.e. first order) Penalization method are short-sighted and do not exploit fully the data-manifold Right: Data-Augmentation respecting the geometry of the data-manifold.

Let D([0, T ], R |Θ| ) be the usual space of càdlàg R |Θ| -valued functions on a bounded time interval [0, T ] endowed with the standard Skorohod topology. Consider the update equation 3 with learning rate η > 0 and define the continuous time process θ η (t) = θ [t/η] . The sequence of processes θ η ∈ D([0, T ], R |Θ| ) converges weakly in D([0, T ], R |Θ| ) and as η → 0 to the solution of the ordinary differential equation θt = -∇ L(θ t ) + λ R(θ t ) .

Figure 2: Labelled data samples with class y = 0 (green triangle) and y = +1 (red dot) are placed on the Left/Right boundary of the unit square. Unlabelled data samples (blue stars) are uniformly placed within the unit square. We consider a simple regression setting with loss function (f, y) = 1 2 (f -y) 2 . Left: Randomly initialized neural network. Middle: labelled/unlabelled data Right: Solution of f obtained by training a standard Π-model. It is the harmonic function f (u, v) = u, as described by equation 8.

Figure 3: Left: For a fixed data-augmentation scheme, generalization properties for λ spanning two orders of magnitude. Right: Influence of the quantity of the data-augmentation of the generalization properties.

Classification: we consider a balanced binary classification problem with|D L | ≥ 2 labelled training examples {x i , y i } i∈I L where x i = Φ(z i ) and y i ∈ Y ≡ {-1, +1}. The sample z i ∈ R d corre-sponding to the positive (resp. negative) class are assumed to have been drawn i.i.d from a Gaussian distribution with identity covariance matrix and mean µ + ∈ R d (resp. mean µ -∈ R d ). The distance µ + -µ -quantifies the hardness of the classification task.Neural architecture and optimization: Consider fitting a two-layered neural network F θ : R D → R by minimising the negative log-likelihood L

Figure 4: Learning curve test (NLL) of the Π-model with λ = 10 for different "quality" of dataaugmentation. The data manifold is of dimension d = 10 in an ambient space of dimension D = 100. Forx i = Φ(z i ) and 1 ≤ k ≤ d, the data-augmentation scheme is implemented as S εω[k] (x i ) = Φ(z i + ε ω[k]) where ω[k] isa sample from a Gaussian distribution whose last (d -k) coordinates are zero. In other words, the data-augmentation scheme only explores k dimensions out of the d dimensions of the data-manifold. We use ε = 0.3 in all the experiments. Left: Learning curves (Test NLL) for data-augmentation dimension k ∈ [5, 10] Right: Test NLL at epoch N = 200 (see left plot) for data-augmentation dimension k ∈ [5, 10].

This indicates that the improved performances of the Mean Teacher approach sometimes reported in the literature are either not statistically meaningful, or due to poorly executed comparisons, or due to mechanisms not captured by the η → 0 asymptotic. Indeed, several recently proposed consistency based SSL algorithms (BCG + 19; SBL + 20; XDH + 19) achieve state-of-the-art performance across diverse datasets without employing any exponential averaging processes. These results are achieved by leveraging more sophisticated data augmentation schemes such as Rand-Augment (CZSL19) , Back Translation (ALAC17) or Mixup (ZCDLP17).

