A PROBABILISTIC MODEL FOR DISCRIMINATIVE AND NEURO-SYMBOLIC SEMI-SUPERVISED LEARNING Anonymous

Abstract

Strong progress has been achieved in semi-supervised learning (SSL) by combining several underlying methods, some that pertain to properties of the data distribution p(x), others to the model outputs p(y|x), e.g. minimising the entropy of unlabelled predictions. Focusing on the latter, we fill a gap in the standard text by introducing a probabilistic model for discriminative semi-supervised learning, mirroring the classical generative model. Several SSL methods are theoretically explained by our model as inducing (approximate) strong priors over parameters of p(y|x). Applying this same probabilistic model to tasks in which labels represent binary attributes, we also theoretically justify a family of neuro-symbolic SSL approaches, taking a step towards bridging the divide between statistical learning and logical reasoning.

1. INTRODUCTION

In semi-supervised learning (SSL), a mapping is learned that predicts labels y for data points x from a dataset of labelled pairs (x l , y l ) and unlabelled x u . SSL is of practical importance since unlabelled data are often cheaper to acquire and/or more abundant than labelled data. For unlabelled data to help predict labels, the distribution of x must contain information relevant to the prediction (Chapelle et al., 2006; Zhu & Goldberg, 2009) . State-of-the-art SSL algorithms (e.g. Berthelot et al., 2019b; a) combine underlying methods, including some that leverage properties of the distribution p(x), and others that rely on the label distribution p(y|x). The latter include entropy minimisation (Grandvalet & Bengio, 2005) , mutual exclusivity (Sajjadi et al., 2016a; Xu et al., 2018) and pseudo-labelling (Lee, 2013), which add functions of unlabelled data predictions to a typical discriminative supervised loss function. Whilst these methods each have their own rationale, we propose a formal probabilistic model that unifies them as a family of discriminative semi-supervised learning (DSSL) methods. Neuro-symbolic learning (NSL) is a broad field that looks to combine logical reasoning and statistical machine learning, e.g. neural networks. Approaches often introduce neural networks into a logical framework (Manhaeve et al., 2018) , or logic into statistical learning models (Rocktäschel et al., 2015) . Several works combine NSL with semi-supervised learning (Xu et al., 2018; van Krieken et al., 2019) but lack rigorous justification. We show that our probabilistic model for discriminative SSL extends to the case where label components obey logical rules, theoretically justifying neuro-symbolic SSL approaches that augment a supervised loss function with a function based on logical constraints. Central to this work are ground truth parameters {θ x } x∈X of the distributions p(y|x), as predicted by models such as neural networks. For example, θ x may be a multinomial parameter vector specifying the distribution over all labels associated with a given x. Since each data point x has a specific label distribution defined by θ x , sampling from p(x) induces an implicit distribution over parameters, p(θ). If known, the distribution p(θ) serves as a prior over all model predictions, θx : for labelled samples it may provide little additional information, but for unlabelled data may allow predictions to be evaluated and the model improved. As such, p(θ) provides a potential basis for semi-supervised learning. We show that, in practice, p(θ) can avoid much of the complexity of p(x) and have a concise analytical form known a priori. In principle, p(θ) can also be estimated from the parameters learned for labelled data (fitting the intuition that predictions for unlabelled data should be consistent with those of labelled data). We refer to SSL methods that rely on p(θ) as discriminative and formalise them with a hierarchical probabilistic model, analogous to that for generative approaches. Recent results (Berthelot et al., 2019b; a) demonstrate that discriminative SSL is orthogonal and complementary to methods that rely on p(x), such as data augmentation and consistency regularisation (Sajjadi et al., 2016b; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2018) . We consider the explicit form of p(θ) in classification with mutually exclusive classes, i.e. where each x only ever pairs with a single y and y|x is deterministic. By comparison of their loss functions, the SSL methods mentioned (entropy minimisation, mutual exclusivity and pseudo-labelling) can be seen to impose continuous relaxations of the resulting prior p(θ) and are thus unified under our probabilistic model for discriminative SSL. We then consider classification with binary vector labels, e.g. representing concurrent image features or allowed chess board configurations, where only certain labels/attribute combinations may be valid, e.g. according to rules of the game or the laws of nature. Analysing the structure of p(θ) here, again assuming y|x is deterministic, we show that logical rules between attributes define its support. As such, SSL approaches that use fuzzy logic (or similar) to add logical rules into the loss function (e.g. Xu et al., 2018; van Krieken et al., 2019) can be seen as approximating a continuous relaxation of p(θ) and so also fall under our probabilistic model for discriminative SSL. Our key contributions are: • to provide a probabilistic model for discriminative semi-supervised learning, comparable to that for classical generative methods, contributing to current theoretical understanding of SSL; • to consider the analytical form of the distribution over parameters p(θ), by which we explain several SSL methods, including entropy minimisation as used in state-of-art SSL models; and • to show that our probabilistic model also unifies neuro-symbolic SSL in which logical rules over attributes are incorporated (by fuzzy logic or similar) to regularise the loss function, providing firm theoretical justification for this means of integrating 'connectionist' and 'symbolic' methods.

2. BACKGROUND AND RELATED WORK

Notation: x l i ∈ X l , y l i ∈ Y l are labelled data pairs, i ∈{1...N l }; x u j ∈ X u , y u j ∈ Y u are unlabelled data samples and their (unknown) labels, j ∈{1...N u }; X , Y are domains of x and y; x, y are random variables of which x, y are realisations. θ x parameterises the distribution p(y|x), and is a realisation of a random variable θ. To clarify: for each x, an associated parameter θ x defines a distribution over associated label(s) y|x; and p(θ) is a distribution over all such parameters.

2.1. SEMI-SUPERVISED LEARNING

Semi-supervised learning is a well established field, described by a number of surveys and taxonomies (Seeger, 2006; Zhu & Goldberg, 2009; Chapelle et al., 2006; van Engelen & Hoos, 2020) . SSL methods have been categorised by how they adapt supervised learning algorithms (van Engelen & Hoos, 2020); or their assumptions (Chapelle et al., 2006) , e.g. that data of each class form a cluster/manifold, or that data of different classes are separated by low density regions. It has been proposed that all such assumptions are variations of clustering (van Engelen & Hoos, 2020). Whilst 'clustering' itself is not well defined (Estivill-Castro, 2002) , from a probabilistic perspective this suggests that SSL methods assume p(x) to be a mixture of conditional distributions that are distinguishable by some property, e.g. connected dense regions. This satisfies the condition that for unlabelled x to help in learning to predict y from x, the distribution of x must contain information relevant to the prediction (Chapelle et al., 2006; Zhu & Goldberg, 2009) . In this work, we distinguish SSL methods by whether they rely on direct properties of p(x), or on properties that manifest in p(θ), the distribution over parameters of p(y|x; θ x ), for x ∼ p(x). State-of-art models (Berthelot et al., 2019b;a) combine methods of both types. A canonical SSL method that relies on explicit assumptions of p(x) is the classical generative model: p(X l , Y l , X u ) = ψ,π p(ψ, π)p(X l |Y l , ψ)p(Y l |π) Y u ∈Y Nu p(X u |Y u , ψ)p(Y u |π) p(X u |ψ,π) Parameters ψ, π of p(x|y) and p(y) are learned from labelled and unlabelled data, e.g. by the EM algorithm, and predictions p(y|x) = p(x|y)p(y)/p(x) follow by Bayes' rule. Figure 1 (left) shows the corresponding graphical model. Whilst generative SSL has an appealing probabilistic rationale, it is rarely used in practice, similarly to its counterpart for fully supervised learning, in large part because p(x) is often complex yet must be accurately described (Grandvalet & Bengio, 2005; Zhu & Goldberg, 2009; Lawrence & Jordan, 2006) . However, properties of p(x) underpin data augmentation and consistency regularisation (Sajjadi et al., 2016b; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2018) , in which true x samples are adjusted, using implicit domain knowledge of p(x|y), to generate artificial samples of the same class, whether or not that class is known. Other SSL methods consider p(x) in terms of components p(x|z), where z is a latent representation useful for

