A PROBABILISTIC MODEL FOR DISCRIMINATIVE AND NEURO-SYMBOLIC SEMI-SUPERVISED LEARNING Anonymous

Abstract

Strong progress has been achieved in semi-supervised learning (SSL) by combining several underlying methods, some that pertain to properties of the data distribution p(x), others to the model outputs p(y|x), e.g. minimising the entropy of unlabelled predictions. Focusing on the latter, we fill a gap in the standard text by introducing a probabilistic model for discriminative semi-supervised learning, mirroring the classical generative model. Several SSL methods are theoretically explained by our model as inducing (approximate) strong priors over parameters of p(y|x). Applying this same probabilistic model to tasks in which labels represent binary attributes, we also theoretically justify a family of neuro-symbolic SSL approaches, taking a step towards bridging the divide between statistical learning and logical reasoning.

1. INTRODUCTION

In semi-supervised learning (SSL), a mapping is learned that predicts labels y for data points x from a dataset of labelled pairs (x l , y l ) and unlabelled x u . SSL is of practical importance since unlabelled data are often cheaper to acquire and/or more abundant than labelled data. For unlabelled data to help predict labels, the distribution of x must contain information relevant to the prediction (Chapelle et al., 2006; Zhu & Goldberg, 2009) . State-of-the-art SSL algorithms (e.g. Berthelot et al., 2019b; a) combine underlying methods, including some that leverage properties of the distribution p(x), and others that rely on the label distribution p(y|x). The latter include entropy minimisation (Grandvalet & Bengio, 2005) , mutual exclusivity (Sajjadi et al., 2016a; Xu et al., 2018) and pseudo-labelling (Lee, 2013) , which add functions of unlabelled data predictions to a typical discriminative supervised loss function. Whilst these methods each have their own rationale, we propose a formal probabilistic model that unifies them as a family of discriminative semi-supervised learning (DSSL) methods. Neuro-symbolic learning (NSL) is a broad field that looks to combine logical reasoning and statistical machine learning, e.g. neural networks. Approaches often introduce neural networks into a logical framework (Manhaeve et al., 2018) , or logic into statistical learning models (Rocktäschel et al., 2015) . Several works combine NSL with semi-supervised learning (Xu et al., 2018; van Krieken et al., 2019) but lack rigorous justification. We show that our probabilistic model for discriminative SSL extends to the case where label components obey logical rules, theoretically justifying neuro-symbolic SSL approaches that augment a supervised loss function with a function based on logical constraints. Central to this work are ground truth parameters {θ x } x∈X of the distributions p(y|x), as predicted by models such as neural networks. For example, θ x may be a multinomial parameter vector specifying the distribution over all labels associated with a given x. Since each data point x has a specific label distribution defined by θ x , sampling from p(x) induces an implicit distribution over parameters, p(θ). If known, the distribution p(θ) serves as a prior over all model predictions, θx : for labelled samples it may provide little additional information, but for unlabelled data may allow predictions to be evaluated and the model improved. As such, p(θ) provides a potential basis for semi-supervised learning. We show that, in practice, p(θ) can avoid much of the complexity of p(x) and have a concise analytical form known a priori. In principle, p(θ) can also be estimated from the parameters learned for labelled data (fitting the intuition that predictions for unlabelled data should be consistent with those of labelled data). We refer to SSL methods that rely on p(θ) as discriminative and formalise them with a hierarchical probabilistic model, analogous to that for generative approaches. Recent results (Berthelot et al., 2019b; a) demonstrate that discriminative SSL is orthogonal and complementary to methods that rely on p(x), such as data augmentation and consistency regularisation (Sajjadi et al., 2016b; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2018) . We consider the explicit form of p(θ) in classification with mutually exclusive classes, i.e. where each x only ever pairs with a single y and y|x is deterministic. By comparison of their loss functions, the SSL methods mentioned (entropy minimisation, mutual exclusivity and pseudo-labelling) can be seen to impose continuous relaxations of the resulting prior p(θ) and are thus unified under our probabilistic model for discriminative SSL. We then consider classification with binary vector labels, e.g. representing concurrent image features or allowed chess board configurations, where only certain labels/attribute combinations may be valid, e.g. according to rules of the game or the laws of nature. Analysing the structure of p(θ) here, again assuming y|x is deterministic, we show that logical rules between attributes define its support. As such, SSL approaches that use fuzzy logic (or similar) to add logical rules into the loss function (e.g. Xu et al., 2018; van Krieken et al., 2019) can be seen as approximating a continuous relaxation of p(θ) and so also fall under our probabilistic model for discriminative SSL. Our key contributions are: • to provide a probabilistic model for discriminative semi-supervised learning, comparable to that for classical generative methods, contributing to current theoretical understanding of SSL; • to consider the analytical form of the distribution over parameters p(θ), by which we explain several SSL methods, including entropy minimisation as used in state-of-art SSL models; and • to show that our probabilistic model also unifies neuro-symbolic SSL in which logical rules over attributes are incorporated (by fuzzy logic or similar) to regularise the loss function, providing firm theoretical justification for this means of integrating 'connectionist' and 'symbolic' methods.

2. BACKGROUND AND RELATED WORK

Notation: x l i ∈ X l , y l i ∈ Y l are labelled data pairs, i ∈{1...N l }; x u j ∈ X u , y u j ∈ Y u are unlabelled data samples and their (unknown) labels, j ∈{1...N u }; X , Y are domains of x and y; x, y are random variables of which x, y are realisations. θ x parameterises the distribution p(y|x), and is a realisation of a random variable θ. To clarify: for each x, an associated parameter θ x defines a distribution over associated label(s) y|x; and p(θ) is a distribution over all such parameters.

2.1. SEMI-SUPERVISED LEARNING

Semi-supervised learning is a well established field, described by a number of surveys and taxonomies (Seeger, 2006; Zhu & Goldberg, 2009; Chapelle et al., 2006; van Engelen & Hoos, 2020) . SSL methods have been categorised by how they adapt supervised learning algorithms (van Engelen & Hoos, 2020); or their assumptions (Chapelle et al., 2006) , e.g. that data of each class form a cluster/manifold, or that data of different classes are separated by low density regions. It has been proposed that all such assumptions are variations of clustering (van Engelen & Hoos, 2020). Whilst 'clustering' itself is not well defined (Estivill-Castro, 2002) , from a probabilistic perspective this suggests that SSL methods assume p(x) to be a mixture of conditional distributions that are distinguishable by some property, e.g. connected dense regions. This satisfies the condition that for unlabelled x to help in learning to predict y from x, the distribution of x must contain information relevant to the prediction (Chapelle et al., 2006; Zhu & Goldberg, 2009) . In this work, we distinguish SSL methods by whether they rely on direct properties of p(x), or on properties that manifest in p(θ), the distribution over parameters of p(y|x; θ x ), for x ∼ p(x). State-of-art models (Berthelot et al., 2019b; a) combine methods of both types. A canonical SSL method that relies on explicit assumptions of p(x) is the classical generative model: p(X l , Y l , X u ) = ψ,π p(ψ, π)p(X l |Y l , ψ)p(Y l |π) Y u ∈Y Nu p(X u |Y u , ψ)p(Y u |π) p(X u |ψ,π) Parameters ψ, π of p(x|y) and p(y) are learned from labelled and unlabelled data, e.g. by the EM algorithm, and predictions p(y|x) = p(x|y)p(y)/p(x) follow by Bayes' rule. Figure 1 (left) shows the corresponding graphical model. Whilst generative SSL has an appealing probabilistic rationale, it is rarely used in practice, similarly to its counterpart for fully supervised learning, in large part because p(x) is often complex yet must be accurately described (Grandvalet & Bengio, 2005; Zhu & Goldberg, 2009; Lawrence & Jordan, 2006) . However, properties of p(x) underpin data augmentation and consistency regularisation (Sajjadi et al., 2016b; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2018) , in which true x samples are adjusted, using implicit domain knowledge of p(x|y), to generate artificial samples of the same class, whether or not that class is known. Other SSL methods consider p(x) in terms of components p(x|z), where z is a latent representation useful for predicting y (Kingma et al., 2014; Rasmus et al., 2015) . We focus on a family of SSL methods that add a function of the unlabelled data predictions to a discriminative supervised loss function, e.g.: • Entropy minimisation (Grandvalet & Bengio, 2005) assumes classes are "well separated". As a proxy for class overlap, the entropy of unlabelled data predictions is added to a discriminative supervised loss function sup : MinEnt (θ) = - i k y l i,k log θ x l i k sup - j k θ x u j k log θ x u j k • Mutual exclusivity (Sajjadi et al., 2016a; Xu et al., 2018) assumes no class overlap, i.e. correct predictions form 'one-hot' vectors. Viewed as vectors of logical variables z, such outputs exclusively satisfy the logical formula k (z k j =k ¬z j ). A function based on the formula applies to unlabelled predictions: MutExc (θ) = sup - j log k θ x u j k k =k (1 -θ x u j k ) • Pseudo-labelling (Lee, 2013) assumes that predicted classes k j (t) = arg max k θ x u j k for unlabelled data x u j at iteration t, are correct (at the time) and treated as labelled data: Pseudo (θ, t) = sup - j log k 1 k=kj (t) θ x u j k These methods, though intuitive, lack a probabilistic rationale comparable to that of generative models (Eq. 1). Summing over all labels for unlabelled samples is of little use (Lawrence & Jordan, 2006) : p(Y l |X l , X u ) = θ p(θ)p(Y l |X l , θ) Y u p(Y u |X u , θ) =1 = θ p(θ)p(Y l |X l , θ). Indeed, under the associated graphical model (Fig. 1 (centre)), parameters θ of p(Y l |X l , θ) are provably independent of X u (Seeger, 2006; Chapelle et al., 2006) . Previous approaches to breaking this independence include introducing additional variables to Gaussian Processes (Lawrence & Jordan, 2006) , or an assumption that parameters of p(y|x) are dependent on those of p(x) (Seeger, 2006) . Taking further the (general) assumption of (Seeger, 2006) , we provide a probabilistic model for discriminative SSL (DSSL), analogous and complementary to that for generative SSL (Eq. 1).

2.2. NEURO-SYMBOLIC LEARNING

Neuro-symbolic learning (NSL) aims to bring together statistical machine learning, e.g. neural networks, and logical reasoning (see Garcez et al. (2019) for a summary). Approaches often either introduce statistical methods into a logical framework (e.g. Rocktäschel & Riedel, 2017; Manhaeve et al., 2018) ; or combine logical rules into statistical learning methods (Rocktäschel et al., 2015; Ding et al., 2018; Marra et al., 2019; van Krieken et al., 2019; Wang et al., 2019) . A conceptual framework for NSL (Valiant, 2000; Garcez et al., 2019) places statistical methods within a low-level perceptual component that processes raw data (e.g. performing pattern recognition), the output of which feeds a reasoning module, e.g. performing logical inference (Fig. 2 ). This structure surfaces in various works (e.g. Marra et al., 2019; Wang et al., 2019; van Krieken et al., 2019; Dai et al., 2019) , in some cases taking explicit analytical form. Marra et al. (2019) propose a 2-step graphical model (their Fig. 1 ) comprising a neural network and a "semantic layer", however logical constraints are later introduced as a design choice (their Eq. 2), whereas in our work they are a natural way to parameterise a probability distribution. The graphical model for SSL of van Krieken et al. and logical rules (reasoning) (Valiant, 2000; Garcez et al., 2019) . We draw an analogy to our probabilistic model for discriminative SSL ( §3), in which p(θ) can be defined with logical rules ( §5). rules directly influence labels (y) of unlabelled data (only), whereas in our model, rules govern the parameters (θ x ) of all label distributions, i.e. p(y|x; θ x ), ∀x. Where van Krieken et al. ( 2019) view probabilities as "continuous relaxations" of logical rules, we show such rules can be used to define the support of the prior p(θ) in a hierarchical probabilistic model for DSSL, which therefore provides a theoretical basis for neuro-symbolic semi-supervised learning. We note that many other works consider related latent variable models (e.g. Mei et al. (2014) implement logical rules as constraints in a quasi-variational Bayesian approach) or structured label spaces (see e.g. Zhu & Goldberg (2009) for a summary), however we restrict the scope of this review to SSL applications.

3. A PROBABILISTIC MODEL FOR DISCRIMINATIVE SSL

Label(s) y ∈ Y that occur with a given x can be viewed as samples drawn from p(y|x; θ x ), a distribution over the label space with parameter θ x . For example, in k-class classification p(y|x) is a multinomial distribution over classes, fully defined by a mean parameter θ x ∈ ∆ k , on the simplex in R k . Every x ∈ X has an associated label distribution and so corresponds to a single ground truth parameter θ x (in some domain Θ). Thus, there exists a well defined (deterministic) function f : X → Θ, f (x) = θ x . Predictive models, e.g. neural networks, typically learn to approximate f : given x, they output θx , an estimate of θ x . Note that the label y itself is not predicted, e.g. in the k-class classification example, if x occurs with multiple distinct labels across the dataset, their mean θ x = E[y|x] is learned. Since each x corresponds to a parameter θ x , sampling x ∼ p(x) induces an implicit distribution over parameters p(θ). In the k-class classification example, p(θ) is a distribution over mean parameters defined on the simplex ∆ k (e.g. a Dirichlet distribution). Importantly, for any model learning to predict θ x , p(θ) serves as a prior distribution over its expected outputs. For a labelled data point, p(θ) may add little information further to the label, however for unlabelled data, p(θ) provides a way to evaluate a prediction θx u and so train the model, i.e. by updating it to increase p( θx u ). We will show (Sec. 4) that under a particular assumption, the analytical form of p(θ) is known a priori. In general, the empirical distribution of predictions for labelled data p( θx l ) might sufficiently approximate p(θ). Formalising, let: θ X l = {θ x l } x l ∈X l be the set of parameters of p(y|x l ) for all x l ∈ X l ; and θ X u be defined similarly. Treating θ x as a latent random variable with hierarchical prior distribution p(θ|α), parameterised by α, the conditional distribution of the data factorises (analogously to Eq. 5) as: p(Y l |X l , X u ) = α,θ X , θX p(α)p(Y l | θX l )p( θX l |θ X l )p(θ X l |α) Y u p(Y u | θX u ) =1 p( θX u |θ X u )p(θ X u |α) ≈ α, θX p(α) p(Y l | θX l )p( θX l |α) p( θX u |α) , where θx . = f ω (x) represent estimates of (ground truth) θ x , and f ω : X → Θ is a family of functions with weights ω, e.g. a neural network. We replace p(Y |X, θ X ) by p(Y |θ X ) as θ x fully defines p(y|x). The distribution p( θx |θ x ), over predictions given ground truth parameters, reflects model accuracy (conceptually a noise or error model) and is expected to vary over X . For labelled data, on which the model is trained, we assume (in row 2) that θx closely approximates θ x , i.e. p( θx | θ x ) ≈ δ θ x -θx . For unlabelled data, p( θx | θ x ) is unknown but assumed to increase as predictions approach the true parameter. p( θx | α) = θ x p( θx | θ x )p(θ x |α) can be interpreted as a relaxation of the prior applied to predictions (a perspective we take going forwards), with equality to the prior in the limiting case p( θx | θ x ) = δ θ x -θx . Fig. 1 (right) shows the corresponding graphical model. Taken together, the relationship f ω (x) = θx , the prior p(θ|α) and the assumed closeness between θ x and θx , break the independence noted previously (Sec. 2). Without f ω , a sample x u reveals nothing of p(y|x u ; θ x u ); without p( θx |α), predictions can be made but not evaluated or improved. Interpreting terms of Eq. 6: • p(Y l | θX l ) encourages labelled predictions θx l to approximate parameters of p(y|x l ); • p( θX l | α) allows α to capture the distribution over θx l , e.g. to approximate p(θ|α); and • p( θX u | α) allows predictions θx u on unlabelled data to be evaluated under prior knowledge of p(θ|α), or from its approximation learned from labelled data (as above). Maximum a posteriori estimates of θ x are given by optimising Eq. 6, e.g. in K-class classification by minimising the following objective with respect to ω (classes indexed by k, recall θx . = f ω (x)): DSSL (θ) = - i k y l i,k log θx l i k - i log p( θx l i |α) - j log p( θx u j |α) The p(θ|α) terms can be interpreted as regularising a supervised learning model. However, unlike typical regularisation, e.g. 1 , 2 , here it applies to model outputs θx not weights ω. Fundamentally, p(θ) provides a relationship between data samples that enables SSL, as an alternative to p(x). A natural question arises: given that SSL has fewer y than x by definition, why consider SSL methods that depend on p(y|x) rather than p(x)? Fortunately, the two options are not mutually exclusive, but rather orthogonal and can be combined, as in recent approaches (Berthelot et al., 2019b; a) . Furthermore, the structure of p(θ) is often far simpler than that of p(x) and may be known a priori, thus applying DSSL can be straightforward. We analyse the form of p(θ) in several cases in Sec. 4. To relate discriminative and generative SSL, we highlight the inherent symmetry between them. Restating the joint distributions behind Eq.s 1 and 6 (see Appendix B for details) as: p(Y l , X l , X u ) = π,ψ Y p(π)p(X l |ψ Y l )p(ψ Y l , Y l |π) Y u p(X u |ψ Y u )p(ψ Y u , Y u |π) [Gen.] p(Y l , X l , X u ) = α,θ X p(α)p(Y l | θ X l )p(θ X l , X l |α) Y u p(Y u | θ X u )p(θ X u , X u |α) [Disc.] reflects a similar hierarchical structure in which one element of the data (y in the former, x in the latter) acts to 'index' a distribution over the other (see superscript to ψ or θ, resp.). To have a little understanding of p(θ), we note that f (x) = θ x (assumed differentiable) gives a relationship p(θ) = |J |p(x), for J the Jacobian matrix J i,j = ∂xi ∂θj . Thus, if p(x) = k p(x|y = k)π k is a mixture distribution with class probabilities π k = p(y=k), then p(θ) is also: p(θ) = |J |p(x) = |J | k p(x|y = k)π k = k |J |p(x|y = k)π k = k p(θ|y = k)π k . Thus any cluster/mixture assumption of p(x) applies also to p(θ); and class conditional distributions p(θ|y) over ground truth parameters must differ sufficiently for classification to be possible.

4. APPLICATIONS OF DISCRIMINATIVE SEMI-SUPERVISED LEARNING

Implementing Eq. 6 requires a description of p(θ), ideally in analytic form. Such form depends heavily on two properties of the data: (i) the label domain Y being continuous or discrete; and (ii) y|x being stochastic or deterministic. Further, discrete labels may represent (a) K distinct classes as 'one-hot' vectors y ∈ {e k } k∈{1..K} , where each distribution p(y|x) is parameterised by θ x ∈ ∆ K (the simplex), θ x k = p(y = k|x); or (b) K binary (non-exclusive) features with y ∈ {0, 1} K (combinations of which give 2 K distinct labels), where θ x ∈ ∆ 2 K , θ x k = p(y k = 1|x). Table 1 shows examples for combinations of these factors that determine the form of p(θ). Italicised cases are discuss in detail.  p(θ) = k p(y = k) p(θ|y = k) ≈ k π k δ θ-e k . ( ) To clarify, for any x ∈ X , the corresponding parameter θ x is always one-hot (the '1' indicating the single corresponding label y), ruling out stochastic parameters that imply the same x can have multiple labels. Note that, irrespective of the complexity of p(x), p(θ) is defined concisely. However, this distribution is discontinuous and lacks support over almost all of ∆ K , i.e. p(θ) = 0 for any θ = e k , making it unsuitable for gradient-based learning methods. However, a continuous approximation to p(θ) is obtained by relaxing each delta component to a suitable function over ∆ K . Such relaxation can be interpreted as estimating a noise or error model p( θx |θ x ) of predictions given true parameters (see Sec 3). From Eqs 2, 3 and 4 and Fig. 4 , the unlabelled loss components of the SSL methods entropy minimisation (Grandvalet & Bengio, 2005) , mutual exclusivity (Sajjadi et al., 2016a; Xu et al., 2018) and pseudo-labelling (Lee, 2013) can be seen to impose (un-normalised) continuous relaxations p(θ) of the discrete p(θ). (Note: such p(θ) need not be normalised in practice since a weighting term in the loss function renders any proportionality constant irrelevant.) We thus theoretically unify these methods under the probabilistic model for discriminative SSL (Eqs. 6, 7). Deterministic classification, non-exclusive features (y ∈ {0, 1} K , θ ∈ {0, 1} K ): In some classification tasks, label vectors y ∈ {0, 1} K represent multiple (K) binary attributes of x, e.g. features present in an image, a solution to Sudoku, or the relations connecting subject and object entities in a knowledge base. As in those examples, y|x can be deterministic. Where so, for a given x * and its (unique) label y * , the conditional distribution p(y|x * ) equates to the indicator function 1 y-y * , as parameterised by θ x * = y * . Thus, all (true) parameters θ x must be at vertices of the simplex {0, 1} K ⊂ ∆ 2 K (analogous to one-hot vectors previously). It follows that p(θ|y) = δ θ-y and so p(θ) = y π y δ θ-y is a weighted sum of point probability masses at θ ∈ {0, 1} K . A continuous Figure 4 : Unsupervised loss components of entropy minimisation (Eq. 2), mutual exclusivity (Eq. 3) and pseudo-labelling (Eq. 4) (exponentiated for comparison to probabilities), seen as continuous relaxations p(θ) of the discrete distribution p(θ), for deterministic y|x with distinct classes. relaxation of p(θ) is again required for gradient based learning. The case becomes more interesting when considering logical relationships that can exist between attributes (Sec. 5). Note that any θ ∈ {0, 1} K ⊂ ∆ 2 K in the support of p(θ) (2 K points in a continuous space) corresponds one-to-one with a label y ∈ {0, 1} K . As such, the distribution p(θ) could potentially be learned from unpaired labels y ∼ p(y), a variation of typical SSL (we leave this direction to future work).

5. NEURO-SYMBOLIC SEMI-SUPERVISED LEARNING

In classification with non-exclusive binary features, certain feature combinations may be impossible, e.g. an animal having both legs and fins, three kings on a chess board, or knowledge base entities being related by capital_city_of but not city_in. Where so, the support of p(y|x) for any x is confined to a data-specific set of valid labels V, a subset of all plausible labels P = {0, 1} K , i.e. p(y|x) = 0, ∀y ∈ P\V. If y|x is deterministic, there is a 1-1 correspondence between y and θ ∈ {0, 1} K (Sec. 4), and we use V, P to refer to both labels y and parameters θ that are valid or plausible (resp.). Thus: p(θ|α) = y∈V p(y)p(θ|y) = y∈V π y δ θ-y , where α = {V, Π V } and Π V = {π y = p(y)} y∈V are marginal label probabilities. (Note that Eq, 9 also holds for any 'larger' set V , where V ⊆ V ⊆ P.) As in the examples mentioned, the set of valid labels V may be constrained, even defined, by a set of rules, e.g. mutual exclusivity of certain attributes, rules of a game, or relationships between entity relations. Importantly, if a set of rules constrain V, Eq. 9 shows that they constrain the support of p(θ), directly connecting them to the distribution used in discriminative semi-supervised learning (Eqs 6, 7). This is appealing since logical rules possess certainty (cf the uncertain generalisation of statistical models, e.g. neural networks) and their universality may allow a large set V to be defined relatively succinctly. To focus on p(θ)'s support, we drop π y and consider probability mass (replacing δ θ k -c with Kronecker delta δ θ k c ), to define: s(θ) = y∈V δ θy = y∈V k:y k =1 δ θ k 1 k:y k =0 δ (1-θ k )1 , where each term in the summation effectively evaluates whether θ matches a valid label y ∈ V, i.e. s(θ) = 1 if θ ∈ V, else s(θ) = 0. Restricting to plausible θ ∈ P and defining logical variables z k ⇐⇒ (δ θ k 1 = 1), it can be seen that Eq. 10 is equivalent to a logical formula in propositional logic: y∈V k:(y k =1) z k k:¬(y k =1) ¬z k , which evaluates to True ⇔ θ ∈ V ⇔ s(θ) = 1. Comparing Eqs. 10 and 11 shows a relationship between logical and mathematical operations common in fuzzy logic and neuro-symbolic literature (e.g. Bergmann, 2008; Serafini & Garcez, 2016; van Krieken et al., 2019) . Here, True maps to 1, False to 0, ∧ to multiplication, ∨ to addition, and where z k corresponds to δ θ k 1 = 1 (a function of θ k ), ¬z k maps to δ (1-θ k )1 = 1 (the same function applied to 1-θ k ). In fact, for any (m-ary) propositional logic operator •(X 1 ... X m ), e.g. X 1 ⇒ X 2 , several functional representations ρ • : [0, 1] m → [0, 1] exist, taking binary inputs, corresponding to X i ∈ {True, False}, and outputting ρ • = 1 if • evaluates to True, else 0 (Bergmann, 2008; Marra et al., 2019) . The functional representation for a logical formula composed of several logical operators is constructed by compounding the functional representations of its components. Where two logical formulae are equivalent, then their functional representations are equivalent in that each evaluates to 1 iff the logical formula is True, else 0. As such, any set of logical rules that define V are equivalent to Eq. 11, and can be converted to a functional representation equivalent to s(θ) in Eq. 10, restricted to θ ∈ P. as encoded in q δ (θ), a function over P that defines the support of p(θ). (right) q g (θ), a relaxation of q δ (θ), defined over [0, 1] K , the gradient of which can "guide" unlabelled predictions towards valid θ. { 0 ... θ ∈P\V q g (θ), θ ∈ [0, 1] K q δ (θ) = Thus, logical rules can be converted into a function q δ (θ) (defined for θ ∈ P = {0, 1} K ) that evaluates whether a binary vector is in V, the support of p(θ), i.e. q δ (θ) = 1 θ∈V . Fig. 5 (left, centre) gives a simple illustration. As in previous cases, gradient-based learning requires a relaxation of this function defined over the domain of model predictions [0, 1] K . This is achieved by replacing the use of δ θ k 1 with any function g(θ k ) : [0, 1] → [0, 1], g(1) = 1, g(0) = 0 (a relaxation of δ θ k 1 ). By choosing g continuous, the resulting q g (θ) : [0, 1] K → [0, 1] is continuous and satisfies q g (θ) = s(θ) = 1 for valid θ ∈ V, and q g (θ) = s(θ) = 0 for invalid θ ∈ P\V, providing a continuous relaxation of p(θ), ∀θ ∈ [0, 1] K (Fig. 5 , right), up to probability weights π y (see Appendix C). Thus the distribution p(θ) required for DSSL (Eqs. 6, 7), can be approximated by a functional representation of logical rules. In practice, the choice g(θ k ) = θ k from fuzzy logic (Bergmann, 2008) is often used (e.g. Serafini & Garcez, 2016; van Krieken et al., 2019; Marra et al., 2019) . Applying this choice directly to Eq. 10 gives the semantic loss (Xu et al., 2018) , which is thus probabilistically justified and unified under the model for discriminative SSL. Under the same DSSL model, p(θ) can also be learned from the labelled data; justifying the use of logical techniques, such as abduction (Wang et al., 2019; Dai et al., 2019) , to extract rules consistent with observed labels, i.e. that entail V. Many works combine functional representations of logical formulae with statistical machine learning. We have shown that such methods are theoretically justified and that logical rules fit naturally into a probabilistic framework, i.e. by defining the support of p(θ), the distribution necessary for discriminative semi-supervised learning.

6. CONCLUSION

In this work, we present a hierarchical probabilistic model for discriminative semi-supervised learning, complementing the analogous model for classical generative SSL methods. Central to this model are the parameters θ x of distributions p(y|x; θ x ), as often predicted by neural networks. The distribution p(θ) over those parameters serves as a prior over the outputs of a predictive model for unlabelled data. Depending on properties of the data, in particular whether y|x is deterministic, the analytical form of p(θ) may be known a priori. Whilst not explored in this paper, the model for DSSL shows that an empirical estimate of p(θ) might also be learned from labelled data predictions (or indeed unpaired labels). In cases where labels reflect multiple binary attributes, logic relationships may exists between attributes. We show how such rules fit within the same probabilistic model for DSSL, providing a principled means of combining logical reasoning and statistical machine learning. Logical rules can be known a priori and imposed, or potentially learned from the data. Our single model for discriminative semi-supervised learning probabilistically justifies and unifies families of methods from the SSL and neuro-symbolic literature, and accords with a general architecture proposed for neuro-symbolic computation (Valiant, 2000; Garcez et al., 2019) , comprising low level perception and high level reasoning modules (Fig 2 ). In future work, we plan to consider the more complicated case where y|x is stochastic (i.e. combining aleatoric uncertainty); to make more rigorous the notion of a noise orerror model (i.e. capturing epistemic uncertainty), and to extend the principled combination of statistical machine learning and logical reasoning to supervised learning scenarios. Implications: Considering only the support of p(θ) may be sufficient for the purposes of DSSL, since where p(θ) provides a prior over unlabelled predictions, it is not used alone. If we had the full (discrete) p(θ) and it alone were used to predict labels for unlabelled data, a maximum likelihood approach would simply assign the most common label to all unlabelled data points. However, p(θ) is used in conjunction with a supervised model that learns to approximates f (x) = θ x to gives predictions θx , taking class probabilities into account. To the extent the model generalises, its predictions on unlabelled data should correlate with (i.e. be close to) their true values θ x . Therefore a function q g (θ), that is a continuous relaxation of the support of p(θ), helps by guiding predictions θx to nearby valid values of θ, which should to some extent reflect the correct labels. Intuitively, the prior class weights may be useful for data samples where the model is highly uncertain, where the best option may again be to choose the most popular class. For well balanced classes, ignoring π y should have little impact.



Figure 1: Graphical models for: generative SSL (left); discriminative SSL (previous (Chapelle et al., 2006)) (centre); discriminative SSL (ours) (right). Shading variables are observed (else latent).

Figure2: A general framework for neuro-symbolic learning combining statistical learning (perception) and logical rules (reasoning)(Valiant, 2000;Garcez et al., 2019). We draw an analogy to our probabilistic model for discriminative SSL ( §3), in which p(θ) can be defined with logical rules ( §5).

Figure 3: The distribution p(θ) for a mix of 2 univariate Gaussians (varying class separation). Stochastic classification, distinct classes (y ∈ {e k } k , θ x ∈ ∆ K ): To see how p(x) and p(θ) can relate, we consider a mixture of two 1-dimensional equivariant Gaussians: p(x) = k π k p(x|y = k), k ∈ {0, 1}, where x|y = k ∼ N (µ k , σ 2 ). Here, p(θ) can be derived in closed form (see Appendix A): p(θ) = 1 k=0 π k σ 2 2π 1 |µ1-µ0| exp{a(log θ1 θ0 ) 2 + (b k -1) log θ 1 + (-b k -1) log θ 0 + c k p(θ|y=k)

Figure 5: An illustration of the correspondence between logical rules and the support of p(θ). (top left) All plausible values of θ if y|x is deterministic, i.e. θ restricted to the vertices (Sec. 4). (bottom left) an example set of logical rules over label attributes. (centre) All valid values of θ under the rules,as encoded in q δ (θ), a function over P that defines the support of p(θ). (right) q g (θ), a relaxation of q δ (θ), defined over [0, 1] K , the gradient of which can "guide" unlabelled predictions towards valid θ.

Task and data properties affecting the distribution p(θ) over parameters of p(y|x; θ).

APPENDIX A DERIVATION OF p(θ) FOR A MIXTURE OF GAUSSIANS

For a general mixture distribution:which, in our particular case, become:Rearranging the former gives x in terms of θ. Substituting into p(θ) = |J |p(x) gives:Explanation: Each term ψ Y parameterises a distribution of the form p(X|Y , ψ Y ). Those distributions are conditional on the labels Y , hence we attach that label to the respective parameter to identify the correspondence. Such parameters are referred to collectively as ψ in line 1. Line 2 separates them and identifies where each occurs elsewhere. Since labels Y and their associated parameters ψ Y go hand in hand, they are probabilistically interchangeable: we could think of drawing each label y from a pool of k labels (and the parameter ψ y comes with it), or draw a parameter ψ y from a pool of k parameters. This explains the last 2 lines. For clarity, note that in p(X u |Y u , ψ Y u ), Y u can be considered redundant, given the parameter of the distribution, the identity of the label of that distribution is irrelevant.

APPENDIX C JUSTIFICATION FOR CONSIDERING ONLY THE SUPPORT OF p(θ)

In section 5, we focus on the support of p(θ) defined by the set of valid binary vectors V, ignoring the corresponding class probabilities p(y) = π y ∈ Π V . The discriminative SSL methods analysed (see Eqs 2, 3, 4) ignore class weights also. Practical reasons for doing are (i) that they may not be known, and (ii) that unless attributes are independent (p(y) = k p(y k )), class probabilities do not factorise across dimensions equivalently to the support (Eq. 10). We briefly consider the validity and implications of considering only the support of p(θ).Validity: Regardless of whether capturing all aspects of p(θ) is preferable, considering only its support is valid since it is equivalent to assuming a uniform distribution over the support. The resulting approximation to p(θ) might be considered a "partially-uninformative" prior.

