

Abstract

Classification tasks, ubiquitous across machine learning, are commonly tackled by a suitably designed neural network with a softmax output layer, mapping each data point to a categorical distribution over class labels. We extend this familiar model from a latent variable perspective to variational classification (VC), analogous to how the variational auto-encoder relates to its deterministic counterpart. We derive a training objective based on the ELBO together with an adversarial approach for optimising it. Within this framework, we identify design choices made implicitly in off-the-shelf softmax functions and can instead include domain-specific assumptions, such as class-conditional latent priors. We demonstrate benefits of the VC model in image classification. We show on several standard datasets, that treating inputs to the softmax layer as latent variables under a mixture of Gaussians prior, improves several desirable aspects of a classifier, such as prediction accuracy, calibration, out of domain calibration and adversarial robustness.

1. INTRODUCTION

Classification is central to much of machine learning, not only in its own right, e.g. to categorise every day objects (Klasson et al., 2019) , make medical diagnoses (Adem et al., 2019; Mirbabaie et al., 2021) or detect potentially life-supporting planets (Tiensuu et al., 2019) , but also as an important component in other learning paradigms, e.g. to select actions in reinforcement learning, distinguish positive and negative samples in contrastive learning or within the attention mechanism of large language models. Recently, it has become all but default to tackle classification tasks with domain-specific neural networks with a sigmoid or softmax output layer. 1 The neural network deterministically maps each data point x (in a domain X ) to a real vector f ω (x), which the last layer maps to the parameter of a discrete distribution p θ (y|x) over class labels y ∈ Y, defined by a point on the simplex ∆ |Y| , e.g.: p θ (y|x) = softmax(x; θ) y = exp g(x, y; θ) y ′ ∈Y exp g(x, y ′ ; θ) = exp{f ω (x) ⊤ w y + b y } y ′ ∈Y exp{f ω (x) ⊤ w y ′ + b y ′ } (1) Despite frequently outperforming alternatives and their widespread use, softmax classifiers are not without issue. The overall mapping from X to ∆ |Y| is learned numerically by minimising a loss function over a finite set of training samples. The result is poorly understood in general, remaining in many respects a "black box" with predictions hard to rationalise. A trained classifier may make accurate predictions for the training set, but predictions for other data points, e.g. test data, are determined by f ω ∈ F, from a class of functions chosen to be highly flexible in the hope of approximating the unknown true mapping f (x) = {p(y|x)} y∈Y . With sufficient flexibility, (i) F may contain many, possibly infinite, functions (for different ω) that give accurate training set predictions, but dissimilar and hence uncertain predictions elsewhere; and (ii) predictions can change materially for imperceptible changes in the data (adversarial examples). Lastly, where p θ (y|x) fails to reflect the true label distribution p(y|x), it fails to reflect the frequency with which classes are expected to occur, making the classifier miscalibrated. A standard softmax classifier also lacks several desirable properties. For example, under certain conditions, it is known that a softmax classifier learns to approximate p(y|x) and so captures stochastic, or aleatoric, uncertainty in the data. However, in practice, predictions for some regions of X may be more reliable than others and it can be important to understand the confidence or epistemic uncertainty of predictions. Also, beyond making positive classifications from a set of labels, it can be useful to know when a sample is "none of the above", or out of distribution. In such cases, a softmax might output a uniform distribution over all classes but that is equally appropriate for a fully in-distribution sample that occurs with all labels, and so is ambiguous without further assumptions. Overall, classification accuracy, model calibration and adversarial robustness depend on how well a model p θ (y|x) approximates the true distribution p(y|x), in particular how it interpolates/extrapolates from the training set to X . Meanwhile, prediction confidence and out-of-sample detection depend on how new data samples compare to those seen at training time. This suggests that to improve a softmax classifier in general requires (i) modelling p(y|x) more accurately, e.g. by obtaining more data or constraining f ω in a useful, perhaps domain-specific way; and (ii) developing a measure of familiarity between data samples, e.g. by learning a distribution over x, or similar. We approach this by generalising the mechanics of the softmax classifier guided by two key observations: (i) the parallel between softmax p θ (y|x) = exp{g(x,y)} Y exp{g(x,y ′ )} and Bayes' rule p(y|x) = p(x,y) Y p(x,y ′ ) ; and (ii) that each layer of a neural network classifier is a function of a random variable (the data x) and so can be treated as a latent random variable z. The first hints at a probabilistic interpretation of the softmax function. The second also underpins the relationship between the deterministic auto-encoder and the variational auto-encoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) , the latter of which both generalises and constrains the former. In a similar way, we propose the variational classifier (VC) that introduces a latent variable z into the (Markov) prediction model p θ (y|x) = z p θ (y|z)p θ (z|x), with marginal p(z). This offers a promising, principled way to address the issues outlined if p(z) both usefully constrains the model to improve accuracy; and, when evaluated for latent variables associated with observed data, indicates their "familiarity". We develop an analog of the ELBO to train a VC and propose an "adversarial trick" for its optimisation. We show that the standard softmax classifier falls within the VC framework under distributional assumptions that equate to implicit design choices. By identifying such choices, alternatives can be introduced where appropriate, such as domain-specific latent priors. This also shows that the VC framework does not "complicate matters" by requiring difficult distribution choices to be made, rather it exposes that default assumptions are made in softmax classifiers that may not be optimal. A regular softmax classifier is also seen to concentrate class-conditional latent distributions q ϕ (z|y) = x q ϕ (z|x)p(x|y), to a single point, akin to a maximium likelihood point estimate, whereas a VC fits q ϕ (z|y) to a class-conditional prior p θ (z|y), mirroring a more Bayesian treatment (see Figure 1 ). On a series of image classification experiments, we demonstrate that a VC outperforms a regular softmax classifier in many of the ways outlined, such as calibration, including for out of domain samples, adversarial robustness, and modestly improves accuracy, more notably if data is scarce. We believe the VC framework offers a deeper interpretation of softmax classification and takes a step towards more fully understanding these familiar models, potentially enabling their further improvement and/or integration with other latent variable models, e.g. VAEs or contrastive learning.



Since the softmax function generalises the sigmoid function to more than two classes, we refer to softmax throughout, but all arguments can be applied to the sigmoid case.



Figure 1: Distributions of softmax inputs q ϕ (z|y), z ∈ R 2 , under the three VC training objectives (MNIST dataset): (left) standard softmax Cross Entropy -"MLE" treatment; (centre) with classconditional Gaussian priors p θ (z|y) -"MAP" treatment; (right) and entropy -"Bayesian" treatment. (Note: softmax inputs have been artificially restricted to 2-dimensional for visualisation purposes.)

