

Abstract

Classification tasks, ubiquitous across machine learning, are commonly tackled by a suitably designed neural network with a softmax output layer, mapping each data point to a categorical distribution over class labels. We extend this familiar model from a latent variable perspective to variational classification (VC), analogous to how the variational auto-encoder relates to its deterministic counterpart. We derive a training objective based on the ELBO together with an adversarial approach for optimising it. Within this framework, we identify design choices made implicitly in off-the-shelf softmax functions and can instead include domain-specific assumptions, such as class-conditional latent priors. We demonstrate benefits of the VC model in image classification. We show on several standard datasets, that treating inputs to the softmax layer as latent variables under a mixture of Gaussians prior, improves several desirable aspects of a classifier, such as prediction accuracy, calibration, out of domain calibration and adversarial robustness.

1. INTRODUCTION

Classification is central to much of machine learning, not only in its own right, e.g. to categorise every day objects (Klasson et al., 2019) , make medical diagnoses (Adem et al., 2019; Mirbabaie et al., 2021) or detect potentially life-supporting planets (Tiensuu et al., 2019) , but also as an important component in other learning paradigms, e.g. to select actions in reinforcement learning, distinguish positive and negative samples in contrastive learning or within the attention mechanism of large language models. Recently, it has become all but default to tackle classification tasks with domain-specific neural networks with a sigmoid or softmax output layer. 1 The neural network deterministically maps each data point x (in a domain X ) to a real vector f ω (x), which the last layer maps to the parameter of a discrete distribution p θ (y|x) over class labels y ∈ Y, defined by a point on the simplex ∆ |Y| , e.g.: p θ (y|x) = softmax(x; θ) y = exp g(x, y; θ) y ′ ∈Y exp g(x, y ′ ; θ) = exp{f ω (x) ⊤ w y + b y } y ′ ∈Y exp{f ω (x) ⊤ w y ′ + b y ′ } (1) Despite frequently outperforming alternatives and their widespread use, softmax classifiers are not without issue. The overall mapping from X to ∆ |Y| is learned numerically by minimising a loss function over a finite set of training samples. The result is poorly understood in general, remaining in many respects a "black box" with predictions hard to rationalise. A trained classifier may make accurate predictions for the training set, but predictions for other data points, e.g. test data, are determined by f ω ∈ F, from a class of functions chosen to be highly flexible in the hope of approximating the unknown true mapping f (x) = {p(y|x)} y∈Y . With sufficient flexibility, (i) F may contain many, possibly infinite, functions (for different ω) that give accurate training set predictions, but dissimilar and hence uncertain predictions elsewhere; and (ii) predictions can change materially for imperceptible changes in the data (adversarial examples). Lastly, where p θ (y|x) fails to reflect the true label distribution p(y|x), it fails to reflect the frequency with which classes are expected to occur, making the classifier miscalibrated. A standard softmax classifier also lacks several desirable properties. For example, under certain conditions, it is known that a softmax classifier learns to approximate p(y|x) and so captures stochastic, or aleatoric, uncertainty in the data. However, in practice, predictions for some regions of X may be more reliable than others and it can be important to understand the confidence or epistemic



Since the softmax function generalises the sigmoid function to more than two classes, we refer to softmax throughout, but all arguments can be applied to the sigmoid case.

