REVISITING INFORMATION-BASED CLUSTERING WITH PSEUDO-POSTERIOR MODELS Anonymous authors Paper under double-blind review

Abstract

Maximization of mutual information (MI) between the network's input and output motivates standard losses for unsupervised discriminative clustering enforcing "decisiveness" and "fairness". In the context of common softmax models, we clarify several general properties of such discriminative losses that were previously not well understood: the relation to K-means, or lack thereof, and marginmaximization. In particular, we show that "desiciveness" without the extra regularization term can lead to poor classification margins. Also, non-convexity of information-based losses motivates us to focus on self-supervised approaches introducing effective higher-order optimization algorithms with auxiliary variables. Addressing limitations of existing formulations, we propose a new self-supervised loss with soft auxiliary variables, or pseudo-confidence estimates. In particular, we introduce strong fairness and motivate the reverse cross-entropy as a robust loss for network training from noisy pseudo-confidence estimates. The latter is efficiently computed using variational inference -we derive a new EM algorithm with closed-form solutions for E and M steps. Empirically, our algorithm improves the performance of earlier methods for information-based clustering.

1. INTRODUCTION

We were inspired by the work of Bridle, Heading, and MacKay from 1991 Bridle et al. (1991) formulating mutual information (MI) loss for unsupervised discriminative training of neural networks using probability-type outputs, e.g. softmax σ : R K → ∆ K mapping K logits l k ∈ R to a point in the probability simplex ∆ K . Such output σ = (σ 1 , . . . , σ K ) is often interpreted as a pseudo posterior 1 over K classes, where σ k = exp l k i exp li is a scalar prediction for each class k. The unsupervised loss proposed in Bridle et al. (1991) trains the model predictions to keep as much information about the input as possible. They derived an estimate of MI as the difference between the average entropy of the output and the entropy of the average output L mi := -M I(c, X) ≈ H(σ) -H(σ) (1) where c is a random variable representing class prediction, X represents the input, and the averaging is done over all input samples {X i } M i=1 , i.e. over M training examples. The derivation in Bridle et al. (1991) assumes that softmax represents the distribution Pr(c|X). However, since softmax is not a true posterior, the right hand side in (1) can be seen only as a pseudo MI loss. In any case, (1) has a clear discriminative interpretation that stands on its own: H(σ) encourages "fair" predictions with a balanced support of all categories across the whole training data set, while H(σ) encourages confident or "decisive" prediction at each data point implying that decision boundaries are away from the training examples Grandvalet & Bengio (2004) . Generally, we call clustering losses for softmax models "information-based" if they use measures from the information theory, e.g. entropy. Discriminative clustering loss (1) can be applied to deep or shallow models. For clarity, this paper distinguishes parameters w of the representation layers of the network computing features f w (X) ∈ R N for any input X and the linear classifier parameters v of the output layer computing K-logit vector v ⊤ f for any feature f ∈ R N . The overall network model is defined as σ(v ⊤ f w (X)). (2) 1 "Pseudo" emphasizes that discriminative training does not lead to the true Bayesian posteriors, in general. 1

