REVISITING INFORMATION-BASED CLUSTERING WITH PSEUDO-POSTERIOR MODELS Anonymous authors Paper under double-blind review

Abstract

Maximization of mutual information (MI) between the network's input and output motivates standard losses for unsupervised discriminative clustering enforcing "decisiveness" and "fairness". In the context of common softmax models, we clarify several general properties of such discriminative losses that were previously not well understood: the relation to K-means, or lack thereof, and marginmaximization. In particular, we show that "desiciveness" without the extra regularization term can lead to poor classification margins. Also, non-convexity of information-based losses motivates us to focus on self-supervised approaches introducing effective higher-order optimization algorithms with auxiliary variables. Addressing limitations of existing formulations, we propose a new self-supervised loss with soft auxiliary variables, or pseudo-confidence estimates. In particular, we introduce strong fairness and motivate the reverse cross-entropy as a robust loss for network training from noisy pseudo-confidence estimates. The latter is efficiently computed using variational inference -we derive a new EM algorithm with closed-form solutions for E and M steps. Empirically, our algorithm improves the performance of earlier methods for information-based clustering.

1. INTRODUCTION

We were inspired by the work of Bridle, Heading, and MacKay from 1991 Bridle et al. (1991) formulating mutual information (MI) loss for unsupervised discriminative training of neural networks using probability-type outputs, e.g. softmax σ : R K → ∆ K mapping K logits l k ∈ R to a point in the probability simplex ∆ K . Such output σ = (σ 1 , . . . , σ K ) is often interpreted as a pseudo posteriorfoot_0 over K classes, where σ k = exp l k i exp li is a scalar prediction for each class k. The unsupervised loss proposed in Bridle et al. (1991) trains the model predictions to keep as much information about the input as possible. They derived an estimate of MI as the difference between the average entropy of the output and the entropy of the average output L mi := -M I(c, X) ≈ H(σ) -H(σ) (1) where c is a random variable representing class prediction, X represents the input, and the averaging is done over all input samples {X i } M i=1 , i. (1991) assumes that softmax represents the distribution Pr(c|X). However, since softmax is not a true posterior, the right hand side in (1) can be seen only as a pseudo MI loss. In any case, (1) has a clear discriminative interpretation that stands on its own: H(σ) encourages "fair" predictions with a balanced support of all categories across the whole training data set, while H(σ) encourages confident or "decisive" prediction at each data point implying that decision boundaries are away from the training examples Grandvalet & Bengio (2004) . Generally, we call clustering losses for softmax models "information-based" if they use measures from the information theory, e.g. entropy. Discriminative clustering loss (1) can be applied to deep or shallow models. For clarity, this paper distinguishes parameters w of the representation layers of the network computing features f w (X) ∈ R N for any input X and the linear classifier parameters v of the output layer computing K-logit vector v ⊤ f for any feature f ∈ R N . The overall network model is defined as σ(v ⊤ f w (X)). (2) A special "shallow" case of the model in ( 2) is a basic linear discriminator σ(v ⊤ X) (3) directly operating on low-level input features f = X. Optimization of the loss (1) for the shallow model ( 3) is done only over linear classifier parameters v, but the deeper network model ( 2) is optimized over all network parameters [v, w] . Typically, this is done via gradient descent or backpropagation Rumelhart et al. (1986); Bridle et al. (1991) . Simple 2D example in Figure 1 (b) motivates "decisiveness" and "fairness" as discriminative properties in the unsupervised clustering loss (1) in the context of a low-level linear classifier (3). MI clustering is compared with the standard K-means result in Figure 1 (a). In this "shallow" 2D setting both clustering methods are linear and have similar parametric complexities, about K × N parameters. K-means (a) finds balanced compact clusters of the least squared deviations or variance. This can also be interpreted "generatively", see Kearns et al. (1997) , as MLE-based fitting of two (isotropic) Gaussian densities, explaining the failure for non-isotropic clusters in (a). To fix (a) "generatively", one should use non-isotropic Gaussian densities, e.g. 2-mode GMM would produce soft clusters similar to (b). However, this has costly parametric complexity -two extra covariance matrices to estimate and quadratic decision boundaries. In contrast, there is no estimation of complex data density models in (b). MI loss (1) trains a simple linear classifier (3) to produce a balanced ("fair") decision boundary away from the data points ("decisiveness"). Later, we show that the "decisiveness" may be deficient without an extra margin maximization term, see Fig. 2 . In the context of deep models (2), unsupervised MI loss finds non-linear partitioning of (unlabeled) training data points {X i } as the network learns their high-dimensional embedding f w (X). In this case, the loss (1) is optimized with respect to both representation w and classification v parameters. 



"Pseudo" emphasizes that discriminative training does not lead to the true Bayesian posteriors, in general.



e. over M training examples. The derivation in Bridle et al.

Figure 1: Generative vs Discriminative clustering -binary example (K = 2) for data points X ∈ R N (N = 2) comparing linear methods of similar parametric complexity: (a) K-means µ k ∈ R N and (b) MI clustering (1) using linear classifier (3) with K-column matrix v producing logits v ⊤ k X. Red and green colors show the optimal partitioning of the data points, as well as the corresponding decision regions over the whole 2D space. The linear decision boundary in (a) is "hard" since Kmeans outputs hard clustering arg min k ∥x -µ k ∥. The "soft" margin in (b) is due to the softmax in the linear classifier (3) shown via the transparency channel α ∝ σ k for each region's color.

RELATED WORK ON DISCRIMINATIVE DEEP CLUSTERING Unsupervised discriminative clustering via (pseudo) posterior models trained by MI loss Bridle et al. (1991) has growing impact Krause et al. (2010); Ghasedi Dizaji et al. (2017); Hu et al. (2017); Ji et al. (2019); Asano et al. (2020); Jabi et al. (2021) due to widespread of neural networks in computer vision where supervision is expensive or infeasible. Clear information-theoretic and discriminative interpretations of MI clustering criterion (1) motivated many extensions. For example, Krause et al. (2010) proposed to reduce the complexity of the model by combining MI loss (1) with regularization of all network parameters interpreted as an isotropic Gaussian prior on these weights L mi+decay = H(σ) -H(σ) + ∥[v, w]∥ 2 c = H(σ) + KL(σ ∥ u) + ∥[v, w]∥ 2 (4)

