MAXIMUM ENTROPY INFORMATION BOTTLENECK FOR CONFIDENCE-AWARE STOCHASTIC EMBEDDING Anonymous

Abstract

Stochastic embedding has several advantages over deterministic embedding, such as the capability of associating uncertainty with the resulting embedding and robustness to noisy data. This is especially useful when the input data has ambiguity (e.g., blurriness or corruption) which often happens with in-the-wild settings. Many existing methods for stochastic embedding are limited by the assumption that the embedding follows a standard normal distribution under the variational information bottleneck principle. We present a different variational approach to stochastic embedding in which maximum entropy acts as the bottleneck, which we call "Maximum Entropy Information Bottleneck" or MEIB. We show that models trained with the MEIB objective outperform existing methods in terms of regularization, perturbation robustness, probabilistic contrastive learning, and risk-controlled recognition performance.



Stochastic embedding is a mapping of an input x to a random variable Z ∼ p(z|x) ∈ R D in which the mapped regions of similar inputs are placed nearby. Unlike deterministic embedding, where z = f (x) is a point in R D , stochastic embedding can represent the input uncertainty, such as data corruption or ambiguity, by controlling the spread of probability density over a manifold Oh et al. (2019) . Figure 1 depicts a typical stochastic embedding framework with the neural networks parameterized by θ. Input x is mapped to a Gaussian distribution N (z; µ , Σ) by a stochastic encoder that consists of a backbone feature extractor f B θ followed by two separate branches f µ θ and f Σ θ , each of which predicts the µ and Σ. 1 While the covariance matrix Σ, in prior work as well as in this paper, is assumed to be diagonal where f Σ θ outputs a D-dimensional vector, it would be straightforward to extend it to a full covariance matrix, for instance, using a Cholesky decomposition Dorta et al. (2018) . Embeddings sampled from this Gaussian are then consumed by a decoder f C θ for the downstream task, e.g., classification. 2017) where the stochastic encoder p(z|x) is regularized by Kullback-Leibler (KL) divergence, KL(p(z|x)||r(z)), where p(z|x) = N (z; µ , Σ) and r(z) = N (z; 0, I) in general. This effectively impels the embeddings to be close to a standard normal distribution, which is an explicit assumption that may not always hold true. 1 We use the terms 2014) fails to correlate the latent variance with the input uncertainty; the variance decreases with the distance to the latent means of training data, which is contrary to expectation. Since VAE is a special case of an unsupervised variant of VIB, this phenomenon also holds for VIB; our experiments show VIB assigns smaller variance to more uncertain inputs (see the supplemental Section A). Motivated by this finding, we explicitly use the variance (entropy) as a confidence indicator rather than a measure of input uncertainties and encourage the model to assign larger variance to more certain inputs. f µ θ (x) and f Σ θ (x) interchangeably with f µ θ (f B θ (x)) and f Σ θ (f B θ (x)) respectively. In this paper, we propose Maximum Entropy Information Bottleneck (MEIB) to lift such constraints of using a fixed prior and instead use the conditional entropy of the embedding H(Z|X) as the only regularization. Based on the maximum entropy principle Jaynes (1957), we postulate that stochastic uncertainty is best represented by the probability distribution with the largest entropy. By maximizing H(Z|X), the embedding distribution is promoted to be more random, pushing for broader coverage in the embedding space, with a trade-off on the expressiveness of Z about target Y. The resulting distribution is also the one that makes the fewest assumptions about the true distribution of data Shore & Johnson (1980) . Figure 2 depicts our intuition; (a) deterministic encoders would learn embeddings "just enough" to classify the training samples unless any regularization technique, such as a margin loss, is considered. It would be vulnerable to small changes in test inputs. (b) the embedding distribution by typical stochastic encoders (e.g., VIB) trained with the KL divergence regularization will tend to cover a fixed prior. Note that it is generally difficult to pick a true prior distribution. Also, it is unnecessary to restrict the embedding distribution to be within a specific bound. (c) with MEIB, on the other hand, by maximizing the conditional entropy of the stochastic embeddings, we would have a better regularization effect as it makes the area secured by the embedding distribution for the given input as broad as possible. The key contributions of MEIB to the previous stochastic embedding methods are summarized as follows: • While it provides a comparable regularization in handwritten digit classification, MEIB outperforms existing approaches in the challenging person re-identification task with three popular datasets. • MEIB shows significantly better perturbation robustness compared to VIB in handwritten digit classification. • MEIB performs better than VIB when used in a probabilistic contrastive learning framework. • Providing reliable confidence measurements, MEIB shows an outstanding risk-controlled recognition performance in digit classification and person re-identification tasks.



Figure 1: Stochastic embedding framework.

Majority of leading methods for stochastic embedding Oh et al. (2019); Chang et al. (2020); Sun et al. (2020); Chun et al. (2021); Li et al. (2021b) are built upon the variational information bottleneck (VIB) principle Alemi et al. (

Figure 2: Embedding space characteristics. Each color represents a class of data. The color-filled shapes refer to the deterministic or the mean point of stochastic embeddings. The ellipses around the shapes depict the standard deviation of stochastic embeddings. The circles and the diamonds represent training and testing data, respectively. The solid lines are the decision boundaries learned.

