MAXIMUM ENTROPY INFORMATION BOTTLENECK FOR CONFIDENCE-AWARE STOCHASTIC EMBEDDING Anonymous

Abstract

Stochastic embedding has several advantages over deterministic embedding, such as the capability of associating uncertainty with the resulting embedding and robustness to noisy data. This is especially useful when the input data has ambiguity (e.g., blurriness or corruption) which often happens with in-the-wild settings. Many existing methods for stochastic embedding are limited by the assumption that the embedding follows a standard normal distribution under the variational information bottleneck principle. We present a different variational approach to stochastic embedding in which maximum entropy acts as the bottleneck, which we call "Maximum Entropy Information Bottleneck" or MEIB. We show that models trained with the MEIB objective outperform existing methods in terms of regularization, perturbation robustness, probabilistic contrastive learning, and risk-controlled recognition performance.



Stochastic embedding is a mapping of an input x to a random variable Z ∼ p(z|x) ∈ R D in which the mapped regions of similar inputs are placed nearby. Unlike deterministic embedding, where z = f (x) is a point in R D , stochastic embedding can represent the input uncertainty, such as data corruption or ambiguity, by controlling the spread of probability density over a manifold Oh et al. (2019) . Figure 1 depicts a typical stochastic embedding framework with the neural networks parameterized by θ. Input x is mapped to a Gaussian distribution N (z; µ , Σ) by a stochastic encoder that consists of a backbone feature extractor f B θ followed by two separate branches f µ θ and f Σ θ , each of which predicts the µ and Σ. 1 While the covariance matrix Σ, in prior work as well as in this paper, is assumed to be diagonal where f Σ θ outputs a D-dimensional vector, it would be straightforward to extend it to a full covariance matrix, for instance, using a Cholesky decomposition Dorta et al. (2018) . Embeddings sampled from this Gaussian are then consumed by a decoder f C θ for the downstream task, e.g., classification. 2017) where the stochastic encoder p(z|x) is regularized by Kullback-Leibler (KL) divergence, KL(p(z|x)||r(z)), where p(z|x) = N (z; µ , Σ) and r(z) = N (z; 0, I) in general. This effectively impels the embeddings to be close to a standard normal distribution, which is an explicit assumption that may not always hold true.



Figure 1: Stochastic embedding framework.

Oh et al. (2019); Chang et al. (2020); Sun et al. (2020); Chun et al. (2021); Li et al. (2021b) are built upon the variational information bottleneck (VIB) principle Alemi et al. (

