VARIATIONAL IMBALANCED REGRESSION

Abstract

Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I.I.D. representations (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation.

1. INTRODUCTION

Deep regression models are currently the state of the art in making predictions in a continuous label space and have a wide range of successful applications in computer vision (Yin et al., 2021) , natural language processing (Jiang et al., 2020) , etc. However, these models fail however when the label distribution in training data is imbalanced. For example, in visual age estimation (Moschoglou et al., 2017) , where a model infers the age of a person given her visual appearance, models are typically trained on imbalanced datasets with overwhelmingly more images of younger adults, leading to poor regression accuracy for images of children or elderly people (Yang et al., 2021) . Such unreliability in imbalanced regression settings motivates the need for both improving performance for the minority in the presence of imbalanced data and, more importantly, providing reasonable uncertainty estimation to inform practitioners on how reliable the predictions are (especially for the minority where accuracy is lower). Existing methods for deep imbalanced regression (DIR) only focus on improving the accuracy of deep regression models by smoothing the label distribution and reweighting data with different labels (Yang et al., 2021) . On the other hand, methods that provide uncertainty estimation for deep regression models operates under the balance-data assumption and therefore do not work well in the imbalanced setting (Amini et al., 2020; Mi et al., 2022; Charpentier et al., 2022) . To simultaneously cover these two desiderata, we propose a probabilistic deep imbalanced regression model, dubbed variational imbalanced regression (VIR). Different from typical variational autoencoders assuming I.I.D. representations (a data point's representation is not directly affected by other data points), our VIR assumes Neighboring and Identically Distributed (N.I.D.) and borrows data with similar regression labels to compute the latent representation's variational distribution. Specifically, VIR first encodes a data point into a probabilistic representation and then mix it with neighboring representations (i.e., representations from data with similar regression labels) to produce its final probabilistic representation; VIR is therefore particularly useful for minority data as it can borrow probabilistic representations from data with similar labels (and naturally weigh them using our probabilistic model) to counteract data sparsity. Furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions by the importance weight computed from the smoothed label distribution to impose probabilistic reweighting on the imbalanced data. This allows the negative log likelihood to naturally put more focus on the minority data, thereby balancing the accuracy for data with different regression labels. Our VIR framework is compatible with any deep regression models and can be trained end to end. We summarize our contributions as below: 1. While previous work has studied imbalanced regression and uncertainty estimation separately, none of them has considered uncertainty estimation in the imbalanced setting. We identify the problem of probabilistic deep imbalanced regression as well as two desiderata, balanced accuracy and uncertainty estimation, for the problem. 2. We propose VIR to simultaneously cover these two desiderata and achieve state-of-the-art performance compared to existing methods. 3. As a byproduct, we also provide strong baselines for benchmarking high-quality uncertainty estimation and promising prediction performance on imbalanced datasets. Variational Autoencoder.

2. RELATED WORK

Variational autoencoder (VAE) (Kingma & Welling, 2014 ) is an unsupervised learning model that aims to infer probabilistic representations from data. However, as shown in Figure 1 , VAE typically assumes I.I.D. representations, where a data point's representation is not directly affected by other data points. In contrast, our VIR borrows data with similar regression labels to compute the latent representation's variational distribution. Imbalanced Regression. Imbalanced regression is underexplored in the machine learning community. Most existing methods for imbalanced regression are direct extensions of the SMOTE algorithm (Chawla et al., 2002) , a commonly used algorithm for imbalanced classification, where data from the minority classes is over-sampled. These algorithms usually synthesize augmented data for the minority regression labels by either interpolating both inputs and labels (Torgo et al., 2013) or adding Gaussian noise (Branco et al., 2017; 2018) . Such algorithms fail to the distance in continuous label space and fall short in handling highdimensional data (e.g., images and text). Recently, DIR (Yang et al., 2021) addresses these issues by applying kernel density estimation to smooth and reweight data on the continuous label distribution, achieving state-of-the-art performance. However, DIR only focuses on improving the accuracy, especially for the data with minority labels, and therefore does not provide uncertainty estimation, which is crucial to assess the predictions' reliability. Ren et al. (2022) focuses on re-balancing the mean squared error (MSE) loss for imbalanced regression, and Gong et al. (2022) introduces ranking similarity for improving deep imbalanced regression. In contrast, our VIR provides a principled probabilistic approach to simultaneously achieve these two desiderata, not only improving upon DIR in terms of performance but also producing reasonable uncertainty estimation as a much-needed byproduct to assess model reliability. There is also related work on imbalanced classification (Deng et al., 2021) , which is related to our work but focusing on classification rather than regression.

Uncertainty Estimation in Regression.

There has been renewed interest in uncertainty estimation in the context of deep regression models (Kendall & Gal, 2017; Kuleshov et al., 2018; Song et al., 2019; Zelikman et al., 2020; Amini et al., 2020; Mi et al., 2022; van Amersfoort et al., 2021; Liu et al., 2020; Gal & Ghahramani, 2016; Stadler et al., 2021; Snoek et al., 2019; Heiss et al., 2022) . Most existing methods either directly predict the variance of the output distribution as the estimated uncertainty (Kendall & Gal, 2017; Zhang et al., 2019; Amini et al., 2020) or rely on post-hoc confidence interval calibration (Kuleshov et al., 2018; Song et al., 2019; Zelikman et al., 2020) . Meanwhile, Posterior Networks methods Charpentier et al. (2020; 2022) ; Stadler et al. (2021) consider conjugate distribution, pseudo-count interpretations, posterior updates, and variational losses for fast and high-quality uncertainty estimation. Closest to our work is Deep Evidential Regression (DER) (Amini et al., 2020) , which attempts to estimate both aleatoric and epistemic uncertainty (Kendall & Gal, 2017; Hüllermeier & Waegeman, 2019) on regression tasks by training the neural networks to directly infer the parameters of the evidential distribution, thereby producing uncertainty measures. While Posterior Networks Charpentier et al. (2020; 2022) are designed for general classification/regression tasks and achieve promising performance, they do not explicitly consider imbalance in regression tasks, which is the focus of this paper. DER (Amini et al., 2020) is designed for the data-rich regime and therefore fails to reasonably estimate the uncertainty if the data is imbalanced; for data with minority labels, DER (Amini et al., 2020) tends produce unstable distribution parameters, leading to poor uncertainty estimation (as shown in Sec. 4). In contrast, our proposed VIR explicitly handles data imbalance in the continuous label space to avoid such instability; VIR does so by modulating both the representations and the output conjugate distribution parameters according to the imbalanced label distribution, allowing training/inference to proceed as if the data is balance and leading to better performance as well as uncertainty estimation (as shown in Sec. 4).

3. METHOD

In this section we introduce the problem setting, provide an overview of our VIR, and then describe details on each of VIR's key components.

3.1. PROBLEM SETTINGS

Assuming an imbalanced dataset in continuous space {x i , y i } N i=1 where N is the total number of data points, x i ∈ R d is the input, and y i ∈ Y ⊂ R is the corresponding label from a continuous label space Y. In practice, Y is partitioned into B equal-interval bins [y (0) , y (1) ), [y (2) , y (2) ), ..., [y (B-1) , y (B) ), with slight notation overload. To directly compare with baselines, we use the same grouping index for target value b ∈ B as in (Yang et al., 2021) . We denote representations as z i , and use ( z µ i , z Σ i ) = q ϕ (z|x i ; θ) to denote the probabilistic representations for input x i generated by a probabilistic encoder parameterized by θ. Similarly we use ( y i , s i ) to denote the mean and variance of the predictive distribution generated by a probabilistic predictor p θ (y i |z). Furthermore, we denote z as the mean of representation z i in each bins (i.e., letting z = 1 N b N b i=1 z i in a bin with N b data points).

3.2. METHOD OVERVIEW

In order to achieve both desiderata in probabilistic deep imbalanced regression (i.e., performance improvement and uncertainty estimation), our proposed variational imbalanced regression (VIR) operates on both the encoder q ϕ (z i |{x i } N i=1 ) and the predictor p θ (y i |z i ). Typical VAE (Kingma & Welling, 2014) lower-bounds input x i 's marginal likelihood; in contrast, VIR lower-bounds the marginal likelihood of input x i and labels y i : log p θ (x i , y i ) = D KL q ϕ (z i |{x i } N i=1 )||p θ (z i |x i , y i ) + L(θ, ϕ; x i , y i ). Note that our variational distribution q ϕ (z i |{x i } N i=1 ) (1) does not conditions on labels y i , since the task is to predict y i and (2) conditions on all (neighboring) inputs {x i } N i=1 rather than just x i . The second term L(θ, ϕ; x i , y i ) is VIR's evidence lower bound (ELBO), which is defined as: L(θ, ϕ; xi, yi) = Eq log p θ (xi|zi) L D i + Eq log p θ (yi|zi) L P i -DKL(q ϕ (zi|{xi} N i=1 )||p θ (zi)) L KL i . ( ) where the p θ (z i ) is the standard Gaussian prior N (0, I), following typical VAE (Kingma & Welling, 2014) , and the expectation is taken over q ϕ (z i |{x i } N i=1 ), which infers z i by borrowing data with similar regression labels to produce the balanced probabilistic representations, which is beneficial especially for the minority (see Sec. 3.3 for details). Different from typical regression models which produce only point estimates for y i , our VIR's predictor, p θ (y i |z i ), directly produces the parameters of the entire NIG distribution for y i and further imposes probabilistic reweighting on the imbalanced data, thereby producing balanced predictive distributions (more details in Sec. 3.4).

3.3. CONSTRUCTING

q(z i |{x i } N i=1 ) To cover both desiderata, one needs to (1) produce balanced representations to improve performance for the data with minority labels and (2) produce probabilistic representations to naturally obtain reasonable uncertainty estimation for each model prediction. To learn such balanced probabilistic representations, we construct the encoder of our VIR (i.e., q ϕ (z i |{x i } N i=1 )) by ( 1) first encoding a data point into a probabilistic representation, (2) computing probabilistic statistics from neighboring representations (i.e., representations from data with similar regression labels), and (3) producing the final representations via probabilistic whitening and recoloring using the obtained statistics. Probabilistic Representations. We first encode each data point into a probabilistic representation. Note that this is in contrast to existing work (Yang et al., 2021) that uses deterministic representations. We assume that each encoding z i is a Gaussian distribution with parameters {z µ i , z Σ i }, which are generated from the last layer in the deep neural network. From I.I.D. to Neighboring and Identically Distributed (N.I.D.). Typical VAE (Kingma & Welling, 2014) is an unsupervised learning model that aims to learn a variational representation from latent space to reconstruct the original inputs under the I.I.D. assumption; that is, in VAE, the latent value (i.e., z i ) is generated from its own input x i . This I.I.D. assumption works well for data with majority labels, but significantly harms performance for data with minority labels. To address this problem, we replace the I.I.D. assumption with the N.I.D. assumption; specifically, VIR's variational latent representations still follow Gaussian distributions (i.e., N (z µ i , z Σ i ), but these distributions will be first calibrated using data with neighboring labels. For a data point (x i , y i ) where y i is in the b'th bin, i.e., y i ∈ [y (b-1) , y (b) ), we compute q(z i |{x i } N i=1 ) ≜ N (z i ; z µ i , z Σ i ) as Mean and Covariance of Initial z i : z µ i , z Σ i = I(x i ), Statistics of Bin b's Statistics: µ µ b , µ Σ b , Σ µ b , Σ Σ b = A({z µ i , z Σ i } N i=1 ), ( ) Smoothed Statistics of Bin b's Statistics: µ µ b , µ Σ b , Σ µ b , Σ Σ b = S({µ µ b , µ Σ b , Σ µ b , Σ Σ b } B b=1 ), Mean and Covariance of Final z i : z µ i , z Σ i = F(z µ i , z Σ i , µ µ b , µ Σ b , Σ µ b , Σ Σ b , µ µ b , µ Σ b , Σ µ b , Σ Σ b ), where the details of functions I(•), A(•), S(•), and F(•) are described below. Function I(•): From Deterministic to Probabilistic Statistics. Different from deterministic statistics in (Yang et al., 2021) , our VIR's encoder uses probabilistic statistics (i.e., statistics of statistics). Specifically, VIR treats z i as a distribution with the mean and covariance (z µ i , z Σ i ) = I(x i ) rather than a deterministic vector. As a result, all the deterministic statistics, µ b , Σ b , µ b , and Σ b are replaced by distributions with the means and covariances,  (µ µ b , µ Σ b ), (Σ µ b , Σ Σ b ), ( µ µ b , µ Σ b ), µ µ b = E[z] = 1 N b N b i=1 z µ i , µ Σ b = V[z] = 1 N 2 b N b i=1 z Σ i . Similarly, our probabilistic overall covariance becomes a matrix-variate distribution (Gupta & Nagar, 2018) with the mean: Σ µ b = 1 N b N b i=1 (z i -z) 2 = 1 N b N b i=1 z Σ i + (z µ i ) 2 -[µ Σ b ] i + ([µ µ b ] i ) 2 , since E[z] = µ µ b and V[z] = µ Σ b . Note that the covariance of Σ b , i.e., Σ Σ b , involves computing the fourth-order moments, which is computationally prohibitive. Therefore in practice, we directly set Σ Σ b to zero for simplicity; empirically we observe that such simplified treatment already achieves promising performance improvement upon the state of the art. Function S(•): Neighboring Data and Smoothed Statistics. Next, we can borrow data with neighboring labels (from neighboring label bins) to compute the smoothed statistics of the current bin b by applying a symmetric kernel k(•, •) (e.g., Gaussian, Laplacian, and Triangular kernels). Specifically, the probabilistic smoothed mean and covariance are (assuming diagonal covariance): µ µ b = b ′ ∈B k(y b , y b ′ )µ µ b ′ , µ Σ b = b ′ ∈B k 2 (y b , y b ′ )µ Σ b ′ , Σ µ b = b ′ ∈B k(y b , y b ′ )Σ b ′ . Function F(•): Probabilistic Whitening and Recoloring. We develop a probabilistic version of the whitening and re-coloring procedure (Sun et al., 2016) used in (Yang et al., 2021) . Specifically, we produce the final probabilistic representation { z µ i , z Σ i } for each data point as: z µ i = (z µ i -µ µ b ) • Σ µ b Σ µ b + µ µ b , z Σ i = (z Σ i + µ Σ b ) • Σ µ b Σ µ b + µ Σ b . Inspired by (Yang et al., 2021) , we keep updating the probabilistic overall statistics, {µ µ b , µ Σ b , Σ b }, and the probabilistic smoothed statistics, { µ µ b , µ Σ b }, cross different epochs. The probabilistic representation { z µ i , z Σ i } are then re-parameterized (Kingma & Welling, 2014) into the final representation z i , and passed into the final layer (discussed in Sec. 3.4) to generate the prediction and uncertainty estimation. Note that the computation of statistics from multiple x's is only needed during training. During testing, VIR directly uses these statistics and therefore does not need to re-compute them.

3.4. CONSTRUCTING p(y i |z i )

Our VIR's predictor p(y i |z i ) ≜ N (y i ; y i , s i ) predicts both the mean and variance for y i by first predicting the NIG distribution and then marginalizing out the latent variables. It is motivated by the following observations on label distribution smoothing (LDS) in (Yang et al., 2021) and deep evidental regression (DER) in (Amini et al., 2020) , as well as intuitions on effective counts in conjugate distributions. LDS's Limitations in Our Probabilistic Imbalanced Regression Setting. The motivation of LDS (Yang et al., 2021) is that the empirical label distribution can not reflect the real label distribution in an imbalanced dataset with a continuous label space; consequently, reweighting methods for imbalanced regression fail due to these inaccurate label densities. By applying a smoothing kernel on the empirical label distribution, LDS tries to recover the effective label distribution, with which reweighting methods can obtain 'better' weights to improve imbalanced regression. However, in our probabilistic imbalanced regression, one needs to consider both (1) the performance for the data with minority labels and (2) uncertainty estimation for each model. However, LDS only focuses on improving the accuracy, especially for the data with minority labels, and therefore does not provide uncertainty estimation, which is crucial to assess the predictions' reliability. DER's limitations in Our Probabilistic Imbalanced Regression Setting. In DER (Amini et al., 2020) , the predicted labels with their correspond uncertainties are produced by the representation of the posterior parameters in Normal Inverse Gamma (NIG) distribution N IG(γ, ν, α, β), while the model is trained via minimizing the negative log-likelihood (NLL) of a Student-t distribution: L DER i = 1 2 log( π ν ) + (α + 1 2 ) log((y i -γ) 2 ν + Ω) -α log(Ω) + log( Γ(α) Γ(α+ 1 2 ) ), where Ω = 2β(1 + ν). It is therefore nontrivial to properly incorporate a reweighting mechanism into the NLL. One straightforward approach is to directly reweight L DER i for different data points (x i , y i ). However, this contradicts the formulation of NIG and often leads to poor performance, as we verify in Sec. 4.

Intuition of Pseudo-Counts for VIR.

To properly incorporate different reweighting methods, our VIR relies on the intuition of pseudo-counts (pseudo-observations) in conjugate distributions (Bishop, 2006) . Assuming Gaussian likelihood, the conjugate distributions would be an NIG distribution (Bishop, 2006) , i.e., (µ, Σ) ∼ N IG(γ, ν, α, β), which means: µ ∼ N (γ, Σ/ν), Σ ∼ Γ -1 (α, β), where Γ -1 (α, β) is an inverse gamma distribution. With a NIG prior distribution N IG(γ 0 , ν 0 , α 0 , β 0 ), the posterior distribution of the NIG after observing n real data points are: γ n = γ0ν0+nΨ νn , ν n = ν 0 + n, α n = α 0 + n 2 , β n = β 0 + 1 2 (γ 2 0 ν 0 ) + Φ, where Ψ = x and Φ =foot_0 2 ( i x 2 i -γ 2 n ν n ). Here ν 0 and α 0 can be interpreted as virtual observations, i.e., pseudo-counts or pseudo-observations that contribute to the posterior distribution. Overall, the mean of posterior distribution above can be interpreted as an estimation from (2α 0 + n) observations, with 2α 0 virtual observations and n real observations. Similarly, the variance can be interpreted an estimation from (ν + n) observations. This intuition is crucial in developing the predictor of our VIR. From Pseudo-Counts to Balanced Predictive Distributions. Based on the intuition above, we construct our predictor (i.e., p(y i |z i )) by ( 1) generating the parameters in the posterior distribution of NIG, (2) computing re-weighted parameters by imposing the importance weights obtained from LDS, and (3) producing the final prediction with corresponding uncertainty estimation. Based on Eqn. 7, we feed the final representation {z i } N i=1 generated from the Sec. 3.3 (Eqn. 5) into a linear layer to output the intermediate parameters n i , Ψ i , Φ i for data point (x i , y i ): n i , Ψ i , Φ i = G(z i ), z i ∼ q(z i |{x i } N i=1 ) = N (z i ; z µ i , z Σ i ) We then apply the importance weights b ′ ∈B k(y b , y b ′ ) -1 2 calculated from the smoothed label distribution to the pseudo-count n i to produce the re-weighted parameters of posterior distribution of NIG. Along with the pre-defined prior parameters (γ 0 , ν 0 , α 0 , β 0 ), we are able to compute the parameters of posterior distribution N IG(γ i , ν i , α i , β i ) for (x i , y i ): γ * i = γ0ν0+ b ′ ∈B k(y b ,y b ′ ) - 1 2 •niΨi ν * n , ν * i = ν 0 + b ′ ∈B k(y b , y b ′ ) - 1 2 • n i , α * i = α 0 + b ′ ∈B k(y b , y b ′ ) - 1 2 • ni 2 , β * i = β 0 + 1 2 (γ 2 0 ν 0 ) + Φ i . Based on the NIG posterior distribution, we can then compute final prediction and uncertainty estimation as y i = γ * i , s i = β * i ν * i (α * i -1 ) . We use an objective function similar to Eqn. 6, but with different definitions of (γ, ν, α, β), to optimize our VIR model: L P i = E q ϕ (z i |{x i } N i=1 ) 1 2 log( π ν * i ) + (α * i + 1 2 ) log((y i -γ * i ) 2 ν * n + Ω) -α * i log(ω * i ) + log( Γ(α * i ) Γ(α * i + 1 2 ) ) , (8) where ω * i = 2β * i (1 + ν * i ). Note that L P i is part of the ELBO in Eqn. 1. Similar to (Amini et al., 2020), we use an additional regularization term to achieve better accuracy 1 : L R i = (ν + 2α) • |y i -y i |. L P i and L R i together constitute the objective function for learning the predictor p(y i |z i ).

3.5. FINAL OBJECTIVE FUNCTION

Putting together Sec. 3.3 and Sec. 3.4, our final objective function (to minimize) for VIR is: L VIR = N i=1 L VIR i , L VIR i = λL R i -L(θ, ϕ; x i , y i ) = λL R i -L P i -L D i + L KL i , where L(θ, ϕ; x i , y i ) = L P i + L D i -L KL i is the ELBO in Eqn. 1. λ adjusts the importance of the additional regularizer and the ELBO, and thus lead to a better result both on accuracy and uncertainty estimation. 3.6 DISCUSSION ON I.I.D. AND N.I.D. ASSUMPTIONS Generalization Error, Bias, and Variance. We could analyze the generalization error of our VIR by bounding the generalization with the sum of three terms: (a) the bias of our estimator, (2) the variance of our estimator, (3) model complexity. Essentially VIR uses the N.I.D. assumption increases our estimator's bias, but significantly reduces its variance in the imbalanced setting. Since the model complexity is kept the same (using the same backbone neural network) as the baselines, N.I.D. will lead to a lower generalization error (see more discussion in Sec. A of the Appendix).

4. RESULTS

Datasets. In this work, we evaluate our methods in terms of prediction accuracy and uncertainty estimation on two imbalanced datasetsfoot_2 , AgeDB (Moschoglou et al., 2017) , IMDB-WIKI (Rothe et al., 2018) . We follow the preprocessing procedures in DIR (Yang et al., 2021) . Details for label density distributions and levels of imbalance are discussed in DIR (Yang et al., 2021) . AgeDB-DIR: We use AgeDB-DIR constructed in DIR (Yang et al., 2021) , which contains 12.2K images for training and 2.1K images for validation and testing. The maximum age in this dataset is 101 and the minimum age is 0, and the number of images per bin varies between 1 and 353.

IMDB-WIKI-DIR:

We use IMDB-WIKI-DIR constructed in DIR (Yang et al., 2021) , which contains 191.5K training images and 11.0K validation and testing images. The maximum age is 186 and minimum age is 0; the maximum bin density is 7149, and minimum bin density is 1.

STS-B-DIR:

We use STS-B-DIR constructed in DIR (Yang et al., 2021) , which contains 5.2K pairs of training sentences and 1.0K pairs for validation and testing. This dataset is a collection of sentence pairs generated from news headlines, video captions, etc. Each pair is annotated by multiple annotators with a similarity score between 0 5. Baselines. We use ResNet-50 (He et al., 2016) as our backbone network, and we describe the baselines below. Vanilla: We use the term VANILLA to denote a plain model without adding any approaches. Synthetic-Sample-Based Methods: Various existing imbalanced regression methods are also included as baselines; these include SMOTER (Torgo et al., 2013) and SMOGN (Branco et al., 2017) . Furthermore, following DIR (Yang et al., 2021) , in IMDB-WIKI-DIR, we also include another two methods: MIXUP (Zhang et al., 2018) and M-MIXUP (Verma et al., 2019) . Cost-Sensitive Reweighting: As shown in DIR (Yang et al., 2021) , the square-root weighting variant (SQINV) baseline (i.e. b ′ ∈B k(y b , y b ′ ) - 2 ) always outperforms Vanilla. Therefore, for simplicity and fair comparison, all our experiments (for both baselines and VIR) use SQINV weighting. To use SQINV in VIR, one simply needs to use the symmetric kernel k(•, •) described in Sec. 3.3. To use SQINV in DER, we replace the final layer in DIR (Yang et al., 2021) with the DER layer (Amini et al., 2020) to produce the predictive distributions. Evaluation Metrics -Accuracy. We follow the evaluation metrics in (Yang et al., 2021) to evaluate the accuracy of our proposed methods; these include Mean Absolute Error (MAE), Mean Squared Error (MSE), and Geometric Mean (GM). The formulas for these metrics are as follows: MAE = 1 N N i=1 |y i -y i |, MSE = 1 N N i=1 (y i -y i ) 2 , GM = N i=1 |y i -y i | 1 N . Evaluation Metrics -Uncertainty Estimation. We use typical evaluation metrics for uncertainty estimation in regression problems to evaluate our produced uncertainty estimation; these include Negative Log Likelihood (NLL), Area Under Sparsification Error (AUSE). Eqn. 8 shows the formula for NLL, and more details regarding to AUSE can be found in (Ilg et al., 2018) . Evaluation Process. Following (Liu et al., 2019; Yang et al., 2021) , for a data sample x i with its label y i which falls into the target bins b i , we divide the label space into three disjoint subsets: many-shot region {b i ∈ B | y i ∈ b i & |y i | > 100}, medium-shot region {b i ∈ B | y i ∈ b i & 20 ≤ |y i | ≤ 100}, and few-shot region {b i ∈ B | y i ∈ b i & |y i | < 20} , where | • | denotes the cardinality of the set. We report results on the overall test set and these subsets with the accuracy metrics discussed above. Implementation Details. We use ResNet-50 (He et al., 2016) for all experiments in AgeDB-DIR and IMDB-WIKI-DIR. We use the Adam optimizer (Kingma & Ba, 2015) to train all models for 100 epochs, with same learning rate and decay by 0.1 and the 60-th and 90-th epoch, respectively. In order to determine the optimal batch size for training, we try different batch sizes and achieve the same conclusion as the DIR paper, i.e., the optimal batch size is 256 when other hyperparameters are fixed. Therefore, we stick to the batch size of 256 through out the experiments in the paper. Meanwhile, we use the same hyperparameters as in DIR (Yang et al., 2021) . We use PyTorch to implement our method. For fair comparison, we implemented a PyTorch version for the official TensorFlow implementation of DER (Amini et al., 2020) . To make sure we can obtain the reasonable uncertainty estimations, we restrict the range for α to [1.5, ∞) instead of [1.0, ∞) in DER. Besides, in the activation function SoftPlus, we set the hyperparameter beta to 0.1. As discussed in Sec. 3.4, we implement a layer which produces the parameters n, Ψ, Ω. We assign 2 as the minimum number for n, and use the same hyperparameter settings for activation function for DER layer. To search for a combination hyperparameters of prior distribution {γ 0 , ν 0 , α 0 , β 0 } for NIG, we combine grid search method and random search method (Bergstra & Bengio, 2012) to select the best hyperparameters. We first intuitively assign a value and a proper range with some step sizes which correspond to the hyperparameters, then, we apply grid search to search for the best combination for the hyperparameters on prior distributions. After locating a smaller range for each hyperparameters, we use random search to search for better combinations, if it exists. In the end, we find our best hyperparameter combinations for NIG prior distributions.

4.1. RESULTS FOR IMBALANCED REGRESSION ACCURACY

We report the accuracy of different methods in Table 1 and Table 2 for AgeDB-DIR and IMDB-WIKI-DIR, respectivelyfoot_3 . In both tables, we can conclude that our methods outperform the baselines in their categories. For ablation studies, see Table 5 and Table 6 of the Appendix. Note that to ensure fair and solid comparison, we re-run the DIR methods based on our machine and software settingsfoot_4 . Overall Performance. As shown in the last category (i.e., last four rows) of both tables, our proposed method's best variants compare favorably against the state of the art including DIR variants (Yang et al., 2021) and DER (Amini et al., 2020) , especially on the imbalanced data samples (i.e., in the few-shot columns). This verifies the effectiveness of our methods in terms of overall performance.

4.2. RESULTS FOR IMBALANCED REGRESSION UNCERTAINTY ESTIMATION

Different from DIR (Yang et al., 2021) which only focuses on accuracy, we create a new benchmark for uncertainty estimation in imbalanced regression. (Amini et al., 2020) 3.936 3.768 3.865 4.421 0.590 0.449 0.468 0.500 LDS + FDS + DER (Yang et al., 2021; Amini et al., 2020) Results show that VIR outperform the baselines in all few-shot metrics. In some categories, VIR may not perform better in the overall, many-shot and median shot metrics, but the gap tends to be minimal. Note that our proposed methods mainly focus on the imbalanced setting, therefore we also focus on the few-shot metrics. Lastly, comparing our model variant with the best performance against the baseline (DER), we can conclude that our methods successfully improve uncertainty estimation in the probabilistic imbalanced regression setting. (Yang et al., 2021; Amini et al., 2020) +0.198 +0.131 +0.578 +1.078 +0.191 +0.157 +0.202 +0.167 We also observe that the improvements of the uncertainty estimation on IMDB-WIKI are larger than those on Age-DB. We suspect that this because IMDB-WIKI contains much more training, validating and testing data, therefore enjoying more stable uncertainty estimation improvements brought by VIR compared to those in Age-DB.

5. CONCLUSION

We identify the problem of probabilistic deep imbalanced regression, which aims to both improve accuracy and obtain reasonable uncertainty estimation in imbalanced regression. We propose VIR, which can use any deep regression models as backbone networks. VIR borrows data with similar regression labels to produce the probabilistic representations and modulates the conjugate distributions to impose probabilistic reweighting on imbalanced data. Furthermore, we create new benchmarks for uncertainty estimation on imbalanced regression. Experiments show that our methods outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. Future work may include (1) improving VIR by better approximating variance of the variances in probability distributions, and (2) developing novel approaches that can achieve stable performance even on imbalanced data with limited sample size, and (3) exploring techniques such as mixture density networks (Bishop, 1994) to enable multi-modality in the latent distribution, thereby further improving the performance.



Note that in DER, the total evidence Φ has a value 2ν + α, but to the best of our knowledge, it would be more reasonable to use ν + 2α as the total evidence for an NIG distribution(Bishop, 2006). Among the five datasets proposed in(Yang et al., 2021), only four of them are publicly available. In this paper we use the largest (IMDB-WIKI) and the smallest (AgeDB) among the four to evaluate our method. Results for STS-B-DIR are reported in Table 7, Table 8, and Table 9 of the Appendix. We find that due to differences in PyTorch, GPU, and CUDA versions, as well as numbers of GPUs used for parallel training, the results in DIR may vary. Furthermore, the randomness in multiple workers in the Dataloader also affect the performance.



Figure 2: Overview of our VIR method. Left: The inference model infers the latent representations given input x's in the neighborhood. Right: The generative model reconstructs the input and predicts the label distribution (including the associated uncertainty) given the latent representation.

respectively (more details in the following three paragraphs on A(•), S(•), and F(•)). Function A(•): Statistics of the current Bin b's Statistics. As part of our probabilistic overall statistics, the probabilistic overall mean becomes a distribution with the mean (letting µ b = z) and covariance (assuming diagonal covariance):

Evaluation results of accuracy on AgeDB-DIR.

Table 3 and Table4show the results on uncertainty estimation for two datasets AgeDB-DIR and IMDB-WIKI-DIR, respectively. Note that most baselines from Table1 and Table 2are deterministic methods (as opposed to probabilistic ) and cannot provide uncertainty estimation; therefore they are not applicable here. To show the superiority of our VIR model, we create a strongest baseline by concatenating the DIR variants (LDS + FDS) with the DER(Amini et al., 2020). Uncertainty estimation results on AgeDB-DIR.

3.794 3.699 3.969 4.214 0.463 0.260 0.392 0.617 VIR (OURS) 3.703 3.598 3.805 4.196 0.437 0.474 0.319 0.413 OURS VS. DER +0.064 +0.071 +0.060 +0.225 +0.153 +0.026 +0.007 +0.036

Uncertainty estimation results on IMDB-WIKI-DIR.

3.683 3.602 4.391 5.697 0.784 0.670 0.455 0.483 VIR (OURS) 3.652 3.568 4.419 5.560 0.622 0.645 0.511 0.374 OURS VS. DER

annex

We could analyze the generalization error of our VIR by bounding the generalization with the sum of three terms: (a) the bias of our estimator, (2) the variance of our estimator, (3) model complexity. Essentially VIR uses the N.I.D. assumption increases our estimator's bias, but significantly reduces its variance in the imbalanced setting. Since the model complexity is kept the same (using the same backbone neural network) as the baselines, N.I.D. will lead to a lower generalization error.Variance of Estimators in Imbalanced Settings. In the imbalanced setting, one typically use inverse weighting to produced an unbiased estimator (i.e., making the first term of the aforementioned bound zero). However, for data with extremely low density, its inverse would be extremely large, therefore leading to a very large variance for the estimator. Our VIR replaces I.I.D. with N.I.D. to "smooth out" such singularity, and therefore significantly lowers the variance of the estimator (i.e., making the second term of the aforementioned bound smaller), and ultimately lowers the generalization error.

B.1 ABLATION STUDY ON VIR

In this section, we include ablation studies to verify that our VIR can outperform its counterparts in DIR (i.e., smoothing on the latent space) and DER (i.e., NIG distribution layers). Ablation Study on q(z i |{x i } N i=1 ). To verify the effectiveness of VIR's encoder q(z i |{x i } N i=1 ), we replace VIR's predictor p(y i |z i ) with a linear layer (as in DIR). Table 5 shows that compared to its counterpart, FDS (Yang et al., 2021) , our encoderonly VIR still leads to a considerable improvements even without generating the NIG distribution, therefore verifying the effectiveness of our VIR's q(z i |{x i } N i=1 ). Ablation Study on p(y i |z i ). To verify the effectiveness of VIR's predictor p(y i |z i ), we replace VIR's encoder q(z i |{x i } N i=1 ) with a simple deterministic encoder as in DER (Amini et al., 2020) . Table 5 and Table 6 show that compared to DER, the counterpart of VIR's predictor, our VIR's predictor still outperforms than DER, demonstrating its effectiveness; this verifies our claim (Sec. 3.4) that directly reweighting DER breaks NIG and leads to poor performance.

B.2 RESULT ON STS-B-DIR DATASET

In this section, we report the accuracy and uncertainty evaluation on STS-B-DIR (more details for the dataset is in DIR (Yang et al., 2021) ). From 

B.3 DIFFERENCE BETWEEN DIR'S AND OUR REPRODUCED RESULTS

To reproduce the results on AgeDB, we use exactly the same settings as in DIR's code (Yang et al., 2021) (i.e., by directly running their code on our machines without modifying hyperparameters). for each model in DIR we report, we use five different random seeds to produce five results. We then report the performance by taking the average of them. Table 10 and Table 11 show the example for SQINV and LDS+FDS on AgeDB-DIR. From the table we can see that under our hardware and 

