IMPROVING DEEP REGRESSION WITH ORDINAL EN-TROPY

Abstract

In computer vision, it is often observed that formulating regression problems as a classification task yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the crossentropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy regularizer to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression. Code can be found here:

1. INTRODUCTION

Classification and regression are two fundamental tasks of machine learning. The choice between the two usually depends on the categorical or continuous nature of the target output. Curiously, in computer vision, specifically with deep learning, it is often preferable to solve regression-type problems as classification tasks. A simple and common way is to discretize the continuous labels; each bin is then treated as a class. After converting regression into classification, the ordinal information of the target space is lost. Discretization errors are also introduced to the targets. Yet for a diverse set of regression problems, including depth estimation (Cao et al., 2017) , age estimation (Rothe et al., 2015) , crowd-counting (Liu et al., 2019a) and keypoint detection (Li et al., 2022) , classification yields better performance. The phenomenon of classification outperforming regression on inherently continuous estimation tasks naturally begs the question of why. Previous works have not investigated the cause, although they hint at task-specific reasons. For depth estimation, both Cao et al. (2017) and Fu et al. (2018) postulate that it is easier to estimate a quantized range of depth values rather than one precise depth value. For crowd counting, regression suffers from inaccurately generated target values (Xiong & Yao, 2022) . Discretization helps alleviate some of the imprecision. For pose estimation, classification allows for the denser and more effective heatmap-based supervision (Zhang et al., 2020; Gu et al., 2021; 2022) . Could the performance advantages of classification run deeper than task-specific nuances? In this work, we posit that regression lags in its ability to learn high-entropy feature representations. We arrive at this conclusion by analyzing the differences between classification and regression from a mutual information perspective. According to Shwartz-Ziv & Tishby (2017) , deep neural networks during learning aim to maximize the mutual information between the learned representation Z and the target Y . The mutual information between the two can be defined as I(Z; Y ) = H(Z) -H(Z|Y ). I(Z; Y ) is large when the marginal entropy H(Z) is high, i.e., features Z are as spread as possible, and the conditional entropy H(Z|Y ) is low, i.e., features of common targets are as close as possible. Classification accomplishes both objectives (Boudiaf et al., 2020) . This work, as a key contribution, shows through derivation that regression minimizes H(Z|Y ) but ignores H(Z). Accordingly, the learned representations Z from regression have a lower marginal entropy (see Fig. 1 The difference in entropy between classification and regression stems from the different losses. We postulate that the lower entropy features learned by L 2 losses in regression explain the performance gap compared to classification. Despite its overall performance advantages, classification lags in the ability to capture ordinal relationships. As such, simply spreading the features for regression to emulate classification will break the inherent ordinality of the regression target output. To retain the benefits of both high entropy and ordinality for feature learning, we propose, as a second contribution, an ordinal entropy regularizer for regression. Specifically, we capture ordinal relationships as a weighting based on the distances between samples in both the representation and target space. Our ordinal entropy regularizer increases the distances between representations, while weighting the distances to preserve the ordinal relationship. The experiments on various regression tasks demonstrate the effectiveness of our proposed method. Our main contributions are three-fold: • To our best knowledge, we are the first to analyze regression's reformulation as a classification problem, especially in the view of representation learning. We find that regression lags in its ability to learn high-entropy features, which in turn leads to the lower mutual information between the learned representation and the target output. • Based on our theoretical analysis, we design an ordinal entropy regularizer to learn highentropy feature representations that preserve ordinality. • Benefiting from our ordinal entropy loss, our methods achieve significant improvement on synthetic datasets for solving ODEs and stochastic PDEs as well as real-world regression tasks including depth estimation, crowd outing and age estimation.

2. RELATED WORK

Classification for Continuous Targets. Several works formulate regression problems as classification tasks to improve performance. They focus on different design aspects such as label discretization and uncertainty modeling. To discretize the labels, Cao et al. (2017) ; Fu et al. (2018) and Liu et al. (2019a) convert the continuous values into discrete intervals with a pre-defined interval width. To improve class flexibility, Bhat et al. ( 2021) followed up with an adaptive bin-width estimator. Due to inaccurate or imprecise regression targets, several works have explored modeling the uncertainty of labels with classification. Liu et al. (2019a) proposed estimating targets that fall within a certain interval with high confidence. Tompson et al. (2014) and Newell et al. (2016) propose modeling the uncertainty by using a heatmap target in which each pixel represents the probability of that pixel being the target class. This work, instead of focusing on task-specific designs, explores the difference between classification and regression from a learning representation point of view. By analyzing mutual information, we reveal a previously underestimated impact of high-entropy feature spaces. Ordinal Classification. Ordinal classification aims to predict ordinal target outputs. Many works exploit the distances between labels (Castagnos et al., 2022; Polat et al., 2022; Gong et al., 2022) to preserve ordinality. Our ordinal entropy regularizer also preserves the ordinality by exploiting the label distances, while it mainly aims at encouraging a higher-entropy feature space. Entropy. The entropy of a random variable reflects its uncertainty and can be used to analyze and regularize a feature space. With entropy analysis, Boudiaf et al. (2020) has shown the benefits of the cross-entropy loss, i.e., encouraging features to be dispersed while keeping intra-class features compact. Moreover, existing works (Pereyra et al., 2017; Dubey et al., 2017) have shown that many regularization terms, like confidence penalization (Pereyra et al., 2017) and label smoothing (Müller et al., 2019) , are actually regularizing the entropy of the output distribution. Inspired by these works, we explore the difference in entropy between classification and regression. Based on our entropy analysis, we design an entropy term (i.e., ordinal entropy) for regression and bypass explicit classification reformulation with label discretization.

3. A MUTUAL-INFORMATION BASED COMPARISON ON FEATURE LEARNING

3.1 PRELIMINARIES Suppose we have a dataset {X, Y } with N input data X = {x i } N i=1 and their corresponding labels Y = {y i } N i=1 . In a typical regression problem for computer vision, x i is an image or video, while y i ∈ R Y takes a continuous value in the label space Y. The target of regression is to recover y i by encoding the image to feature z i = φ(x i ) with encoder φ and then mapping z i to a predicted target ŷi = f θ (z i ) with a regression function f (•) parameterized by θ. The encoder φ and θ are learned by minimizing a regression loss function such as the mean-squared error L mse = 1 N N i=1 (y i -ŷi ) 2 . To formulate regression as a classification task with K classes, the continuous target Y can be converted to classes Y C = {y c i } N i=1 with some discretizing mapping function, where y c i ∈ [0, K -1] is a categorical target. The feature z i is then mapped to the categorical target y c i = g ω (z i ) with classifier g ω (•) parameterized by ω. The encoder ϕ and ω are learned by minimizing the cross- The entropy of a random variable can be loosely defined as the amount of "information" associated with that random variable. One approach for estimating entropy H(Z) for a random variable Z is the meanNN entropy estimator (Faivishevsky & Goldberger, 2008) . It can accommodate higherdimensional Zs, and is commonly used in high-dimensional space (Faivishevsky & Goldberger, 2010) . The meanNN estimator relies on the distance between samples to approximate p(Z). For a D-dimensional Z, it is defined as entropy loss L CE = -1 N N i=1 log g ω (z i ), where (g ω (z i )) k = expω T k zi j expω T k zj . Ĥ(Z) = D N (N -1) i̸ =j log ||z i -z j || 2 + const. (1)

3.2. FEATURE LEARNING WITH A CROSS-ENTROPY LOSS (CLASSIFICATION)

Given that mutual information between target Y C and feature Z is defined as I(Z; Y C ) = H(Z) - H(Z|Y C ), it follows that I(Z; Y C ) can be maximized by minimizing the second term H(Z|Y C ) and maximizing the first term H(Z). Boudiaf et al. (2020) showed that minimizing the cross-entropy loss accomplishes both by approximating the standard cross-entropy loss L CE with a pairwise cross-entropy loss L P CE . L P CE serves as a lower bound for L CE , and can be defined as L P CE = - 1 2λN 2 N i=1 j∈[y c j =y c i ] z ⊺ i z j Tightness ∝ H(Z|Y C ) + 1 N N i=1 log K k=1 exp( 1 λN N j=1 p jk z ⊺ i z j ) - 1 2λ K k=1 ||c k || Diversity ∝ H(Z) , where λ ∈ R is to make sure L CE is a convex function with respect to ω. Intuitively, L P CE can be understood as being composed of both a pull and a push objective more familiar in contrastive learning. We interpret the pulling force as a tightness term. It encourages higher values for z ⊺ i z j and closely aligns the feature vectors within a given class. This results in features clustered according to their class, i.e., lower conditional entropy H(Z|Y C ). The pushing force from the second term encourages lower z ⊺ i z j while forcing classes' centers c s k to be far from the origin. This results in diverse features that are spread apart, or higher marginal entropy H(Z). Note that the tightness term corresponds to the numerator of the Softmax function in L CE , while the diversity term corresponds to the denominator.

3.3. FEATURE LEARNING WITH AN L MSE LOSS (REGRESSION)

In this work, we find that minimizing L mse , as done in regression, is a proxy for minimizing H(Z|Y ), without increasing H(Z). Minimizing L mse does not increase the marginal entropy H(Z) and therefore limits feature diversity. The link between classification and regression is first established below in Lemma 1. We assume that we are dealing with a linear regressor, as is commonly used in deep neural networks. Lemma 1 We are given dataset {x i , y i } N i=1 , where x i is the input and y i ∈ Y is the label, and linear regressor f θ (•) parameterized by θ. Let z i denote the corresponding feature. Assume that the label space Y is discretized into bins with maximum width η, and c i is the center of the bin to which y i belongs. Then for any ϵ > 0, there exists η > 0 such that: |L mse - 1 N N 1 (θ T z i -c i ) 2 |≤ η 2n n 1 |2θ T z i -c i -y i | < ϵ. The detailed proof of Lemma 1 is provided in Appendix A. The result of Lemma 1 says that the discretization error from replacing a regression target y i with c i can be made arbitrarily small if the bin width η is sufficiently fine. As such, the L mse can be directly approximated by the second term of Eq. 3 i.e., L mse ≈ 1 N N 1 (θ T z i -c i ) 2 . With this result, it can be proven that minimizing L mse is a proxy for minimizing H(Z|Y ). Theorem 1 Let z ci denote the center of the features corresponding to bin center c i , and ϕ i be the angle between θ and z i -z ci . Assume that θ is normalized, (Z c |Y ) ∼ N (z ci , I), where Z c is the distribution of z ci and that cos ϕ i is fixed. Then, minimizing L mse can be seen as a proxy for minimizing H(Z|Y ) without increasing H(Z). Proof Based on Lemma 1, we have L mse = 1 N N 1 (θ T (z i -z ci )) 2 = 1 N N 1 (||θ||||z i -z ci || cos ϕ i ) 2 = 1 N N 1 ||θ|| 2 ||z i -z ci || 2 cos 2 ϕ i ∝ 1 N N 1 ||z i -z ci || 2 . (4) Note, z ci exist unless θ = 0 and c i ̸ = 0. Since it is assumed that Z c |Y ∼ N (z ci , I), the term 1 N N 1 ||z i -z ci || 2 can be interpreted as a conditional cross entropy between Z and Z c , as it satisfies H(Z; Z c |Y ) = -E z∼Z|Y [log p Z c |Y (z)] mc ≈ -1 N N i=1 log(e -1 2 ||zi-zc i || 2 ) + const c = 1 N N i=1 ||z i -z ci || 2 , where c = denotes equal to, up to a multiplicative and an additive constant. The mc ≈ denotes Monte Carlo sampling from the Z|Y distribution, allowing us to replace the expectation by the mean of the samples. Subsequently, we can show that L mse ∝ 1 N N 1 ||z i -z ci || 2 c = H(Z; Z c |Y ) = H(Z|Y ) + D KL (Z||Z c |Y ). The result in Eq. 6 shows that 1 N N 1 ||z i -z ci || 2 is an upper bound of the tightness term in mutual information. If (Z|Y ) ∼ N (z ci , I), then D KL (Z||Z c |Y ) is equal to 0 and the bound is tight i.e., 1 N N 1 ||z i -z ci || 2 ≥ H(Z|Y ). Hence, minimizing L mse is a proxy for minimizing H(Z|Y ). Apart from H(Z|Y ), the relation in Eq. 6 also contains the KL divergence between the two conditional distributions P (Z|Y ) and P (Z c |Y ), where Z c i are feature centers of Z. Minimizing this divergence will either force Z closer to the centers Z c , or move the centers Z c around. By definition, however, the cluster centers Z c i cannot expand beyond Z's coverage, so features Z must shrink to minimize the divergence. As such, the entropy H(Z) will not be increased by this term. □ Based on Eq. 2 and Theorem 1, we draw the conclusion that regression, with an MSE loss, overlooks the marginal entropy H(Z) and results in a less diverse feature space than classification with a cross-entropy loss. It is worth mentioning that the Gaussian distribution assumption, i.e., Z c |Y ∼ N (z ci , I), is standard in the literature when analyzing features (Yang et al., 2021a; Salakhutdinov et al., 2012) and entropy (Misra et al., 2005) , and cos ϕ i is a constant value at each iteration.

4. ORDINAL ENTROPY

Our theoretical analysis in Sec. 3 shows that learning with only the MSE loss does not increase the marginal entropy H(Z) and results in lower feature diversity. To remedy this situation, we propose a novel regularizer to encourage a higher entropy feature space. Using the distance-based entropy estimate from Eq. 1, one can then minimize the the negative distances between feature centers z ci to maximize the entropy of the feature space. z ci are calculated by taking a mean over all the features z which project to the same y i . Note that as feature spaces are unbounded, the features z must first be normalized, and below, we assume all the features z are already normalized with an L2 norm: L ′ d = - 1 M (M -1) M i=1 i̸ =j ||z ci -z cj || 2 ∝ -H(Z), ( ) where M is the number of feature centers in a batch of samples or a sampled subset sampled from a batch. We consider each feature as a feature center when the continuous labels of the dataset are precise enough. While the regularizer L ′ d indeed spreads features to a larger extent, it also breaks ordinality in the feature space (see Fig. 3(b) ). As such, we opt to weight the feature norms in L ′ d with w ij , where w ij are the distances in the label space Y: L d = - 1 M (M -1) M i=1 i̸ =j w ij ||z ci -z cj || 2 , where As shown in Figure 3 (c), L d spreads the feature while also preserve preserving ordinality. Note that L ′ d is a special case of L d when w ij are all equal. To further minimize the conditional entropy H(Z|Y ), we introduce an additional tightness term that directly considers the distance between each feature z i with its centers z ci in the feature space: w ij = ||y i -y j || 2 L t = 1 N b N b i=1 ||z i -z ci || 2 , where N b is the batch size. Adding this tightness term further encourages features close to its centers (compare Fig. 3 (c) with Fig. 3(d) ). Compared with features from standard regression (Fig. 3 (a)), the features in Fig. 3 (d) are more spread, i.e., the lines formed by the features are longer. We define the ordinal entropy regularizer as L oe = L d + L t , with a diversity term L d and a tightness term L t . L oe achieves similar effect as classification in that it spreads z ci while tightening features z i corresponding to z ci . Note, if the continuous labels of the dataset are precise enough and each feature is its own center, then our ordinal entropy regularizer will only contain the diversity term, i.e., L oe = L d . We show our regression with ordinal entropy (red dotted arrow) in Fig. 2(a) . The final loss function L total is defined as: L total = L m + λ d L d + λ t L t , where L m is the task-specific regression loss and λ d and λ e are trade-off parameters.

5.1. DATASETS, METRICS & BASELINE ARCHITECTURES

We conduct experiments on four tasks: operator learning, a synthetic dataset and three real-world regression settings of depth estimation, crowd counting, age estimation. For operator learning, we follow the task from DeepONet (Lu et al., 2021) and use a two-layer fully connected neural network with 100 hidden units. See Sec. 5.2 for details on data preparation. For depth estimation, NYU-Depth-v2 (Silberman et al., 2012) provides indoor images with the corresponding depth maps at a pixel resolution 640 × 480. We follow (Lee et al., 2019) and use ResNet50 (He et al., 2016) as our baseline architecture unless otherwise indicated. We use the train/test split given used by previous works (Bhat et al., 2021; Yuan et al., 2022) , and evaluate with the standard metrics of threshold accuracy δ 1 , average relative error (REL), root mean squared error (RMS) and average log 10 error. For crowd counting, we evaluate on SHA and SHB of the ShanghaiTech crowd counting dataset (SHTech) (Zhang et al., 2015) . Like previous works, we adopt density maps as labels and evaluate with mean absolute error (MAE) and mean squared error (MSE). We follow Li et al. (2018) and use CSRNet with ResNet50 as the regression baseline architecture. For age estimation, we use AgeDB-DIR (Yang et al., 2021b) and also implement their regression baseline model, which uses ResNet-50 as a backbone. Following Liu et al. (2019b) , we report results on three disjoint subsets (i.e., ,Many, Med. and Few), and also overall performance (i.e., ALL). We evaluate with MAE and geometric mean (GM). Other Implementation Details: We follow the settings of previous works DeepONet (Lu et al., 2021) for operator learning, Adabins (Bhat et al., 2021) for depth estimation, CSRNet (Li et al., 2018) for crowd counting, and Yang et al. (2021b) for age estimation. See Appendix D for details. λ d and λ t are set empirically based on the scale of the task loss λ m . We use the trade-off parameters λ d , λ t the same value of 0.001, 1, 10, 1, for operator learning, depth estimation, crowd counting and age estimation, respectively.

5.2. LEARNING LINEAR AND NONLINEAR OPERATORS

We first verify our method on the synthetic task of operator learning. In this task, an (unknown) operator maps input functions into output functions and the objective is to regress the output value. We follow (Lu et al., 2021) and generate data for both a linear and non-linear operator. For the linear operator, we aim to learn the integral operation G: G : u(x) → s(x) = x 0 u(τ )dτ, x ∈ [0, 1], where u is the input function, and s is the target function. The data is generated with a meanzero Gaussian random field function space: u ∼ G(0, k l (x 1 , x 2 )), where the covariance kernel k l (x 1 , x 2 ) = exp(-||x 1 -x 2 || 2 /2l 2 ) is the radial-basis function kernel with a length-scale parameter l = 0.2. The function u is represented by the function values of m = 100 fixed locations {x 1 , x 2 , • • • , x m }. The data is generated as ([u, y], G(u)(y)) , where y is sampled from the domain of G(u). We randomly sample 1k data as the training set and test on the testing set with 100k samples. For the nonlinear operator, we aim to learn the following stochastic partial differential equation, which maps b(x; ω) of different correlation lengths l ∈ [1, 2] to a solution u(x; ω): div(e b(x;ω) ∇u(x; ω)) = f (x), where x ∈ (0, 1), ω from the random space with Dirichlet boundary conditions u(0) = u(1) = 0, and f (x) = 10. The randomness comes from the diffusion coefficient e b(x;ω) . The function b(x; ω) ∼ GP(b 0 (x), cov(x1, x2)) is modelled as a Gaussian random process GP, with mean b 0 (x) = 0 and cov(x1, x2) = σ 2 exp(-||x 1 -x 2 || 2 /2l 2 ). We randomly sample 1k training samples and 10k test samples. For operator learning, we set L mse as the task-specific baseline loss for both the linear and non-linear operator. Table 1 shows that even without ordinal information, adding the diversity term i.e., L ′ d to L mse already improves performance. The best gains, however, are achieved by incorporating the weighting with L d , which decreases L mse by 46.7% for the linear operator and up to 80% for the more challenging non-linear operator. The corresponding standard variances are also reduced significantly. Note that we do not verify L t on operator learning due to the high data precision on synthetic datasets, it is difficult to sample points belonging to the same z ci . Adding L t , however, is beneficial for the three real-world tasks (see Sec. 5.3).

5.3. REAL-WORLD TASKS: DEPTH ESTIMATION, CROWD COUNTING & AGE ESTIMATION

Depth Estimation: Table 2 shows that adding the ordinal entropy terms boosts the performance of the regression baseline and the state-of-the-art regression method NeW-CRFs (Yuan et al., 2022) . NeW-CRFs with ordinal entropy achieves the highest values for all metrics, decreasing δ 1 and REL errors by 12.8% and 6.3%, respectively. Moreover, higher improvement can be observed when adding the ordinal entropy into a simpler baseline, i.e., ResNet-50. Crowd Counting: Table 3 shows that adding L d and L t each contribute to improving the baseline. Adding both terms has the largest impact and for SHB, the improvement is up to 14.2% on MAE and 9.4% on MSE. Age Estimation: Table 4 shows that with L d we achieve a significant 0.13 and 0.29 overall improvement (i.e., ALL) on MAE and GM, respectively. Applying L t achieves a further overall improvement over L d only, including 0.14 on MAE and 0.04 on GM.

5.4. ABLATION STUDIES

Ablation results on both operator learning and depth estimation are shown in Table 1 and Figure 4 . Ordinal Relationships: Table 1 shows that using the unweighted diversity term 'Baseline+L ′ d ', which ignores ordinal relationships, is worse than the weighted version 'Baseline+L d ' for both operator learning and depth estimation. Feature Normalization: As expected, normalization is important, as performance will decrease (compare 'w/o normalization' to 'Baseline+L d ' in Table 1 ) for both operator learning and depth estimation. Most interestingly, normalization also helps to lower variance for operator learning. Feature Distance ||z ci -z cj ||: We replace the original feature distance L2 with cosine distance (see 'w/ cosine distance') and the cosine distance is slightly worse than L2 for all the cases. Weighting Function w ij : The weight as defined in Eq. 8 is based on an L2 distance. Table 1 shows that L2 is best for linear operator learning and depth estimation but is slightly worse than w ij = ||y i -y j || 2 2 for nonlinear operator learning. Sample Size (M) In practice, we estimate the entropy from a limited number of regressed samples and this is determined by the batch size. For certain tasks, this may be sufficiently large, e.g., depth estimation (number of pixels per image × batch size) or very small, e.g., age estimation (batch size). We investigate the influence of M from Eq. 8 on linear operator learning (see Fig. 4 (b)). In the most extreme case, when M = 2, the performance is slightly better than the baseline model (2.7 × 10 -3 vs. 3.0 × 10 -3 ), suggesting that our ordinal regularizer terms are effective even with 2 samples. As M increases, the MSE and its variance steadily decrease as the estimated entropy likely becomes more accurate. However, at a certain point, MSE and variance start to increase again. This behavior is not surprising; with too many samples under consideration, it likely becomes too difficult to increase the distance between a pair of points without decreasing the distance to other points, i.e., there is not sufficient room to maneuver. The results for the nonlinear operator learning are given in Appendix C. Hyperparameter λ d and λ t : Fig. 4 (b) plots the MSE for linear operator learning versus the trade-off hyper-parameter λ d applied to the diversity term L d . Performance remains relatively stable up to 10 -2 , after which this term likely overtakes the original learning objective L mse and causes MSE to decrease. The results for the nonlinear operator and analysis on λ t are given in Appendix C and E. Marginal Entropy H(Z): We show the marginal entropy of the testing set from different methods during training (see Fig. 4 (c)). We can see that the marginal entropy of classification is always larger than that of regression, which has a downward trend. Regression with only diversity achieves the largest marginal entropy, which verifies the effectiveness of our diversity term. With both diversity and tightness terms, as training goes, its marginal entropy continues to increase and larger than that of regression after the 13 th epoch. More experiment results can be found in Appendix F.

6. CONCLUSION

In this paper, we dive deeper into the trend of solving regression-type problems as classification tasks by comparing the difference between regression and classification from a mutual information perspective. We conduct a theoretical analysis and show that regression with an MSE loss lags in its ability to learn high-entropy feature representations. Based on the findings, we propose an ordinal entropy regularizer for regression, which not only keeps an ordinal relationship in feature space like regression, but also learns a high-entropy feature representation like classification. Experiments on different regression tasks demonstrate that our entropy regularizer can serve as a plug-in component for regression-based methods to further improve the performance. 

D EVALUATION METRICS

Here we introduce the definition of the evaluation metrics for depth estimation, crowd counting, and age estimation. 



(a)). A t-SNE visualization of the features (see Fig. 1(b) and1(c)) confirms that features learned by classification have more spread than features learned by regression. More visualizations are shown in Appendix B.

Figure 1: Feature learning of regression versus classification for depth estimation. Regression keeps features close together and forms an ordinal relationship, while classification spreads the features (compare (b) vs. (c) ), leading to a higher entropy feature space. Features are colored based on their predicted depth. Detailed experimental settings are given in Appendix B.

Figure 2: Illustration of (a) regression and classification for continuous targets, and the use of our ordinal entropy for regression, (b) the pull and push objective of tightness and diversity on the feature space. The tightness part encourages features to be close to their feature centers while the diversity part encourages feature centers to be far away from each other.

Fig. 2(a)  visualizes the symmetry of the two formulations.

Regression +L d (d) Regression +L d + Lt

Figure 3: t-SNE visualization of features from the depth estimation task. (b) Simply spreading the features (L ′ d ) leads to a higher entropy feature space, while the ordinal relationship is lost. (c) By further exploiting the ordinal relationship in the label space (L d ), the features are spread and the ordinal relationship is also preserved. (d) Adding the tightness term (L t ) further encourages features close to its centers.

Figure 4: Based on the linear operator learning and depth estimation, we show (a) the effect of the number of samples on MSE, (b) the performance analysis with different λ d and (c) the entropy curves of different methods during testing. The results for the nonlinear operator learning are given in Appendix C.

Figure A: Visualization results with different entropy estimators on training and testing set.

Figure B: Based on the nonlinear operator learning problem, we show (a) the effect of the number of samples on MSE loss, (b) the performance analysis with different λ d .

Ablation studies on linear and nonlinear operators learning with synthetic data and depth estimation on NYU-Depth-v2. For operator learning, we report results as mean ± standard variance over 10 runs. Bold numbers indicate the best performance.

Quantitative comparison of depth estimation results with NYU-Depth-v2. Bold numbers indicate the best performance.

Results on SHTech. Bold numbers indicate the best performance.

Results on AgeDB-DIR. Bold numbers indicate the best performance.

Quantitative comparison of the time consumption and memory consumption on depth estimation with NYU-v2. The training time is one epoch training time.

acknowledgement

Acknowledgement. This research / project is supported by the Ministry of Education, Singapore, under its MOE Academic Research Fund Tier 2 (STEM RIE2025 MOE-T2EP20220-0015).

availability

https://github.com/needylove/OrdinalEntropy 

Appendix

A PROOF OF LEMMA 1Since |c i -y i | ≤ η 2 , we have:Experimental setting We train the regression and classification models on the NYU-Depth-v2 dataset for depth estimation. We modify the last layer of a ResNet-50 model to a convolution operation with kernel size 1 × 1, and train the modified model with L mse as our regression model. For the classification models, we modify the last layer of two ResNet-50 models to output N c channels, where N c is the number of classes, and train the modified models with cross-entropy. The classes are defined by uniformly discrete ground-truth depths into N c bins. The entropy of feature space is estimated using Eq 1 on pixel-wise features over the training and test set of NYU-Depth-v2. After training, we visualize the pixel-wise features of an image from the test set using t-distributed stochastic neighbor embedding (t-SNE), and features are colored based on their predicted depth.The visualization results are shown in Figure A. We exploit three entropy estimators to estimate the entropy of feature space H(Z). Entropy in the first row of Figure A is estimated with the meanNN entropy estimator Eq. 1. Entropy in the second row is also estimated with the meanNN entropy estimator, where the input is the features normalized with the L2 norm. Entropy in the third row is estimated with the diversity part of our ordinal entropy Eq. 8.We make several interesting observations from the visualization results: Depth Estimation. We denote the predicted depth at position p as y p and the corresponding ground truth depth as y ′ p , the total number of pixels is n. The metrics are: 1) threshold accuracy δ 1 ≜ % of y p , s.t. max( i are the estimated count and the ground truth for the i-th image, respectively. We exploit two widely used metrics as measurements: 1) Mean AbsoluteGiven N images for testing, y i and y ′ i are the i-th prediction and ground-truth, respectively. The evaluation metrics include 1)MAE:We analyze the effect of λ t with an ablation study on the age estimation with Age-DB-DIR. Table 5 shows that the final performance is not sensitive to the change of λ t , and L t is effective even with a small λ t , i.e., 0.1. Efficiency-wise, the computing complexity of the regularizer is quadratic with respect to M . The synthetic experiments on operator learning (Table 6 ) use a 2-layer MLP, so the regularizer adds significant computing time when M gets large. However, the real-world experiments on depth estimation (Table 7 ) use a ResNet-50 backbone, and the added time and memory are negligible (27% and 0.3%, respectively), even with M = 3536. We exploit M = 3536 in our depth estimation experiments, where 3536 is a 16x subsampling of the total number of pixels in an image. Note that these increases are only during training and do not add computing demands for inference. In addition, the added time and memory with L t are also negligible (0.08% and 0%, respectively), even with M=3536. 

