IMPROVING DEEP REGRESSION WITH ORDINAL EN-TROPY

Abstract

In computer vision, it is often observed that formulating regression problems as a classification task yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the crossentropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy regularizer to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression. Code can be found here:

1. INTRODUCTION

Classification and regression are two fundamental tasks of machine learning. The choice between the two usually depends on the categorical or continuous nature of the target output. Curiously, in computer vision, specifically with deep learning, it is often preferable to solve regression-type problems as classification tasks. A simple and common way is to discretize the continuous labels; each bin is then treated as a class. After converting regression into classification, the ordinal information of the target space is lost. Discretization errors are also introduced to the targets. Yet for a diverse set of regression problems, including depth estimation (Cao et al., 2017) , age estimation (Rothe et al., 2015) , crowd-counting (Liu et al., 2019a) and keypoint detection (Li et al., 2022) , classification yields better performance. The phenomenon of classification outperforming regression on inherently continuous estimation tasks naturally begs the question of why. Previous works have not investigated the cause, although they hint at task-specific reasons. For depth estimation, both Cao et al. (2017) and Fu et al. (2018) postulate that it is easier to estimate a quantized range of depth values rather than one precise depth value. For crowd counting, regression suffers from inaccurately generated target values (Xiong & Yao, 2022) . Discretization helps alleviate some of the imprecision. For pose estimation, classification allows for the denser and more effective heatmap-based supervision (Zhang et al., 2020; Gu et al., 2021; 2022) . Could the performance advantages of classification run deeper than task-specific nuances? In this work, we posit that regression lags in its ability to learn high-entropy feature representations. We arrive at this conclusion by analyzing the differences between classification and regression from a mutual information perspective. According to Shwartz-Ziv & Tishby (2017), deep neural networks during learning aim to maximize the mutual information between the learned representation Z and the target Y . The mutual information between the two can be defined as I(Z; Y ) = H(Z) -H(Z|Y ). I(Z; Y ) is large when the marginal entropy H(Z) is high, i.e., features Z are as spread as possible, and the conditional entropy H(Z|Y ) is low, i.e., features of common targets are as close as possible. Classification accomplishes both objectives (Boudiaf et al., 2020) . This work, as a key contribution, shows through derivation that regression minimizes H(Z|Y ) but ignores H(Z). Accordingly, the learned representations Z from regression have a lower marginal entropy (see Fig. 1 The difference in entropy between classification and regression stems from the different losses. We postulate that the lower entropy features learned by L 2 losses in regression explain the performance gap compared to classification. Despite its overall performance advantages, classification lags in the ability to capture ordinal relationships. As such, simply spreading the features for regression to emulate classification will break the inherent ordinality of the regression target output. To retain the benefits of both high entropy and ordinality for feature learning, we propose, as a second contribution, an ordinal entropy regularizer for regression. Specifically, we capture ordinal relationships as a weighting based on the distances between samples in both the representation and target space. Our ordinal entropy regularizer increases the distances between representations, while weighting the distances to preserve the ordinal relationship. The experiments on various regression tasks demonstrate the effectiveness of our proposed method. Our main contributions are three-fold: • To our best knowledge, we are the first to analyze regression's reformulation as a classification problem, especially in the view of representation learning. We find that regression lags in its ability to learn high-entropy features, which in turn leads to the lower mutual information between the learned representation and the target output. • Based on our theoretical analysis, we design an ordinal entropy regularizer to learn highentropy feature representations that preserve ordinality. • Benefiting from our ordinal entropy loss, our methods achieve significant improvement on synthetic datasets for solving ODEs and stochastic PDEs as well as real-world regression tasks including depth estimation, crowd outing and age estimation. 2016) propose modeling the uncertainty by using a heatmap target in which each pixel represents the probability of that pixel being the target class. This work, instead of focusing on task-specific designs, explores the difference between classification and regression from a learning representation point of view. By analyzing mutual information, we reveal a previously underestimated impact of high-entropy feature spaces. Ordinal Classification. Ordinal classification aims to predict ordinal target outputs. Many works exploit the distances between labels (Castagnos et al., 2022; Polat et al., 2022; Gong et al., 2022) to



(a)). A t-SNE visualization of the features (see Fig. 1(b) and1(c)) confirms that features learned by classification have more spread than features learned by regression. More visualizations are shown in Appendix B.

Figure 1: Feature learning of regression versus classification for depth estimation. Regression keeps features close together and forms an ordinal relationship, while classification spreads the features (compare (b) vs. (c) ), leading to a higher entropy feature space. Features are colored based on their predicted depth. Detailed experimental settings are given in Appendix B.

Targets. Several works formulate regression problems as classification tasks to improve performance. They focus on different design aspects such as label discretization and uncertainty modeling. To discretize the labels, Cao et al. (2017); Fu et al. (2018) and Liu et al. (2019a) convert the continuous values into discrete intervals with a pre-defined interval width. To improve class flexibility, Bhat et al. (2021) followed up with an adaptive bin-width estimator. Due to inaccurate or imprecise regression targets, several works have explored modeling the uncertainty of labels with classification. Liu et al. (2019a) proposed estimating targets that fall within a certain interval with high confidence. Tompson et al. (2014) and Newell et al. (

availability

https://github.com/needylove/OrdinalEntropy 

