IMPROVING DEEP REGRESSION WITH ORDINAL EN-TROPY

Abstract

In computer vision, it is often observed that formulating regression problems as a classification task yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the crossentropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy regularizer to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression. Code can be found here:

1. INTRODUCTION

Classification and regression are two fundamental tasks of machine learning. The choice between the two usually depends on the categorical or continuous nature of the target output. Curiously, in computer vision, specifically with deep learning, it is often preferable to solve regression-type problems as classification tasks. A simple and common way is to discretize the continuous labels; each bin is then treated as a class. After converting regression into classification, the ordinal information of the target space is lost. Discretization errors are also introduced to the targets. Yet for a diverse set of regression problems, including depth estimation (Cao et al., 2017) , age estimation (Rothe et al., 2015) , crowd-counting (Liu et al., 2019a) and keypoint detection (Li et al., 2022) , classification yields better performance. The phenomenon of classification outperforming regression on inherently continuous estimation tasks naturally begs the question of why. Previous works have not investigated the cause, although they hint at task-specific reasons. For depth estimation, both Cao et al. (2017) and Fu et al. (2018) postulate that it is easier to estimate a quantized range of depth values rather than one precise depth value. For crowd counting, regression suffers from inaccurately generated target values (Xiong & Yao, 2022) . Discretization helps alleviate some of the imprecision. For pose estimation, classification allows for the denser and more effective heatmap-based supervision (Zhang et al., 2020; Gu et al., 2021; 2022) . Could the performance advantages of classification run deeper than task-specific nuances? In this work, we posit that regression lags in its ability to learn high-entropy feature representations. We arrive at this conclusion by analyzing the differences between classification and regression from a mutual information perspective. According to Shwartz-Ziv & Tishby (2017), deep neural networks during learning aim to maximize the mutual information between the learned representation Z and the target Y . The mutual information between the two can be defined as I(Z; Y ) = H(Z) -H(Z|Y ). I(Z; Y ) is large when the marginal entropy H(Z) is high, i.e., features Z are as spread as possible, and the conditional entropy H(Z|Y ) is low, i.e., features of common targets are as close as possible. Classification accomplishes both objectives (Boudiaf et al., 2020) . This work, as a key contribution, shows through derivation that regression minimizes H(Z|Y ) but ignores H(Z). Accordingly, the learned representations Z from regression have a lower marginal entropy (see Fig. 1 The difference in entropy between classification and regression stems from the different losses. We postulate that the lower entropy features learned by L 2 losses in regression explain the performance 1



(a)). A t-SNE visualization of the features (see Fig. 1(b) and1(c)) confirms that features learned by classification have more spread than features learned by regression. More visualizations are shown in Appendix B.

availability

https://github.com/needylove/OrdinalEntropy 

