SUPERVISED CONTRASTIVE REGRESSION WITH SAMPLE RANKING

Abstract

Deep regression models typically learn in an end-to-end fashion and do not explicitly try to learn a regression-aware representation. Their representations tend to be fragmented and fail to capture the continuous nature of regression tasks. In this paper, we propose Supervised Contrastive Regression (SupCR), a framework that learns a regression-aware representation by contrasting samples against each other based on their target distance. SupCR is orthogonal to existing regression models, and can be used in combination with such models to improve performance. Extensive experiments using five real-world regression datasets that span computer vision, human-computer interaction, and healthcare show that using SupCR achieves the state-of-the-art performance and consistently improves prior regression baselines on all datasets, tasks, and input modalities. SupCR also improves robustness to data corruptions, resilience to reduced training data, performance on transfer learning, and generalization to unseen targets.

1. INTRODUCTION

Regression problems are ubiquitous and fundamental in the real world. They include estimating age from human appearance (Rothe et al., 2015) , predicting health scores from physiological signals (Engemann et al., 2022) , and detecting gaze directions from webcam images (Zhang et al., 2017b) . Since regression targets are continuous, the most widely used approach to train a regression model is to have the model predict the target value, and use the distance (e.g., L1 or L2 distance) between the prediction and the ground-truth target as the loss function (Zhang et al., 2017a; b; Schrumpf et al., 2021; Engemann et al., 2022) . There are also works that control the relationship between predictions and targets by converting the regression task into a classification task and training the model with the cross-entropy loss (Rothe et al., 2015; Niu et al., 2016; Shi et al., 2021) . However, all previous methods focus on imposing constraints on the final predictions in an end-to-end fashion, but do not explicitly consider the representations learned by the model. Their representations tend to be fragmented and fail to capture the continuous relationships underlying regression tasks. For example, Figure 1(a) shows the representation learned by the L1 loss in the task of predicting weather temperature from webcam outdoor images (Chu et al., 2018) , where the images are captured by 44 outdoor webcams at different locations. The representation learned by the L1 model does not exhibit the continuous ground-truth temperatures; rather it is grouped by different webcams in a fragmented manner. Such unordered and fragmented representation is sub-optimal for the regression task and can even hamper performance since it contains distracting information (e.g., the capturing webcam). While there is a rich literature on representation learning, past methods are focused on classification problems. In particular, contrastive learning (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b) and supervised contrastive learning (SupCon) (Khosla et al., 2020) have been proven highly effective in representation learning. However, as shown in Figure 1 (b), which plots the representation learned by SupCon for the visual temperature prediction task mentioned above, such method produces a sub-optimal representation for regression problems because it ignores the continuous order between the samples in a regression task. Besides, there are several recent works (Wang et al., 2022; Dufumier et al., 2021a; b; Schneider et al., 2022) adopting contrastive learning in the context of continuous labels, but they are not doing regression learning tasks. In this paper, we introduce Supervised Contrastive Regression (SupCR), a new framework for deep regression learning, where we first learn a representation that ensures the distances in the embedding (Khosla et al., 2020) , and the proposed SupCR for the task of predicting the temperature from webcam outdoor images (Chu et al., 2018) . The representation of each image is visualized as a dot and the color indicates the ground-truth temperature. SupCR can learn a representation that captures the intrinsic ordered relationships between the samples while L1 and SupCon fail to do so. space are ordered according to the target values. We then use this representation to predict the targets. To learn such a regression-aware representation, we contrast samples against each other based on their label/target distances. Our method explicitly leverages the ordered relationships between the samples to optimize the representation for the downstream regression task. As shown in Figure 1 (c), SupCR leads to a representation that captures the intrinsic ordered relationships between the samples. Moreover, our framework is orthogonal to existing regression methods as one can use any kind of regression method to map our learned representation to the prediction values. Extensive experiments using five real-world regression datasets that span computer vision, humancomputer interaction, and healthcare verify the superior performance of our framework. The results show that SupCR achieves the state-of-the-art performance and consistently improves the performance of previous regression methods by 8.7% on average across all datasets, tasks, and input modalities. Furthermore, our experimental results also show that SupCR delivers multiple desirable properties: 1) improved robustness to data corruptions; 2) enhanced resilience to reduced training data; 3) higher performance on transfer learning; and 4) better generalization to unseen targets.

2. RELATED WORK

Regression Learning. Deep learning has achieved a great success in addressing regression tasks (Rothe et al., 2015; Zhang et al., 2017a; b; Schrumpf et al., 2021; Engemann et al., 2022) . The most straightforward and widely used way to train a regression model is to have the model predict the target value, and use the distance between the prediction and the ground-truth target as the loss function. Common distance metrics used as the loss function in regression tasks are the L1 loss, the MSE loss and the Huber loss (Huber, 1992) . Past work has also proposed variants to the basic methods described above. One branch of prior works (Rothe et al., 2015; Gao et al., 2017; 2018; Pan et al., 2018) divide the regression range into small bins to convert the problem into a classification problem, and use classification loss, e.g., the cross-entropy loss, to train the model. Another branch of prior work (Niu et al., 2016; Fu et al., 2018; Cao et al., 2020; Shi et al., 2021) casts regression as an ordinal classification problem. They design multiple ordered thresholds and use multiple binary classifiers, one for each threshold, to learn whether the target is larger than each threshold. Past work however focuses on forcing the final predictions of the model to be close to the target and does so in an end-to-end manner. In contrast, this paper focuses on ordering the samples in the embedding space in accordance with the label order, and it does so by first learning a representation that contrasts samples against each other, and then learning a predictor to map the representation to the prediction values. Therefore, our method is orthogonal to prior methods for regression learning as those prior methods can be applied to training our predictor and learning the final predictions with the help of our ordered representation. Contrastive Learning. Contrastive learning, which aligns positive pairs and repulses negative pairs in the representation space, has demonstrated improved performance for self-supervised learn-ing (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; Chuang et al., 2020) . Its supervised version, named supervised contrastive learning (SupCon) (Khosla et al., 2020) , which defines samples from the same class as positive pairs and samples from different classes as negative pairs, was also shown to outperform the cross-entropy loss on image classification tasks (Khosla et al., 2020) . It is also beneficial for learning in challenging settings such as noisy labels (Li et al., 2022a) , long-tailed classification (Kang et al., 2020; Li et al., 2022b) and out-of-domain detection (Zeng et al., 2021) . A couple of recent papers have used the concept of "contrastive" or contrastive learning in the context of continuous labels. In particular, Yu et al. ( 2021) learns a model for action quality assessment by regressing the relative scores between two input videos. This work differs from ours in that it does not use the standard contrastive learning framework. Wang et al. (2022) proposes to improve domain adaptation for gaze estimation by adding a contrastive loss term to the L1 loss. They show that their approach is beneficial for adapting a gaze estimation model from one dataset (i.e., one domain) to another, but the approach produces no benefits and even reduces performance for the source dataset. In contrast, our approach improves performance for the source dataset as opposed to being specific to domain adaptation. Besides, Dufumier et al. (2021a; b) use a contrastive loss reweighted by continuous meta-data for classification. Schneider et al. (2022) learns low-dimensional and interpretable embeddings to encode behavioral and neural data using a generalized contrastive loss which samples positives and negatives according to continuous behavior or time labels. However, the continuous meta-data in (Dufumier et al., 2021a; b) and the continuous labels in (Schneider et al., 2022) are not the targets in their tasks and they are not working on regression problems.

3. METHOD

For a regression task, we aim to train a neural network composed of a feature encoder f (•) : X → R de and a predictor p(•) : R de → R dt to predict the target y ∈ R dt based on the input data x ∈ X. Given an input batch of data, similar to contrastive learning methods (Chen et al., 2020a; Khosla et al., 2020) , we first apply data augmentation twice to obtain two views of the batch. The two views are fed into the encoder f (•) to obtain a d e -dimensional feature embedding for each augmented input data. Our supervised contrastive regression loss, L SupCR , is computed on the feature embeddings. To use the learned representation for regression, we freeze the encoder f (•) and train the predictor on top of it using a regression loss (e.g., L1 loss).

3.1. SUPERVISED CONTRASTIVE REGRESSION LOSS

We would like our loss function to ensure distances in the embedding space are ordered according to distances in the label space. But, how can we design a contrastive loss that delivers this property? Below, we first define our loss function, then explain it in the context of positive and negative contrastive samples, and finally provide a theoretical analysis to prove that as our loss approaches its minimum, the learned representation will be ordered according to label distances. For a positive integer I, let [I] denote the set {1, 2,  L SupCR = - 1 2N 2N i=1 1 2N -1 2N j=1, j̸ =i log exp(sim(v i , v j )/τ ) 2N k=1 1 [k̸ =i, d( ỹi, ỹk )≥d( ỹi, ỹj )] exp(sim(v i , v k )/τ ) , where sim(  A Batch of Samples samples are those that come from the same class or the same input image, and all other samples are negative samples. In regression problems, there are no classes but continuous labels. Any two samples can be thought of as a positive pair or a negative pair, depending on context. To exploit the inherent continuity underlying the labels, we define positive and negative samples in a relative way, based on their label distance to the anchor. In particular, we rank samples with respect to their label distance to the anchor. We then contrast the anchor to the sample in the batch that is closest in the label space, which will be treated as the positive sample. All other samples are farther from the anchor and should be negative samples. We also contrast the anchor to the second closest sample in the batch. In this case, only samples in the batch that are farther from the positive sample (i.e., samples whose ranks is three or higher) can act as negative samples. We continue this process for higher rank samples (the third closest, fourth closest, etc.), and all anchors in a batch. shows two positive pairs and their corresponding negative pair(s). Specifically, for any anchor sample i, any other sample j in the same batch can be used to create a positive pair with corresponding negative samples set to all samples in the batch whose labels differ from i's label by more than the label of j. In other words, L SupCR aims to force the feature similarity between i and j greater than the feature similarity between i and any other sample k in the batch if the label distance between i and k is greater than i and j. Therefore, optimizing L SupCR will make the feature embedding ordered according to the order in the label space. The design of L SupCR sufficiently leverages the ordered relationships between samples reflected in the continuous label space of regression problems, which is also the intrinsic difference between regression problems and classification problems.

3.2. THEORETICAL ANALYSIS

In this section, we analytically show that optimizing L SupCR will make the feature embedding ordered according to the order in the label space. To show this, we first derive a lower bound of L SupCR , and show L SupCR can be arbitrarily close to it. Next, we formalize the concept of δ-ordering which refers to the feature embeddings being ordered according to the order in the label space, and show that as L SupCR gets sufficiently close to the lower bound, the feature embeddings will be δ-ordered. All proofs are in Appendix A. Notations. Let s i,j := sim(v i , v j )/τ , ∀i, j ∈ [2N ] , d i,j := d( ỹi , ỹj ), ∀i, j ∈ [2N ], D i,1 < D i,2 < • • • < D i,Mi be the sorted label distances starting from the i-th sample, i.e., sort({d i,j |j ∈ [2N ]\{i}}), ∀i ∈ [2N ], and n i,m := |{j | d i,j = D i,m , j ∈ [2N ]\{i}}| be the number of samples whose distance from the i-th sample equals D i,m , ∀ i ∈ [2N ], m ∈ [M i ]. First, we show that L ⋆ := 1 2N (2N -1) 2N i=1 Mi m=1 n i,m log n i,m is a lower bound of L SupCR . Theorem 1 (Lower bound of L SupCR ). L ⋆ is a lower bound of L SupCR , i.e., L SupCR > L ⋆ . Next, we show that L SupCR can be arbitrarily close to its lower bound L ⋆ . Theorem 2 (Lower bound tightness). For any ϵ > 0, there exists a set of feature embeddings such that L SupCR < L ⋆ + ϵ. Then, we define a property called δ-ordered for the feature embeddings {v l } l∈[2N ] . Definition 1 (δ-ordered feature embedddings). For any 0 < δ < 1, the feature embeddings {v l } l∈[2N ] are δ-ordered if ∀i ∈ [2N ], j, k ∈ [2N ]\{i},            s i,j > s i,k + 1 δ if d i,j < d i,k |s i,j -s i,k | < δ if d i,j = d i,k s i,j < s i,k - 1 δ if d i,j > d i,k . It says that a δ-ordered set of feature embeddings satisfies the following properties: First, for all j and k such that d i,j = d i,k , the gap between s i,j and s i,k should be smaller than δ; Second, for all j and k such that d i,j < d i,k , s i,j should be larger than s i,k by at least 1 δ . Notice that 1 δ > δ, which means the feature similarity gap between a sample pair with different label distance to the anchor is always larger than the feature similarity gap between a sample pair with equal label distance to the anchor. Finally, we show that for any 0 < δ < 1, when L SupCR is close enough to its lower bound L ⋆ , the feature embeddings will be δ-ordered. Theorem 3 (Main theorem). For any 0 < δ < 1, there exist ϵ > 0, such that if L SupCR < L ⋆ + ϵ, then the feature embeddings are δ-ordered.

4. EXPERIMENTS

In this section, we evaluate our proposed method empirically. We first benchmark our method and compare it with the state-of-the-art regression baselines. Then, we evaluate the desirable properties of our learned representations, including the robustness to data corruptions, the resilience to reduced training data, the performance on transfer learning, and the generalization ability to unseen targets. Finally, we analyze and evaluate the design variants for our method. Refer to Appendix E for additional experiments and analysis. Datasets. We benchmark our method on five regression datasets for common real-world tasks in computer vision, human-computer interaction, and healthcare. Ethics statements of the datasets are included in Appendix B. 1) AgeDB (Moschoglou et al., 2017) We subsample and split it into a 33,000-image training set, a 6,000-image validation set and a 6000image test set with no overlapping participants. The gaze direction is described as a 2-dimensional vector with the pitch angle in the first dimension and the yaw angle in the second dimension. The range of the pitch angle is -40°to 10°and the range of the yaw angle is -45°to 45°. 4) SkyFinder (Mihail et al., 2016; Chu et al., 2018) is used for temperature prediction from outdoor webcam images. It contains 35,417 images captured by 44 cameras around 11am on each day under a wide range of weather and illumination conditions. The temperature range is -20 Metrics. We report two kinds of metrics: the prediction error and coefficient of determination (R 2 ). Prediction errors have practical meaning and are convenient for interpretation, while R 2 quantifies how much the model outperform a dummy regressor that always predicts the mean value of the training labels. For age, brain-age, and temperature, the mean absolute error (MAE) is reported as the prediction error. For gaze direction, the angular error is reported as the prediction error. Baselines. We implemented seven typical regression methods as baselines. L1, MSE and HUBER have the model directly predict the target value and train the model with an error-based loss function, where L1 uses the mean absolute error, MSE uses the mean squared error and HUBER uses an MSE term when the error is below a threshold and an L1 term otherwise. DEX (Rothe et al., 2015) and DLDL-V2 (Gao et al., 2018) divide the regression range of each label dimension into several bins and learn the probability distribution over the bins. DEX (Rothe et al., 2015) optimizes a cross-entropy loss between the predicted distribution and the one-hot ground-truth labels, while DLDL-V2 (Gao et al., 2018) jointly optimizes a KL loss between the predicted distribution and a normal distribution centered at the ground-truth value, as well as an L1 loss between the expectation of the predicted distribution and the ground-truth value. During inference, they output the expectation of the predicted distribution for each label dimension. OR (Niu et al., 2016) and CORN (Shi et al., 2021) design multiple ordered threshold for each label dimension, and learn a binary classifier for each threshold. OR (Niu et al., 2016) optimizes a binary cross-entropy loss for each binary classifier to learn whether the target value is larger than each threshold, while CORN (Shi et al., 2021) learns whether the target value is larger than each threshold conditioning on it is larger than the previous threshold. During inference, they aggregate all binary classification results to produce the final results. Experiment Settings. We use ResNet-18 (He et al., 2016) as the backbone model for AgeDB, IMDB-WIKI, MPIIFaceGaze and SkyFinder datasets. For TUAB, we use a 24-layer 1D ResNet (He et al., 2016) as the backbone model to process the EEG signals. We use the linear regressor as the predictor. We use the SGD optimizer and cosine learning rate annealing (Loshchilov & Hutter, 2016) for training. The batch size is set to 256. We pick the best learning rates and weight decays by grid search for both our method and the baselines. We train all the baselines and the encoder of our method for 400 epochs, and the linear regressor of our method for 100 epochs. The same data augmentations are adopted for all baselines and our method: random crop and resize (with random flip), color distortions for AgeDB, IMDB-WIKI, SkyFinder; random crop and resize (without random flip), color distortions for MPIIFaceGaze; random crop for TUAB. Appendix C shows augmentation examples for each dataset. Negative L2 norm, i.e., sim(v i , v j ) = -∥v i -v j ∥ 2 is used as the feature similarity measure in L SupCR . L1 distance is used as the label distance measure in L SupCR for AgeDB, IMDB-WIKI, SkyFinder and TUAB, while angular distance is used as the label distance measure for MPIIFaceGaze. The temperature τ is set to 2.0. Refer to Appendix D for more details.

4.1. MAIN RESULTS

As explained earlier, our model learns a regression-suitable representation that can be used by any of the baselines. Thus, in our comparison, we first train the encoder with the proposed L SupCR , and then freeze the encoder and train a predictor on top of it using each of the baseline methods. We then compare the original baseline without our representation, to the same baseline with our representation. For example, L1 means training the encoder and predictor end-to-end with L1 loss and SUPCR(L1) means training the encoder with L SupCR and then training the predictor with L1 loss. Table 1 , Table 2 , Table 3 and Table 4 show the evaluation results on AgeDB, TUAB, MPIIFaceGaze and SkyFinder, respectively. Green numbers highlight the performance gains brought by using our representation and the best numbers are shown in bold. As all tables indicate, SUPCR consistently achieves the best performance on both metrics across all datasets. Moreover, incorporating SUPCR to learn the representation consistently reduces the prediction error of all baselines by 5.8%, 10.1%, 11.7% and 7.0% on average on the AgeDB, TUAB, MPIIFaceGaze and SkyFinder, respectively. We evaluate the transfer learning performance by first pre-training the feature encoder on a large dataset and then using either linear probing (fixed encoder) or fine-tuning to learn a predictor on a small dataset. We investigate two scenarios: transferring from AgeDB which contains ∼12k samples to a subsampled IMDB-WIKI of 2k samples and transferring from another subsampled IMDB-WIKI of 32k samples to AgeDB. As shown in Table 5 , SUPCR(L1) outperforms L1 in both the linear probing and fine-tuning settings in both two scenarios. In real-world regression tasks, it is common that some targets are unseen during training. As in Yang et al. (2021) , we curate two subset of IMDB-WIKI that contain unseen targets. The test set is a uniform distribution across the whole target range. Table 6 shows the label distributions of the two training set, where pink shading indicates regions of unseen targets, and the blue shading represents the distribution of seen targets. The first training set consists of 3,781 training samples with a bi-modal Gaussian distribution over the target space, and the second training set consists of 3,662 training samples with a tri-modal Gaussian distribution over the target space. We report the prediction error on the seen and unseen targets separately. The results show that SUPCR (L1) outperforms L1 by a larger margin on the unseen targets without sacrificing the performance on the seen targets.

4.6. DESIGN VARIANTS

Similarity Measure and Projection Head. We explore potential variants of our method to pick the best design. Table 7 compares the performance of our method (L SupCR with similarity measure using the negative L2 norm) with alternative designs that use a different similarity measure and potentially support a projection head. The results are on the AgeDB dataset and using SUPCR(L1). Specifically, the table shows the performance of using different feature similarity measures sim(•, •). We see that among cosine similarity, i.e., sim(v i , v j ) =  sim(v i , v j ) = -∥v i -v j ∥ 1 and negative L2 norm, i.e., sim(v i , v j ) = -∥v i -v j ∥ 2 , negative L2 norm delivers the best performance. The table also shows the performance with and without a projection head. To train the encoder network f (•) : X → R de , one can involve a projection head h(•) : R de → R dp during training to calculate the loss on {h(f (x i ))} i=1∈[2N ] , and then discard the projection head to use f (x i ) for inference. One can also directly calculate the loss on {f (x i )} i∈[2N ] and then use f (x i ) for inference. SIMCLR (Chen et al., 2020a) and SUPCON (Khosla et al., 2020) both use the former approach. For L SimCLR , the results show that adding the projection head can benefit the regression performance. This is because L SimCLR aims to extract features that are invariant to data augmentations and can remove information that may be useful for the downstream regression task. For L SupCon and L SupCR , however, it is better to train the encoder without the projection head, since both L SupCon and L SupCR leverage the label information to extract features that directly target on the downstream task. Moreover, we see that using L SupCon or L SupCR as the loss function delivers better performance than using L SimCLR , verifying that it is helpful to utilize label information during encoder training. We also see that L SupCR outperforms L SupCon by a large margin, highlighting the superiority of L SupCR which explicitly leverages the ordered relationships between samples for regression problems. Training Scheme. There are usually three schemes to train the feature encoder: (1) Linear probing: first trains the feature encoder using the representation learning loss, then freezes the encoder and trains a linear regressor on top of it using a regression loss. (2) Fine-tuning: first trains the feature encoder using the representation learning loss, then fine-tunes the whole model using a regression loss. (3) Regularization: trains the whole model while jointly optimizing the representation learning loss and the regression loss. Table 8 shows the performance on AgeDB for the three schemes using L SupCR as the representation learning loss and L1 as the regression loss. We see that all of the three schemes can improve performance over using the regression loss L1 alone to train a model. Further, unlike classification problems where fine-tuning often delivers the best performance, freezing the feature encoder performs the best. This is because in the case of regression, back-propagating the L1 loss to the representation can destroy the order in the embedding space learned by L SupCR , which leads to poorer performance.

5. CONCLUSION

In this paper, we propose Supervised Contrastive Regression (SupCR), a framework that learns a regression-aware representation by contrasting samples against each other according to their target distance. The proposed framework is orthogonal to existing regression models, and can be used in combination with such models to improve performance. It achieves the state-of-the-art performance and consistently improves prior regression baselines across different datasets, tasks, and input modalities. It also improves the robustness to data corruptions, resilience to reduced training data, performance on transfer learning, and generalization to unseen targets.

A PROOFS

Theorem 1 (Lower bound of L SupCR ). L ⋆ is a lower bound of L SupCR , i.e., L SupCR > L ⋆ . Proof. L SupCR = - 1 2N (2N -1) 2N i=1 j∈[2N ]\{i} log exp(s i,j ) k∈[2N ]\{i}, di,k≥di,j exp(s i,k ) = - 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log exp(s i,j ) k∈[2N ]\{i}, di,k≥Di,m exp(s i,k ) = - 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log exp(s i,j ) k∈[2N ]\{i}, di,k=Di,m exp(s i,k ) + 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log   1 + k∈[2N ]\{i}, di,k>Di,m exp(s i,k -s i,j ) k∈[2N ]\{i}, di,k=Di,m exp(s i,k -s i,j )    > - 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log exp(s i,j ) k∈[2N ]\{i}, di,k=Di,m exp(s i,k ) . ( ) ∀i ∈ [2N ], m ∈ [M i ], from Jensen's Inequality we have - j∈[2N ]\{i}, di,j =Di,m log exp(s i,j ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) ≥ -n i,m log    1 n i,m j∈[2N ]\{i}, di,j =Di,m exp(s i,j ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k )    = n i,m log n i,m . (3) Thus, by plugging Equation 3 into Equation 2, we have L SupCR > 1 2N (2N -1) 2N i=1 Mi m=1 n i,m log n i,m = L ⋆ . Theorem 2 (Lower bound tightness). For any ϵ > 0, there exists a set of feature embeddings such that L SupCR < L ⋆ + ϵ. Proof. We will show ∀ϵ > 0, there is a set of feature embeddings where s i,j > s i,k + γ if d i,j < d i,k s i,j = s i,k if d i,j = d i,k and γ := log 2N min i∈[2N ],m∈[M i ] ni,mϵ , ∀i ∈ [2N ], j, k ∈ [2N ]\{i}, such that L SupCR < L ⋆ + ϵ. For such a set of feature embeddings, ∀i ∈ [2N ], m ∈ [M i ], j ∈ {j ∈ [2N ]\{i} | d i,j = D i,m }, -log exp(s i,j ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) = log n i,m since s i,k = s i,j for all k such that d i,k = D i,m = d i,j , and log   1 + k∈[2N ]\{i}, d i,k >Di,m exp(s i,k -s i,j ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k -s i,j )    < log 1 + 2N exp(-γ) n i,m < 2N exp(-γ) n i,m ≤ ϵ (6) since s i,k -s i,j < -γ for all k such that d i,k > D i,m = d i,j and s i,k -s i,j = 0 for all k such that d i,k = D i,m = d i,j . From Equation 2we have L SupCR = - 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log exp(s i,j ) k∈[2N ]\{i}, di,k=Di,m exp(s i,k ) + 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log   1 + k∈[2N ]\{i}, di,k>Di,m exp(s i,k -s i,j ) k∈[2N ]\{i}, di,k=Di,m exp(s i,k -s i,j )    , ) By plugging Equation 5 and 6 into Equation 7we have L SupCR < 1 2N (2N -1) 2N i=1 Mi m=1 n i,m log n i,m + ϵ = L ⋆ + ϵ Theorem 3 (Main theorem). For any 0 < δ < 1, there exist ϵ > 0, such that if L SupCR < L ⋆ + ϵ, then the feature embeddings are δ-ordered. Proof. We will show ∀0 < δ < 1, there is a ϵ = 1 2N (2N -1) min min i∈[2N ],m∈[Mi] log 1 + 1 n i,m exp(δ + 1 δ ) , 2 log 1 + exp(δ) 2 -δ > 0, such that when L SupCR < L ⋆ + ϵ, the feature embeddings are δ-ordered. We first show that |s i,j - s i,k | < δ if d i,j = d i,k , ∀i ∈ [2N ], j, k ∈ [2N ]\{i} when L SupCR < L ⋆ + ϵ. From Equation 2we have L SupCR > - 1 2N (2N -1) 2N i=1 Mi m=1 j∈[2N ]\{i}, di,j =Di,m log exp(s i,j ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) . Let p i,m := arg min j∈[2N ]\{i}, di,j =Di,m s i,j , q i,m := arg max j∈[2N ]\{i}, di,j =Di,m s i,j , ζ i,m := s i,pi,m , η i,m := s i,qi,m -s i,pi,m , ∀i ∈ [2N ], m ∈ [M i ], by splitting out the maximum term and the minimum term we have L SupCR > - 1 2N (2N -1) 2N i=1 Mi m=1 log exp(ζ i,m ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) + log exp(ζ i,m + η i,m ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) + log exp j∈[2N ]\{i,pi,m,qi,m},di,j =Di,m s i,j k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) ni,m-2 . ( ) Let θ i,m := 1 ni,m-2 j∈[2N ]\{i,pi,m,qi,m},di,j =Di,m exp(s i,j -ζ i,m ), we have -log exp(ζ i,m ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) = log(1 + exp(η i,m ) + (n i,m -2)θ i,m ) and -log exp(ζ i,m + η i,m ) k∈[2N ]\{i}, d i,k =Di,m exp(s i,k ) = log(1 + exp(η i,m ) + (n i,m -2)θ i,m ) -η i,m . Then, from Jensen's inequality we know exp j∈[2N ]\{i,pi,m,qi,m},di,j =Di,m s i,j ≤ 1 ni,m-2 j∈[2N ]\{i,pi,m,qi,m},di,j =Di,m exp(s i,j ) ni,m-2 , (13) thus -log exp   j∈[2N ]\{i,p i,m ,q i,m },d i,j =D i,m si,j     k∈[2N ]\{i}, d i,k =D i,m exp(si,k)   n i,m -2 ≥ (n i,m -2) log(1 + exp(η i,m ) + (n i,m -2)θ i,m ) -(n i,m -2) log(θ i,m ) By plugging Equation 11, 12 and 14 into Equation 10, we have , exp(η i,m ) , thus L SupCR > 1 2N (2N -1) 2N i=1 Mi m=1 (n i,m log(1 + exp(η i,m ) + (n i,m -2)θ i,m ) -η i,m -(n i,m -2) log(θ i,m )) . h(θ) ≥ h 1 + exp(η i,m ) 2 = n i,m log n i,m + 2 log 1 + exp(η i,m ) 2 -η i,m . By plugging Equation 16into Equation 15, we have L SupCR > 1 2N (2N -1) 2N i=1 Mi m=1 n i,m log n i,m + 2 log 1 + exp(η i,m ) 2 -η i,m = L ⋆ + 1 2N (2N -1) 2N i=1 Mi m=1 2 log 1 + exp(η i,m ) 2 -η i,m . Then, since η i,m ≥ 0, we have 2 log 1+exp(ηi,m) 2 -η i,m ≥ 0. Thus, ∀i ∈ [2N ], m ∈ [M i ], L SupCR > L ⋆ + 1 2N (2N -1) 2 log 1 + exp(η i,m ) 2 -η i,m . ( ) If L SupCR < L ⋆ + ϵ ≤ L ⋆ + 1 2N (2N -1) 2 log 1+exp(δ) 2 -δ , then 2 log 1 + exp(η i,m ) 2 -η i,m < 2 log 1 + exp(δ) 2 -δ. Since y(x) = 2 log 1+exp(x) 2 -x increases monotonically when x > 0, we have • For AgeDB and SkyFinder, random crop and resize (with random horizontal flip), color distortions are used as data augmentation; η i,m < δ. Hence, ∀i ∈ [2N ], j, k ∈ [2N ]\{i}, if d i,j = d i,k = D i,m , |s i,j -s i,k | ≤ η i,m < δ. • For TUAB, random crop is used as data augmentation; • For MPIIFaceGaze, random crop and resize (without random horizontal flip), color distortions are used as data augmentation.

D DETAILED EXPERIMENT SETTINGS

For the encoder training of our method and regression learning baselines, we selected the best learning rates and weight decays for each dataset by grid search, with a grid of learning rates from {0.01, 0.05, 0.1, 0.2, 0.5, 1.0} and weight decays from {10 -6 , 10 -5 , 10 -4 , 10 -3 }. For the predictor training of our method, we adopted the same search setting above except for adding no weight decay to the search choices of weight decays. For temperature parameter τ , we searched from {0.1, 0.2, 0.5, 1.0, 2.0, 5.0} and selected the best, which is 2.0. For the classification-based baselines, we divided the regression range into small bins. For AgeDB, the target range is 0 ∼ 101 and the bin size is set to 1; for TUAB, the target range is 0 ∼ 95 and the bin size is set to 1; for MPIIFaceGaze, the target range is -40 ∼ 10 for the pitch angle and -45 ∼ 45 for the yaw angle, and the bin size is set to 0.5 for the pitch angle and is set to 1 for the yaw angle. For SkyFinder, the target range is -20 ∼ 49 and the bin size is set to 1. For the implementation of SupCon (Khosla et al., 2020) on regression datasets, the regression range is divided into small bins in the same way and each bin is regarded as a class. Samples from the same class, i.e., sharing the same target bin as the anchor, are considered as positive samples.

E ADDITIONAL EXPERIMENTS AND ANALYSIS E.1 IMPACT OF MODEL ARCHITECTURES

In the main paper, we use ResNet-18 as the encoder backbone for three image datasets (AgeDB, MPIIFaceGaze and SkyFinder). In this section, we study the impact of backbone architectures on the experiment results. As Table 9 reports, the results of using ResNet-50 as the encoder backbone are consistent with the results with ResNet-18 in Table 1 , Table 3 and Table 4 , showing that our method is compatiable with different model architectures. and (Chu et al., 2018) for SkyFinder. However, They are either using different backbone architectures and optimizers, or using different dataset splits and number of training samples from ours. Therefore, the results reported in their paper are not directly comparable to our results. We implement their method under our training and evaluation protocols and summarize the results in Table 10 . Results show that our method surpasses the state-of-the-art methods on all datasets. 

E.3 COMPARISON TO MORE SELF-SUPERVISED LEARNING METHODS

In Section 4.6, we compare our method with SIMCLR (Chen et al., 2020a) and SUPCON (Khosla et al., 2020) on AgeDB dataset. In this section, we investigate more SSL methods under various training settings to compare with our method on all datasets. Specifically, we evaluate the encoder pretrained using SIMCLR and DINO (Caron et al., 2021) on ImageNet and the regression datasets respectively. Besides, we also consider the setting where we use SIMCLR loss as regularization along with L1 loss to train on the regression datasets. The results are shown in Table 11 . We use ResNet-50 as the encoder backbone for image datasets. The ResNet-50 ImageNet-pretrained encoders are 800-epoch checkpoints and their ImageNet linear evaluation accuracies are 69.1% and 75.3% respectively. For DINO-related methods, apart from linear evaluation, we also report k-NN evaluation results (k = 20). The results show that SUPCR outperforms all these baselines on all datasets, and pre-training using SIMCLR or DINO, or using SIMCLR as a regularizer all performs worse than the vanilla L1 baseline. This further verifies that the performance gain of our method stems from our proposed SUPCR loss rather than the pre-training scheme. In this section, we study the standard deviations of the best results on each dataset with 5 different random seeds. Table 12 shows their average prediction errors and standard deviations. These results are aligned with the results we reported in the main paper. In L SupCR , all samples in the batch will be treated as the positive for each anchor. Here, we conduct an ablation study on only considering the first K closest samples to the anchor as positive. Table 13 shows the MAE on AgeDB for different Ks (using L1 loss in the linear probing), where K = 511 means considering all samples as positive since we use a batch size N = 256 and thus the number of all samples other than the anchor equal to 2N -1 = 511. These experiments show that the larger K, the better the performance. This phenomenon is aligned with the design intuition of SUPCR loss since each contrastive term ensures a group of orders related to the positive sample, i.e., it ensures all samples that have a larger label distance from the anchor than the positive sample to be farther from the anchor than the positive in the feature embedding space. Only when all of the samples are considered as positive and their related groups of orders are ensured, the order in the feature space can be fully guaranteed.

E.6 TRAINING EFFICIENCY

We compute the average wall-clock running time (in seconds) per training epoch on 8 NVIDIA TITAN RTX GPUs for SUPCR and compare it with SUPCON (Khosla et al., 2020) on all four datasets, as shown in Table 14 . Results show the training efficiency of SUPCR is comparable to SUPCON. 



Figure 1: UMAP (McInnes et al., 2018) visualization of the representation learned by L1, Sup-Con(Khosla et al., 2020), and the proposed SupCR for the task of predicting the temperature from webcam outdoor images(Chu et al., 2018). The representation of each image is visualized as a dot and the color indicates the ground-truth temperature. SupCR can learn a representation that captures the intrinsic ordered relationships between the samples while L1 and SupCon fail to do so.

Figure 2: Illustration of L SupCR . (a) An example batch of input data and their age labels. (b) Two example positive pairs and the corresponding negative pair(s) when the anchor is the 20-year old man (shown in gray shading). When the 30-year old man creates a positive pair with the anchor, their label distance is 10, so the corresponding negative samples are the 0-year old baby and the 70-year old man, whose label distances to the anchor are 20 and 50 respectively. When the 0-year old baby creates a positive pair with the anchor, their label distance is 20. Only one sample in the batch has a larger label distance to the anchor, namely the 70-year old man, and it acts as a negative sample.

Figure 2(a) shows an example batch, and Figure 2(b)shows two positive pairs and their corresponding negative pair(s). Specifically, for any anchor sample i, any other sample j in the same batch can be used to create a positive pair with corresponding negative samples set to all samples in the batch whose labels differ from i's label by more than the label of j. In other words, L SupCR aims to force the feature similarity between i and j greater than the feature similarity between i and any other sample k in the batch if the label distance between i and k is greater than i and j. Therefore, optimizing L SupCR will make the feature embedding ordered according to the order in the label space. The design of L SupCR sufficiently leverages the ordered relationships between samples reflected in the continuous label space of regression problems, which is also the intrinsic difference between regression problems and classification problems.

3) MPIIFaceGaze(Zhang et al., 2017a;b)  is used for gaze direction estimation from face images. It contains 213,659 face images collected from 15 participants during their natural everyday laptop use.

θ) := n i,m log(1+exp(η i,m )+(n i,m -2)θ)-η i,m -(n i,m -2) log(θ).From derivative analysis we know h(θ) decreases monotonically when θ ∈ 1,

Figure 5: Visualizations of original and augmented data samples in each dataset.

• • • , I}. Given a randomly sampled batch of N input and label pairs {(x n , y n )} n∈[N ] , we apply data augmentation to the batch and obtain a two-view batch {( xℓ , ỹℓ )} ℓ∈[2N ] , where x2n = t(x n ), x2n-1 = t ′ (x n ) (t and t ′ are independently sampled data augmentations operations) and ỹ2n = ỹ2n-1 = y n , ∀n ∈ [N ]. The augmented batch is fed into the encoder f (•) to obtain v l = f ( xl ) ∈ R de , ∀l ∈ [2N ]. Our supervised contrastive regression loss is then computed over the feature embeddings {v l } l∈[2N ] as:

is used for predicting age from face images. It contains 16,488 in-the-wild images of celebrities and the corresponding age labels. The age range is between 0 and 101. It is split into a 12,208-image training set, a 2140-image validation set and a 2140-image test set. 2) TUAB (Obeid & Picone, 2016; Engemann et al., 2022) is used for brain-age estimation from EEG resting-state signals. The dataset comes from EEG exams at the Temple University Hospital in Philadelphia. Following Engemann et al. (2022), we use only the non-pathological subjects, so that we may consider their chronological age as their brain-age label. The dataset includes 1,385 21-channel EEG signals sampled at 200Hz from individuals whose age ranges from 0 to 95. It is split into a 1,246-subject training set and a 139-subject test set.

• C to 49 • C. It is split into a 28,373-image training set, a 3,522-image validation set and a 3,522-image test set. Evaluation results on AgeDB.

Evaluation results on TUAB.

Evaluation results on MPIIFaceGaze.

Evaluation results on SkyFinder. IMDB-WIKI(Rothe et al., 2015) is a large dataset for predicting age from face images, which contains 523,051 celebrities images and the corresponding age labels. The age range is between 0 and 186 (some images are mislabeled). We use this dataset to test our method's resilience to reduced training data, performance on transfer learning, and ability to generalize to unseen targets. We subsample the dataset to create a variable size training set, and keep the validation set and test set unchanged with 11,022 images in each.

Transfer learning results.

Generalization performance to unseen targets on two curated subsets of IMDB-WIKI. MAE ↓ is used as the metric.

Comparison to potential variants of L SupCR and their performance on AgeDB.

Performance of alternative training schemes for the feature encoder on AgeDB.

Evaluation results using ResNet-50 as the encoder backbone.

Comparison to exsiting SOTA methods on each dataset.

Comparison to SSL methods applied under different settings on each dataset. MAE ↓ is used as the metric for AgeDB, TUAB and SkyFinder, and Angular Error ↓ for MPIIFaceGaze.

Average prediction errors and standard deviations of the best results on each dataset.

Ablation on number of positives considered.

Average wall-clock running time (in seconds) per training epoch on 8 NVIDIA TITAN RTX GPUs for SUPCR and SUPCON(Khosla et al., 2020) on each dataset.

annex

From Equation 2we have(20) and combining it with Equation 3 we haveWhen L SupCR < L ⋆ + ϵ, we already have |s i,l -s i,j | < δ, ∀d i,l = d i,j , which derives s i,l -s i,j < δ and thus exp(s i,l -s i,j ) < exp(δ). By putting this into Equation 20, we have ∀i ∈where r i,j ∈ [M i ] is the index such that D i,ri,j = d i,j .Further, given

B ETHICS STATEMENTS

All of the datasets used in the paper are public datasets. The ethics statements for each dataset are listed below:• TUAB: This dataset is collected from archival records at Temple University Hospital (TUH).All work was performed in accordance with the Declaration of Helsinki and with the full approval of the Temple University IRB. All personnel in contact with privileged patient information were fully trained on patient privacy and were certified by the Temple IRB (Obeid & Picone, 2016 ). • MPIIFaceGaze: The authors of Zhang et al. (2017b) confirmed that the informed consent has been obtained from all of the participants. • AgeDB, IMDB-WIKI and SkyFinder: Those datasets are collected without including any interaction or intervention with human subjects and do not contain any private information (Moschoglou et al., 2017; Rothe et al., 2015; Mihail et al., 2016) .

