REGRESSION FROM UPPER ONE-SIDE LABELED DATA

Abstract

We address a regression problem from weakly labeled data that are correctly labeled only above a regression line, i.e., upper one-side labeled data. The label values of the data are the results of sensing the magnitude of some phenomenon. In this case, the labels often contain missing or incomplete observations whose values are lower than those of correct observations and are also usually lower than the regression line. It follows that data labeled with lower values than the estimations of a regression function (lower-side data) are mixed with data that should originally be labeled above the regression line (upper-side data). When such missing label observations are observed in a non-negligible amount, we thus should assume our lower-side data to be unlabeled data that are a mix of original upper-and lowerside data. We formulate a regression problem from these upper-side labeled and lower-side unlabeled data. We then derive a learning algorithm in an unbiased and consistent manner to ordinary regression that is learned from data labeled correctly in both upper-and lower-side cases. Our key idea is that we can derive a gradient that requires only upper-side data and unlabeled data as the equivalent expression of that for ordinary regression. We additionally found that a specific class of losses enables us to learn unbiased solutions practically. In numerical experiments on synthetic and real-world datasets, we demonstrate the advantages of our algorithm.

1. INTRODUCTION

This paper addresses a scenario in which a regression function is learned for label sensor values that are the results of sensing the magnitude of some phenomenon. A lower sensor value means not only a relatively lower magnitude than a higher value but also a missing or incomplete observation of a monitored phenomenon. Label sensor values for missing observations are lower than those for when observations are correct without missing observations and are also usually lower than an optimal regression line that is learned from the correct observations. A naive regression algorithm using such labels causes the results of prediction to be low and is thus biased and underfitted in comparison with the optimal regression line. In particular, when the data coverage of a label sensor is insufficient, the effect of missing observations causing there to be bias is critical. One practical example is that, for comfort in healthcare, we mimic and replace an intrusive wrist sensor (label sensor) with non-intrusive bed sensors (explanatory sensors). We learn a regression function that predicts the values of the wrist sensor from values of the bed sensors. The wrist sensor is wrapped around a wrist. It accurately represents the motion intensity of a person and is used such as for sleep-wake discrimination Tryon (2013); Mullaney et al. (1980) ; Webster et al. (1982) ; Cole et al. (1992) . However, it can sense motion only on the forearm, which causes data coverage to be insufficient and observations of movements on other body parts to be missing frequently. The bed sensors are installed under a bed; while their accuracy is limited because of their non-intrusiveness, they have much broader data coverage than that of the wrist sensor. In this case, the wrist sensor values for missing observations are improperly low and also inconsistent with the bed sensor values as shown in Fig. 1-(1 ). This leads to severe bias and underfitting. The specific problem causing the bias stems from the fact that our data labeled with lower values than the estimations of the regression function are mixed with data that should be originally labeled above the regression line. Here, we call data labeled above the regression line upper-side data, depicted as circles in Fig. 1-( 2), and data labeled below the regression line lower-side data, depicted as squares in Fig. 1-( 2). When there are missing observations, that is, our scenario, it means that the original data with missing observations have been moved to the lower side, depicted as triangles in Fig. 1-( 3). We We thus should assume our lower-side data to be unlabeled data, that is, a mix of original upperand lower-side data. We overcome the bias by handling this asymmetric label corruption, in which upper-side data are correctly labeled but lower-side data are always unlabeled. There is an established approach against such corrupted weak labels in regression, that is, robust regression that regards weak labels as containing outliers Huber et al. (1964); Narula & Wellington (1982) ; Draper & Smith (1998); Wilcox (1997) . However, since not asymmetric but rather symmetric label corruption is assumed there, it is still biased in our problem setting. In the classification problem setting, asymmetric label corruption is addressed with positive-unlabeled (PU) learning, where it is assumed that negative data cannot be obtained but unlabeled data are available as well as positive data Denis ( 1998 2020). The focus is on classification tasks, and an unbiased risk estimator has been proposed Du Plessis et al. (2014; 2015) . There is a gap between the classification problem setting and our regression problem setting, i.e., we have to estimate specific continuous values, not positive/negative classes. We fill the gap with a novel approach for deriving an unbiased solution for our regression setting. In this paper, we formulate a regression problem from upper one-side labeled data, in which the upper-side data are correctly labeled, and we regard lower-side data as unlabeled data. We refer to this as one-side regression. Using these upper-side labeled and lower-side unlabeled data, we derive a learning algorithm in an unbiased and consistent manner to ordinary regression that uses data labeled correctly in both upper-and lower-side cases. This is achieved by deriving our gradient that requires only upper-side data and unlabeled data as an asymptotically equivalent expression of that for ordinary regression. This is a key difference from the derivation of unbiased PU classification where loss has been used. We additionally found that a specific class of losses enables us to make it so that an unbiased solution can be learned practically. For implementing the algorithm, we propose a stochastic optimization method. In numerical experiments using synthetic and real-world datasets, we empirically evaluated the effectiveness of the proposed algorithm. We found that it improves performance against regression algorithms that assume that both upper-and lower-side data are correctly labeled. consider the ordinary regression problem; after that, we formulate a one-side regression problem by transforming the objective function of the ordinary one.

2.1. ORDINARY REGRESSION PROBLEM

Let x ∈ R D (D ∈ N) be a D-dimensional explanatory variable and y ∈ R be a real-valued label. We learn a regression function f (x) that computes the value of an estimation of a label, ŷ, for a newly observed x as ŷ = f (x). The optimal regression function f * is given by f * ≡ argmin f L(f ), where L(f ) is the expected loss when the regression function f (x) is applied to data, x and y, distributed in accordance with an underlying probability distribution p(x, y): L(f ) ≡ E[L(f (x), y)], where E denotes the expectation over p(x, y), and L(f (x), y) is the loss function between f (x) and y, e.g., the squared loss, L(f (x), y) = f (x) -y 2 2 . L(f ) can be written by using the decomposed expectations E up when labels are higher than estimations of the regression function (f (x) < y, upper-side case) and E lo when labels are lower than the estimations of the regression function (y < f (x), lower-side case) as L(f ) = π up E up [L(f (x), y)] + π lo E lo [L(f (x), y)], where π up and π lo are the ratios for upper-and lower-side cases, respectively. Note that the decomposition in Eq. ( 3) holds for any f including f * , and we omitted the decomposed expectation when y = f (x) because it is always zero.

2.2. ONE-SIDE REGRESSION PROBLEM

We here consider a scenario in which we have training data, D ≡ {x n , y n } N n=1 , that are correctly labeled only in the upper-side case because of the existence of missing label observations. The data in the lower-side case are a mix of original upper-and lower-side data and are considered to be unlabeled data. We can divide D by estimations of the regression function f into upper-side data {X up , y up } ≡ {x, y ∈ D | f (x) < y} and unlabeled data X un ≡ {x ∈ D | y < f (x)}. In the ordinary regression, where both upper-and lower-side data are correctly labeled for training, expectations E up and E lo in Eq. ( 3) can be estimated by using the corresponding sample averages. In our setting, however, correctly labeled data from the lower-side case are unavailable, and, therefore, E lo cannot be estimated directly. We can avoid this problem by expressing L(f ) as L(f ) ≡π up E up [L(f (x), y)] + E[L(f (x), ỹlo )] -π up E up [L(f (x), ỹlo )], where expectation E for x can be estimated by computing a sample average for our unlabeled data X un , and ỹlo is a virtual label that is always lower than the estimations of the regression function f (x), whose details will be given in the next paragraph. For this expression, the expected loss L(f ) is represented by only the expectations over the upper-side data and unlabeled data, E up and E. Thus, we can design a gradient-based learning algorithm by using our training data. This transformation comes from Eqs. ( 2) and (3) with ỹlo as E[L(f (x), ỹlo )] = π up E up [L(f (x), ỹlo )] + π lo E lo [L(f (x), ỹlo )] π lo E lo [L(f (x), ỹlo )] = E[L(f (x), ỹlo )] -π up E up [L(f (x), ỹlo )]. (5) In practice, we cannot properly set the value of ỹlo as being always lower than f (x). However, for learning based on gradients, this is not needed when we set the loss function as losses whose gradients do not depend on the value of ỹlo but just on the sign of f (x) -ỹlo , which is always positive and sgn(f (x) -ỹlo ) = 1 from the definition of ỹlo , i.e., the loss functions satisfy ∂L(f (x), y) ∂θ = g sgn(f (x) -y), f (x) , ( ) where θ is the parameter vector of f , g sgn(f (x) -y), f (x) is a gradient function depending on sgn(f (x) -y) and f (x), and sgn(•) is a sign function. Common such losses are absolute loss and quantile losses. For example, the gradient of absolute loss, |f (x) -y|, is ∂|f (x) -y| ∂θ =      ∂f (x) ∂θ (sgn(f (x) -y) = 1) -∂f (x) ∂θ (sgn(f (x) -y) = -1) Undefined (sgn(f (x) -y) = 0) , which does not depend on the value of y but just on the sign of f (x) -y.

3. LEARNING WITH GRADIENT USING UPPER ONE-SIDE LABELED DATA

In this section, we derive the learning algorithm based on Eqs. ( 1) and ( 4) and show that it is unbiased to and consistent with ordinary regression. We consider the gradient of Eq. ( 4) by using losses that satisfy Eq. ( 6) for its second and third terms as ∂ L(f ) ∂θ =π up E up ∂L(f (x), y) ∂θ + E g sgn(f (x) -ỹlo ), f (x) -π up E up g sgn(f (x) -ỹlo ), f (x) . Using upper-side and unlabeled sample sets, {X up , y up } and X un , the gradient in Eq. ( 8) can be estimated as ∂ L(f ) ∂θ = π up n up {x,y}∈{Xup,yup} ∂L(f (x), y) ∂θ + 1 n un x∈Xun g sgn(f (x) -ỹlo ), f (x) - π up n up x∈Xup g sgn(f (x) -ỹlo ), f (x) , where {x, y} ∈ {X up , y up } represent coupled pairs of x and y in the upper-side sample set, and n up and n un are the numbers of samples for the upper-side and unlabeled sets, respectively. By using the gradient in Eq. ( 9), we can optimize Eq. ( 1) and learn the regression function. Its unbiasedness and consistency will be given in Section 3.1, and the specific implementation of the algorithm will be given in Section 3.2.

3.1. UNBIASEDNESS AND CONSISTENCY OF GRADIENT

Our learning algorithm based on the gradient in Eq. ( 9) that uses only upper-side data and unlabeled data is justified as follows. Theorem 1. Suppose that loss function L for the second term in Eq. (3) satisfies Eq. (6). Then, for any f , the gradient in Eq. ( 8) and its empirical approximation in Eq. ( 9) are unbiased to and consistent with the gradient of L(f ) in Eq. (3). In other words, learning based on the gradient of Eq. ( 9), which uses only upper-side data and unlabeled data (one-side regression), asymptotically produces the same result as learning based on the gradient of L(f ) in Eq. (3), which uses both upper-and lower-side data (ordinary regression). Proof. First, by substituting Eq. ( 5) into the second and third terms in Eq. ( 8), ∂ L(f ) ∂θ =π up E up ∂L(f (x), y) ∂θ + π lo E lo [g sgn(f (x) -ỹlo ), f (x) ]. Then, from the definitions of ỹlo and E lo , both in which y is always y < f (x), E lo [g sgn(f (x) -ỹlo ), f (x) ] = E lo [g 1, f (x) ] = E lo [g sgn(f (x) -y), f (x) ], and, thus, the gradient ( 10) is essentially the same as the following gradient of the loss L(f ) in Eq. ( 3) for ordinary regression when we set the loss function for the second term in Eq. ( 3) as losses that satisfy Eq. ( 6), ∂L(f ) ∂θ = π up E up ∂L(f (x), y) ∂θ + π lo E lo [g sgn(f (x) -y), f (x) ]. The gradient in Eq. ( 9) is also unbiased to and consistent with the gradient in Eq. ( 12), and its convergence rate is of the order O p ( 1 / √ nup + 1 / √ nun) in accordance with the central limit theorem Chung (1968), where O p denotes the order in probability.

3.2. IMPLEMENTATION OF LEARNING ALGORITHM BASED ON STOCHASTIC OPTIMIZATION

We scale our algorithm based on Eq. ( 9) up by stochastic approximation with M -mini-batches and add a regularization term, R(f ): ∂ L(f ) ∂θ = M m=1 {x,y}∈ X {m} up ,y {m} up ∂L(f (x), y) ∂θ + ρ x∈X {m} un g 1, f (x) - x∈X {m} up g 1, f (x) + λ ∂R(f ) ∂θ , where {X {m} up , y {m} up } and X {m} un are upper-side and unlabeled sets in the m-th mini-batch, respectively, λ is a regularization parameter, and the regularization term R(f ) is, for example, the L1 or L2 norm for the parameters θ. We also convert nup /(πupnun) as ρ ignoring constant coefficients and apply sgn(f (x) -ỹlo ) = 1. The hyperparameters ρ and λ are optimized in training. We can learn the regression function with the gradient in Eq. ( 13) by using any stochastic gradient method, such as Adam Kingma & Ba (2015) and FOBOS Duchi & Singer (2009) . The algorithm is described in Algorithm 1. In the following experiments, we used Adam with the hyperparameters recommended in Kingma & Ba (2015) , and the number of samples in the mini-batches was set to 32. By using the learned f (x), we can estimate ŷ = f (x) for new data. From a practical perspective, the first term in Eqs. ( 9) and ( 13) requires that the estimations for the upper-side samples be higher than their label values as much as possible because {x,y}∈{Xup,yup} L(f (x), y) becomes zero when every estimation of the regression function f (x) is located in y ≤ f (x). In contrast, the second term requires that the estimations for unlabeled samples be just as small as possible. The third term balances the effect of the second term. Our algorithm and discussions are applicable to a scenario that is opposite the one-side case, where the data are correctly labeled only on the lower side. Since the derivation is obvious from the analogy of the upper one-side case, we just show its learning algorithm for the lower one-side case in the supplementary material. One example of the lower one-side case is when the label sensor has ideal coverage, but the cost of observation is high, and we need to mimic sensor values with other cheaper sensors having smaller coverage.

4. EXPERIMENTAL RESULTS

We now empirically test the effectiveness of the proposed approach. Our goal is to investigate the impact of our unbiased gradient, which is derived from the objective function based on the assumption of upper one-side labeled data in Eq. (4). We thus show how the proposed method improves performance against regression methods whose objective functions assume that both upperand lower-side data are correctly labeled. We use the same model and optimization method for all of the methods, and the only difference is their objective functions.

4.1. EXPERIMENTAL SETUP AND DATASETS

We report the mean absolute error (MAE) and its standard error between the estimation results ŷ = {ŷ n } N n=1 and the corresponding true labels y across 5-fold cross-validation, each with a different Algorithm 1 One-side regression based on stochastic gradient method Input: Training data D = {x n , y n } N n=1 and hyperparameters ρ, λ ≥ 0 Output: Model parameters θ for f 1: Let A be an external stochastic gradient method and G m be a gradient for the m-th mini-batch 2: while No stopping criterion has been met 3: Shuffle D into M -mini-batches, and denote by X {m} , y {m} the m-th mini-batch whose size is N m 4: For each fold of the cross-validation, we used a randomly sampled 20% of the training set as a validation set to choose the best hyperparameters for each algorithm, where hyperparameters providing the highest MAE in the validation set were chosen. All of the experiments were carried out with a Python implementation on workstations having 48-80 GB of memory and 2.3-4.0 GHz CPUs. With this environment, the computational time was a few hours for producing the results for each dataset. for m = 1 to M 5: G m ← 0 6: for n = 1 to N m 7: if f x {m} n -y {m} n < 0 then 8: G m ← G m + ∂L f x {m} n ,y {m} n ∂θ -g 1, f x {m} n + λ ∂R(f ) ∂θ 9: else 10: G m ← G m + ρg 1, f x {m} n + λ ∂R(f ) Data 1: Synthetic dataset. Using a synthetic dataset, we investigated whether our algorithm could indeed learn from upper one-side labeled data. We randomly generated N training samples, X = {x n } N n=1 , from the standard Gaussian distribution N (x n ; 0, I), where the number of samples was N = 1, 000, the number of features in x was D = 10, and I is the identity matrix. Then, using X, we generated the corresponding N sets of true labels y = {y n } N n=1 from the distribution N (y n ; w x n , β), where w are coefficients that were also randomly generated from the standard Gaussian distribution N (w; 0, I), β is the noise precision, and denotes the transpose. For simulating the situation in which a label sensor has missing observations, we created training labels y = {y n } N n=1 by randomly selecting K percent of data in y and replacing their values with the minimum value of y. We finally added white Gaussian noise whose precision was the same as that of y for the replaced K percent of data. We repeatedly evaluated the proposed method for each of the following settings. The noise precision was β = {10 0 , 10 -1 }, which corresponded to lowand high-noise settings, and the proportion of missing training samples was K = {25, 50, 75}%. In the case of K = 75%, only 25 percent of the samples correctly corresponded to labels, and all of the other samples were attached with labels that were lower than the corresponding true values. In general, it is quite hard to learn regression functions using such data. In the experiment on Data 1, we used a linear model, θ x, for f (x) and an implementation for Eq. ( 13) with squared loss for the first term, absolute loss, which satisfies Eq. ( 6), for the second and third terms, and L1-regularization for the regularization term. Loss functions having such a heterogeneous aspect are often used in the literature, e.g., Huber loss Huber et al. (1964) , Epsilon-insensitive loss Vapnik (1995) , and quantile losses Koenker & Bassett Jr (1978) . We set the candidates of the hyperparameters, ρ and λ, to {10 -3 , 10 -2 , 10 -1 , 10 0 }. We standardized the data by subtracting their mean and dividing by their standard deviation in the training split. Data 2: Kaggle dataset with synthetic corruption. We used a real-world sensor dataset collected from the Kaggle dataset Sen (2016) that contains breathing signals. For this dataset, we used signals from a chest belt as X = {x n } N n=1 and signals obtained by the Douglas bag (DB) method, which is the gold standard for measuring ventilation, as true labels y = {y n } N n=1 . The dataset consisted of N = 1, 432 samples, and x in each sample had D = 2 number of features, i.e., the period and height of the expansion/contraction of the chest. For our problem setting, we created training labels y = {y n } N n=1 by randomly selecting K percent of data in y and replacing their value with the minimum value of y. We finally added white Gaussian noise whose standard deviation was 0.1 × s for the replaced K percent of data, where s is the standard deviation of the original y. The setting for K was the same as that of Data 1, K = {25, 50, 75}%. In the experiment on Data 2, for its non-Table 1 : Comparison of proposed method and methods based on various objective functions in terms of MAE (smaller is better). We show best methods in bold. (1) Data 1: Synthetic dataset Low-noise setting (β = 10 0 ) High-noise setting (β = 10 -1 ) linearity, we used θ φ(x, σ) for f (x), where φ is a radial basis function, and σ is a hyperparameter representing the kernel width that is also optimized in the training split. We set the candidates of the hyperparameters, ρ, λ, and σ, to {10 -3 , 10 -2 , 10 -1 , 10 0 }. The other implementation was the same as that for Data 1. K = 25% K = 50% K = 75% K = 25% K = 50% K = 75% MSE 0.77±0 Data 3: Real-world UCI dataset. We here applied the algorithm to a real sensor dataset, which was collected from the UCI Machine Learning Repository Velloso (2013); Velloso et al. (2013) . It contains sensor outputs from dumbbells and from wearable devices attached to the arm, forearm, and waist during exercise. We used all of the features from the dumbbell sensor that took "None" values less than ten times as X = {x n } N n=1 , where each sample had D = 13 number of features. We used the magnitude of acceleration on the arm as training labels y = {y n } N n=1 , which had insufficient data coverage and missing observations for the movements of other body parts. For testing, we used the magnitude of acceleration for the entire body as true labels y = {y n } N n=1 . Because there were five classes for the exercise task with severe mode changes between classes, we divided the dataset into five datasets on the basis of class: A (N = 11, 159), B (N = 7, 593), C (N = 6, 844), D (N = 6, 432), and E (N = 7, 214). In the experiment on Data 3, we used a 6-layer multilayer perceptron with ReLU Nair & Hinton (2010) (more specifically, D-100-100-100-100-1) as f (x) in order to demonstrate the usefulness of the proposed method in training deep neural networks. We also used a dropout Srivastava et al. (2014) with a rate of 50% after each fully connected layer. We used two implementations for the first term in Eq. ( 13) with absolute loss (Proposed-1) and squared loss (Proposed-2). For both implementations, we used the absolute loss, which satisfies Eq. ( 6), for the second and third terms and used L1-regularization for the regularization term. We set the candidates of the hyperparameters, ρ and λ, to {10 -3 , 10 -2 , 10 -1 , 10 0 }. The other implementation was the same as that for Data 1.

4.2. PERFORMANCE COMPARISON

Table 1 -(1) and -(2) show the performance on Data 1 and Data 2 for the proposed method and an ordinary regression method that uses mean squared error (MSE) assuming that both upper-and lower-side data are correctly labeled as its objective function. This comparison shows whether our method could learn from upper one-side labeled data, from which the ordinary regression method could not learn. From Table 1 -(1) and -(2), we can see that the overall performance of the proposed method was significantly better than that of MSE. We found that the performance of our method was not significantly affected by the increase in the proportion of missing training samples K even for K = 75%, unlike that of MSE. Table 1-( 3) shows a more extensive comparison using the real-world UCI dataset (Data 3) between our methods, Proposed-1 and Proposed-2, and methods based on Wellington (1982); Wilcox (1997) . The regression methods based on MAE and Huber losses were robust regression methods that assume symmetric label corruption. From Table 1-(3), we can see that the performance of Proposed-1 and Proposed-2 was totally better than that of the baselines. The robust regression methods did not improve in performance against MSE. In particular, Proposed-1 and Proposed-2 respectively reduced the error by more than 20% and 30% compared with the other methods on average. Demonstration of unbiased learning and prediction. Figure 2 -(1) compares the estimation results of the proposed method with true labels and those of MSE for the Kaggle dataset with synthetic corruption (Data 2). Since the ordinary regression method, MSE, regards both upper-and lower-side data as correctly labeled, we can see that it produced biased results due to the missing observations. The proposed method did not. Figure 2-( 2) shows a comparison of the estimation results between the proposed method, Proposed-2, and MSE for the real-world UCI dataset (Data 3). For ease of viewing, we show the results for the first 1, 000 samples for the class E data, where the errors of most of the methods were the lowest. Although MSE showed the lowest error among the baselines for the class E data, we can see that the predictions by MSE were somewhat biased and underfitted for the real data having our assumed nature. This was not the case for the proposed method.

4.3. REAL HEALTHCARE CASE STUDY

We show the results of a healthcare case study in the supplementary material, where we estimated the motion intensity of a participant that was measured accurately with an intrusive sensor wrapped around the wrist (ActiGraph) Tryon (2013); Mullaney et al. (1980); Webster et al. (1982); Cole et al. (1992) from non-intrusive bed sensors that were installed under a bed. The results showed that the intrusive sensor could be replaced with the non-intrusive ones, which would be quite useful for reducing the burden on users.

5. CONCLUSION

We formulated a one-side regression problem using upper-side labeled and lower-side unlabeled data and proposed a learning algorithm for it. We showed that our learning algorithm is unbiased to and consistent with ordinary regression that uses data labeled correctly in both upper-and lower-side cases. We developed a stochastic optimization method for implementing the algorithm. An experimental evaluation using synthetic and real-world datasets demonstrated that the proposed algorithm was significantly better than regression algorithms without the assumption of upper one-side labeled data.



ONE-SIDE REGRESSIONOur goal is to derive a learning algorithm with upper one-side labeled data in an unbiased and consistent manner to ordinary regression that uses both upper-and lower-side labeled data. We first



Figure1: One-side regression problem, where, due to missing observations, data are correctly labeled only above regression line, i.e., upper one-side. Regression function must be learned in unbiased and consistent manner to ordinary regression, where data are labeled correctly in both upper-and lower-side.

); De Comité et al. (1999); Letouzey et al. (2000); Shi et al. (2018); Kato et al. (2019); Sakai & Shimizu (2019); Charoenphakdee & Sugiyama (2019); Li et al. (2019); Zhang et al. (2019); Xu et al. (2019); Zhang et al. (2020); Guo et al. (2020); Chen et al. (

θ by A with G m randomly sampled training-testing split. MAE is defined as MAE( y, ŷ) = 1 /N N n=1 |y n -ŷn |.

Figure 2: Comparison of estimation results of proposed method and those of MSE. Orange line represents estimation results of each method, and blue line represents true label values.

