SEMI-SUPERVISED REGRESSION WITH SKEWED DATA VIA ADVERSARIALLY FORCING THE DISTRIBUTION OF PREDICTED VALUES Anonymous

Abstract

Advances in scientific fields including drug discovery or material design are accompanied by numerous trials and errors. However, generally only representative experimental results are reported. Because of this reporting bias, the distribution of labeled result data can deviate from their true distribution. A regression model can be erroneous if it is built on these skewed data. In this work, we propose a new approach to improve the accuracy of regression models that are trained using a skewed dataset. The method forces the regression outputs to follow the true distribution; the forcing algorithm regularizes the regression results while keeping the information of the training data. We assume the existence of enough unlabeled data that follow the true distribution, and that the true distribution can be roughly estimated from domain knowledge or a few samples. During training neural networks to generate a regression model, an adversarial network is used to force the distribution of predicted values to follow the estimated 'true' distribution. We evaluated the proposed approach on four real-world datasets (pLogP, Diamond, House, Elevators). In all four datasets, the proposed approach reduced the root mean squared error of the regression by around 55 percent to 75 percent compared to regression models without adjustment of the distribution.

1. INTRODUCTION

Advances in scientific fields including drug discovery or material design are accompanied by numerous trials and errors. However, generally only representative experimental results are chosen to be reported. As a consequence of this reporting bias, the distribution of the reported results can differ from the true distribution. For this reason, when data from the literature are used to train a regression model, predictions from the regression model may differ from the true distribution because the model is derived using biased data (Lin et al., 2002; Galar et al., 2011) . In particular, pharmaceutical development is often affected by this problem. Quantitative structure-activity relationship (QSAR), including drug-target interaction (DTI), is consistently affected by the bias in the reported experimental data, because usually the targeted range of molecular property is clearly defined (Liu et al., 2015; Chen & Zhang, 2013) . When regression is performed using such skewed data, it often erroneously predicts that the target properties are satisfied. As a consequence, it is difficult to discover molecules that have the desired properties (Peng et al., 2017) . Active learning applications also have a similar problem. Many active learning methods repeat the selection of new data by applying certain criteria and retraining the surrogate model (Lookman et al., 2019; Rouet-Leduc et al., 2016; Yuan et al., 2018) . During this process, the data can be skewed according to the criteria (de Mello, 2013; Prabhu et al., 2019) . However, despite this problem, few studies have tried to improve the accuracy of regression models that have been trained on skewed data. In this work, we propose a new approach to improve the accuracy of a regression model that is trained using skewed data. We assume the presence of enough unlabeled data which follow the true distribution, and that the true distribution can be roughly estimated using domain knowledge or a few examples. We use a semi-supervised learning framework with an adversarial network to force the distribution of the regression output to resemble the assumed true distribution (Figure 1 ). At the same time, by sharing the front part of the regression model with the encoder of an adversarial autoencoder (AAE), the process of forcing the distribution of output values is regularized in a way that the information of the labeled data is represented stably. We created skewed datasets by selecting data that exceeded a certain threshold from four real-world datasets (pLogP, Diamond, House, Elevators), then evaluated the proposed approach using these skewed datasets. The proposed approach reduced the root mean squared error (RMSE) of the regression model derived using each of the four datasets, compared to the regression model that had been trained using only the skewed datasets. We also verified that the proposed approach is feasible even when the estimate of the true distribution is not perfect. 

2. RELATED WORK

Semi-Supervised Learning (SSL) is a machine-learning strategy to learn using partially-labeled datasets (Chapelle et al., 2009) . In the field of SSL, various methods have been developed, including those using generative models (Kingma et al., 2014 ), graphs (Goldberg & Zhu, 2006 ), self-training (Rosenberg et al., 2005) and consistency regularization (Sohn et al., 2020) . SSL can improve classification and regression models by using information in a large set of unlabeled data to train a relatively small set of labeled data (Xie et al., 2019; Creswell et al., 2018; Dimokranitou, 2017; Rezagholiradeh & Haidar, 2018) . These approaches generally assume that labeled and unlabeled datasets are well distributed without distortion.



Figure 1: Architecture of a regression model with proposed approach

