SEMI-SUPERVISED REGRESSION WITH SKEWED DATA VIA ADVERSARIALLY FORCING THE DISTRIBUTION OF PREDICTED VALUES Anonymous

Abstract

Advances in scientific fields including drug discovery or material design are accompanied by numerous trials and errors. However, generally only representative experimental results are reported. Because of this reporting bias, the distribution of labeled result data can deviate from their true distribution. A regression model can be erroneous if it is built on these skewed data. In this work, we propose a new approach to improve the accuracy of regression models that are trained using a skewed dataset. The method forces the regression outputs to follow the true distribution; the forcing algorithm regularizes the regression results while keeping the information of the training data. We assume the existence of enough unlabeled data that follow the true distribution, and that the true distribution can be roughly estimated from domain knowledge or a few samples. During training neural networks to generate a regression model, an adversarial network is used to force the distribution of predicted values to follow the estimated 'true' distribution. We evaluated the proposed approach on four real-world datasets (pLogP, Diamond, House, Elevators). In all four datasets, the proposed approach reduced the root mean squared error of the regression by around 55 percent to 75 percent compared to regression models without adjustment of the distribution.

1. INTRODUCTION

Advances in scientific fields including drug discovery or material design are accompanied by numerous trials and errors. However, generally only representative experimental results are chosen to be reported. As a consequence of this reporting bias, the distribution of the reported results can differ from the true distribution. For this reason, when data from the literature are used to train a regression model, predictions from the regression model may differ from the true distribution because the model is derived using biased data (Lin et al., 2002; Galar et al., 2011) . In particular, pharmaceutical development is often affected by this problem. Quantitative structure-activity relationship (QSAR), including drug-target interaction (DTI), is consistently affected by the bias in the reported experimental data, because usually the targeted range of molecular property is clearly defined (Liu et al., 2015; Chen & Zhang, 2013) . When regression is performed using such skewed data, it often erroneously predicts that the target properties are satisfied. As a consequence, it is difficult to discover molecules that have the desired properties (Peng et al., 2017) . Active learning applications also have a similar problem. Many active learning methods repeat the selection of new data by applying certain criteria and retraining the surrogate model (Lookman et al., 2019; Rouet-Leduc et al., 2016; Yuan et al., 2018) . During this process, the data can be skewed according to the criteria

