DEEP POSITIVE UNLABELED LEARNING WITH A SEQUENTIAL BIAS

Abstract

For many domains, from video stream analytics to human activity recognition, only weakly-labeled datasets are available. Worse yet, the given labels are often assigned sequentially, resulting in sequential bias. Current Positive Unlabeled (PU) classifiers, a state-of-the-art family of robust semi-supervised methods, are ineffective under sequential bias. In this work, we propose DeepSPU, the first method to address this sequential bias problem. DeepSPU tackles the two interdependent subproblems of learning both the latent labeling process and the true class likelihoods within one architecture. We achieve this by developing a novel iterative learning strategy aided by theoretically-justified cost terms to avoid collapsing into a naive classifier. Our experimental studies demonstrate that DeepSPU outperforms state-of-the-art methods by over 10% on diverse real-world datasets.

1. INTRODUCTION

Motivation. State-of-the-art approaches for learning from data with only only incomplete positive labels require an accurate estimation of the likelihood that any given positive instance receives a label, known as the propensity score. However, all existing approaches overlook the fact that the annotations given for sequential data are often clustered together, and thus the likelihood that a given instance is labeled is dependent on the labels of the surrounding instances. We refer to this as sequential bias. Overlooking this sequential bias results in an incorrect propensity score and significantly reduced classification performance. Ours is the first work to make this observation and we propose the first solution to this open problem. Human Activity Recognition (HAR) is a prime example of sequential bias in data. To collect HAR data, subjects are asked to report their activities while wearing mobile sensors. As study-length increases (collection may take many days), participants leave many activities unlabeled. Additionally, wearable sensors record data rapidly so large blocks of time get labeled consecutively, also creating sequential bias. Many more applications, such as intrusion detection from video or illness prediction from medical records, have similar sequentially-labeled data are are susceptible to sequential bias (Rodríguez-Moreno et al., 2019; Schaekermann et al., 2018) . This is a crucial issue as existing methods show drastically reduced accuracy when sequential bias is not accounted for (as demonstrated in our Experimental Results). State-of-the-Art. Positive Unlabeled (PU) classifiers are a family of semi-supervised methods that learn from incompletely-labeled data without requiring any labeled negative examples (Bekker & Davis, 2020; Elkan & Noto, 2008; Li & Liu, 2005; Hsieh et al., 2015; Du Plessis et al., 2015; Kiryo et al., 2017; Bekker & Davis, 2018a; Kato et al., 2019) . This is a key strength of PU methods because representative negative examples, typically required by semi-supervised methods, are often not feasible to acquire. For instance, in the HAR example, there are infinitely many activities that an individual is not performing at any given time. Consequentially, participants are only expected to provide some positive labels for their activities (Vaizman et al., 2017) . Unfortunately, existing PU methods make unrealistically restrictive simplifying assumptions on how the labels were applied. Specifically, they either assume that there is no bias in the labeling process (the probability of a sample being an unlabeled positive instance is uniform) (Elkan & Noto, 2008; Du Plessis et al., 2015; Kiryo et al., 2017) or otherwise only depends on the local attributes of each instance (Bekker & Davis, 2018a; Kato et al., 2019) . This means that existing methods do not

