SEQUENTIAL DENSITY RATIO ESTIMATION FOR SIMULTANEOUS OPTIMIZATION OF SPEED AND ACCURACY

Abstract

Classifying sequential data as early and as accurately as possible is a challenging yet critical problem, especially when a sampling cost is high. One algorithm that achieves this goal is the sequential probability ratio test (SPRT), which is known as Bayes-optimal: it can keep the expected number of data samples as small as possible, given the desired error upper-bound. However, the original SPRT makes two critical assumptions that limit its application in real-world scenarios: (i) samples are independently and identically distributed, and (ii) the likelihood of the data being derived from each class can be calculated precisely. Here, we propose the SPRT-TANDEM, a deep neural network-based SPRT algorithm that overcomes the above two obstacles. The SPRT-TANDEM sequentially estimates the loglikelihood ratio of two alternative hypotheses by leveraging a novel Loss function for Log-Likelihood Ratio estimation (LLLR) while allowing correlations up to N (∈ N) preceding samples. In tests on one original and two public video databases, Nosaic MNIST, UCF101, and SiW, the SPRT-TANDEM achieves statistically significantly better classification accuracy than other baseline classifiers, with a smaller number of data samples. The code and Nosaic MNIST are publicly available at https://github.com/TaikiMiyagawa/SPRT-TANDEM.

1. INTRODUCTION

The sequential probability ratio test, or SPRT, was originally invented by Abraham Wald, and an equivalent approach was also independently developed and used by Alan Turing in the 1940s (Good, 1979; Simpson, 2010; Wald, 1945) . SPRT calculates the log-likelihood ratio (LLR) of two competing hypotheses and updates the LLR every time a new sample is acquired until the LLR reaches one of the two thresholds for alternative hypotheses (Figure 1 ). Wald and his colleagues proved that when sequential data are sampled from independently and identically distributed (i.i.d.) data, SPRT can minimize the required number of samples to achieve the desired upper-bounds of false positive and false negative rates comparably to the Neyman-Pearson test, known as the most powerful likelihood test (Wald & Wolfowitz, 1948 ) (see also Theorem (A.5) in Appendix A). Note that Wald used the i.i.d. assumption only for ensuring a finite decision time (i.e., LLR reaches a threshold within finite steps) and for facilitating LLR calculation: the non-i.i.d. property does not affect other aspects of the SPRT including the error upper bounds (Wald, 1947) . More recently, Tartakovsky et al. verified that the non-i.i.d. SPRT is optimal or at least asymptotically optimal as the sample size increases (Tartakovsky et al., 2014) , opening the possibility of potential applications of the SPRT to non-i.i.d. data series. About 70 years after Wald's invention, neuroscientists found that neurons in the part of the primate brain called the lateral intraparietal cortex (LIP) showed neural activities reminiscent of the SPRT (Kira et al., 2015) ; when a monkey sequentially collects random pieces of evidence to make a binary choice, LIP neurons show activities proportional to the LLR. Importantly, the time of the decision can be predicted from when the neural activity reaches a fixed threshold, the same as the SPRT's decision rule. Thus, the SPRT, the optimal sequential decision strategy, was re-discovered to be an

annex

the LLR every time a new sample (x (t) at time t) is acquired, until the LLR reaches one of the two thresholds. For data that is easy to be classified, the SPRT outputs an answer after taking a few samples, whereas for difficult data, the SPRT takes in numerous samples in order to make a "careful" decision. For formal definitions and the optimality in early classification of time series, see Appendix A.algorithm explaining primate brains' computing strategy.It remains an open question, however, what algorithm will be used in the brain when the sequential evidence is correlated, non-i.i.d. series.The SPRT is now used for several engineering applications (Cabri et al., 2018; Chen et al., 2017; Kulldorff et al., 2011) . However, its i.i.d. assumption is too crude for it to be applied to other real-world scenarios, including time-series classification, where data are highly correlated, and key dynamic features for classification often extend across more than one data point, violating the i.i.d. assumption. Moreover, the LLR of alternative hypotheses needs to be calculated as precisely as possible, which is infeasible in many practical applications.In this paper, we overcome the above difficulties by using an SPRT-based algorithm that Treats data series As an N-th orDEr Markov process (SPRT-TANDEM), aided by a sequential probability density ratio estimation based on deep neural networks. A novel Loss function for Log-Likelihood Ratio estimation (LLLR) efficiently estimates the density ratio that let the SPRT-TANDEM approach close to asymptotic Bayes-optimality (i.e., Appendix A.4). In other words, LLLR optimizes classification speed and accuracy at the same time. The SPRT-TANDEM can classify non-i.i.d. data series with user-defined model complexity by changing N (∈ N), the order of approximation, to define the number of past samples on which the given sample depends. By dynamically changing the number of samples used for classification, the SPRT-TANDEM can maintain high classification accuracy while minimizing the sample size as much as possible. Moreover, the SPRT-TANDEM enables a user to flexibly control the speed-accuracy tradeoff without additional training, making it applicable to various practical applications.We test the SPRT-TANDEM on our new database, Nosaic MNIST (NMNIST), in addition to the publicly available UCF101 action recognition database (Soomro et al., 2012) and Spoofing in the Wild (SiW) database (Liu et al., 2018) . Two-way analysis of variance (ANOVA, (Fisher, 1925) ) followed by a Tukey-Kramer multi-comparison test (Tukey, 1949; Kramer, 1956) shows that our proposed SPRT-TANDEM provides statistically significantly higher accuracy than other fixed-length and variable-length classifiers at a smaller number of data samples, making Wald's SPRT applicable even to non-i.i.d. data series. Our contribution is fivefold:1. We invented a deep neural network-based algorithm, SPRT-TANDEM, which enables Wald's SPRT on arbitrary sequential data without knowing the true LLR. 2. The SPRT-TANDEM extends the SPRT to non-i.i.d. data series without knowing the true LLR. 3. With a novel loss, LLLR, the SPRT-TANDEM sequentially estimates LLR to optimize speed and accuracy simultaneously. 4. The SPRT-TANDEM can control the speed-accuracy tradeoff without additional training. 5. We introduce Nosaic MNIST, a novel early-classification database.

2. RELATED WORK

The SPRT-TANDEM has multiple interdisciplinary intersections with other fields of research: Wald's classical SPRT, probability density estimation, neurophysiological decision making, and time-series

