SEQUENTIAL DENSITY RATIO ESTIMATION FOR SIMULTANEOUS OPTIMIZATION OF SPEED AND ACCURACY

Abstract

Classifying sequential data as early and as accurately as possible is a challenging yet critical problem, especially when a sampling cost is high. One algorithm that achieves this goal is the sequential probability ratio test (SPRT), which is known as Bayes-optimal: it can keep the expected number of data samples as small as possible, given the desired error upper-bound. However, the original SPRT makes two critical assumptions that limit its application in real-world scenarios: (i) samples are independently and identically distributed, and (ii) the likelihood of the data being derived from each class can be calculated precisely. Here, we propose the SPRT-TANDEM, a deep neural network-based SPRT algorithm that overcomes the above two obstacles. The SPRT-TANDEM sequentially estimates the loglikelihood ratio of two alternative hypotheses by leveraging a novel Loss function for Log-Likelihood Ratio estimation (LLLR) while allowing correlations up to N (∈ N) preceding samples. In tests on one original and two public video databases, Nosaic MNIST, UCF101, and SiW, the SPRT-TANDEM achieves statistically significantly better classification accuracy than other baseline classifiers, with a smaller number of data samples. The code and Nosaic MNIST are publicly available at https://github.com/TaikiMiyagawa/SPRT-TANDEM. We test the SPRT-TANDEM on our new database, Nosaic MNIST (NMNIST), in addition to the publicly available UCF101 action recognition database (Soomro et al., 2012) and Spoofing in the Wild (SiW) database (Liu et al., 2018) . Two-way analysis of variance (ANOVA, (Fisher, 1925)) followed by a Tukey-Kramer multi-comparison test (Tukey, 1949; Kramer, 1956) shows that our proposed SPRT-TANDEM provides statistically significantly higher accuracy than other fixed-length and variable-length classifiers at a smaller number of data samples, making Wald's SPRT applicable even to non-i.i.d. data series. Our contribution is fivefold: 1. We invented a deep neural network-based algorithm, SPRT-TANDEM, which enables Wald's SPRT on arbitrary sequential data without knowing the true LLR. 2. The SPRT-TANDEM extends the SPRT to non-i.i.d. data series without knowing the true LLR. 3. With a novel loss, LLLR, the SPRT-TANDEM sequentially estimates LLR to optimize speed and accuracy simultaneously. 4. The SPRT-TANDEM can control the speed-accuracy tradeoff without additional training. 5. We introduce Nosaic MNIST, a novel early-classification database.

1. INTRODUCTION

The sequential probability ratio test, or SPRT, was originally invented by Abraham Wald, and an equivalent approach was also independently developed and used by Alan Turing in the 1940s (Good, 1979; Simpson, 2010; Wald, 1945) . SPRT calculates the log-likelihood ratio (LLR) of two competing hypotheses and updates the LLR every time a new sample is acquired until the LLR reaches one of the two thresholds for alternative hypotheses (Figure 1 ). Wald and his colleagues proved that when sequential data are sampled from independently and identically distributed (i.i.d.) data, SPRT can minimize the required number of samples to achieve the desired upper-bounds of false positive and false negative rates comparably to the Neyman-Pearson test, known as the most powerful likelihood test (Wald & Wolfowitz, 1948) (see also Theorem (A.5) in Appendix A). Note that Wald used the i.i.d. assumption only for ensuring a finite decision time (i.e., LLR reaches a threshold within finite steps) and for facilitating LLR calculation: the non-i.i.d. property does not affect other aspects of the SPRT including the error upper bounds (Wald, 1947) . More recently, Tartakovsky et al. verified that the non-i.i.d. SPRT is optimal or at least asymptotically optimal as the sample size increases (Tartakovsky et al., 2014) , opening the possibility of potential applications of the SPRT to non-i.i.d. data series. About 70 years after Wald's invention, neuroscientists found that neurons in the part of the primate brain called the lateral intraparietal cortex (LIP) showed neural activities reminiscent of the SPRT (Kira et al., 2015) ; when a monkey sequentially collects random pieces of evidence to make a binary choice, LIP neurons show activities proportional to the LLR. Importantly, the time of the decision can be predicted from when the neural activity reaches a fixed threshold, the same as the SPRT's decision rule. Thus, the SPRT, the optimal sequential decision strategy, was re-discovered to be an the LLR every time a new sample (x (t) at time t) is acquired, until the LLR reaches one of the two thresholds. For data that is easy to be classified, the SPRT outputs an answer after taking a few samples, whereas for difficult data, the SPRT takes in numerous samples in order to make a "careful" decision. For formal definitions and the optimality in early classification of time series, see Appendix A. algorithm explaining primate brains' computing strategy. It remains an open question, however, what algorithm will be used in the brain when the sequential evidence is correlated, non-i.i.d. series. The SPRT is now used for several engineering applications (Cabri et al., 2018; Chen et al., 2017; Kulldorff et al., 2011) . However, its i.i.d. assumption is too crude for it to be applied to other real-world scenarios, including time-series classification, where data are highly correlated, and key dynamic features for classification often extend across more than one data point, violating the i.i.d. assumption. Moreover, the LLR of alternative hypotheses needs to be calculated as precisely as possible, which is infeasible in many practical applications. In this paper, we overcome the above difficulties by using an SPRT-based algorithm that Treats data series As an N-th orDEr Markov process (SPRT-TANDEM), aided by a sequential probability density ratio estimation based on deep neural networks. A novel Loss function for Log-Likelihood Ratio estimation (LLLR) efficiently estimates the density ratio that let the SPRT-TANDEM approach close to asymptotic Bayes-optimality (i.e., Appendix A.4). In other words, LLLR optimizes classification speed and accuracy at the same time. The SPRT-TANDEM can classify non-i.i.d. data series with user-defined model complexity by changing N (∈ N), the order of approximation, to define the number of past samples on which the given sample depends. By dynamically changing the number of samples used for classification, the SPRT-TANDEM can maintain high classification accuracy while minimizing the sample size as much as possible. Moreover, the SPRT-TANDEM enables a user to flexibly control the speed-accuracy tradeoff without additional training, making it applicable to various practical applications. classification. The comprehensive review is left to Appendix B, while in the following, we introduce the SPRT, probability density estimation algorithms, and early classification of the time series. Sequential Probability Ratio Test (SPRT). The SPRT, denoted by δ * , is defined as the tuple of a decision rule and a stopping rule (Tartakovsky et al., 2014; Wald, 1947) : Definition 2.1. Sequential Probability Ratio Test (SPRT). Let λ t as the LLR at time t, and X (1,T ) as a sequential data X (1,T ) := {x (t) } T t=1 . Given the absolute values of lower and upper decision threshold, a 0 ≥ 0 and a 1 ≥ 0, SPRT, δ * , is defined as δ * = (d * , τ * ), where the decision rule d * and stopping time τ * are d * (X (1,T ) ) = 1 if λ τ * ≥ a 1 0 if λ τ * ≤ -a 0 , τ * = inf{T ≥ 0|λ T / ∈ (-a 0 , a 1 )} . (3) We review the proof of optimality in Appendix A.4, while Figure 1 shows an intuitive explanation. Probability density ratio estimation. Instead of estimating numerator and denominator of a density ratio separately, the probability density ratio estimation algorithms estimate the ratio as a whole, reducing the degree of freedom for more precise estimation (Sugiyama et al., 2010; 2012) . Two of the probability density ratio estimation algorithms that closely related to our work are the probabilistic classification (Bickel et al., 2007; Cheng & Chu, 2004; Qin, 1998) and density fitting approach (Sugiyama et al., 2008; Tsuboi et al., 2009) algorithms. As we show in Section 4 and Appendix E, the SPRT-TANDEM sequentially estimates the LLR by combining the two algorithms. Early classification of time series. To make decision time as short as possible, algorithms for early classification of time series can handle variable length of data (Mori et al., 2018; Mori et al., 2016; Xing et al., 2009; 2012) to minimize high sampling costs (e.g., medical diagnostics (Evans et al., 2015; Griffin & Moorman, 2001) , or stock crisis identification (Ghalwash et al., 2014) ). Leveraging deep neural networks is no exception in the early classification of time series (Dennis et al., 2018; Suzuki et al., 2018) . Long short-term memory (LSTM) variants LSTM-s/LSTM-m impose monotonicity on classification score and inter-class margin, respectively, to speed up action detection (Ma et al., 2016) . Early and Adaptive Recurrent Label ESTimator (EARLIEST) combines reinforcement learning and a recurrent neural network to decide when to classify and assign a class label (Hartvigsen et al., 2019) .

3. PROPOSED ALGORITHM: SPRT-TANDEM

In this section, we propose the TANDEM formula, which provides the N -th order approximation of the LLR with respect to posterior probabilities. The i.i.d. assumption of Wald's SPRT greatly simplifies the LLR calculation at the expense of the precise temporal relationship between data samples. On the other hand, incorporating a long correlation among multiple data may improve the LLR estimation; however, calculating too long a correlation may potentially be detrimental in the following cases. First, if a class signature is significantly shorter than the correlation length in consideration, uninformative data samples are included in calculating LLR, resulting in a late or wrong decision (Campos et al., 2018) . Second, long correlations require calculating a long-range of backpropagation, prone to vanishing gradient problem (Hochreiter et al., 2001) . Thus, we relax the i.i.d. assumption by keeping only up to the N -th order correlation to calculate the LLR. The TANDEM formula. Here, we introduce the TANDEM formula, which computes the approximated LLR, the decision value of the SPRT-TANDEM algorithm. The data series is approximated as an N -th order Markov process. For the complete derivation of the 0th (i.i.d.), 1st, and N -th order TANDEM formula, see Appendix C. Given a maximum timestamp T ∈ N, let X (1,T ) and y be a sequential data X (1,T ) := {x (t) } T t=1 and a class label y ∈ {1, 0}, respectively, where x (t) ∈ R dx and d x ∈ N. By using Bayes' rule with the N -th order Markov assumption, the joint LLR of data at a timestamp t is written as follows: log p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t) |y = 0) = t s=N +1 log p(y = 1|x (s-N ) , ..., x (s) ) p(y = 0|x (s-N ) , ..., x (s) ) -t s=N +2 log p(y = 1|x (s-N ) , ..., x (s-1) ) p(y = 0|x (s-N ) , ..., x (s-1) ) -log p(y = 1) p(y = 0) (see Equation ( 84) and ( 85) in Appendix C for the full formula). Hereafter we use terms k-let or multiplet to indicate the posterior probabilities, p(y|x (1) , ..., x (k) ) = p(y|X (1,k) ) that consider correlation across k data points. The first two terms of the TANDEM formula (Equation ( 4)), N + 1let and N -let, have the opposite signs working in "tandem" adjusting each other to compute the LLR. The third term is a prior (bias) term. In the experiment, we assume a flat prior or zero bias term, but a user may impose a non-flat prior to handling the biased distribution of a dataset. The TANDEM formula can be interpreted as a realization of the probability matching approach of the probability density estimation, under an N -th order Markov assumption of data series. Neural network that calculates the SPRT-TANDEM formula. The SPRT-TANDEM is designed to explicitly calculate the N -th order TANDEM formula to realize sequential density ratio estimation, which is the critical difference between our SPRT-TANDEM network and other architecture based on convolutional neural networks (CNNs) and recurrent neural networks (RNN). Figure 2 illustrates a conceptual diagram explaining a generalized neural network structure, in accordance with the 1st-order TANDEM formula for simplicity. The network consists of a feature extractor and a temporal integrator (highlighted by red and blue boxes, respectively). They are arbitrary networks that a user can choose depending on classification problems or available computational resources. The feature extractor and temporal integrator are separately trained because we find that this achieves better performance than the end-to-end approach (also see Appendix D). The feature extractor outputs single-frame features (e.g., outputs from a global average pooling layer), which are the input vectors of the temporal integrator. The output vectors from the temporal integrator are transformed with a fully-connected layer into two-dimensional logits, which are then input to the softmax layer to obtain posterior probabilities. They are used to compute the LLR to run the SPRT (Equation ( 2)). Note that during the training phase of the feature extractor, the global average pooling layer is followed by a fully-connected layer for binary classification. How to choose the hyperparameter N ? By tuning the hyperparameter N , a user can efficiently boost the model performance depending on databases; in Section 5, we change N to visualize the model performance as a function of N . Here, we provide two ways to choose N . One is to choose N based on the specific time scale, a concept introduced in Appendix D, where we describe in detail how to guess on the best N depending on databases. The other is to use a hyperparameter tuning algorithm, such as Optuna, (Akiba et al., 2019) to choose N objectively. Optuna has multiple hyperparameter searching algorithms, the default of which is the Tree-structured Parzen Estimator (Bergstra et al., 2011) . Note that tuning N is not computationally expensive, because N is only related to the temporal integrator, not the feature extractor. In fact, the temporal integrator's training speed is much faster than that of the feature extractor: 9 mins/epoch vs. 10 hrs/epoch (N = 49, NVIDIA RTX2080Ti, SiW database).

4. LLLR AND MULTIPLET CROSS-ENTROPY LOSS

Given a maximum timestamp T ∈ N and dataset size M ∈ N, let S := {(X (1,T ) i , y i )} M i=1 be a sequential dataset. Training our network to calculate the TANDEM formula involves the following loss functions in combination: (i) the Loss for Log Likelihood Ratio estimation (LLLR), L LLR , and (ii) multiplet cross-entropy loss, L multiplet . The total loss, L total is defined as L total = L LLR + L multiplet . (5)

Feature extractor Temporal integrator

Network output Decision value

Intermediate value

Figure 2 : Conceptual diagram of neural network for the SPRT-TANDEM where the order of approximation N = 1. The feature extractor (red) extracts the feature vector for classification and outputs it to the temporal integrator (blue). Note that the temporal integrator memorizes up to N preceding states in order to calculate the TANDEM formula (Equation ( 4)). LLR is calculated using the estimated probability densities that are output from the temporal integrator. We use • to highlight a quantity estimated by a neural network. Trainable weight parameters are shared across the boxes with the same color in the figure.

4.1. LOSS FOR LOG-LIKELIHOOD RATIO ESTIMATION (LLLR).

The SPRT is Bayes-optimal as long as the true LLR is available; however, the true LLR is often inaccessible under real-world scenarios. To empirically estimate the LLR with the TANDEM formula, we propose the LLLR L LLR = 1 M T M i=1 T t=1 y i -σ log p(x (1) i , x i , ..., x i |y = 1) p(x (1) i , x (2) i , ..., x (t) i |y = 0) , ( ) where σ is the sigmoid function. We use p to highlight a probability density estimated by a neural network. The LLLR minimizes the Kullback-Leibler divergence (Kullback & Leibler, 1951) between the estimated and the true densities, as we briefly discuss below. The full discussion is given in Appendix E due to page limit. Density fitting. First, we introduce KLIEP (Kullback-Leibler Importance Estimation Procedure, Sugiyama et al. (2008) ), a density fitting approach of the density ratio estimation Sugiyama et al. (2010) . KLIEP is an optimization problem of the Kullback-Leibler divergence between p(X|y = 1) and r(X)p(X|y = 0) with constraint conditions, where X and y are random variables corresponding to X (1,t) i and y i , and r(X) := p(X|y = 1)/p(X|y = 0) is the estimated density ratio. Formally, argmin r [KL(p(X|y = 1)||r(X)p(X|y = 0))] = argmin r -dXp(X|y = 1) log(r(X)) (7) with the constraints 0 ≤ r(X) and dX r(X)p(X|y = 0) = 1. The first constraint ensures the positivity of the estimated density r(X)p(X|y = 0), while the second one is the normalization condition. Applying the empirical approximation, we obtain the final optimization problem: argmin r 1 M 1 i∈I1 -log r(X (1,t) i ) , with r(X (1,t) i ) ≥ 0 and 1 M 0 i∈I0 r(X (1,t) i ) = 1 , where I 1 := {i ∈ [M ]|y i = 1}, I 0 := {i ∈ [M ]|y i = 0}, M 1 := |I 1 |, and M 0 := |I 0 |. Stabilization. The original KLIEP (8), however, is asymmetric with respect to p(X|y = 1) and p(X|y = 0). To recover the symmetry, we add 1 M0 i∈I0 -log(r(X (1,t) i ) -1 ) to the objective and impose an additional constraint 1 M1 i∈I1 r(X (1,t) i ) -1 = 1. Besides, the symmetrized objective still has unbounded gradients, which cause instability in the training. Therefore, we normalize the LLRs with the sigmoid function, obtaining the LLLR (6). We can also show that the constraints are effectively satisfied due to the sigmoid funciton. See Appendix E for the details. In summary, we have shown that the LLLR minimizes the Kullback-Leibler divergence of the true and the estimated density and further stabilizes the training by restricting the value of LLR. Here we emphasize the contributions of the LLLR again. The LLLR enables us to conduct the stable LLR estimation and thus to perform the SPRT, the algorithm optimizing two objectives: stopping time and accuracy. In previous works (Mori et al., 2018; Hartvigsen et al., 2020) , on the other hand, these two objectives are achieved with separate loss sub-functions. Compared to KLIEP, the proposed LLLR statistically significantly boosts the performance of the SPRT-TANDEM (Appendix E.4). Besides, experiment on multivariate Gaussian with a simple toy-model also shows that the LLLR minimize errors between the estimated and the true density ratio (Appendix F).

4.2. MULTIPLET CROSS-ENTROPY LOSS.

To further facilitate training the neural network, we add binary cross-entropy losses, though the LLLR suffices to estimate LLR. We call them multiplet cross-entropy loss here, and defined as: L multiplet := N +1 k=1 L k-let , where L k-let := 1 M (T -N ) M i=1 T -(N +1-k) t=k -log p(y i |x (t-k+1) i , ..., x (t) i ) . Minimizing the multiplet cross-entropy loss is equivalent to minimizing the Kullback-Leibler divergence of the estimated posterior k-let p(y i |x (t-k+1) i , ..., x i ) and the true posterior p(y i |x (t-k+1) i , ..., x (t) i ) (shown in Appendix G) , which is a consistent objective with the LLLR and thus the multiplet loss accelerates the training. Note also that the multiplet loss optimizes all the logits output from the temporal integrator, unlike the LLLR.

5. EXPERIMENTS AND RESULTS

In the following experiments, we use two quantities as evaluation criteria: (i) balanced accuracy, the arithmetic mean of the true positive and true negative rates, and (ii) mean hitting time, the average number of data samples used for classification. Note that the balanced accuracy is robust to class imbalance (Luque et al., 2019) , and is equal to accuracy on balanced datasets. Evaluated public databases are NMNIST, UCF, and SiW. Training, validation, and test datasets are split and fixed throughout the experiment. We selected three early-classification models (LSTM-s (Ma et al., 2016) , LSTM-m (Ma et al., 2016) , and EARLIEST (Hartvigsen et al., 2019) ) and one fixed-length classifier (3DResNet (Hara et al., 2017) ), as baseline models. All the early-classification models share the same feature extractor as that of the SPRT-TANDEM for a fair comparison. Hyperparameters of all the models are optimized with Optuna unless otherwise noted so that no models are disadvantaged by choice of hyperparameters. See Appendix H for the search spaces and fixed final parameters. After fixing hyperparameters, experiments are repeated with different random seeds to obtain statistics. In each of the training runs, we evaluate the validation set after each training epoch and then save the weight parameters if the balanced accuracy on the validation set updates the largest value. The last saved weights are used as the model of that run. The model evaluation is performed on the test dataset. During the test stage of the SPRT-TANDEM, we used various values of the SPRT thresholds to obtain a range of balanced accuracy-mean hitting time combinations to plot a speed-accuracy tradeoff (SAT) curve. If all the samples in a video are used up, the thresholds are collapsed to a 1 = a 0 = 0 to force a decision. To objectively compare all the models with various trial numbers, we conducted the two-way ANOVA followed by the Tukey-Kramer multi-comparison test to compute statistical significance. For the details of the statistical test, see Appendix I. We show our experimental results below. Due to space limitations, we can only show representative results. For the full details, see Appendix J. For our computing infrastructure, see Appendix K. Nosaic MNIST (Noise + mosaic MNIST) database. We introduce a novel dataset, NMNIST, whose video is buried with noise at the first frame, and gradually denoised toward the last, 20th frame (see Appendix L for example data). The motivation to create NMNIST instead of using a preexisting time-series database is as follows: for simple video databases such as Moving MNIST (MMNIST, (Srivastava et al., 2015) ), each data sample contains too much information so that well-trained classifiers can correctly classify a video only with one or two frames (see Appendix M for the results of the SPRT-TANDEM and LSTM-m on MMNIST). We design a parity classification task, classifying 0 -9 digits into an odd or even class. The training, validation, and test datasets contain 50,000, 10,000, and 10,000 videos with frames of size 28 × 28 × 1 (gray scale). Each pixel value is divided by 127.5, before subtracted by 1. The feature extractor of the SPRT-TANDEM is ResNet-110 (He et al., 2016a) , with the final output reduced to 128 channels. The temporal integrator is a peephole-LSTM (Gers & Schmidhuber, 2000; Hochreiter & Schmidhuber, 1997) , with hidden layers of 128 units. The total numbers of trainable parameters on the feature extractor and temporal integrator are 6.9M and 0.1M, respectively. We train 0th, 1st, 2nd, 3rd, 4th, 5th, 10th, and 19th order SPRT-TANDEM networks. LSTM-s / LSTM-m and EARLIEST use peephole-LSTM and LSTM, respectively, both with hidden layers of 128 units. 3DResNet has 101 layers with 128 final output channels so that the total number of trainable parameters is in the same order (7.7M) as that of the SPRT-TANDEM. Figure 3a and Table 1 shows representative results of the experiment. Figure 3d shows example LLR trajectories calculated with the 10th order SPRT-TANDEM. The SPRT-TANDEM outperforms other baseline algorithms by large margins at all mean hitting times. The best performing model is the 10th order TANDEM, which achieves statistically significantly higher balanced accuracy than the other algorithms (p-value < 0.001). Is the proposed algorithm's superiority because the SPRT-TANDEM successfully estimates the true LLR to approach asymptotic Bayes optimality? We discuss potential interpretations of the experimental results in the Appendix D. (He et al., 2016b) , with the final output reduced to 64 channels. The temporal integrator is a peephole-LSTM, with hidden layers of 64 units. The total numbers of trainable parameters in the feature extractor and temporal integrator are 26K and 33K, respectively. We train 0th, 1st, 2nd, 3rd, 5th, 10th, 19th, 24th , and 49th-order SPRT-TANDEM. LSTM-s / LSTM-m and EARLIEST use peephole-LSTM and LSTM, respectively, both with hidden layers of 64 units. 3DResNet has 50 layers with 64 final output channels so that the total number of trainable parameters (52K) is on the same order as that of the SPRT-TANDEM. Figure 3b and Table 2 shows representative results of the experiment. The best performing model is the 10th order TANDEM, which achieves statistically significantly higher balanced accuracy than other models (p-value < 0.001). The superiority of the higher-order TANDEM indicates that a classifier needs to integrate longer temporal information in order to distinguish the two classes (also see Appendix D). , 1st, 2nd, 3rd, 5th, 10th, 19th, 24th , and 49th-order SPRT-TANDEM networks. LSTM-s / LSTM-m and EARLIEST use peephole-LSTM and LSTM, respectively, both with hidden layers of 512 units. 3DResNet has 101 layers with 512 final output channels so that the total number of trainable parameters (5.3M) is in the same order as that of the SPRT-TANDEM. Optuna is not applied due to the large database and network size. Figure 3c and Table 3 shows representative results of the experiment. The best performing model is the 10th order TANDEM, which achieves statistically significantly higher balanced accuracy than other models (p-value < 0.001). The superiority of the lower-order TANDEM indicates that each video frame contains a high amount of information necessary for the classification, imposing less need to collect a large number of frames (also see Appendix D). Ablation study. To understand contributions of the L LLR and L multiplet to the SAT curve, we conduct an ablation study. The 1st-order SPRT-TANDEM is trained with L LLR only, L multiplet only, and both L LLR and L multiplet . The hyperparameters of the three models are independently optimized using Optuna (see Appendix H). The evaluated database and model are NMNIST and the 1st-order SPRT-TANDEM, respectively. Figure 3e shows the three SAT curves. The result shows that L LLR leads to higher classification accuracy, whereas L multiplet enables faster classification. The best performance is obtained by using both L LLR and L multiplet . We also confirmed this tendency with the 19th order SPRT-TANDEM, as shown in Appendix N. SPRT vs. Neyman-Pearson test. As we discuss in Appendix A, the Neyman-Person test is the optimal likelihood ratio test with a fixed number of samples. On the other hand, the SPRT takes a flexible number of samples for an earlier decisions. To experimentally test this prediction, we compare the SPRT-TANDEM and the corresponding Neyman-Pearson test. The Neyman-Pearson test classifies the entire data into two classes at each number of frames, using the estimated LLRs with threshold λ = 0. Results support the theoretical prediction, as shown in Figure 3f : the Neyman-Pearson test needs a larger number of samples than the SPRT-TANDEM.

6. CONCLUSION

We presented the SPRT-TANDEM, a novel algorithm making Wald's SPRT applicable to arbitrary data series without knowing the true LLR. Leveraging deep neural networks and the novel loss function, LLLR, the SPRT-TANDEM minimizes the distance of the true LLR and the LLR sequentially estimated with the TANDEM formula, enabling simultaneous optimization of speed and accuracy. Tested on the three publicly available databases, the SPRT-TANDEM achieves statistically significantly higher accuracy over other existing algorithms with a smaller number of data points 

APPENDIX A THEORETICAL ASPECTS OF THE SEQUENTIAL PROBABILITY RATIO TEST

In this section, we review the mathematical background of the SPRT following the discussion in Tartakovsky et al. (2014) . First, we define the SPRT based on the measure theory and introduce Stein's lemma, which assures the termination of the SPRT. To define the optimality of the SPRT, we introduce two performance metrics that measure the false alarm rate and the expected stopping time, and discuss their tradeoff -the SPRT solves it. Through this analysis, we utilize two important approximations, the asymptotic approximation, and the no-overshoot approximation, which play essential roles to simplify our analysis. The asymptotic approximation assumes the upper and lower thresholds are infinitely far away from the origin, being equivalent to making the most careful decision to reduce the error rate, at the expense of the stopping time. On the other hand, the no-overshoot approximation assumes that we can neglect the threshold overshoots of the likelihood ratio. Next, we show the superiority of the SPRT to the Neyman-Pearson test, using a simple Gaussian model. The Neyman-Pearson test is known to be optimal in the two-hypothesis testing problem and is often compared with the SPRT. Finally, we introduce several types of optimal conditions of the SPRT. A.1 PRELIMINARIES Notations. Let (Ω, F, P ) be a probability space; Ω is a sample space, F ⊂ P Ω is a sigma-algebra of Ω, where P A denotes the power set of a set A, and P is a probability measure. Intuitively, Ω represents the set of all the elementary events under consideration, e.g., Ω = {all the possible elementary events such that "a human is walking through a gate."}. F is defined as a set of the subsets of Ω, and stands for all the possible combinations of the elementary events; e.g., F { "Akinori is walking through the gate at the speed of 80 m/min," "Taiki is walking through the gate at the speed of 77 m/min," or "Nothing happened."} P : F → [0, 1] is a probability measure, a function that is normalized and countably additive; i.e., P measures the probability that the event A ∈ F occurs. A random variable X is defined as the measurable function from Ω to a measurable space, practically R d (d ∈ N); e.g., if ω(∈ Ω) is "Taiki is walking through the gate with a big smile," then X(ω) may be 100 frames of the color images with 128×128 pixels (d = 128 × 128 × 3 × 100), i.e., a video recorded with a camera attached at the top of the gate. The probability that a random variable X takes a set of values S ∈ R d is defined as P (X ∈ S) := P (X -1 (S)), where X -1 is the preimage of X. By definition of the measurable function, X -1 (S) ∈ F for all S ∈ R d . Let {F t } t≥0 be a filtration. By definition, {F t } t≥0 is a non-decreasing sequence of sub-sigma-algebras of F; i.e., F s ⊂ F t ⊂ F for all s and t such that 0 < s < t. Each element of filtration can be interpreted as the available information at a given point t. (Ω, F, {F t } t≥0 , P ) is called a filtered probability space. As in the main manuscript, let X (1,T ) := {x (t) } T t=1 be a sequential data point sampled from the density p, where T ∈ N ∪ {∞}. For each t ∈ [T ], x (t) ∈ R dx , where d x ∈ N is the dimensionality of the input data. In the i.i.d. case, p(X (1,T ) ) = T t=1 f (x (t) ), where f is the density of x (1) . For each time-series data X (1,T ) , the associated label y takes the value 1 or 0; we focus on the binary classification, or equivalently the two-hypothesis testing throughout this paper. When y is a class label, p(X (1,T ) |θ) is the likelihood density function. Note that X (1,T ) with label y is sampled according to density p(X (1,T ) |y). Our goal is, given a sequence X (1,T ) , to identify which one of the two densities p 1 or p 0 the sequence X (1,T ) is sampled from; formally, to test two hypotheses H 1 : y = 1 and H 0 : y = 0 given X (1,T ) . The decision function or test of a stochastic process X (1,T ) is denoted by d(X (1,T ) ) : Ω → {1, 0}. We can identify this definition with, for each realization of X (1,T ) , d : R dx×T → {1, 0}, i.e., X (1,T ) → y, where y ∈ {1, 0}. Thus we write d instead of d(X (1,T ) ), for simplicity. The stopping time of X (1,T ) with respect to a filtration {F t } t≥0 is defined as τ := τ (X (1,T ) ) : Ω → R ≥0 such that {ω ∈ Ω|τ (ω) ≤ t} ∈ F t . Accordingly, for fixed T ∈ N ∪ {∞} and y ∈ {1, 0}, {d = y} means the set of time-series data such that the decision function accepts the hypothesis H i with a finite stopping time; more specifically, {d = y} = {ω ∈ Ω|d(X (1,T ) )(ω) = y, τ (X (1,T ) )(ω) < ∞}. The decision rule δ is defined as the doublet (d, τ ). Let Λ T := Λ(X (1,T ) ) := p(X (1,T ) |y=1) p(X (1,T ) |y=0 ) and λ T := log Λ T be the likelihood ratio and the log-likelihood ratio of X (1,T ) . In the i.i.d. case, Λ T = T t=1 p(x (t) |y=1) p(x (t) |y=0) = T t=1 Z (t) , where p(X (1,T ) |y) = T t=1 p(x (t) |y) (y ∈ {1, 0}) and Z (t) := p(x (t) |y=1) p(x (t) |y=0) .

A.2 DEFINITION AND THE TRADEOFF OF FALSE ALARMS AND STOPPING TIME

Let us overview the theoretical structure of the SPRT. In the following, we assume that the time-series data points are i.i.d. until otherwise stated. Definition of the SPRT. The sequential probability ratio test (SPRT), denoted by δ * is defined as the doublet of the decision function and the stopping time. Definition A.1. Sequential probability ratio test (SPRT) Let a 0 = -log A 0 ≥ 0 and a 1 = log A 1 ≥ 0 be (the absolute values of) a lower and an upper threshold respectively. δ * = (d * , τ * ), d * (X (1,T ) ) = 1 if λ τ * ≥ a 1 0 if λ τ * ≤ -a 0 , τ * = inf{T ≥ 0|λ T / ∈ (-a 0 , a 1 )} . ( ) Note that d * and τ * implicitly depend on a stochastic process X (1,T ) . In general, a doublet δ of a terminal decision function and a stopping time is called a decision rule or a hypothesis test. Termination. The i.i.d.-SPRT terminates with probability one and all the moments of the stopping time are finite, provided that the two hypotheses are distinguishable: Lemma A.1. Stein's lemma Let (Ω, F, P ) be a probability space and {Y (t) } t≥1 be a sequence of i.i.d. random variables under P . Define τ := inf{T ≥ 1| T t=1 Y (t) / ∈ (-a 0 , a 1 )}. If P (Y (1) ) = 1, the stopping time τ is exponentially bounded; i.e., there exist constants C > 0 and 0 < ρ < 1 such that P (τ > T ) ≤ Cρ T for all T ≥ 1. Therefore, P (τ < ∞) = 1 and E[τ k ] < ∞ for all k > 0. Two performance metrics. Considering the two-hypothesis testing, we employ two kinds of performance metrics to evaluate the efficiency of decision rules from complementary points of view: the false alarm rate and the stopping time. The first kind of metrics is the operation characteristic, denoted by β(δ, y), and its related metrics. The operation characteristic is the probability of the decision being 0 when the true label is y = y; formally, Definition A.2. Operation characteristic The operation characteristic is the probability of accepting the hypothesis H 0 as a function of y: β(δ, y) := P (d = 0|y) . ( ) Using the operation characteristic, we can define four statistical measures based on the confusion matrix; namely, False Positive Rate (FPR), False Negative Rate (FNR), True Negative Rate (TNR), and True Positive Rate (TPR). FPR: α 0 (δ) := 1 -β(δ, 0) = P (d = 1|y = 0) (15) FNR: α 1 (δ) := β(δ, 1) = P (d = 0|y = 1) (16) TNR: β(δ, 0) = 1 -α 0 (δ) = 1 -P (d = 1|y = 0) (17) TPR: 1 -β(δ, 1) = 1 -α 1 (δ) = 1 -P (d = 0|y = 1) Note that balanced accuracy is denoted by (1 + β(δ, 0) -β(δ, 1))/2 according to this notation. The second kind of metrics is the mean hitting time, and is defined as the expected stopping time of the decision rule: Definition A.3. Mean hitting time The mean hitting time is the expected number of time-series data points that are necessary for testing a hypothesis when the true parameter value is y: E y τ = Ω τ dP (•|y). The mean hitting time is also referred to as the expected sample size of the average sample number. There is a tradeoff between the false alarm rate and the mean hitting time. For example, the quickness may be sacrificed, if we use a decision rule δ that makes careful decisions, i.e., with the false alarm rate 1 -β(δ, 0) less than some small constant. On the other hand, if we use δ that makes quick decisions, then δ may make careless decisions, i.e., raise lots of false alarms because the amount of evidences is insufficient. At the end of this section, we show that the SPRT is optimal in the sense of this tradeoff. The tradeoff of false alarms and stopping times for both i.i.d. and non-i.i.d. We formulate the tradeoff of the false alarm rate and the stopping time. We can derive the fundamental relation of the threshold to the operation characteristic in both i.i.d. and non-i.i.d. cases (Tartakovsky et al. (2014) ): α * 1 ≤ e -a0 (1 -α * 0 ) α * 0 ≤ e -a1 (1 -α * 1 ) , where we defined α * y := α y (δ * ) (y ∈ {1, 0}). These inequalities essentially represent the tradeoff of the false alarm rate and the stopping time. For example, as the thresholds a y (y ∈ {1, 0}) increase, the false alarm rate and the false rejection rate decrease, as ( 19) suggests, but the stopping time is likely to be larger, because more observations are needed to accumulate log-likelihood ratios to hit the larger thresholds. The asymptotic approximation and the no-overshoot approximation. Equation 19 is an example of the tradeoff of the false alarm rate and the stopping time; further, we can derive another example in terms of the mean hitting time. Before that, we introduce two types of approximations that simplify our analysis. The first one is the no-overshoot approximation. It assumes to ignore the threshold overshoots of the log-likelihood ratio at the decision time. This approximation is valid when the log-likelihood ratio of a single frame is sufficiently small compared to the gap of the thresholds, at least around the decision time. On the other hand, the second one is the asymptotic approximation, which assumes a 0 , a 1 → ∞, being equivalent to sufficiently low false alarm rates and false rejection rates at the expense of the stopping time. These approximations drastically facilitate the theoretical analysis; in fact, the no-overshoot approximation alters (19) as follows (see Tartakovsky et al. (2014) ): α * 1 ≈ e -a0 (1 -α * 0 ), α * 0 ≈ e -a1 (1 -α * 1 ) , which is equivalent to α * 0 ≈ e a0 -1 e a0+a1 -1 , α * 1 ≈ e a1 -1 e a0+a1 -1 (21) ⇐⇒ -a 0 ≈ log α * 1 1 -α * 0 , a 1 ≈ log 1 -α * 1 α * 0 (22) ⇐⇒ β * (0) ≈ e a1 -1 e a1 -e -a0 , β * (1) ≈ e -a1 -1 e -a1 -e a0 , where β * (y) := β(δ * , y) (y ∈ {1, 0}). Further assuming the asymptotic approximation, we obtain α * 0 ≈ e -a1 , α * 1 ≈ e -a0 . Therefore, as the threshold gap increases, the false alarm rate and the false rejection rate decrease exponentially, while the decision making becomes slow, as is shown in the following. Mean hitting time without overshoots. Let I y := E y [Z (1) ] (y ∈ {1, 0}) be the Kullback-Leibler divergence of f 1 and f 0 . I y is larger if the two densities are more distinguishable. Note that I y 0 since P y (Z (1) = 0) 1, and thus the mean hitting times of the SPRT without overshoots are expressed as E 1 [τ * ] = 1 I 1 (1 -α * 1 ) log( 1 -α * 1 α * 0 ) -α * 1 log( 1 -α * 0 α * 1 ) , E 0 [τ * ] = 1 I 0 (1 -α * 0 ) log( 1 -α * 0 α * 1 ) -α * 0 log( 1 -α * 1 α * 0 ) In Tartakovsky et al. (2014) . Introducing the function γ(x, y) := (1 -x) log( 1 -x y ) -x log( 1 -y x ) , we can simplify (25-26): E 1 [τ * ] = 1 I 1 γ(α * 1 , α * 0 ) (28) E 0 [τ * ] = 1 I 1 γ(α * 0 , α * 1 ) . (25-26) shows the tradeoff as we mentioned above: the mean hitting time of positive (negative) data diverges if we are to set the false alarm (rejection) rate to be zero. The tradeoff with overshoots. Introducing the overshoots explicitly, we can obtain the equality, instead of the inequality such as ( 19), that connects the the error rates and the thresholds. We first define the overshoots of the thresholds a 0 and a 1 at the stopping time as κ 1 (a 0 , a 1 ) := λ τ * -a 1 on{λ τ * ≥ a 1 } (30) κ 0 (a 0 , a 1 ) := -(λ τ * + a 0 ) on{λ τ * ≤ -a 0 }. We further define the expectation of the exponentiated overshoots as (a0,a1)  e 1 (a 0 , a 1 ) := E 1 [e -κ1 |λ τ * ≥ a 1 ] e 0 (a 0 , a 1 ) : (a0,a1)  = E 0 [e -κ0 |λ τ * ≤ -a 0 ] . Then we can relate the thresholds to the error rates (without the no-overshoots approximation, Tartakovsky (1991) ): α * 0 = e 1 (a 0 , a 1 )e a0 -e 1 (a 0 , a 1 )e 0 (a 0 , a 1 ) e a1+a0 -e 1 (a 0 , a 1 )e 0 (a 0 , a 1 ) , α * 1 = e 0 (a 0 , a 1 )e a1 -e 1 (a 0 , a 1 )e 0 (a 0 , a 1 ) e a1+a0 -e 1 (a 0 , a 1 )e 0 (a 0 , a 1 ) . To obtain more specific dependence on the thresholds a y (y ∈ {1, 0}), we adopt the asymptotic approximation. Let T 0 (a 0 ) and T 1 (a 1 ) be the one-sided stopping times, i.e., T 0 (a 0 ) := inf{T ≥ 1|λ T ≤ -a 0 } and T 1 (a 1 ) := inf{T ≥ 1|λ T ≥ a 1 }. We then define the associated overshoots as κ1 (a 1 ) := λ T1 -a 1 on{T 1 < ∞} , (35) κ0 (a 0 ) := -(λ T0 + a 0 ) on{T 0 < ∞} . ( ) According to Lotov (1988) , we can show that α * 0 ≈ ζ 1 e a0 -ζ 1 ζ 0 e a0+a1 -ζ 1 ζ 0 , α * 1 ≈ ζ 0 e a1 -ζ 1 ζ 0 e a0+a1 -ζ 1 ζ 0 (37) under the asymptotic approximation. Note that ζ y := lim ay→∞ E y [e -κy ] (y ∈ {1, 0}) have no dependence on the thresholds a y (y ∈ {1, 0}). Therefore we have obtained more precise dependence of the error rates on the thresholds than (24): Theorem A.1. The Asymptotic tradeoff with overshoots Assume that 0 < I y < ∞ (y ∈ {1, 0}). Let ζ y be given in (38). Then α * 0 = ζ 1 e -a1 (1 + o(1)), α * 1 = ζ 0 e -a0 (1 + o(1)) (a 0 , a 1 -→ ∞) . ( ) Mean hitting time with overshoots. A more general form of the mean hitting time is provided in Tartakovsky (1991) . We can show that E 1 τ * = 1 I 1 1 -α * 1 a 1 + E 1 [κ 1 |τ * = T ] -α * 1 a 0 + E 1 [κ 0 |τ * = T 0 ] (40) E 0 τ * = 1 I 0 1 -α * 0 a 0 + E 0 [κ 0 |τ * = T ] -α * 0 a 1 + E 0 [κ 1 |τ * = T 1 ] . The mean hitting times (40-41) explicitly depend on the overshoots, compared with (25-26). Let χ y := lim ath→∞ E y [κ y ] (y ∈ {1, 0}) be the limiting average overshoots in the one-sided tests. Note that χ y have no dependence on a y (y ∈ {1, 0}). The asymptotic mean hitting times with overshoots are E 1 τ * = 1 I 1 (a 1 +χ 1 )+o(1), E 0 τ * = 1 I 0 (a 0 +χ 0 )+o(1) (a 0 e -a1 → 0, a 1 e -a0 → 0) (43) As expressed in Tartakovsky et al. (2014) . Therefore, they have an asymptotically linear dependence on the thresholds.

A.3 THE NEYMAN-PEARSON TEST AND THE SPRT

So far, we have discussed the tradeoff of the false alarm rate and the mean hitting time and several properties of the operation characteristic and the mean hitting time. Next, we compare the SPRT with the Neyman-Pearson test, which is well-known to be optimal in the classification of time-series with fixed sample lengths; in contrast, the SPRT is optimal in the early classification of time-series with indefinite sample lengths, as we show in the next section. We show that the Neyman-Pearson test is optimal in the two-hypothesis testing problem or the binary classification of time-series. Nevertheless, we show that in the i.i.d. Gaussian model, the SPRT terminates earlier than the Neyman-Pearson test despite the same error rates. Preliminaries. Before defining the Neyman-Pearson test, we specify what the "best" test should be. There are three criteria, namely the most powerful test, Bayes test, and minimax test. To explain them in detail, we have to define the size and the power of the test. The significance level, or simply the size of test d is defined asfoot_0 α := P (d = 1|y = 0) . (44) It is also known as the false positive rate, the false alarm rate, or the false acceptance rate of the test. On the other hand, the power of the test d is given by γ := 1 -β := P (d = 1|y = 1) . ( ) γ is also called the true positive rate, the true acceptance rate. the recall, or the sensitivity. β is known as the false negative rate or the false rejection rate. Now, we can define the three criteria mentioned above.

Definition A.4. Most powerful test

The most powerful test d of significance level α(> 0) is defined as the test that for every other test d of significance level α, the power of d is greater than or equal to that of d : P (d = 1|y = 1) ≥ P (d = 1|y = 1) . ( ) Definition A.5. Bayes test Let π 0 := P (y = 0) and π 1 := P (y = 1) = 1 -π 0 be the prior probabilities of hypotheses H 0 and H 1 , and ᾱ(d) be the average probability of error: ᾱ(d) := i=1,0 π i α i (d) , where α i (d) := P (d = i|y = i) is the false negative rate of the class i ∈ {1, 0}. A Bayes test, denoted by d B , for the priors is defined as the test that minimizes the average probability of error: d B := arginf d {ᾱ(d)} , where the infimum is taken over all fixed-sample-size decision rules. Definition A.6. Minimax test Let α max (d) be the maximum error probability: α max (d) := max i∈{1,0} {α i (d)} . ( ) A minimax test, denoted by d M , is defined as the test that minimizes the maximum error probability: α max (d M ) = inf d {α max (d)} , ( ) where the infimum is taken over all fixed-sample-size tests. Note that a fixed-sample-size decision rule or non-sequential rule is the decision rule with a fixed stopping time T = N with probability one. Definition and the optimality of the Neyman-Pearson test. Based on the above notions, we state the definition and the optimality of the Neyman-Pearson test. We see the most powerful test for the two-hypothesis testing problem is the Neyman-Pearson test; the theorem below is also the definition of the Neyman-Pearson test. Theorem A.2. Neyman-Pearson lemma Consider the two-hypothesis testing problem, i.e., the problem of testing two hypotheses H o : P = P 0 and H 1 : P 1 , where P 0 and P 1 are two probability distributions with densities p 0 and p 1 with respect to some probability measure. The most powerful test is given by d NP (X (1,T ) ) := 1 if Λ(X (1,T ) ) ≥ h(α) 0 otherwise , where Λ(X (1,T ) ) = p1(X (1,T ) ) p0(X (1,T ) ) is the likelihood ratio and the threshold h(α) is defined as α 0 (d NP ) ≡ P (d NP (X (1,T ) ) = 1|H 0 ) = E 0 [d NP (X (1,T ) )] = α (52) to ensure for the false positive rate to be the user-defined value α(> 0). d NP is referred to as the Neyman-Pearson test and is also optimal with respect to the Bayes and minimax criteria:  Theorem A.3. (d) = π 0 α 0 (d) + π 1 α 1 (d), is given by d B (X (1,T ) ) = 1 (if Λ(X (1,T ) ) ≥ π 0 /π 1 ) 0 (otherwise) . ( ) That is, the Bayesian test is given by the Neyman-Pearson test with the threshold π 0 /π 1 . Theorem A.4. Neyman-Pearson test is minimax optimal Consider the two-hypothesis testing problem. the minimax test d M , which minimizes the maximal error probability α max (d) = max i∈{1,0} {α i (d)}, is the Neyman-Pearson test with the threshold such that α 0 (d M ) = α 1 (d M ). The proofs are given in Borovkov (1998) and Lehmann & Romano (2006) . The SPRT is more efficient. We have shown that the Neyman-Pearson test is optimal in the twohypothesis testing problem, in the sense that the Neyman-Pearson test is the most powerful, Bayes, and minimax test; nevertheless, we can show that the SPRT terminates faster than the Neyman-Pearson test even when these two show the same error rate. Consider the two-hypothesis testing problem for the i.i.d. Gaussian model:    H i : y = y i (i ∈ {1, 0}) x (t) = y + ξ (t) (t ≥ 1, y ∈ R 1 ) ξ (t) ∼ N (0, σ 2 ) (σ ≥ 0) , where N (0, σ 2 ) denotes the Gaussian distribution with mean 0 and variance σ 2 . The Neyman-Pearson test has the form d NP (X (1,n(α0,α1)) ) = 1 (if λ n(α0,α1) ≥ h(α 0 , α 1 )) 0 (otherwise) . The sequence length n = n(α 0 , α 1 ) and the threshold h = h(α 0 , α 1 ) are defined so as for the false positie rate and the false negative rate to be equal to α 0 and α 1 respectively; i.e., P (λ n ≥ h|y = y 0 ) = α 0 , (56) P (λ n < h|y = y 1 ) = α 1 . (57) We can solve them for the i.i.d. Gaussian model (Tartakovsky et al. (2014) ). To see the efficiency of the SPRT to the Neyman-Pearson test, we define E 0 (α 0 , α 1 ) = E[τ * |y = y 0 ] n(α 0 , α 1 ) (58) E 1 (α 0 , α 1 ) = E[τ * |y = y 1 ] n(α 0 , α 1 ) . ( ) Assuming the overshoots are negligible, we obtain the following asymptotic efficiency (Tartakovsky et al. (2014) ): lim max{α0,α1}→0 E y (α 0 , α 1 ) = 1 4 (y ∈ {1, 0}) . ( ) In other words, under the no-overshoot and the asymptotic assumptions, the SPRT terminates four times earlier than the Neyman-Pearson test in expectation, despite the same false positive and negative rates. A.4 THE OPTIMALITY OF THE SPRT Optimality in i.i.d. cases. The theorem below shows that the SPRT minimizes the expected hitting times in the class of decision rules that have bounded false positive and negative rates. Consider the two-hypothesis testing problem. We define the class of decision rules as C(α 0 , α 1 ) = {δ s.t. P (d = 1|H 0 ) ≤ α 0 , P (d = 0|H 1 ) ≤ α 1 , E[τ |H 0 ] < ∞, E[τ |H 1 ] < ∞} . (61) Then the optimality theorem states: Theorem A.5. I.I.D. Optimality (Tartakovsky et al. (2014) ) Let the time-series data points x (t) , t = 1, 2, ... be i.i.d. with density f 0 under H 0 and with density f 1 under H 1 , where f 0 ≡ f 1 . Let α 0 > 0 and α 1 > 0 be fixed constants such that α 0 + α 1 < 1. If the thresholds -a o and a 1 satisfies α * 0 (a 0 , a 1 ) = α 0 and α * 1 (a 0 , a 1 ) = α 1 , then the SPRT δ * = (d * , τ * ) satisfies inf δ=(d,τ )∈C(α0,α1) E[τ |H 0 ] = E[τ * |H 0 ] and inf δ=(d,τ )∈C(α0,α1) E[τ |H 1 ] = E[τ * |H 1 ] (62) A similar optimality holds for continuous-time processes (Irle & Schmitz (1984) ). Therefore the SPRT terminates at the earliest stopping time in expectation of any other decision rules achieving the same or less error rates -the SPRT is optimal. Theorem A.5 tells us that given user-defined thresholds, the SPRT attains the optimal mean hitting time. Also, remember that the thresholds determine the error rates (e.g., Equation ( 24)). Therefore, the SPRT can minimize the required number of samples and achieve the desired upper-bounds of false positive and false negative rates. Asymptotic optimality in general non-i.i.d. cases. In most of the discussion above, we have assumed the time-series samples are i.i.d. For general non-i.i.d. distributions, we have the asymptotic optimality; i.e., the SPRT asymptotically minimizes the moments of the stopping time distribution (Tartakovsky et al. (2014) ). Before stating the theorem, we first define a type of convergence of random variables. Definition A.7. r-quick convergence Let {x (t) } t≥1 be a stochastic process. Let T ({x (t) } t≥1 ) be the last entry time of the stochastic process {x (t) } t≥1 in the region ( , ∞) ∪ (-∞, -), i.e., T ({x (t) } t≥1 ) = sup t≥1 {t s.t. |x (t) | > }, sup{∅} := 0 . ( ) Then, we say that the stochastic process {x (t) } t≥1 converges to zero r-quickly, or x (t) r-quickly ------→ t→∞ 0 , ( ) for some r > 0, if E[(T ({x (t) } t≥1 )) r ] < ∞ for every > 0 . (65) r-quick convergence ensures that the last entry time in the large-deviation region (T ({x (t) } t≥1 )) is finite almost surely. The asymptotic optimality theorem is: Theorem A.6. Non-i.i.d. asymptotic optimality If there exist positive constants I 0 and I 1 and an increasing non-negative function ψ(t) such that λ t ψ(t) P1-r-quickly ---------→ t→∞ I 1 and λ t ψ(t) P0-r-quickly ---------→ t→∞ -I 0 , where λ t is defined in section A.1, then E[(τ * ) r |y = i] < ∞ (i ∈ {1, 0} ) for any finite a 0 and a 1 . Moreover, if the thresholds a 0 and a 1 are chosen to satisfy (19), a 0 → log(1/α * 1 ), and a 1 → log(1/α * 0 ) (a i → ∞), then for all 0 < m ≤ r, inf δ∈C(α0,α1) {E[τ m |y = y 1 ]} -E[(τ * ) m |y = y 1 ] -→ 0 (68) inf δ∈C(α0,α1) {E[τ m |y = y 0 ]} -E[(τ * ) m |y = y 0 ] -→ 0 (69) as max{α 0 , α 1 } -→ 0 with | log α 0 / log α 1 | -→ c, where c ∈ (0, ∞).

B SUPPLEMENTARY REVIEW OF THE RELATED WORK

Primate's decision making and parietal cortical neurons. The process of decision making involves multiple steps, such as evidence accumulation, reward prediction, risk evaluation, and action selection. We give a brief overview regarding mainly to neural activities of primate parietal lobe and their relationship to the evidence accumulation, instead of providing a comprehensive review of the decision making literature. Interested readers may refer to review articles, such as Doya (2008) ; Gallivan et al. (2018) ; Gold & Shadlen (2007) . In order to study neural correlates of decision making, Roitman & Shadlen (2002) used a random dot motion (RDM) task on non-human primates. They found that the neurons in the cortical area lateral intraparietal cortex, or LIP, gradually accumulated sensory evidence represented as increasing firing rate, toward one of the two thresholds corresponding to the two-alternative choices. Moreover, while a steeper increase of firing rates leads to an early decision of the animal, the final firing rates at the decision time is almost constant regardless of reaction time. (2004) ). Application of SPRT. Ever since Wald's formulation, the sequential hypothesis testing was applied to study decision making and its reaction time (Stone (1960) ; Edwards (1965) ; Ashby (1983) ). Several extensions to more general problem settings were also proposed. In order to test more than two hypotheses, multi-hypothesis SPRT (MSPRT) was introduced (Armitage (1950); Baum & Veeravalli (1994) ), and shown to be asymptotically optimal (Dragalin et al. (1999; 2000) ; Veeravalli & Baum (1995) ). The SPRT was also generalized for non-i.i.d. data (Lai (1981) ; Tartakovsky (1999) ), and theoretically shown to be asymptotically optimal, given the known LLR (Dragalin et al. (1999; 2000) ). Vote system (HIVE-COTE) showed high classification performance at the expense of their high computational cost (Bagnall et al. (2015) ; Lines et al. ( 2016)). Word Extraction for time series classification (WEASEL) and its variant, WEASEL+MUSE take a bag-of-pattern approach to utilize carefully designed feature vectors (Schäfer & Leser (2017) ). The advent of deep learning allows researchers to classify not only univariate/multivariate data, but also large-size, video data using convolutional neural networks (Hara et al. (2017) ; Carreira & Zisserman (2017) ; Karim et al. (2018) ; Wang et al. (2017) ). Thanks to the increasing computation power and memory of modern processing units, each video data in a minibatch are designed to be sufficiently long in the time domain such that class signature can be contained. Video length of the training, validation, and test data are often assumed to be fixed; however, ensuring sufficient length for all data may compromise the classification speed (i.e., number of samples that used for classification). We extensively test this issue in Section 5.

C DERIVATION OF THE TANDEM FORMULA

The derivations of the important formulas in Section 3 are provided below. The 0th order (i.i.d.) TANDEM formula. We use the following probability ratio to identify if the input sequence {x (s) } t s=1 is derived from either hypothesis H 1 : y = 1 or H 0 : y = 0. p(x (1) , ..., x (t) |y = 1) p(x (1) , ..., x (t) |y = 0) . We can rewrite it with the posterior. First, by repeatedly using the Bayes rule, we obtain p(x (1) , x (2) , ..., x (t) |y) = p(x (t) |x (t-1) , x (t-2) , ..., x (1) , y)p(x (t-1) , x (t-2) , ..., x (1) |y) = p(x (t) |x (t-1) , x (t-2) , ..., x (1) , y) × p(x (t-1) |x (t-2) , x (t-3) , ..., x (1) , y)p(x (t-2) , x (t-3) , ..., x (1) , y) = ... . . . = p(x (t) |x (t-1) , x (t-2) , ..., x (1) , y)p(x (t-1) |x (t-2) , x (t-3) , ..., x (1) , y) . . . p(x (2) |x (1) , y) . (71) We use this formula hereafter. Let us assume that the process {x (s) } t s=1 is conditionally-independently and identically distributed (hereafter simply noted as i.i.d.), namely p(x (1) , x (2) , ..., x (t) |y) = t s=1 p(x (s) |y) , which yields the following LLR representation ("0-th order Markov process"): p(x (t) |x (t-1) , x (t-2) , ..., x (1) , y) = p(x (t) |y) . Then p(x (1) , x (2) , ..., x (t) |y) = p(x (t) |y)p(x (t-1) |y) . . . p(x (2) |y)p(x (1) |y) = t s=1 p(x (s) |y) = t s=1 p(y|x (s) )p(x (s) ) p(y) . Hence p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t)  |y = 0) = t s=1 p(y = 1|x (s) ) p(y = 0|x (s) ) p(y = 0) p(y = 1) t , or log p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t)  |y = 0) = t s=1 log p(y = 1|x (s) ) p(y = 0|x (s) ) -tlog p(y = 1) p(y = 0) . ( ) The 1st-order TANDEM formula. So far, we have utilized the i.i.d. assumption (73) or (72). Now let us derive the probability ratio of the first-order Markov process, which assumes p(x (t) |x (t-1) , x (t-2) , ..., x (1) , y) = p(x (t) |x (t-1) , y) . (76) Applying ( 76) to (71), we obtain p(x (1) , x (2) , ..., x (t) |y) = p(x (t) |x (t-1) , y)p(x (t-1) |x (t-2) , y) . . . p(x (2) |x (1) , y)p(x (1) |y) = t s=2 p(x (s) |x (s-1) , y) p(x (1) |y) = t s=2 p(y|x (s) , x (s-1) )p(x (s) , x (s-1) ) p(x (s-1) , y) p(y|x (1) )p(x (1) ) p(y) = t s=2 p(y|x (s) , x (s-1) )p(x (s) , x (s-1) ) p(y|x (s-1) )p(x (s-1) ) p(y|x (1) )p(x (1) ) p(y) , ( ) for t ≥ 2. Hence p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t)  |y = 0) = t s=2 p(y = 1|x (s) , x (s-1) ) p(y = 0|x (s) , x (s-1) ) t s=3 p(y = 0|x (s-1) ) p(y = 1|x (s-1) ) p(y = 0) p(y = 1) , or log p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t) |y = 0) = t s=2 log p(y = 1|x (s) , x (s-1) ) p(y = 0|x (s) , x (s-1) ) - t s=3 log p(y = 1|x (s-1) ) p(y = 0|x (s-1) ) -log p(y = 1) p(y = 0) . ( ) For t = 1 and t = 2, the natural extensions are (80) The N -th order TANDEM formula. Finally we extend the 1st order TANDEM formula so that it can calculate the general N -th order log-likelihood ratio. The N -th order Markov process is defined as p(x (t) |x (t-1) , x (t-2) , ..., x (1) , y) = p(x (t) |x (t-1) , ..., x (t-N ) , y) . (81) Therefore, for t ≥ N + 2 p(x (1) , x (2) , ..., x (t) |y) = p(x (t) |x (t-1) , ..., x (t-N ) , y)p(x (t-1) |x (t-2) , ..., x (t-N -1) , y) . . . p(x (2) |x (1) , y)p(x (1) |y) = t s=N +1 p(x (s) |x (s-1) , ..., x (s-N ) , y) p(x (N ) , x (N -1) , ..., x (1) |y) = t s=N +1 p(y|x (s) , ..., x (s-N ) )p(x (s) , ..., x (s-N ) ) p(x (s-1) , ..., x (s-N ) , y) p(y|x (N ) , ..., x (1) )p(x (N ) , ..., x (1) ) p(y) = t s=N +1 p(y|x (s) , ..., x (s-N ) )p(x (s) , ..., x (s-N ) ) p(y|x (s-1) , ..., x (s-N ) )p(x (s-1) , ..., x (s-N ) ) p(y|x (N ) , ..., x (1) )p(x (N ) , ..., x (1) ) p(y) . (82) Hence p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t) |y = 0) = t s=N +1 p(y = 1|x (s) , ..., x (s-N ) ) p(y = 0|x (s) , ..., x (s-N ) ) t s=N +2 p(y = 0|x (s-1) , ..., x (s-N ) ) p(y = 1|x (s-1) , ..., x (s-N ) ) p(y = 0) p(y = 1) , or log p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t)  |y = 0) = t s=N +1 log p(y = 1|x (s) , ..., x (s-N ) ) p(y = 0|x (s) , ..., x (s-N ) ) - t s=N +2 log p(y = 1|x (s-1) , ..., x (s-N ) ) p(y = 0|x (s-1) , ..., x (s-N ) ) -log p(y = 1) p(y = 0) . ( ) For t < N + 2, we obtain log p(x (1) , x (2) , ..., x (t) |y = 1) p(x (1) , x (2) , ..., x (t) |y = 0) = log p(y = 1|x (1) , x (2) , ..., x (t) ) p(y = 0|x (1) , x (2) , ..., x (t) ) -log p(y = 1) p(y = 0) . ( )

D SUPPLEMENTARY DISCUSSION

Why is the SPRT-TANDEM superior to other baselines? The potential drawbacks common to the LSTM-s/m and EARLIEST is that they incorporate long temporal correlation: it may lead to (1) the class signature length problem and (2) vanishing gradient problem, as we described in Section 3. (1) If a class signature is significantly shorter than the correlation length in consideration, uninformative data samples are included in calculating the log-likelihood ratio, resulting in a late or wrong decision. (2) long correlations require calculating a long-range of backpropagation, prone to the vanishing gradient problem. An LSTM-s/m-specific drawback is similar to that of Neyman-Pearson test, in the sense that it fixes the number of samples before performance evaluations. On the other hand, the SPRT, and the SPRT-TANDEM, classify various lengths of samples: thus, the SPRT-TANDEM can achieve a smaller sampling number with high accuracy on average. Another potential drawback of LSTM-s/m is that their loss function explicitly imposes monotonicity to the scores. While the monotonicity is advantageous for quick decisions, it may sacrifice flexibility: the LSTM-s/m can hardly change its mind during a classification. EARLIEST, the reinforcement-learning based classifier, decides on the various length of samples. A potential EARLIEST-specific drawback is that deep reinforcement learning is known to be unstable (Nikishin et al. (2018) ; Kumar et al.) . How optimal is the SPRT-TANDEM? In practice, it is difficult to strictly satisfy the necessary conditions for the SPRT's optimality (Theorem A.5 and A.6) because of experimental limitations. One of our primary interests is to apply the SPRT, the provably optimal algorithm, to real-world datasets. A major concern about extending Wald's SPRT is that we need to know the true likelihood ratio a priori to implement the SPRT. Thus we propose the SPRT-TANDEM with the help of machine learning and density ratio estimation to remove the concern, if not completely. However, some technical limitations still exist. Let us introduce two properties of the SPRT that can prevent the SPRT-TANDEM from approaching exact optimality. Firstly, the SPRT is assumed to terminate for all the LLR trajectories under consideration with probability one. The corresponding equation stating this assumption is Equation ( 61) and (66) under the i.i.d. and non-i.i.d. condition, respectively. Given that this assumption (and the other minor technical conditions in Theorem A.5 and A.6) is satisfied, the more precisely we estimate the LLRs, the more we approach the genuine SPRT implementation and thus its asymptotic Bayes optimality. Secondly, the non-i.i.d. SPRT is asymptotically optimal when the maximum number of samples allowed is not fixed (infinite horizon). On the other hand, our experiment truncates the SPRT (finite horizon) at the maximum timestamp, which depends on the datasets. Under the truncation, gradually collapsing thresholds are proven to give the optimal stopping (Tartakovsky et al. ( 2014)); however, the collapsing thresholds are obtained via backward induction (Bingham et al. (2006) ), which is possible only after observing the full sequence. Thus, under the truncation, finding the optimal solutions in a strict sense critically limits practical applicability. The truncation is just an experimental requirement and is not an essential assumption for the SPRT-TANDEM. Under the infinite horizon settings, the LLRs is assumed to increase or decrease toward the thresholds (Theorem A.6) in order to ensure the asymptotic optimality. However, we observed that the estimated LLRs tend to be asymptotically flat, especially when N is large (Figure 10 , 11, and 12); the estimated LLRs can violate the assumption of Theorem A.6. One potential reason for the flat LLRs is the TANDEM formula: the first and second term of the formula has a different sign. Thus, the resulting log-likelihood ratio will be updated only when the difference between the two terms are non-zero. Because the first and second term depends on N + 1 and N inputs, respectively, it is expected that the contribution of one input becomes relatively small as N is enlarged. We are aware of this issue and already started working on it as future work. Nevertheless, the flat LLRs at least do not spoil the practical efficiency of the SPRT-TANDEM, as our experiment shows. In fact, because we cannot know the true LLR of real-world datasets, it is not easy to discuss whether the assumption of the increasing LLRs is valid on the three databases (NMNIST, UCF, and SiW) we tested. Numerical simulation may be possible, but it is out of our scope because our primary interest is to implement a practically usable SPRT under real-world scenarios. The best order N of the SPRT-TANDEM. The order N is a hyperparamer, as we mentioned in Section 3 and thus needs to be tuned to attain the best performance. However, each dataset has its own temporal structure, and thus it is challenging to acquire the best order a priori. In the following, we provide a rough estimation of the best order, which may give dramatic benefit to the users and may lead to exciting future works. Let us introduce a concept, specific time scale, which is used in physics to analyze qualitative behavior of a physical system. Here, we define the specific time scale of a physical system as a temporal interval in which the physical system develops dramatically. For example, suppose that a physical system under consideration is a small segment of spacetime in which an unstable particle, ortho-positronium (o-Ps), exists. In this case, a specific time scale can be defined as the lifetime of o-Ps, 0.14µs (Czarnecki (1999) ), because the o-Ps is likely to vanish in 0.14 × O(1)µs -the physical system has changed completely. Note that the definition of specific time scale is not unique for one physical system; it depends on the phenomena the researcher focuses on. Specific (time) scale are often found in fundamental equations that describes physical systems. In the example above, the decay equation N (t) = A exp(-t/τ ) has the lifetime τ ∈ R in itself. Here N (t) ∈ R is the expected number of o-Ps' at time t ∈ R, and A ∈ R is a constant. Let us borrow the concept of the specific time scale to estimate the best order of the SPRT-TANDEM before training neural networks, though there is a gap in scale. In this case, we define the specific time scale of a dataset as the number of frames after which a typical video in the dataset shows completely different scene. As is discussed below, we claim that the specific time scale of a dataset is a good estimation of the best order of the SPRT-TANDEM, because the correlations shorter than the specific time scale are insufficient to distinguish each class, while the longer correlations may be contaminated with noise and keep redundant information. First, we consider Nosaic MNSIT (NMNIST). The specific time scale of NMNIST can be defined as the half-lifefoot_1 of the noise, i.e., the necessary temporal interval for half of the noise to disappear. It is 10 frames by definition of NMNIST, and approximately matches the best order of the SPRT-TANDEM: in Figure 3 , our experiment shows that the 10th order SPRT-TANDEM (with 11-frames correlation) outperforms the other orders in the latter timestamps, though we did not perform experiments with all the possible orders. A potential underlying mechanism is: Too long correlations keep noisy information in earlier timestamps, causing degradation, while too short correlations do not fully utilize the past information. Next, we discuss the two classes in UCF101 action recognition database, handstand pushups and handstand walking, which are used in our experiment. A specific time scale is ∼ 10 frames because of the following reasons. The first class, handstand pushups, has a specific time scale of one cycle of raising and lowering one's body ∼ 50 frames (according to the shortest video in the class). The second class, handstand walking, has a specific time scale of one cycle of walking, i.e., two steps, ∼ 10 frames (according to the longest video in the class). Therefore the specific time scale of UCF is ∼ 10, the smaller one, since we can see whether there is a class signature in a video within at most ∼ 10 frames. The specific time scale matches the best order of the SPRT-TANDEM according to Figure 3 . Finally, a specific time scale of SiW is ∼ 1 frame, because a single image suffices to distinguish a real person and a spoofing image, because of the reflection of the display, texture of the photo, or the movement specific to a live personfoot_2 . The best order in Figure 3 is ∼ 1, matching the specific time scale. We make comments on two potential future works related to estimation of the best order of the SPRT-TANDEM. First, as our experiments include only short videos, it is an interesting future work to estimate the best order of the SPRT-TANDEM in super-long video classification, where gradient vanishing becomes a problem and likelihood estimation becomes more challenging. Second, it is an exciting future work to analyse the relation of the specific time scale to the best order when there are multiple time scales. For example, recall the discussion of locality above in this Appendix D: Applying the SPRT-TANDEM to a dataset with distributed class signatures is challenging. Distributed class signatures may have two specific time scales: e.g., one is the mean length of the signatures, and the other is the mean interval between the signatures. The best threshold λ τ * of the SPRT-TANDEM. In practice, a user can change the thresholds after deploying the SPRT-TANDEM algorithm once and control the speed-accuracy tradeoff. Computing the speed-accuracy-tradeoff curve is not expensive, and importantly computable without re-training. According to the speed-accuracy-tradeoff curve, a user can choose the desired accuracy and speed. Note that this flexible property is missing in most other deep neural networks: controlling speed usually means changing the network structures and training it all over again. End-to-end v.s. separate training. The design of SPRT-TANDEM does not hamper an end-to-end training of neural networks; the feature extractor and temporal integrator can be readily connected for thorough backpropagation calculation. However, in Section 5, we trained the feature integrator and temporal integrator separately: after training the feature integrator, its trainable parameters are fixed to start training the temporal integrator. We decided to train the two networks separately because we found that it achieves better balanced accuracy and mean hitting time. Originally we trained the network using NMNIST database with an end-to-end manner, but the accuracy was far lower than the result reported in Section 5. We observed the same phenomenon when we trained the SPRT-TANDEM on our private video database containing 1-channel infrared videos. These observations might indicate that while the separate training may lose necessary information for classification compared to the end-to-end approach, it helps the training of the temporal integrator by fixing information at each data point. It will be interesting to study if this is a common problem in early-classification algorithms and find the right balance between the end-to-end and separate training to benefit both approaches. Feedback to the field of neuroscience. Kira et al. (2015) experimentally showed that the SPRT could explain neural activities in the area LIP at the macaque parietal lobe. They randomly presented a sequence of visual objects with associated reward probability. A natural question arises from here: what if the presented sequence is not random, but a time-dependent visual sequence? Will the neural activity be explained by our SPRT-TANDEM, or will the neurons utilize a completely different algorithm? Our research provides one driving hypothesis to lead the neuroscience community to a deeper understanding of the brain's decision-making system. Usage of statistical tests. As of writing this manuscript, not all of the computer science papers use statistical tests to evaluate their experiments. However, in order to provide an objective comparison across proposed and existing models, running multiple validation trials with random seeds followed by a statistical test is helpful. Thus, the authors hope that our paper stimulates the field of computer science to utilize statistical tests more actively. Ethical concern. The proposed method, SPRT-TANDEM, is a general algorithm applicable to a broad range of serial data, such as auditory signals or video frames. Thus, any ethical concerns entirely depend on the application and training database, not on our algorithm per se. For example, if SPRT-TANDEM is applied to a face spoofing detection, using faces of people of one particular racial or ethnic group as training data may lead to a bias toward or against people of other groups. However, this bias is a concern in machine learning in general, not specific to the SPRT-TANDEM. Is the SPRT-TANDEM "too local"? In our experiments in Section 5, the SPRT-TANDEM with maximum correlation allowed (i.e., 19th, 49th, and 49th on NMNIST, UCF, and SiW databases, respectively) does not necessarily reach the highest accuracy with a larger number of frames. Instead, depending on the database, the lower order of approximation, such as 10th order TANDEM, outperforms the other orders. In the SiW database, this observation is especially prominent: the model records the highest balanced accuracy is the 2nd order SPRT-TANDEM. While this may indicate our TANDEM formula with the "dropping correlation" strategy works well as we expected, a remaining concern is the SPRT may integrate too local information. What if class signatures are far separated in time? In such a case, the SPRT-TANDEM may fail to integrate the distributed class signatures for correct classification. On the other hand, the SPRT-TANDEM may be able to add the useful information of the class signatures to the LLR only when encountering the signatures (in other words, do not add non-zero values to the LLR without seeing class signatures). The SPRT-TANDEM may be able to skip unnecessary data points without modification, or with a modification similar to SkipRNN (Campos et al. (2018) ), which actively achieve this goal: by learning unnecessary data point, the SkipRNN skips updating the internal state of RNN to attend just to informative data. Similarly, we can modify the SPRT-TANDEM so that it learns to skip updating LLR upon encountering uninformative data. It will be exciting future work, and the authors are looking forward to testing the SPRT-TANDEM on a challenging database with distributed class signatures. A more challenging dataset: Nosaic MNIST-Hard (NMNIST-H). In the main text, we see the accuracy of the SPRT-TANDEM saturates within a few timestamps. Therefore, it is worth testing the models on a dataset that require more samples for reaching good performance. We create a more challenging dataset, Nosaic MNIST-Hard: The MNIST handwritten digits are buried with heavier noise than the orifinal NMNIST (only 10 pixels/frame are revealed, while it is 40 pixels/frame for the original NMNIST). The resulting speed-accuracy tradeoff curves below show that the SPRT-TANDEM outperforms LSTM-s/m more than the error-bar range, even on the more challenging dataset requiring more timestamps to attain the accuracy saturation. 

E LOSS FOR LOG-LIKELIHOOD RATIO ESTIMATION (LLLR)

In this section, we discuss a deep connection of the novel loss function, LLLR L LLR = 1 M i∈I1 |1 -σ(log r(X i ))| + 1 M i∈I0 σ(log r(X i )) , to density ratio estimation (Sugiyama et al. (2012; 2010) ). Here, X i := {x (t) i ∈ R dx } T t=1 and y i ∈ {1, 0} (i ∈ I := I 1 ∪ I 0 , T ∈ N, d x ∈ N) are a sequence of samples and a label, respectively, where I, I 1 , and I 0 are the index sets of the whole dataset, class 1, and class 0, respectively. r(X i ) (i ∈ I) is the likelihood ratio of X i . The hatted notation (• ) means that the quantity is an estimation with, e.g., a neural network on the training dataset {(X i , y i )} i∈I . Note that we do not necessarily have to compute p(X i |y = 1) and p(X i |y = 0) separately to obtain the likelihood ratio r(X i ) = p(X|y=1) p(X|y=0) ; we can estimate r directly, as is explained in the following subsections. In the following, we first introduce KLIEP (Kullback-Leibler Importance Estimation Procedure, Sugiyama et al. (2008) ), which underlies the theoretical aspects of the LLLR. KLIEP was originally invented to estimate density ratio without directly estimating the densities. The idea is to minimize the Kullback-Leibler divergence of the true density p(X|y = 1) and the estimated density r(X)p(X|y = 0), where (X, y) is a sequential data-label pair defined on the same space as (X i , y i )'s. Next, we introduce the symmetrized KLIEP, which cares about not only p(X|y = 1) and r(X)p(X|y = 0), but p(X|y = 0) and r-1 (X)p(X|y = 1) to remove the asymmetry inherent in the Kullback-Leibler divergence. Finally, we show the equivalence of the symmetrized KLIEP to the LLLR; specifically, we show that the LLLR minimizes the Kullback-Leibler divergence of the true and the estimated density, and further stabilizes the training by restricting the value of likelihood ratio.

E.1 DENSITY RATIO ESTIMATION AND KLIEP

In this section, we briefly review density ratio estimation and introduce KLIEP. Density estimation is the construction of underlying probability densities based on observed datasets. Taking their ratio, we can naively estimate the density ratio; however, division by an estimated quantity is likely to enhance the estimation error (Sugiyama et al. (2012; 2010) ). Density ratio estimation has been developed to circumvent this problem. We can categorize the methods to the following four: probabilistic classification, moment matching, density ratio fitting, and density fitting.  M 0 M 1 , where M 1 and M 0 denote the number of the training data points with label 1 and 0 respectively. Thus we can estimate the likelihood ratio from the estimated posterior ratio. The multiplet cross-entropy loss conducts the density ratio estimation in this way. Moment matching. The moment matching approach aims to match the moments of p(X|y = 1) and r(X)p(X|y = 0), according to the fact that two distributions are identical if and only if all moments agree with each other. Density ratio fitting. Without knowing the true densities, we can directly minimize the difference between the true and estimated ratio as follows: argmin r dXp(X|y = 0)(r(X) -r(X)) 2 (88) =argmin r dXp(X|y = 0)r(X) 2 -2 dXp(X|y = 1)r(X) (89) argmin r 1 M 0 i∈I0 r(X i ) 2 - 2 M 1 i∈I1 r(X i ) . ( ) Here, we applied the empirical approximation. In addition, we restrict the value of r(X): r(X) ≥ 0. Since ( 90) is not bounded below, we must add other terms or put more constraints, as is done in the original paper (Kanamori et al. (2009) ). This formulation of density ratio estimation is referred to as least-squares importance fitting (LSIF, Kanamori et al. (2009) ). Density fitting. Instead of the squared expectation, KLIEP minimizes the Kullback-Leibler divergence:  We need to restrict r:    0 ≤ r(X) (94) dX r(X)p(X|y = 0) = 1 , ( ) The first inequality ensures the positivity of the probability ratio, while the second equation is the normalization condition. Applying the empirical approximation, we obtain the final objective and the constraints: argmin r 1 M 1 i∈I1 -log r(X i ) (96)      r(X) ≥ 0 (97) 1 M 0 i∈I0 r(X i ) = 1 (98) Several papers implement the algorithms mentioned above using deep neural networks. In Nam & Sugiyama (2015) , LSIF is applied to outlier detection with the deep neural network implementation, whereas in Khan et al. (2019) , KLIEP and its variant are applied to changepoint detection.

E.2 THE SYMMETRIZED KLIEP LOSS

As shown above, KLIEP minimizes the Kullback-Leibler divergence; however, its asymmetry can cause instability of the training, and thus we introduce the symmetrized KLIEP loss. A similar idea was proposed in Khan et al. (2019) independently of our analysis. First, notice that KL(p(X|y = 1)||rp(X|y = 0)) = dXp(X|y = 1) log( p(X|y = 1) r(X)p(X|y = 0) ) (99) = -dXp(X|y = 1) log(r(X)) + const. ( ) The constant term is independent of the weight parameters of the network and thus negligible in the following discussion. Similarly, KL(p(X|y = 0)||r -1 p(X|y = 1)) = -dXp(X|y = 1) log(r(X) -1 ) + const. (101) We need to restrict the value of r in order for p(X|y = 1) and p(X|y = 0) to be probability densities:    0 ≤ r(X)p(X|y = 0) (102) dX r(X)p(X|y = 0) = 1 , ( ) and    0 ≤ r(X) -1 p(X|y = 1) (104) dX r(X) -1 p(X|y = 1) = 1 , ( ) Therefore, we define the symmetrized KLIEP loss as L KLIEP := dX(-p(X|y = 1) log r(X)) -dX(-p(X|y = 0) log r(X)) with the constraints ( 102)-( 105). The estimated ratio function argmin r(X) L KLIEP with the constraints minimizes KL(p(X|y = 1)||r(X)p(X|y = 0)) + KL(p(X|y = 0)||r -1 p(X|y = 1))). According to the empirical approximation, they reduce to L KLIEP ({X i } M i=1 ) 1 M 1 i∈I1 -log(r(X i )) + 1 M 0 i∈I0 -log(r(X i ) -1 ),                r(X) ≥ 0 (108) 1 M 0 i∈I0 r(X i ) = 1 (109) 1 M 1 i∈I1 r(X i ) -1 = 1 . ( )

E.3 THE LLLR AND DENSITY RATIO ESTIMATION

Let us investigate the LLLR in connection with the symmetrized KLIEP loss. Divergence terms. First, we focus on the divergence terms in (107): 1 M 1 i∈I1 -log(r(X i )) (111) 1 M 0 i∈I0 -log(r(X i ) -1 ) . ( ) As shown above, decreasing ( 111) and ( 112) leads to minimizing the Kullback-Leibler divergence of p(X|y = 1) and rp(X|y = 0) and that of p(X|y = 0) and r-1 p(X|y = 1) respectively. The counterparts in the LLLR are L LLR = 1 M i∈I1 |1 -σ(log r(X i ))| ↔ 1 M 1 i∈I1 -log(r(X i )) (113) + 1 M i∈I0 σ(log r(X i )) ↔ 1 M 0 i∈I0 -log(r(X i ) -1 ) , because, on one hand, both terms in (113) ensures the likelihood ratio r to be large for class 1, and, on the other hand, both terms in (114) ensures r to be small for class 0. Therefore, minimizing L LLR is equivalent to decreasing both ( 111) and ( 112) and therefore to minimizing (108), i.e., the Kullback-Leibler divergences of the true and estimated densities. Again, we emphasize that the LLLR is more stable, since L LLR is lower-bounded unlike the KLIEP loss. Constraints. Next, we show that the LLLR implicitly keeps r not too large nor too small; specifically, with increasing R(X i ) := | log r(X i )|, the gradient converges to zero before R(X i ) enters the region, e.g., R(X i ) 1. Therefore the gradient descent converges before r(X i ) becomes too large or small. To show this, we first write the gradients explicitly: ∇ W σ(log(r(X i ))) = σ (log r(X i )) • ∇ W log r(X i ) ( ) where W is the weight and σ is the sigmoid function. We see that with increasing R( X i ) = | log r(X i )|, the factor σ (log r(X i )) (116) converges to zero, because (116) ∼ 0 for too large or small r(X i ), e.g., for around R(X i ) 1. Thus the gradient (115) vanishes before r(X i ) becomes too large or small, i.e., keeping r(X i ) moderate. In conclusion, the LLLR minimizes the difference between the true (p(X|y = 1) and p(X|y = 0)) and the estimated (r -1 (X)p(X|y = 1) and r(X)p(X|y = 0)) densities in the sense of the Kullback-Leibler divergence, including the effective constraints. THE TRIVIAL SOLUTION IN EQUATION (115) . We show that vanishing ∇ W log r(X i ) (117) in ( 115) corresponds to a trivial solution ; i.e., we show that (117)= 0 forces the bottleneck feature vectors to be zero. Let us follow the notations in Table 4 and Figure 5 . We particularly focus on the last components in the gradient ∇ W log r(X i ), i.e., ∇ W (L) ab log r(X i ) (a ∈ [d L-1 ] and b ∈ [d L ] = {0, 1}). Specifically, ∇ W (L) ab log r = ∇ W (L) ab log p(X|y = 1) -∇ W (L) ab log p(X|y = 0) = y=1,0 ∂g (L) y ∂W (L) ab ∂ log p(X|y = 1) ∂g (L) y - ∂g (L) y ∂W (L) ab ∂ log p(X|y = 0) ∂g (L) y . ( ) Since ∂ log py /∂g (L) y = δ yy -py , where y, y ∈ {1, 0} and δ y,y is the Kronecker delta, we see ∂ log r ∂g (L) y = (δ y1 -py ) -(δ y0 -py ) = 1 (if y = 1) -1 (if y = 0) (119) ∴ (118) = ∂g (L) 1 W (L) ab • 1 + ∂g (L) 0 ∂W (L) ab • (-1) = δ 1b f (L-1) a -δ 0b f (L-1) a . ( ) Thus (117) = 0 =⇒ (118)= 0 ⇐⇒ f (L-1) a = 0 (∀a ∈ [d L-1 ] ), which is a trivial solution, because the bottleneck feature vector collapses to zero at convergence. Our experiments, however, show that our model does not tend to such a trivial solution; otherwise, the SPRT-TANDEM cannot attain such a high performance. f (0) (x i ) = (f (0) 1 , ..., f (0) dx ) T = x i ∈ R d0=dx f (l) (x i ) = (f (l) 1 , ..., f (l) d l ) T = σ(g (l) (x i )) ∈ R d l (l = 0, 1, 2..., L -1) f (L) (x i ) = (f (L) 1 , ..., f (L) d L ) T = softmax(g (L) (x i )) ∈ R d L =2 g (l) (x i ) = (g (l) 1 , ..., g (l) d l ) T = W (l) T f (l-1) (x i ) ∈ R d l (l = 0, 1, 2..., L -1) W (l) ∈ R d l-1 ×d l (l = 0, 1, 2..., L -1) S = {(x i , t i )} M i=1 : training dataset d x : input dimension x i ∈ R dx : input vector L: number of layers t i ∈ R 2 : one-hot label vector σ: activation function 4 . We assume that the network has L fully-connected layers W l and the activation function σ, with the final softmax with cross-entropy loss.

E.4 PREPARATORY EXPERIMENT TESTING THE EFFECTIVENESS OF THE LLLR

To test if the proposed L LLR could effectively train a neural network, we ran two preliminary experiments before the main manuscript. First, we compared training the proposed network architecture L multiplet , with and without L LLR . Next, we compared training the network using L multiplet with L LLR , and training with L multiplet with the KLIEP loss, L KLIEP , whose numerator and denominator were carefully bounded so that the L KLIEP did not diverge. We tested the effectiveness of L LLR on a 3D-ness detection task on a depth-from-motion (DfM) datasetfoot_3 . The DfM dataset was a small dataset containing 2320 and 2609 3D-and 2D-facial videos, respectively, each of which consisted of 10 frames. In each of the video, a face was filmed from various angles (Figure 6a ), so that the dynamics of the facial features could be used to determine whether the face appearing in a video is flat, 2D face, or had a 3D structure. The recording device was an iPhone7. Here, the feature extractor f wFE (x (t) ) was the LBP-AdaBoost algorithm (Viola & Jones (2001) ) combined with the supervised descent method (Xiong & De la Torre ( 2013)), which took the t-th frame of the given video as input x (t) and output facial feature points (Figure 6b ). The feature points were output as a vector of 152 lengths, consisted of vertical and horizontal pixel positions of 76 facial feature points. The temporal integrator g wTI (x (t) ) is an LSTM, whose number of hidden units was the same as that of feature points. The 1st-order SPRT-TANDEM was evaluated on two hypotheses, y = 1: 2D face, and y = 0: 3D face. We assumed a flat prior, p(y = 1) = p(y = 0). The validation and test data were 10% of the entire data randomly selected at the beginning of training, respectively. We conducted a 10-fold cross-validation test to evaluate the effect of L LLR . We compared the classification performance of the SPRT-TANDEM network using both L LLR and L multiplet , and using L multiplet only. We also compare L LLR + L multiplet and L KLIEP + L multiplet . To use L KLIEP without making a loss diverge, we set the upper and lower bound of the numerator and denominator of r as 10 5 and 10 -5 , respectively. Out of 100 training epochs, the results of the last 80 epochs were used to calculate test equal error rates (EERs). Two-way ANOVA with factors "loss type" and "epoch" were conducted to see if the difference in loss function caused statistically significantly different EERs. We included the epoch as a factor in order to see if the value of EER reached a plateau in the last 80 epochs (i.e., statistically NOT significant). As we expected, EER values across training epochs were not significantly different (p = 0.17). On the other hand, the loss type caused statistically significant differences between the loss groups (i.e., L LLR + L multiplet , L KLIEP + L multiplet , and L multiplet . p < 0.001). Following Tukey-Kramer multi-comparison test showed that training with L LLR loss statistically significantly reduced the EER, compared to both L KLIEP (p = 9.56 * 10 -10 ) and the L LLR -ablated loss (p = 9.56 * 10 -10 ). The result is plotted in Figure 7 .

F PROBABILITY DENSITY RATIO ESTIMATION WITH THE LLLR

Below we test whether the proposed LLLR can help a neural network estimating the true probability density ratio. Providing the ground-truth probability density ratio was difficult in the three databases used in the main text, because it was prohibitive to find the true probability distribution out of the public databases containing real-world scenes. Thus, we create a toy-model estimating the probability density ratio of the two multivariate Gaussian distributions. Experimental results show that a multilayer perceptron (MLP) trained with the proposed LLLR achieves smaller estimation error than an MLP with crossentropy (CE)-loss. F.1 EXPERIMENTAL SETTINGS Following Sugiyama et al. 2008 Sugiyama et al. (2008) , let p 0 (x) be the d-dimensional Gaussian density with mean (2, 0, 0, ..., 0) and covariance identity, and p 1 (x) be the d-dimensional Gaussian density with mean (0, 2, 0, ..., 0) and covariance identity. The task for the neural network is to estimate the density ratio: r(x i ) = p1 (x i ) p0 (x i ) . Here, x is sampled from one of the two Gaussian distributions, p 0 or p 1 , and is associated with class label y = 0 or y = 1, respectively. We compared the two loss functions, CE-loss and LLLR: LLLR := 1 N N i=1 |y -σ (log ri )| where σ is the sigmoid function. A simple Neural network consists of 3-layer fully-connected network with nonlinear activation (ReLU) is used for estimating r(x). Evaluation metric is normalized mean squared error (NMSE, Sugiyama et al. (2008) ): NMSE := 1 N N i=1 rj N j=1 rj - r i N j=1 r j 2

F.2 DENSITY ESTIMATION RESULTS

To calculate statistics, the MLP was trained either with the LLLR or CE-loss, repeated 40 times with different random initial vairables. Figure 8 shows the mean NMSE with the shading shows standard error of the mean. Although the training with LLLR does not decrease NMSE well at the first few thousands of trials, the NMSE reaches as low as 10 -5 around 14000 iterations. In contrast, the training with CE shows a steep decrease of NMSE in the first 2000 iterations, but saturates after that. Thus, the proposed LLLR not only facilitates the sequential binary hypothesis testing, but also facilitates the estimation of true density ratio. 

G MULTIPLET CROSS-ENTROPY LOSS

In this section, we show that estimation of the true posterior is realized by minimizing the multiplet cross-entropy loss defined in Section 4 on the basis of the principle of maximum likelihood estimation. First, let us consider the 1st order for simplicity. The multiplet cross-entropy loss ensures for the posterior p(y|x (t) ) estimated by the network to be close to the true posterior p(y|x (t) ). Consider the Kullback-Leibler divergence of p(y|x (t) ) and p(y|x (t) ) for some x (t) ∈ R dx (t ∈ N), where y ∈ {0, 1}: argmin p E x (t) ∼p(x (t) ) [KL(p(y|x (t) )||p(y|x (t) ))] = argmin p E (x (t) ,y)∼p(x (t) ,y) [-log p(y|x (t) )] (125) argmin p 1 M M i=1 [-log p(y i |x (t) i )] Thus, the last line shows the smaller singlet loss leads to the smaller Kullback-Leibler divergence; in other words, we can estimate the true posterior density by minimizing the multiplet loss, which is necessary to run the SPRT algorithm. Similarly, we adopt the doublet cross-entropy to estimate the true posterior p(y|x (t) , x (t+1) ): argmin p E (x (t) ,x (t+1) )∼p(x (t) ,x (t+1) ) [KL(p(y|x (t) , x (t+1) )||p(y|x (t) , x (t+1) ))] argmin p 1 M M i=1 [-log p(y i |x (t) i , x i )] . The crucial difference from the singlet loss is that the doublet loss involves the temporal correlation between x (t) and x (t+1) , being necessary to implement the SPRT-TANDEM. Similar statements hold for other orders.

H HYPERPARAMETER OPTIMIZATION

We used Optuna, the optimization software, to determine hyperparameters. Hyperparameter search trials are followed by performance evaluation trials of the fixed hyperparameter configuration. The evaluation criteria used by Optuna to find the best parameter combination is balanced accuracy. For models that produce multiple balanced accuracies / mean hitting time combinations, we use the average of balanced accuracy at every natural number of the mean hitting time (e.g., one frame, two frames).

H.1 NOSAIC MNIST (NMNIST)

SPRT-TANDEM: feature extractor. ResNet version 1 with 110 layers and 128 final output channels (total trainable parameters: 6.9M) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -2 , 10 -3 } optimizer ∈ {Adam, Momentum, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. Where Adam, Momentum, and RMSprop are based on (Kingma & Ba (2014) , Rumelhart et al. (1986) , and Graves ( 2013)), respectively. Numbers of batch size and training epoch are fixed to 64 and 50, respectively. The best hyperparameter combination is summarized in Table 5 . One search trial takes approximately 5 hours on our computing infrastructure (see Appendix K). SPRT-TANDEM: temporal integrator. Peephole-LSTM with a hidden layer of size 128 (total trainable parameters: 0.1M) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -2 , 10 -3 , 10 -4 , 10 -5 } batch size ∈ {256, 512, 1024} optimizer ∈ {Adam, Momentum, Adagrad, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} Where Adagrad is based on (Duchi et al. (2011) ). Number of training epochs is fixed to 50. The number of search trials and resulting best hyperparameter combination are summarized in Table 6 . LSTM-m / LSTM-s. Peephole-LSTM with a hidden layer of size 128 (total trainable parameters: 0.1M) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -2 , 10 -3 , 10 -4 , 10 -foot_4 } optimizer ∈ {Adam, Momentum, Adagrad, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} lambda ∈ {0.01, 0.1, 1, 6, 10, 100} where the lambda is a specific parameter of LSTM-m / LSTM-s. Batch size and number of training epochs are fixed to 1024 and 100, respectively. The number of search trials and resulting best hyperparameter combination are summarized in Table 7 . One search trial takes approximately 3 hours on our computing infrastructure (see Appendix K). EARLIEST. LSTM with a hidden layer of size 128 (total trainable parameters: 0.1M) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 } optimizer ∈ {Adam, Momentum, Adagrad, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} Batch size and number of training epochs are fixed to 1 5 and 2, respectively. The number of search trials and resulting best hyperparameter combination are summarized in Table 8 .  learning rate ∈ {10 -3 , 10 -4 , 10 -5 } batch size ∈ {100, 200, 500} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. Optimizer and number of training epochs are fixed to Adam and 50, respectively. The number of search trials and resulting best hyperparameter combination are summarized in Table 9 . LSTM-m / LSTM-s. Peephole-LSTM with a hidden layer of size 64 (total trainable parameters: 33K) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -4 , 10 -5 , 10 -6 , 10 -7 } batch size ∈ {57, 114, 171, 342} optimizer ∈ {Adam, Momentum, Adagrad, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. dropout ∈ {0.1, 0.2, 0.3, 0.4} lambda ∈ {0.01, 0.1, 1, 6, 10, 100} The number of training epochs is fixed to 100. The number of search trials and resulting best hyperparameter combination are summarized in Table 13 . EARLIEST. LSTM with a hidden layer of size 64 (total trainable parameters: 33K) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 } optimizer ∈ {Adam, Momentum, Adagrad, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} Batch size and number of training epochs are fixed to 1 and 30, respectively. The number of search trials and resulting best hyperparameter combination are summarized in Table 14 .  learning rate ∈ {10 -3 , 10 -4 , 10 -5 } weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. Batch size, optimizer and number of training epochs are fixed to 19, Adam, and 50, respectively. The number of search trials and resulting best hyperparameter combination are summarized in Table 15 . 

H.3 SIW

The large database and network size prevent us to run multiple parameter search trials on SiW database. Thus, we manually selected hyperparameters as follows.

I STATISTICAL TEST DETAILS

The models we compared in the experiment have various numbers of trials due to the difference in training time; some models were prohibitively expensive for multiple runs (for example, 3DResNet takes 20 hrs/epoch on SiW database with NVIDIA RTX2080Ti.) In order to have an objective comparison of these models, we conducted statistical tests, Two-way ANOVAfoot_5 followed by Tukey-Kramer multi-comparison test. In the tests, small numbers of trials lead to reduced test statistics, making it difficult to claim significance, because the test statistic of Tukey-Kramer method is proportional to 1/ (1/n + 1/m), where n and m are trial numbers of two models to be compared. Nevertheless, the SPRT-TANDEM is statistically significantly better than other baselines. One intuitive interpretation of this result is that "the SPRT-TANDEM achieved accuracy high enough so that only a few trials of baselines were needed to claim the significance." These statistical tests are standard practice in some research fields such as biological science, in which variable trial numbers are inevitable in experiments. All the statistical tests are executed with a customized MATLAB (2017) script. Here, the two factors for ANOVA are (1) the model factor contains four members: the SPRT-TANDEM with the best performing order on the given database, LSTM-m, EARLIEST, and 3DResNet, and (2) the phase factor contains two or three members: early phase and late phase (NMNIST, UCF), or early, mid, and the late phase (SiW). The early, mid, and late phases are defined based on the number of frames used for classification. The actual number of frames is chosen so that the compared models can use as similar data samples as possible and thus depends on the database. The SPRT-TANDEM, LSTM-m, and 3DResNet can be compared with the same number of samples used. However, EARLIEST cannot flexibly change the average number of samples (i.e., mean hitting time); thus, we include the results of EARLIEST to groups with the closest number of data possible. For NMNIST, five frames and ten frames are used to calculate the statistics of the early and late phases, respectively, except EARLIEST uses 4.37 and 19.66 frames on average in each phase. For UCF, 15 frames and 25 frames are used to calculate the statistics of the early and late phases, respectively, except EARLIEST uses 2.01 and 2.09 frames on average in each phasefoot_6 . For SiW, 5, 15, and 25 frames are used to calculate the early, mid, and late phases, respectively, except EARLIEST uses 1.19, 8.21, and 32.06 frames. The p-values are summarized in the Tables 21 22 23 24 . P-values with asterisks are statistically significant: one, two and three asterisks show p < 0.05, p < 0.01, and p < 0.001, respectively. Here we present the details of the experiments in Section 5. Figure 9 shows the SAT curves of all the models we use in the experiment. Figure 10 , 11, and 12 show example LLR trajectories calculated using NMNIST, UCF, and SiW database, respectively. Tables 25 to 40 shows average balanced accuracy and standard error of the mean (SEM) at the corresponding number of frames that used for classification. , 1st, 2nd, 3rd, 5th, 10th, 14th, 24th , and 49th-order SPRT-TANDEM, respectively. (c) (d) (a) (b) (e) (c) (d) (a) (b) (e) (f) (g) (h) (i) Figure 11 : Log-likelihood ratio (LLR) trajectories calculated on UCF database. Red and blue trajectories represent handstand-pushups and handstand-walking class, respectively. Panels (a-i) shows results of 0th, 1st, 2nd, 3rd, 5th, 10th, 14th, 24th, and 49th-order SPRT-TANDEM, respectively. (c) (d) (a) (b) (e) (f) (g) (h) (i) Figure 12 : Log-likelihood ratio (LLR) trajectories calculated on SiW database. Red and blue trajectories represent live and spoof class, respectively. Panels (a-i) shows results of 0th, 1st, 2nd, 3rd, 5th, 10th, 14th, 24th, and 49th-order SPRT-TANDEM, respectively. As we described in Section 5, the Nosaic (Noise + mOSAIC) MNIST, NMNIST for short, contains videos with 20 frames of MNIST handwritten digits, buried with noise at the first frame, gradually denoised toward the last frame. The first frame has all 255-valued pixels (white) except only 40 masks of pixels that are randomly selected to reveal the original image. Another forty pixels are randomly selected at each of the next timestamps, finally revealing the original image at the last, 20th frame. An example video is shown in Figure 13 . 



P (d = 1|y = 0) is short for P ({ω ∈ Ω|d(X (1,T ) )(ω) = 1}|y = 0) and is equivalent toP X (1,T ) ∼p(X (1,T ) |y=1) [d(X (1,T ) ) = 1] (i.e., the probability of the decision being 1, where X (1,T ) is sampled from the density p(X (1,T ) |y = 1) ). This choice of words is, strictly speaking, not correct, because the noise decay in NMNIST is linear; the definition of half-life in physics assumes the decay to be exponential. In fact, the feature extractor, which classified a single frame to two classes, showed fairly high accuracy in our experiment, without temporal information. For the protection of personal information, this database cannot be made public. As of writing this manuscript, the original code of EARLIEST does not allow batch size larger than 1. Note that we also conducted a three-way ANOVA with model, phase, and database factors to achieve qualitatively the same result verifying superiority of the SPRT-TANDEM over other algorithms. On UCF, EARLIEST does not use a large number of frames even when the hyperparameter lambda is set to a small value.



Figure 1: Conceptual figure explaining the SPRT. The SPRT calculates the log-likelihood ratio (LLR) of two competing hypotheses and updatesthe LLR every time a new sample (x (t) at time t) is acquired, until the LLR reaches one of the two thresholds. For data that is easy to be classified, the SPRT outputs an answer after taking a few samples, whereas for difficult data, the SPRT takes in numerous samples in order to make a "careful" decision. For formal definitions and the optimality in early classification of time series, see Appendix A.

Figure 3: Experimental results. (a-c) Speed-accuracy tradeoff (SAT) curves for three databases: NMNIST, UCF, and SiW. Note that only representative results are shown. Error bars show the standard error of the mean (SEM). (d) Example LLR trajectories calculated on the NMNIST database with the 10th-order SPRT-TANDEM. Red and blue trajectories represent odd and even digits, respectively. (e) SAT curves of the ablation test comparing the effect of the L multiplet and the LLLR. (f) SAT curves comparing the SPRT and Neyman-Pearson test (NPT) using the same 1st-order SPRT-TANDEM network trained on the NMNIST database.

(1) , x (2) |y = 1) p(x (1) , x (2) |y = 0) = log p(y = 1|x (1) , x (2) ) p(y = 0|x (1) , x (2) ) -log p(y = 1) p(y = 0) .

Figure 4: The speed-accuracy tradeoff curve. Compare this with Figure 3 in the main text. "10th TANDEM" means the 10-th order SPRT-TANDEM. "19th TANDEM" means the 19-th order SPRT-TANDEM. The numbers of trials for hyperparameter tuning if 200 for all the models. The error bars are standard error of mean (SEM). The numbers of trials for statistics are 440, 240, 200, and 200 for 10th TANDEM, 19th TANDEM, LSTM-s, and LSTM-m, respectively.

Probabilistic classification. The idea of the probabilistic classification is that the posterior density p(Y |X) is easier to estimate than the likelihood p(X|Y ). Notice that r

dXp(X|y = 1) log(r(X)) .

f (0) (xi) = xi is the input vector with dimension dx ∈ N. f (l) (xi) is a feature vector after the activation function σ with dimension d l ∈ N. f (L) (xi) is the output of the softmax function, and has d L = 2, since we focus on the binary classification problem in this paper. g (l) (xi) is a feature vector before the activation function with dimension d l ∈ N. W (l) is a weight matrix in the neural network. Figure5visualizes the network structure.

Figure 5: Visualization of the notation given in Table4. We assume that the network has L fully-connected layers W l and the activation function σ, with the final softmax with cross-entropy loss.

Figure 6: Depth-from-motion dataset for the face 3D-ness detection task. Only three frames out of ten frames are shown. Top and bottom faces are 3D and 2D faces, respectively. (a) Video of faces taken from various angles. (b) Facial feature points that are extracted with the feature extractor, fw FE (x (t) )

Figure 7: Statistical test of equal error rates (EERs) in 10-fold cross-validation test. Two-way ANOVA are conducted with a loss factor (LLLR + Lmultiplet, LKLIEP + Lmultiplet, and Lmultiplet) and a epoch factor (21 -100-th epoch). P-values with asterisks are statistically significant: one, two and three asterisks show p < 0.05, p < 0.01, and p < 0.001, respectively. Error bars show the standard errors of the mean (SEM).

Figure 8: Normalized mean squared error (NMSE) -iteration curve. A multi-layer perceptron is trained either cross-entropy loss (blue) or the LLLR (red). Shades show the standard error of the mean (SEM).

Figure 9: Speed-accuracy tradeoff (SAT) curves of all the models. The right three panels show magnified views of the left three panels. The magnified region is same as the region plotted in the insets in Figure 3a, 3b, and 3c. Error bars show the standard error of the mean (SEM). (a,b) NMNIST database. (c,d) UCF database. (e,f) SiW database.

Figure 13: Nosaic MNIST (NMNIST) database consists of videos of 20 frames, each of which has 28 × 28 × 1 pixels. The frames are buried with noise at the first frame, gradually denoised toward the last frame. NMNIST provides a typical task in early classification of time series.

Representative mean balanced accuracy (%) calculated on NMNIST. For the complete list including standard errors, see Appendix J. To create a more challenging task, we selected two classes, handstand-pushups and handstand-walking, from the 101 classes in the UCF database. At a glimpse of one frame, the two classes are hard to distinguish. Thus, to correctly classify these classes, temporal information must be properly used. We resize each video's duration as multiples of 50 frames and sample every 50 frames with 25 frames of stride as one data. Training, validation, and test datasets contain 1026, 106, and 105 videos with frames of size 224 × 224 × 3, randomly cropped to 200 × 200 × 3 at training. The mean and variance of a frame are normalized to zero and one, respectively. The feature extractor of the SPRT-TANDEM is ResNet-50

Representative mean balanced accuracy (%) calculated on UCF. For the complete list including standard errors, see Appendix J.

Representative mean balanced accuracy (%) calculated on SiW. For the complete list including standard errors, see Appendix J.

Definition and the tradeoff of false alarms and stopping time . . . . . . . . . . . . A.3 The Neyman-Pearson test and the SPRT . . . . . . . . . . . . . . . . . . . . . . . A.4 The Optimality of the SPRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SiW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Neyman-Pearson test is Bayes optimal Consider the two-hypothesis testing problem. Given a prior distribution π i (i ∈ {1, 0}) the Bayes test d B , which minimizes the average error probability ᾱ

Thus, at least in population-level LIP neurons are representing information very similar to that of the LLR in the SPRT algorithm (It is under active discussion whether the ramping activity is seen only in averaged population firing rate or both in population and single neuron level. SeeLatimer et al. (2015);Shadlen et al. (2016)). But also seeOkazawa et al. (2021) for a recent finding that evidence accumulation is represented in a high-dimensional manifold of neural population. In any case, LIP neurons seem to represent accumulated evidence as their activity patterns.

Tartakovsky et al. (2014)  provided a comprehensive review of these theoretical analyses, a part of whose reasoning we also follow to show optimality in Appendix A. The SPRT, and closely related, generalized LLR test, applied to solve several problems includes drug safety surveillance(Kulldorff  et al. (2011)), exoplanet detection(Hu et al. (2019)), and the LLR test out of weak classifiers (WaldBoost,Sochman & Matas (2005)), to name a few. On an A/B test,Johari et al. (2017) tackled an important problem of inflating error rates at the sequential hypothesis testing.Ju et al. (2019) proposed an inputed Girshick test to determine a better variant.

Notation.

Hyperparameter tuning result of the SPRT-TANDEM feature extractor on NMNIST database.

Hyperparameter tuning result of the SPRT-TANDEM temporal integrator. on NMNIST database.

Hyperparameter tuning result of the LSTM-m / LSTM-s on NMNIST database.

Hyperparameter tuning result of the EARLIEST on NMNIST database,

Hyperparameter tuning result of the 3DResNet on NMNIST database,

Hyperparameter tuning result of the LSTM-m / LSTM-s on UCF database.

Hyperparameter combination of the EARLIEST on UCF database.

Hyperparameter tuning result of the 3DResNet on NMNIST database,

p-values from the two-way ANOVA conducted on the three public databases.

p-values from the Tukey-Kramer multi-comparison test conducted on UCF.

Database: NMNIST, model: SPRT-TANDEM (1/2)

Database: NMNIST, model: EARLIEST

Database: UCF, model: SPRT-TANDEM (2/2)

Database: UCF, model: LSTM-m/s

Database: UCF, model: EARLIEST

Database: SiW, model: SPRT-TANDEM (1/2)

Database: SiW, model: SPRT-TANDEM (2/2)

Database: SiW, model: LSTM-m/s

Database: SiW, model: EARLIEST

Database: SiW, model: 3DResNet AN EXAMPLE VIDEO OF THE NOSAIC MNIST DATABASE.

ACKNOWLEDGEMENTS

The authors thank anonymous reviewers for their careful reading to improve the manuscript. We would also like to thank Hirofumi Nakayama and Yuka Fujii for insightful discussions. Special thanks to Yuka for naming the proposed algorithm.

AUTHOR CONTRIBUTIONS

A.F.E. conceived the study. A.F.E. and T.M. constructed the theory, conducted the experiments, and wrote the paper. T. M. organized python codes to be ready for the release. K.S. and H.I. supervised the study. M Supplementary experiment on Moving MNIST database N Supplementary ablation experiment Ablation experiment. Peephole-LSTM with a hidden layer of size 128 (total trainable parameters: 0.1M) is used. Hyperparameters are searched within the following space: learning rate ∈ {10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 } optimizer ∈ {Adam, Momentum, Adagrad, RMSprop} weight decay ∈ {10 -3 , 10 -4 , 10 -5 }. dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}Batch size and number of training epochs are fixed to 1024 and 100, respectively. The number of search trials and resulting best hyperparameter combination are summarized in Table 10 . learning rate ∈ {10 -3 , 10 -4 , 10 -5 , 10 -6 } weight decay ∈ {10 -3 , 10 -4 , 10 -5 }.Numbers of batch size, optimizer, and training epochs are fixed to 512, Adam, and 100, respectively. The best hyperparameter combination is summarized in Table 11 . SPRT-TANDEM: temporal integrator. Peephole-LSTM with a hidden layer of size 64 (total trainable parameters: 33K) is used. Hyperparameters are searched within the following space:Numbers of weight decay and training epochs are fixed to 10 -4 and 100, respectively. The number of search trials and the best hyperparameter combination are summarized in Table 12 . SPRT-TANDEM: temporal integrator. Peephole-LSTM with a hidden layer of size 512s (total trainable parameters: 2.1M) is used. Table 17 shows the fixed parameter combination. LSTM-m / LSTM-s. Peephole-LSTM with a hidden layer of size 512s (total trainable parameters: 2.1M) is used. Table 18 shows the fixed parameter combination. EARLIEST. LSTM with a hidden layer of size 512s (total trainable parameters: 2.1M) is used.Table 19 shows the fixed parameter combination. 2020)) and Scipy (Virtanen et al. (2020) ) are used for mathematical computations. We use Tensorflow 2.0.0 (Abadi et al. (2015) ) as a machine learning framework except when running baseline algorithms that are implemented with PyTorch (Paszke et al. (2019) ).

M SUPPLEMENTARY EXPERIMENT ON MOVING MNIST DATABASE

Prior to the experiment on Nosaic MNIST, we conducted a preliminary experiment on the Moving MNIST (MMNIST) database. 1st, 2nd, 3rd, and 5th-order SPRT-TANDEM were compared to the LSTM-m. Hyperparameters of each model were independently optimized with Optuna. The result plotted in Figure 14 showed that the balanced accuracy of the SPRT-TANDEM peaked and reached the plateau phase only after two or three frames. This indicated that each of the frames in MMNIST contained too much information so that a well-trained classifier could classify a video easily. Thus, although our SPRT-TANDEM outperformed LSTM-m with a large margin, we decided to design the original database, Nosaic MNNIST (NMNIST) for the early-classification task. NMNIST contains videos with noise-buried handwritten digits, gradually denoised towards the end of the videos, increasing mean hitting time compared to the MMNIST. 

N SUPPLEMENTARY ABLATION EXPERIMENT

In addition to the ablation experiment presented in Figure 3e , which is calculated with 1st-order SPRT-TANDEM, we also conduct an experiment with 19th-order SPRT-TANDEM. The result shown in Figure 15 is qualitatively in line with Figure 3e : the L multiplet has an advantage at the early phase with a few data samples, while the L LLR leads to the higher final balanced accuracy at the late phase, and using both loss functions the best SAT curves can be obtained. 

