DBT: A DETECTION BOOSTER TRAINING METHOD FOR IMPROVING THE ACCURACY OF CLASSIFIERS Anonymous

Abstract

Deep learning models owe their success at large, to the availability of a large amount of annotated data. They try to extract features from the data that contain useful information needed to improve their performance on target applications. Most works focus on directly optimizing the target loss functions to improve the accuracy by allowing the model to implicitly learn representations from the data. There has not been much work on using background/noise data to estimate the statistics of in-domain data to improve the feature representation of deep neural networks. In this paper, we probe this direction by deriving a relationship between the estimation of unknown parameters of the probability density function (pdf) of input data and classification accuracy. Using this relationship, we show that having a better estimate of the unknown parameters using background and indomain data provides better features which leads to better accuracy. Based on this result, we introduce a simple but effective detection booster training (DBT) method that applies a detection loss function on the early layers of a neural network to discriminate in-domain data points from noise/background data, to improve the classifier accuracy. The background/noise data comes from the same family of pdfs of input data but with different parameter sets (e.g., mean, variance). In addition, we also show that our proposed DBT method improves the accuracy even with limited labeled in-domain training samples as compared to normal training. We conduct experiments on face recognition, image classification, and speaker classification problems and show that our method achieves superior performance over strong baselines across various datasets and model architectures.

1. INTRODUCTION

Modern pattern recognition systems achieve outstanding accuracies on a vast domain of challenging computer vision, natural language, and speech recognition benchmarks (Russakovsky et al. (2015) ; Lin et al. (2014) ; Everingham et al. (2015) ; Panayotov et al. (2015) ). The success of deep learning approaches relies on the availability of a large amount of annotated data and on extracting useful features from them for different applications. Learning rich feature representations from the available data is a challenging problem in deep learning. A related line of work includes learning deep latent space embedding through deep generative models (Kingma & Welling (2014) ; Goodfellow et al. (2014) ; Berthelot et al. (2019) or using self-supervised learning methods (Noroozi & Favaro (2016) ; Gidaris et al. (2018) ; Zhang et al. (2016b) ) or through transfer learning approaches (Yosinski et al. (2014) ; Oquab et al. (2014) ; Razavian et al. (2014) ). In this paper, we propose to use a different approach to improve the feature representations of deep neural nets and eventually improve their accuracy by estimating the unknown parameters of the probability density function (pdf) of input data. Parameter estimation or Point estimation methods are well studied in the field of statistical inference (Lehmann & Casella (1998) ). The insights from the theory of point estimation can help us to develop better deep model architectures for improving the model's performance. We make use of this theory to derive a correlation between the estimation of unknown parameters of pdf and classifier outputs. However, directly estimating the unknown pdf parameters for practical problems such as image classification is not feasible since it can sum up to millions of parameters. In order to overcome this bottleneck, we assume that the input data points are sampled from a family of pdfs instead of a single pdf and propose to use a detection based training approach to better estimate the unknowns using in-domain and background/noise data. One alternative is that we can use generative models for this task, however, they mimic the general distribution of training data conditioned on random latent vectors and hence cannot be directly applied for estimating the unknown parameters of a family of pdfs. Our proposed detection method involves a binary class discriminator that separates the target data points from noise or background data. The noise or background data is assumed to come from the same family of distribution of in-domain data but with different moments (Please refer to the appendix for more details about the family of distributions and its extension to a general structure). In image classification, this typically represents the background patches from input data that fall under the same distribution family. In speech domain, it can be random noise or the silence intervals in speech data. Collecting such background data to improve the feature representations is much simpler as compared to using labeled training data since it is time-consuming and expensive to collect labeled data. Since the background patches in images or noise in speech signals are used for binary classification in our method, we refer to such data as the noise of an auxiliary binary classification problem denoted by auxiliary binary classification (ABC)-noise dataset. An advantage of using ABC-noise data during training is that it can implicitly add robustness to deep neural networks against the background or noisy data. Since ABC-noise data can be collected in large quantities for free and using that data in our approach improves the classification benchmarks, we investigate whether this data can act as a substitute for labeled data. We conduct empirical analysis and show that using only a fraction of labeled training data together with ABC-noise data in our DBT method, indeed improves the accuracy as compared to normal training. To summarize, our contributions are threefold. First, we present a detailed theoretical analysis on the relation between the estimation of unknown parameters of pdf of data and classification outputs. Second, based on the theoretical analysis, we present a simple booster training method to improve classification accuracy which also doubles up as an augmented training method when only limited labeled data is available. Third, we consistently achieve improved performances over strong baselines on face recognition, image classification, and speaker recognition problems using our proposed method, showing its generalization across different domains and model architectures.

2. RELATED WORK

Notations and Preliminary: In this paper, vectors, matrices, functions, and sets are denoted by bold lower case, bold uppercase, lower case, and calligraphic characters, respectively. Consider a datapoint denoted by x. We assume that x belongs to a family of probability density functions (pdf's) defined as P = {p(x, θ), θ ∈ Θ}, where Θ is the possible set of parameters of the pdf. In general, θ is a real vector in higher dimensions. For example, in a mixture of Gaussians, θ is a vector containing the component weights, the component means, and the component covariance matrices. In this paper, we assume that θ is an unknown deterministic function (There are other approaches such as bayesian that consider θ as a random vector). In general, although the structure of the family of pdfs is itself unknown, defining a family of pdfs such as P can help us to develop theorems and use those results to derive a new method. For the family of distribution P, we can define the following classification problem { C 1 : θ ∈ Θ 1 , C 2 : θ ∈ Θ 2 , • • • , C n : θ ∈ Θ n } (1) where set of Θ i 's is a partition of Θ. The notation of (1) means that, class C i deals with a set of data points whose pdf is p(x, θ i ) where θ i ∈ Θ i . A wide range of classification problems can be defined using (1) e.g., ((Lehmann & Casella, 2006 , Chapter 3)) and ((Duda et al., 2012, Chapter 4) ). The problem of estimating θ comes under the category of parametric estimation or point estimation (Lehmann & Casella (1998) ). Estimating the unknown parameters of a given pdf p(x, θ), have been extensively studied in the field of point estimation methods (Lindgren (2017) ; Lee et al. (2018) ; Lehmann & Casella (2006) ). An important estimator in this field is the minimum variance unbiased estimator and it is governed by the Cramer Rao bound. The Cramer Rao bound provides the lower bound of the variance of an unbiased estimator (Bobrovsky et al. (1987) ). Let the estimation of θ be denoted by θ, and assume that θ is an unbiased estimator, i.e., E( θ) = θ. Its covariance matrix denoted by Σ θ satisfies Σ θ -I -1 (θ) 0, where A 0 implies that A is a non-negative definite matrix ( (Lehmann & Casella, 1998, chapter 5) ) and I(θ ) := -E(∂ 2 log(p(x, θ))/∂θ 2 ) is called the Fisher information matrix. For an arbitrary differentiable function g(•), an efficient estimator of g(θ) is an unbiased estimator when its covariance matrix equals to I -1 g (θ), where I -1 g (θ) is the fisher information matrix of g(θ), i.e., the efficient estimator achieves the lowest possible variance among all unbiased estimators. The efficient estimator can be achieved using factorization of (Rao (1992) ; Lehmann & Casella (1998) ). ∂ log(p(x, θ))/∂g(θ) = I g (θ)( g(x) -g(θ)), if it exists Based on these results, we derive a relationship between the efficient estimation of unknowns and maximum likelihood classifier of (1) and use auxiliary binary classifiers to apply that result in our proposed DBT method. Parameter Estimations: Independent component analysis (Hyvärinen (1999) ) decomposes a multivariate signal into independent non-Gaussian signals. ICA can extract non-Gaussian features from Gaussian noise. Additionally, there is a class of classifiers called generalized likelihood ratio functions that replaces the estimation of unknown parameters into the likelihood functions. This approach provides a huge improvement in the field of parametric classifiers, where the family of pdf of data is given (Zeitouni et al. (1992) , Conte et al. (2001) , Lehmann & Casella (2006) ). Noise-contrastive estimation (NCE) (Gutmann & Hyvärinen (2010) ) involves training a generative model that allows a model to discriminate data from a fixed noise distribution. Then, this trained model can be used for training a sequence of models of increasing quality. This can be seen as an informal competition mechanism similar in spirit to the formal competition used in the adversarial networks game. In Bachman et al. (2019) , a feature selection is proposed by maximizing the mutual information of the difference between features extracted from multiple views of a shared context. In that work, it is shown that the best results is given by using a mutual information bound based on NCE. The key difference between our method and NCE is that, we do not construct a generative model for noise. Instead of estimating the pdf of noise in NCE, we estimate the parameters of pdf of in-domain dataset using an auxiliary class that has many common parameters in its pdf. Moreover, we show that the estimation of that parameters are sufficient statistic for a classifier. We assume that the noise dataset is not pure and it has some similarity with the in-domain dataset, where it can help the feature selection layers to select relevant (in-domain) features, e.g., see Fig. 3 . Further, in our approach, we do not construct the pdf of noise or in-domain data, instead we estimate its parameters directly, which is more efficient in terms of training, computation and also dimensionality reduction. Auxiliary classifiers were introduced in inception networks (Szegedy et al. (2015) ) and used in (Lee et al. (2015) ; S. et al. ( 2016)) for training very deep networks to prevent vanishing gradient problems. Further, auxiliary classifiers were also proposed for early exit schemes (Teerapittayanon et al. (2016) ) and self-distillation methods (Zhang et al. (2019a; b) ). Such auxiliary classifiers tackle different problems by predicting the same target as the final classification layer. In contrast, our proposed DBT method involves auxiliary binary classifiers that detect noise, interference, and/or background data from in-domain data points for improving the target classification accuracy.

3. ESTIMATION OF PARAMETERS OF PDF AND CLASSIFICATION

For (1), we define a deterministic discriminative function of Θ i , denoted by t i (•) such that the following conditions are satisfied: • t i (•) maps Θ to real numbers such that t i (θ) > 0, if θ ∈ Θ i and t i (θ) ≤ 0 for θ / ∈ Θ i . • t i (•) is a differentiable function almost everywhere and Θ |t i (θ)|dµ l (θ) < ∞, where µ l denotes the Lebesgue measure. The following theorem shows the relationship of t i (•) and the log-likelihood ratio of class C i versus other classes. The proofs of Theorems 1, 2 and 3 are provided in the appendix. Theorem 1 Assume that the pdf p(x, θ) is differentiable with respect to θ almost everywhere. If the efficient minimum variance and unbiased estimation of a deterministic discriminative function of Θ i exists, then the log likelihood ratio of class i against the rest of classes is an increasing function of the minimum variance and unbiased estimation of Θ i . Directly from this theorem, it follows that the optimal classifier using the maximum likelihood for ( 1) is given as follows d(x) = arg max i∈{1,••• ,n} k i ( t i (x)), where k i 's are some increasing functions and t i (•)'s are the deterministic discriminative function of Θ i 's such that the efficient minimum variance and unbiased estimation for them exists. Based on this result, a set of minimum variance and unbiased estimation of deterministic discriminative functions of Θ i 's leads us to the maximum likelihood classifier. One approach is to directly estimate the deterministic discriminative functions, instead of maximizing the likelihood function. However, finding deterministic discriminative functions that have efficient minimum variance and unbiased estimation may not be feasible in practical problems, Theorem 2 Consider the output of two classifiers for the ith class as follows: r j (x) = i if h j (x) > τ and r j (x) = other classes if h j (x) < τ , where j ∈ {1, 2}. where h j (x) is the estimation of a deterministic discriminative function and τ is a classification threshold. Assume that the cumulative distribution function of h j (x)'s have bounded inflection points, and also, the probability of true positive of r j (x) is an increasing function of d(θ), which is the deterministic discriminative function of class i, for all i. Further assume that for each τ the probability of false positive of r 1 (x) is less than the probability of false positive of r 2 (x) and the probability of true positive of r 1 (x) is greater than the probability of true positive of r 2 (x). Then, there exists a h min such that for all d(θ) > h min and all θ we have Pr(|h 1 (x) -d(θ)| < ) > Pr(|h 2 (x) -d(θ)| < ). Theorem 2 shows that a better classifier leads to a better estimation of d(θ). In the next theorem, we show the dual property of this result. Theorem 3 Let Θ m be a Borel set with positive Lebesgue measure in (1) for all m ∈ {1, • • • , n}. Assume that r 1 (•) and r 2 (•) are given as follows r 1 (x) = m, if θ 1 ∈ Θ m and r 2 (x) = m, if θ 2 ∈ Θ m . Also, assume that Pr( θ 1 -θ ≤ ) ≥ Pr( θ 2 -θ ≤ ), for all θ ∈ Θ = ∪ n m=1 Θ m and > 0, then the probability of classification error r 1 (•) is less than r 2 (•) where θ 1 and θ 2 are two different estimators of θ ∈ Θ = ∪ M -1 m=0 Θ m . Theorem 3 proves that a more accurate estimator leads to a classifier that has a lower probability of classification error. From Theorem 1, we can infer that a sufficient statistic for developing the maximum likelihood classification is t i (x), which is the efficient minimum variance and unbiased estimation of the deterministic discriminative functions of Θ i 's denoted by t i (θ). In other words, the maximum likelihood classifier is a function of x only via the efficient minimum variance and unbiased estimation t i (θ). We can estimate t i (θ) by replacing the estimation θ in t i (•), i.e., t i (θ) ≈ t i ( θ), where θ is a function of x. From the above theorems, we conclude that improving the estimation of unknown parameters of pdf of data can improve the accuracy of the classifier. On the other side, having a good classifier means having a good estimator of unknowns of the pdf of input data. In many practical problems, the optimal maximum likelihood classifier may not be achievable, but the likelihood function of the classifier provides an optimal bound of the probability of error. In such cases, we can improve the accuracy of sub-optimal classifiers and that is the main focus of this paper. to improve the estimation of unknown parameters of the family of pdf (based on Theorem 2). A better estimation of unknown parameters corresponds to better feature representations in the early layers and these features are input to the rest of the layers to construct the deterministic discriminative functions (DDF) useful for the in-domain data classification (based on Theorem 3). A general schema for dividing a deep model into two sub-models namely PEF (parameter estimator functions) and DDF is depicted in Figure 2 . The early layers of the model estimate the unknown parameters of pdf of data while the later layers construct the discriminative functions essential for classification. Based on this scheme, we formally define the three main components of DBT as follows: • parameter estimator functions (PEF): The sub-network from input layer to the kth layer, where k is a hyperparameter in the DBT approach. • auxiliary binary classification (ABC): Some additional layers are attached to the end of PEF, mapping the output of the kth layer to a one-dimensional vector. • deterministic discriminative functions (DDF): The sub-network from kth layer to the output of the model. The output of model is a vector equal to the length of the number of classes n. From Theorem 2, we showed that unknown parameter estimation can be improved using a detection approach. During training, we apply a binary classification on the early layers (PEF) of the model to improve the estimation of unknown parameters of pdf and subsequently provide rich feature vectors for DDF. We define the auxiliary binary classification problem (ABC problem) as follows: • Class 1 (alternative hypothesis) of ABC problem denoted by H 1 is set of all data points of classes of C 1 to C n , i.e. θ ∈ ∪ n i=1 Θ i . • Class 0 (null hypothesis) of ABC problem denoted by H 0 is a dataset of data points from same distribution p(x, θ) but θ / ∈ ∪ n i=1 Θ i . We also define the dataset of Class 0 of ABC as ABC-noise dataset, i.e., the ABC is given by the following hypothesis testing problem: H 1 : θ ∈ ∪ n i=1 Θ i versus H 0 : θ / ∈ ∪ n i=1 Θ i . In many practical problems, the noise, background or interference data related to the in-domain dataset have same type of probability distribution but different pdf parameters. Hence, using that dataset is a cheap and adept choice for the null hypothesis of ABC. The Auxiliary Binary Classification problem influences only the PEF and ABC units while the main classification problem with n classes updates the parameters of both PEF and DDF using in-domain data. Since the auxiliary classifier is only used during training, the inference model (IM) consists of only PEF and DDF and hence, there is no additional computation cost during inference. We formulate the aforementioned method in the following notations and loss functions. Assume that x is a data point that belongs to Class C i , i ∈ {1, • • • , n} or Class H 0 of ABC. Here, we define two type of labels denoted by l ABC and l MC , where the subscription "MC" stands for multi-classes. So, if x belongs to class C i , then l ABC = 1 and l MC = i -1, else if x is a ABC-noise data point, l ABC = 0 and l MC is None. Therefore, the loss function is defined as: L tot = L ABC (Q ABC (Q PEF (x)), l ABC ) + λl ABC L MC (Q DDF (Q PEF (x)), l MC ), where Q PEF , Q ABC and Q DDF are the functions of PEF, ABC and DDF blocks, respectively. We set the hyperparameter λ = 1 to balance the two loss terms. It is seen that, the second term of the total loss is zero if l ABC = 0. L ABC and L MC are selected based on the problem definition and datasets. For classification, a simple selection for them can be binary cross-entropy and cross-entropy, respectively. For a given task and deep neural network, the choice of k and L ABC influences the feature representation of early layers differently and consequently the accuracy of the model. We provide empirical studies in the next section to verify the same. 2019)), we constrain the face/non-face prototypes on diametrically opposite directions i.e cos(θ p f p nf ) = -1 and normalize the output feature vectors for faces and non-faces such that p fi = p nf i = 1. We then define the L ABC as, L ABC = - 1 N N i=1 log e s(cos(m1θy i +m2)-m3) e s(cos(m1θy i +m2)-m3) + e s cos θ2 + 1 N N i=1 (-1 -|p fi . p nf i |) 2 , where θ yi and θ 2 correspond to the angles between the weights and the features for face and non-face labels, respectively; m 1 , m 2 , m 3 are the angular margins; s denotes the radius of the hypersphere. For the second choice, we use simple binary cross entropy for L ABC . Table 1 shows that the verification accuracy on LFW (Huang et al. (2007) ) using (3) is 0.16% higher than simple cross entropy loss. This also shows that choosing a task-specific L ABC is essential in obtaining more accurate results. We use Eqn.1 as the default for L ABC in all our face recognition experiments, unless otherwise stated. 2016)) and AgeDb-30 (Moschoglou et al. (2017) ). For the LFW test set, we follow the unrestricted with labeled outside data protocol to report the performance. We trained ResNet-50 and ResNet-100 using ArcFace and DBT approaches on CASIA (small) and MS1MV2 (large) datasets, respectively. The results show that DBT method outperforms ArcFace on all datasets. Table 7 shows the angle statistics of the trained ArcFace and DBT models on the LFW dataset. Min. Inter and Inter refer to the mean of minimum angles and mean of all angles between the template embedding features of different classes (mean of the embedding features of all images for each class), respectively. Intra refers to the mean of angles between x i and template embedding feature for each class. From Table 7 , we infer that DBT extracts better face features and hence reduces the intra-class variations. Directly from Tables 3 and 7 on all test sets. Second, learning better features in the early layers is crucial to obtain rich face feature embeddings. Third, the achieved gain using DBT is more pronounced on models trained using a smaller (CASIA) dataset (it has fewer identities and images). This shows that DBT can address the issue of the lack of in-domain data using cheap ABC-noise data. We also provide the results of training Inception-ResNet-V1 and ResNet-64 models using DBT on MS1MV2 to show the generalization capacity of the DBT method. For the Inception-ResNet-V1 and ResNet-64, the PEF is set to be the first six layers and the DDF is the rest of the model. We use large margin cosine loss (LMCL) Wang et al. (2018) for L MC and Cross entropy (CE) for L ABC . Table 4 shows the verification accuracy on LFW for Inception-ResNet-V1 and ResNet-64 models trained on MS1MV2 with and without DBT. The results show that DBT method is independent of model depth or architectures or loss functions and thereby consistently improves the accuracy compared to baseline results. Table 4 also compares the DBT method with state-of-the-art methods on LFW and YTF datasets. DBT method notably improves the baselines that are comparable to ArcFace and superior to all the other methods. We were not able to reproduce the results of the ArcFace paper using our Tensorflow implementation and dataset. We believe that using the original implementation and dataset from ArcFace will achieve superior results over the baselines on the benchmark datasets as evident from the results of our implementation. Finally, we compare the result ArcFace and DBT on IJB-B and IJB-C, in Table 5 . It is seen that DBT provides a notable boost on both IJB-B and IJB-C by a considerable margin. DBT improves the verification accuracy as high as 1.94 % on IJB-B and 2.57 % on IJB-C dataset at 10 -4 false alarm rate (FAR). We plot the receptive fields of the top ten maximally activated neurons of an intermediate layer of the face recognition model to visualize the features learned using the DBT method. Fig. 3 shows that the receptive fields of layer 15 of the inception-resnet-v1 model trained using DBT attends to the regions of eyes, nose and mouth as compared to insignificant regions in the normal training method. This shows that DBT learns more discriminative features essential to face recognition, corroborating our theoretical claims. To show that current SOTA models are not robust to animal faces, we performed a 1:N identification experiment with approximately 3000 animal distractors on the IJB-B (Whitelam et al. (2017) ) dataset. We trained the face recognition model with about 500K non-face data which contains 200 animal faces. This is disjoint from the 3000 distractors used in the identification experiment. We collected the animal faces from web images using MTCNN (Zhang et al. (2016a) ) face detector which are the false positives from the face detector. 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 -6 10 -5 10 -4 10 -3 10 We randomly selected a fraction of the training data to be our training set, e.g., k/5 of dataset means that we only used k fifth of total samples for training. From first row of Table 8 , we find that models trained with DBT show 0.59% and 0.35% improvement on CIFAR-10, 0.62% and 1.45% improvement on CIFAR-100 over baseline models for ResNet-110 and ResNext-101 architectures, respectively. Furthermore, using partial training data with our DBT method achieves superior results (as high as 5.49 % on ResNext (1/5) CIFAR-100) as compared to normal training. Table 6 shows the results on Imagenet. We see that DBT improves the accuracy by 0.28% on Top-1 accuracy. This shows that the DBT method consistently improves the results on both small and large datasets.

SPEAKER IDENTIFICATION

We consider the problem of speaker identification using the VGG-M (Chatfield et al. ( 2014)) model. We set PEF as the first two CNN layers and DDF as the remaining CNN layers. L ABC and L MC are defined to be the cross-entropy loss. The ABC-noise is generated from the silence intervals of VoxCeleb (Nagrani et al. (2017) ) augmented with Gaussian noise with variance one. The input to the model is the short-time Fourier transformation of speech signals with a hamming sliding window of width 25 ms and step 10 ms. Table 9 provides the accuracies of VGG-M model trained with and without DBT on VoxCeleb, Librispeech (Panayotov et al. (2015) ), VCTK (Veaux et al. (2016) ) and ELSDR (L. ( 2004)) datasets. Table 9 shows that the trained models using DBT significantly improves the accuracy (as high as 5.62%) for all datasets. Implementation details are provided in the appendix.

MISCELLANEOUS EXPERIMENTS

In this section, we experiment with the naive way of using background data by considering non-faces as a separate class in the final classification layer. For face recognition, Table 11 shows the results of training with an additional background class on MS1MV2 dataset with and without using DBT. ResNet+mod refers to a model trained with ArcFace loss and n + 1 classes where the additional class corresponds to the non-faces. ResNet-DBT+mod refers to a model trained with both DBT and the additional non-face class. We find that adding the additional non-face class hurts the performance of the model whereas ResNet-DBT+mod improves the results significantly relative to ResNet+mod model. Since the non-face dataset is sampled from a wide range of a family of distributions compared with faces, it has a larger range of unknown parameters, then the sufficient statistic of them should be larger than the sufficient statistics of face data. Thus, when we restrict faces and non-faces on the surface of a hypersphere, the non-face data is more spread on the surface compared with each of the other face classes. We demonstrate this effect with the help of a toy example in Fig. 6 in the appendix. We also conduct this experiment on CIFAR-10/CIFAR-100 and report it in Table 10 . We see that naively incorporating the background class is inferior to DBT showing that DBT is an effective technique to utilize background data to boost the performance of classification models.

6. CONCLUSION

In this paper, we presented a detailed theoretical analysis of the dual relationship between estimating the unknown pdf parameters and classification accuracy. Based on the theoretical study, we presented a new method called DBT using ABC-noise data for improving in-distribution classification accuracy. We showed that using ABC-noise data helps in better estimation of unknown parameters of pdf of family of distribution can provide more information about the in-domain distribution. Let us consider that the pdf of in-domain data points is given by p x (x, [θ s , θ n ]) and the pdf of noise/background is given by p n (x, θ n ), so the extended pdf can be represented by h(p n (x, θ n ), p x (x, [θ s , θ n ])), where h is a function that combines two pdfs in a general structure. So a general family of distribution can be denoted as follows: P = {h(p n (x, θ n ), p x (x, [θ s , θ n ]))|θ := [θ s , θ n ] ∈ Θ s,n }, where θ is defined as a new set of parameters in a higher dimension and Θ s,n are set of all possible [θ s , θ n ] that belongs to p n and p x . The extended family of pdf provides more information about the nuisance parameters of pdf of in-domain datapoints. Inspired by this observation, we develop our detection booster training method using background/noise data. Figure 5 shows an example of background and in-domain data point.

PROOF OF THEOREM 1

Let t i (•) denote deterministic discriminative function of Θ i . Since the efficient minimum variance and unbiased estimation of t i (θ) exists, we have ∂ ln(p(x, θ)) ∂t i (θ) = I ti (θ)( t i (x) -t i (θ)), where t i (x) is the minimum variance and unbiased estimation of t i (θ) using the data point x and I ti (x) is the Fisher information function of t i (θ), which is given by I ti (θ) = ∂t i (θ) ∂θ T I(θ) ∂t i (θ) ∂θ ≥ 0, where T denotes the transpose and I(θ) is the Fisher information matrix of θ. Now we show that the log-likelihood ratio is an increasing function in t i (x). Note that I ti (θ) ≥ 0 (Lehmann & Casella (2006) ). On the other hand, we have d ln(p(x, θ)) = j ∂ ln(p(x,θ)) ∂θj dθ j , therefore, ln(p(x, θ)) + k(x) = j ∂ ln(p(x, θ)) ∂θ j dθ j = j ∂ ln(p(x, θ)) ∂t i (θ) ∂t i (θ) ∂θ j dθ j = ∂ ln(p(x, θ)) ∂t i (θ) j ∂t i (θ) ∂θ j dθ j = I ti (θ)( t i (x) -t i (θ)) j ∂t i (θ) ∂θ j dθ j = α(θ) t i (x) -β(θ) (5) where the third equality is archived based on the third property of t i (•) in its definition and the forth equality is given by replacing (4; k(x) is the constant of integration. Finally, the last equality is given by defining the following terms α(θ) := I ti (θ) j ∂t i (θ) ∂θ j dθ j , β(θ) := I ti (θ)t i (θ) j ∂t i (θ) ∂θ j dθ j , thus dα(θ) dti(θ) = I ti (θ) ≥ 0, i.e., α(θ) is increasing in t i (θ). Since, t i is a deterministic discriminative function of Θ i , so for each j = i and θ i ∈ Θ i and θ j ∈ Θ j , we have t i (θ i ) > t i (θ j ), therefore Under review as a conference paper at ICLR 2021 α(θ i ) ≥ α(θ j ). The later inequality is achieved based on the increasing property of α(θ) with respect to t i (θ). Using (5), the log likelihood ratio of class i against the rest of classes is given by LLR := ln(p(x, θ i )) -ln(p(x, θ j )), so we have LLR = α(θ i ) -α(θ i ) t i (x) -β(θ i ) -β(θ j ) . LLR depends on x only via t i (x) and since for each j = i and θ i ∈ Θ i and θ j / ∈ Θ i , α(θ i ) -α(θ i ) > 0, then LLR is increasing in t i (x).

PROOF OF THEOREM 2

The probability of true positive of class i of r j is given by P tp,i,j = Pr θ (h j (x) > τ ) = 1 -F j θ (τ ), where F i θ (•) denotes the Cumulative distribution function (CDF) of h j . Since the probability of true positive of class i of r 1 is greater than r 2 for all τ , F 1 θ (τ ) < F 2 θ (τ ), for all τ . Now we define a function as follows u(τ, θ) := F 2 θ (τ ) -F 1 θ (τ ). Since the CDFs are increasing in τ and tend to 1 and the number of inflection points of these CDFs are bounded, there is an h min such that, for τ > h min , such that u(τ, θ) is a monotonically decreasing function in τ . Thus for any θ that satisfies d(θ) > h min we have u(d(θ) + , θ) < u(d(θ) -, θ). Replacing u(h, θ) = F 2 θ (h) -F 1 θ (h) in the last inequality, we have F 2 θ (d(θ) + ) -F 1 θ (d(θ) + ) < F 2 θ (d(θ) -) -F 1 θ (d(θ) -) ⇒ (7) F 2 θ (d(θ) + ) -F 2 θ (d(θ) -) < F 1 θ (d(θ) + ) -F 1 θ (d(θ) -). Based on the definition of CDF, we have The error probability of r j is given by p er,j = 1 -n i=1 P i P tp,i,j , where P i is the prior probability of class i. Therefor, p er,1 ≤ p er,2 . Pr θ |h 2 (x) -d(θ)| < = Pr θ d(θ) -< h 2 (x) < d(θ) + < Pr θ d(θ) -< h 1 (x)) < d(θ) + = Pr θ |h 1 (x) -d(θ)| < . CONNECTING THE THEOREMS WITH THE PROPOSED METHOD Fig. 6 shows the connection between the proposed theorems and the approach. In part 1, Theorem 2 connects the estimation of unknown parameters to the auxiliary classifier. In part 2, the learned features are passed to a decision making network (result of Theorem 2). In part 3, Theorem 3 guarantees that the multi-class classifier outperforms other classifiers, because it is using the features from a better estimation of unknown parameters of pdf.

TOY EXAMPLE:

We demonstrate the effect of adding background class to the original classifier with a toy example and visualize it in Fig. 7 . In this example, the input is a sequence of binary bits (+1 and -1) with length 3 in white Gaussian noise. the classifier is constructed using two fully connected layers with sigmoid and the last layer is normalized on unit circle. As seen from Fig. 7 , adding an additional noise class visibly reduces the feature separation between all the other classes.

IMPLEMENTATION DETAILS FACE RECOGNITION

We use Tensorflow (Abadi et al. (2015) ) to conduct all our experiments. We train with a batch size of 256 on two NVIDIA TeslaV100 (32G) GPUs. We train our models following small (less than 1M training images) and large (more than 1M training images) protocol conventions. We use CASIA-Webface (Yi et al. (2014) ) dataset for small protocol and MS1MV2 dataset for the large protocol. We use ResNet-50 (He et al. (2016) ) and ResNet-100 models for small and large protocols, respectively. The PEF is selected as the first three layers. Following (Deng et al. (2019) ), we apply



Figure 1: Visualizing Theorems 1,2 and 3 Figure 2: A general schema of our proposed DBT method with PEF, DDF and ABC blocks

Fig. 1 illustrates the proposed theorems visually. 4 PROPOSED METHOD: DETECTION BOOSTER TRAINING (DBT)

Figure 3: Maximally activated receptive fields of layer 15 of Inception-ResNet-v1 with (top row) and without (bottom row) DBT.

Figure 4: Examples of mis-identified faces along with their corresponding animal distractors on the IJB-B for ArcFace.

input data and thereby improves the feature representations and consequently the accuracy in image classification, speaker classification, and face recognition benchmarks. It also augments the training data when only limited labeled data is available by improving accuracy. We showed that the concept of DBT is generic and generalizes well across domains through extensive experiments using different model architectures and datasets. Our framework is complementary to existing training methods and hence, it can be easily integrated with current and possibly future classification methods to enhance accuracy. In summary, the proposed DBT method is a powerful technique that can augment limited training data and improve classification accuracy in deep neural networks.

Figure 5: In-domain data point versus background data point. The background is cropped from the in-domain image and provides complementary information to the main data, thereby we can provide a better estimation of the pdf parameters of in-domain data.

prove the following claim, Claim: For any open set, there exists a set of disjoint countable open balls such that their union equals the origin open set. Proof of claim: Consider an open set O, and also consider x 0 ∈ O, such that B(x 0 , r 0 ) ⊆ O and r 0 is the greatest possible radius between all possible open balls in O, where B(x 0 , r 0 ) is theopen ball with radius r 0 at point x 0 . Now, we define x 1 ∈ O -B(x 0 , r 0 ), where B(x 0 , r 0 ) is the closure of B(x 0 , r 0 ), as the point with greatest radius in O -B(x 0 , r 0 ) and similarlyx i ∈ O -∪ i-1 k=0 B(x k , r k ) such that B(x i , r i ) provides the greatest radius in O -∪ i-1 k=0 B(x k , r k ). So we have O = ∪ ∞ k=0 B(x k , r k ). This is because, if the latest equality is not valid, then there existsan open ball in O -∪ ∞ k=0 B(x k , r k )hence another open ball with greatest radius will be added to∪ ∞ k=0 B(x k , r k ), which has a contradiction with the definition of ∪ ∞ k=0 B(x k , r k ).The claim is proven at this point. Now, we show the true positive probability of r 1 is greater than r 2 . Let Θ m be the set of interior points of Θ m , then, there exists a union of disjoint open balls such that Θ m = ∪ ∞ k=0 B(x k , r k ). From assumptions in the theorem, we have Pr(θ 1 -θ ≤ ) ≥ Pr( θ 2 -θ ≤ ), then Pr θ ( θ 1 ∈ B(x k , r k )) ≥ Pr θ ( θ 2 ∈ B(x k , r k )),where θ ∈ Θ m . Based on the claim we havePr θ ( θ 1 ∈ Θ m ) ≥ Pr θ ( θ 2 ∈ Θ m ).(10)Moreover, based on definition of r i , the true positive probability of class m is given byp tp,i = Pr θ ( θ i ∈ Θ m ) = Pr θ ( θ i ∈ Θ m ) + Pr θ ( θ i ∈ Θ m -Θ m ),

Figure 6: Relationship between the theorems in Section 3 and the proposed method in Section 4.

Figure 7: Feature distance between different classes with and without additional background class for a toy example. Left: Contains 8 classes and the feature separation is visibly larger; Right: Contains an additional noise class that decreases the feature distance for all the other classes.



Table2shows the Rank-1 identification accuracy of ResNet-100 on IJB-B dataset, trained on MS1MV2 using the ArcFace loss (ResNet-100-AF) versus our DBT approach (ResNet-100-DBT). The third column of Table2denotes the accuracy on a hard subset of images (false positives from ArcFace model) on the IJB-B dataset denoted by H-set. Results of Table2show that current face recognition models are unable to discriminate out-of-distribution (non-face) images from face images. Our ResNet-100-DBT significantly (as high as 21%) reduces the misidentification rate as compared to the ArcFace model which shows that DBT method inherently overcomes this issue while also improving face recognition accuracy. Comparison of DBT models with SOTA methods on LFW and YTF. ArcFace * * refers to our arcface implementation.

-2 10 -1 Comparison of inter and intra angles (degrees) for different methods on LFW.

Comparison of Top-1 error rates (%) for CIFAR-10 and CIFAR-100 datasets w/o DBT. * denotes our implementation. (x/5) denotes the fraction of training data used for training that model.

Comparison of top-1 error rates on CIFAR-10 and CIFAR-100 using an additional background class vs DBT.

Ablation study on the verification performance of adding background class to the model on MS1MV2 dataset.

APPENDIX IN-DOMAIN FAMILY OF PDFS AND THE EXTENDED FAMILY OF DISTRIBUTIONS

In this section, we discuss about background/noise and in-domain data points and their corresponding distributions to clarify the definition of those concepts in this paper. Consider a random vector denoted by s. Assume that the corresponding distribution is Gaussian with mean and variance given by α = 0 and σ = 1, respectively. Now, assume that we observed x = s + n, where the pdf of n is assumed to be Guassian with zero mean and variance σ 2 n , hence the pdf of x is Gaussian with mean α and varianceHere, n is the background or noise data and the vector of unknowns is given by,The in-domain family of pdfs for x is then given by Pinclude the family of pdf of n to P x , then we can extendSo P is the union of family of pdfs of in-domain data points and noise/background data. From estimation theory, we know that the sufficient statistics and the unknown parameters of P can also represent the sufficient statistics and the unknown parameters of P x . In other words, an estimation of α can help us detect if the observed data point is from s + n or n by comparing it with a threshold.Thus, estimating the unknown parameters of the family of pdfs using P can provide more information about the observed data useful for tasks such as classification.In general, we can assume that a generalized family of pdfs is given by the family of pdf of noise or background along with the family of pdfs of in-domain data. Hence, estimating from the extended BN (Ioffe & Szegedy (2015) ), dropout (Srivastava et al. (2014) ) to the last feature map layer followed by a fully connected layer and batch normalization to obtain the 512-D embedding vector. We set the feature scale s parameter to 64 following (Wang et al. (2018) ; Deng et al. (2019) ) and set the margin parameters (m 1 , m 2 , m 3 ) to (1, 0.5, 0), respectively. For small scale protocol, we start the learning rate at 0.01 and divide the learning rate by 10 at 40K, 80K, and 100K iterations. We train for 120K iterations. For large scale protocol, we start the learning rate at 0.01 and divide the learning rate by 10 at 80K, 100K, and 200K iterations. We train for 240K iterations. We use Momentum optimizer and set the momentum to 0.9 and weight decay to 5e-4. We use the feature centre of all images from a template or all frames from a video in order to report the results on IJB-B, IJB-C and YTF datasets. 

