DBT: A DETECTION BOOSTER TRAINING METHOD FOR IMPROVING THE ACCURACY OF CLASSIFIERS

Abstract

Deep learning models owe their success at large, to the availability of a large 1 amount of annotated data. They try to extract features from the data that contain 2 useful information needed to improve their performance on target applications. Most works focus on directly optimizing the target loss functions to improve the 4



accuracy by allowing the model to implicitly learn representations from the data. based training approach to better estimate the unknowns using in-domain and background/noise data.

44

One alternative is that we can use generative models for this task, however, they mimic the general



There has not been much work on using background/noise data to estimate the 6 statistics of in-domain data to improve the feature representation of deep neural 7 networks. In this paper, we probe this direction by deriving a relationship between 8 the estimation of unknown parameters of the probability density function (pdf) 9 of input data and classification accuracy. Using this relationship, we show that 10 having a better estimate of the unknown parameters using background and in-11 domain data provides better features which leads to better accuracy. Based on 12 this result, we introduce a simple but effective detection booster training (DBT) 13 method that applies a detection loss function on the early layers of a neural network 14 to discriminate in-domain data points from noise/background data, to improve 15 the classifier accuracy. The background/noise data comes from the same family 16 of pdfs of input data but with different parameter sets (e.g., mean, variance). In 17 addition, we also show that our proposed DBT method improves the accuracy even 18 with limited labeled in-domain training samples as compared to normal training. 19 We conduct experiments on face recognition, image classification, and speaker 20 classification problems and show that our method achieves superior performance 21 over strong baselines across various datasets and model architectures.

systems achieve outstanding accuracies on a vast domain of challenging 24 computer vision, natural language, and speech recognition benchmarks(Russakovsky et al. (2015); 25 Lin et al. (2014); Everingham et al. (2015); Panayotov et al. (2015)). The success of deep learning 26 approaches relies on the availability of a large amount of annotated data and on extracting useful 27 features from them for different applications. Learning rich feature representations from the available 28 data is a challenging problem in deep learning. A related line of work includes learning deep latent 29 space embedding through deep generative models (Kingma & Welling (2014); Goodfellow et al. 30 (2014); Berthelot et al. (2019) or using self-supervised learning methods (Noroozi & Favaro (2016); 31 Gidaris et al. (2018); Zhang et al. (2016b)) or through transfer learning approaches (Yosinski et al. 32 (2014); Oquab et al. (2014); Razavian et al. (2014)). 33 In this paper, we propose to use a different approach to improve the feature representations of deep 34 neural nets and eventually improve their accuracy by estimating the unknown parameters of the 35 probability density function (pdf) of input data. Parameter estimation or Point estimation methods 36 are well studied in the field of statistical inference (Lehmann & Casella (1998)). The insights from 37 the theory of point estimation can help us to develop better deep model architectures for improving 38 the model's performance. We make use of this theory to derive a correlation between the estimation 39 of unknown parameters of pdf and classifier outputs. However, directly estimating the unknown 40 pdf parameters for practical problems such as image classification is not feasible since it can sum 41 up to millions of parameters. In order to overcome this bottleneck, we assume that the input data 42 points are sampled from a family of pdfs instead of a single pdf and propose to use a detection

43

annex

for estimating the unknown parameters of a family of pdfs. Our proposed detection method involves 47 a binary class discriminator that separates the target data points from noise or background data. The 48 noise or background data is assumed to come from the same family of distribution of in-domain 49 data but with different moments (Please refer to the appendix for more details about the family of 

59

Since ABC-noise data can be collected in large quantities for free and using that data in our approach 60 improves the classification benchmarks, we investigate whether this data can act as a substitute for 61 labeled data. We conduct empirical analysis and show that using only a fraction of labeled training 62 data together with ABC-noise data in our DBT method, indeed improves the accuracy as compared We assume that x belongs to a family of probability density functions (pdf's) defined 75 as P = {p(x, θ), θ ∈ Θ}, where Θ is the possible set of parameters of the pdf. In general, θ is a real 76 vector in higher dimensions. For example, in a mixture of Gaussians, θ is a vector containing the 77 component weights, the component means, and the component covariance matrices. In this paper, we 78 assume that θ is an unknown deterministic function (There are other approaches such as bayesian 79 that consider θ as a random vector). In general, although the structure of the family of pdfs is itself 80 unknown, defining a family of pdfs such as P can help us to develop theorems and use those results

81

to derive a new method. For the family of distribution P, we can define the following classificationwhere set of Θ i 's is a partition of Θ. The notation of (1) means that, class C i deals with a set of 84 data points whose pdf is p(x, θ i ) where θ i ∈ Θ i . A wide range of classification problems can be 85 defined using (1) e.g., ((Lehmann & Casella, 2006 , Chapter 3)) and ((Duda et al., 2012, Chapter 4) ).

86

The problem of estimating θ comes under the category of parametric estimation or point estimation 95 is called the Fisher information matrix. For an arbitrary differentiable function g(•), an efficient 96 estimator of g(θ) is an unbiased estimator when its covariance matrix equals to I -1 g (θ), where I -1 g (θ)97 is the fisher information matrix of g(θ), i.e., the efficient estimator achieves the lowest possible 98

