TRAINING A MULTI-STAGE DEEP CLASSIFIER WITH FEEDBACK SIGNALS

Abstract

Multi-Stage Classifier (MSC) -several classifiers work sequentially in an arranged order and classification decision is partially made at each step -is widely used in industrial applications for various resource limitation reasons. The classifiers of a multi-stage process are usually Neural Network (NN) models trained independently or in their inference order without considering the signals from the latter stages. Aimed at two-stage binary classification process, the most common type of MSC, we propose a novel training framework, named Feedback Training. The classifiers are trained in an order reverse to their actual working order, and the classifier at the later stage is used to guide the training of initial-stage classifier via a sample weighting method. We experimentally show the efficacy of our proposed approach, and its great superiority under the scenario of few-shot training.

1. INTRODUCTION

The state-of-the-art deep neural networks have equipped a various of applications with much better quality, especially the emergence of BertDevlin et al. (2018 ), a TransformerVaswani et al. (2017) based pre-training language model, and a series of its derivatives Brown et al. (2020) ; Lan et al. (2019) . Their great success is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is rather difficult to deploy them to real-time products such as Fraud Detection Senator et al. (1995); Kirkland et al. (1999) 2020) further compress deep models even smaller. In many practical situations, however, we need super tiny models to meet the demanding memory and latency requirements, which would inevitably suffer serious performance degradation. From another perspective, multi-stage classification system Trapeznikov et al. ( 2012) is widely used to reduce the opportunity of calling the deep and cumbersome models by filtering out some or even most of the input samples using simpler and faster models trained with limited data and easier features. In a multi-stage system, light-weight models such as SVM, Logistic Regression or k-Nearest Neighbors are used as earlier stage classifiers, classifying the samples (usually relatively easier negative ones) based on simple or easily accessible features, and leaving indeterminate ones for later. Models of later stages need to be heavier to deal with harder samples as well as more complex and costly features. A two-stage working mechanism is simply shown in Figure 1(a) . In several practical multi-stage applications, as shown in Figure 1 2013) jointly optimize the multi-stage classifiers in a way like cascade, allowing the contextual information to transfer from earlier stages to later. However, most of them primarily consider classification accuracy rather than latency, and therefore would not make the classification decisions until the final stage. In this paper, we consider to further explore how to forge closer connections between classifiers in a two-stage classification problem. We propose a novel training framework, Feedback Training, where the whole decision-making pipeline is consisted of two classifiers, a extremely lightweight Pre-classifier followed by a relatively heavier Main-classifier. Different from existing methods, these two models are trained in the reverse order of inference, that is, the first-stage model would be trained under the guidance of the second-stage one through a sample weighting method. The capacity of Pre-classifier is more effectively explored by considering the learning results of Main-classifier. Our contributions can be summarized threefold: 1) We propose a novel training framework for two-stage classification applications. 2) We discuss a sample weighting method that assists Pre-classifier to learn according to its preference. 3) We verify our approach on two data sets that it outperforms baseline models significantly and shows greater superiority under few-shot scenarios. 

2. PRELIMINARIES

We consider a binary classification problem with the training set D = {(x 1 , y 1 ), . . . , (x n , y n )}, where x i denotes i th observed training sample paired with its label y i ∈ {0, 1}. In a m-stage process, there are a series of predictive functions, F = {f θj (•)|j = 1, 2, ..., m} working in the given order. The j th predictive function is parameterized by θ j . The classification decision is partially made in each step. One popular design is to filter out negative samples as many as possible in earlier stages and leaving the positive ones to the end for final decision. When classifiers are trained independently without considering the others, each one is trained by optimizing the basic Cross-Entropy loss as in Eq.1: L(f θj (•)) = - 1 n n i=1 L CE (y i , f θj (x i )) where L CE (y, ŷ) = y • log ŷ + (1 -y) • log(1 -ŷ) (2)



, Search and Recommendation systems Covington et al. (2016); Ren et al. (2021), and many mobile applications, not only because of the high computational complexity but also the large memory requirements. Several techniques are developed to make the trade-off between performance and model scale. Knowledge Distillation (KD) Hinton et al. (2015); Sanh et al. (2019) is the most empirically successful approach used to transfer the knowledge learnt from a heavy Teacher to a more light-weight and faster Student. Besides, Pruning Han et al. (2015); Frankle & Carbin (2018) and Quantization Han et al. (2015); Chen et al. (

(b) and (c), classifiers in different stages are trained independently or sequentially without considering the relationships among them Isler et al. (2019); Kruthika et al. (2019). To build tighter connections between classifiers in a multi-stage system for better collaboration, most exist methods Mendes et al. (2020); Qi et al. (2019); Sabokrou et al. (2017); Zeng et al. (

Figure 1: Working process and different training strategies for two-stage classifiers. We use the terms Pre-classifier and Main-classifier to denote the classifiers working at 1 st and 2 nd stages. (a) The working process of a two-stage classifier: only samples passed Pre-classifier would be fed into heavier Main-classifier for final decision, otherwise would be judged as negative without calling Main-classifier. (b) Independent Training: all classifiers are trained independently without considering the training results of each other. (c) Sequential Training: Classifiers are trained in their working order. Only samples passed Pre-classifier would be fed into Main-classifier for training. (d) Feedback Training: Classifiers are trained in their reverse order of inference. Pre-classifier assigns different attention to different samples based on the training result of Main-classifier and the proposed sample weighting approach.

