TRAINING A MULTI-STAGE DEEP CLASSIFIER WITH FEEDBACK SIGNALS

Abstract

Multi-Stage Classifier (MSC) -several classifiers work sequentially in an arranged order and classification decision is partially made at each step -is widely used in industrial applications for various resource limitation reasons. The classifiers of a multi-stage process are usually Neural Network (NN) models trained independently or in their inference order without considering the signals from the latter stages. Aimed at two-stage binary classification process, the most common type of MSC, we propose a novel training framework, named Feedback Training. The classifiers are trained in an order reverse to their actual working order, and the classifier at the later stage is used to guide the training of initial-stage classifier via a sample weighting method. We experimentally show the efficacy of our proposed approach, and its great superiority under the scenario of few-shot training.

1. INTRODUCTION

The state-of-the-art deep neural networks have equipped a various of applications with much better quality, especially the emergence of BertDevlin et al. (2018) 2020) further compress deep models even smaller. In many practical situations, however, we need super tiny models to meet the demanding memory and latency requirements, which would inevitably suffer serious performance degradation. From another perspective, multi-stage classification system Trapeznikov et al. ( 2012) is widely used to reduce the opportunity of calling the deep and cumbersome models by filtering out some or even most of the input samples using simpler and faster models trained with limited data and easier features. In a multi-stage system, light-weight models such as SVM, Logistic Regression or k-Nearest Neighbors are used as earlier stage classifiers, classifying the samples (usually relatively easier negative ones) based on simple or easily accessible features, and leaving indeterminate ones for later. Models of later stages need to be heavier to deal with harder samples as well as more complex and costly features. A two-stage working mechanism is simply shown in Figure 1(a) . In several practical multi-stage applications, as shown in Figure 1 



, a TransformerVaswani et al. (2017)based pre-training language model, and a series of its derivatives Brown et al. (2020); Lan et al. (2019). Their great success is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is rather difficult to deploy them to real-time products such as Fraud Detection Senator et al. (1995); Kirkland et al. (1999), Search and Recommendation systems Covington et al. (2016); Ren et al. (2021), and many mobile applications, not only because of the high computational complexity but also the large memory requirements. Several techniques are developed to make the trade-off between performance and model scale. Knowledge Distillation (KD) Hinton et al. (2015); Sanh et al. (2019) is the most empirically successful approach used to transfer the knowledge learnt from a heavy Teacher to a more light-weight and faster Student. Besides, Pruning Han et al. (2015); Frankle & Carbin (2018) and Quantization Han et al. (2015); Chen et al. (

(b) and (c), classifiers in different stages are trained independently or sequentially without considering the relationships among them Isler et al. (2019); Kruthika et al. (2019). To build tighter connections between classifiers in a multi-stage system for better collaboration, most exist methods Mendes et al. (2020); Qi et al. (2019); Sabokrou et al. (2017); Zeng et al. (2013) jointly optimize the multi-stage classifiers in a way like cascade, allowing the contextual information to transfer from earlier stages to later. However, most of them primarily consider classification

