DAMOFD: DIGGING INTO BACKBONE DESIGN ON FACE DETECTION

Abstract

Face detection (FD) has achieved remarkable success over the past few years, yet, these leaps often arrive when consuming enormous computation costs. Moreover, when considering a realistic situation, i.e., building a lightweight face detector under a computation-scarce scenario, such heavy computation cost limits the application of the face detector. To remedy this, several pioneering works design tiny face detectors through off-the-shelf neural architecture search (NAS) technologies, which are usually applied to the classification task. Thus, the searched architectures are sub-optimal for the face detection task since some design criteria between detection and classification task are different. As a representative, the face detection backbone design needs to guarantee the stage-level detection ability while it is not required for the classification backbone. Furthermore, the detection backbone consumes a vast body of inference budgets in the whole detection framework. Considering the intrinsic design requirement and the virtual importance role of the face detection backbone, we thus ask a critical question: How to employ NAS to search FD-friendly backbone architecture? To cope with this question, we propose a distribution-dependent stage-aware ranking score (DDSAR-Score) to explicitly characterize the stage-level expressivity and identify the individual importance of each stage, thus satisfying the aforementioned design criterion of the FD backbone. Based on our proposed DDSAR-Score, we conduct comprehensive experiments on the challenging Wider Face benchmark dataset and achieve dominant performance across a wide range of compute regimes. In particular, compared to the tiniest face detector SCRFD-0.5GF, our method is +2.5 % better in Average Precision (AP) score when using the same amount of FLOPs.

1. INTRODUCTION

Face detection is a fundamental task in computer vision and plays an important role on various facerelated down-streaming applications, e.g., facial expression recognition Zhao et al. (2021) , face recognition Deng et al. (1801) and face alignment Ren et al. (2014) . In the last decade, we have witnessed tremendous progress on the realm of face detection. However, these leaps arrive only when consuming huge computation cost, such as heavy detection framework in Hambox Liu et al. (2019 ), TinaFace Zhu et al. (2020 ), and DSFD Li et al. (2019) . Moreover, when building a tiny face detector under a computation-scarce scenario, such heavy computation cost limits the application of face detectors. 2016) as a basic detection framework and further construct a lightweight network via substituting a manual-designed backbone for SSD feature extractor. However, these methods can only cover a minor range of compute regimes, hindering the application on multiple computation-scarce scenarios. Therefore, follow-up efforts start to pay attention to neural architecture search (NAS) solution, which is a promising direction for developing lightweight face detectors across a wide range of compute regimes. At present, existing NAS-based FD methods prefer to follow the standard NAS approach to search overall face detection framework, e.g., SPNAS Guo et al. (2019) However, these methods heavily rely on the power of off-the-shelf NAS approaches while lacking the FD-friendly design, making the searched architectures sub-optimal on the realm of face detection. In this work, we make 3 efforts to develop a novel NAS method for searching lightweight face detectors under various inference budgets automatically, such as inference latency, FLOPs (Floating Point Operations), and model size. 2016), has progressively emerged as an alternative backbone design method. However, applying NAS technology to the detection backbone is much harder than the classification backbone as the former needs to guarantee the stage-level detection ability while it is not required for the latter. Such task-level discrepancy inevitably results in that existing NAS-based FD methods Liu et al. ( 2018 2015), the superiority of a ReLU Neural Network (NN) is highly related with its expressivity, i.e., the number of linear regions it can separate its input space into. Based on this property of characterizing the NN representation, we propose a stage-aware ranking score (SAR-Score), which measures the stage-level expressivity in a fair way. Concretely, our stage-aware ranking score has 2 advantages. On the one hand, the computational complexity of the number of the linear region grows exponentially with the increase of ReLU NN's depth, making the expressivity of some backbone architectures hard to obtain. This inevitably hinders the application of using linear regions to characterize ReLU NN's expressivity directly. To deal with this issue, rather than computing the exact linear regions, we adopt the lower bound of the maximal number of linear regions to approximate network expressivity, which can compute the expressivity of any network architecture with extremely low computation cost. Benefiting from this advantage, our method achieves a dominant performance and the searching cost is far less than other sota methods, such as BFbox Liu & Tang (2020), SCRFD Guo et al. (2021), and ASFD Zhang et al. (2020a) . On the other hand, due to the chain computation procedure of linear regions, the deeper layer expressivity is more powerful than the shallow one. This indicates that it is no longer confident to reflect the expressivity of shallow stages compared to the deeper ones. To cope with this problem, our stage-aware ranking score makes 3 novel modifications. A detailed explanation of our stage-aware ranking score can be seen in Section 3 and 4. (ii) After obtaining the stage-level expressivity (SAR-Score), we further propose a Distribution-Dependant Stage-aware Ranking score to linearly combine each stage expressivity according to



It is thus of attracting major research interest on constructing tiny face detectors manually Zhang et al. (2017a); Bazarevsky et al. (2019), which employ SSD Liu et al. (

in BfBox Liu & Tang (2020), RegNet Radosavovic et al. (2020) in SCRFD Guo et al. (2021) and DARTS Liu et al. (2018) in ASFD Zhang et al. (2020a).

Search backbone architecture via NAS. Instead of searching the overall detection framework (unify searching strategy), in this paper, we only use NAS to search the backbone architecture. The rationale is that the detection backbone consumes huge inference costs in most popular detection frameworks Li et al. (2019); Liu et al. (2019); Tang et al. (2018); Deng et al. (2019), indicating the superiority of the detection framework is heavily determined by the backbone architecture. Moreover, the overall detection framework consists of backbone, feature pyramid layer (FPN), and detection head module. As a metric to rank architectures, the accuracy predictor fails to reflect the quality of sampled backbone architecture confidently when adopting unify searching strategy since the backbone is a preceding component of a detection framework. Identify the major challenge in the Nas-based backbone design of face detector. Over the past few years, manual-designed backbones He et al. (2016); Sandler et al. (2018) have achieved significant progress on the task of image classification. Recently, Neural Architecture Search, first introduced by Baker et al. (

); Guo et al. (2019); Lin et al. (2021); Radosavovic et al. (2020) are not satisfactory for the detection task. Thus, we can conclude that the major challenge in NAS-based detection backbone design is how to design an accuracy predictor to measure the effectiveness of stage-level detection ability. Distribution-dependent stage-aware ranking score. To solve the aforementioned challenge, we propose a Distribution-dependant Stage-aware Ranking score, termed DDSAR-Score, which acts as an accuracy predictor to measure the quality of sampled backbone architectures from a stage-wise detection ability perspective. Some motivations behind the proposed DDSAR-Score are explained as follows. (i) In the deep learning theory Pascanu et al. (2013); Hahnloser et al. (2000); Hahnloser & Seung (2001); Pascanu et al. (2013); Bianchini & Scarselli (2014); Telgarsky (

availability

The code is avaliable at https://github.com/ly19965/EasyFace/tree/ master/face_project

