DAMOFD: DIGGING INTO BACKBONE DESIGN ON FACE DETECTION

Abstract

Face detection (FD) has achieved remarkable success over the past few years, yet, these leaps often arrive when consuming enormous computation costs. Moreover, when considering a realistic situation, i.e., building a lightweight face detector under a computation-scarce scenario, such heavy computation cost limits the application of the face detector. To remedy this, several pioneering works design tiny face detectors through off-the-shelf neural architecture search (NAS) technologies, which are usually applied to the classification task. Thus, the searched architectures are sub-optimal for the face detection task since some design criteria between detection and classification task are different. As a representative, the face detection backbone design needs to guarantee the stage-level detection ability while it is not required for the classification backbone. Furthermore, the detection backbone consumes a vast body of inference budgets in the whole detection framework. Considering the intrinsic design requirement and the virtual importance role of the face detection backbone, we thus ask a critical question: How to employ NAS to search FD-friendly backbone architecture? To cope with this question, we propose a distribution-dependent stage-aware ranking score (DDSAR-Score) to explicitly characterize the stage-level expressivity and identify the individual importance of each stage, thus satisfying the aforementioned design criterion of the FD backbone. Based on our proposed DDSAR-Score, we conduct comprehensive experiments on the challenging Wider Face benchmark dataset and achieve dominant performance across a wide range of compute regimes. In particular, compared to the tiniest face detector SCRFD-0.5GF, our method is +2.5 % better in Average Precision (AP) score when using the same amount of FLOPs.

1. INTRODUCTION

Face detection is a fundamental task in computer vision and plays an important role on various facerelated down-streaming applications, e.g., facial expression recognition Zhao et al. (2021) , face recognition Deng et al. (1801) and face alignment Ren et al. (2014) . In the last decade, we have witnessed tremendous progress on the realm of face detection. However, these leaps arrive only when consuming huge computation cost, such as heavy detection framework in Hambox Liu et al. (2019 ), TinaFace Zhu et al. (2020) , and DSFD Li et al. (2019) . Moreover, when building a tiny face detector under a computation-scarce scenario, such heavy computation cost limits the application of face detectors. It is thus of attracting major research interest on constructing tiny face detectors manually Zhang et al. (2017a) ; Bazarevsky et al. (2019) , which employ SSD Liu et al. (2016) as a basic detection framework and further construct a lightweight network via substituting a manual-designed backbone for SSD feature extractor. However, these methods can only cover a minor range of compute regimes, hindering the application on multiple computation-scarce scenarios. Therefore, follow-up efforts start to pay attention to neural architecture search (NAS) solution, which is a promising direction for developing lightweight face detectors across a wide range of compute regimes. At present, existing NAS-based FD methods prefer to follow the standard NAS approach to search overall face detection framework, e.g., SPNAS Guo et al. (2019) in BfBox Liu & Tang (2020) , RegNet Radosavovic et al. (2020) in SCRFD Guo et al. (2021) and DARTS Liu et al. (2018) in ASFD Zhang et al. (2020a) . However, these methods heavily rely on the power of off-the-shelf NAS approaches while lacking the FD-friendly design, making the searched architectures sub-optimal on the realm of face detection. In this work, we make 3 efforts to develop a novel NAS method for searching lightweight face detectors under various inference budgets automatically, such as inference latency, FLOPs (Floating Point Operations), and model size. Search backbone architecture via NAS. Instead of searching the overall detection framework (unify searching strategy), in this paper, we only use NAS to search the backbone architecture. The rationale is that the detection backbone consumes huge inference costs in most popular detection frameworks Li et al. (2019) ; Liu et al. (2019) ; Tang et al. (2018) ; Deng et al. (2019) , indicating the superiority of the detection framework is heavily determined by the backbone architecture. Moreover, the overall detection framework consists of backbone, feature pyramid layer (FPN), and detection head module. As a metric to rank architectures, the accuracy predictor fails to reflect the quality of sampled backbone architecture confidently when adopting unify searching strategy since the backbone is a preceding component of a detection framework. Identify the major challenge in the Nas-based backbone design of face detector. Over the past few years, manual-designed backbones He et al. (2016) ; Sandler et al. (2018) have achieved significant progress on the task of image classification. Recently, Neural Architecture Search, first introduced by Baker et al. (2016) , has progressively emerged as an alternative backbone design method. However, applying NAS technology to the detection backbone is much harder than the classification backbone as the former needs to guarantee the stage-level detection ability while it is not required for the latter. Such task-level discrepancy inevitably results in that existing NAS-based FD methods Liu et al. (2018) ; Guo et al. (2019) ; Lin et al. (2021) ; Radosavovic et al. (2020) are not satisfactory for the detection task. Thus, we can conclude that the major challenge in NAS-based detection backbone design is how to design an accuracy predictor to measure the effectiveness of stage-level detection ability. Distribution-dependent stage-aware ranking score. To solve the aforementioned challenge, we propose a Distribution-dependant Stage-aware Ranking score, termed DDSAR-Score, which acts as an accuracy predictor to measure the quality of sampled backbone architectures from a stage-wise detection ability perspective. Some motivations behind the proposed DDSAR-Score are explained as follows. (i) In the deep learning theory Pascanu et al. (2013); Hahnloser et al. (2000) ; Hahnloser & Seung (2001) ; Pascanu et al. (2013) ; Bianchini & Scarselli (2014) ; Telgarsky (2015) , the superiority of a ReLU Neural Network (NN) is highly related with its expressivity, i.e., the number of linear regions it can separate its input space into. Based on this property of characterizing the NN representation, we propose a stage-aware ranking score (SAR-Score), which measures the stage-level expressivity in a fair way. Concretely, our stage-aware ranking score has 2 advantages. On the one hand, the computational complexity of the number of the linear region grows exponentially with the increase of ReLU NN's depth, making the expressivity of some backbone architectures hard to obtain. This inevitably hinders the application of using linear regions to characterize ReLU NN's expressivity directly. To deal with this issue, rather than computing the exact linear regions, we adopt the lower bound of the maximal number of linear regions to approximate network expressivity, which can compute the expressivity of any network architecture with extremely low computation cost. Benefiting from this advantage, our method achieves a dominant performance and the searching cost is far less than other sota methods, such as BFbox Liu & Tang (2020) , SCRFD Guo et al. (2021) , and ASFD Zhang et al. (2020a) . On the other hand, due to the chain computation procedure of linear regions, the deeper layer expressivity is more powerful than the shallow one. This indicates that it is no longer confident to reflect the expressivity of shallow stages compared to the deeper ones. To cope with this problem, our stage-aware ranking score makes 3 novel modifications. A detailed explanation of our stage-aware ranking score can be seen in Section 3 and 4. (ii) After obtaining the stage-level expressivity (SAR-Score), we further propose a Distribution-Dependant Stage-aware Ranking score to linearly combine each stage expressivity according to individual importance. Inspired by the fact that each stage detection ability should be highly correlated with the ratio of ground-truths matched in the corresponding stage, which is illustrated in Guo et al. (2021) , we determine the importance of each stage expressivity (i.e., stage-aware ranking score) according to the prior ground-truths distribution. To this end, the proposed DDSAR-Score is formed. The rest of the paper is organized as follows. In Section 3, we review how to use linear regions to characterize the ReLU NN's expressivity. In Section 4, we introduce our method and corresponding advantages. Experimental results under different Flops are provided in Section 5. Finally, we conclude the paper in Section 6.

2. RELATED WORK

Face Detection. Existing Face detection methods focus on solving extreme scale variance through label assignment, scale-level data augmentation strategy, and feature enhancement module. (i) S 3 FD Zhang et al. (2017b) introduces a scale compensation anchor matching strategy to help outer faces match enough anchors. Based on a novel observation that some negative anchors equip with significant regression ability, Hambox Liu et al. (2019) introduces an online high-quality anchor mining strategy to mine anchors with extreme regression ability via online information. (ii) Tang et al. (2018) proposes a data-anchor-sampling strategy to increase the ratio of small faces in the training data, which is not a robust solution since it is only excellent at detecting small faces. To solve this, MogFace Liu et al. (2022) analyzes the relationship between the performance of each pyramid layer and the number of ground truths it matches, and further introduces a selective-scale enhancement strategy to maximize the certain pyramid layer detection ability. (iii) SSH Najibi et al. (2017) builds a detection module to increase the receptive field on multiple feature maps. PyramidBox Tang et al. (2018) designs a context-sensitive predict module to enjoy the gain of a wider and deeper network Retinaface Deng et al. (2019) adopts a deformable convolution layer to dynamically increase the context information, improving the face detection performance significantly. Even though tremendous strides have been made by the aforementioned methods efforts, they typically adopt the offthe-shelf backbone architecture (e.g., ResNet-50 He et al. (2016) ), causing only a little and plain investigation on the FD-friendly backbone design. In this work, we propose a DDSAR-Score to measure the effectiveness of a backbone architecture in a stage-level perspective, which satisfies the detection backbone design requirement. Then based on the powerful NAS technology, a series of FD-friendly backbones under different inference budgets can be searched efficiently. Neural Architecture Search. Early NAS methods Baker et al. (2016) ; Liu et al. (2018) ; Guo et al. (2019) focus on searching backbone architecture on the image recognition task. Typically, they first determine the search space, then search the optimal architecture according to the accuracy predictor. Motivated by the success of NAS on the image recognition task, some researchers begin to put attention to face detection tasks. BFbox Liu & Tang (2020) devises a joint searching strategy to search the backbone and the connection mode of the feature pyramid network simultaneously. ASFD Zhang et al. (2020a) discover the importance of feature enhancement module and search head and feature fusion mode. SCRFD Guo et al. (2021) directly searches the overall detection framework, including backbone, head, and feature fusion mode. All these methods emphasize searching the detection framework in a unified perspective, which damages the quality of searched backbone architecture. In contrast, we propose an effective DDSAR-Score to measure the quality of sampled backbone network from a stage-level perspective, achieving satisfactory computation allocation across different stages.

3. PRELIMINARIES

In this paper, we will study how to use the linear region to represent CNN stage-level expressivity impartially, based on which, a robust accuracy predictor can be designed for the subsequent NAS procedure. First, we formally introduce the conception of the linear region and corresponding background information. Here, the ReLU CNN N we considered has L hidden convolutional layers and H neurons. Each hidden layer contains one convolutional operator followed by ReLU activation Glorot et al. (2011) . For the detailed clarify in N , we denote by h 0 × w 0 × d 0 the dimension of input neurons x 0 . Analogously, let h l × w l × d l be the dimension of output (x l ) of l-th hidden layer , for 1 ≤ l ≤ L. Thus, we can view ReLU CNN N as a piece-wise linear function In the remainder of this section, we recall 2 theorems in Xiong et al. (2020) , which state the number of linear regions for one-layer ReLU CNN and the lower bound of maximal linear regions for multilayer ReLU CNNs, respectively. F N : R h0×w0×d0 → R h L ×w L ×d L ; F N (x 0 ; θ) = g L • f L • • • • • g 1 • f 1 (x 0 ). Theorem 1 (Theorem 2 from Xiong et al. ( 2020)) Assume that N is a one-layer ReLU CNN with input dimension h 0 ×w 0 ×d 0 and hidden layer dimension h 1 ×w 1 ×d 1 . The d 1 filters have the dimension f (1) 1 × f (2) 1 × d 0 and the stride s 1 . Suppose that the parameters θ = {W, B} are drawn from a fixed distribution µ which has densities with respect to Lebesgue measure in R #weights+#bias . Define I N = {(i, j) : 1 ≤ i ≤ h 1 , 1 ≤ j ≤ w 1 } and S N = (S i,j ) h1×w1 where S i,j = {(a + (i -1)s 1 , b + (j -1)s 1 , c) : 1 ≤ a ≤ f (1) 1 , 1 ≤ b ≤ f (2) 1 , 1 ≤ c ≤ d 0 } for each (i, j) ∈ I N . Let K N := {(t i,j ) (i,j)∈I N : t i,j ∈ N, (i,j)∈J t i,j ≤ # ∪ (i,j)∈J S i,j ∀J ⊆ I N }. Then, the expectation of the number N R N θ of linear regions of N equals N R N max : (2) l × d l-1 and stride s l in the l-th layer. Assume that d l ≥ d 0 for each 1 ≤ l ≤ L. Then, the maximal number N R N max of linear regions of N is at least (lower bound) E θ∼µ [N R N θ ] = N R N max = (ti,j ) (i,j)∈I N ∈K N (i,j)∈I N d 1 t i,j . N R N max ≥ N R N ′ max L-1 l=1 d l d 0 h l ×w l ×d0 , where N ′ is a one-layer ReLU CNN which has input dimension h L-1 × w L-1 × d 0 , hidden layer dimension h L × w L × d L , and d L filters with dimension f (1) L × f L × d 0 and stride s L . Note that the maximal linear regions N R N ′ max of N ′ can be calculated by Eq. 1

4. METHODOLOGY

In this section, we first propose a novel SAR-Score to measure the expressivity of different stages and make them comparable, simultaneously. Then, we determine the importance of each stage according to the prior ground-truths distribution, based on which, a DDSAR-Score is proposed to measure the backbone in a unified proxy while preserving the ability to characterize stage-wise expressivity. Finally, combining DDSAR-Score with available NAS technology (i.e., search space design and evolutionary architecture search) completes the searching process on FD backbone architecture.

4.1. STAGE-AWARE RANKING SCORE

We first revisit a standard neural architecture search framework, which consists of two key components, architecture generator and accuracy predictor. The former is responsible for generating potential high-quality architecture and the latter is in charge of predicting the corresponding accuracy. Due to the lack of tailored considerations on stage-level representation for existing accuracy predictors, adopting the off-the-shelf NAS technology to design a FD-friendly backbone is often not satisfactory. In this part, we look closely into the lower bound of network expressivity and analyze its issues on characterizing stage-level expressivity, namely unbalanced representation between different stages, huge computation cost on obtaining the lower bound, and non-sensitivity to filter's kernel size. To handle with the two former weaknesses, we propose a stage-aware expressivity score (SAE-Score). We first clarify some notations. Suppose that N has a stem and 4 stages N ci (i = 2, 3, 4, 5). For convenience, we term the stem as N c1 . The input dimension of N is h 0 ×w 0 ×d 0 . For 1 ≤ i ≤ 5, N ci has L ci hidden layers and the input dimension of N ci is h ci 0 × w ci 0 × d ci 0 . For 1 ≤ l ≤ L ci , the l-th hidden layer (N l ci ) in ci has dimension h ci l × w ci l × d ci l and corresponding d ci l filters have dimension f ci l × f ci l × d ci l-1 with stride s ci l . By Theorem 2, the maximal number of linear regions of N ci is at least: N R N ci max ≥ R Nci = N R N ′ ci max i-1 j=1 L cj l=1 d l d 0 h cj l ×w cj l ×d0 × L ci -1 n=1 d n d 0 h ci n ×w ci n ×d0 (3) As discussed in Xiong et al. (2020) , the lower bound (R Nci ) corresponding to the maximal linear region of N ci can represent the expressivity of N ci . However, two distinct issues occur when using R Nci to measure the stage-level expressivity directly. (i) As shown in Table 1 , we give an illustrator case 1 to unveil that directly using R Nci (i= 1, 2, 3, 4, 5) to represent different stage expressivity incurs extremely unbalanced representation (i.e., the lower bound of the maximal linear regions on different stage presents a huge gap). The rationale is that the proof of Theorem 2 is based on the chain calculation, leading the lower bound of the deeper stage increases exponentially relative to the that of the shallow one. This means that it is no longer powerful to reflect the expressivity of shallow stages compared to the deeper ones, when we regard the linear combination of R Nci (i = 1, 2, 3, 4, 5) as the proxy to rank a set of candidate backbone architectures. (ii) The computation cost of N R N ′ ci max is huge, where obtaining the elements in K N ′ ci requires crude brute force trial. Example 1 Let N be a five-layer Relu CNN and the input dimension is 1 × 1 × 1. For 1 ≤ l ≤ 5, in the l-th hidden layer, there are 2 l filters with dimension 1 × 1 × 2 (l-1) and stride 2. That is, these 5 layers are the corresponding N c1 , N c2 , N c3 , N c4 , N c5 . According to Eq. 3, for 1 ≤ i ≤ 5, the number of R N ′ ci max (namely N R N ′ ci max ) and R Nci are reported in Table 1 .

To alleviate this unbalanced representation across different stages and crude computation process of

N R N ′ ci max , we propose a novel stage-aware expressivity score to measure the stage-wise expressivity Table 1 : the number of R Nci and R N ′ ci max in Example 1 i = 1 i = 2 i = 3 i = 4 i = 5 Number from N c1 to N c5 : S sae Nci = log(h ci L ci × w ci L ci × d 0 L ci n=1 d ci n d 0 h ci n ×w ci n ×d0 ) (4) = log(h ci L ci × w ci L ci × d 0 ) + L ci n=1 (h ci n × w ci n × d 0 ) log( d ci n d 0 ) (5) The design of our SAE-Score is based on the R Nci with 2 following novel modifications: (i) Remove stage-irrelevant item from R Nci . We first review the process of derivation on i-1 j=1 L cj l=1 d l d0 , which represents that the layer from c1 to ck (k = i -1) map i-1 j=1 L cj l=1 d l d0 distinct unit hypercubes in [0, 1] h0×w0×d0 into the same [0, 1] h ck L ck ×w ck L ck ×d0 . Informally speaking, the value of i-1 j=1 L cj l=1 d l d0 is only based on the topology structure of the former stage layer while the role of N ci can be reflected by the remaining term in R Nci . Thus, we can regard (ii) Re-combination trick: Adding a fully-connected layer with only 1 hidden neuron at the end of N ci .foot_1 . Based on this, we can derive the exact formula of SAE-Score in Eq. 4. The L ci n=1 dn d0 h ci n ×w ci n ×d0 in Eq. 4 represents the stage ci map L ci n=1 d ci n d0 h ci n ×w ci n ×d0 distinct unit hypercubes in [0, 1] h0×w0×d0 into the same hypercube [0, 1] h ci L ci ×w ci L ci ×d0 . Then by Theorem 3, the new addition layer can divide single hypercube [0, 1] h ci L ci ×w ci L ci ×d0 into h ci L ci × w ci L ci × d 0 linear regions. Finally, the SAE-Score is formed via multiplicating h ci L ci × w ci L ci × d 0 and L ci n=1 d ci n d0 h ci n ×w ci n ×d0 . Thus, compared to the R Nci , our SAR-Score has two advantages. i) The representation across different stages is more balanced because we remove the stage-irrelevant item, whose value increases exponentially with the increase of the stage number, which is demonstrated in Example 1; ii) Eliminating the crude computation process of N 2013)) Let N be a one-layer ReLU NN with n 0 input neurons and n 1 hidden neurons. Then, the maximal number of linear regions of N is equal to n0 i=0 n1 i . By taking these advantages of SAE-Score, we can efficiently measure the detection ability across different stages expressivity. However, the SAE-Score is not sensitive to the structure of the filters (e.g., kernel size and stride), resulting in some trivial architectures are searched during the NAS period, i.e., the searched architecture may only contain 3 × 3 convolution layer. To solve this, we further propose a filter sensitivity score to rescue the SAE-Score: S f s Nci = L ci l=1 N R N l ′ ci max = L ci l=1 (ti,j ) (i,j)∈I N l ′ ci ∈K N l ′ ci (i,j)∈I N d ′ t i,j . ( ) Where I N l ′ ci = {(i, j) : 1 ≤ i ≤ h ′ , 1 ≤ j ≤ w ′ } and S i,j = {(a + (i -1)s ci l , b + (j -1)s ci l , c) : 1 ≤ a ≤ f ci l , 1 ≤ b ≤ f ci l , 1 ≤ c ≤ d ′ } for each (i, j) ∈ I N l ′ ci . Let K N l ′ ci := {(t i,j ) (i,j)∈I N l ′ ci : t i,j ∈ N, (i,j)∈J t i,j ≤ # ∪ (i,j)∈J S i,j ∀J ⊆ I N l ′ ci }. Where N l ′ ci is a one-layer ReLU CNN. d ′ represents the number of fliters which is set to 7. The height, width, and stride of d ′ filters in N l ′ ci are the same as those in N l ′ ci . For 1 ≤ l ≤ L ci , the input dimension of N l ′ ci is h ′ × w ′ × d ci l-1 . h ′ and w ′ are both set to 1 in this paper. By looking at Theorem 1, the N R N l ′ ci max represents the expectation of the number of N R N l ′ ci θ of linear regions of N l ′ ci , when θ ranges over R #weights+#bias . Considering the differences between N l ′ ci and N l ci are the dimension of input neurons, the depth of filters and the number of filters, we can confidently put our trust in that N R N l ′ ci max and N R N l ci max have the same role on measuring the expressivity sensitivity to the height, width, and stride of filter. Note that why we use N R N l ′ ci max instead of N R N l ci max is that calculating the value of N R N l ′ ci max is extreme quickly and easily, making the Theorem 1 is applicable during the training phase. Finally, by integrating expressivity sensitivity of each layer, the S f s Nci can be used to measure the expressivity sensitivity of N ci to the filter. Finally, we propose a stage-aware ranking score to employ the advantages of stage-aware expressivity score and stage-aware filter sensitivity score, simultaneously: S sar Nci = S sae Nci + α S f s Nci Where α represents the importance of S f s Nci . Based on the experimental results, we set α to 0.25 when searching DDSAR-500M models. While for DDSAR-2.5G, DDSAR-10G and DDSAR-34G models, we need to find a suitable value of α and add some constraints into search space. Otherwise, some trivial architectures may be searched.

4.2. DISTRIBUTION-DEPENDENT STAGE-AWARE RANKING SCORE

We have presented how to measure stage-level expressivity by introducing a stage-aware ranking score. Moreover, for the sake of the following NAS procedure, we require a unified proxy to measure the effectiveness of the overall detector architecture. To achieve this, we propose a distributiondependent stage-aware ranking score, termed DDSAR-Score F N := λ 1 S sar Nc1 + λ 2 S sar Nc2 + ... + λ 5 S sar Nc5 The weights λ = (λ 2 , λ 3 , λ 4 , λ 5 ) are calculated by the following 2 steps: 1) Given a dataset with ground-truths annotation and a Resnet50 backbone; 2) Compute the ratio (λ 2 , λ 3 , λ 4 , λ 5 ) of the all ground-truths matched in the N c2 , N c3 , N c4 , and N c5 , respectively. Since anchors are not tiled on the N c1 , the weight λ 1 is set to 0.2 according to the uniform allocation opinion. The motivation behind our DDSAR-Score is that the positive anchor distribution can guide the computation allocation from N c2 to N c5 Guo et al. (2021) ; Liu et al. (2022) , indicating there exists a positive correlation between the positive anchor distribution and stage-level architecture. Thus we adopt the aforementioned steps to determine the value of the weights λ under the guidance of ground-truths distribution. 4 Randomly select a network architecture N from P; 4: N m = Mutation(N , A); 5: if N m not exceeds inference budget then 6: Calculate DDSAR-Score F Nm by Eq. 4; 7: P = P ∪ N m ; 8: end if 9: Remove network of the smallest DDSAR-Score if the size of P exceeds the population size P . 10: end for 11: Return the architecture with the highest DDSAR-Score in P

4.4. EVOLUTIONARY ARCHITECTURE SEARCH

In the previous subsection, we presented a novel proxy (DDSAR-Score) to measure the expressivity of the backbone. Then, the subsequent NAS process can be formulated as: a * = arg max a∈A F a ( ) Where A is the pre-defined search space. During the architecture search in Eq. 9, we directly adopt Evolutionary Architecture Search Guo et al. (2019) . Algorithm 1 describes the detailed searching process. We first construct a search space as illustrated in the above subsection. Then, as described in line 1, we initialize the population P 0 according to the inference budget B and population size. After that, at each iteration step t, we randomly select a network architecture N from P and mutate it to obtain a child architecture N m . For the mutation process, we mutate a randomly sampled block in N to produce a new candidate architecture N m . If the inference cost of N m is less than the inference budget, we add it into the population P. After T iterations, we return the highest DDSAR-Score in the Population P.

5.1. DATASET AND IMPLEMENTATION DETAILS.

Training details of NAS phase. To make a fair comparison with previous works, the training details corresponding to the NAS phase are consistent with Lin et al. (2021) . To be concrete, The population size and iteration in Algorithm 1 are set 256 and 96000, respectively. The convolution kernel size is searched from the set {3, 5, 7}. The searched architecture contains 5 stages, ranging from N c1 to N c5 . The inference budgets contain Flops under VAG resolution (640 × 480), inference time, and model parameters. In this paper, we only conduct experiments under Flops constraint. More optimization details, evaluation protocols and Dataset introduction can be seen in Section A.

5.2. ABLATION STUDY

Based on the SCRFD-0.5GF detection framework, we conduct ablative experiments to evaluate the effectiveness of the searched backbone architecture through our proposed DDSAR-Score. To conduct a fair comparison with SCRFD-0.5GF, we first compute the Flops (403 Mflops) of the SCRFD-0.5GF backbone under VGA resolution. Thus, the inference budget B in Algorithm 1 is set to 403 Mflops. To this end, we can obtain the searched backbone architecture, which is denoted as DDSAR-0.5GF. Table 5 .2 illustrates the results of SCRFD-0.5GF, SCRFD-MobileNet-0.5GF (MobileNet0.25 + SCRFD-0.5GF detection framework), DDSAR-0.5GF (DDSAR-0.4GF + SCRFD-0.5GF detection framework). By directly employing the searched DDSAR-0.5GF backbone on the SCRFD-0.5F detection framework, our proposed method achieves a great improvement of 2.49% on the challenging Wider Face hard subset. Besides, the searching cost of our method is Zhang et al. (2017a) . As shown in Table 5 .1, our DDSAR family achieves the best performance under different compute regimes. This demonstrates that whatever manual-designed or existing NAS-based backbones, the corresponding backbone representation is inferior to ours. In our opinion, considering the manualdesigned backbones are derived from image recognition tasks, such significant improvements reveal the importance and effectiveness of designing a backbone on the face detection realm. While comparing with SCRFD family models, our DDSAR-Score continuously takes the stage-level expressivity and prior ground-truths distribution into account, thus a FD-friendly backbones can be searched.

6. CONCLUSION

In this paper, we aim to employ NAS to search FD-friendly backbone. First, we discover that the off-the-shelf NAS technology fails to consider stage-wise detection ability, making the searched backbone sub-optimal on the face detection realm. Secondly, we propose a stage-aware expressivity score to characterize stage-level detection ability explicitly. Thirdly, we further propose a DDSAR-Score to linearly combine each stage expressivity (SAR-score) according to the prior ground-truths distribution. Extensive experiments on the authoritative and challenging Wider Face dataset demonstrate the superiority of our approach.



Given a set T , let #T be the number of elements in T . This step only involves the calculation of SN ci



where f l is an affine pre-activation function and g l is a ReLU activation function (1 ≤ l ≤ L). The parameter θ is composed of weight matrices and bias vectors in the N . Following the definition inPascanu et al. (2013);Montufar et al. (2014);Serra et al. (2018);Brandfonbrener (2018);Hanin &  Rolnick (2019a;b), the linear region of a ReLU CNN N corresponding to θ can be given byR N θ := {x 0 ∈ R h0×w0×d0 : g(h(x 0 ; θ)) > 0, ∀h a neuron in N }where g(x) = max {0, x} is a ReLU function and h(x 0 ; θ) is the pre-activation function of a neuron h. Then, we denote by N R N θ the number of the linear regions in N at θ, i.e., NR N θ = #{R N θ : R N θ ̸ = ∅} 1 . Note that N R Nθ has been widely recognized to serve as the expressivity proxy Montufar et al. (2014);Hanin & Rolnick (2019a), when given a neural network N and corresponding weights θ. Furthermore, let N R N max := max θ N R N θ be the maximal number of linear regions of N when θ ranges over R #weights+#bias .

Based on the Theorem 1,Xiong et al. (2020) further derive the bound of N R N max on multi-layer CNN:Theorem 2 (Theorem 5 from Xiong et al. (2020)) Suppose that N is a ReLU CNN with L hidden convolutional layers. The input dimension is h 0 × w 0 × d 0 ; the l-th hidden layer has dimension h l × w l × d l for 1 ≤ l ≤ L; and there are d l filters with dimension f

× 10 5 3.92 × 10 10

stage-irrelevant item.

Proposition 2 from Pascanu et al. (

.3 NETWORK SEARCH SPACE Following previous worksLin et al. (2021); He et al. (2016); Radosavovic et al. (2020); Sandler et al.(2018), the search space of backbone architecture contains 3 different types of blocks, including residual blocks, bottleneck blocks and moblilenet blocksSandler et al. (2018). The depth-wise expansion ratio is searched in set {1, 2, 4, 6}. As mentioned above, the effectiveness of DDSAR-Score is controlled by the Relu CNN network. Thereby, we remove some redundant layers (i.e., Batch Normalization, residual link) when computing DDSAR-Score.Algorithm 1 Evolutionary Architecture Search Require: Search space A, inference budget B, max iterations T , population size P . Ensure: The architecture with the highest DDSAR-Score.1: P := Initialize population(P, B); 2: for t = 1, 2, • • • , T do

Results of the state-of-the-art face detection methods on Wider Face validation dataset. * denotes the result is obtained from scrfd open-source code. GPU hours, which is far less than counterparts (100+ GPU hours). Such rare searching costs and higher detection performance consistently demonstrate the superiority and great potential of our proposed DDSAR-Score. Due to the limitation of page size, the detail structure of searched backbone under different Flops is placed on Section A.

Ablation studies for DDSAR-Score on the Wider Face validation dataset. COMPARISON WITH STATE OF THE ART Under different inference budgets, we can search a FD backbone family. Then, integrating them into the SCRFD detection framework, the DDSAR family is formed, including DDSAR-0.5GF, DDSAR-2.5GF, DDSAR-10GF, and DDSAR-34GF. As such, our DDSAR family can be compared with existing state-of-the-art methods in a fair way, e.g., SCRFD family,DSFD Li et al. (2019), RetinFaceDeng et al. (2019), TinaFace Zhu et al. (2020)  and FaceBoxes

availability

The code is avaliable at https://github.com/ly19965/EasyFace/tree/ master/face_project

annex

 Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and 

