LEARNING UNIFIED REPRESENTATIONS FOR MULTI-RESOLUTION FACE RECOGNITION

Abstract

In this work, we propose Branch-to-Trunk network (BTNet), a novel representation learning method for multi-resolution face recognition. It consists of a trunk network (TNet), namely a unified encoder, and multiple branch networks (BNets), namely resolution adapters. As per the input, a resolution-specific BNet is used and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. The discriminability of tiny faces is significantly improved, as the interpolation error introduced by rescaling, especially up-sampling, is mitigated on the inputs. With branch distillation and backwardcompatible training, BTNet transfers discriminative high-resolution information to multiple branches while guaranteeing representation compatibility. Our experiments demonstrate strong performance on face recognition benchmarks, both for multi-resolution face verification and face identification, with much less computation amount and parameter storage. We establish new state-of-the-art on the challenging QMUL-SurvFace 1: N face identification task.

1. INTRODUCTION

Machine learning has advanced tremendously driven by deep learning methods, but is still severely challenged by various data specifications, such as data type, structure, scale and size, etc. For instance, face recognition (FR) is a well-established deep learning task, while the performance degrades dramatically in the testing domain that differs from the training one, influenced by factors of variance like resolution, illumination, occlusion, etc. Most face recognition methods map each image to a point embedding in the common metric space by deep neural networks (DNNs). The dissimilarity of images can be then calculated using various distance metrics (e.g., cosine similarity, Euclidean distance, etc.) for face recognition tasks. Recent advancements in margin-based loss (e.g., ArcFace Deng et al. (2019a) , MV-Arc-Softmax Wang et al. (2020c) , CurricularFace Huang et al. (2020) , etc) enhanced discriminability of the metric space, with small intra-identity distance and large inter-identity distance. However, lack of variation in training data still leads to poor generalizability. Various useful methods are utilized to mitigate this issue. The model adapts to factors of variance by augmenting datasets, whereas the large discrepancy in data distribution could potentially weaken the model's ability to extract discriminative features with the same data scale and model structure (see Section 4.3). Fine-tuning is widely used to transfer large pretrained models to new domains with different data specifications. However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single new domain, which is expensive and often infeasible. As known, the resolutions of face images in reality may be far beyond the scope covered by the model. As the small feature maps with a fixed spatial extent (e.g., 7 × 7) are mapped to an embedding with a predefined dimension (e.g., 128 -d, 512 -d, etc. ) by a fully connected (fc) layer, input images need to be rescaled to a canonical spatial size (e.g., 112 × 112) before fed into the network. However, up-sampling low-resolution (LR) images introduces the interpolation error (see Section 3.1), deteriorating the recognizable ones which contain enough clues to identify the subject. Even though super-resolution methods (Zhu et al. (2016) ; Grm et al. (2020) ; Wang et al. (2016) ; Cheng et al. (2018a) ; Yin et al. (2020) ; Singh et al. (2019) ; Rai et al. (2020) ) are widely used to build faces with good visualization, they inevitably introduce feature information of other identi- ties when reconstructing high-resolution (HR) faces. This may lead to erroneous identity-specific features, which are detrimental to risk-controlled face recognition. Empirically, we can divide inputs by resolution distribution and learn to operate on them via multiple models to achieve high accuracy and efficiency. However, multi-model fashion cannot be applied directly for cross-resolution recognition as representation compatibility among models need to be guaranteed (Shen et al. (2020) ; Budnik & Avrithis (2021) ; Wang et al. (2020a) ; Meng et al. (2021) ; Duggal et al. (2021) ). To improve discriminability while ensure the compatibility of the metric space for multi-resolution face representation, we learn the "unified" representation by a partially-coupled Branch-to-Trunk Network (BTNet). It is composed of multiple independent branch networks (BNets) and a shared trunk network (TNet). A resolution-specific BNet is used for a given image, and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. Furthermore, we find that multi-resolution training can be beneficial to building a strong and robust TNet, and backward-compatible training (BCT) Shen et al. (2020) can improve the representation compatibility during the training process of BTNet. To ameliorate the discriminability of tiny faces, we propose branch distillation in intermediate layers, utilizing information extracted from HR images to help the extraction of discriminative features for resolution-specific branches. Our method is simple and efficient, which can serve as a general framework easily applied to existing networks to improve their robustness against image resolutions. Since multi-resolution face recognition is dominated by super-resolution and projection methods, to the best of our knowledge, our method is the first attempt to decouple the information flow conditioned on the input resolution, which breaks the convention of up-sampling the inputs. Meanwhile, BTNet is able to reduce the number of FLOPS by operating the inputs without up-sampling, and per-resolution storage cost by only storing the learned branches and resolution-aware BNs Zhu et al. (2021) , while re-using the copy of the trunk model. We demonstrate that our method performs comparably in various open-set face recognition tasks (1:1 face verification and 1: N face identification), while meaningfully reduces the redundant computation cost and parameter storage. In the challenging QMUL-SurvFace 1: N face identification task Cheng et al. (2018b) , we establish new state-of-the-art by outperforming prior models. In brief, our work can be summarized as follows: (1) What is our goal? Matching images with arbitrary resolutions (i.e., high-resolution, cross-resolution and low-resolution) effectively and efficiently, which is quite different from the traditional face recognition task. (2) What is the core idea of our method? Building unified (i.e.,compatible and discriminative) representations for multiresolution images without introducing erroneous information. (3) How to achieve our goal via our method? Table 1 shows that we ensure the compatibility and discriminability from three aspects: input preprocessing, network structure, and training strategy.

2. RELATED WORK

Compatible Representation Learning: The task of compatible representation learning aims at encoding features that are interoperable with the features extracted from other models. Shen et. al. Shen et al. (2020) first formulated the problem of backward-compatible learning (BCT) and proposed to utilize the old classifier for compatible feature learning. Since the multi-model fashion benefits representation learning with lower computation, our idea of cross-resolution representation learning can be modeled similar to cross-model compatibility 

3. LEARNING SPECIFIC-SHARED FEATURE TRANSFER

Instead of rescaling the inputs to a canonical size, we build multiple resolution-specific branches (BNets) that are used to map inputs to intermediate features with the same resolution and a resolution-shared trunk (TNet) to map feature maps with different resolutions to a high-dimension embedding. We gain several important properties by doing so: (1) Processing inputs on its original resolution can diminish the inevitably introduced error via up-sampling or information loss via down-sampling, thus preserving the discriminability of visual information with different resolutions. (2) Information streams of different resolutions are encoded uniformly, thus enabling the representation compatibility, which is particularly beneficial to open-set face recognition considering that a compatible metric space is the prerequisite for computing similarity. (3) This also effectively reduce the computation for LR images by supplying computational resources conditioned on the input resolution.

3.1. UP-SAMPLING ERROR ANALYSIS

Figure 1 illustrates the experimental estimation of interpolation error, whose upper bound increases with the decline of the image resolution (see detailed theoretical derivation in Appendix A.1). Note that the error soars up when the resolution drops below 32 approximately which can be viewed as LR face images, consistent with the tiny-object criterion Torralba et al. (2008) . The results show that: (1) inputs with a resolution higher than around 32 can be considered in the same HR domain, since the error information introduced by up-sampling via interpolation can be ignored to a certain extent; (2) inputs with a resolution lower than around 32 should be treated as in various LR domains due to the high sensitivity of the resolution to errors.

3.2. BRANCH-TO-TRUNK NETWORK

Let X be an input RGB image with a space shape: X ∈ R H×W ×3 where H × W corresponds to the spatial dimension of the input. For efficient batch training and inference, we predefine a canonical size S × S (e.g., 112 × 112 for typical face recognition models like ArcFace Deng et al. (2019a) ). We build a trunk network T : R H×W ×3 → R C emb capable of extracting discriminative information with different resolutions, where C emb is the number of embedding channels. For every resolution r in the candidate set, we formulate a resolution-specific branch, z r = B r (X r ), which maps the input image X r to feature maps with the same resolution and expanded channels z r : R r×r×3 → R r×r×Cr . The idea is to learn our branches B to focus on resolution-specific feature transfer independently. Feature maps will then be coupled to the trunk network T in the feature pyramid with the same spatial resolution r × r, allowing for further mapping to the unified presentation space by T r : R r×r×Cr → R C emb . Here, we follow the idea of "avoiding redundant up-sampling". Our branches B are implemented with same-resolution mapping: i.e., the model preserves the network architecture of T from input to the layer with resolution r and abandons down-sampling operations (e.g., replacing the convolution of stride 2 with stride 1, abandoning the pooling layers, etc.) to keep the same-resolution flow. We specifically name our specific-shared feature transfer network as Branch-to-Trunk Network, abbreviated as "BTNet". Figure 2 visually summarizes the main ideas of BTNet.

3.3. TRAINING OBJECTIVES

We now describe the training objectives. The training of BTNet includes training the trunk network T such that it can produce discriminative and compatible representations for multi-resolution information, and fine-tuning the branch networks B to encourage them to learn resolution-specific feature transfer, so as to improve accuracy without compromising compatibility. ) can be refined as our influence loss, in the form of: L inf luence = L cls (φ bt , κ * ) (1) where φ bt is BTNet backbone (both B r and T r ), and κ * is the classifier of the pretrained trunk T . Branch Distillation Loss. Due to the continuity of the scale change of both the image pyramid and the feature pyramid Lindeberg (1994) , we can get a qualitative sense of the similarity between images and feature maps with the same resolution (see Figure 3 ). Furthermore, features extracted from HR images have richer and clearer information than those from LR images Lui et al. (2009) . Motivated by these analyses, we utilize an MSE loss to encourage the branch output z r to be similar to the corresponding feature maps of the pretrained trunk network z s : L branch = 1 V V v=1 (z rv -z sv ) 2 (2) where V denotes the batch size. The whole training objective is a combination of the above objectives: L = L inf luence + λ branch L branch (3) where λ branch is a hyper-parameter to weigh the losses and we set λ branch = 0.5 in all our experiments. •High-Resolution Trained φ hr . Naive baseline trained with HR data. •Independently Trained φ mm . Multi-model fashion: is it possible to achieve better results if we train a specific model for each resolution independently? Specifically, we train φ r for data with resolution r and denote the multi-model collections as φ mm . •Multi-Resolution Trained φ mr . Trained with multi-resolution data which adapts to resolutionvariance. For a comprehensive evaluation, we implemented three baselines, denoted as φ mr , φ mr(v2) , φ mr(v3) respectively. Each image is down-sampled to a certain size and then up-sampled to 112 × 112. The differences are as follows: (i)φ mr : down-sampled to a size in the candidate set  { 112 2 i × 112 2 i |i = 0,

4.2. EVALUATION METRICS

On the benchmarks for face verification, we use 1:1 verification accuracy as the basic metrics. The rank-20 true positive identification rates (TPIR20) at varying false positive identification rates (FPIR) and AUC are used to report the identification results on QMUL-SurvFace. For better evaluation, we define another two metrics to assess the relative performance gain similar to Shen et al. (2020) ; Meng et al. (2021) . Cross-Resolution Gain. With the purpose towards the cross-resolution compatible representations, we define the performance gain as follows: Gain r1&r2 (φ) = M r1&r2 (φ) -M r1&r2 (φ hr ) |M r1&r2 (φ mr ) -M r1&r2 (φ hr )| Here M r1&r2 (•) are metrics when the resolutions of the image/template pair are r 1 × r 1 and r 2 × r 2 (r 1 ̸ = r 2 ), respectively. φ mr shares the same architecture with φ hr while is trained on multiresolution images and thus serves as the baseline of cross-resolution gain. Same-Resolution Gain. For the scenario of multi-resolution face recognition, the performance of same-resolution verification/identification is also vital besides cross-resolution one. Therefore, we report the relative performance improvement from base model φ hr in the scenario of sameresolution. Gain r&r (φ) = M r&r (φ) -M r&r (φ hr ) |M r&r (φ r ) -M r&r (φ hr )| Here M r&r (•) are metrics when the resolutions of the image/template pair are both r × r. φ r is a model of the set {φ mm = φ r |r = 7, 14, 28} trained on images with resolution r × r without considering cross-resolution representation compatibility, which serves as the baseline of same-resolution gain on resolution r. Note that for both metrics we add the absolute symbol to the denominator as they can be negative in some test settings (detailed in Section 4.3).

4.3.1. MULTI-RESOLUTION FACE VERIFICATION

We now conduct experiments on the proposed BTNet framework for multi-resolution identity matching. Two different settings are included : (1) same-resolution matching, and (2) cross-resolution matching. Table 2 compares the average performance on popular benchmarks for φ hr , φ mm , φ mr , φ bt . When directly applied to test data with the resolution lower than training data, φ hr suffers a severe performance degradation. Up-sampling images via interpolation can increase the amount of data but not the amount of information, only to improve the detailed part of the image and the spatial resolution (size) Liu & Liu (2003) . Moreover, it also brings various noise and artificial processing traces Siu & Hung (2012) . Up-sampling images via interpolation-typically bilinear interpolation or bicubic interpolation of 4x4 pixel neighborhoods, essentially a function approximation method, is bound to introduce error information (detailed in Appendix A.1), thus potentially confusing identity information, which is especially crucial for LR images with limited details. We are able to observe improvement of φ mm in same-resolution matching but its cross-resolution gain is negative with approximately 50% accuracy. Unsurprisingly, independently trained φ r is unaware of representation compatibility, and thus does not naturally suitable for cross-resolution recognition. The results show that φ mr improved both cross-resolution and same-resolution accuracy by a large margin, as it learns to adapt to resolution variance and maintain discriminability of multi-resolution inputs. Note that the model size and training data scale stay the same, while only the resolution distribution of the data changes for φ mr , and thus there is a marginal accuracy drop in the setting of 112&112 matching. Comparably, φ bt substantially outperforms all baselines with 2.02 ˜5.00 cross-resolution gain and 2.45˜9.13 same-resolution gain. Importantly, due to the multi-resolution branches, our approach has a cost same with φ mm , significantly lower than φ hr and φ mr (see Figure 5 ). Moreover, we investigate the deviation in the accuracy change between different datasets, and assess the robustness of the face recognition systems to image resolutions.We can find that our proposed approach is much more robust than baselines against image resolution, and can also remain effective with more factors of variance (e.g., large pose variations, large age gap, etc.) included.

4.3.2. MULTI-RESOLUTION FACE IDENTIFICATION

In the native scenario, it is common to inference on inputs with resolutions not strictly matched to the branch. Since the lowquality image may possess an underlying optical resolution significantly lower than its size due to degraded quality caused by noise, blur, occlusion, etc Wong et al. (2010) . , there exists dislocation between the underlying optical resolution of native face images and that of a branch. To avoid introducing extra large-scale parameters for predicting the image quality, three heuristic selection strategies based on different resolution indicators are validated (see Figure 8 ). Table 3 compares BTNet against the state-of-the-arts models on QMUL-SurvFace 1:N identification benchmark. We are able to observe that our proposed approach extends the state-of-the-arts while being more computationally efficient. We believe the performance of BT-Net (max + ceil) is the highest that have been reported so far, and we believe it is meaningful with the increased focus on unconstrained surveillance applications. We compare different training method combinations in Table 4 and find that both pretraining and BCT succeeded in ensuring representation compatibility. Among these two, BCT performs better since it imposes a stricter constraint during training. Furthermore, we are able to observe that branch distillation is crucial for improving the discriminative power by transferring high-resolution information to lowresolution branches. Loss Functions. Since the difficulties of samples vary due to image resolution, we compute Curricu-larFace Huang et al. (2020) as our classification loss in the original architecture, which distinguishes both the difficultness of different samples in each stage and relative importance of easy and hard samples during different training stages.

5. ABLATION STUDY

To prove the main technical contribution of BTNet (rather than other components), we use different loss functions to replace the CurricularFace loss as influence loss in the original architecture. The comparison results(in Table 5 ) demonstrate that there is no significant difference among different implementations of influence loss. It means that the main performance gain is attributed to our novel design. Where should we have resolution-specific layers? We conducted an ablation to see the effects of different specific-shared layer allocation strategies. The experiment was done with different trunk layers (i.e., the parameters of these layers are inherited from the pretrained trunk without updating). Figure 7 shows the results. We find that increasing the number of branch layers (i.e., specific layers for different resolutions) will lead to better performance due to increased flexibility. Our specificshared layer allocation of BTNet can achieve better parameter/accuracy tradeoffs. Since further increasing the number of trunk layers based on BTNet cannot lead to significantly better performance but increases parameter storage cost by a large margin, we use resolution-specific layers as shown in Figure 9 . 6 DISCUSSION AND CONCLUSION This paper works on the problem of multi-resolution face recognition, and provides a new scheme to operate images conditioned on its input resolution without large span rescaling. The error introduced by up-sampling via interpolation is investigated and analyzed. Decoupled as branches for discriminative representation learning and coupled as the trunk for compatible representation learning, our Branch-to-Trunk Network (BTNet) achieves significant improvements on multi-resolution face verification and identification tasks. Besides, the superiority of BTNet in reducing computational cost and parameter storage cost is also demonstrated. It is worth noting that our approach is easy to expand to recognition tasks for other classes of objects and has the potential to serve as a general network architecture for multi-resolution visual recognition. Limitations and Future Work. The dislocation between the underlying optical resolution of native face images and that of a certain branch may limit the power of the model, which may be improved by selecting the optimal processing branch for the input in combination with the image quality, rather than by image size alone. The optimal branch selection strategy is not fully investigated though we have provided an intuitive way to select the branch for inputs (see Figure 8 ). Importantly, based on the unified multi-resolution metric space, the underlying resolution of the inputs (integrated spatial resolution with quality assessment) can be utilized to provide the reliability of the representation and contribute to risk-controlled face recognition. They will be our future research directions. 

A APPENDIX

A.1 THEORETICAL DERIVATION OF UP-SAMPLING ERROR Here, we take bilinear interpolation, a typical image interpolation method, as an example to analyze the relationship between the interpolation error and the resolution of a face image. Bilinear interpolation can be considered as a bivariate Lagrange interpolation problem containing two interpolation nodes in each of the two dimensions. Let D be a unit-bounded closed region in a two-dimensional image space, and Q 1 (x 0 , y 0 ) , Q 2 (x 1 , y 0 ) , Q 3 (x 0 , y 1 ) , Q 4 (x 1 , y 1 ) ∈ D be four adjacent pixel points in this region. We use an interpolation polynomial P (x, y) for the interpolation approximation of the bivariate continuous function f (x, y) defined on D , and the interpolation error E(x, y) can be expressed as E(x, y) = f (x, y) -P (x, y) which indicates the potential error information introduced to the recognition of different identities. According to the the Rolle's theorem, we can obtain E(x, y) = ∂ 4 f (ξ,η) ∂x 2 ∂y 2 4 ω 2 (x)µ 2 (y) where ξ, η is an interior point of D and A.4 VISUALIZATION To interpret the behavior of learning compatible and discriminative representations, we visualize the intermediate feature maps in Figure 10 . We find that φ hr introduces the noise information while φ mm has more discriminative but resolution-variant feature maps. The feature maps of φ mr tend to be smoother, diminishing the error information, but the discriminability could be limited as highfrequency details benefit recognition Wang et al. (2020b) . ω 2 (x) = (x -x 0 )(x -x 1 ) We also show that through the resolution-specific feature transfer of multiple branches, φ bt can encourage the transferred features to be aligned before fed into the trunk network in corresponding layers. For instance, at stage 2, the feature maps of φ bt with input resolution 112 and 28 are more similar than those of φ hr , φ mm , φ mr . Furthermore, more detailed information can be found in the feature maps of φ bt with input resolution 28 compared to φ mr . This inspiring phenomenon suggests that BTNet can learn compatible representations while improving the discriminability in low-resolution domain through the knowledge transferred from high-resolution visual signals.

A.5 MORE EXPERIMENTS

Multi-resolution feature aggregation is common in set-based recognition tasks where the model needs to determine the similarity of sets (templates), instead of images. Each set could contain images of the same identity with different resolutions. In our experiment, we rescale the original and flipped images in each set to different resolutions and aggregate their features into a representation of the template. 10 -6 10 -5 10 -3 10 -2 10 -1 10 -6 10 -5 10 -3 10 -2 10 -1 10 -6 10 -5 10 -3 10 -2 10 -1 10 -6 10 -5 10 -3 10 10 (a) compares the cross-resolution results of TAR@FAR=10 -4 for 1:1 verification. The cross-resolution features are ensured to be mapped to the same vector space where the aggregation is conducted for φ hr and φ mr , but we can observe that φ hr performs much better than φ mr . One possible reason is that φ hr has outstanding discriminability to extract HR features, while LR features may not overly deteriorate the HR information. This phenomenon also suggests that φ mr sacrifices its discriminability in exchange for the adaptability for resolution-variance. We can see φ bt is comparable with φ hr , demonstrating the discriminative power of BTNet for aggregating multi-resolution features. Table 10 (b) compares the same-resolution results of TAR@FAR=10 -4 for 1:1 verification. When HR information is removed from the template representation (i.e., test settings 7&7, 14&14, 28&28), φ hr suffers from performance degradation as well, as the informative embedding cannot catch the lost details of the LR images Fang et al. (2020) . Both φ mm and φ mr improve with a limited sameresolution gain, while φ bt surpasses the baselines by a large margin while also reducing the compute. In Table 11 we show the results of TPIR@FPIR=10 -1 for 1:N identification protocol. Similar to our results for 1:1 verification, we are able to observe that φ bt is comparable or even better than φ hr with HR information involved and can preserve superior discriminability with limited LR information, while also being more computationally efficient. We report the detailed results on the IJB-C dataset, including TAR at different FAR (see Table ??, 6), ROC Curve (see Figure 11 , 12) for 1:1 verification, and TPIR at FPIR=0.01, Top-1, Top-5, Top-10 accuracy (see Table 8 , 9) for 1:N identification. We are able to observe that φ bt can be comparable to or serve as the paradigm model (i.e., model with the best performance) in each resolution setting, both for identity matching and feature aggregation.



Figure 1: Estimated Error Upperbound. (bilinear interpolation, average value for over 100 images) with the change of image resolution relative to resolution 112.

Figure 2: Basic ideas of the proposed BTNet. Images of a certain identity are first projected to the feature maps with the same resolution respectively (Adapt) and then projected to a unified feature representation (Encode). In this figure, feature maps with the same resolution are indicated by outlines in the same color.

Figure 3: Visual comparison of face imagefeature map pairs with different resolutions (resized to a common size here for illustration).

Figure 4: Comparison of # Params (M) between fully finetuning and φ bt .

Figure 5: Comparison of FLOPs (G) between baselines and φ bt .

1, 2, 3, 4} with equal probability of being chosen. (ii)φ mr(v2) : down-sampled to a size in the candidate set with unequal probability of being chosen[0.3, 0.25, 0.2, 0.15, 0.1]. (iii)φ mr(v3) : down-sampled to a size in the candidate interval [4, 112]. Instantiation of Network Architecture. The BTNet and baselines are implemented with ResNet50 He et al. (2016), and they could be extended easily with other implementations. Dubbed as φ bt , the detailed instantiation of BTNet based on ResNet50 is illustrated in Appendix A.2. Training. The training details can be found in Appendix A.3.

Figure 6: Detailed cross-resolution face verification comparison of different methods on six benchmarks for different image resolutions. The clockwise sequence indicates 112&7,112&14,112&28 matching per-benchmark.

(2) Pretraining: initialize the backbone and classifier with the pretrained trunk network. (3) Backwardcompatible training (BCT, Shen et al. (2020)): fix parameters of the old classifier. (4) Fix-trunk: fix parameters of the trunk subnet T r . (5) Branch distillation: use L2-distance to obtain the loss between the intermediate feature maps at the coupling layer of the pretrained trunk T and the branch B r .

Figure 7: Comparison of verification accuracy and the amount of stored parameters for different specific-shared layer allocation strategies. Note that "Stage x+" indicates that layers deeper than "Stage x, Unit 1" are inherited from the pretrained trunk without updating.

Figure 8: Branch selection process. Max/min/average is used on (W, H) to obtain a resolution indicator for further allocation (floor/near/ceil) to a certain branch.

Figure 10: Visualization of intermediate feature maps for inputs with different resolutions. We show the feature maps located at output layers of BNets, denoted as stage1/2/3/4 respectively. We see our method can transfer multi-resolution visual inputs to intermediate feature maps at corresponding layers (indicated by bounding boxes of the same color) of TNet.

Figure 11: 1:1 verification ROC Curve on the IJB-C dataset for cross-resolution feature aggregation.

Figure 12: 1:1 verification ROC Curve on the IJB-C dataset for same-resolution feature aggregation.

Correspondence between our goals and methods

Comparison of different methods on six face verification benchmarks.

Comparison of different training methods for our BTNet. "Acc." denotes average 1:1 verification accuracy. "# Params." indicates the amount of parameter storage for the branch network B 14 .

Ablation study of different loss functions.



1:1 verification TAR at different FAR on the IJB-C dataset for cross-resolution feature aggregation.

-2 10 -1 φ hr 0.69 1.73 12.58 27.63 56.81 9.82 20.38 52.57 72.61 90.30 75.67 83.24 94.21 97.15 98.74 89.58 94.51 97.57 98.40 99.06 φ mm 0.68 1.73 11.93 27.48 56.84 7.59 15.61 48.28 71.13 91.04 73.68 85.14 95.82 97.65 98.89 89.58 94.51 97.57 98.40 99.06 φ mr 0.74 1.76 11.11 25.98 54.26 14.21 24.72 60.39 79.84 94.35 78.91 86.42 96.04 98.07 99.09 88.48 93.37 97.50 98.51 99.23 φ bt (Ours) 12.09 20.70 57.17 79.02 93.90 57.75 70.63 90.85 96.06 98.68 82.85 90.32 96.94 98.31 99.15 88.48 93.37 97.50 98.51 99.23



Ours) 15.55 55.49 67.98 73.05 63.69 86.35 92.14 94.01 86.87 95.42 97.06 97.62 90.89 96.44 97.65 98.00

Comparison of different methods on the IJB-C dataset 1:1 face verification task. "TAR" denotes TAR (%@FAR=1e-4).

Comparison of different methods on the IJB-C dataset 1: N face identification task. "TPIR" denotes TPIR (%@FPIR=0.1).



annex

µ 2 (y) = (y -y 0 )(y -y 1 ) (9)As x 1 -x 0 = y 1 -y 0 = 1 for adjacent pixel points, we can get the upper bound of |ω 2 (x)| and |µ 2 (y)Thus, the error estimation can be expressed aswhere ∂ 4 f (ξ,η) ∂x 2 ∂y 2 can be approximated using the difference operatorBased on the above theoretical analysis, we can experimentally study the relationship between the estimated up-sampling error and the image resolution.

A.2 INSTANTIATION OF BTNET-RES50

We provide the detailed architecture of BTNet-res50 (φ bt ), an instantiation of BTNet framework based on ResNet50 He et al. (2016) . Our method can be easily implemented by refining a network with the top-down hierarchical representation structure. 

