LEARNING UNIFIED REPRESENTATIONS FOR MULTI-RESOLUTION FACE RECOGNITION

Abstract

In this work, we propose Branch-to-Trunk network (BTNet), a novel representation learning method for multi-resolution face recognition. It consists of a trunk network (TNet), namely a unified encoder, and multiple branch networks (BNets), namely resolution adapters. As per the input, a resolution-specific BNet is used and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. The discriminability of tiny faces is significantly improved, as the interpolation error introduced by rescaling, especially up-sampling, is mitigated on the inputs. With branch distillation and backwardcompatible training, BTNet transfers discriminative high-resolution information to multiple branches while guaranteeing representation compatibility. Our experiments demonstrate strong performance on face recognition benchmarks, both for multi-resolution face verification and face identification, with much less computation amount and parameter storage. We establish new state-of-the-art on the challenging QMUL-SurvFace 1: N face identification task.

1. INTRODUCTION

Machine learning has advanced tremendously driven by deep learning methods, but is still severely challenged by various data specifications, such as data type, structure, scale and size, etc. For instance, face recognition (FR) is a well-established deep learning task, while the performance degrades dramatically in the testing domain that differs from the training one, influenced by factors of variance like resolution, illumination, occlusion, etc. Most face recognition methods map each image to a point embedding in the common metric space by deep neural networks (DNNs). The dissimilarity of images can be then calculated using various distance metrics (e.g., cosine similarity, Euclidean distance, etc.) for face recognition tasks. 2020), etc) enhanced discriminability of the metric space, with small intra-identity distance and large inter-identity distance. However, lack of variation in training data still leads to poor generalizability. Various useful methods are utilized to mitigate this issue. The model adapts to factors of variance by augmenting datasets, whereas the large discrepancy in data distribution could potentially weaken the model's ability to extract discriminative features with the same data scale and model structure (see Section 4.3). Fine-tuning is widely used to transfer large pretrained models to new domains with different data specifications. However, this strategy requires one to store and deploy a separate copy of the backbone parameters for every single new domain, which is expensive and often infeasible.

Recent advancements in

As known, the resolutions of face images in reality may be far beyond the scope covered by the model. As the small feature maps with a fixed spatial extent (e.g., 7 × 7) are mapped to an embedding with a predefined dimension (e.g., 128 -d, 512 -d, etc.) by a fully connected (fc) layer, input images need to be rescaled to a canonical spatial size (e.g., 112 × 112) before fed into the network. However, up-sampling low-resolution (LR) images introduces the interpolation error (see Section 3.1), deteriorating the recognizable ones which contain enough clues to identify the subject. Even though super-resolution methods (Zhu et al. ( 2016 2021)). To improve discriminability while ensure the compatibility of the metric space for multi-resolution face representation, we learn the "unified" representation by a partially-coupled Branch-to-Trunk Network (BTNet). It is composed of multiple independent branch networks (BNets) and a shared trunk network (TNet). A resolution-specific BNet is used for a given image, and the output are implanted as feature maps in the feature pyramid of TNet, at a layer with the same resolution. Our method is simple and efficient, which can serve as a general framework easily applied to existing networks to improve their robustness against image resolutions. Since multi-resolution face recognition is dominated by super-resolution and projection methods, to the best of our knowledge, our method is the first attempt to decouple the information flow conditioned on the input resolution, which breaks the convention of up-sampling the inputs. Meanwhile, BTNet is able to reduce the number of FLOPS by operating the inputs without up-sampling, and per-resolution storage cost by only storing the learned branches and resolution-aware BNs Zhu et al. ( 2021), while re-using the copy of the trunk model. We demonstrate that our method performs comparably in various open-set face recognition tasks (1:1 face verification and 1: N face identification), while meaningfully reduces the redundant computation cost and parameter storage. In the challenging QMUL-SurvFace 1: N face identification task Cheng et al. (2018b) , we establish new state-of-the-art by outperforming prior models. In brief, our work can be summarized as follows: (1) What is our goal? Matching images with arbitrary resolutions (i.e., high-resolution, cross-resolution and low-resolution) effectively and efficiently, which is quite different from the traditional face recognition task. ( 2) What is the core idea of our method? Building unified (i.e.,compatible and discriminative) representations for multiresolution images without introducing erroneous information. (3) How to achieve our goal via our method? Table 1 shows that we ensure the compatibility and discriminability from three aspects: input preprocessing, network structure, and training strategy.

2. RELATED WORK

Compatible Representation Learning: The task of compatible representation learning aims at encoding features that are interoperable with the features extracted from other models. Shen et. 



margin-based loss (e.g., ArcFace Deng et al. (2019a), MV-Arc-Softmax Wang et al. (2020c), CurricularFace Huang et al. (

); Grm et al. (2020); Wang et al. (2016); Cheng et al. (2018a); Yin et al. (2020); Singh et al. (2019); Rai et al. (2020)) are widely used to build faces with good visualization, they inevitably introduce feature information of other identi-

Furthermore, we find that multi-resolution training can be beneficial to building a strong and robust TNet, and backward-compatible training (BCT) Shen et al. (2020) can improve the representation compatibility during the training process of BTNet. To ameliorate the discriminability of tiny faces, we propose branch distillation in intermediate layers, utilizing information extracted from HR images to help the extraction of discriminative features for resolution-specific branches.

al. Shen et al. (2020) first formulated the problem of backward-compatible learning (BCT) and proposed to utilize the old classifier for compatible feature learning. Since the multi-model fashion benefits representation learning with lower computation, our idea of cross-resolution representation learning can be modeled similar to cross-model compatibility Shen et al. (2020); Budnik & Avrithis (2021); Wang et al. (2020a); Meng et al. (2021); Duggal et al. (2021), as metric space alignment

Correspondence between our goals and methodsEmpirically, we can divide inputs by resolution distribution and learn to operate on them via multiple models to achieve high accuracy and efficiency. However, multi-model fashion cannot be applied directly for cross-resolution recognition as representation compatibility among models need to be guaranteed (Shen et al. (2020); Budnik & Avrithis (2021); Wang et al. (2020a); Meng et al. (2021); Duggal et al. (

