A BENCHMARK FOR VOICE-FACE CROSS-MODAL MATCHING AND RETRIEVAL

Abstract

Cross-modal associations between a person's voice and face can be learned algorithmically, and this is a useful functionality in many audio and visual applications. The problem can be defined as two tasks: voice-face matching and retrieval. Recently, this topic has attracted much research attention, but it is still in its early stages of development, and evaluation protocols and test schemes need to be more standardized. Performance metrics for different subtasks are also scarce, and a benchmark for this problem needs to be established. In this paper, a baseline evaluation framework is proposed for voice-face matching and retrieval tasks. Test confidence is analyzed, and a confidence interval for estimated accuracy is proposed. Various state-of-the-art performances with high test confidence are achieved on a series of subtasks using the baseline method (called TriNet) included in this framework. The source code will be published along with the paper. The results of this study can provide a basis for future research on voice-face cross-modal learning.

1. INTRODUCTION

Studies in biology and neuroscience have shown that a person's appearance is associated with his or her voice (Smith et al., 2016b; a; Mavica & Barenholtz, 2013) . Both the facial features and voice-controlling organs of individuals are affected by hormones and genetic information (Hollien & Moore, 1960; Thornhill & Møller, 1997; Kamachi et al., 2003; Wells et al., 2013) , and human beings have the ability to recognize this association. For example, when speaking on the phone, we can guess the gender and approximate age of the person on the other end of the line. When watching a TV show without sound, we can also imagine the approximate voice of the protagonist by observing his or her face movements. With the recent advances in deep learning, face recognition models (Wen et al., 2016; Wu et al., 2018; Liu et al., 2017) and speaker recognition models (Wang et al., 2018; Li et al., 2017) have achieved extremely high precision. It is then natural to wonder if the associations between voices and faces could be discovered algorithmically by machines. The research on this problem could benefit many applications such as the synchronization of video faces with talking voices and the generation of faces according to voice. In recent years, much research attention (Wen et al., 2018; Horiguchi et al., 2018; Nagrani et al., 2018a; Kim et al., 2018; Nagrani et al., 2018b) has been paid to voice-face cross-modal learning tasks, which has shown the feasibility of recognizing voice-face associations. This problem is generally formulated as a voice-face matching task and a voice-face retrieval task. The research on this problem is still at an early stage, and a benchmark for this problem still needs to be established. In this paper, we address this issue with the following contributions: 1) Existing methods are all evaluated on a single dataset of about 200 identities with limited tasks. The estimated accuracy always has great deviation due to the high sampling risk existed in cross-modal learning problem. Test confidence interval is proposed for qualifying the statistical significance of experimental results. 2) A solid baseline framework for voice-face matching and retrieval is also proposed. State-of-the-art performances on various voice-face matching and retrieval tasks are achieved on large-scale datasets with a high test confidence. (Kim et al., 2018) are all these kind of methods. The aim of pair-wise loss based methods is to make the embeddings of positive pairs closer and the embeddings of negative pairs farther apart. In contrast, the aim of classification-based methods is to separate the embeddings of different classes. Of these two approaches, pair-wise loss based methods are better at distinguishing hard examples because of the characteristics of this approach. There is still no related work which presents a benchmark for voice-face cross-modal learning tasks, which is addressed in detail as follows: 1) As for evaluation metrics, the reliability of experiments has not been addressed by all previous research. Test confidence is proposed in this paper. With the guidance of test confidence, reliable evaluations can be conducted. 2) As for tasks, joint matching and joint retrieval tasks established in this paper are not noticed by previous research. Though these tasks are direct extensions of traditional tasks, these very simple extensions can improve the performance of voice-face cross-modal learning dramatically. 3) As for models, the most similar work to 

3. TASKS AND EVALUATION

3.1 TASKS 1:2 Matching and 1:n Matching. Given an audio and two face candidates (only one of which is from the speaker of the audio), the goal is to find the face that belongs to the speaker. The more difficult l:n matching task is an extension of the 1:2 matching task that increases the number of candidate faces from 2 to N .



Statistics of voice-face cross-modal datasets.Vox-VGG-n represents the combined dataset of theVoxCeleb (Nagrani et al., 2017; Chung et al., 2018)  andVGGFace (Cao et al., 2018; Parkhi  et al., 2015)  and n denotes the version. The number of images refers to the number of all images remaining afterMTCNN (Zhang et al., 2016)  face detection. existing methods for voice-face cross-modal learning can be classified as classification-based methods and pair-wise loss based methods, as shown in Figure1. CNN-based networks are normally used to embed the voices and faces into feature vectors. SVHF(Nagrani et al., 2018b) is a prior study on voice-face cross-modal learning that investigated the performance of a CNN-based deep network on this problem. The human baseline for the voice-face matching task is also presented in this paper.DIMNet (Wen et al., 2018)  learns a common representation for faces and voices by leveraging their relationships to some covariates such as gender and nationality. For pair-wise loss based methods, a pair or a triplet of vectors is embedded by a voice and face network, and contrastive loss(Hadsell  et al., 2006)  or triplet loss(Schroff et al., 2015)  is used to supervise the learning of the embeddings. Horiguchi et al.'s method (Horiguchi et al., 2018) , Pins (Nagrani et al., 2018a), Kim et al.'s methods

TriNet of this paper isKim et al.'s method (Kim et al.,  2018). Both models use the triplet loss function. The main difference is that TriNet uses L2 normalization and voice-anchored embedding learning to constrain the feature space, because it is difficult to obtain satisfactory results by training directly in a huge Euclidean space. Though L2 normalization is a normal technique, it hasn't been introduced to the current problem. 4) As for datasets, currently available voice-face datasets are the data generated by the common speakers of VGGFace(Cao et al., 2018; Parkhi et al., 2015)  face recognition dataset and VoxCeleb(Nagrani et al., 2017; Chung et al., 2018)  speaker recognition dataset. As shown in Table1, the voice-face datasets have two versions, Vox-VGG-1 and Vox-VGG-2, which include 1,251 and 5,994 identities, respectively. To the best of our knowledge, only Vox-VGG-1 is used in previous research. Both Vox-VGG-1 and Vox-VGG-2 are used to evaluate the proposed baseline method, TriNet.

