A BENCHMARK FOR VOICE-FACE CROSS-MODAL MATCHING AND RETRIEVAL

Abstract

Cross-modal associations between a person's voice and face can be learned algorithmically, and this is a useful functionality in many audio and visual applications. The problem can be defined as two tasks: voice-face matching and retrieval. Recently, this topic has attracted much research attention, but it is still in its early stages of development, and evaluation protocols and test schemes need to be more standardized. Performance metrics for different subtasks are also scarce, and a benchmark for this problem needs to be established. In this paper, a baseline evaluation framework is proposed for voice-face matching and retrieval tasks. Test confidence is analyzed, and a confidence interval for estimated accuracy is proposed. Various state-of-the-art performances with high test confidence are achieved on a series of subtasks using the baseline method (called TriNet) included in this framework. The source code will be published along with the paper. The results of this study can provide a basis for future research on voice-face cross-modal learning.

1. INTRODUCTION

Studies in biology and neuroscience have shown that a person's appearance is associated with his or her voice (Smith et al., 2016b; a; Mavica & Barenholtz, 2013) . Both the facial features and voice-controlling organs of individuals are affected by hormones and genetic information (Hollien & Moore, 1960; Thornhill & Møller, 1997; Kamachi et al., 2003; Wells et al., 2013) , and human beings have the ability to recognize this association. For example, when speaking on the phone, we can guess the gender and approximate age of the person on the other end of the line. When watching a TV show without sound, we can also imagine the approximate voice of the protagonist by observing his or her face movements. With the recent advances in deep learning, face recognition models (Wen et al., 2016; Wu et al., 2018; Liu et al., 2017) and speaker recognition models (Wang et al., 2018; Li et al., 2017) have achieved extremely high precision. It is then natural to wonder if the associations between voices and faces could be discovered algorithmically by machines. The research on this problem could benefit many applications such as the synchronization of video faces with talking voices and the generation of faces according to voice. In recent years, much research attention (Wen et al., 2018; Horiguchi et al., 2018; Nagrani et al., 2018a; Kim et al., 2018; Nagrani et al., 2018b) has been paid to voice-face cross-modal learning tasks, which has shown the feasibility of recognizing voice-face associations. This problem is generally formulated as a voice-face matching task and a voice-face retrieval task. The research on this problem is still at an early stage, and a benchmark for this problem still needs to be established. In this paper, we address this issue with the following contributions: 1) Existing methods are all evaluated on a single dataset of about 200 identities with limited tasks. The estimated accuracy always has great deviation due to the high sampling risk existed in cross-modal learning problem. Test confidence interval is proposed for qualifying the statistical significance of experimental results. 2) A solid baseline framework for voice-face matching and retrieval is also proposed. State-of-the-art performances on various voice-face matching and retrieval tasks are achieved on large-scale datasets with a high test confidence.

