MUSIC-TO-TEXT SYNAESTHESIA: GENERATING DE-SCRIPTIVE TEXT FROM MUSIC RECORDINGS

Abstract

In this paper, we consider a novel research problem, music-to-text synaesthesia. Different from the classical music tagging problem that classifies a music recording into pre-defined categories, the music-to-text synaesthesia aims to generate descriptive texts from music recordings for further understanding. Although this is a new and interesting application to the machine learning community, to our best knowledge, the existing music-related datasets do not contain the semantic descriptions on music recordings and cannot serve the music-to-text synaesthesia task. In light of this, we collect a new dataset that contains 1,955 aligned pairs of classical music recordings and text descriptions. Based on this, we build a computational model to generate sentences that can describe the content of the music recording. To tackle the highly non-discriminative classical music, we design a group topology-preservation loss in our computational model, which considers more samples as a group reference and preserves the relative topology among different samples. Extensive experimental results qualitatively and quantitatively demonstrate the effectiveness of our proposed model over five heuristics or pretrained competitive methods and their variants on our collected dataset. 1

1. INTRODUCTION

Multi-modal learning has drawn great attention in recent years and has been very developed in diverse applications, since our physical world is naturally composed of various modalities. Visual frames in videos are matched with text captions and these pairs have been widely used for videolanguage pre-training (Sun et al., 2019; Luo et al., 2021; Li et al., 2020) ; Kinects employ the RGB camera and the depth sensor for action recognition and human pose estimation (Shotton et al., 2011; Carreira & Zisserman, 2017) ; autonomous driving cars integrate the visible and invisible lights by the camera, radar and lidar for a series of driving-related tasks (Buehler et al., 2007; Torres et al., 2019) ; cross-modal retrieval aims to understand and match text with the existing textual repository and other modalities to meet users' queries (Nagrani et al., 2018; Suris et al., 2018; Zeng et al., 2021) ; language grounding learns the meaning of natural language meaning by leveraging the sensory data such as video or images (Bisk et al., 2020; Thomason et al., 2021) . Besides the above studies that employ multi-modal data to jointly achieve the learning task, translating information among different modalities, also known as synaesthesia, is another crucial task in the multi-modal community, where the text with its good compatibility and presentation ability, has become the intermediary of modality interaction. Various methods for synaesthesia between text and other modalities have been studied. Speech recognition can be directly regarded as a translation between the text and audio modality (Shen et al., 2018) . Image captioning extracts the high-level visual cues and translates them into a descriptive sentence to describe the image content, while some studies consider the inverse process of image captioning by converting a semantic text into the visual image (Huang et al., 2021; Xu et al., 2015) . Different from the existing modality translation studies, in this paper, we consider a novel problem, the music-to-text synaesthesia, i.e., generating descriptive texts from music recordings. Recently, there are some pioneering attempts that build the connections between music recordings and tags at the initial stage. Cai et al. (2020) formulate music auto-tagging as a captioning task and



Our code and data resources are available at https://github.com/MusicTextSynaesthesia/ MusicTextSynaesthesia.

