MUSIC-TO-TEXT SYNAESTHESIA: GENERATING DE-SCRIPTIVE TEXT FROM MUSIC RECORDINGS

Abstract

In this paper, we consider a novel research problem, music-to-text synaesthesia. Different from the classical music tagging problem that classifies a music recording into pre-defined categories, the music-to-text synaesthesia aims to generate descriptive texts from music recordings for further understanding. Although this is a new and interesting application to the machine learning community, to our best knowledge, the existing music-related datasets do not contain the semantic descriptions on music recordings and cannot serve the music-to-text synaesthesia task. In light of this, we collect a new dataset that contains 1,955 aligned pairs of classical music recordings and text descriptions. Based on this, we build a computational model to generate sentences that can describe the content of the music recording. To tackle the highly non-discriminative classical music, we design a group topology-preservation loss in our computational model, which considers more samples as a group reference and preserves the relative topology among different samples. Extensive experimental results qualitatively and quantitatively demonstrate the effectiveness of our proposed model over five heuristics or pretrained competitive methods and their variants on our collected dataset. 1

1. INTRODUCTION

Multi-modal learning has drawn great attention in recent years and has been very developed in diverse applications, since our physical world is naturally composed of various modalities. Visual frames in videos are matched with text captions and these pairs have been widely used for videolanguage pre-training (Sun et al., 2019; Luo et al., 2021; Li et al., 2020) ; Kinects employ the RGB camera and the depth sensor for action recognition and human pose estimation (Shotton et al., 2011; Carreira & Zisserman, 2017) ; autonomous driving cars integrate the visible and invisible lights by the camera, radar and lidar for a series of driving-related tasks (Buehler et al., 2007; Torres et al., 2019) ; cross-modal retrieval aims to understand and match text with the existing textual repository and other modalities to meet users' queries (Nagrani et al., 2018; Suris et al., 2018; Zeng et al., 2021) ; language grounding learns the meaning of natural language meaning by leveraging the sensory data such as video or images (Bisk et al., 2020; Thomason et al., 2021) . Besides the above studies that employ multi-modal data to jointly achieve the learning task, translating information among different modalities, also known as synaesthesia, is another crucial task in the multi-modal community, where the text with its good compatibility and presentation ability, has become the intermediary of modality interaction. Various methods for synaesthesia between text and other modalities have been studied. Speech recognition can be directly regarded as a translation between the text and audio modality (Shen et al., 2018) . Image captioning extracts the high-level visual cues and translates them into a descriptive sentence to describe the image content, while some studies consider the inverse process of image captioning by converting a semantic text into the visual image (Huang et al., 2021; Xu et al., 2015) . Different from the existing modality translation studies, in this paper, we consider a novel problem, the music-to-text synaesthesia, i.e., generating descriptive texts from music recordings. Recently, there are some pioneering attempts that build the connections between music recordings and tags at the initial stage. Cai et al. ( 2020) formulate music auto-tagging as a captioning task and automatically outputs a sequence of tags given a clip of music. Zhang et al. ( 2020) use keywords of music key, meter and style to generate music descriptions, which can be used for caption generation. However, we argue that descriptive texts contain much richer information than tags, thus providing a better understanding of music recording. Moreover, we notice that tags might have a biased interpretation. Figure 1 presents two music recordings with the same music tags, but the opposite sentiment orientation of the text. The first one expresses a positive sentiment by describing the music as "peaceful" and "beautiful," while the second one uses tokens including "sadness" and "loss" to express a negative sentiment. It is clear that music tags are insufficient for describing the content of a music piece. Contributions. In this paper, we propose a new task of generating descriptive text from music recordings. Specifically, given a music recording, we aim to build computational models that can generate sentences that can describe the content of the music recording, as well as the music's inherent sentiment. The major contributions are summarized as follows: • From the research problem perspective, different from the music tagging problem, we propose the music-to-text synaesthesia, a cross-modality translation task that aims at converting a given music piece to a text description. To our best knowledge, the music-to-text synaesthesia is a novel research problem in the multi-modal learning community. • From the dataset perspective, the existing music-related datasets do not contain the semantic description of music recordings. To build computational models for this task, we collect a new dataset that contains 1,955 aligned pairs of classical music recordings and text descriptions. • From the technical perspective, we design a group topology-preservation loss in our computational model to tackle the non-discriminative music representation, which considers more data points as a group reference and preserves the relative topology among different nodes. Thus it can better align the music representations with the structure in text space. • From the empirical evaluation, extensive experimental results demonstrate the effectiveness of our proposed model over five heuristics or pre-trained competitive methods and their variants on our collected dataset. We also provide several case studies for comparisons and elaborate the explorations on our group topology-preservation loss and some parameter analyses.

2. RELATED WORK

Multi-modality Learning. The goal of multi-modal machine learning is to build computational models that are able to process and relate information from different modalities, such as audio, text, and image. Baltrusaitis et al. ( 2019) describes five challenges for multi-modal learning, namely: learning representations, translating, aligning, fusing, and co-learning from/between different modalities. A large portion of prior works has focused on modality fusion, which aims at making predictions by joining information from two or more modalities. 2017) designed an embedding between video features and term vectors to learn the entire representation from freely available web videos and their descriptions. (3) Representations in coordinated representations based models exist in separated spaces, but are coordinated through a similarity function (e.g., Euclidean distance) or a structure constraint. These works include Wang et al. (2017) , which presents a method to learn a common subspace based on adversarial learning for adversarial cross-modal retrieval. Peng et al. (2018) proposes modalityspecific cross-modal similarity measurement approach for tasks including cross-modal retrieval. In



Our code and data resources are available at https://github.com/MusicTextSynaesthesia/ MusicTextSynaesthesia.



Applications include audio-visual speech recognition (Afouras et al., 2018), visual question answering (Goyal et al., 2017), emotion recognition, and media summarization. We now briefly review the literature and refer readers to Guo et al. (2019); Baltrusaitis et al. (2019) for more complete surveys.

