JOINTIST: SIMULTANEOUS IMPROVEMENT OF MULTI-INSTRUMENT TRANSCRIPTION AND MUSIC SOURCE SEPARATION VIA JOINT TRAINING

Abstract

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.

1. INTRODUCTION

Transcription, or automatic music transcription (AMT), is a music analysis task that aims to represent audio recordings using symbolic notations such as scores or MIDI (Musical Instrument Digital Interface) files (Benetos et al., 2013; 2018; Piszczalski & Galler, 1977) . AMT can play an important role in music information retrieval (MIR) systems since symbolic information -e.g., pitch, duration, and velocity of notes -determines a large part of our musical perception. A successful AMT should provide a denoised version of music in a musically-meaningful, symbolic format, which could ease the difficulty of many MIR tasks such as melody extraction (Ozcan et al., 2005) , chord recognition (Wu & Li, 2018 ), beat tracking (Vogl et al., 2017 ), composer classification (Kong et al., 2020a ), (Kim et al., 2020) , and emotion classification (Chou et al., 2021) . Finally, high-quality AMT systems can be used to build large-scale datasets as done by Kong et al. (2020b) . This can, in turn, accelerate the development of neural network-based MIR systems as these are often trained using otherwise scarcely available audio aligned symbolic data (Brunner et al., 2018; Wu et al., 2020a; Hawthorne et al., 2018) . Currently, the only available pop music dataset is the Slakh2100 dataset by Manilow et al. (2019) . The lack of large-scale audio aligned symbolic dataset for pop music impedes the development of other MIR systems that are trained using symbolic music representations. In the early research on AMTs, the problem is often defined narrowly as transcription of a single target instrument, typically piano (Klapuri & Eronen, 1998) or drums (Paulus & Klapuri, 2003) , and whereby the input audio only includes that instrument. The limitation of this strong and thenunavoidable assumption is clear: the model would not work for modern pop music, which occupies a majority of the music that people listen to. In other words, to handle realistic use-cases of AMT, it is necessary to develop a multi-instrument transcription system. Recent examples are Omnizart (Wu et al., 2021; 2020b) and MT3 (Gardner et al., 2021 ) which we will discuss in Section 2. The progress towards multi-instrument transcription, however, has just begun. There are still several challenges related to the development and evaluation of such systems. In particular, the number of instruments in multiple-instrument audio recordings are not fixed. The number of instruments in a pop song may vary from a few to over ten. Therefore, it is limiting to have a model that transcribes a pre-defined fixed number of musical instruments in every music piece. Rather, a model that can adapt to a varying number of target instrument(s) would be more robust and useful. This indicates that we may need to consider instrument recognition and instrument-specific behavior during development as well as evaluation. Motivated by the aforementioned recent trend and the existing issues, we propose Jointist -a framework that includes instrument recognition, source separation, as well as transcription. We adopt a joint training scheme to maximize the performance of both the transcription and source separation module. Our experiment results demonstrate the utility of transcription models as a pre-processing module of MIR systems. The result strengthens a perspective of transcription result as a symbolic representation, something distinguished from typical i) audio-based representation (spectrograms) or ii) high-level features (Choi et al., 2017; Castellon et al., 2021) . This paper is organized as follows. We first provide a brief overview of the background of automatic music transcription in Section 2. Then we introduce our framework, Jointist, in Section 3. We describe the experimental details in Section 4 and discuss the experimental results in Section 5. We also explore the potential applications of the piano rolls generated by Jointist for other MIR tasks in Section 6. Finally, we conclude the paper in Section 7.

2. BACKGROUND

While automatic music transcription (AMT) models for piano music are well developed and are able to achieve a high accuracy (Benetos et al., 2018; Sigtia et al., 2015; Kim & Bello, 2019; Kelz et al., 2019; Hawthorne et al., 2017; 2018; Kong et al., 2021) , multi-instrument automatic music transcription (MIAMT) is relatively unexplored. MusicNet (Thickstun et al., 2016; 2017) and Re-conVAT (Cheuk et al., 2021a) are MIAMT systems that transcribe musical instruments other than piano, but their output is a flat piano roll that includes notes from all the instruments in a single channel. In other words, they are not instrument-aware. Omnizart (Wu et al., 2019; 2021) is instrument-aware but it does not scale up well when the number of musical instruments increases as discussed in Section 5.3. MT3 Gardner et al. ( 2021) is currently the state-of-the-art MIAMT model. It formulates AMT as a sequence prediction task where the sequence consists of tokens representing musical notes. By adopting the structure of a natural language processing (NLP) model called T5 (Raffel et al., 2020) , MT3 shows that a transformer-based sequence-to-sequence architecture can perform successful transcription by learning from multiple datasets for various instruments. Although there have been attempts to jointly train a speech separation and recognition model (Shi et al., 2022) , the joint training of transcription model together with a source separation model is still very limited in the MIR domain. For example, Jansson et al. ( 2019) use the F0 estimation to guide the singing voice separation. However, they only demonstrated their method with a monophonic singing track. In this paper, the Jointist framework extends this idea into polyphonic music. While Manilow et al. (2020) extends the joint transcription and source separation training into polyphonic music, their model is limited to up to five sources (Piano, Guitar, Bass, Drums, and Strings) which do not cover the diversity of real-world popular music. Our proposed method is trained and evaluated on 39 instruments, which is enough to handle real-world popular music. 2021) but was only trained and evaluated on 13 classical instruments. Our proposed Jointist framework alleviates most of the problems mentioned above. Unlike existing models, Jointist is instrument-aware, and transcribes and separates only the instruments present in the input mix audio.



Hung et al. (2021) and Chen et al. (2021a) also use joint transcription and source separation training for a small number of instruments. However, they only use transcription as an auxiliary task during training while no transcription is performed during the inference phases. The model in Tanaka et al. (2020), on the other hand, is only capable of doing transcription. Their model applies a joint spectrogram and pitchgram clustering method to improve the multi-instrument transcription accuracy. A zeroshot transcription and separation model was proposed in Lin et al. (

