AUTOMATIC MUSIC PRODUCTION USING GENERA-TIVE ADVERSARIAL NETWORKS

Abstract

When talking about computer-based music generation, two are the main threads of research: the construction of autonomous music-making systems, and the design of computer-based environments to assist musicians. However, even though creating accompaniments for melodies is an essential part of every producer's and songwriter's work, little effort has been done in the field of automatic music arrangement in the audio domain. In this contribution, we propose a novel framework for automatic music accompaniment in the Mel-frequency domain. Using several songs converted into Mel-spectrograms -a two-dimensional time-frequency representation of audio signals -we were able to automatically generate original arrangements for both bass and voice lines. Treating music pieces as images (Mel-spectrograms) allowed us to reformulate our problem as an unpaired imageto-image translation problem, and to tackle it with CycleGAN, a well-established framework. Moreover, the choice to deploy raw audio and Mel-spectrograms enabled us to more effectively model long-range dependencies, to better represent how humans perceive music, and to potentially draw sounds for new arrangements from the vast collection of music recordings accumulated in the last century. Our approach was tested on two different downstream tasks: given a bass line creating credible and on-time drums, and given an acapella song arranging it to a full song. In absence of an objective way of evaluating the output of music generative systems, we also defined a possible metric for the proposed task, partially based on human (and expert) judgement.

1. INTRODUCTION

The development of home music production has brought significant innovations into the process of pop music composition. Software like Pro Tools, Cubase, and Logic -as well as MIDI-based technologies and digital instruments -provide a wide set of tools to manipulate recordings and simplify the composition process for artists and producers. After recording a melody, maybe with the aid of a guitar or a piano, song writers can now start building up the arrangement one piece at a time, sometimes not even needing professional musicians or proper music training. As a result, singers and song writers -as well as producers -have started asking for tools that could facilitate, or to some extent even automate, the creation of full songs around their lyrics and melodies. To meet this new demand, the goal of designing computer-based environments to assist human musicians has become central in the field of automatic music generation (Briot et al., 2020) . IRCAM OpenMusic (Assayag et al., 1999 ), Sony CSL-Paris FlowComposer (Papadopoulos et al., 2016) , and Logic Pro X Easy Drummer are just some examples. In addition, more solutions based on deep learning techniques, such as RL-Duet (Jiang et al., 2020) -a deep reinforcement learning algorithm for online accompaniment generation -or PopMAG, a transformer-based architecture which relies on a multi-track MIDI representation of music (Ren et al., 2020) , continue to be studied. A comprehensive review of the most relevant deep learning techniques applied to music is provided by (Briot et al., 2020) . Most of these strategies, however, suffer from the same critical issue, which makes them less appealing in view of music production for commercial purposes: they rely on a symbolic/MIDI representation of music. The approach proposed in this paper, instead, is a first attempt at automatically generating an euphonic arrangement (two or more sound patterns that produce a pleasing and harmonious piece of music) in the audio domain, given a musical sample encoded in a two-dimensional time-frequency representation (in particular, we opted for the Mel-spectrogram time-frequency representation). Al-though arrangement generation has been studied in the context of symbolic audio, indeed, switching to Mel-spectrograms allows us to preserve the sound heritage of other musical pieces (allowing operations such as sampling) and is more suitable for real-life cases, where voice, for instance, cannot be encoded in MIDI. We focused our attention on two different tasks of increasing difficulty: (i) given a bass line to create credible and on-time drums, and (ii) given the voice line, to output a new and euphonic musical arrangement. Incidentally, we found out that -for training samples -our model was able to reconstruct the original arrangement pretty well, even though no pairing among the Mel-spectrograms of the two domains was performed. By means of the Mel-spectrogram representation of music, we can consider the problem of automatically generating an arrangement or accompaniment for a specific musical sample equivalent to an image-to-image translation task. For instance, if we have the Mel-spectrogram of an acapella song, we may want to produce the Mel-spectrogram of the same song including a suitable arrangement. To solve this task, we tested an unpaired image-to-image translation strategy known as CycleGAN (Zhu et al., 2017) , which consists of translating an image from a source domain X to a target domain Y in the absence of paired examples, by training both the mapping from X to Y and from Y to X simultaneously, with the goal of minimizing a cycle consistency loss. The aforementioned system was trained on 5s pop music samples (equivalent to 256 × 256 Mel-spectrograms) coming both from the Free Music Archive (FMA) dataset (Defferrard et al., 2017; 2018) , and from the Demucs dataset (Défossez et al., 2019) . The short sample duration does not affect the proposed methodology, at least with respect to the arrangement task we focus on, and inference can be performed also on full songs. Part of the dataset was pre-processed first, since the FMA songs lack source separated channels (i.e. differentiated vocals, bass, drums, etc.). The required channels were extracted using Demucs (Défossez et al., 2019) . The main innovations presented in this contribution are as follows: (i.) treating music pieces as images, we developed a framework to automatically generate music arrangement in the Mel-frequency domain, different from any other previous approach; (ii.) our approach is able to generate arrangements with low computational resources and limited inference time, if compared to other popular solutions for automatic music generation (Dhariwal et al., 2020) ; (iii.) we developed a metric -partially based on or correlated to human (and expert) judgement -to automatically evaluate the obtained results and the creativity of the proposed system, given the challenges of a quantitative assessment of music. To the best of our knowledge, this is the first work to face the automatic arrangement production task in the audio domain by leveraging a two-dimensional time-frequency representation.

2. RELATED WORKS

The interest surrounding automatic music generation, translation and arrangement has greatly increased in the last few years, as proven by the high numbers of solutions proposed -see (Briot et al., 2020) for a comprehensive and detailed survey. Here we present a brief overview of the key contributions both in symbolic and audio domain.

Music generation & arrangement in the symbolic domain.

There is a very large body of research that uses a symbolic representation of music to perform music generation and arrangement. The following contributions used MIDI, piano rolls, chord and note names to feed several deep learning architectures and tackle different aspects of the music generation problem. In (Yang et al., 2017) , CNNs are used for generating melody as a series of MIDI notes either from scratch, by following a chord sequence, or by conditioning on the melody of previous bars. In (Mangal et al., 2019; Jaques et al., 2016; Mogren, 2016; Makris et al., 2017) , LSTM networks are used to generate musical notes, melodies, polyphonic music pieces, and long drum sequences, under constraints imposed by metrical rhythm information and a given bass sequence. The authors of (Yamshchikov & Tikhonov, 2017; Roberts et al., 2018) , instead, use VAE networks to generate melodies. In (Boulanger-Lewandowski et al., 2012) , symbolic sequences of polyphonic music are modeled in a completely general pianoroll representation, while the authors of (Hadjeres & Nielsen, 2017) propose a novel architecture to generate melodies satisfying positional constraints in the style of the soprano parts of the J.S. Bach chorale harmonisations encoded in MIDI. In (Johnson, 2017) , RNNs are used for prediction and composition of polyphonic music; in (Hadjeres et al., 2017) , highly convincing chorales in the style of Bach were automatically generated using note names; (Lattner et al., 2018) added higher-level structure on generated, polyphonic music, whereas (Mao et al., 2018) designed an end-to-end generative model capable of composing music conditioned on a specific mixture of composer styles. The

