AUTOMATIC MUSIC PRODUCTION USING GENERA-TIVE ADVERSARIAL NETWORKS

Abstract

When talking about computer-based music generation, two are the main threads of research: the construction of autonomous music-making systems, and the design of computer-based environments to assist musicians. However, even though creating accompaniments for melodies is an essential part of every producer's and songwriter's work, little effort has been done in the field of automatic music arrangement in the audio domain. In this contribution, we propose a novel framework for automatic music accompaniment in the Mel-frequency domain. Using several songs converted into Mel-spectrograms -a two-dimensional time-frequency representation of audio signals -we were able to automatically generate original arrangements for both bass and voice lines. Treating music pieces as images (Mel-spectrograms) allowed us to reformulate our problem as an unpaired imageto-image translation problem, and to tackle it with CycleGAN, a well-established framework. Moreover, the choice to deploy raw audio and Mel-spectrograms enabled us to more effectively model long-range dependencies, to better represent how humans perceive music, and to potentially draw sounds for new arrangements from the vast collection of music recordings accumulated in the last century. Our approach was tested on two different downstream tasks: given a bass line creating credible and on-time drums, and given an acapella song arranging it to a full song. In absence of an objective way of evaluating the output of music generative systems, we also defined a possible metric for the proposed task, partially based on human (and expert) judgement.

1. INTRODUCTION

The development of home music production has brought significant innovations into the process of pop music composition. Software like Pro Tools, Cubase, and Logic -as well as MIDI-based technologies and digital instruments -provide a wide set of tools to manipulate recordings and simplify the composition process for artists and producers. After recording a melody, maybe with the aid of a guitar or a piano, song writers can now start building up the arrangement one piece at a time, sometimes not even needing professional musicians or proper music training. As a result, singers and song writers -as well as producers -have started asking for tools that could facilitate, or to some extent even automate, the creation of full songs around their lyrics and melodies. To meet this new demand, the goal of designing computer-based environments to assist human musicians has become central in the field of automatic music generation (Briot et al., 2020) . IRCAM OpenMusic (Assayag et al., 1999) , Sony CSL-Paris FlowComposer (Papadopoulos et al., 2016) , and Logic Pro X Easy Drummer are just some examples. In addition, more solutions based on deep learning techniques, such as RL-Duet (Jiang et al., 2020) -a deep reinforcement learning algorithm for online accompaniment generation -or PopMAG, a transformer-based architecture which relies on a multi-track MIDI representation of music (Ren et al., 2020) , continue to be studied. A comprehensive review of the most relevant deep learning techniques applied to music is provided by (Briot et al., 2020) . Most of these strategies, however, suffer from the same critical issue, which makes them less appealing in view of music production for commercial purposes: they rely on a symbolic/MIDI representation of music. The approach proposed in this paper, instead, is a first attempt at automatically generating an euphonic arrangement (two or more sound patterns that produce a pleasing and harmonious piece of music) in the audio domain, given a musical sample encoded in a two-dimensional time-frequency representation (in particular, we opted for the Mel-spectrogram time-frequency representation). Al-

