FIGARO: CONTROLLABLE MUSIC GENERATION USING EXPERT AND LEARNED FEATURES

Abstract

Recent symbolic music generative models have achieved significant improvements in the quality of the generated samples. Nevertheless, it remains hard for users to control the output in such a way that it matches their expectation. To address this limitation, high-level, human-interpretable conditioning is essential. In this work, we release FIGARO, a Transformer-based conditional model trained to generate symbolic music based on a sequence of high-level control codes. To this end, we propose description-to-sequence learning, which consists of automatically extracting fine-grained, human-interpretable features (the description) and training a sequence-to-sequence model to reconstruct the original sequence given only the description as input. FIGARO achieves state-of-the-art performance in multi-track symbolic music generation both in terms of style transfer and sample quality. We show that performance can be further improved by combining human-interpretable with learned features. Our extensive experimental evaluation shows that FIGARO is able to generate samples that closely adhere to the content of the input descriptions, even when they deviate significantly from the training distribution.

1. INTRODUCTION

Music is a fascinating subject that surrounds us constantly, being a source of inspiration and canvas for imagination to many. To some, creating music is a topic worthy of dedicating one's life to, which is a testament to the artistry and mastery involved. While composition is an intricate form of art that requires a deep understanding of the human experience and domain knowledge, the idea of devising a systematic or algorithmic approach to music creation has been around for centuries (Nierhaus, 2009) . With the advent of deep learning, automatic music generation has witnessed renewed interest (Hernandez-Olivan & Beltran, 2021) . Especially the Transformer architecture (Vaswani et al., 2017) , which has seen applications to many Machine Learning domains (Brown et al., 2020; Dosovitskiy et al., 2021; Lample & Charton, 2019; Biggio et al., 2021) , has proven to be a powerful tool for musical sequence modelling. Initial breakthroughs by Huang et al. (2018) and Payne (2019) applied language modelling techniques to symbolic music to achieve state-of-the-art music generation. Though these models were capable of some form of conditional generation (e.g. melody or genre conditioning), other conditioning mechanisms and different types of control have since been proposed (Ens & Pasquier, 2020; Choi et al., 2020; Wu & Yang, 2021) . As deep generative models are improving and producing more and more realistic samples, it remains an area of active research how humans can interact with these models and steer them to generate a desirable result. Recent efforts in text-to-image generation (Ramesh et al., 2022; Saharia et al., 2022) have shown the potential in usability and artistic applications of human-interpretable controllable generative models. Whereas text-based conditioning has yielded human-interpretable control for image generation, the same conditioning mechanisms are not easily applicable to music generation. We aim to extend this kind of control to other domains, in this case to music generation. As scale has proven to be key for achieving capable models, we cannot rely on scarce annotated data and instead propose a self-supervised objective, which we call description-to-sequence learning. We take inspiration from recent text-to-image approaches, but instead of a natural language description of the target, we automatically extract a sequence of high-level features (the description). These can

