FIGARO: CONTROLLABLE MUSIC GENERATION USING EXPERT AND LEARNED FEATURES

Abstract

Recent symbolic music generative models have achieved significant improvements in the quality of the generated samples. Nevertheless, it remains hard for users to control the output in such a way that it matches their expectation. To address this limitation, high-level, human-interpretable conditioning is essential. In this work, we release FIGARO, a Transformer-based conditional model trained to generate symbolic music based on a sequence of high-level control codes. To this end, we propose description-to-sequence learning, which consists of automatically extracting fine-grained, human-interpretable features (the description) and training a sequence-to-sequence model to reconstruct the original sequence given only the description as input. FIGARO achieves state-of-the-art performance in multi-track symbolic music generation both in terms of style transfer and sample quality. We show that performance can be further improved by combining human-interpretable with learned features. Our extensive experimental evaluation shows that FIGARO is able to generate samples that closely adhere to the content of the input descriptions, even when they deviate significantly from the training distribution.

1. INTRODUCTION

Music is a fascinating subject that surrounds us constantly, being a source of inspiration and canvas for imagination to many. To some, creating music is a topic worthy of dedicating one's life to, which is a testament to the artistry and mastery involved. While composition is an intricate form of art that requires a deep understanding of the human experience and domain knowledge, the idea of devising a systematic or algorithmic approach to music creation has been around for centuries (Nierhaus, 2009) . With the advent of deep learning, automatic music generation has witnessed renewed interest (Hernandez-Olivan & Beltran, 2021) . Especially the Transformer architecture (Vaswani et al., 2017) , which has seen applications to many Machine Learning domains (Brown et al., 2020; Dosovitskiy et al., 2021; Lample & Charton, 2019; Biggio et al., 2021) , has proven to be a powerful tool for musical sequence modelling. Initial breakthroughs by Huang et al. (2018) and Payne (2019) applied language modelling techniques to symbolic music to achieve state-of-the-art music generation. Though these models were capable of some form of conditional generation (e.g. melody or genre conditioning), other conditioning mechanisms and different types of control have since been proposed (Ens & Pasquier, 2020; Choi et al., 2020; Wu & Yang, 2021) . As deep generative models are improving and producing more and more realistic samples, it remains an area of active research how humans can interact with these models and steer them to generate a desirable result. Recent efforts in text-to-image generation (Ramesh et al., 2022; Saharia et al., 2022) have shown the potential in usability and artistic applications of human-interpretable controllable generative models. Whereas text-based conditioning has yielded human-interpretable control for image generation, the same conditioning mechanisms are not easily applicable to music generation. We aim to extend this kind of control to other domains, in this case to music generation. As scale has proven to be key for achieving capable models, we cannot rely on scarce annotated data and instead propose a self-supervised objective, which we call description-to-sequence learning. We take inspiration from recent text-to-image approaches, but instead of a natural language description of the target, we automatically extract a sequence of high-level features (the description). These can

Bars of music

Bar-level rec. Sequence-level rec. either be hand-crafted using domain knowledge or learned. The description then serves as the input to a conditional model to reconstruct the original sequence. To this end, we define a description function which extracts said features from a given piece of music. The choice of description function determines the characteristics of the resulting model and serves as an inductive bias, allowing us to emphasize desirable properties such as human-interpretability or fine-grained control over instruments and chord progression. Note that the general nature of the proposed framework allows for applications to other domains despite our focus on symbolic music. Our main contribution is FIGARO (FIne-grained music Generation via Attention-based, RObust control), a model trained on the proposed description-to-sequence objective by combining two separate description functions: 1) The hand-crafted expert description, which provides global context in the form of a high-level, human-interpretable sequence and 2) the learned description, where we use representation learning to extract salient features from the source sequence. The learned description is intended to amend the expert description with high-fidelity information in places where the latter might be incomplete, albeit at the cost of human-interpretability. The model is trained on conditional generation, mapping descriptions to music. An illustrated overview of the model is given in Figure 1 . At inference time, users may interact with the model in human-interpretable description space. We provide a simple interface in the form of an online demo of our model. 1 We also release the source code and model weights for anyone to download and use freely.foot_1 Our secondary contribution is REMI+, an extension to the REMI input representation (Huang & Yang, 2020) which opens the way to multi-track, multi-time-signature music. We evaluate FIGARO on its ability to adhere to the prescribed condition by comparing it to state-ofthe-art methods for controllable symbolic music generation (Choi et al., 2020; Wu & Yang, 2021) . We demonstrate empirically that our technique outperforms the state-of-the-art in controllable generation and sample quality. To evaluate sample quality, we employ subjective evaluation in the form of a listening study. We further demonstrate that FIGARO is robust with respect to distributional shifts in description space and performs well on constructed samples outside the training distribution, indicating that the proposed objective is effective at learning generalized concepts about the data.

2. CONTROLLABLE SYMBOLIC MUSIC GENERATION

In the context of generative modelling, controllability is an important issue, as such models only become useful if the user is able to steer the generation process in a desired direction. This has recently been observed for text-to-image models and we intend to take a closer look at controllable music generation. We identify two different levels of controllability: global and fine-grained control. Global conditioning, where the generation is guided by a constant set of attributes that do not change during the generation process, is the most prevalent form of control. Examples of global control include prompt-based conditioning (Payne, 2019) or conditional decoding of latent representations



Online demonstration of FIGARO is available on Google Colab (https://tinyurl.com/28etxz27). We recommend selecting a GPU environment for improved inference speed. Code and model weights are available through GitHub (https://github.com/dvruette/figaro).

