Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints

Abstract

Recently, there has been significant progress designing deep generative models that generate realistic sequence data such as text or music. Nevertheless, it remains difficult to incorporate high-level structure to guide the generative process. We propose a novel approach for incorporating structure in the form of relational constraints between different subcomponents of an example (e.g., lines of a poem or measures of music). Our generative model has two parts: (i) one model to generate a realistic set of relational constraints, and (ii) a second model to generate realistic data satisfying these constraints. To train model (i), we propose a novel program synthesis algorithm that infers the relational constraints present in the training data, and then train the models based on the resulting relational constraints. In our experiments, we show that our approach significantly improves over stateof-the-art approaches in terms of capturing high-level structure in the data, while performing comparably or better in terms of low-level structure.

1. Introduction

Over the past few years, there has been tremendous progress in designing deep generative models for generating sequence data such as natural language (Vaswani et al., 2017) or music (Huang et al., 2019) . These approaches leverage the vast quantities of data available in conjunction with unsupervised and self-supervised learning to learn probabilistic models of the data; then, new examples can be generated by sampling from these models, with the possibility of conditioning on initial elements of the sequence. Despite this progress, a key challenge facing deep generative models is the difficulty incorporating high-level structure into the generated examples-e.g., rhyming and meter across lines of a poem, or repetition across measures of a piece of music. The ability to capture high-level structure is important for improving the quality of the generated data, especially in low-data regimes where only small numbers of examples are available-intuitively, knowledge of the structure compresses the amount of information that the generative model has to learn. Furthermore, explicit representations of structure-i.e., in a symbolic way rather than implicitly in a vector embedding-can have the added benefit that users can modify the structure to guide the generative process. Recently, Young et al. (2019) proposed a technique called neurosymbolic generative models for incorporating high-level structure into image generation, focusing on simple 2D repeating patterns in images of building facades (e.g., repeating windows). The basic idea is to leverage program synthesis to extract structure from data-in particular, given an example image x, they devise an algorithm A that extracts a program c = A(x) that represents the set of 2D repeating patterns present in training examples x. Then, using the pairs (x, c), they train two generative models: (i) a model p φ (c) that generates a program, and (ii) a model p θ (x | c) that generates an image that contains the structure represented by c. However, their approach is heavily tailored to the image domain in several ways. First, their representation of structure is geared towards relatively simple patterns occurring in images of building facades. In addition, their algorithm A is specifically designed to extract this kind of program from an input image, as are their models p φ (c) for generating programs and p θ (x | c) for generating images conditioned on the program. We represent relational constraints c x present in an example x by relating each subcomponent w of a given example x with a prototype w, which can intuitively be thought of as the "original" subcomponent from which w is constructed. In particular, the relationship between w and w is labeled with a set of relations R, which encodes the constraint that w and w should satisfy relation r for each r ∈ R. Importantly, while each subcomponent is associated with a single prototype, each prototype may be associated with multiple subcomponents. As a consequence, different subcomponents associated with the same prototype are related in some way. This representation is compact, only requiring linearly many constraints in the number of subcomponents in x (assuming the number of prototypes is constant). Intuitively, compactness ensures the representation both generalizes well and is easy to generate. Then, we design a program synthesis algorithm that can extract an optimal representation of the structure present in a training example x. We show how to express the synthesis problem as a constrained combinatorial optimization problem, which we solve using an SMT solver Z3 (De Moura & Bjørner, 2008) . Next, we represent c as a sequence, and design p φ (c) to be inferred through a specialized sequence VAE. Finally, we propose three possible designs of p θ (x | c) based on trying to identify an example x that is realistic (e.g., according to a pretrained generative model p θ (x)) while simultaneously satisfies the constraints c. We evaluate our approach on two tasks: poetry generation, where the relational constraints include rhyming lines or lines with shared meter, and music generation, where the relational constraints include equality in terms of pitch or rhythm, that one measure is a transposition of another (i.e., pitches shifted up or down by a constant amount), etc. We show that our approaches outperform or perform similarly to SOTA models according to many metrics. Finally, we also perform a qualitative evaluation where we show how the user can modify the high-level to generate examples that satisfy additional desired constraints. This ability demonstrates an important feature of our approach-i.e., that the user can guide the generative process by modifying the relational constraints as desired. Example. Figure 1 illustrates how our approach is applied to generate poetry. During training, our approach uses program synthesis to infer relational constraints c x present in the examples x, and uses both x and c x to train the generative models. Here, c x is a bipartite graph, where the LHS vertices are prototypes, and the RHS vertices correspond to lines of x. Each vertex on the right is connected to exactly one prototype, and is labeled with constraints on how it should relate to its prototype. To generate new examples, it first samples relational constraints c, and then samples an example x that satisfies c-i.e., we need to choose a line to fill each RHS node in a way that the line satisfies the relations with its prototype. Furthermore, a user can modify the sampled constraint c to guide the generative process. Thus, our approach enables users to flexibly incorporate domain knowledge on the high-level structure of the data into the generative process, both in terms of the relational constraints included and by allowing them to modify the generated relational constraints.

Related work.

There has been recent interest in leveraging program synthesis to improve machine learning. For instance, it has been applied to unsupervised learning of latent structure in drawings (Ellis et al., 2015) and to reinforcement learning (Verma et al., 2018) . These techniques have benefits such as improving interpretability Verma et al. (2018) ; Ellis et al. (2020) , enabling learning from fewer examples (Ellis et al., 2015) , generalizing more robustly (Inala et al., 2019) , and being easier to formally reason about (Bastani et al., 2018) . More recently, there has been work leveraging program synthesis in conjunction with deep learning, where the DNN handles perception and program synthesis handles high-level structure (Ellis et al., 2017) , including work in the lifelong learning setting (Valkov et al., 2018) . In contrast to these approaches, our focus is on generative models. In particular, we extend recent work leveraging these ideas in the setting of image generation to incorporating high-level relational structure into sequence generation tasks (Young et al., 2019) . Early music generation approaches were rule-based (Ovans & Davison, 1992) or used simple statistical models such as Markov models (Sandred et al., 2009; Cope, 1987) or probabilistic CFGs (Quick, 2016) . Recent work has used deep learning to generate music (Huang et al., 2019; OpenAI, 2019) and poetry (Liao et al., 2019) ; our experiments show that these approaches have difficulty generating realistic high-level structure. Approaches have incorpo-  𝑧 = 0.5 … -1.2 Music to song, on the city air A little music to live upon Song, a song, yes, yes, a long with the sun, And all together if they are all ten, Until all is ready for men again and then, To the capital I hall ride with chase, And look like men tied full all around You, and whisper that I shall found, Though I'm not built upon a little dome, I say, I made a home, I make a roam.

Music to song, on the city air

A little music to live upon

rhymes, meter

Song, a song, yes, yes, a long with the sun, Until all is ready for men again and then, To the capital I hall ride with chase, And look like men tied full all around Though I'm not built upon a little dome, rhymes rhymes, meter rhymes, meter rhymes, meter rhymes rhymes, meter rhymes "Father," I said, "Father, I cannot play The harp that thou didst give me, and all day I sit in idleness, while to and fro About me thy serene, grave servants go; And I am weary of my lonely ease. Better a perilous journey overseas Away from thee, than this, the life I lead, To sit all day in the sunshine like a weed That grows to naught-I love thee more than they Who serve thee most; yet serve thee in no way

Training Example 𝑥

Relational Constraints 𝑐 # The harp that thou didst give me, and all day I sit in idleness, while to and fro And I am weary of my lonely ease. Away from thee, than this, the life I lead, rated structure into deep learning to generate music (Medeot et al., 2018) or poetry (Castro & Attarian, 2018) , but they are domain specific; we find they do not perform at a human level on capturing global (and sometimes local) structure. Some approaches incorporate expert-provided constraints such as rhyme and meter to generate poetry (Lau et al., 2018) ; unlike our approach, they cannot automatically learn and generate these constraints from data.

2. Background on Neurosymbolic Generative Models

Consider the problem of learning a generative model given training data from the underlying distribution. Given training examples x 1 , ..., x k ∼ p * , our goal is to learn a generative model p θ ≈ p * from which we can draw additional samples x ∼ p θ . We consider sequence data-i.e., an example x ∈ X is a sequence x = (w 1 , ..., w m ) ∈ W m .foot_0 For example, each subcomponent w may be a line of a poem or a measure of music, and x may be a poem or song. We are interested in domains where likely examples satisfy latent relational constraints c ∈ C over the subcomponents. For instance, c may say that two measures w i and w j of x start with the same series of pitches, or two lines w i and w j of x rhyme. We assume given a set of relations R (e.g., r ∈ R might be "rhyme" or "equal"), and a function f : W × W × R → B (where B = {0, 1}) such that f (w, w , r) indicates whether w and w satisfy relation r. Then, c is a compact representation of the relations present in an input x. We describe the structure of c in detail in Section 3.1; for now, the approach we describe works for any choice of c. In particular, we build on neurosymbolic generative models (Young et al., 2019) , where c is itself generated based on a latent value z ∈ Z-i.e., p θ,φ (x) = c∈C p θ (x | c) • p φ (c | z) • p(z)dz. Then, Young et al. (2019) considers the variational distribution q φ(c, z | x) = q φ(z | c) • q(c | x) where q(c | x) = δ(c -c x ). Here, δ is the Dirac delta function, and c x is a single representative associated with x. In particular, c x is generated from x using a program synthesis algorithm David & Kroening (2017)-i.e., an algorithm A that takes as input an example x and outputs a program c = A(x) encoding the relational constraints present in x. Next, Young et al. (2019) derive an evidence lower bound log p θ,φ (x) ≥ log p θ (x | c x ) + E q φ(z|cx) [log p φ (c x | z)] -D KL (q φ(z | c x ) p(z)). ( ) where D KL is the KL divergence and H is the information entropy. The first term of ( 1) is the log-likelihood of a generative model predicting the probability of example x given relational structure c x , and the second and third terms form the loss of a variational autoencoder (VAE) p φ (c | z) and q φ(z | c) (Kingma & Welling, 2019). In summary, this approach separately learns (i) a VAE to generate c given z, and (ii) a generative model to generate x given c; the latter can be a second VAE or a generative adversarial network (GAN) (Goodfellow et al., 2014) . This approach is called synthesis-guided generative models (SGM) since it uses program synthesis to guide training. To leverage this framework, we have to instantiate (i) the space of relational constraints C, (ii) the synthesis algorithm A : X → C used to extract a program encoding the structure of x, and (iii) the architectures of p φ (c | z), q φ(z | c), and p θ (x | c). In previous work, Young et al. (2019) used heuristics specific to the the image domain to achieve these goals-in particular, they used (i) simple equality constraints on sub-regions of the image designed to capture 2D repeating patterns, (ii) a custom synthesis algorithm that greedily adds constraints in the data to the program, and (iii) a representation of c x as an image, in which case p θ is a generative model over images, and p φ , q φ based on an encoding of c as a fixed-length vector. We design a synthesis algorithm that expresses the synthesis problem as a constrained combinatorial optimization problem, which it solves using Z3 (De Moura & Bjørner, 2008) . In terms of (iii), our programs encode declarative constraints rather than imperative renderings, so the previous architectures of p φ , and q φ cannot be used. Instead, we use expert domain-specific heuristics, transformers (Vaswani et al., 2017) , or graph neural networks (GNNs) (Kipf & Welling, 2017) for p φ and q φ. For p θ , we propose several methods for imposing the constraints encoded by c when generating an example x.

3. Relational Constraints for Sequence Data

In this section, we describe how we represent relational constraints r, as well as our algorithm A for synthesizing the relational constraints c x = A(x) present in an example sequence x.

3.1. Graph Representation of Relational Constraints

Recall that our generative model operates by first generating a relational program c, and then generating an example x that satisfies c. Thus, we need to design relational programs c that encode constraints on the structure of an example x. Our programs c encode a set of relational constraints, each of which imposes a constraint that subcomponents of x should have certain kinds of relations. We begin by describing the structure of a single relational constraint, and then describe how c encodes a set of relational constraints. A relational constraint φ ∈ Φ = W × I × R, where I = {1, ..., m}, is a tuple φ = ( w, i, r); we call w ∈ W a prototype subcomponent. An example x satisfies φ (denoted x |= φ) if f ( w, w i , r) = 1 , where w i is the ith subcomponent of x. That is, φ says the ith subcomponent w i of x should have relation r with prototype subcomponent w. Thus, we can interpret φ as a function φ : X → B, where φ(x) = 1 if x satisfies φ and φ(x) = 0 otherwise.

Next, a relational program c encodes a collection of relational constraints on examples

x. We represent c as an undirected labeled bipartite graph c = ( Ṽ , V, E) with vertices Ṽ and V and edges E ⊆ Ṽ × V × L, where L are the labels. The vertices w ∈ Ṽ are prototype subcomponents w ∈ W; equivalently, they may be vector embeddings of prototype subcomponents. The vertices i ∈ V = {1, ..., m} are the indices of subcomponents in x. The edges e ∈ E are tuples e = ( w, i, R), where R ⊆ R is a set of relations. We impose the constraint that each v ∈ V is part of a single edge ( w, v, R) (though ṽ ∈ Ṽ may be part of multiple edges). Finally, c encodes the set of relational constraints Φ c = {( w, i, r) | ( w, i, R) ∈ E ∧ r ∈ R} . In other words, c includes the relational constraint that each subcomponent w i of x should have all relations r ∈ R with prototype w, where v is connected to w. As an example, in Figure 1 , the graph shown on the top right encodes a relational constraint c x , and the top right shows an example x that satisfies all the constraints φ ∈ Φ cx . The nodes on the left-hand side of c x are prototype subcomponents w ∈ W, each of which is a line of poetry. The nodes on the right-hand side correspond to indices i (from i = 1 on top to i = m = 10 on the bottom); each one is labeled with a set of relations R i . Then, Φ cx contains constraints φ = ( w, i, R i ) for each edge w → i in the graph, which says that line i of x should have relations r ∈ R i with w. For instance, the last (10th) node in c x has constraints R 10 = {rhyme, meter}, and is connected to prototype line w ="The harp that thou...". Thus, this edge encodes a constraint φ = ( w, 10, R 10 ) saying that the last line of x should rhyme and have the same meter as w. Indeed, the last line of x is w 10 ="Who serve thee most...", which rhymes and has the same meter as "The harp that thou...". Remark 3.1. We use prototypes rather than direct relationships between components to ensure the size of the graph is tractable. In particular, our approach ensures that the graph is linear in the size of the input (assuming the number of prototypes is constant). A compact graph is both to synthesize (for training) or train a model to generate (for generation). Our approach can easily be generalized to more complex representations. Remark 3.2. We refer to c as a program since it can be interpreted as a Datalog program Ceri et al. (1989) (i.e., a relational logic program). At a high level, Φ c is a set of Datalog relations over examples x ∈ X . Thus, c can be interpreted as a program c : X → B such that c(x) = 1 if φ(x) = 1 for all φ ∈ Φ c and c(x) = 0 otherwise.

3.2. Synthesizing Relational Constraints

Recall that when training our generative model, we need to design a program synthesis algorithm A that synthesizes a relational program c x = A(x) that best encodes the latent relational constraints present in each training example x. A key question is where the prototypes come from. We simply choose the prototypes w to be actual subcomponents in x. Thus, c x encodes that subcomponents of x are each related to one of a small number of distinguished subcomponents of x. We formulate the problem of synthesizing c x as a constrained optimization problem, which we describe below. Optimization variables. The variables are a binary vector H ∈ B m and a binary matrix K ∈ B m×m . Intuitively, H i indicates whether subcomponent w i of x is a prototype subcomponent in c, and K ij indicates whether w i is the prototype for subcomponent w j . Constraints. Our optimization problem has the following three constraints: ψ 1 ≡ k min ≤ m i=1 H i ≤ k max , ψ 2 ≡ m j=1 m i=1 K ij = 1, ψ 3 ≡ m i=1 m j=1 K ij ≤ m • H i . First, ψ 1 says that the number of prototype subcomponents is between k min and k max . Next, ψ 2 says that every subcomponent w j corresponds to exactly one prototype subcomponent w i . Finally, ψ 3 says that for every i, if w i is the prototype subcomponent of w j according to K, then it must be a prototype subcomponent according to H as well. Objective. The objective of our optimization problem is expressed in terms of a precomputed distance matrix D ∈ R m×m , where D ij measures the similarity between components w i and w j ; smaller values indicate a greater degree of similarity. In particular, we define D ij = 1 |R| r∈R 1(f(w i , w j , r) = 0), i.e., D ij is the fraction of relations that are not satisfied by w i and w j . Then, our objective (which is to be minimized) has the following three terms: J 1 = m i,j=1 K ij • D ij , J 2 = m i,j=1 k K ki • K kj • D ij , J 3 = - m i,j=1 M i • M j • D ij . First, J 1 says that subcomponents should be similar to their prototypes. Second, J 2 says that subcomponents should also be similar to other subcomponents that share the same prototype. Third, J 3 says that different prototype subcomponents should be dissimilar. Optimization problem. Our algorithm A uses Z3 to solve the optimization problem (H * , K * ) = arg min H,K {λ 1 • J 1 + λ 2 • J 2 + λ 3 • J 3 } subj. to ψ 1 ∧ ψ 2 ∧ ψ 3 , where λ 1 , λ 2 , λ 3 ∈ R ≥0 are hyperparameters. Finally, to construct c x , A chooses Ṽ = {w i | H * i = 1}, V = {1, ..., m}, E = {(w i , j, R ij ) | K * ij = 1}, where R ij = {r ∈ R | f (w i , w j , r) = 1}-i.e. , Ṽ are the prototype subcomponents according to H * , E are the edges according to K * , and R ij are the relations satisfied by w i and w j . Z3 is guaranteed to find the optimal solution; in the unlikely event that multiple such solutions exist, it chooses one nondeterministically. Intuitively, our approach should perform well when a handful of prototypes are sufficient to approximately capture the relational structure in the data. Furthermore, since the user has the ability to define relations, they can adjust their definitions as needed to capture the desired structures.

4. Neurosymbolic Generative Models with Relational Constraints

In this section, we describe our deep generative model for generating examples x. Recall that our approach proceeds in two steps: (i) generate c, and (ii) generate x given Φ c . We describe each of these steps in detail below.

4.1. Step 1: Generating Relational Constraints

The first step of our generative model is to generate relational constraints Φ c using a VAEi.e., z ∼ p(•) and c ∼ p φ (• | z), where p φ (c | z) is a VAE and p(z) = N (z; 0, I) is a Gaussian distribution. The main choice is the architecture to use for the VAE. In particular, we consider a representation of c as a sequence (s 1 , ..., s m ), where s i ∈ {0, 1, ..., m} for each i; intuitively, s i encodes that subcomponent w i should have the same prototype subcomponent as w i-si , or if s i = 0, that w i corresponds to a new prototype subcomponent. More precisely, we initialize Φ c = ∅. Then, we generate the sequence s i ∈ {0, 1, ..., m} and r i ∈ {0, 1, ..., m} (where r i is represented as a binary vector of length n = |R|) using an LSTM-VAE. For each i, we generate ( w, R i ) based on s i and r i . If s i = 0, we generate a new prototype subcomponent w using a domain-specific generative model, generate with using another LSTM-VAE, and add φ i = ( wi , i, R i ) to Φ c . If s i > 0, we let φ i = ( wi-si , i, R i ).

4.2. Step 2: Generating Examples Given Relational Constraints

Next, we describe how we implement the second step p θ (x | c) of our generative model. We propose three approaches for generating x given Φ c ; we give details in Appendix A. Approach 1: Constrained sampling. We sample values x ∼ p θ (•) by sequentially sampling w i ∼ p θ (•) from a pretrained generative model p θ (w). We do so using rejection sampling at each step-i.e., we sample w i ∼ p θ (•) until we find w i satisfying f ( w, w i , r) = 1 for each ( w, i, r) ∈ Φ c . In addition, to speed up sampling, at each step of sampling w i (e.g., a word in a line or a pitch in a measure), we eliminate choices that violate Φ c . Approach 2: Constraint-aware embeddings. We train a conditional generative model p θ (w 1 , ..., w m | c) (in the form of a graph convolutional network) that simultaneously generates all m subcomponents in a way that satisfies c, and sample x = (w 1 , ..., w m ) ∼ p θ (• | c). Approach 3: Combinatorial optimization. We sample x ∼ p θ (•) by sequentially generating w i by solving an optimization problem whose objective is to maximize adherence to Φ c plus additional terms encoding domain-specific heuristics encouraging w i to be realistic.

5. Experiments

We evaluate our approach on music and poetry generation; see Appendix B.6 for details. Music generation. We evaluated our approach on a music generation task. In this setting, x is a song, and w are measures of music. We consider 20 relations including equality, same rhythm, same pitch progression, etc.; a full list is given in Appendix B.2. We used songs from the Essen folk song corpus (Schaffrath, 1995) , using 2000 for training and 500 for testing. For this dataset, we used each of the three approaches A1, A2, and A3 described in Section 4 to sample x ∼ p θ (• | c). For A1, we use a pretrained transformer called MusicAutoBot (Shaw, 2020) . For A2, we require a generative model that constructs vector embeddings of measures; we use the version of Magenta's MusicVAE which embeds single measures (Roberts et al., 2018) . We finetune all models on our training examples. We compare to MusicAutoBot, an LSTM model with attention (AttentionRNN) (Waite, 2016) , Magenta's 16-bar MusicVAE, and an implementation of StructureNet, an approach that integrates structure into an LSTM (Medeot et al., 2018) . We also compare to a constraint generation approach called Motifate (Muhammad Faisal, 2017); see Appendix B.5. We compare performance in terms of both high-level and low-level structure. For high-level structure, given a generated (or human held-out) example x, we use our program synthesis algorithm to synthesize its relational structure c x = A(x). Then, given a collection C gen = {c x | x ∈ X gen } of synthesized structure for generated examples, along with a collection C human = {c x | x ∈ X human } of synthesized structure for the held-out human examples, we train a graph convolutional neural network (GCN) to try and discriminate C gen from C human . In addition, we also train a random forest (RF) over handcrafted features (described in B.4) to try and discriminate them. In both cases, we use a balanced dataset (i.e., 50% human held-out and 50% generated) so random predictions have accuracy 0.5. For low-level structure, we use the negative log likelihood (NLL) according to MusicAutoBot (pretrained and then fine-tuned on our dataset), MusicVAE, and our own GraphVAE (described in Section A.2). While these metrics are not perfect, they can be used to evaluate across all approaches. For models where NLL is available, we also compare their NLL on the held-out humand data. We show results in Table 1 . For each approach (as well as a held-out human dataset), we show the negative log-likelihood (NLL) assigned to that approach by one of the three models "MusicAutoBot", "MusicVAE", and "GraphVAE". According to almost all metrics, our algorithm (using A2 for sampling p θ (x | c)) outperforms or performs equally to all others, and achieves performance very similar to human. The one exception is our measure of low-level structure according to MusicAutoBot; however, this model rates StructureNet and MusicVAE-16 as substantially better than human, indicating that it is not a good measure of quality. Finally, our GraphVAE produces a lower NLL for the held-out human music than either MusicAutoBot or MusicVAE-16, which indicates that it models human data better than the others (the other approaches cannot be used to compute NLL). Poetry generation. We use Project Gutenberg's poetry collection (Parrish, 2018) , filtered to focus on examples that contain rhymes and meter. We use 2700 10-line poems for training and 300 for testing. In this case, we were unable to apply A3 due to the large Table 2 : Results for the poetry domain. To evaluate low-level structure, we use the negative log likelihood per token of a fine-tuned version of BERT and GPT2. For high-level structure, we use cross-entropy loss of the GCN ("GCN Disc."). We also show the information entropy. The highest (non-human) score in each column is bolded. size of the vocabulary, making constrained optimization infeasible. We were also unable to apply A2 since state-of-the-art generative models such as BERT and GPT2 were unable to capture rhyming and meter, since they operate at the word level where this information is unavailable. In A1, rather than sample words going forward, we sample them backwards, making it easier to sample lines that satisfy rhyming constraints; see Appendix A. Thus, we use BERT to sample (Devlin et al., 2018) , since it supports bidirectional sampling. We compare to BERT and GPT2 (Radford et al., 2019) , both finetuned on our dataset. We also consider a variant GPT2-Opt of GPT2 where we use beam search to choose line breaks in a way that maximizes occurrences of rhyme and meter. We also tried a variant of GPT2 that used constrained sampling to try and find poems that fit a given rhyme and meter scheme, but the search space was too large and it was unable to generate a single poem even after several hours. We also compare to an implementation of RichLyrics (Castro & Attarian, 2018) , where the consecutive parts of speech for each line given the previous line and the ability to fill in the correct word for the given part of speech were both learned separately from the corpus. Finally, in addition to using BERT as a sequential generator, we considered an ablation where we perform constrained sampling, but with a uniformly random Φ c rather than sampling it from a learned distribution. As before, we compare both high-level structure and low-level structure. For high-level structure, we again use a GCN discriminator. For low-level structure, we use the negative log likelihoods per token according to each of BERT and GPT2 (finetuned on our training dataset). In this case, because our approach uses constrained sampling from a pretrained generative model, we could not evaluate negative log-likelihood according to our approach. We show results in Table 2 . Our approach significantly outperforms all baselines (including RichLyrics) in terms of high-level structure. Furthermore, our approach produces a lower BERT NLL than either unmodified-BERT or the BERT-based implementation of Rich-Lyrics. GPT2 and GPT2-Opt produce more likely output than our technique according to the transformer probability metrics, most likely because they are better at natural language generation than BERT. If we could instead build on GPT2 (i.e., perform backwards sampling with GPT2), then our approach would likely achieve better performance; we leave this direction to future work. Furthermore, our approach achieved comparable BERT scores to GPT2-Opt, while significantly outperforming it in terms of structure. Finally, we noticed that one way the baselines (except for the ablation study and RichLyrics) tended to perform well was by being very repetitive. Thus, we additionally measured the information entropy of the different models. As can be seen, our approach (both with learned and random Φ c ) is by far closer to human in terms of human entropy than GPT2-Opt and BERT, supporting our hypothesis that the baselines were overly repetitive. RichLyrics avoided this lack of entropy, as consecutive words were constrained to belong to different part of speech groups. However, it did not receive a high likelihood score, as the transformer model used to produce those parts of speech often resulted in unlikely output. In Figure 2 , we show an example poem generated using our approach (left) along with one generated using GPT2-Opt (right). As can be seen, the GPT2-Opt poem does not capture structure in the same way human poems do-e.g., adjacent lines are unrelated, lines have very unequal length, and the only rhymes are the word "the" in the brown lines and the words "to" and "too" in the green lines. There is even less structure in poems generated using vanilla GPT2. Thus, GPT2 is completely unable to capture high-level structure present in the real poetry provided as training data. In contrast, our poem captures structure very similar to the human poem shown in Figure 1 , such as rhyming adjacent lines.

User modifications.

A key benefit of our approach is that the user can modify the relational constraints c (or construct their own from scratch) for use in the second step p θ (x | c), giving the user a way to guide the generative process. Figure 2 demonstrates this process. We manually edited the part of the program corresponding to the last two lines so that they shared a prototype with the previous two lines. The example generated using the unmodified (sampled) constraints is shown on the left, and the example generated using the modified constraints is shown in the middle. We show in the supplement a similar process performed with music data.

6. Conclusion

We have presented a novel approach for representing and synthesizing relational constraints on sequence data, and for generating examples whose relational structure resembles that of the training data. Our experiments demonstrate that we outperform existing approaches in terms of achieving human-like structure, while performing comparably or better on widelyused metrics which do not explicitly account for structure. Equally importantly, our approach gives the user a way to guide the generative process by modifying the relational constraints. Directions for future work include automatically discovering relational primitives, integrating our structure generator with more powerful language models such as GPT-3, and improving the ability to sample from generative models subject to constraints.

A Generating Examples Given Relational Constraints

A.1 Approach 1: Constrained Sampling In the music domain, we choose the pretrained generative model p θ (w) to be a pretrained version of MusicAutoBot. To generate x, we sequentially sample each measure w i conditioned on all prior measures w 1 , ..., w i-1 . Each measure is sampled by sequentially sampling a sequence of pitch-duration pairs until the total duration is 16 beats (i.e., the length of a measure). During sampling, we mask pitch-duration pairs that cannot satisfy Φ c (i.e., we set their sampling probability to zero and rescale the remaining probabilities). For instance, if the "has similar interval" relation is supposed to hold between the the prototype measure and measure i, and we are sampling the second note of measure i, then we mask any pitch j in measure i such that |(pitch j -pitch j-1 ) -( pitch j -pitch j-1 )| ≥ 3, where pitch j is pitch j in the prototype corresponding to w i . In other words, we eliminate pitches that would cause sampling to violate this constraint. In the the poetry domain, we finetune a pretrained BERT model on our dataset, by taking the pretrained models weights and then training the model on our dataset with a strong gradient weight decay. BERT has the ability to complete masked words in a sentence. We leverage this ability to sample lines that rhyme and have the same meter, which is a challenging task since such lines are a tiny fraction of the search space. We describe how we simultaneously handle rhyming and equal meter; the cases where only one of these two constraints has to hold are similar. Given a prototype w, we work backwards-on each step j, we sample from BERT a word word j that has the same number of syllables as the corresponding word word j in the prototype. More precisely, we feed BERT the sequence word 1 , ..., word j-1 , MASK, word j+1 , ... and ask it to fill in the masked word, setting the probability of any word with different number of syllables as word j to zero. In addition, we also set the probability of any word too similar to the original word in terms of cosine similarity to zero. For the last word (which we sample first), we additionally restrict to words that rhyme with word j . To increase diversity, we sample the remaining words twice-(i) backwards-to-forwards from word k -1 to word 1, where k is the number of words, and (ii) we resample each of the k -1 words (i.e., except the last word) in a random order. We discard any lines which after being sampled are determined to be too unlikely according to BERT.

A.2 Approach 2: Constraint-Aware Embeddings

In this approach, we start with a pretrained generative model p θ (w) that ignores c; in particular, we assume that p θ (w) = p θ (w | u) • p(u)du, where p θ (w | u) is the decoder network of a VAE over w, and p(u) = N (0, I). Now, rather than sample w i ∼ p θ (•), we train another generative model p ψ (u 1 , ..., u m ...u m | c) = N ((u 1 , ..., u m ) | µ ψ (c), Σ ψ (c)) to generate latent vectors u i ∈ U such that w i ∼ p θ (• | u i ) are likely to satisfy Φ c . More precisely, µ ψ and Σ ψ are the intermediate outputs of a graph convolutional network (GCN) (Kipf & Welling, 2017 ) that takes as input the graph c (where edge attributes between nodes encode Φ c ) and ultimately outputs a sequence (u 1 , ..., u m ). Our approach can be considered to be a graph autoencoder in the sense that the objective function used during training rewards the reconstruction of the exact embeddings of the nodes and (implicitly, through the relationship consistency loss) their edge attribute. Our graph encoder/decoder produces one latent vector per node, which are rewarded for close to i.i.d. Gaussian random variables with mean zero and variance one. To train p ψ , we construct a training example (c x , (u 1 , ..., u m )) for each training example x = (w 1 , ..., w m ), where u 1 , ..., u m are obtained by the encoder network q θ (u | w)-i.e., u i ∼ q θ (• | w i ). Then, we train p ψ using the objective J(ψ) = (c, u) D KL (N (µ ψ (c), Σ ψ (c)) N (0, I)) + m i=1 u i -µ ψ (c) i 2 2 + J rel (µ ψ (c); c). The first term enforces that the distribution of the latent vectors u should be Gaussian, and the second term enforces that each latent vector u should be close to its original value according to the VAE encoder q θ (u | w). The third term is designed to enforce the satisfaction of the constraints Φ c . In particular, we train a kind of "semantic discriminator" p α (u, u ; r), that predicts whether w ∼ p θ (• | u) and w ∼ p θ (• | u ) satisfies relation ri.e., f (w, w , r) = 1. The network p α is trained on data generated from the given training examples x. Then, given p α , we want (u 1 , ..., u m ) = µ ψ (c) to satisfy p α (u i , ũ, r) ≈ 1 if ( w, i, R) ∈ Φ c ∧ r ∈ R 0 otherwise where ũ ∼ q θ (• | w). In other words, we want to generate an example x that satisfies the relations in c according to p α . In particular, we use the loss J rel ( u; c) = m i=1 r∈R CE p α (u i , ũ, r), 1 ( w, i, R) ∈ Φ c ∧ r ∈ R where ũ ∼ q θ (• | w), and where CE denotes the cross-entropy loss. Once we have trained p ψ , we generate sequences by sampling u ∼ p ψ (• | c) and w i ∼ p θ (• | u i ), and constructing x = (w 1 , ..., w m ). For the music domain, we use embeddings from a pretrained Magenta MusicVAE; unlike the MusicVAE used for evaluation, we finetuned it to decode only 1-2 measures of music from a 256-dimensional vector. Then,w e use the decoder portion of this model to convert the embeddings u 1 , ..., u m ∼ p ψ (u 1 , ..., u m | c) sampled from the GCN-VAE p ψ into measures. The graphs in the training set vary in size depending on the number of prototype measures.

A.3 Approach 3: Combinatorial Optimization

Given sampled program c, this approach attempts to generate values x = (w 0 , . . . , w m ) such that x |= Φ c by solving a system of constraint solving problem. However, the relational constraints φ ∈ Φ c are not always consistent with one another, so we relax the constraint x |= Φ c as an objective-i.e., x = arg max x∈X m i=1 r∈R 1(R( w, w i , n) ⇔ ( w, i, n) ∈ Φ c ). Encoding this optimization problem as one Z3 can solve depends on the domain and relations. For this approach to work, we need to include additional, handcrafted terms in the objective that encourage the generated example x is realistic. For the music domain, the optimization variables are the optimal sequence of pitches and their durations. The objective function is a linear combination of the degree to which x satisfies c, along with domain-specific heuristics-e.g., minimizing large jumps in pitch values (i.e., |pitch i+1 -pitch i | ≥ 4), not having any intervals of length 6 (i.e., |pitch i+1 -pitch i | = 6) due to the unpleasant harmonic nature of that interval, and not having two consecutive jumps in pitch (i.e., |pitch i+2 -pitch i+1 | ≥ 5) ∧ |pitch i+1 -pitch i | ≥ 5). These heuristics are based on standard concepts from music theory (Horton & Ritchey, 2000) .

B Evaluation Details B.1 Experimental Setup

Generating c. To generate c, we use an LSTM-VAE with 2 LSTM layers and a latent size of 200 in the music domain and 50 in the poetry domain. This model is trained to reproduce a given sequence of (s i , r i ) pairs which are given as input, with an additional requirement that the distribution of their encodings should be roughly equivalent to a Gaussian normal distribution. Each (s i , r i ) pair is represented as a (S + |R|)-dimensional vector, where S is the maximum distance between objects with the same prototype and R is the set of relations. Evaluating low-level structure. We evaluate low-level structure by using negative log likelihood according to a deep generative model. In the music domain, we use both Mu-sicAutoBot, a transformer based on the transformer-XL architecture (Dai et al., 2019) , as well as the MusicVAE from Magenta, which is a hierarchical VAE that learns embeddings for each measure and then learns an LSTM-VAE on top of these embeddings. In the poetry domain, we use BERT and GPT2, both finetuned on our dataset and solely pretrained on a non-poetry dataset. For the evaluation metrics that explicitly captured structure, we computed the optimal program A(x) for every example x in the held-out validation human dataset as well as all of the generated examples. High-level structure. We evaluate high-level structure by using our algorithm to synthesize the relational constraints in every generated example-i.e., C gen = {A(x) | x ∈ X gen }, where X gen is the set of examples generated using a model. Similarly, we can construct C human = {A(x) | x ∈ X human } , where X human is the set of human-created examples heldout from the training dataset. Then, we evaluate high-level structure by training a model to try and discriminate C gen from C human ; if the model achieves lower performance, then the quality of high-level structure is higher. A general approach is to train a graph neural network (e.g., a graph convolutional network) to do so; this model takes as input the graph structure of relational constraints c, along with vector embeddings of the prototype subcomponents, and outputs whether c ∈ C gen or c ∈ C human . We balance the data so it consists of 50% human data and 50% generated data. We report the cross-entropy (CE) loss; higher values correspond to better generative models. In the music domain, we additionally used a random forest (RF) trained on a manual featurization of c. We report the accuracy of the RF; lower values (i.e., closer to 50%) correspond to better generative models.

B.2 Musical Relations Used

The following are the relations r ∈ R used in the music domain: 1. Measures i and j have the same pitch classes. 2. Measures i and j have the same pitch class prefix. 3. Measures i and j have the same pitch class suffix. 4. Measures i and j's pitches have an edit distance of 1. 5. Measures i and j have approximately the same interval structure. 6. Measures i and j have the same interval prefix. 7. Measures i and j have the same interval suffix. 8. Measures i and j have the same note (pitch + duration) prefix. 9. Measures i and j have the same note (pitch + duration) suffix. 10. Measures i and j have the same rhythm. 11. Measures i and j's rhythm has an edit distance of ≤ 2. 12. Either measure i's onsets are a subset of measure j's onsets, or measure j's onsets are a subset of measure i's onsets. 13. Measures i and j have the same rhythmic and melodic contour. 14. Measures i and j have the same rhythmic and melodic contour prefix. 15. Measures i and j have the same rhythmic and melodic contour suffix. 16. Either the first or second half of measures i and j are identical. 17. Either both or neither of measures i and j have leaps. 18. Measures i and j fit within the same diatonic scale. 19. Either both or neither of measures i and j have syncopation. 20. Either both or neither of measures i and j have consecutive notes shorter than an eighth note.

B.3 Poetry Relations Used

The following are the relations r ∈ R used in the poetry domain: 1. Lines i and j have the same end rhyme. 2. Lines i and j have the same meter.

B.4 Random Forest Features

The following are the manually constructed features used in the random forest discriminator for the music domain: 1. Mean number of relations between prototype and sequence measures. 2. Variance of number of relations between prototype and sequence. measures 3. Variance in histogram of prototype measure mappings. 4. Longest sequence i . . . j such that w i . . . w j all have the same prototype measure. 5. Number of pairs (i, j) such that wi = wj and wi+1 = wj+1 . 6. Mean distance between two measures with the same prototype. 7. Variance in distance between two measures with the same prototype.

B.5 Comparison to Constraint Solving

We also considered a comparison to a constraint-based implementation called Motifate, with explicit attention to development of musical material (Muhammad Faisal, 2017) . This approach was designed with heuristics for 3-beat measures, while our evaluation models anticipated 4-beat measures, so we could not obtain NLL scores. Nevertheless, we found that even the structure was insufficient-our RF discriminator had accuracy 0.91, and our GCN discriminator had cross entropy loss 0.43, both of which are significantly worse than the other approaches.

B.6 Qualitative Results

Music domain. In addition to quantitative measurements, we evaluated the strengths and weaknesses of our approach using A2 (which was the best according to quantitative metrics). According to our observations, the strengths of A2 include clearer phrases with obvious resolutions, likely and plausibly repetitive rhythms, intervals between notes which seemed plausible but not overly repetitive, and less variance in quality. However, the results were not very rhythmically diverse, and certain idiomatic patterns of resolutions of intervals between notes and at the end of phrases were not followed. Furthermore, AttentionRNN does better in terms of creating realistic chord progressions (we did not explicitly consider chord progressions in our model; doing so is a promising direction for future work). Finally, while global structure is much better than the baselines, examples still relatively infrequently had the full four-bar repetitions characteristic of much folk music. Figure 3 : A song generated using our approach A3 (top), and a nearly identical song generated where part of the sampled relational constraints c were manually modified. These pieces were generated using A3, and the same reference measures w were used, but Φ c was slightly perturbed (the similarity relations were changed). Examples. Here we show how user modifications can occur in the music setting. By explicitly modifying c, we are able to generate two pieces with similar internal patterns but with different structural characteristics. We show an example of generated songs using our approach with each A1, A2, and A3 in Figure 4 , Figure 5 , and Figure 6 , respectively, and show an example generated using each of the baselines MusicVAE16, AttentionRNN, MusicAutoBot, and Structurenet in Figures 7, 8, 9, and 10, respectively. Qualitatively, the generated music and poetry appears plausible, exhibiting realistic high-level structure without sacrificing low-level structure. Figure 4 : An example of a song generated using our approach (A1). Measures that have the same prototype are shown in the same color. Note the existence of repeating four-bar phrases, found commonly in folk songs. We also give examples of poetry generated using our baselines-in particular, GPT2 finetuned and optimized for rhyme and meter in Figure 12 , BERT finetuned as a language generation model in Figure 14 , RichLyrics, and our ablation (i.e., use BERT in conjunction with a uniformly randomly sampled Φ c ) in Figure 15 . Figure 12 : A poem generated using GPT2-Opt. It is more plausible than BERT in terms of of global structure, which may be due to the fact that GPT2 is a better text generation tool than BERT, but it is still somewhat repetitive and its structure is not very human-like. Figure 13 : A poem generated using BERT. It is clearly overly repetitive and not very semantically coherent, and lacks high-level structure.



We use a fixed m to simplify our exposition; our approach trivially extends to variable m.



Figure 1: Top: Process for training. For each training example x, our algorithm uses program synthesis to infer the relational constraints c x = A(x) present in x. Then, it (i) uses c x to train p φ (c) = E z∼p(•) [p φ (c | z)•p(z)], and (ii) uses (c x , x) to train p θ (x | c). Bottom: Process for generating a sample x from the learned models p φ (c | z) and p θ (x | c). Lines with the same prototype are shown in the same color; metrical constraints are represented as purple and rhyme constraints as green edges.

Figure 2: Left: Poetry generated using relational constraints c ∼ p φ (•). Middle: user modified variant of c where the last two lines share a prototype with the two lines before them. Right: A poem generated by GPT2 optimized to maximize rhyme and meter. The colors indicate the relations synthesized by our algorithm after the examples were generated.

Figure 5: An example of a song generated using our approach (A2). Measures that have the same prototype are shown in the same color. Note the existence of clear phrase endings marked by long notes or rests, particularly the recurring pattern of fast notes resolving into long notes.

Figure 6: An example of a song generated using our approach (A3). Measures that have the same prototype are shown in the same color. The existence of two-bar and three-bar phrases is apparent, but the close note and rhythm similarities among different prototypes weaken the overall clarity of the song's melody.

Figure 7: An example of a song generated using Magenta's hierarchical MusicVAE model finetuned on our dataset. While the local structure is extremely coherent, it does not seem to possess the expected internal repetition/development.

Figure 8: An example of a song generated using AttentionRNN trained on our dataset. Note the existence of erratic rhythms and unclear structure, which are common traits of custom-trained AttentionRNN models.

Figure 9: An example of a song generated using MusicAutoBot. Note the repetitive nature and stark contrast between the first half and second half of the song, which are common problems with transformer models.

Figure 10: An example of a song generated using StructureNet. While some degree of internal structure is apparent, and the local coherence is high, the pattern of internal repetition seems fairly arbitrary.

Results for the music domain. To evaluate low-level structure, we use negative log likelihood according to MusicAutoBot, Magenta's verrsion of MusicVAE designed for hierarchical 16-bar melodies, and GraphVAE. For high-level structure, we use accuracy of the random forest ("RF Disc.") and GCN cross entropy loss ("GCN Disc."). The best (non-human) score in each column is bolded; the human score is italicized if best. We also bold the model that achieves the best NLL on the held-out human data.

One was done. Another was done. And I wish you know the way, Full name and date to whom this story pour And know a lot of things that were called a war See a soldier, fair fair beautiful grace That men turn'd toward. Another race Together, married. Much to see, the dead Were gone. The man who ascended to the head Office retired, and gave birth to a trace That doesn't tell a name, but tells a face.One was done. Another was done. And I wish you know the way, Full name and date to whom this story pour And know a lot of things that were called a war See a soldier, fair fair beautiful grace That men turn'd toward. Another race Together, married. Much to see, the dead Were gone. The man who ascended to the head With full beard and hair was a little said But was old and not intended for bed.

annex

Figure 14 : A poem generated using RichLyrics. While it is less repetitive than nonconditioned BERT, it is still not very semantically coherent, and lacks high-level structure.Figure 15 : A poem generated using our ablation. While it is much more coherent, it lacks the idiomatic rhyme and meter structure of our approach.

