MARKUP-TO-IMAGE DIFFUSION MODELS WITH SCHEDULED SAMPLING

Abstract

Building on recent advances in image generation, we present a fully data-driven approach to rendering markup into images. The approach is based on diffusion models, which parameterize the distribution of data using a sequence of denoising operations on top of a Gaussian noise distribution. We view the diffusion denoising process as a sequential decision making process, and show that it exhibits compounding errors similar to exposure bias issues in imitation learning problems. To mitigate these issues, we adapt the scheduled sampling algorithm to diffusion training. We conduct experiments on four markup datasets: mathematical formulas (LaTeX), table layouts (HTML), sheet music (LilyPond), and molecular images (SMILES). These experiments each verify the effectiveness of the diffusion process and the use of scheduled sampling to fix generation issues. These results also show that the markup-to-image task presents a useful controlled compositional setting for diagnosing and analyzing generative image models.

1. INTRODUCTION

Recent years have witnessed rapid progress in text-to-image generation with the development and deployment of pretrained image/text encoders (Radford et al., 2021; Raffel et al., 2020) and powerful generative processes such as denoising diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) . Most existing image generation research focuses on generating realistic images conditioned on possibly ambiguous natural language (Nichol et al., 2021; Saharia et al., 2022; Ramesh et al., 2022) . In this work, we instead study the task of markup-to-image generation, where the presentational markup describes exactly one-to-one what the final image should look like. While the task of markup-to-image generation can be accomplished with standard renderers, we argue that this task has several nice properties for acting as a benchmark for evaluating and analyzing text-to-image generation models. First, the deterministic nature of the problem enables exposing and analyzing generation issues in a setting with known ground truth. Second, the compositional nature of markup language is nontrivial for neural models to capture, making it a challenging benchmark for relational properties. Finally, developing a model-based markup renderer enables interesting applications such as markup compilers that are resilient to typos, or even enable mixing natural and structured commands (Glennie, 1960; Teitelman, 1972) . We build a collection of markup-to-image datasets shown in Figure 1 : mathematical formulas, table layouts, sheet music, and molecules (Nienhuys & Nieuwenhuizen, 2003; Weininger, 1988) . These datasets can be used to assess the ability of generation models to produce coherent outputs in a structured environment. We then experiment with utilizing diffusion models, which represent the current state-of-the-art in conditional generation of realistic images, on these tasks. The markup-to-image challenge exposes a new class of generation issues. For example, when generating formulas, current models generate perfectly formed output, but often generate duplicate or misplaced symbols (see Figure 2 ). This type of error is similar to the widely studied exposure bias issue in autoregressive text generation (Ranzato et al., 2015) . To help the model fix this class of errors during the generation process, we propose to adapt scheduled sampling (Bengio et al., 2015) . Experiments on all four datasets show that the proposed scheduled sampling approach improves the generation quality compared to baselines, and generates images of surprisingly good quality for these tasks. Models produce clearly recognizable images for all domains, and often do very well at representing the semantics of the task. Still, there is more to be done to ensure faithful and consistent generation in these difficult deterministic settings. All models, data, and code are publicly available at https://github.com/da03/markup2im. Math \widetilde \gamma _ { \mathrm { h o p f } } \simeq \sum _ { n > 0 } \widetilde { G } _ { n } { \frac { ( -a ) ˆ{ n } } { 2 ˆ{ 2 n -1 } } }

2. MOTIVATION: DIFFUSION MODELS FOR MARKUP-TO-IMAGE GENERATION

Task We define the task of markup-to-image generation as converting a source in a markup language describing an image to that target image. The input is a sequence of M tokens x = x 1 , • • • , x M ∈ X , and the target is an image y ∈ Y ⊆ R H×W of height H and width W (for simplicity we only consider grayscale images here). The task of rendering is defined as a mapping f : X → Y. Our goal is to approximate the rendering function using a model parameterized by θ f θ : X → Y trained on supervised examples {(x i , y i ) : i ∈ {1, 2, • • • , N }}. To make the task tangible, we show several examples of x, y pairs in Figure 1 . Challenge The markup-to-image task contains several challenging properties that are not present in other image generation benchmarks. While the images are much simpler, they act more discretely than typical natural images. Layout mistakes by the model can lead to propagating errors throughout the image. For example, including an extra mathematical symbol can push everything one line further down. Some datasets also have long-term symbolic dependencies, which may be difficult for non-sequential models to handle, analogous to some of the challenges observed in nonautoregressive machine translation (Gu et al., 2018) .



... <span style=" font-weight:bold; text-align:center; font-size:150%; " > f j </span> </div> ...Sheet Music\relative c'' { \time 4/4 d4 | r2 b4 b2 | ces4 b4˜g2 f4 | a4 d8 | e4 g16 g2 f2 r4 | des2 d8 d8 f8 e4 d8 a16 b16 | d4 e2 d2. a8˜g4 r16˜e16. d2 f4 b4 e2 | f4. | b 16 a16 e4. r2˜c4 r4 b4 d8 b2 | d4 | r8. e 8 e2 | r8˜e2 } Molecules COc1ccc(cc1N)C(=O)Nc2ccccc2

Figure 1: Markup-to-Image suite with generated images. Tasks include mathematical formulas (LaTeX), table layouts (HTML), sheet music (LilyPond), and molecular images (SMILES). Each example is conditioned on a markup (bottom) and produces a rendered image (top). Evaluation directly compares the rendered image with the ground truth image.



