CHIRODIFF: MODELLING CHIROGRAPHIC DATA WITH DIFFUSION MODELS

Abstract

Generative modelling over continuous-time geometric constructs, a.k.a chirographic data such as handwriting, sketches, drawings etc., have been accomplished through autoregressive distributions. Such strictly-ordered discrete factorization however falls short of capturing key properties of chirographic data -it fails to build holistic understanding of the temporal concept due to one-way visibility (causality). Consequently, temporal data has been modelled as discrete token sequences of fixed sampling rate instead of capturing the true underlying concept. In this paper, we introduce a powerful model-class namely Denoising Diffusion Probabilistic Models or DDPMs for chirographic data that specifically addresses these flaws. Our model named "CHIRODIFF", being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate up to a good extent. Moreover, we show that many important downstream utilities (e.g. conditional sampling, creative mixing) can be flexibly implemented using CHIRODIFF. We further show some unique use-cases like stochastic vectorization, de-noising/healing, abstraction are also possible with this model-class. We perform quantitative and qualitative evaluation of our framework on relevant datasets and found it to be better or on par with competing approaches.

1. INTRODUCTION

Chirographic data like handwriting, sketches, drawings etc. are ubiquitous in modern day digital contents, thanks to the widespread adoption of touch screen and other interactive devices (e.g. AR/VR sets). While supervised downstream tasks on such data like sketch-based image retrieval (SBIR) (Liu et al., 2020; Pang et al., 2019) , semantic segmentation (Yang et al., 2021; Wang et al., 2020) , classification (Yu et al., 2015; 2017) continue to flourish due to higher commercial demand, unsupervised generative modelling remains slightly under-explored. Recently however, with the advent of large-scale datasets, generative modelling of chirographic data started to gain traction. Specifically, models have been trained on generic doodles/drawings data (Ha & Eck, 2018) , or more "specialized" entities like fonts (Lopes et al., 2019 ), diagrams (Gervais et al., 2020; Aksan et al., 2020 ), SVG Icons (Carlier et al., 2020) etc. Building unconditional neural generative models not only allows understanding the distribution of chirographic data but also enables further downstream tasks (e.g. segmentation, translation) by means of conditioning. with auto-regressive model. CHIRODIFF's latent space is much more effective with compositional structures for complex data. By far, learning neural models over continuous-time chirographic structures have been facilitated broadly by two different representations -grid-based raster image and vector graphics. Raster format, the de-facto representation for natural images, has served as an obvious choice for chirographic structures (Yu et al., 2015; 2017) . The static nature of the representation however does not provide the means for modelling the underlying creative process that is inherent in drawing. "Creative models", powered by topology specific vector formats (Carlier et al., 2020; Aksan et al., 2020; Ha & Eck, 2018; Lopes et al., 2019; Das et al., 2022) , on the other hand, are specifically motivated to mimic this dynamic creation process. They build distributions of a chirographic entity (e.g., a sketch) X with a specific topology (drawing direction, stroke order etc), i.e. p θ (X). Majority of the creative models are designed with autoregressive distributions (Ha & Eck, 2018; Aksan et al., 2020; Ribeiro et al., 2020) . Such design choice is primarily due to vector formats having variable lengths, which is elegantly handled by autoregression. Doing so, however, restrict the model from gaining full visibility of the data and fails to build holistic understanding of the temporal concepts. A simple demonstration of its latent-space interpolation confirms this hypothesis (Figure 2 ). The other possibility is to drop the ordering/sequentiality of the points entirely and treat chirographic data as 2D point-sets and use prominent techniques from 3D point-cloud modelling (Luo & Hu, 2021a; b; Cai et al., 2020) . However, point-set representation does not fit chirographic data well due to its inherently unstructured nature. In this paper, with CHIRODIFF, we find a sweet spot and propose a framework that uses non-autoregressive density while retaining its sequential nature. Another factor in traditional neural chirographic models that limit the representation is effective handling of temporal resolution. Chirographic structures are inherently continuous-time entities as rightly noted by Das et al. (2022) . Prior works like SketchRNN (Ha & Eck, 2018) modelled continuous-time chirographic data as discrete token sequence or motor program. Due to limited visibility, these models do not have means to accommodate different sampling rates and are therefore specialized to learn for one specific temporal resolution (seen during training), leading to the loss of spatial/temporal scalability essential for digital contents. Even though there have been attempts (Aksan et al., 2020; Das et al., 2020) to directly represent continuous-time entities with their underlying geometric parameters, most of them still possess some form of autoregression. Recently, SketchODE (Das et al., 2022) approached to solve this problem by using Neural ODE (abbreviated as NODE) (Chen et al., 2018) for representing time-derivative of continuous-time functions. However, the computationally restrictive nature of NODE's training algorithm makes it extremely hard to train and adopt beyond simple temporal structures. CHIRODIFF, having visibility of the entire sequence, is capable of implicitly modelling the sampling rate from data and consequently is robust to learning the continuous-time temporal concept that underlies the discrete motor program. In that regard, CHIRODIFF outperforms Das et al. (2022) significantly by adopting a model-class superior in terms of computational costs and representational power while training on similar data. We chose Denoising Diffusion Probabilstic Models (abbr. as DDPMs) as the model class due to their spectacular ability to capture both diversity and fidelity (Ramesh et al., 2021; Nichol et al., 2022) . Furthermore, Diffusion Models are gaining significant popularity and nearly replacing GANs in wide range of visual synthesis tasks due to their stable training dynamics and generation quality. A surprising majority of existing works on Diffusion Model is solely based or specialized to gridbased raster images, leaving important modalities like sequences behind. Even though there are some isolated works on modelling sequential data, but they have mostly been treated as fixed-length entities (Tashiro et al., 2021) . Our proposed model, in that regard, is one of the first models to exhibit the potential to apply Diffusion Model on continuous-time entities. To this end, our generative model generates X by transforming a discretized brownian motion with unit step size. We consider learning stochastic generative model for continuous-time chirographic data both in unconditional (samples shown in Figure 1 ) and conditional manner. Unlike autoregressive models, CHIRODIFF offers a way to draw conditional samples from the model without an explicit encoder



Figure 1: Unconditional samples from CHIRODIFF trained on VMNIST, KanjiVG and Quick, Draw!.

Figure2: Latent space interpolation (Top) with CHIRODIFF using DDIM sampler and (Bottom) with auto-regressive model. CHIRODIFF's latent space is much more effective with compositional structures for complex data.

