SKETCHKNITTER: VECTORIZED SKETCH GENERA-TION WITH DIFFUSION MODELS

Abstract

We show vectorized sketch generation can be identified as a reversal of the stroke deformation process. This relationship was established by means of a diffusion model that learns data distributions over the stroke-point locations and pen states of real human sketches. Given randomly scattered stroke-points, sketch generation becomes a process of deformation-based denoising, where the generator rectifies positions of stroke points at each timestep to converge at a recognizable sketch. A key innovation was to embed recognizability into the reverse time diffusion process. It was observed that the estimated noise during the reversal process is strongly correlated with sketch classification accuracy. An auxiliary recurrent neural network (RNN) was consequently used to quantify recognizability during data sampling. It follows that, based on the recognizability scores, a sampling shortcut function can also be devised that renders better quality sketches with fewer sampling steps. Finally it is shown that the model can be easily extended to a conditional generation framework, where given incomplete and unfaithful sketches, it yields one that is more visually appealing and with higher recognizability.

1. INTRODUCTION

Free-hand human sketches are abstract concepts which can efficiently express ideas. Generative models for sketches have received increasing attentions in recent years. Compared with producing pixelated sketches (Ge et al., 2020; Chen et al., 2001; Liu et al., 2020) , modeling sketches with point trajectories is more reasonable and appealing as it more closely resembles drawing process of humans. Sketch-RNN (Ha & Eck, 2018) utilizes a set of discrete stroke points and binary pen states as an approximation of the continuous drawing trajectory. BézierSketch (Das et al., 2020) makes use of parametric representation, which fits the stroke trajectory by Bézier curves. Very recently, SketchODE (Das et al., 2021a) applies neural ordinary differential equations to representing stroke trajectory through continuous-time functions. All said approaches however suffer from the inability to model complex vectorized sketches. This is largely attributed to the de-facto RNN backbone that falls short in accommodating large stroke point numbers -rule of thumb is anything beyond 200 points will fail (Pascanu et al., 2013; Das et al., 2021b) . In this paper, we attempt to change the status quo in how stroke-point trajectories are modeled. Instead of seeing sketch generation as a process of determining where the next stroke-point lies under each recurrent step (as per RNN), we attempt to estimate distributions of all stroke-points holistically at each time instance -as every knitting enthusiast will tell you, it is all about having a global plan, never just about the next thread! 1 . Our key novelty lies with the realization that sketch generation can be conceptualized as the reversal of a stroke deformation process. Through modeling a forward deformation process (i.e., sketch to noise), our diffusion model learns the stroke-point distributions of real human sketches, and thus able to reverse the process to generate novel sketches given noisy input. It follows that given this diffusion setup, the sequential information in sketches can be persevered by simply maintaining the temporal ordering of stroke-points during reverse-time diffusion. We further draw importance on the overall quality (recognizability) of the sketches generated. We show that the estimated noise in the sampling stage naturally can reflect the recognizability of the generated sketch at each timestep. It follows that a learnable RNN was devised to explicitly model the relation between estimated noise and recognizability. This is achieved by introducing an pretrained image classifier as supervision signal. Embedding recognizabilty into the sampling process also yields the added benefit of introducing skip steps that allows for more efficient and effective data generation. This is because early stages for generating sequential data is very inefficient using vanilla DDPMs sampling (Ho et al., 2020) as witnessed in Figure 1 (b), resulting in minor improvement of recognizability in a long period of sampling as unveiled in Figure 1 (c). Last but not least, we demonstrate the model (without retraining) can be readily used to remedy defects in sketches due to unfaithful or incomplete drawing, by incorporating instance-aware guidance into data sampling. Motivated by recent works on guided diffusion models (Dhariwal & Nichol, 2021; Ho & Salimans, 2022) , gradients of perceptual similarity (Zhang et al., 2018) between the generated data and the conditional sketch were incorporated during sampling to guide the noise prediction, thereby influencing the obtained sample at each timestep. This was done with the goal of enforcing visual similarity to the conditional, flawed sketch, while also being more appealing and recognizable after reverse-time diffusion. Our contributions can be summarized as follows: (i) Denoising diffusion models are exploited for sketch generation in vector format. The generative model is to learn distribution over stroke points' locations, from a deformation-based denoising process which starts from noise. (ii) The quality, i.e., recognizablity, of the generated sketches is quantifiable by leveraging the knowledge of the estimated noises during sampling. This is achieved by devising an auxiliary RNN, which is trained supervised under a pre-trained image classifier, to predict the recognizability of a generated sketch at timestep t from the corresponding estimated noise. (iii) A shortcut sampling path can be discovered through a simple skip strategy based on the learned quality measurement net. This allows faster and more effective generation with little trade off in data quality. (iv) Instance-aware guidance built on perceptual metric is embedded into the reverse-time diffusion. It enables our model to recover distorted or corrupted sketches without retraining.

2. RELATED WORKS

Sketch Generation There is a rich literature of research works related to sketch generation. Early works (Guo et al., 2007; Li et al., 2019) leverage edge maps as substitution for sketches. Coupled



Figure 1: (a) Built on diffusion models, sketch generation is formulated as a stroke deformation rectification problem. Essentially, our is to reorganize points with fixed adjacency, such that placing them on meaningful locations out of a total mess, dubbed sketch knitting. Note that the order of stroke points is pre-determined and unchanged. (b) Generated sketches with different deformation levels associated with timesteps by our model. Early sampling stage is not efficient that the obtained data change little and remain noisy. (c) Early sampling is inefficient as basically the same estimated noise to each stroke point is used, led sketch remains noisy and unrecognizable. And different noise patterns can be observed at different timesteps, which motivate us to devise recognizability-based skipping (blue△) based on the estimated noise to find shortcut sampling. Red solid curve denotes mean and the shade denotes variance. 1000 generated sketches are used for plotting.

