SKETCHKNITTER: VECTORIZED SKETCH GENERA-TION WITH DIFFUSION MODELS

Abstract

We show vectorized sketch generation can be identified as a reversal of the stroke deformation process. This relationship was established by means of a diffusion model that learns data distributions over the stroke-point locations and pen states of real human sketches. Given randomly scattered stroke-points, sketch generation becomes a process of deformation-based denoising, where the generator rectifies positions of stroke points at each timestep to converge at a recognizable sketch. A key innovation was to embed recognizability into the reverse time diffusion process. It was observed that the estimated noise during the reversal process is strongly correlated with sketch classification accuracy. An auxiliary recurrent neural network (RNN) was consequently used to quantify recognizability during data sampling. It follows that, based on the recognizability scores, a sampling shortcut function can also be devised that renders better quality sketches with fewer sampling steps. Finally it is shown that the model can be easily extended to a conditional generation framework, where given incomplete and unfaithful sketches, it yields one that is more visually appealing and with higher recognizability.

1. INTRODUCTION

Free-hand human sketches are abstract concepts which can efficiently express ideas. Generative models for sketches have received increasing attentions in recent years. Compared with producing pixelated sketches (Ge et al., 2020; Chen et al., 2001; Liu et al., 2020) , modeling sketches with point trajectories is more reasonable and appealing as it more closely resembles drawing process of humans. Sketch-RNN (Ha & Eck, 2018) utilizes a set of discrete stroke points and binary pen states as an approximation of the continuous drawing trajectory. BézierSketch (Das et al., 2020) makes use of parametric representation, which fits the stroke trajectory by Bézier curves. Very recently, SketchODE (Das et al., 2021a) applies neural ordinary differential equations to representing stroke trajectory through continuous-time functions. All said approaches however suffer from the inability to model complex vectorized sketches. This is largely attributed to the de-facto RNN backbone that falls short in accommodating large stroke point numbers -rule of thumb is anything beyond 200 points will fail (Pascanu et al., 2013; Das et al., 2021b) . In this paper, we attempt to change the status quo in how stroke-point trajectories are modeled. Instead of seeing sketch generation as a process of determining where the next stroke-point lies under each recurrent step (as per RNN), we attempt to estimate distributions of all stroke-points holistically at each time instance -as every knitting enthusiast will tell you, it is all about having a global plan, never just about the next thread! 1 . Our key novelty lies with the realization that sketch generation can be conceptualized as the reversal of a stroke deformation process. Through modeling a forward deformation process (i.e., sketch to noise), our diffusion model learns the stroke-point distributions of real human sketches, and thus able to reverse the process to generate novel sketches given noisy input. It follows that given this diffusion setup, the sequential information in sketches can be persevered by simply maintaining the temporal ordering of stroke-points during reverse-time diffusion. We further draw importance on the overall quality (recognizability) of the sketches generated. We show that the estimated noise in the sampling stage naturally can reflect the recognizability of the generated sketch at each timestep. It follows that a learnable RNN was devised to explicitly model the relation between estimated noise and recognizability. This is achieved by introducing an pretrained image classifier as supervision signal. Embedding recognizabilty into the sampling process also yields the added benefit of introducing skip steps that allows for more efficient and effective data generation. This is because early stages for generating sequential data is very inefficient using vanilla DDPMs sampling (Ho et al., 2020) as witnessed in Figure 1 (b), resulting in minor improvement of recognizability in a long period of sampling as unveiled in Figure 1 (c). Last but not least, we demonstrate the model (without retraining) can be readily used to remedy defects in sketches due to unfaithful or incomplete drawing, by incorporating instance-aware guidance into data sampling. Motivated by recent works on guided diffusion models (Dhariwal & Nichol, 2021; Ho & Salimans, 2022) , gradients of perceptual similarity (Zhang et al., 2018) between the generated data and the conditional sketch were incorporated during sampling to guide the noise prediction, thereby influencing the obtained sample at each timestep. This was done with the goal of enforcing visual similarity to the conditional, flawed sketch, while also being more appealing and recognizable after reverse-time diffusion. Our contributions can be summarized as follows: (i) Denoising diffusion models are exploited for sketch generation in vector format. The generative model is to learn distribution over stroke points' locations, from a deformation-based denoising process which starts from noise. (ii) The quality, i.e., recognizablity, of the generated sketches is quantifiable by leveraging the knowledge of the estimated noises during sampling. This is achieved by devising an auxiliary RNN, which is trained supervised under a pre-trained image classifier, to predict the recognizability of a generated sketch at timestep t from the corresponding estimated noise. (iii) A shortcut sampling path can be discovered through a simple skip strategy based on the learned quality measurement net. This allows faster and more effective generation with little trade off in data quality. (iv) Instance-aware guidance built on perceptual metric is embedded into the reverse-time diffusion. It enables our model to recover distorted or corrupted sketches without retraining. with deep learning, much progress has been made recently. Particularly, generative adversarial networks (GANs) have motivated extensive works in sketch synthesis, involving doodle-sketch generation (Ge et al., 2020) , image-to-sketch translation (Liu et al., 2020) , colored sketch rendering (Rathod et al., 2021) , pencil-shading sketch generation (Li et al., 2020b) , and face sketch synthesis (Wang et al., 2020) . However, those models are all pixel-based generation, which is fundamentally different from how humans sketch objects using pens or brushes. Towards modeling sketches like humans, sketches are preferred to be treated as sequential pen actions, and RNN-based variational autoencoder (VAE) (Ha & Eck, 2018; Zhang et al., 2017; Graves, 2013) , reinforcement learning (RL) (Xie et al., 2013; Zheng et al., 2018; Ganin et al., 2018 ), Transformed-based sketch representation (Ribeiro et al., 2020; Lin et al., 2020) , learning parametric Bézier curve (Das et al., 2020; 2021b) , and neural ODE (Das et al., 2021a) are explored for sketch generation. (Aksan et al., 2020) proposes a relational model built on auto-encoder, which can decompose sketch formed by a single temporal sequence into a group of disordered strokes. Particularly, promising generative results on complex structures have been witnessed. Other notable works include generating stylized line drawing from 3D shapes (Liu et al., 2021) , and intent communication through sketching by referential communication game (Mihai & Hare, 2021) . Diffusion Models Recently, approaches with diffusion models have delivered impressive results on several generative tasks, including image generation (Ho et al., 2020; Dhariwal & Nichol, 2021) , shape generation (Cai et al., 2020) , 3D shape modelling (Luo & Hu, 2021) , audio synthesis (Kong et al., 2020) , and cross-domain generation (Popov et al., 2021; Nichol et al., 2021) . Different from likelihood-based models (variational auto-encoder (VAEs) (Kingma & Welling, 2013) , normalizing flow models (Dinh et al., 2014; Papamakarios et al., 2021) , energy-based models (EBMs) (LeCun et al., 2006) ) and implicit generative models (GANs (Goodfellow et al., 2014) ), diffusion models can be categorized into score-based generative modeling (SGM) (Song et al., 2020) , which aims to model the gradient of the log probability density function by score matching (Hyvärinen & Dayan, 2005) . There are two popular sub-classes in SGMs, i.e., score matching with Langevin dynamics (SMLD) (Song & Ermon, 2019) and denoising diffusion probabilistic models (DDPM) (Sohl-Dickstein et al., 2015; Ho et al., 2020) . Despite attractive, rare work with diffusion models targets at handling sketches. The most relevant work to ours is Diff-HW (Luhman & Luhman, 2020) which also applies diffusion models for sequential data generation, i.e., handwriting. However, Diff-HW adopts the vanilla DDPMs and focuses on text-to-sketch translation. On contrast, we offer (model built-in) quality quantifiable diffusion models, improved sampling strategy and gradients guided conditional sampling based on DDIM.

3. DIFFUSION MODELS FOR VECTORIZED SKETCH GENERATION

Our objective is to generate sketch stroke sequences from noise by a novel method built on denoising diffusion implicit models (DDIMs). DDIMs generalize DDPMs by introducing a non-Markovian diffusion process, which yet still has the same forward marginals as DDPMs. Uniquely, we propose a method to find a shortcut sampling trajectory based on recognizablity of a generated sketch. Moreover, we will present how to rectify a flawed sketch by our trained unconditional sketch generation model on-the-fly during the generative process.

3.1. PROBLEM SETUP

We construct a sketch in a sequence as s 0 = {s 1 , s 2 , . . . , s N } using the representation from Ha & Eck (2018) , i.e. stroke-3. Each point s i is represented as a 3-D vector (∆x i , ∆y i , g i ), where (∆x i , ∆y i ) indicates the offsets at point i during the pen's moving trajectory, and g i is a binary pen state, denoting whether the pen is touching the paper or not. Our goal is to learn the probability distribution of the offsets {(∆x i , ∆y i )} from the training data by diffusion models. Then, a sketch can be drawn given the estimated (∆x i , ∆y i ) for each point and the corresponding pen state p i inferred by a pen-state network. We will describe more details in the following sections.

3.2. SKETCH DIFFUSION IN FORWARD PROCESS

During the forward process, the noise will be gradually added to each point offsets of an original sketch, resulting in increased stroke distortion over time. Consider we have N ordered stroke points for a given sketch s 0 = {s 1 , s 2 , . . . , s N }, the offset (∆x i , ∆y i ) of each point s i is supposed to be sampled independently from a distribution q(s 0 ). To diffuse s 0 into s 1 , . . . , s T , the Markov diffusion process introduced in DDPMs is applied here. Formally, the forward process enforces each of the point offsets (∆x i , ∆y i ) in s 0 to drift along both x and y coordinates by gradually adding noises sampled from Gaussain distributions with pre-defined schedules α 1 , . . . , α T . Therefore, the Markov chain in forward process is defined by: q(s 1:T |s 0 ) := T t=1 q(s t |s t-1 ), q(s t |s t-1 ) := N s t ; α t α t-1 s t-1 , 1 - α t α t-1 I . (1)

3.3. DDIM BASED GENERATIVE PROCESS

Following (Song et al., 2021) , DDIM-based sampling is adopted to generate sketches from noise, i.e., the reverse process, as it normally achieves high data quality with significantly small sampling steps. More importantly, their non-Markovian sampling process supports us in discovering a novel shortcut sampling function being more efficient and effective for our sketch generation. Formally, the generative process is defined as follows: q σ (s t-1 |s t , s 0 ) = N √ α t-1 s 0 + 1 -α t-1 -σ 2 t • s t - √ α t s 0 √ 1 -α t , σ 2 t I . With DDIM we can predict s 0 through the generative process p (t) θ (s t-1 |s t ) = N (f (1) θ (s 1 ), σ 2 1 I) if t=1 q σ (s t-1 |s t , f (t) θ (s t )), otherwise where f (t) θ (s t ) = (s t - √ 1 -α t • ϵ (t) θ (s t ))/ √ α t is a prediction of s 0 based on our noise approxi- mator ϵ (t) θ . Then we can sample a data sample s 0 from a random noise s T = N (0, I) by iteratively repeating the following equation s t-1 = √ α t-1 s t - √ 1 -α t ϵ (t) θ (s t ) √ α t + 1 -α t-1 -σ 2 t • ϵ (t) θ (s t ). More detailed derivations about DDPMs and how DDIMs generalize DDPMs by defining a non-Markovian forward process and obtain the corresponding generative process described in Eq. 3 can be found in Appendix A.1 and Appendix A.2.

3.4. NOISE APPROXIMATOR ϵ (t) θ

In our case the noise approximator ϵ 4) is a trainable network to estimate noise ϵ (t) ∈ R N ×2 as coordinate offsets for s t ∈ R N ×2 at timestep t. An improved U-Net is developed to handle the sequence of the points from a vectorized sketch. Namely, we add a trainable embedding and a decoding layer into U-Net to transform the input s t into an embedding e t ∈ R N ×128 and convert the penultimate feature embedding back to coordinates. Then the rest design is as per the conventional U-Net for dealing with 2D images. Please refer to Appendix A.3 for details. (t) θ (s t ) in Eq (

3.5. RECOGNIZABILITY BASED SHORTCUT SAMPLING

As shown in Figure 1 (c), we observe that at the early sampling steps, the recognizability of generated s t is consistently low, although the amount of the estimated noise is large if we take full reverse steps of length T used in DDPMs. In contrast, denoising gets much more efficient and effective, leading to a noticeable leap of recognizability at the later steps. We then intuitively suppose that the recognizability of a generated sample ought to be inferred from the pattern of denoising sequence for all stroke points, i.e., ϵ (t) θ ∈ R N ×2 . Therefore, we incorporate an additional trainable network to predict the recognizability r t given the estimated noise: rt = h ϕ (ϵ (t) θ , t). (5) And the prediction rt could be used as a signal to find a shortcut sampling path -we can skip m sampling steps if rt < ζ, which implies the period of ineffective sampling. m is a constant indicating the interval steps to skip, and ζ is a predefined threshold. Therefore the timestep sampling function can be formulated as t n = t c -m, if h ϕ (ϵ (tc) θ , t c ) < ζ, t c -1, otherwise, where t c is the current timestep and t n is the next sampled timestep. Different from the Linear or Quadratic sub-sequence selection proposed in DDIM Song et al. (2021) , our sub-sequence is chosen adaptively, which can be more effective and shorter. In practice, h ϕ (•) is implemented as a bi-directional RNN as (Ha & Eck, 2018) with the estimated sequence of noise as input. Then the output latent vector z ∈ R d is fed into a trainable linear layer to predict rt . And the ground truth r t could be obtained from an extra pre-trained sketch classifier as done in (Song et al., 2018) . Namely, we use the maximum probability of the softmax prediction as r t . Once trained, h ϕ (•) could be accommodatingly used to inspect how recognizable a sketch is during sampling, while no need to render it into an image. Please refer to Appendix A.4 for more analysis and insights about the shortcut sampling.

3.6. PEN STATE ESTIMATION

The diffusion model is to learn the distribution of the coordinate offsets for sketch points. However, it is also required to predict binary pen state for each of the stroke point. Following (Luhman & Luhman, 2020) , the feature vector, i.e., v ∈ R n×128 , from the penultimate layer of the above U-Net for noise approximation is utilized. We feed this feature sequence into another trainable linear layer followed by a sigmoid function to predict each point's pen status: ĝ = sigmoid(f ψ (v)), where ψ are parameters of the trainable linear layer. When ĝi > 0.5, it indicates the pen is touching the canvas at point i. We do this for each timestep in the generative process.

3.7. TRAINING OBJECTIVE OF ϵ (t) θ

To train our noise approximator ϵ (t) θ , we follow Luhman & Luhman (2020) to jointly minimize the denoising loss and pen state loss. Specifically, we train ϵ (t) θ following DDIMs Song et al. (2021) to minimize the L2 discrepancy between the estimated noise ϵ (t) and ϵ t generated in the forward process: L d (θ) = E||ϵ (t) -ϵ (t) θ (s t )|| 2 2 . (8) Additionally, at each timestep t, ϵ θ is also optimized by minimizing the pen state loss L p (θ) = 1 N N i=1 [-g i log(ĝ i ) -(1 -g i ) log(1 -ĝi )], where ĝ is predicted using Eq. 7. In summary, the total training loss is: L(θ) = L d (θ) + γL p (θ), ( ) where γ is a weight to balance two losses.

3.8. CONDITIONAL GENERATION TO RECTIFY BAD SKETCHES

The above has described the details of the unconditional sketch generation. We now further present how to rectify a flawed sketch using our model trained with unconditional generation only. Inspired by the classifier guidance widely used in diffusion models (Dhariwal & Nichol, 2021; Ho & Salimans, 2022) , a sketch generation guidance is introduced for the generative process. Different from existing works using gradients of an image classifier as guidance, a perceptual metric is employed to generate guidance -the gradients of the log probability of perceptual distance between a generated s t and a condition s c , denoted as ∇ st log p(s t , s c ). Then the estimated noise is refined as ε(t) (s t ) = ϵ (t) θ (s t ) + η∇ st log p(s t , s c ), where η is to control the strength of the guidance. Then the sample ŝt-1 will be generated using Eq 4 with the new noise ε(t) (s t ). Intuitively, we aim to guide the generation of s 0 by repeatedly sampling s t , which has similar content as s c . We follow prior work (Zhang et al., 2018) based on the L2 distance between two image features to measure their perceptual similarity, as it is effective in transferring image content. Neural line rendering (NLR) (Li et al., 2020a) is used to convert a vector sketch into an image sketch to allow the gradients being propagated to the vectorized sketch. After rasterizing sketch into its image version, a perceptual metric is thus applied as p(s t , s c ) = L l=1 1 H l W l h,w ||w l ⊙ (F l hw s t ) -F l hw (s c ) || 2 2 , ( ) where F l hw (•) ∈ R H l ×W l ×C l denotes the feature maps for l-th layer of an ImageNet pre-trained VGG Simonyan & Zisserman (2014) , w l ∈ R C l is adopted to scale the feature activations channelwise as per (Zhang et al., 2018) . Details about NLR are provided in Appendix A.5.

4. EXPERIMENTS

In this section, we evaluate our model in two modes, i.e., unconditional and conditional generation, to verify the quality of the generated data and the ability to mend inferior sketches of our model. Please refer to Appendix A.6 for implementation details.

4.1. UNCONDITIONAL GENERATION

Dataset. We evaluate our proposed method on QuickDraw Ha & Eck (2018), which contains over 50M sketches in vector format across 345 common categories. A subsetfoot_0 from QuickDraw is collected for the experiments. Specifically, there are 10 classes chosen adhered to the following principles: (i) Complexity -simple, moderate and complex sketch drawings are all included, e.g., fish, umbrella and lion; (ii) Diversity -objects with diverse sub-category variations are involved, e.g., bus and spider; (iii) Ambiguity -sketches belong different classes share highly similar appearance, e.g., apple and moon. The original data split is adopted, i.e., 70,000 training and 2,500 testing sketches for each class. Competitors. Three current RNN-based state-of-the-art methods, i.e., SketchRNN Ha & Eck (2018 ), SketchPix2seq Chen et al. (2017) and SketchHealer Su et al. (2020) , are included. Diff-HW Luhman & Luhman (2020) , which is built on DDPMs, is enabled for comparison by tweaking the cross-modal attention layersfoot_1 to be self-attention since only sketches are available in our problem. Additionally, the unconditional mode of SketchODE (Das et al., 2021a) is also used for comparison. Evaluation metrics. We gauge the quality of the generated data resorting to the evaluation metrics for image generation once the vector sketches are rasterized into images. Specifically, Fréchet Inception Distance (FID) Heusel et al. (2017) measures the distance between the generated (image) data and real ones by comparing the mean and variance of image features, which are obtained from Inception-V3 Szegedy et al. (2016) trained on ImageNet Krizhevsky et al. (2012) for image classification. The Geometry Score (GS) Khrulkov & Oseledets (2018) metric compares the geometrical properties of data manifold between the generated and real data. In addition, the improved precision and recall (Kynkäänniemi et al., 2019) are used as complementary evaluation metrics following other image generation works (Nichol & Dhariwal, 2021) . 7.4 3.9 0.47 0.87 8.9 5.2 0.44 0.85 10.5 6.1 0.39 0.81 0.08 Ours (Linear-DDIMs, S=30) 11.9 6.4 0.38 0.81 13.3 8.8 0.36 0.78 15.1 9.6 0.33 0.72 0.08 Ours (Quadratic-DDIMs, S=30) 12.3 6.6 0.41 0.79 13.8 8.7 0.35 0.76 15.4 9.9 0.34 0. Quantitative results. As shown in Table 1 , our sketch generator clearly outperforms other competitors, suggesting better quality of the generated data. Particularly, our model is able to maintain relative stable FID, GS, Precision and Recall scores regardless of the complexity of the generated sketches. On the contrary, obvious performance decline is witnessed when constructing sketches with more complicated structures for other baseline methods.

Qualitative results.

Sampling trajectory . We compare different approaches for choosing sampling trajectory, including full reverse steps, linear and quadratic in DDIMs and ours using recognizability-based skip function. Results in Table 1 show that our recognizability-based skipping (r-Shortcut) achieves the best results when performing the same total sampling steps, i.e., S = 30, suggesting the superiority of our sampling strategy. In addition, compressing sampling steps from 1000 to 30 using our method deteriorates the data quality very minor, while 16x faster speed is reached. Effectiveness of h ϕ (•). We further conduct experiments to testify if the learned network h ϕ (•) in Eq.5 can faithfully reflect the recognizability of sketches generated during sampling. Specifically, we compare how well the predicted recognizability can match the probability of assigning the correct class label given by the pre-trained classifier. This experiments are conducted using 10 single class models, thus we can know which class probability in the classifier to be compared. Results in Table 2 show that low and stable error can be achieved along the sampling steps. Impact of N . To study the impact of point number N , we train our model with different settings. We can see from Table 3 that it is inferior to using too fewer points (N = 24) for modeling relative complex structures, while applying too much points (N = 384) is also sub-optimal. An unique and optimal N is hard to reach as the complexity of sketch structure varies case by case. Absolute coordinates works? To gain more insights about the importance of modeling relative coordinates by our model, i.e., (∆x, ∆y), we train a model to learn from sketches represented by stroke points in absolute coordinates, i.e., (x, y), with other settings/components unchanged, denoted as "Ours (Abs)" in Table 1 . Significant decrease on performances is observed, verifying the crucial role of training model with relative coordinates. What is learned? We suspect that capturing the implicit drawing structure is the key to success. A simple way to verify the speculation could be destroying the drawing structure. Specifically, we reorganize the original sketch data into a disordered version by randomly shuffling sketch segments 

Simple

Moderate Complex Speed FID GS Prec Rec FID GS Prec Rec FID GS Prec Rec T=1000 N=24 7.4 4.4 0.32 0.74 14.7 13.5 0.30 0.76 16.3 18.2 0.26 0.69 0.16 N=192 6.9 3.4 0.33 0.76 9.8 5.3 0.31 0.74 10.3 6.3 0.27 0.71 0.18 N=384 8.5 5.0 0.30 0.72 10.6 8.1 0.28 0.69 11.9 12.6 0.23 0.67 0.22 (each sketch segment is formed by connecting two neighbor stroke points). Intuitively, the newly constructed sequential data discards the implicit interdependent relations among stroke points, as the corresponding segments are geometrically far from each other. Then we can obtain a variant model trained with such structure-broken data, denoted by "Ours (Point-Shuffle)" in Table 1 . We can see that the performance is clearly harmed compared to the model trained with normal data. Furthermore, we also experiment with a model variant trained with a stroke-level shuffle tactic, which means that the structural cues at stroke-level are preserved as each stroke remain complete, however the stroke orders are changed given the random shuffle applied. Less declines are observed ("Ours (Stroke-Shuffle)" in Table 1 ), revealing that the middle-level structural cues (i.e., complete strokes) are of great importance to our model. Note full steps are taken for these two model variants.

4.2. CONDITIONAL GENERATION

Experiments on sketch refinement and sketch healing are conducted to verify the effectiveness of the conditional sampling by our model. We show that, given any conditional sketch with defeats due to either stroke distortion or corruption, the generated sketch could be a refined version accordingly. Dataset. We utilize the same data in sec 4.1 for the sketch refinement task. To synthesis sketches with different degrees of distortion, random Gaussian noise e ∼ N (0, I) is added to each location of stroke points. The deformation degrees could be easily controlled by adding noise using different t timesteps. Because a real sketch can be transformed into a random noise after T diffusion steps, then a new sample obtained from the intermediate step t would be considered as x% deformed if t = x%T . The sketch healing task aims to create new sketches which should resemble the given the partial sketches. For fair comparison, we follow the same experimental setups in SketchHealer (Su et al., 2020) . 17 categories from QuickDraw and the same data splits (70,000 for training and another 2,500 novel ones for testing per class) are adopted. Sketches are damaged using two different mask ratios: p mask = 10% or 30%, that is, 10% or 30% key stroke points will be randomly removed to form the corrupted versions from a complete sketch. Sketch classifier. A multi-category classifier built on AlexNet is pre-trained on the training set of all 345 QuickDraw categories. Intuitively, better recognizability of sketches is manifested by higher recognition accuracy offered by this pre-trained classifier. Following the common practice in Song et al. (2018) ; Su et al. (2020) , acc@1 and acc@10 are adopted as evaluation metrics. Specifically, acc@K denotes the accuracy of the true positive is ranked within the top K predictions. SBIR model. Similarly, a sketch-based image retrieval (SBIR) model is pre-trained to verify if the rectified sketch generated by our model could enable better retrieval performances. Specifically, Triplet- SAN Yu et al. (2016) , which is constructed by employing Sketch-A-Net Yu et al. (2017) as backbone network, is trained on the QuickDraw Extended dataset Dey et al. (2019) under the supervision of triplet loss. Built on QuickDraw, QuickDraw Extended dataset is the largest SBIR dataset which contains 330,000 sketches paired with 204,000 photos over 110 categories. In our case, the SBIR model is trained by using the training set of the same selected 10 categories in the section 4.1. The remaining testing set of the 10 classes is utilized as samples to be deformed. Evaluation metric acc@1 and acc@10 measure if a target image could be ranked within the top 1/10 retrieval results. The mean average precision (mAP) is adopted for evaluation as well. Results. As shown in Table 4 , we can observe that (i) for all cases involving noisy sketches (i.e., deformation levels from 10% ∼ 50%), performance improvements on recognition and retrieval can be achieved by using sketches after rectification, revealing the obvious enhancement benefits gained from our model; (ii) The recognition and retrieval results are stable regardless of the deformation level tackled, e.g., recognition accuracy acc@1 is kept around 47% ∼ 50% for sketches with deformation 10% ∼ 50%. (iii) Interestingly, recognition and SBIR results could be further significantly increased on the original human-drawn sketches after refinement. Exemplar refined sketches from the original ones are demonstrate in Figure 3 (a), revealing that badly drawn sketches (i.e., missing part, random strokes and line distortions) can be largely rectified by our model, while other competitors either not able to faithfully resemble the conditions or unable to model complex structures. Quantitative healing results are shown in Table 5 . We can see that our generator outperforms all the other competitors on sketch recognition in most cases (expect the top 10 result when p mask = 10%), indicating the superiority of our model on recovering incomplete sketches. The advantage of our method is further enlarged when the corruption level increases, i.e., p mask = 30%. Similar situation can be observed for human study results. The healed sketches given by our model are mostly preferred compared against with other baseline methods. Qualitative comparisons against Sketch-RNN, SketchODE and BézierSketch are also provided in Figure 3 (b) . 

5. CONCLUSION

We show for the first time sketch generation can be formulated as a process of deformation-based denoising. The key finding is that increased sketch deformation degrees can be monotonically synthesized by diffusing stroke points with Gaussian noise, and the demanded probabilistic distribution of the stroke points of sketch objects can thus be effectively learned by diffusion inversion. Importantly, the ability of quantifying recognizability of the generated sketch was injected during the sampling. For that, a RNN was developed to predict the recognizability of a sampled sketch based on the estimated noise at each timestep. As a result, more efficient sampling can be enabled by a recognizability-based skip function. Additionally, our model trained for unconditional generation could be readily extended for conditional generation by incorporating a perceptual similarity based gradients into the sampling. Extensive experiments validated the effectiveness of our model. To manage the abstraction level of the generated sketch would be a potential future work.

A APPENDIX

A.1 DENOISING DIFFUSION PROBABILISTIC MODELS (DDPM) To learn the probability distribution over data x, diffusion models corrupt training data by slowly injecting noise and learn to reverse the corruption, such that the obtained models can gradually transform random noise into sample for data generation. Forward process. Formally, for each training data x 0 ∼ q data (x 0 ), a discrete Markov chain x 0 , x 1 , . . . , x T is formed by the forward process (also known as diffusion process). This process is defined as a Markov chain which slowly adds Gaussian noise to the data according to a variance schedule β 1 , . . . , β T : q(x 1:T |x 0 ) := T t=1 q(x t |x t-1 ) q(x t |x t-1 ) := N (x t ; 1 -β t x t-1 , β t I) (14) If we know q(x t-1 |x t ), we could sample data from the data distribution q(x 0 ) by first sampling x T from q(x T ) (isotropic Gaussian) and then sampling from q(x t-1 |x t ) until we get x 0 . However, it is difficult to estimate q(x t-1 |x t ) since it needs entire dataset to do so. Therefore, p θ is proposed to approximate the conditional probabilities q(x t-1 |x t ) during the backward process. Backward process. In the backward/reverse process, diffusion models have to denoise the perturbed data (starting at random noise p(x T ) = N (x T ; 0, I)) back to the origin data x 0 . Mathematically, diffusion models is defined as p θ (x 0 ) := p θ (x 0:T )dx 1:T in which the joint distribution p θ (x 0:T ) defines the reverse process: p θ (x 0:T ) := p(x T ) T t=1 p θ (x t-1 |x t ) (16) p θ (x t-1 |x t ) := N (x t-1 ; µ θ (x t , t), Σ θ (x t , t)) The training objective is to optimize variational bound on negative log likelihood: E[-log p θ (x 0 )] ≤ E q [-log p θ (x 0:T ) q(x 1:T |x 0 ) ] = E q [-log p(x T ) - t≥1 log p θ (x t-1 |x t ) q(x t |x t-1 ) ] which is equivalent to optimize the following variational lower-bound L vlb : L vlb := L 0 + L 1 + • • • + L T -1 + L T (19) L 0 := -log p θ (x 0 |x 1 ) (20) L t-1 := D KL (q(x t-1 |x t , x 0 )||p θ (x t-1 |x t )) L T := D KL (q(x T |x 0 )||p(x T )) (22) Essentially, the above KL terms compare two Gaussian distributions which can be addressed in closed form Ho et al. (2020) . The training objective for Eq (17) is to get µ θ (x t , t), while not involve Σ θ (x t , t), as it is set to time-dependent constants σ 2 t=1 I. Furthermore, instead of predicting µ θ (x t , t) (forward process posterior mean) by a neural network, Ho et al. (2020) proposed to utilize an approximator ϵ θ (x t , t) to predict noise ϵ from x t , which is proven to be more effective than optimizing µ θ (x t , t). The simplified training objective is: L simple (θ) := E t∼[1,T ],x0∼q(x0),ϵ∼N (0,I) [||ϵ -ϵ θ (x t , t)|| 2 ] ( ) Data sampling. Once trained, we have a neural network to estimate noise ϵ from sample x t at timestep t, i.e., ϵ θ (x t , t). Then µ θ (x t , t) can be derived from ϵ θ (x t , t) by the following equation: µ θ (x t , t) = 1 √ α t (x t - 1 -α t √ 1 -ᾱt ϵ θ (x t , t)) where α t = 1-β t and ᾱt := t s=1 α s . To this end, we can sample data from p θ (x t-1 |x t ) according to Eq (17) repeatedly until we reach s 0 .

A.2 DENOISING DIFFUSION IMPLICIT MODELS (DDIMS)

To improve the sampling efficiency, authors of DDIM (Song et al., 2021) have proposed a novel non-Markov chain process to reduce the forward and reverse process steps of DDPMs. They found a special property of the forward process of DDPM as q(x t |x 0 ) = N (x t ; √ α t x 0 , (1 -α t )I) such that the training objective of DDPM is able to not directly based on the joint q(x 1:T |x 0 ). Then, they derive that q(x t-1 |x t , x 0 ) = N √ α t-1 x 0 + 1 -α t-1 -σ 2 t • x t - √ α t x 0 √ 1 -α t , σ 2 t I , where t > 1 and q(x T |x 0 ) = N ( √ α T x 0 , (1 -α T )I). And based on Bayes' rule, the forward process can be derived as q σ (x t |x t-1 , x 0 ) = q σ (x t-1 |x t , x 0 )q σ (x t |x 0 ) q σ (x t-1 |x 0 ) , i.e. the forward process is no longer Markovian as each x t is dependent on x t-1 and x 0 . During the generative process, they train a noise approximator ϵ where ϵ t ∼ N (0, I). Once the noise approximator is trained, one can generate a sample x t-1 give a sample from x t as x t-1 = √ α t-1 x t - √ 1 -α t ϵ (t) θ (x t ) √ α t + 1 -α t-1 -σ 2 t • ϵ (t) θ (x t ) Accelerated Generation Process The DDIM authors proposed to use non-Markovian chain for the forward procedure as q σ (x t |x 0 ) can be directly estimated. Therefore, they propose to use subsequence τ from [1, • • • , T ] to speed up the generative process. 

A.3 U-NET ARCHITECTURE

The architecture of our U-Net is shown in Figure 4 . It consists of a stack of convolution blocks and average pooling for downsampling, followed by a stack of upsampling convolutions and convolution blocks, with skip connections for concatenating feature maps in the same resolution. Additionally, a single head attention layer is used to inject the timestep t embedding into the convolution blocks during downsampling following DDIM Song et al. (2021) . Bi-directional RNN as noise approximator. Instead of using the conventional convolutional U-Net, a different network architecture, i.e., bi-directional RNN, is explored to estimate the noise. Specifically, given a sketch represented by a sequence of points s t = {s 1 t , s 2 t , . . . , s N t } where s t ∈ R N ×2 , bi-directional RNN is first applied to map each point s i t ∈ R 2 into a hidden state h i t ∈ R d . Then all hidden states are concatenated together to construct the overall feature h t ∈ R N ×d for a sketch. Finally, h t is fed into a fully-connected layer to predict the noises, i.e., ϵ t ∈ R N ×2 . Similarly, the pen states can be obtained from h t by another FC-layer. The results are shown in Table 6 . We can find out that the obtained performances can not surpass the results using U-Net.

T=300 T=0 T=1000

Figure 5 : Examples of the process of data sampling (one example in a row). We can see that early steps (T=1000 to about T=300) are often not effective that samples in the red box change slowly. In contrast, later steps get much more efficient to rectify sketches as in the blue box.



Class list: moon, airplane, fish, umbrella, train, spider, shoe, apple, lion, bus. Diff-HW is originally proposed for stylized text-to-handwriting generation, requiring text and an image as condtions to control the content and style of the generated handwriting, respectively.



Figure 1: (a) Built on diffusion models, sketch generation is formulated as a stroke deformation rectification problem. Essentially, our model is to reorganize points with fixed adjacency, such that placing them on meaningful locations out of a total mess, dubbed sketch knitting. Note that the order of stroke points is pre-determined and unchanged. (b) Generated sketches with different deformation levels associated with timesteps by our model. Early sampling stage is not efficient that the obtained data change little and remain noisy. (c) Early sampling is inefficient as basically the same estimated noise to each stroke point is used, led sketch remains noisy and unrecognizable. And different noise patterns can be observed at different timesteps, which motivate us to devise recognizability-based skipping (blue△) based on the estimated noise to find shortcut sampling. Red solid curve denotes mean and the shade denotes variance. 1000 generated sketches are used for plotting.

Figure 2(a) shows some examples of reverse-time diffusion process, i.e., from random noise till reach the data sample, the generated sketch at each step exhibits different (reduced) level of distortion. More results of unconditional sketch generation are demonstrated in Figure 2(b).

Figure 2: (a) Sketch generated from random noise. (b) More examples of unconditional generation.

Figure 3: Qualitative comparison on sketch refinement (a)(b) and healing (c)(d). (a) Normal degree of complexity. (b) Complex classes. (c) Corruption p mask = 10%. (d) Corruption p mask = 30%. Table 4: Recognition and retrieval results before and after (separated by "|") sketch rectification by our model at different deformation levels (DL). Performance gain in red. "-": without injecting noise, i.e., real data. DL Recognition Retrieval acc@1(%) acc@10(%) mAP acc@1(%) acc@10(%) -51.9 | 52.4 (+0.90) 87.7 | 90.2 (+2.50) 0.704 | 0.789 (+0.045) 67.4 | 73.3 (+5.90) 91.3 | 96.2 (+4.90) 10% 45.7 | 48.9 (+3.20) 82.3 | 82.4 (+0.10) 0.724 | 0.788 (+0.064) 66.9 | 73.1 (+6.20) 92.1 | 96.8 (+4.70) 20% 33.0 | 47.3 (+14.3) 68.2 | 81.9 (+13.7) 0.607 | 0.772 (+0.165) 55.8 | 72.8 (+17.0) 81.8 | 94.7 (+12.9) 30% 20.6 | 48.2 (+27.6) 51.5 | 81.9 (+30.4) 0.496 | 0.787 (+0.291) 46.9 | 72.8 (+25.9) 68.9 | 95.0 (+26.1) 50% 7.29 | 50.1 (+42.8) 27.1 | 84.3 (+57.2) 0.328 | 0.786 (+0.458) 28.6 | 74.9 (+46.3) 47.8 | 96.3 (+48.5)



Figure 4: Architecture of the U-Net for estimating noise ϵ (t) θ and pen state ĝ.

Quantitative comparison results. Testing categories are orgnaized in three folds according to the complexity, i.e., the average number of stroke points (ASP). Simple: < 40 ASP, Moderate: 40 ∼ 100 ASP and Complex: > 100 ASP. Speed: second per sketch sampling.

Averaged error (10k samples for each class) of the predicted recognizability using h ϕ (•).

Impact of points number n (full reverse steps are performed).

Comparison results on sketch healing. Recognition results are obtained by classifying generated healed sketches with a pre-trained multi-category sketch classifier. "Human" denotes human's preference of choice among the synthetic outputs by different competitors.

(x t-1 |x t ) based on the probability q σ (x t-1 |x t , x 0 ) as follows (x t ) is the prediction of x 0 using the noise approximator ϵ

Comparison results of using U-Net and Bi-directional RNN as noise approximator.

annex

Published as a conference paper at ICLR 2023 Figure 6 : The obtained rasterized sketches using neural line rendering (NLR).

A.4 FURTHER INSIGHTS INTO SHORTCUT SAMPLING

Shortcut sampling is a cute discovery we made that is specific to sketch data -coordinates in sketch sequence are far less robust to added noise than pixel values of an image. This can be seen from Figure 1 (c) and Figure 5 -that early sampling steps are often long and inefficient. Addressing this merely means a speed-up in generation while retaining quality (16x faster, see Table 1 ).

A.5 NEURAL LINE RENDERING

Following Li et al. (2020a) , neural line rendering (NLR) is performed to convert a vectorized sketch to its pixelative image. Specifically, given a sketch represented by a sequence of points s t = {s 1 t , s 2 t , . . . , s N t } at timestep t, a bi-directional Long Short-Term Memory (LSTM) is firstly used to extract per-point features f i t ∈ R d for each point s i t :)where h and c are the hidden states and the optional cell states, w and b are the weights and biases of a fully connected layer. Then the point sequence and the features, i.e., {(s i t , f i t )}, are fed into the NLR module to produce a d-channel feature map of size H × W × d. And the c-th channel of the feature map I c (the timestep t and the index for point i are omitted for clarity) can be obtained as follows:which means the pixel value I c k is computed by a linear interpolation of f c i and f c i+1 (i.e., the c-th feature values of two nearby stroke points p i and p i+1 ) if I k is a stroke pixel (the distance from I k to the line segment p i p i+1 is smaller than a threshold, i.e., D(I k , p i p i+1 ) < γ.) Note that p i and p i+1 are absolute coordinates corresponding to points s i and s i+1 . And α k = ∥p k -p i ∥ 2 /∥p i+1 -p i ∥ 2 , where p k is the projection point of I k on line segment p i p i+1 .To this end, by rendering the stroke points' features into pixel values I k , a vectorized sketch can be transformed into an image sketch. NLR is differentiable due to the linear interpolation based rendering, thus the gradient w.r.t the perceptual similarity given by a 2D-CNN in Eq 12 can be backpropagated to the vectorized sketch. Some examples of rendered sketch images are demonstrated in Figure 6 where d = 3, γ = 1, H and W are both set to 256.Published as a conference paper at ICLR 2023 A.6 IMPLEMENTATION DETAILS A single Nvidia 3090 GPU is used for model training. The batch size is set to 512. The point number is selected as 96, which is the average value of stroke points in the QuickDraw dataset. Cosequently, sketches with more than 96 stroke points are excluded during model training. In addition, we will pad the point sequence with zeros to reach N=96 if the actual number of stroke points of any sketch is less than 96. It turns out that N = 96 works well for most cases since about 91% sketches (statistics from all the 345 categories) having stroke points less than 96. The default setting of skipping stride m = 50, and the recognizability threshold ζ = 0.2. Instead of directly using α 1 , . . . , α T , β 1 , . . . , β T is adopted to define the mean and variance of Gaussian noise. And a linear noise schedule is leveraged and β t is defined as:where β 1 = 10 -4 and β T = 0.02 in our case. Then the mean is √ 1 -β t and variance is β t in Eq 1. Adam optimizer (β 1 = 0.9 and β 2 = 0.98) is used for optimization.

