LMSER-PIX2SEQ: LEARNING STABLE SKETCH REPRE-SENTATIONS FOR SKETCH HEALING

Abstract

Sketch healing aims to recreate a complete sketch from the corrupted one. The sparse and abstract nature of the sketch makes it challenging. The features extracted from the corrupted sketch may be inconsistent with the ones from the corresponding full sketch. In this paper, we present Lmser-pix2seq to learn stable sketch representations against the missing information by employing a Least mean square error reconstruction (Lmser) block, which falls into encoder-decoder paradigm. Taking as input a corrupted sketch, the Lmser encoder computes the embeddings of structural patterns of the input, while the decoder reconstructs the complete sketch from the embeddings. We build bi-directional skip connections between the encoder and the decoder in our Lmser block. The feedback connections enable recurrent paths to receive more information about the reconstructed sketch produced by the decoder, which helps the encoder extract stable sketch features. The features captured by the Lmser block are eventually fed into a recurrent neural network decoder to recreate the sketches. Experimental results show that our Lmser-pix2seq outperforms the state-of-the-art methods in sketch healing, especially when the sketches are heavily masked or corrupted.

1. INTRODUCTION

Humans are able to complete things that are missing in life through their imagination, such as completing blanks, novel sequels and image repairs. Sketch healing task (Su et al., 2020) is one of the related works. Sketch healing is to synthesise a complete sketch that best resembles the partial input (Su et al., 2020; Qi et al., 2022) . Different from the image inpainting task (Pathak et al., 2016) , where photos have rich texture information, freehand sketches are highly abstract and sparse, making sketch healing quite challenging. The way to get a corrupted sketch, proposed by Su et al. (2020) , is to crop several local visual patches from a raster sketch image and drop some of them. This approach results in a corrupted sketch raster image and some remaining visual patches. Conventional sketch generation models (Chen et al., 2017; Zang et al., 2021) that take images as input can be used for sketch healing. However, these models designed for sketch synthesis are not comparable to SketchHealer-1.0 (Su et al., 2020) , which was specifically designed for sketch healing. SketchHealer-1.0 constructs a graphical representation of the sketch by treating patches as nodes and connecting edges based on the nodes' temporal proximity, i.e., the drawing order. The graphic sketch representation realizes the information interaction between different patches in the same sketch, so as to achieve a better effect of healing. Based on SketchHealer-1.0, SketchHealer-2.0 (Qi et al., 2022) considered the relationship between the local reconstruction and the global semantic preservation. SketchHealer-2.0 requires the involvement of a pre-trained model to calculate the semantic similarity between the recreated sketch and the full sketch. SketchHealer-1.0 (Su et al., 2020) and SketchHealer-2.0 (Qi et al., 2022) build graphs that depend on drawing order, but this information is not always available. To overcome this difficulty, SketchLattice (Qi et al., 2021) proposes a novel lattice representation and takes image as input. However, during the data processing phase, the lattice approach causes some of the information in the raster sketch image to be lost, thus limiting SketchLattice's performance. Different from the state-of-the-art graph-structure models, which pass information between nodes to fill in the gaps, we expect that the network to take full advantage of the information in the raster sketch images and learn stable sketch representations in the absence of temporal information. Stable representations mean that the model extracts the features of the corrupted sketch as consistent as possible to the ones of the full sketch. Theoretically, this consistency allows different corrupted sketches obtained by masking from a full sketch to be recreated similarly. Conversely, when the extracted features are unstable (lack of consistency), the healed sketch fails to maintain semantics, and worse, its category changes. To learn stable sketch representations, we expect the feature maps from different layers in the network to be fully fused, which helps to extract significant and stable features from the corrupted sketches. Least mean square error reconstruction (Lmser) (Xu, 1991; 1993) enables this purpose. Lmser was a development of autoencoder (AE) (Bourlard & Kamp, 1988 ) by folding and merging the symmetrical encoder and decoder together. Such folding is equivalent to adding bi-directional skip connections between the encoder and the decoder (Xu, 2019) . The effectiveness of Lmser is demonstrated in image inpainting (Huang et al., 2020b) , super-resolution (Li et al., 2019) , and semantic segmentation (Guo et al., 2019; Cao et al., 2021) . However, these studies focus on imagerelated applications with rich texture information, rather than sparse and abstract sketches. We present Lmser-pix2seq to learn stable sketch representations against the missing information by employing a Lmser block, which falls into encoder-decoder paradigm. Taking as input a corrupted sketch, the Lmser encoder computes the embeddings of structural patterns of the input, while the decoder reconstructs the complete sketch from the embeddings. We build bi-directional skip connections between the encoder and the decoder in our Lmser block. The feedback connections enable recurrent paths to receive more information about the reconstructed sketch produced by the decoder, which helps the encoder extract stable sketch features. The features captured by the Lmser block are eventually fed into a Recurrent Neural Network (RNN) decoder to recreate the sketches. In summary, our contribution is that we propose Lmser-pix2seq to learn stable sketch representations for sketch healing. The bi-directional skip connections in our Lmser blocks allow the feature maps from the encoder and decoder to be sufficiently fused to facilitate the extraction of sketch features. Experimental results show that our Lmser-pix2seq outperforms the state-of-the-art methods, especially when the sketches are heavily masked or corrupted.

2. RELATED WORK

Sketch Generation. Research related to sketch generation with deep learning methods (Ha & Eck, 2018; Zhou et al., 2018a; Das et al., 2021; Ge et al., 2021) has been developing rapidly in recent years. An interesting work on sketch generation is that the neural network imitates humans to draw the vector sketch stroke by stroke. sketch-rnn (Ha & Eck, 2018 ) is an RNN-RNN architecture generation model based on the Variational Autoencoder (VAE) (Kingma & Welling, 2013) , which enables the conditional and the unconditional single category sketch generation. Later, the proposed sketch-pix2seq (Chen et al., 2017) with the convolutional neural network (CNN) encoder solves the multi-category generation problem and finds that latent code with the normal distribution constraint removed has better reconstruction results. Inspired by the above two models, (Song et al., 2018) fuses photo texture information with temporal information by shortcut cycle consistency. To further improve the controllability of the generation, RPCL-pix2seq (Zang et al., 2021) assumes that the latent space follows a Gaussian mixture model (GMM), and the number of Gaussians is determined by automatic selection of the model. sketch-rnn and RPCL-pix2seq are more similar to our model. There are some large pre-trained models, e.g. Sketch-Bert (Lin et al., 2020) and Sketchformer (Ribeiro et al., 2020) , for not only the sketch generation, but also for other downstream tasks. Sketch healing is similar to the combination of vector sketch generation and image inpainting. Sketch Healing. SketchHealer-1.0 (Su et al., 2020) clarified the definition of the sketch healing and proposed a novel graph representation method. SketchHealer-2.0 (Qi et al., 2022) rasterizes the generated sequence and calculates its semantic perceptual loss from the corresponding full sketch. The other model that represents sketch as a graph is the SketchLattice (Qi et al., 2021) . SketchLattice is a lightweight network that can construct graphs without relying on the drawing order. SketchLattice lattices the image, then treats the intersection of the lattice with the pixels of the sketch stroke as the nodes, and constructs the edges between the nodes by Euclidean distance. In contrast to the sketch generation task, the sketch healing task requires the model to extract accurate and effective features when the sketch is masked. Skip Connection. Forward connections refer to the direct transmission of information from the shallow layers to the deep layers via a short-circuit path. Studies have shown that forward connections can alleviate the gradient vanishing problem (He et al., 2016) and promote multi-scale feature fusion (Ronneberger et al., 2015) . This technique are widely applied in the field of computer vision, e.g., image recognition (He et al., 2016; Huang et al., 2017; Dosovitskiy et al., 2020) and image semantic segmentation (Ronneberger et al., 2015; Milletari et al., 2016; Zhou et al., 2018b; Huang et al., 2020a) . Feedback connections, as opposed to forward connections, have received attention recently. The implementation of a feedback connection usually requires the assistance of a recurrent mechanism (Xiang et al., 2020; Huang et al., 2020b) . Studies on semantic segmentation of medical images have shown that feedback connections can acquire finer details to effectively improve the accuracy of segmentation (Xiang et al., 2020; Cao et al., 2021) . In addition, block reuse reduces the number of parameters in the network.

3.1. OVERVIEW OF LMSER-PIX2SEQ

The proposed Lmser-pix2seq is a model consisting of an Lmser block encoder and an RNN decoder, where the Lmser block is also in encoder-decoder architecture. The bi-directional skip connections in our Lmser block allow the feature maps to be fully fused to extract stable features from the sketch. Fig. 1(b) shows the pipeline of our Lmser-pix2seq. A corrupted sketch X is fed into the Lmser block to obtain the reconstructed raster image X and the latent vector y, respectively. We project y into two vectors µ and σ and then get the final latent code m by the reparametrisation trick (Kingma & Welling, 2013), which is fed into a Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) decoder to output the recreated sequence Ŝ. Specifically, the LSTM predicts the pen state and produces parameters to form a GMM to estimate the offset distance from previous point. Our LSTM decoder with 512 nodes is consistent with the SketchHealer-1.0 (Su et al., 2020) . The LSTM modeling method is referenced from the sketch-rnn (Ha & Eck, 2018) , from which readers can access more details. In our method, a sketch has sequence S( s1 , s2 , ..., sn ) and raster image X two different modal representations. A sequential representation of the sketch is composed of a series of points, each of which is a vector containing five elements: (∆x, ∆y, p 1 , p 2 , p 3 ). ∆x and ∆y are the offset distance from the previous point in the x and y direction. (p 1 , p 2 , p 3 ) indicates the current stroke state, where (1, 0, 0), (0, 1, 0), (0, 0, 1) represent the three stroke states of touch, lift, and the end of sketch, respectively. X is rasterized by the sequence S. The corrupted sketch X is obtained by masking part of the information in the full sketch X.

3.2. LMSER BLOCK ARCHITECTURE

As mentioned before, we expect a network to recover incomplete sketch raster images while extracting features. We take a convolutional autoencoder (CAE) as the backbone network to implement our Lmser block. The encoder and decoder of the CAE consist of several convolution and deconvolution layers, respectively, as shown in Fig. 1(a) . The encoder and decoder layers of the CAE are fully symmetrical. We add bi-directional skip connections to the corresponding layers between the encoder and the decoder.We connect the decoder and the encoder to form a loop, as illustrated in Fig. 1(a) , that allows the feedback skip connections to work (Huang et al., 2020b) . With the recurrent mechanism, we calculate the feature fusion as follows: z te l = f l (z te l-1 ), t = 0, 0 < l < L, z te l = αf l (z te l-1 ) + (1 -α)z (t-1)d l , t > 0, 0 < l < L, z te l = f l (z te l-1 ), t ≥ 0, l = L, z td l = g (l+1) (z td l+1 ), t ≥ 0, l = 0, z td l = αg (l+1) (z td l+1 ) + (1 -α)z te l , t ≥ 0, 0 < l < L, ( ) where z te l and z td l respectively denote the t-th update of the l-th layer's feature maps from the encoder and the decoder, f l (•) and g l (•) separately denote the l-th layer network-related calculations for the of encoder and the decoder. α is a hyper-parameter, which we set to 0.5 in the experiments. We use the bottom-most output z td 0 of the decoder as the input to the encoder to form the loop, i.e., z (t+1)e 0 = z td 0 . The specific workflow of the Lmser block is as follows: when a sketch X is fed to the Lmser block, the neurons of the encoder are activated layer by layer. Then, the embedding z 0e L obtained from the top-most layer of the encoder are sent to the decoder. The neurons in each layer of the decoder are initialized by receiving both the information transmitted by the forward connection and the signals from the previous layer of the decoder, e.g. z 0d 1 = αg (2) (z 0d 2 ) + (1 -α)z 0e 1 . The decoder output, z 0d 0 , is used as the input z 1e 0 to the encoder in the next loop. Thereafter, the updates of the neurons in the encoder are also influenced by the information delivered by the feedback connections, e.g. z 1e 2 = αf 2 (z 1e 1 ) + (1 -α)z 0d 2 . The neurons in the Lmser are dynamic updated layer by layer through a large loop repeatedly, see Fig. 2 of CLmser (Huang et al., 2020b) for more details. After T loops, the top-most and bottom-most layers of the Lmser block output the feature y and the reconstructed sketch X, respectively. Theoretically, the dynamic process above can eventually reach an equilibrium state (Xu, 1993) . In practice, the choice of T is a trade-off between performance and consumption. More iterations allow the feature maps to be fully fused and facilitate the output of stable features, but also require the involvement of more computational resources. In this paper, we set T to 2 by default. It is worth noting that when T = 0, Lmser will degenerate into a U-Net-like network architecture.

3.3. TRAINING LMSER-PIX2SEQ

Lmser-pix2seq has a multi-modal output property whose main purpose is to make the recreated vector sketch Ŝ as high-quality as possible while maintaining global semantic with the original full sketch S. To obtain a better reconstruction in multi-class sketch, we remove the Kullback-Leibler (KL) Divergence Loss term in VAE as existing work (Chen et al., 2017; Su et al., 2020; Qi et al., 2021; 2022) . The loss function consists of two parts: the raster image reconstruction loss L CN N and the sequence reconstruction loss L RN N . As commonly image generative tasks, L CN N adopts the l 2 loss. Following sketch-rnn (Ha & Eck, 2018) , the sequence modal's reconstruction loss L RN N is to minimize the negative log-likelihood of the generated probability distribution (Qi et al., 2021) . To sum up, our objective is to minimize L total =w c L CN N + w r L RN N = 1 2 w c E(∥ X -X∥ 2 2 ) -w r E q ϕ (m|X) [log p θ ( Ŝ|m)], where X and X denote the full sketch and the complete sketch reconstructed by Lmser block, respectively. L CN N and L RN N are weighted by hyper-parameters w c and w r correspondingly.

4.1. PREPARATION

Dataset. Our experiments are conducted on two datasets, a 17-category dataset 1 (DS1)foot_0 from SketchHealer-1.0 (Su et al., 2020) and a 5-category dataset 2 (DS2)foot_1 from RPCL-pix2seq (Zang et al., 2021) . DS1 and DS2 are both from the large-scale dataset Quickdraw (Ha & Eck, 2018) . Each category contains 70000 sketches for training and 2500 sketches for testing. Our corruption method is generally consistent with SketchHealer-1.0 (Su et al., 2020) . The sketch is drawn with a 640 × 640 canvas, and M patches of 128 × 128 are taken from the canvas according to the order of the stroke points, then some of the patches are removed with a probability p mask . In the end, we resize the corrupted canvas into a 128 × 128 image as the corrupted sketch. SketchHealer-1.0's mask method (Su et al., 2020) has an information leakage problem. There are two patches A and B with overlapping parts, A is cropped before B, and B is selected to be removed, at this time, A still contains some information from the removed B. Our solution to the information leakage problem in SketchHealer-1.0 (Su et al., 2020) is to first mask all the information on the canvas corresponding to the location of the patches that need to be removed, and then crop the individual patches. Evaluation Metrics. Rec and Ret (Zang et al., 2021) are evaluated as recreated sketch metrics. Rec indicates whether the recreated sketch Ŝ and its corresponding sketch S belong to the same category, e.g. top -1 recognition accuracy. We pre-train two sketch a net classifiers (Yu et al., 2015) for computing Rec in two datasets, respectively. Ret is a measure of the semantic similarity between the recreated sketch Ŝ and its corresponding full sketch S. Specifically, for different models, full sketch S and recreated sketch Ŝ are rasterized and fed to their own encoder to obtain vectors μ and μ. We use μ to retrieve its corresponding vector μ, and Ret is the success rate of retrieval. We use Ret@k to represent top -k retrieval accuracy. Our Ret is completely different from SketchHealer-1.0 (Su et al., 2020) , SketchHealer-2.0 (Qi et al., 2022) and SketchLattice (Qi et al., 2021) , whose retrieval is a category-level retrieval and not an instance-level retrieval. Our Ret is more responsive to the semantic consistency of the recreated sketch and the full sketch. Baseline. We compare our Lmser-pix2seq with five baseline models in the sketch healing task. These models include sketch-pix2seq (Chen et al., 2017) and RPCL-pix2seq (Zang et al., 2021) , which are proposed for sketch generation. Additional models are SketchHealer-1.0 (Su et al., 2020) , SketchHealer-2.0 (Qi et al., 2022) , and SketchLattice (Qi et al., 2021) , which are designed for sketch healing. Among these models, only RPCL-pix2seq has constraints on the latent space. We train SketchLattice (Qi et al., 2021) and SketchHealer-1.0 (Su et al., 2020) using the official open source code. Meanwhile, we re-implement RPCL-pix2seq by pytorch (Paszke et al., 2017) . The code of SketchHealer-2.0 (Qi et al., 2022) is not open source yet, while the original paper lacks the specific implementation of gradient back-propagation after rasterization. Therefore, we do our best to reproduce SketchHealer-2.0. We first pre-train multi-category sketch-rnn (Ha & Eck, 2018) networks for each datasets and freeze the networks' parameters. When training SketchHealer-2.0, we choose the Gaussian component and pen state with the highest probability of the LSTM output to generate a five-tuple vector sequence Ŝ to ensure that the gradients can be back-propagated. Eventually, Ŝ and S are simultaneously fed into the pre-trained sketch-rnn's encoder to get the corresponding latent code, which are used to calculate the perceptual loss. Since the masking process is performed on the canvas, we do not include networks like sketch-rnn, which use sequences as input, as the baseline models. Implement Details. The Adam optimizer (Kingma & Ba, 2014) is applied to our Lmser-pix2seq with parameters β 1 = 0.9, β 2 = 0.999, ϵ = 10 -8 , and the learning rate of the network starts from 10 -3 with a decay rate of 0.999 for every iteration. We randomly mask p mask = 10% of the patches to obtain the corrupted sketch for model training, and different proportions for model testing. The weights of the loss function are w c = 0.5 and w r = 1. The dimension of the latent code m is N z = 128. All models is trained on a single NVIDIA RTX 2080Ti GPU with 150000 iterations.

4.2. RESULTS

Quantitative Results. Table 1 reports the retrieval performance on DS1. The proposed Lmser-pix2seq learns stable representations which lead to outstanding performance on Ret. When the sketch is unmasked, the success rate of Lmser-pix2seq retrieval performs significantly superior to other methods. The top-1 Ret of our method is 22.91% higher than SketchHealer-1.0 (Su et al., 2020) . The advantage of the Lmser-pix2seq becomes more apparent when the proportion of sketches masked gradually increases. When p mask =50%, the top1 Ret of Lmser-pix2seq improved by 33.42% compared with SketchHealer-1.0. On the multi-style DS2, our method still performs well, see Table 2 . The Rec metrics of Lmser-pix2seq are overall at the same level as the stateof-the-art methods SketchHealer-1.0 (Su et al., 2020) and SketchHealer-2.0 (Qi et al., 2022) . The above results show that our model satisfies the need for both high-quality and semantic preservation. Observe the performance of other models. RPCL-pix2seq (Zang et al., 2021) , as a latent space constrained model, focuses on the controllability of sketch generation rather than sketch healing, which makes it more sensitive to masking rates than sketch-pix2seq (with KL term removed) (Chen et al., 2017) . The lightweight SketchLattice (Qi et al., 2021) takes coordinates as input, making it difficult to utilize sufficient information as input. Besides, when there are noisy strokes in the sketch, inappropriate coordinates may be treated as nodes by SketchLattice and affect the quality of sketch generation. Another reason is that we are performing the lattice operation on the corrupted sketch instead of discarding several nodes based on probability as in (Qi et al., 2021) , which causes more information to be removed. Naturally, SketchLattice*, which uses the original masking method, has improved both Rec and Ret SketchHealer-1.0 (Su et al., 2020) and SketchHealer-2.0 (Qi et al., 2022) produce higher quality generation of corrupted sketches, as reflected in the Rec metric. SketchHealer-2.0 shows an improvement on Rec compared with SketchHealer-1.0, which benefits from the perceptual loss to promote global semantic preservation. However, the Ret of SketchHealer-2.0 do not perform consistently between DS1 and DS2. This may be due to the fact that we reproduce the perceptual loss calculation by sketch-rnn (Ha & Eck, 2018) , which is not designed for multi-category sketch generation. Qualitative Results. In Fig. 2 , we present the sketch healing results for p mask = 10%, p mask = 30% and p mask = 50%. With less loss of sketch information (p mask = 10%), the healing results for all models are relatively reasonable. However, as more strokes are covered, not all methods are effective. When p mask = 50%, some key details in the sketches are lost, e.g., the head of the sheep is almost completely lost. Lmser-pix2seq successfully recreate the sheep by the limited stroke information of the head, while other methods can only produce a rough body. Observe the airplane in Fig. 2 . When p mask = 30% and p mask = 50%, the shape of the fuselage in most of the generated sketches does not match the shape in the original sketch. However, the sketch recreated by Lmser-pix2seq and SketchHealer-2.0 (Qi et al., 2022) maintains the structure of the fuselage (Qi et al., 2021) 77.54 77.81 77.10 72.98 SketchHealer-1.0 (Su et al., 2020 ) 90.93 90.50 88.80 83.15 SketchHealer-2.0 (Qi et al., 2022) 92.10 91.74 90.01 84.23 Lmser-pix2seq 92.42 91.78 90.49 88.37 Figure 2 : The healing results with masking probabilities of 10%, 30%, and 50%. better, especially the tail section. Fig. 3 shows the sketch healing results for different masking areas at p mask = 30%. Compared with the state-of-the-art methods, the healing results of the proposed Lmser-pix2seq do not differ significantly with the change of the masking location. Figure 3 : Qualitative comparisons between the proposed Lmser-pix2seq and other state of-the-art methods. p mask = 30% throughout.

4.3. ABLATION STUDY

In this subsection, we examine the effectiveness of the skip connections, the number of iterations T , and the image reconstruction. Our ablation experiments are conducted on the larger DS1. We design five models, as shown in Table 4 , of which Lmser-pix2seq (T=2) is the Lmser-pix2seq model above. Table 5 and Table 6 shows the Ret and Rec results of the ablation experiments, respectively. CNN Decoder. CAE-pix2seq with CNN decoder generates more recognizable sketches (higher Rec) compared with CE-pix2seq. This is because the CNN decoder captures the overall structure of a certain class of sketches by reconstructing the images and guides the CNN encoder to extract more accurate features (Zang et al., 2021) . However, the deep network makes it difficult to efficiently transfer details in sketches from the shallow encoder layers to the deep decoder layers, which results in poor Ret performance of the CAE-pix2seq. Skip Connection. The performance of Lmser-pix2seq (T=1) is significantly improved over CAE-pix2seq by introducing skip connections, especially on Ret. The main reason is that while the forward connections assist the decoder in reconstructing the sketch, the feedback connections transmit the reconstruction information to the encoder for feature map fusion, capturing the exact feature of the instance. Number Of Iterations. The experimental results show that the performance of Lmser-pix2seq with T = 2 is obviously superior to that with T = 1. This indicates that one cycle does not allow the feature maps in the Lmser block to be fully fused, and the feature extracted from the sketch is not stable. When T=4, the Rec metric of Lmser-pix2seq improves overall, compared with the Lmser-pix2seq (T=2). However, the benefit of more iterations with resource consumption is an issue that should be considered.

5. CONCLUSION

In this paper, we present Lmser-pix2seq to learn stable sketch representations for sketch healing. The bi-directional skip connections in our Lmser block facilitate the fusion of the feature maps between the encoder and decoder, thus capturing the stable features of the corrupted sketch. Benefiting from the stable features, the Lmser-pix2seq can recreate sketches with high-quality and preserve their global semantics. Experiments show that our approach outperforms the state-of-the-art models in the sketch healing task. C RETRIEVAL METRIC Fig. 5 shows two retrieval methods, and the metric used in this paper correspond to (a) in the figure. Compared with the retrieval metric in SketchHealer-1.0 (Su et al., 2020) , Ret is more challenging. For each query, Ret allows only one correct result, while hundreds of answers exist for the other method. Figure 5 : Two types of retrieval metrics. The retrieval is performed in the latent space. Blue circles represent queries and red circles represent samples with correct retrieval results. (a) The retrieval method from RPCL-pix2seq (Zang et al., 2021) , i. e. Ret, is an instance-level approach. (b) The retrieval method from SketchHealer-1.0 (Su et al., 2020) is a category-level approach.

D LMSER-PIX2SEQ ONLY TRAINED WITH CNNS

When Lmser-pix2seq only trained with CNNs, sketch healing is then turned into image inpainting. Table 7 and Table 8 report the relevant metrics. Fig. 6 shows the results of some sketch reconstructions. Although these images maintain the overall structure of the sketch well, some of the results are blurry and incomplete. This is why an RNN decoder is needed to generate the sequences. 



airplane, angel, alarm clock, apple, butterfly, belt, bus, cake, cat, clock, eye, fish, pig, sheep, spider, umbrella, The Great Wall of China. These categories are common in life and the instances in the categories are globally similar in appearance. bee, bus, flower, giraffe, pig. These classes have multi-style characteristics and are more challenging for sketch healing.



Figure 1: (a) Architecture of Lmser block. The convolution and deconvolution configurations are shown as h × w@d/s, where h, w, d and s represent height, width, depth and stride, respectively. All the layers are followed by an instance norm layer and a relu activation function, except for the bottom-most deconvolution layer, which employs a tanh activation function. (b) overview of Lmser-pix2seq. The latent vector y output at the top of the Lmser block is eventually fed to an RNN decoder to recreate the vector sketch Ŝ.

Figure 4: Sketch healing results for irregular hole masking.The left three columns in the figure are ground truth sketches, masks and masked sketches. The right six columns in the figure are the sketches recreated by Lmser-pix2seq.

Sketch healing retrieval performance Ret (%) ↑ on DS1.

Sketch healing retrieval performance Ret (%) ↑ on DS2.

Sketch healing recognition performance Rec (%) ↑ . Chen et al., 2017) 66.99 66.02 60.68 53.40 RPCL-pix2seq (Zang et al., 2021) 69.86 66.15 55.19 44.33 SketchLattice(Qi et al., 2021) 48.88 46.57 37.87 27.91 SketchLattice*(Qi et al., 2021) 48.88 49.15 45.85 31.16 SketchHealer-1.0(Su et al., 2020) 76.76 75.05 70.47 62.86 SketchHealer-2.0(Qi et al., 2022) 77.48 75.76 71.93 64.56

Models for ablation study.

Sketch healing retrieval performance Ret (%) ↑ on DS1 for ablation study.

Sketch healing recognition performance Rec (%) ↑ on DS1 for ablation study.

Sketch image inpainting recognition performance Rec (%) ↑ .

