SKETCHEMBEDNET: LEARNING NOVEL CONCEPTS BY IMITATING DRAWINGS

Abstract

Sketch drawings are an intuitive visual domain that appeals to human instinct. Previous work has shown that recurrent neural networks are capable of producing sketch drawings of a single or few classes at a time. In this work we investigate representations developed by training a generative model to produce sketches from pixel images across many classes in a sketch domain. We find that the embeddings learned by this sketching model are extremely informative for visual tasks and infer a unique visual understanding. We then use them to exceed state-of-the-art performance in unsupervised few-shot classification on the Omniglot and mini-ImageNet benchmarks. We also leverage the generative capacity of our model to produce high quality sketches of novel classes based on just a single example.

1. INTRODUCTION

Upon encountering a novel concept, such as a six-legged turtle, humans can quickly generalize this concept by composing a mental picture. The ability to generate drawings greatly facilitates communicating new ideas. This dates back to the advent of writing, as many ancient written languages are based on logograms, such as Chinese hanzi and Egyptian hieroglyphs, where each character is essentially a sketch of the object it represents. We often see complex visual concepts summarized by a few simple strokes. Inspired by the human ability to draw, recent research has explored the potential to generate sketches using a wide variety of machine learning models, ranging from hierarchical Bayesian models (Lake et al., 2015) , to more recent deep autoregressive models (Gregor et al., 2015; Ha & Eck, 2018; Chen et al., 2017) and generative adversarial nets (GANs) (Li et al., 2019) . It is a natural question to ask whether we can obtain useful intermediate representations from models that produce sketches in the output space, as has been shown by other generative models (Ranzato et al., 2006; Kingma & Welling, 2014; Goodfellow et al., 2014; Donahue et al., 2017; Doersch et al., 2015) . Unfortunately, a hierarchical Bayesian model suffers from prolonged inference time, while other current sketch models mostly focus on producing drawings in a closed set setting with a few classes (Ha & Eck, 2018; Chen et al., 2017) , or on improving log likelihood at the pixel level (Rezende et al., 2016) . Leveraging the learned representation from these drawing models remains a rather unexplored topic. In this paper, we pose the following question: Can we learn a generalized embedding function that captures salient and compositional features by directly imitating human sketches? The answer is affirmative. In our experiments we develop SketchEmbedNet, an RNN-based sketch model trained to map grayscale and natural image pixels to the sketch domain. It is trained on hundreds of classes without the use of class labels to learn a robust drawing model that can sketch diverse and unseen inputs. We demonstrate salience by achieving state-of-the-art performance on the Omniglot few-shot classification benchmark and visual recognizability in one-shot generations. Then we explore how the embeddings capture image components and their spatial relationships to explore image space compositionality and also show a surprising property of conceptual composition. We then push the boundary further by applying our sketch model to natural images-to our knowledge, we are the first to extend stroke-based autoregressive models to produce drawings of open domain natural images. We train our model with adapted SVG images from the Sketchy dataset (Sangkloy et al., 2016) and then evaluate the embedding quality directly on unseen classes in the mini-ImageNet task for few-shot classification (Vinyals et al., 2016) . Our approach is competitive with existing unsupervised few-shot learning methods (Hsu et al., 2019; Khodadadeh et al., 2019; Antoniou & Storkey, 2019) on this natural image benchmark. In both the sketch and natural image domain, we show that by learning to draw, our methods generalize well even across different datasets and classes. Figure 1 : A: A natural or sketch pixel image is passed into the CNN encoder to obtain Gaussian SketchEmbedding z, which is concatenated with the previous stroke y t-1 as the decoder input at each timestep to generate y t . B+C: Downstream tasks performed after training is complete.

2. RELATED WORK

In this section we review relevant literature including generating sketch-like images, unsupervised representation learning, unsupervised few-shot classification and sketch-based image retrieval (SBIR). Autoregressive drawing models: Graves (2013) use an LSTM to directly output the pen coordinates to imitate handwriting sequences. SketchRNN (Ha & Eck, 2018) builds on this by applying it to general sketches beyond characters. Song et al. (2018) ; Cao et al. (2019) ; Ribeiro et al. (2020) all extend SketchRNN through architectural improvements. Chen et al. (2017) change inputs to be pixel images. This and the previous 3 works consider multi-class sketching, but none handle more than 20 classes. Autoregressive models also generate images directly in the pixel domain. DRAW (Gregor et al., 2015) uses recurrent attention to plot pixels; Rezende et al. (2016) extends this to one-shot generation and PixelCNN (van den Oord et al., 2016) generates image pixels sequentially. Image processing methods & GANs: Other works produce sketch-like images based on style transfer or low-level image processing techniques. Classic methods are based on edge detection and image segmentation (Arbelaez et al., 2011; Xie & Tu, 2017) . Zhang et al. (2015) use a CNN to directly produce sketch-like pixels for face images. Photo-sketch and pix2pix (Li et al., 2019; Isola et al., 2017) propose a conditional GAN to generate images across different style domains. Image processing based methods do not acquire high-level image understanding, as all the operations are in terms of low-level filtering; none of the GAN sketching methods are designed to mimic human drawings on open domain natural images.

Unsupervised representation learning:

In the sketch image domain, our method is similar to the large category of generative models which learn unsupervised representations by the principle of analysis-by-synthesis. Work by Hinton & Nair (2005) operates in a sketch domain and learns to draw by synthesizing an interpretable motor program. Bayesian Program Learning (Lake et al., 2015) draws through exact inference of common strokes but learning and inference are computationally challenging. As such, a variety of deep generative models aim to perform approximate Bayesian inference by using an encoder structure that directly predicts the embedding, e.g., deep autoencoders (Vincent et al., 2010 ), Helmholtz Machine (Dayan et al., 1995) , variational autoencoder (VAE) (Kingma & Welling, 2014) , BiGAN (Donahue et al., 2017) , etc. Our method is also related to the literature of self-supervised representation learning (Doersch et al., 2015; Noroozi & Favaro, 2016; Gidaris et al., 2018; Zhang et al., 2016) , as sketch strokes are part of the input data itself. In few-shot learning (Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017) , recent work has explored unsupervised meta-training. CACTUs, AAL and UMTRA (Hsu et al., 2019; Antoniou & Storkey, 2019; Khodadadeh et al., 2019) all operate by generating pseudo-labels for training.

Sketch-based image retrieval (SBIR):

In SBIR, a model is provided a sketch-drawing and retrieves a photo of the same class. The area is split into fine-grained (FG-SBIR) (Yu et al., 2016; Sangkloy et al., 2016; Bhunia et al., 2020) and a zero-shot setting (ZS-SBIR) (Dutta & Akata, 2019; Pandey et al., 2020; Dey et al., 2019) . FG-SBIR considers minute details while ZS-SBIR learns high-level cross-domain semantics and a joint latent space to perform retrieval. 

3. LEARNING TO IMITATE DRAWINGS

Here we describe learning to draw through sketch imitation. Our architecture is a generative encoderdecoder model with a CNN encoder for pixel images and an RNN decoder to output vector sketches as shown in Figure 1 . Unlike other drawing models that only train on a single or few classes (Ha & Eck, 2018; Chen et al., 2017) , SketchEmbedNet is not limited by class inputs and can sketch a wide variety of images. We also introduce a differentiable rasterization function for computing an additional pixel-based training loss. Input & output representation Unlike SketchRNN which encodes drawing sequences, we learn an image embedding by mapping pixels to sketches, similar to Chen et al. (2017) . Training data for this task (adopted from Ha & Eck (2018) ) consists of a tuple (x, y), where x ∈ R H×W ×C is the input image and y ∈ R T ×5 is the stroke target. T is the maximum sequence length of the stroke data y, and each stroke y t consists of 5 elements, (∆ x , ∆ y , s 1 , s 2 , s 3 ). The first 2 elements are horizontal and vertical displacements on the drawing canvas from the endpoint of the previous stroke. The latter 3 elements are mutually exclusive pen states: s 1 indicates the pen is on paper for the next stroke, s 2 indicates the pen is lifted, and s 3 indicates the sketch sequence has ended. y 0 is initialized with (0, 0, 1, 0, 0) to start the generative process. Note that no class information is available while training. SketchEmbedding as a compositional encoding of images We use a CNN to encode the input image x and obtain the latent space representation z, as shown in Figure 1 . To model intra-class variance, z is a Gaussian random variable parameterized by CNN outputs, similar to a VAE (Kingma & Welling, 2014) . Throughout this paper, we refer to z as the SketchEmbedding. In typical image representations the embedding is trained to classify object classes, or to reconstruct the input pixels. Here, since the SketchEmbedding is fed into an RNN decoder to produce a sequence of drawing actions, z is additionally encouraged to have a compositional understanding of the object structure, instead of just an unstructured set of pixel features. For example when drawing the legs of a turtle, the model must explicitly generate each leg instance. While pixel-based models suffer from blurriness and in generating the image at once, does not distinguish between individual components such as the legs, body and head. The loss of this component information by pixel models has been observed in GAN literature (Goodfellow, 2017) which we propose is avoided by our sketching task. To accommodate the increased training data complexity by including hundreds of classes, we also upscale the size of our model in comparison to work by Chen et al. (2017) ; Ha & Eck (2018) ; Song et al. (2018) . The backbone is either a 4-layer CNN (Conv4) (Vinyals et al., 2016) for consistent comparisons in the few-shot setting or a ResNet12 (Oreshkin et al., 2018) which produces better drawing results. In comparison, Chen et al. (2017) only use 2D convolution with a maximum of 8 filters.

RNN decoder

The RNN decoder used in SketchEmbedNet is the same as in SketchRNN (Ha & Eck, 2018) . The decoder outputs a mixture density which represents the stroke distribution at each timestep. Specifically, the stroke distribution is a mixture of some hyperparameter M bivariate Gaussians denoting spatial offsets as well as the probability of the three pen states s 1-3 . The spatial offsets ∆ = (∆x, ∆y) are sampled from the mixture of Gaussians, described by: (1) the normalized mixture weight π j ; (2) mixture means µ j = (µ x , µ y ) j ; and (3) covariance matrices Σ j . We further reparameterize each Σ j with its standard deviation σ j = (σ x , σ y ) j and correlation coefficient ρ xy,j . Thus, the stroke offset distribution is p(∆) = M j=1 π j N (∆|µ j , Σ j ). The RNN is implemented using a HyperLSTM (Ha et al., 2017) ; LSTM weights are generated at each timestep by a smaller recurrent "hypernetwork" to improve training stability. Generation is autoregressive, using z ∈ R D , concatenated with the stroke from the previous timestep y t-1 , to form the input to the LSTM. Stroke y t-1 is the ground truth supervision at train time (teacher forcing), or a sample y t-1 , from the mixture distribution output by the model during from timestep t -1. 

3.1. LEARNING

We train the drawing model in an end-to-end fashion by jointly optimizing three losses: a pen loss L pen for learning pen states, a stroke loss L stroke for learning pen offsets, and our proposed pixel loss L pixel for matching the visual similarity of the predicted and the target sketch: L = L pen + (1 -α)L stroke + αL pixel , where α is a loss weighting hyperparameter. Both L pen and L stroke were in SketchRNN, while the L pixel is our novel contribution to stroke-based generative models. Unlike SketchRNN, we do not impose a prior using KL divergence as we are not interested in unconditional sampling and it worsens quantitative results in later sections. Pen loss The pen-states predictions {s 1 , s 2 , s 3 } are optimized as a simple 3-way classification with the softmax cross-entropy loss, L pen = -1 T T t=1 3 m=1 s m,t log(s m,t ). Stroke loss The stroke loss maximizes the log-likelihood of the spatial offsets of each ground truth stroke ∆ t given the mixture density distribution p t at each timestep: L stroke = -1 T T t=1 log p t (∆ t ). Pixel loss While pixel-level reconstruction objectives are common in generative models (Kingma & Welling, 2014; Vincent et al., 2010; Gregor et al., 2015) , we introduce a pixel-based objective for vector sketch generation. After decoding, a differentiable rasterization function f raster is used to map the sketch into a pixel image. f raster transforms a stroke sequence y into a set of 2D line segments (l 0 , l 1 ), (l 1 , l 2 ) . . . (l T -1 , l T ) where l t = t τ =0 ∆ τ . It renders by fixing canvas dimensions, scaling and centering strokes before determining pixel intensity based on the L 2 distance between each pixel to lines in the drawing. Further details on f raster can be found in Appendix A. f raster is applied to both the prediction y and ground truth y, to produce two pixel images. Gaussian blur g blur (•) is used to reduce strictness before computing the binary cross-entropy loss, L pixel : I = g blur (f raster (y)), I = g blur (f raster (y )), L pixel = - 1 HW H i=1 W j=1 I ij log(I ij ). Curriculum training schedule We find that α (in Equation 1) is an important hyperparameter that impacts both the learned embedding space and the generation quality of SketchEmbedNet. A curriculum training schedule is used, increasing α to prioritize L pixel relative to L stroke as training progresses; this makes intuitive sense as a single drawing can be produced by many different stroke sequences but learning to draw in a fixed manner is easier. While L pen promotes reproducing a specific drawing sequence, L pixel only requires that the generated drawing visually matches the image. Like a human, the model should learn to follow one drawing style (a la paint-by-numbers) before learning to draw freely.

4. DRAWING IMITATION EXPERIMENTS

In this section, we introduce our experiments on training SketchEmbedNet using two sketching datasets. The first is based on pure stroke-based drawings, and the second consists of natural image and drawing pairs. For this task we use the Sketchy dataset (Sangkloy et al., 2016) which consists of ImageNet images paired with vector sketches for a total of 56k examples after processing. Sketches are stored as SVGs with timestamps preserving their original drawing sequence which we adapt by sampling paths in this order. Images are also centered, padded and resized to resolution 84 × 84 (see Figure 2a ). We fix the maximum sequence length to T = 100, and use all 125 categories but remove classes that have overlapping child synsets with the test classes of mini-ImageNet (Vinyals et al., 2016) . This enables testing on mini-ImageNet without any alterations to the benchmark. Once again this is an unsupervised learning formulation.

4.1. RESULTS AND VISUALIZATIONS

Figure 3 shows drawings conditioned on sketch image inputs. There is little noticeable drop in quality when we sample sketches from unseen classes compared to those it has seen before. Figure 4 shows examples of sketches generated from natural images. While they are not fine-grained renditions, these sketches clearly demonstrate SketchEmbedNet's ability to capture key components of seen and unseen classes. The model effectively isolates the subject in each natural image and captures the circular and square shapes in the cakes and storefronts respectively. Even with the challenging lion images, it identifies the silhouette of the laying lion despite low contrast and appropriately localizes the one on the mountainside. Unlike pixel-based auto-encoder models, our sketches do not follow the exact pose of the original strokes, but rather capture a general notion of component groupings. In the basketball example of Figure 3 , the lines are not a good pixel-wise match to those in the original image yet they are placed in sensible relative positions. Weaker examples are presented in the last row of Figure 3 and 4; regardless, even poorer examples still capture some structural aspects of the original image. Implementation details can be found in Appendix B. In later sections we explore the uses of SketchEmbeddings and fix models for all downstream tasks. We would like to assess the benefits of learning to draw by performing few-shot classification with our learned embedding space. Examining performance on discriminative tasks reveals that learning to imitate sketches allows the embeddings to capture salient information of novel object classes. Below we describe our few-shot classification procedure and summarize results on the Omniglot (Lake et al., 2015) and mini-ImageNet benchmarks (Vinyals et al., 2016) . Comparison to unsupervised few-shot classification In unsupervised few-shot classification, a model is not provided with any class labels during meta-training, until it is given a few labeled examples ("shots") of the novel classes at meta-test time. While our model is provided a "target"-a sequence of strokes-during training, it is not given any class information. Therefore, we propose that the presented sketch imitation training, though it uses sketch sequences, is comparable to other class-label-free representation learning approaches (Berthelot et al., 2019; Donahue et al., 2017; Caron et al., 2018) and the learned SketchEmbeddings can be applied to unsupervised few-shot classification methods. In our experiments, we compare to previous unsupervised few-shot learning approaches: CAC-TUs (Hsu et al., 2019) , AAL (Antoniou & Storkey, 2019) , and UMTRA (Khodadadeh et al., 2019) . These methods create pseudo-labels during meta-training using either clustering or data augmentation. As additional baselines, a Conv-VAE (Kingma & Welling, 2014) and a random CNN are also included, both using the same Conv4 backbone. Few-shot experimental setup The CNN encoder of SketchEmbedNet is used as an embedding function combined with a linear classification head to perform few-shot classification. The embedding is made deterministic by taking the mean of the random normal latent space z and discarding the variance parameter from the encoder. Otherwise, the conventional episodic setup for few-shot classification is used; each episode consists of a labeled "support" set of N × K (N-way K-shot) embeddings and an unlabeled "query" set. The linear classification head is trained on the labeled support set and evaluated on the query set.

5.1. FEW-SHOT CLASSIFICATION ON OMNIGLOT

The Omniglot (Lake et al., 2015) dataset contains 50 alphabets, 1623 unique character types, each with 20 examples and is presented as both a greyscale image and a stroke drawing. We use the same train-test split as Vinyals et al. (2016) along with randomly sampled episodes. Experiments using the more challenging Lake split where episodes are sampled within alphabet, as proposed by Lake et al. (2015) , are in Appendix E and random seed experiments in Appendix F. To ensure a fair comparison with other few-shot classification models, we use the same convolutional encoder (Conv4) as Vinyals et al. (2016) . Results from training only on Omniglot (Lake et al., 2015) are also presented to demonstrate effectiveness without the larger Quickdraw (Jongejan et al., 2016) dataset. No significant improvements were observed using the deeper ResNet12 (Oreshkin et al., 2018) architecture; additional results are in Appendix I. All of our methods out-perform the previous state-of-the-art on the unsupervised Omniglot benchmark (Table 1 ). The Quickdraw trained model surpasses supervised MAML (Finn et al., 2017) , and is on par with a supervised ProtoNet (Snell et al., 2017) model , especially in the 5-shot settings. Both baselines, a Conv-VAE and a random CNN, perform well compared to other unsupervised methods.

5.2. FEW-SHOT CLASSIFICATION ON MINI-IMAGENET

We extend our investigation and assess SketchEmbeddings for the classification of natural images in the mini-ImageNet benchmark (Vinyals et al., 2016) . The same CNN encoder model from the natural image sketching task is used to match the visual domain of the examples we hope to classify. The mini-ImageNet (Vinyals et al., 2016) While the natural image sketching task is challenging and does not always produce high-fidelity results, it still learns useful visual information. By training on the Sketchy dataset, we learn how to draw other data distributions for which no sequential stroke data exists. Then, by knowing how to sketch this mini-ImageNet data we are able to produce distinguishable embeddings that enable competitive few-shot classification performance. Varying weighting of pixel-loss For both settings we sweep the pixel loss coefficient α max to ablate its impact on model performance on the Omniglot task (Table 3 ). There is a substantial improvement in few-shot classification when α max is non-zero. α max = 0.50 achieves the best results, and the trend goes downward when α max approaches to 1.0, i.e. the weighting for L stroke goes to 0.0. This is reasonable as the training of SketchEmbedNet is more stable under the guidance of ground truth strokes.

6. PROPERTIES OF SKETCHEMBEDDINGS

We hypothesize that reproducing a sketch drawing rather than a pixel-based approach requires the preservation of more structural information due to sequential RNN generation. By learning in this manner, SketchEmbeddings are aware of spatial properties and the composition of elements in image space. We examine this compositionality through several comparisons of SketchEmbeddings with those generated by a Conv-VAE.

Component arrangements

We construct examples that contain the same set of objects but in different arrangements to test sensitivity to component arrangement and composition in image space. We then embed these examples with both generative models and project into 2D space using UMAP (McInnes et al., 2018) to visualize their organization. In the first 2 panels of Figure 5 -A, we see that the SketchEmbeddings are easily separated in unsupervised clustering. The rightmost panel of Figure 5 -A exhibits non-synthetic classes with duplicated shapes; snowmen with circles and televisions with squares. With these results, we demonstrate the greater component level awareness of SketchEmbeddings. The 4 rearranged shapes and the nested circle and squares have similar silhouettes that are difficult to differentiate to a conventional pixel loss. To SketchEmbeddings, the canvas offset and different drawing sequence of each shape make them substantially different in embedding space. Spatial relationships Drawing also builds awareness of relevant underlying variables, such as spatial relationships between components of the image. We examine the degree to which the underlying variables of angle, distance or size are captured by the embedding, by constructing images that vary along each dimension respectively. The embeddings are again grouped by a 2D projection in Figure 5 -B using the UMAP (McInnes et al., 2018) algorithm. When clustered, the 2D projection of SketchEmbeddings arranges the examples along an axis corresponding to the latent variable compared to the Conv-VAE embeddings which is visibly non-linear and arranges in clusters. This clear axis-alignment suggests a greater level of latent variable disentanglement in the SketchEmbeddings. Conceptual composition Finally, we explore concept space composition using SketchEmbeddings (Figure 5-C ) by embedding different Quickdraw examples then performing arithmetic with the latent vectors. By subtracting a circle embedding and adding a square embedding from a snowman composed of stacked circles, we produce stacked boxes. This property of vector arithmetic is reminiscent of language representations, as evidenced in analogies like King -Man + Woman = Queen (Ethayarajh et al., 2019) . Our results indicate that this property is captured to a greater degree in SketchEmbedding than in the pixel-based VAE embeddings. Composing SketchEmbeddings produces decoded examples that appeal to our intuitive conceptual understanding while the VAE degrades to blurry, fragmented images. We provide more examples of the setup in Figure 5 -C as well as additional classes in Appendix K.

7. ONE-SHOT GENERATION

To evaluate the sketches generated by our model, we make qualitative comparisons to other one-shot generative models and quantitatively assess our model through visual classification via a ResNet101 (He et al., 2016) . In this section, all models use the ResNet12 (Oreshkin et al., 2018) backbone.

Qualitative comparisons

We compare SketchEmbedNet one-shot generations of Omniglot characters with examples from other few-shot (Reed et al., 2017) and one-shot (Rezende et al., 2016) 

8. CONCLUSION

Learning to draw is not only an artistic pursuit but drives a distillation of real-world visual concepts. We present a generalized drawing model capable of producing accurate sketches and visual summaries of open-domain natural images. While sketch data may be challenging to source, we show that training to draw either sketch or natural images can generalize for downstream tasks, not only within each domain but also well beyond the training data. More generally research in this direction may lead to more lifelike image understanding inspired by how humans communicate visual concepts.

A RASTERIZATION

The key enabler of our novel pixel loss for sketch drawings is our differentiable rasterization function f raster . Sequence based loss functions such as L stroke are sensitive to the order of points while in reality, drawings are sequence invariant. Visually, a square is a square whether it is drawn clockwise or counterclockwise. The purpose of a sketch representation is to lower the complexity of the data space and decode in a more visually intuitive manner. While it is a necessary departure point, the sequential generation of drawings is not key to our visual representation and we would like SketchEmbedNet to be agnostic to any specific sequence needed to draw the sketch that is representative of the image input. To facilitate this, we develop our rasterization function f raster which renders an input sequence of strokes as a pixel image. However, during training, the RNN outputs a mixture of Gaussians at each timestep. To convert this to a stroke sequence, we sample from these Gaussians; this can be repeated to reduce the variance of the pixel loss. We then scale our predicted and ground truth sequences by the properties of the latter before rasterization. Stroke sampling At the end of sequence generation we have N s ×(6M +3) parameters, 6 Gaussian mixture parameters, 3 pen states, N s times, one for each stroke. To obtain the actual drawing we sample from the mixture of Gaussians: ∆x t ∆y t = µ x,t µ y,t + σ x,t 0 ρ xy,t σ y,t σ y,t 1 -ρ 2 xy,t , ∼ N (0, 1 2 ). After sampling we compute the cumulative sum of every stroke over the timestep so that we obtain the absolute displacement from the initial position: x t y t = T τ =0 ∆x τ ∆y τ . y t,abs = (x t , y t , s 1 , s 2 , s 3 ). (5) Scaling Each sketch generated by our model begins at (0,0) and the variance of all strokes in the training set is normalized to 1. On a fixed canvas the image is both very small and localized to the top left corner. We remedy this by computing a scale λ and shift x shift , y shift using labels y and apply them to both the prediction y as well as the ground truth y. These parameters are computed as: λ = min W x max -x min , H y max -y min , x shift = x max + x min 2 λ, y shift = y max + y min 2 λ. x max , x min , y max , y min are the minimum and maximum values of x t , y t from the supervised stroke labels and not the generated strokes. W and H are the width and height in pixels of our output canvas. Calculate pixel intensity Finally we are able to calculate the pixel p ij intensity of every pixel in our H × W canvas. p ij = σ 2 -5 × min t=1...Ns dist (i, j), (x t-1 , y t-1 ), (x t , y t ) + (1 -s 1,t-1 )10 6 , where the distance function is the distance between point (i, j) from the line segment defined by the absolute points (x t-1 , y t-1 ) and (x t , y t ). We also blow up any distances where s 1,t-1 < 0.5 so as to not render any strokes where the pen is not touching the paper.

B IMPLEMENTATION DETAILS

We train our model for 300k iterations with a batch size of 256 for the Quickdraw dataset and 64 for Sketchy due to memory constraints. The initial learning rate is 1e-3 which decays by 0.85 every 15k steps. We use the Adam (Kingma & Ba, 2015) optimizer and clip gradient values at 1.0. σ = 2.0 is used for the Gaussian blur in L pixel . For the curriculum learning schedule, the value of α is set to 0 initially and increases by 0.05 every 10k training steps with an empirically obtained cap at α max = 0.50 for Quickdraw and α max = 0.75 for Sketchy. The ResNet12 (Oreshkin et al., 2018) encoder uses 4 ResNet blocks with 64, 128, 256, 512 filters respectively and ReLU activations. The Conv4 backbone has 4 blocks of convolution, batch norm (Ioffe & Szegedy, 2015) , ReLU and max pool, identical to Vinyals et al. (2016) . We select the latent space to be 256 dimensions, RNN output size to be 1024, and the hypernetwork embedding size to be 64. We use a mixture of M = 30 bivariate Gaussians for the mixture density output of the stroke offset distribution.

C LATENT SPACE INTERPOLATION

Like in many encoding-decoding models we evaluate the interpolation of our latent space. We select 4 embeddings at random and use bi-linear interpolation to produce new embeddings. Results are in Figures 7a and 7b . We observe that compositionality is also present in these interpolations. In the top row of Figure 7a , the model first plots a third small circle when interpolating from the 2-circle power outlet and the 3-circle snowman. This small circle is treated as single component that grows as it transitions between classes until it's final size in the far right snowman drawing. Some other RNN-based sketching models (Ha & Eck, 2018; Chen et al., 2017) experience other classes materializing in interpolations between two unrelated classes. Our model does not exhibit this same behaviour as our embedding space is learned from more classes and thus does not contain local groupings of classes.

D EFFECT OF α ON FEW-SHOT CLASSIFICATION

We performed additional experiments exploring the impact of our curriculum training schedule for α. The encoding component of our drawing model was evaluated on the few-shot classification task for different values of α max every 25k iterations during training. A graph is shown in Figure 8 and the full table of all values of α max is in Table 5 . The creators of the Omniglot dataset and one-shot classification benchmark originally proposed an intra-alphabet classification task. This task is more challenging than the common Vinyals split as characters from the same alphabet may exhibit similar stylistics of sub-components that makes visual differentiation more difficult. This benchmark has been less explored by researchers; however, we still present the performance of our SketchEmbedding model against evaluations of other few-shot classification models on the benchmark. Results are shown in Table 6 . (Snell et al., 2017; Lake et al., 2019) Conv4 Omniglot --86.30 -RCN (Supervised) (George et al., 2017; Lake et al., 2019) N/A Omniglot --92.70 -VHE (Supervised) (Hewitt et al., 2018; Lake et al., 2019) N/A Omniglot --81.30 -Unsurprisingly, our model is outperformed by supervised models and does fall behind by a more substantial margin than in the Vinyals split. However, our SketchEmbedding approach still achieves respectable classification accuracy overall and greatly outperforms a Conv-VAE baseline. F EFFECT OF RANDOM SEEDING ON FEW-SHOT CLASSIFICATION The training objective for SketchEmbedNet is to reproduce sketch drawings of the input. This task is unrelated to few-shot classification may perform variably given different initialization. We quantify this variance by training our model with 15 unique random seeds and evaluating the performance of the latent space on the few-shot classification tasks. We disregard the per (evaluation) episode variance of our model in each test stage and only present the mean accuracy. We then compute a new confidence interval over random seeds. Results are presented in Tables 7, 8, 9. key, keyboard, knee, knife, ladder, lantern, leaf, leg, light bulb, lighter, lighthouse, lightning, line, lipstick, lobster, mailbox, map, marker, matches, megaphone, mermaid, microphone, microwave, monkey, mosquito, motorbike, mountain, mouse, moustache, mouth, mushroom, nail, necklace, nose, octopus, onion, oven, owl, paint can, paintbrush, palm tree, parachute, passport, peanut, pear, pencil, penguin, piano, pickup truck, pig, pineapple, pliers, police car, pool, popsicle, postcard, purse, rabbit, raccoon, radio, rain, rainbow, rake, remote control, rhinoceros, river, rollerskates, sailboat, sandwich, saxophone, scissors, see saw, shark, sheep, shoe, shorts, shovel, sink, skull, sleeping bag, smiley face, snail, snake, snowflake, soccer ball, speedboat, square, star, steak, stereo, stitches, stop sign, strawberry, streetlight, string bean, submarine, sun, swing set, syringe, t-shirt, table, teapot, teddy-bear, tennis racquet, tent, tiger, toe, tooth, toothpaste, tractor, traffic light, train, triangle, trombone, truck, trumpet, umbrella, underwear, van, vase, watermelon, wheel, windmill, wine bottle, wine glass, wristwatch, zigzag, blackberry, power outlet, peas, hot tub, toothbrush, skateboard, cloud, elbow, bat, pond, compass, elephant, hurricane, jail, school bus, skyscraper, tornado, picture frame, lollipop, spoon, saw, cup, roller coaster, pants, jacket, rifle, yoga, toilet, waterslide, axe, snowman, bracelet, basket, anvil, octagon , washing machine, tree, television, bowtie, sweater, backpack, zebra, suitcase, stairs, The Great Wall of China G.2 OMNIGLOT We derive our Omniglot tasks from the stroke dataset originally provided by Lake et al. (2015) rather than the image analogues. We translate the Omniglot stroke-by-stroke format to the same one used in Quickdraw. Then we apply the Ramer-Douglas-Peucker (Douglas & Peucker, 1973) algorithm with an epsilon value of 2 and normalize variance to 1 to produce y. We also rasterize our images in the same manner as above for our input x.

G.3 SKETCHY

Sketchy data is provided as an SVG image composed of line paths that are either straight lines or Bezier curves. To generate stroke data we sample sequences of points from Bezier curves at a high resolution that we then simplify with RDP, = 5. We also eliminate continuous strokes with a short examples, cluster them in 512-dimensional space and visualize the strokes belonging to each cluster for each example. A full decoding is also rendered where each cluster within an example is assigned a color. Single class: snowman First we explore this clustering using only the snowman class from Quickdraw (Jongejan et al., 2016) . We expect substantial reuse of a "circle" both within and over many examples. Clustering of the strokes is done with the DBSCAN (Ester et al., 1996) and parameter = 3.9. Results are in Figure 20 . Each row is a separate input; the far left column is the color-coded, composed image, the second is the noise cluster and every subsequent column is a unique cluster. 



Figure 2: Examples from Sketchy (Sangkloy et al., 2016) and Quickdraw (Jongejan et al., 2016) datasets. Sketchy examples are reshaped and padded to increase image-sketch spatial agreement.

Figure 3: Sampled generated drawings of Quickdraw examples. Weaker drawings boxed in red.

Figure 4: Sampled drawings of mini-ImageNet examples. Weaker drawings boxed in red.

Figure 5: Experiments exploring properties of SketchEmbeddings. Examples colored for understandability only.

Figure 6: One-shot Omniglot generation compared to Rezende et al. (2016); Reed et al. (2017).

(a) Interpolation of classes: power outlet, snowman, jacket, elbow (b) Interpolation of classes: cloud, power outlet, basket, compass

Figure 7: Latent space interpolations of randomly selected examples

Figure 8: Few-shot classification accuracy of α max values 0.0 and 0.5 over training

Figure 15: Latent space visualization squares and circles arranged differently in a 2x2 array

Figure 20: Snowman class stroke clustering

Figure 21: Fully composed images with coloured cluster assignments

Figure 22: Snowman class stroke clusteringWe still observe that the model continues to isolate circles in the first column and note it continues to do so for the cup and clock classes which are not exclusively circular.

Figure 23: Fully composed images with coloured cluster assignments Many random classes: Finally, we repeat the above clustering with the 45 randomly selected holdout classes from the Quickdraw training process of SketchEmbedding. Results are once again presented in Figure 24 and select examples in Figure 25.

Figure 24: Snowman class stroke clustering

Few-shot classification results on Omniglot

Few-shot classification results on mini-ImageNet

We present a new paradigm: using sketching as an auxiliary task to learn visual content. Only by training a drawing model that can sketch general image inputs are we able to transfer the learned understanding to new data distributions. By considering the stroke distribution of the Quickdraw dataset, we are able to interpret image inputs from the separate Omniglot dataset and tackle the few-shot classification task with surprising accuracy.

Effect of pixel loss coefficient α on Omniglot few-shot classification

ResNet-101 45-way classification score on 1-shot generated sketches of seen and unseen classes. In the settings shown, none of the models have seen any examples from the character class, or the parent alphabet. Furthermore, the drawer has seen no written characters during training and is trained only on the Quickdraw dataset. Visually, our generated images better resemble the support examples and the variations by stretching and shifting strokes better preserves the semantics of each character. Generations in pixel space may disrupt strokes and alter the character to human perception. This is especially true for written characters as they are frequently defined by a specific set of strokes instead of blurry clusters of pixels.Quantitative evaluation of generation quality Evaluating generative models is often challenging. Per-pixel metrics like inReed et al. (2017); Rezende et al. (2016) may penalize generative variance that still preserves meaning. We propose an Inception Score(Salimans et al., 2016) inspired metric to quantify class-recognizability and generalization of generated examples. We train two separate ResNet classifiers(He et al., 2016), each on a different set of 45 Quickdraw classes. One set was part of the training set of SketchEmbedNet (referred to as "seen") and the other set was held out during training (referred to as "unseen"). We then have SketchEmbedNet generate one-shot sketches from each set and have the corresponding classifier predict a class. The accuracy of the classifier on generated examples is compared with its training accuracy in Table4. For a ResNet classifier, SketchEmbedNet generations are more recognizable for both classes seen and unseen classes.

Few-shot classification accuracy of all α max values .35 87.94 88.73 88.46 88.01 88.04 88.23 87.73 88.03 87.86 87.65 87.17 0.25 89.21 90.39 90.20 89.75 87.78 88.37 88.64 88.05 87.98 88.41 88.15 87.82 0.50 90.48 89.58 89.81 89.02 90.68 91.24 90.26 90.94 91.12 91.30 91.12 91.39 0.75 91.39 89.95 89.56 89.81 89.95 90.79 91.02 91.09 91.82 90.76 91.42 90.59 0.95 90.23 90.15 90.10 89.55 90.27 92.37 92.27 90.29 91.58 91.02 89.73 89.77

Few-shot classification results on Omniglot (Lake split)

Random Seeding on Few-Shot Classification results onOmniglot (Conv4)

Random Seeding on Few-Shot Classification results onOmniglot (ResNet12)

Random Seeding on Few-Shot Classification results on mini-ImageNetWe apply the same data processing methods as inHa & Eck (2018) with no additional changes to produce our stroke labels y. When rasterizing for our input x, we scale, center the strokes then pad the image with 10% of the resolution in that dimension rounded to the nearest integer.

annex

path length or small displacement to reduce our stroke length and remove small and noisy strokes. Path length and displacement are considered with respect to the scale of the entire sketch.Once again we normalize stroke variance and rasterize for our input image in the same manners as above.The following classes were use for training after removing overlapping classes with mini-ImageNet: hot-air_balloon, violin, tiger, eyeglasses, mouse, jack-o-lantern, lobster, teddy_bear, teapot, helicopter, duck, wading_bird, rabbit, penguin, sheep, windmill, piano, jellyfish, table, fan, beetle, cabin, scorpion, scissors, banana, tank, umbrella, crocodilian, volcano, knife, cup, saxophone, pistol, swan, chicken, sword, seal, alarm_clock, rocket, bicycle, owl, squirrel, hermit_crab, horse, spoon, cow, hotdog, camel, turtle, pizza, spider, songbird, rifle, chair, starfish, tree, airplane, bread, bench, harp, seagull, blimp, apple, geyser, trumpet, frog, lizard, axe, sea_turtle, pretzel, snail, butterfly, bear, ray, wine_bottle, , elephant, raccoon, rhinoceros, door, hat, deer, snake, ape, flower, car_(sedan), kangaroo, dolphin, hamburger, castle, pineapple, saw, zebra, candle, cannon, racket, church, fish, mushroom, strawberry, window, sailboat, hourglass, cat, shoe, hedgehog, couch, giraffe, hammer, motorcycle, shark 

H AUTOREGRESSIVE DRAWING MODEL COMPARISONS

We summarize the key components of SketchEmbedNet in comparison to other autoregressive drawing models in Table 10 . 

I FEW-SHOT CLASSIFICATION ON OMNIGLOT -FULL RESULTS

The full results table for few-shot classification on the Omniglot (Lake et al., 2015) dataset, including the ResNet12 (Oreshkin et al., 2018) model.

J FEW-SHOT CLASSIFICATION ON MINI-IMAGENET -FULL RESULTS

The full results table for few-shot classification on the mini-ImageNet dataset, including the ResNet12 (Oreshkin et al., 2018) model and Conv4 models. 

L EMBEDDING PROPERTIES OF OTHER BASELINE MODELS

Here we substantiate the uniqueness of the properties observed in SketchEmbeddings by applying the same experiments to a β-VAE (Higgins et al., 2017) as well a vanilla autoencoder trained on the same dataset. We also include results of a SketchEmbedNet trained with a KL objective. We show that the performance of SketchEmbedding embeddings in our experiments in Section 6 which focuses on organization in latent space is not correlated with the KL term. We present both a vanilla autoencoder without the KL objective and a SketchEmbedNet trained with a KL objective. We observe a drop in overall generation quality in the Conceptual Composition decoding as is expected with an additional constraint but maintained performance in the other tasks. Meanwhile, the autoencoder does not demonstrate any marked improvements over the Conv-VAE in the main paper or any other baseline comparison.

M ADDITIONAL COMPOSITIONALITY MODES

We provide additional clustering methods t-SNE (Maaten & Hinton, 2008) and PCA as well as 2 new experiments that explore the compositionality of our latent SketchEmbedding. Additional Experiments Here we provide different investigations into the compositionality of our learned embedding space that were not present in our main paper. These results presented in Figure 18 and 19.

Additional clustering methods

Figure 18 : 2D Embedding visualization of different spatial orientations of circles and squares In Figure 18 we place a square in the center of the example and place a circle above, below or to the sides of it. Once again we find that our SketchEmbedding embedding clusters better than the VAE approach. 

N HYPERNETWORK ACTIVATIONS

To further explore how our network understands drawings, we examine the relationships between the activations of the hypernetwork of our HyperLSTM (Ha et al., 2017) .The hypernetwork determines the weights of the LSTM that generates the RNN at each decoding timestep. These activations are 512-dimensional vectors. We collect the activations from many

