SKETCHEMBEDNET: LEARNING NOVEL CONCEPTS BY IMITATING DRAWINGS

Abstract

Sketch drawings are an intuitive visual domain that appeals to human instinct. Previous work has shown that recurrent neural networks are capable of producing sketch drawings of a single or few classes at a time. In this work we investigate representations developed by training a generative model to produce sketches from pixel images across many classes in a sketch domain. We find that the embeddings learned by this sketching model are extremely informative for visual tasks and infer a unique visual understanding. We then use them to exceed state-of-the-art performance in unsupervised few-shot classification on the Omniglot and mini-ImageNet benchmarks. We also leverage the generative capacity of our model to produce high quality sketches of novel classes based on just a single example.

1. INTRODUCTION

Upon encountering a novel concept, such as a six-legged turtle, humans can quickly generalize this concept by composing a mental picture. The ability to generate drawings greatly facilitates communicating new ideas. This dates back to the advent of writing, as many ancient written languages are based on logograms, such as Chinese hanzi and Egyptian hieroglyphs, where each character is essentially a sketch of the object it represents. We often see complex visual concepts summarized by a few simple strokes. Inspired by the human ability to draw, recent research has explored the potential to generate sketches using a wide variety of machine learning models, ranging from hierarchical Bayesian models (Lake et al., 2015) , to more recent deep autoregressive models (Gregor et al., 2015; Ha & Eck, 2018; Chen et al., 2017) and generative adversarial nets (GANs) (Li et al., 2019) . It is a natural question to ask whether we can obtain useful intermediate representations from models that produce sketches in the output space, as has been shown by other generative models (Ranzato et al., 2006; Kingma & Welling, 2014; Goodfellow et al., 2014; Donahue et al., 2017; Doersch et al., 2015) . Unfortunately, a hierarchical Bayesian model suffers from prolonged inference time, while other current sketch models mostly focus on producing drawings in a closed set setting with a few classes (Ha & Eck, 2018; Chen et al., 2017) , or on improving log likelihood at the pixel level (Rezende et al., 2016) . Leveraging the learned representation from these drawing models remains a rather unexplored topic. In this paper, we pose the following question: Can we learn a generalized embedding function that captures salient and compositional features by directly imitating human sketches? The answer is affirmative. In our experiments we develop SketchEmbedNet, an RNN-based sketch model trained to map grayscale and natural image pixels to the sketch domain. It is trained on hundreds of classes without the use of class labels to learn a robust drawing model that can sketch diverse and unseen inputs. We demonstrate salience by achieving state-of-the-art performance on the Omniglot few-shot classification benchmark and visual recognizability in one-shot generations. Then we explore how the embeddings capture image components and their spatial relationships to explore image space compositionality and also show a surprising property of conceptual composition. We then push the boundary further by applying our sketch model to natural images-to our knowledge, we are the first to extend stroke-based autoregressive models to produce drawings of open domain natural images. We train our model with adapted SVG images from the Sketchy dataset (Sangkloy et al., 2016) and then evaluate the embedding quality directly on unseen classes in the mini-ImageNet task for few-shot classification (Vinyals et al., 2016) . Our approach is competitive with existing unsupervised few-shot learning methods (Hsu et al., 2019; Khodadadeh et al., 2019; Antoniou & Storkey, 2019) on this natural image benchmark. In both the sketch and natural image domain, we show that by learning to draw, our methods generalize well even across different datasets and classes. 

2. RELATED WORK

In this section we review relevant literature including generating sketch-like images, unsupervised representation learning, unsupervised few-shot classification and sketch-based image retrieval (SBIR). Autoregressive drawing models: Graves (2013) use an LSTM to directly output the pen coordinates to imitate handwriting sequences. SketchRNN (Ha & Eck, 2018) 2015) use a CNN to directly produce sketch-like pixels for face images. Photo-sketch and pix2pix (Li et al., 2019; Isola et al., 2017) propose a conditional GAN to generate images across different style domains. Image processing based methods do not acquire high-level image understanding, as all the operations are in terms of low-level filtering; none of the GAN sketching methods are designed to mimic human drawings on open domain natural images. Unsupervised representation learning: In the sketch image domain, our method is similar to the large category of generative models which learn unsupervised representations by the principle of analysis-by-synthesis. Work by Hinton & Nair (2005) operates in a sketch domain and learns to draw by synthesizing an interpretable motor program. Bayesian Program Learning (Lake et al., 2015) draws through exact inference of common strokes but learning and inference are computationally challenging. As such, a variety of deep generative models aim to perform approximate Bayesian inference by using an encoder structure that directly predicts the embedding, e.g., deep autoencoders (Vincent et al., 2010 ), Helmholtz Machine (Dayan et al., 1995) , variational autoencoder (VAE) (Kingma & Welling, 2014) , BiGAN (Donahue et al., 2017) , etc. Our method is also related to the literature of self-supervised representation learning (Doersch et al., 2015; Noroozi & Favaro, 2016; Gidaris et al., 2018; Zhang et al., 2016) , as sketch strokes are part of the input data itself. In few-shot learning (Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017) , recent work has explored unsupervised meta-training. CACTUs, AAL and UMTRA (Hsu et al., 2019; Antoniou & Storkey, 2019; Khodadadeh et al., 2019) all operate by generating pseudo-labels for training.

Sketch-based image retrieval (SBIR):

In SBIR, a model is provided a sketch-drawing and retrieves a photo of the same class. The area is split into fine-grained (FG-SBIR) (Yu et al., 2016; Sangkloy et al., 2016; Bhunia et al., 2020) and a zero-shot setting (ZS-SBIR) (Dutta & Akata, 2019; Pandey et al., 2020; Dey et al., 2019) . FG-SBIR considers minute details while ZS-SBIR learns high-level cross-domain semantics and a joint latent space to perform retrieval.



Figure1: A: natural or sketch pixel image is passed into the CNN encoder to obtain Gaussian SketchEmbedding z, which is concatenated with the previous stroke y t-1 as the decoder input at each timestep to generate y t . B+C: Downstream tasks performed after training is complete.

builds on this by applying it to general sketches beyond characters. Song et al. (2018); Cao et al. (2019); Ribeiro et al. (2020) all extend SketchRNN through architectural improvements. Chen et al. (2017) change inputs to be pixel images. This and the previous 3 works consider multi-class sketching, but none handle more than 20 classes. Autoregressive models also generate images directly in the pixel domain. DRAW (Gregor et al., 2015) uses recurrent attention to plot pixels; Rezende et al. (2016) extends this to one-shot generation and PixelCNN (van den Oord et al., 2016) generates image pixels sequentially. Image processing methods & GANs: Other works produce sketch-like images based on style transfer or low-level image processing techniques. Classic methods are based on edge detection and image segmentation (Arbelaez et al., 2011; Xie & Tu, 2017). Zhang et al. (

