GENERATING FURRY CARS: DISENTANGLING OBJECT SHAPE & APPEARANCE ACROSS MULTIPLE DOMAINS

Abstract

Figure 1: Each block above follows [Appearance, Shape → Output]. We propose a generative model that disentangles and combines shape and appearance factors across multiple domains, to create hybrid images which do not exist in any single domain.

1. INTRODUCTION

Humans possess the incredible ability of being able to combine properties from multiple image distributions to create entirely new visual concepts. For example, Lake et al. (2015) discussed how humans can parse different object parts (e.g., wheels of a car, handle of a lawn mower) and combine them to conceptualize novel object categories (a scooter). Fig. 2 illustrates another example from a different angle; it is easy for us humans to imagine how the brown car would look if its appearance were borrowed from the blue and red bird. To model a similar ability in machines, a precise disentanglement of shape and appearance features, and the ability to combine them across different domains are needed. In this work, we seek to develop a framework to do just that, where we define domains to correspond to "basic-level categories" (Rosch, 1978) . Disentangling the factors of variation in visual data has received significant attention (Chen et al., 2016; Higgins et al., 2017; Denton & Birodkar, 2017; Singh et al., 2019) , in particular with advances in generative models (Goodfellow et al., 2014; Radford et al., 2016; Zhang et al., 2018; Karras et al., 2019; Brock et al., 2019) . The premise behind learning disentangled representations is that an image can be thought of as a function of, say two independent latent factors, such that each controls only one human interpretable property (e.g., shape vs. appearance). The existence of such representations enables combining latent factors from two different source images to create a new one, which has properties of both. Prior generative modeling work (Hu et al., 2018; Singh et al., 2019; Li et al., 2020) explore a part of this idea, where the space of latent factors being combined is limited to one domain (e.g., combining a sparrow's appearance with a duck's shape within the domain of birds; I AA in Fig. 2 ), a scenario which we refer to as intra-domain disentanglement of latent factors. This work, focusing on shape and appearance as factors, generalizes this idea to inter-domain disentanglement: combining latent factors from different domains (e.g., appearance from birds, shape from cars) to create a new breed of images which does not exist in either domain (I AB in Fig. 2 ). The key challenge to this problem is that there is no ground-truth distribution for the hybrid visual concept that spans the two domains. Due to this, directly applying a single domain disentangled image generation approach to the multi-domain setting does not work, as the hybrid concept would be considered out of distribution (we provide more analysis in Sec. 3). Despite the lack of ground-truth, as humans, we would deem certain combinations of factors to be better than others. For example, if two domains share object parts (e.g., dog and leopard), we would prefer a transfer of appearance in which local part appearances are preserved. For the ones that don't share object parts (e.g., bird and car), we may prefer a transfer of appearance in which the overall color/texture frequency is preserved (e.g. Fig. 2 , I 2 and I AB ), which has been found to be useful in object categorization at the coarse level in a neuroimaging study (Rice et al., 2014) . Our work formulates this idea as a training process, where any two images having the same latent appearance are constrained to have similar frequency of those low-level features. These features in turn are learned (as opposed to being hand-crafted), using contrastive learning (Hadsell et al., 2006; Chen et al., 2020) , to better capture the low-level statistics of the dataset. The net effect is an accurate transfer of appearance, where important details remain consistent across domains in spite of large shape changes. Importantly, we achieve this by only requiring bounding box annotations to help disentangle object from background, without any other labels, including which domain an image comes from. To our knowledge, our work is the first to attempt combining factors from different data distributions to generate abstract visual concepts (e.g., car with dog's texture). We perform experiments on a variety of multi-modal datasets, and demonstrate our method's effectiveness qualitatively, quantitatively, and through user studies. We believe our work can open up new avenues for art/design; e.g., a customer could visualize how sofas would look with an animal print or a fashion/car designer could create a new space of designs using the appearance from arbitrary objects. Finally, we believe that the task introduced in this work offers better scrutiny of the quality of disentanglement learned by a method: if it succeeds in doing so within a domain but not in the presence of multiple ones, that in essence indicates some form of entanglement of factors with the domain's properties.



Figure 2: Each domain can be represented with e.g., a set of object shapes (X A/B ) and appearances (Y A/B ). The ability to generate images of the form I AA/BB requires the system to learn intra-domain disentanglement (Singh et al., 2019) of latent factors, whereas the ability to generate images of the form I AB (appearance/shape from domain A/B, respectively) requires inter-domain disentanglement of factors, which is the goal of this work.

