PANDA: UNSUPERVISED LEARNING OF PARTS AND APPEARANCES IN THE FEATURE MAPS OF GANS

Abstract

Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control.

1. INTRODUCTION

Figure 1 : We propose an unsupervised method for learning a set of factors that correspond to interpretable parts and appearances in a dataset of images. These can be used for multiple tasks: (a) local image editing, (b) context-aware object removal, and (c) producing saliency maps for learnt concepts of interest. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) constitute the state of the art (SOTA) for the task of image synthesis. However, despite the remarkable progress in this domain through improvements to the image generator's architecture (Radford et al., 2016; Karras et al., 2018; 2019; 2020b; 2021; Brock et al., 2019) , their inner workings remain to a large extent unexplored. Developing a better understanding of the way in which high-level concepts are represented and composed to form synthetic images is important for a number of downstream tasks such as generative model interpretability (Shen et al., 2020a; Bau et al., 2019; Yang et al., 2021) and image editing (Härkönen et al., 2020; Shen & Zhou, 2021; Shen et al., 2020c; Voynov & Babenko, 2020; Tzelepis et al., 2021; Bau et al., 2020) . In modern generators however, the synthetic images are produced through an increasingly complex interaction of a set of per-layer latent codes in tandem with the feature maps themselves (Karras et al., 2020b; 2019; 2021) and/or with skip connections (Brock et al., 2019) . Furthermore, given the rapid pace at which new architectures are being developed, demystifying the process by which these vastly different networks model the constituent parts of an image is an ever-present challenge. Thus, many recent advances are architecture-specific (Wu et al., 2021; Collins et al., 2020; Ling et al., 2021) and a general-purpose method for analyzing and manipulating convolutional generators remains elusive. A popular line of GAN-based image editing research concerns itself with learning so-called "interpretable directions" in the generator's latent space (Härkönen et al., 2020; Shen & Zhou, 2021; Shen et al., 2020c; Voynov & Babenko, 2020; Tzelepis et al., 2021; Yang et al., 2021; He et al., 2021; Haas et al., 2021; 2022) . Once discovered, such representations of high-level concepts can be manipulated to bring about predictable changes to the images. One important question in this line of research is how latent representations are combined to form the appearance at a particular local region of the image. Whilst some recent methods attempt to tackle this problem (Wang et al., 2021; Wu et al., 2021; Broad et al., 2022; Zhu et al., 2021a; Zhang et al., 2021; Ling et al., 2021; Kafri et al., 2021) , the current state-of-the-art methods come with a number of important drawbacks and limitations. In particular, existing techniques require prohibitively long training times (Wu et al., 2021; Zhu et al., 2021a) , costly Jacobian-based optimization (Zhu et al., 2021a; 2022) , and the requirement of semantic masks (Wu et al., 2021) or manually specified regions of interest (Zhu et al., 2021a; 2022) . Furthermore, whilst these methods successfully find directions affecting local changes, optimization must be performed on a per-region basis, and the resulting directions do not provide pixel-level control-a term introduced by Zhu et al. (2021a) referring to the ability to precisely target specific pixels in the image. In this light, we present a fast unsupervised method for jointly learning factors for interpretable parts and their appearances (we thus refer to our method as PandA) in pre-trained convolutional generators. Our method allows one to both interpret and edit an image's style at discovered local semantic regions of interest, using the learnt appearance representations. We achieve this by formulating a constrained optimization problem with a semi-nonnegative tensor decomposition of the dataset of deep feature maps Z ∈ R M ×H×W ×C in a convolutional generator. This allows one to accomplish a number of useful tasks, prominent examples of which are shown in Fig. 1 . Firstly, our learnt representations of appearance across samples can be used for the popular task of local image editing (Zhu et al., 2021a; Wu et al., 2021) (for example, to change the colour or texture of a cat's ears as shown in Fig. 1 (a) ). Whilst the state-of-the-art methods (Zhu et al., 2021a; Wu et al., 2021; Zhu et al., 2022) provide finegrained control over a target region, they adopt an "annotation-first" approach, requiring an end-user to first manually specify a ROI. By contrast, our method fully exploits the unsupervised learning paradigm, wherein such concepts are discovered automatically and without any manual annotation. These discovered semantic regions can then be chosen, combined, or even modified by an end-user as desired for local image editing. More interestingly still, through a generic decomposition of the feature maps our method identifies representations of common concepts (such as "background") in all generator architectures considered (all 3 StyleGANs (Karras et al., 2019; 2020b; 2021) , ProgressiveGAN (Karras et al., 2018), and BigGAN (Brock et al., 2019) ). This is a surprising finding, given that these generators are radically different in architecture. By then editing the feature maps using these appearance factors, we can thus, for example, remove specific objects in the foreground (Fig. 1 (b )) in all generators, seamlessly replacing the pixels at the target region with the background appropriate to each image. However, our method is useful not only for local image editing, but also provides a straightforward way to localize the learnt appearance concepts in the images. By expressing activations in terms of our learnt appearance basis, we are provided with a visualization of how much of each of the appearance concepts are present at each spatial location (i.e., saliency maps for concepts of interest). By then thresholding the values in these saliency maps (as shown in Fig. 1 (c )), we can localize the learnt appearance concepts (such as sky, floor, or background) in the images-without the need for supervision at any stage. We show exhaustive experiments on 5 different architectures (Karras et al., 2020b; 2018; 2021; 2019; Brock et al., 2019) and 5 datasets (Deng et al., 2009; Choi et al., 2020; Karras et al., 2019; Yu et al., 

