GENERATING UNSEEN COMPLEX SCENES: ARE WE THERE YET?

Abstract

Although recent complex scene conditional generation models generate increasingly appealing scenes, it is very hard to assess which models perform better and why. This is often due to models being trained to fit different data splits, and defining their own experimental setups. In this paper, we propose a methodology to compare complex scene conditional generation models, and provide an in-depth analysis that assesses the ability of each model to (1) fit the training distribution and hence perform well on seen conditionings, (2) to generalize to unseen conditionings composed of seen object combinations, and (3) generalize to unseen conditionings composed of unseen object combinations. As a result, we observe that recent methods are able to generate recognizable scenes given seen conditionings, and exploit compositionality to generalize to unseen conditionings with seen object combinations. However, all methods suffer from noticeable image quality degradation when asked to generate images from conditionings composed of unseen object combinations. Moreover, through our analysis, we identify the advantages of different pipeline components, and find that (1) encouraging compositionality through instance-wise spatial conditioning normalizations increases robustness to both types of unseen conditionings, (2) using semantically aware losses such as the scene-graph perceptual similarity helps improve some dimensions of the generation process, and (3) enhancing the quality of generated masks and the quality of the individual objects are crucial steps to improve robustness to both types of unseen conditionings.

1. INTRODUCTION

The recent years have witnessed significant advances in generative models (Goodfellow et al., 2014; Kingma & Welling, 2014; van den Oord et al., 2016a; Miyato & Koyama, 2018; Miyato et al., 2018; Brock et al., 2019) , enabling their increasingly widespread use in many application domains (van den Oord et al., 2016b; Vondrick et al., 2016; Zhang et al., 2018a; Hong et al., 2018; Sun & Wu, 2020) . Among the most promising approaches, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have achieved remarkable results, generating high quality, high resolution samples in the context of single class conditional image generation (Brock et al., 2019) . This outstanding progress has paved the road towards tackling more challenging tasks such as the one of complex scene conditional generation, where the goal is to generate high quality images with multiple objects and their interactions from a given conditioning (e.g. bounding box layout, segmentation mask, or scene graph). Given the exploding number of possible object combinations and their layouts, the requested conditionings oftentimes require zero-shot generalization. Therefore, successfully generating high quality, high resolution, diverse samples from complex scene datasets such as COCO-Stuff (Caesar et al., 2018) remains a stretch goal. Despite recent efforts producing increasingly appealing complex scene samples (Hong et al., 2018; Hinz et al., 2019; Park et al., 2019; Ashual & Wolf, 2019; Sun & Wu, 2020; Sylvain et al., 2020) , and as previously noted in the unconditional GAN literature (Lucic et al., 2018; Kurach et al., 2018) , it is unfortunately very hard to assess which models perform better, and perhaps more importantly why. In the case of conditional complex scene generation, this is often due to models being trained to fit different data splits, using different conditioning modalities and levels of supervision -bounding box layouts, segmentation masks, scene graphs -, and reporting inconsistent quantitative metrics (e.g. repeatedly computing previous methods' results using different reference distributions, and/or using different image compression algorithms to store generated images), among other uncontrolled sources of variation. Moreover, these methods disregard the challenges that emerge from their expected generalization to unseen conditionings. This lack of rigour leads to conclusion replication failure, and hinders the identification of the most promising directions to advance the field. Therefore, in this paper we aim to provide an in-depth analysis and propose a methodology to compare current conditional complex scene generation methods. We argue that such analysis has the potential to deepen the understanding of such approaches and contribute to their progress. The proposed methodology addresses the following questions: (1) How well does each method perform on seen conditionings (training conditionings)?; (2) How well does each method generalize to unseen conditionings composed of seen object combinations?; and (3) How well does each method generalize to unseen conditionings composed of unseen object combinations? We investigate the answers to the previous questions both from a scene-wise and an object-wise standpoints. As a result, we observe that: (1) recent methods are capable of generating identifiable scenes from seen conditionings; (2) they exhibit some generalization capabilities when using unseen conditionings composed of seen object combinations, exploiting compositionality to generate scenes with different object arrangements; (3) they suffer from poorer generated image quality when asked for unseen object combinations, especially the method presented in (Sylvain et al., 2020) . However, in all cases, the quality of the generated objects generally suffers from missing high frequency details, especially for those classes in the long tail of the dataset distribution. Finally, through an extensive ablation, we are able to identify the strengths and weaknesses of different pipeline components. In particular, we note that endowing the generator with an instance-wise normalization module (Sun & Wu, 2019) results in increased individual object quality and better robustness to both types of unseen conditionings, whereas exploiting the normalization module of Sylvain et al. ( 2020) results in improved scene quality, suggesting that exploiting scene compositionality in the generator helps improve generalization. Moreover, we find that including a scene-graph similarity (Sylvain et al., 2020) while training complex scene conditional generation models leads to better conditional consistency, especially for unseen object combinations, emphasizing the promise of moving towards semantically aware training losses to improve generalization. We also identify the improvement of generated segmentation masks as one promising avenue to promote generalization to unseen conditionings. Finally, by leveraging these findings, we are able to compose a pipeline which obtains state-of-theart results in metrics such as FID, while maintaining competitive results in almost all other studied metrics.

2. RELATED WORK

Evaluation metrics for GANs. The most widely used metrics are the Inception Score (IS) (Salimans et al. (2016) ) and the Fréchet Inception Distance (FID) (Heusel et al. (2017) ), which aim to capture visual sample quality and sample diversity. On the one hand, IS was designed to evaluate single-object image generation on problems where the expected marginal distribution over classes is uniform, an unrealistic expectation in many real-world datasets with multi-object images. On the other hand, FID was introduced for unconditional image generation and yields a single distributionspecific score, which hinders the analysis of individual failure cases. To overcome the former, researchers have attempted to extend FID to the conditional generation case (DeVries et al., 2019; Benny et al., 2020) . To overcome the latter, researchers have proposed to disentangle visual sample quality and diversity into two different metrics (Shmelkov et al., 2018; Sajjadi et al., 2018; Kynkäänniemi et al., 2019; Ravuri & Vinyals, 2019) by modifying the definition of precision-recall in different ways. Finally, the diversity of generated samples has also been quantified by measuring the perceptual similarity between generated image patches, as proposed by Zhang et al. (2018b) . Complex scene conditional generation. The conditional generation literature encompasses a variety of input conditioning modalities. Several existing works focus on generating photo-realistic scenes from detailed semantic segmentation masks (Chen & Koltun, 2017; Qi et al., 2018; Park et al., 2019; Wang et al., 2018; Tang et al., 2020) . Although semantic segmentation masks are information rich, they may be difficult to obtain with enough fine-grained details. This is especially relevant in applications where a user is expected to specify the input conditioning. As such, other commonly used input conditioning modalities include text descriptions (Reed et al., 2016; Hong et al., 2018; Tan et al., 2019) , scene graphs (Johnson et al., 2018; Ashual & Wolf, 2019) , and bounding box layouts specifying the position and scale of the objects in the scene (Zhao et al., 2019; Sun & 

