GENERATIVE SCENE GRAPH NETWORKS

Abstract

Human perception excels at building compositional hierarchies of parts and objects from unlabeled scenes that help systematic generalization. Yet most work on generative scene modeling either ignores the part-whole relationship or assumes access to predefined part labels. In this paper, we propose Generative Scene Graph Networks (GSGNs), the first deep generative model that learns to discover the primitive parts and infer the part-whole relationship jointly from multi-object scenes without supervision and in an end-to-end trainable way. We formulate GSGN as a variational autoencoder in which the latent representation is a treestructured probabilistic scene graph. The leaf nodes in the latent tree correspond to primitive parts, and the edges represent the symbolic pose variables required for recursively composing the parts into whole objects and then the full scene. This allows novel objects and scenes to be generated both by sampling from the prior and by manual configuration of the pose variables, as we do with graphics engines. We evaluate GSGN on datasets of scenes containing multiple compositional objects, including a challenging Compositional CLEVR dataset that we have developed. We show that GSGN is able to infer the latent scene graph, generalize out of the training regime, and improve data efficiency in downstream tasks.

1. INTRODUCTION

Learning to discover and represent objects purely from observations is at the core of human cognition (Spelke & Kinzler, 2007) . Recent advances in unsupervised object-centric representation learning have enabled decomposition of scenes into objects (Greff et al., 2019; Lin et al., 2020b; Locatello et al., 2020) , inference and rendering of 3D object models (Chen et al., 2020) , and object tracking and future generation (Crawford & Pineau, 2019a; Jiang et al., 2020; Lin et al., 2020a) . These neuro-symbolic approaches, where the discreteness of discovered objects provides the symbolic representation, facilitate various desired abilities such as out-of-distribution generalization, relational reasoning, and causal inference. In this paper, we seek to further discover and represent the structure within objects without supervision. Our motivation is that natural scenes frequently contain compositional objects-objects that are composed of primitive parts. We humans can easily identify the primitive parts and recognize the part-whole relationship. Representing objects as explicit composition of parts is expected to be more efficient, since a vast number of complex objects can often be compositionally explained by a small set of simple primitives. It also allows us to imagine and create meaningful new objects. A well-established representation for part-whole relationships in computer graphics is called the scene graph (Foley et al., 1996) . It is a tree whose leaf nodes store models of primitive parts, and whose edges specify affine transformations that compose parts into objects. While in computer graphics, the scene graph is manually constructed for rendering, in this paper, we are interested in inferring the scene graph from unlabeled images. To this end, we propose Generative Scene Graph Networks (GSGNs) . We formulate this model as a variational autoencoder (Kingma & Welling, 2013; Rezende et al., 2014) whose latent representation is a probabilistic scene graph. In the latent tree, each node is associated with an appearance variable that summarizes the composition up to the current level, and each edge with a pose variable that parameterizes the affine transformation from the current level to the upper level. The design of the GSGN decoder follows the rendering process of graphics engines, but with differentiable operations helping the encoder to learn inverse graphics (Tieleman, 2014; Wu et al., 2017; Romaszko et al., 2017; Yao et al., 2018; Deng et al., 2019) . As a result, the pose variables inferred by GSGN are interpretable, and the probabilistic scene graph supports symbolic manipulation by configuring the pose variables. One major challenge is to infer the structure of the scene graph. This involves identifying the parts and grouping them into objects. Notice that unlike objects, parts are often stitched together and thus can have severe occlusion, making it hard to separate them. Existing methods for learning hierarchical scene representations circumvent this challenge by working on single-object scenes (Kosiorek et al., 2019) and also providing predefined or pre-segmented parts (Li et al., 2017; Huang et al., 2020) . In contrast, GSGN addresses this challenge directly, and learns to infer the scene graph structure from multi-object scenes without knowledge of individual parts. Our key observation is that the scene graph has a recursive structure-inferring the structure of the tree should be similar to inferring the structure of its subtrees. Hence, we develop a top-down inference process that first decomposes the scene into objects and then further decomposes each object into its parts. This allows us to reuse existing scene decomposition methods such as SPACE (Lin et al., 2020b) as an inference module shared at each level of the scene graph. However, we find that SPACE has difficulty separating parts that have severe occlusion, possibly due to its complete reliance on bottom-up image features. Therefore, simply applying SPACE for decomposition at each level will lead to suboptimal scene graphs. To alleviate this, GSGN learns a prior over plausible scene graphs that captures typical compositions. During inference, this prior provides top-down information which is combined with bottom-up image features to help reduce ambiguity caused by occlusion. For evaluation, we develop two datasets of scenes containing multiple compositional 2D and 3D objects, respectively. These can be regarded as compositional versions of Multi-dSprites (Greff et al., 2019) and CLEVR (Johnson et al., 2017) , two commonly used datasets for evaluating unsupervised object-level scene decomposition. For example, the compositional 3D objects in our dataset are made up of shapes similar to those in the CLEVR dataset, with variable sizes, colors, and materials. Hence, we name our 3D dataset the Compositional CLEVR dataset. The contributions of this paper are: (i) we propose the probabilistic scene graph representation that enables unsupervised and end-to-end scene graph inference and compositional scene generation, (ii) we develop and release the Compositional CLEVR dataset to facilitate future research on object compositionality, and (iii) we demonstrate that our model is able to infer the latent scene graph, shows decent generation quality and generalization ability, and improves data efficiency in downstream tasks.

2. RELATED WORK

Object-centric representations. Our model builds upon a line of recent work on unsupervised object-centric representation learning, which aims to eliminate the need for supervision in structured scene understanding. These methods learn a holistic model capable of decomposing scenes into objects, learning appearance representations for each object, and generating novel scenes-all without supervision and in an end-to-end trainable way. We believe such unsupervised and holistic models are more desirable, albeit more challenging to learn. These models can be categorized into scene-mixture models (Greff et al., 2017; 2019; Burgess et al., 2019; Engelcke et al., 2020; Locatello et al., 2020) and spatial-attention models (Eslami et al., 2016; Crawford & Pineau, 2019b; Lin et al., 2020b; Jiang & Ahn, 2020) . Compared to these models, we go a step further by also decomposing objects into parts. We use spatial-attention models as the inference module at each level of the scene graph, because they explicitly provide object positions, unlike scene-mixture models. We combine the inference module with a learned prior to help improve robustness to occlusion. This also allows sampling novel scenes from the prior, which is not possible with spatial-attention models. Hierarchical scene representations. Modeling the part-whole relationship in scenes has attracted growing interest, and has been utilized for improving image classification (Sabour et al., 2017; Hinton et al., 2018; Kosiorek et al., 2019) , parsing, and segmentation (Zhu et al., 2008) . However, these models have been applied to scenes with one dominant object only, and cannot perform scene generation. Recent work on assembly-based 3D shape modeling also learns the part-whole relationship (Tulsiani et al., 2017; Li et al., 2017; Zhu et al., 2018; Huang et al., 2020; Kania et al., 2020) , but these methods require predefined or pre-segmented parts as input, and can only model single shapes with no background. By contrast, our model learns the part-whole relationship from multi-object scenes without knowledge of individual parts. There has also been work on 3D part decomposition (Chen et al., 2019; Deng et al., 2020) , but they require voxels or point clouds as input, and typically focus on geometry (e.g., part occupancy) without learning to represent appearance (e.g., color, material). Part hierarchies have also been used for shape generation (Mo et al., 2019) , where the hierarchy is provided as input rather than inferred from the input. Our approach infers compositional structures from static scenes, and is orthogonal to methods that use motion cues for decomposing dynamic scenes (Xu et al., 2019) and methods that infer physical interactions from dynamic scenes (Li et al., 2020; Stanić et al., 2020) . Hinton (2021) recently proposed an iterative procedure that is expected to form the part hierarchy through multiple rounds of message passing among adjacent levels. While our model works without iterative message passing, we believe this is important for parsing more complex scenes. Hierarchical latent variable models. Our model can be regarded as a hierarchical latent variable model, and is inspired by several recent advances (Bachman, 2016; Sønderby et al., 2016; Zhao et al., 2017; Maaløe et al., 2019) that have achieved impressive generation quality. While these methods focus on designing the hierarchical structure and training method that harness the full expressive power of generative models, our goal is to learn the hierarchical structure from unlabeled images that captures the compositional relationship among symbolic entities like objects and parts.

3.1. GENERATIVE PROCESS

We assume that the image x is generated by a set of foreground variables collectively denoted z fg and a background variable z bg as follows: p(x) = p(x | z fg , z bg ) p(z bg | z fg ) p(z fg ) dz fg dz bg . To represent the compositional structures within foreground objects, GSGN models z fg as a treestructured probabilistic scene graph, as shown in Figure 1A . Each leaf node represents a primitive entity that is not further decomposed. Each internal node represents an abstract entity that is composed from its children nodes. Similar to graphics engines, the composition is modeled as affine transformations, and is specified by the relative pose (including rotation, scaling, and translation) of each child node v with respect to its parent pa(v). We use a pose variable z pose v to represent the relative pose, and associate it with the corresponding edge. We also associate an appearance variable z appr v with each node v. It is expected to represent the appearance of entity v in its canonical posefoot_0 , summarizing all lower-level composition in the subtree rooted at v. In particular, the appearance variable z appr r at the root node r summarizes the full scene. Due to this summarization assumption, given z appr v , we can generate the pose and appearance variables for all children nodes of v in a conditionally independent way. Hence, for a given tree structure with V being the set of all nodes, the prior over foreground variables can be factorized according to the tree structure: p(z fg ) = p(z appr r ) v∈V \{r} p(z pose v | z appr pa(v) ) p(z appr v | z appr pa(v) ) . Here we further assume conditional independence between z pose v and z appr v , since in graphics engines, one should be able to specify the pose and appearance separately. Representing tree structures. The above factorization only works for a given tree structure. To deal with variable tree structures, we need to include them in the latent representation as well. We start by setting a maximum out-degree for each node so that the total number of possible structures is bounded. To determine the structure, it then suffices to specify the presence of each possible edge. Hence, for an arbitrary edge between node v and its parent, we introduce a Bernoulli variable z pres v to indicate its presence. If z pres v = 0, meaning the edge is not present, then the pose variable associated with the edge along with all variables in the subtree rooted at v are excluded from the probabilistic scene graph. More precisely, let us define zpres v to be the product of all the presence variables along the path from root r to node v: zpres r = 1 , zpres v = z pres v × zpres pa(v) for v ∈ V \ {r} . The foreground variables now become z fg = {z appr r } ∪ {z pres v , z pose v , z appr v } v∈V \{r} , and the prior factorizes as follows: p(z fg ) = p(z appr r ) v∈V \{r} [p(z pres v | z appr pa(v) )] zpres pa(v) [p(z pose v | z appr pa(v) ) p(z appr v | z appr pa(v) )] zpres v . We implement the pose and appearance variables as Gaussian variables, p(z appr r ) = N (0, 1), and the parameters of each conditional distribution are output by an MLP. Differentiable decoder. We design the decoder to follow the recursive compositing process in graphics engines, helping the encoder to learn inverse graphics. First, for each leaf node v, we use a neural network g(•) to decode its appearance variable into a small image patch xv and a (close to) binary mask mv : (x v , mv ) = g(z appr v ). Here, g(•) is implemented as a spatial broadcast decoder (Watters et al., 2019) optionally followed by sub-pixel convolutions (Shi et al., 2016) . We then recursively compose these primitive patches into whole objects and the full scene by applying affine transformations parameterized by the pose variables. Specifically, let u be an internal node, and ch(u) be the set of its children. We compose the higher-level image patch xu and mask mu as follows: xu = v∈ch(u) z pres v • α v ST -1 (x v , z pose v ) , mu = v∈ch(u) z pres v • α v ST -1 ( mv , z pose v ) . Here, denotes pixel-wise multiplication, and ST -1 is an inverse spatial transformer (Jaderberg et al., 2015) that differentiably places xv and mv into the coordinate frame of the parent node u. To deal with occlusion, we include relative depth in z pose v , and compute a transparency map α v by the softmax over negative depth values. This ensures that entities with smaller depth will appear in front of entities with larger depth. See Figure 1C for an illustration. When we reach the root node, we obtain an image xr and a mask mr of all foreground objects. We then use a spatial broadcast decoder (Watters et al., 2019) to decode z bg into a background image xbg . The full scene x can now be modeled as a pixel-wise mixture of foreground and background, where mr serves as the mixing weight: p(x | z fg , z bg ) = mr N (x | xr , σ 2 fg 1) + (1 -mr ) N (x | xbg , σ 2 bg 1) . Here, σ fg and σ bg are hyperparameters.

3.2. INFERENCE AND LEARNING

Since computing p(x) in Equation 1 involves an intractable integral, we train GSGN with variational inference. The approximate posterior factorizes similarly as the generative process: p(z fg , z bg | x) ≈ q(z fg , z bg | x) = q(z fg | x) q(z bg | z fg , x) . ( ) To infer the foreground variables, we make the key observation that the probabilistic scene graph has a recursive structure. This suggests a recursive inference process, in which the children nodes of the root node are first inferred, and then inference is recursively performed within each subtree rooted at the children nodes. Hence, we design a top-down factorization: q(z fg | x) = q(z appr r | x) v∈V \{r} [q(z pres v | z appr pa(v) , x pa(v) )] zpres pa(v) [q(z pose v | z appr pa(v) , x pa(v) )] zpres v × v∈V \{r} [q(z appr v | z appr pa(v) , x v )] zpres v , where the inference of child node v is conditioned on the inferred appearance variable z appr pa(v) of its parent. This crucially provides top-down information for separating lower-level entities, since z appr pa(v) summarizes lower-level composition. We combine this top-down information with bottom-up image features. Specifically, we use spatial attention to crop a local image region x v that is expected to capture the entity v. This provides more relevant information for inferring the appearance and structure of entity v. The region x v is specified by all the predicted pose variables along the path from root r to node v. More precisely, we define x r = x, and recursively extract x v = ST (x pa(v) , z pose v ) using a spatial transformer ST (Jaderberg et al., 2015) , as shown in Figure 1B . Parameter sharing between prior and posterior. Inspired by Ladder VAEs (Sønderby et al., 2016) , we find it beneficial to share parameters between the prior and the posterior. Specifically, we implement each factorized posterior as a product of the corresponding prior and a posterior obtained solely from image features. This also allows us to reuse existing scene decomposition methods such as SPACE (Lin et al., 2020b) as an inference module of GSGN. For example, q(z appr v | z appr pa(v) , x v ) ∝ p(z appr v | z appr pa(v) ) q(z appr v | x v ) = p(z appr v | z appr pa(v) ) q SPACE (z appr v | x v ) . ( ) Training. We use reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014) to sample the continuous pose and appearance variables, and Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2016) to sample the discrete presence variables. The entire model can be trained end-to-end via backpropagation to maximize the evidence lower bound (ELBO): L = E q [p(x | z fg , z bg ) -D KL [q(z fg | x) p(z fg )] -D KL [q(z bg | z fg , x) p(z bg | z fg )]] . Auxiliary KL loss. As pointed out by recent work (Chen et al., 2020) , learnable priors have difficulty reflecting our prior knowledge. Hence, similar to Chen et al. ( 2020), we introduce auxiliary KL terms between the posterior and some fixed prior that reflects our preference. In particular, we find it important to introduce a KL term between q(z pres v | z appr pa(v) , x pa(v) ) and a Bernoulli prior distribution with low success probability to encourage sparse tree structures.

4. EXPERIMENTS

Datasets. In order to evaluate GSGN's ability to discover scene graph structures, we develop compositional versions of Multi-dSprites (Greff et al., 2019) and CLEVR (Johnson et al., 2017) , two commonly used datasets for evaluating unsupervised object-centric representation learning models. We refer to our new datasets as 2D Shapes and Compositional CLEVR datasets, respectively. For each dataset, we define three types of primitive parts, and ten types of objects composed from these parts. Three of the object types contain a single part, another three contain two parts, and the remaining four contain three parts. To construct a scene, we randomly sample the number of objects (between one and four) and their types, sizes, positions, and orientations. The color and material of each part are also randomly sampled, thus covering a broad range of compositional structures. Each dataset consists of 128×128 color images, split into 64000 for training, 12800 for validation, and 12800 for testing. GSGN implementation. While our formulation of GSGN allows representing and inferring scene graphs of arbitrary depth, in our experiments, we have only investigated the effectiveness of a threelevel GSGN as a proof-of-concept, containing part-, object-, and scene-level representations. We set the maximum out-degree to be 4 for each node. We also implemented another version, denoted GSGN-9, where the maximum out-degree is 9. This is to evaluate the effectiveness of our model when structural hyperparameters are mismatched to data, and show some potential for scalability. Baselines. Previous work on hierarchical scene representations assumes single-object scenes (Kosiorek et al., 2019) and requires predefined or pre-segmented parts (Li et al., 2017; Huang et al., 2020) , and thus cannot work on our datasets. Hence, we compare with SPACE (Lin et al., 2020b) , a state-of-the-art non-hierarchical scene decomposition model. Although SPACE cannot infer the scene graph, its hyperparameters can be tuned to decompose the scene into either objects (SPACE-O) or parts (SPACE-P). This allows us to compare the inference quality of individual nodes. Ablations. We consider two ablated versions of GSGN. GSGN-No-Share does not share parameters between prior and posterior, namely it does not factorize the posterior as in Equation 10. Instead, the posterior mean and variance are computed by another network that takes in both z appr pa(v) and x v . GSGN-No-Aux directly optimizes the ELBO without auxiliary KL terms. Scene graph inference. We visualize the inferred scene graphs in Figure 2 and Figure 3 . Here, the bounding boxes visualize the inferred pose variables (including rotation, scaling, and translation), and the reconstructions are obtained by feeding the inferred appearance variables through the decoder. Empty slots indicate that the corresponding presence variables are zero. As can be seen, GSGN is able to correctly separate the parts and group them into objects, even when objects are close (Figure 2 Row 1-4, Figure 3 Row 1-2), parts have similar color (Figure 2 Row 4-5, Figure 3 Row 3), and there is occlusion between objects and parts (Figure 3 Row 4-5). We report quantitative results in Table 1 and Table 2 . Here, counting accuracy measures the correctness of the number of nodes in the inferred scene graph. F1 score is the harmonic mean of precision and recall. We define a node to be true positive if it is sufficiently close to a groundtruth entity. Hence, F1 score reflects if the inferred nodes indeed capture entities. We measure closeness by the distance between center positions. The distance threshold for objects and parts are 10 and 5 pixels respectively. As can be seen, all models obtain similar reconstruction quality. GSGN-No-Aux fails to infer the scene graph structure. We find that it tends to set all presence variables to be one, resulting in redundant nodes. With auxiliary KL terms, GSGN and GSGN-No-Share are able to infer sparse structures, achieving comparable object-level decomposition performance with SPACE-O. On the Compositional CLEVR dataset, GSGN achieves better part-level decomposition than SPACE-P. We observe that this is because SPACE-P has difficulty separating parts that have severe occlusion. We further investigate this by splitting the test set into three occlusion levels. Here we measure the amount of occlusion in an image by the minimum number of visible pixels per part. As shown in Table 3 , when an image contains a part that has less than 100 visible pixels, SPACE-P performs significantly worse than GSGN. Its low recall suggests that SPACE-P tends to miss occluded parts. We also find GSGN-No-Share performs worse than GSGN in this scenario, indicating that parameter sharing helps better combine top-down information from the prior to reduce ambiguity caused by occlusion. For GSGN-9, although its number of leaf nodes is far more than the maximum number of parts per scene in our dataset, it only shows a slight drop in performance when compared to GSGN, and it is still much better than SPACE-P at identifying parts that have severe occlusion. The counting accuracy and F1 score evaluate the object-and part-level separately, without taking into account the overall tree structure. To directly measure the quality of the inferred tree structure, we perform a manual inspection on 100 test images from each dataset. GSGN achieves 100% and 99% structural accuracy on the 2D Shapes and Compositional CLEVR datasets, respectively. We also consider a heuristic baseline, which starts from the parts predicted by SPACE-P, and groups the parts whose bounding boxes overlap. This baseline achieves 95% and 80% structural accuracy on the two datasets, respectively. Scene graph manipulation. The scene graphs inferred by GSGN have interpretable tree structures and pose variables. Therefore, new objects and scenes can be composited by directly manipulating the latent variables in an inferred scene graph. In Figure 4A , we show object-level manipulations. We manipulate the inferred scene graph of the fourth scene of Figure 3 , by changing the scale (first row) and x-coordinate (second row) of an object in the scene. As can be seen, object occlusion is handled correctly. We can also make these modifications after replacing the object with some other object from a different scene (third and fourth rows). This is achieved by replacing the subtree corresponding to that object. Because the subtree stores the relative pose (rather than the absolute pose) of the parts, the part structure of the newly added object can be automatically preserved. In Figure 4B , we show part-level manipulations on the first scene of Figure 3 . In the first row, we modify the size of the bronze ball. This requires changing both the relative position and scale of the ball, so that it remains in contact with the green cube below it. In the second row, we replace the bronze ball with other parts that can be found in the five scenes of Figure 3 . This is achieved by replacing the appearance and scale variables with those of the new part, and recalculating the position so that the new part will appear right above the green cube. We note that this allows novel objects to be composited that have never been seen during training (see the last two columns). Object and scene generation. GSGN is able to generate objects and scenes by sampling the latent variables from the learned prior and feeding them through the decoder. We show generation results in Figure 5 . We find that GSGN has captured many predefined object types in the two datasets considered, and also managed to come up with novel compositions. The generated scenes are also reasonable, with moderate distance and occlusion between objects. The objects generated by SPACE-O are much less interpretable, since SPACE does not explicitly model the part-whole relationship. Also, SPACE cannot generate meaningful scenes, because the presence variables sampled from its prior will almost always be zero, leading to no object being put onto the scene. Generalization performance. We evaluate GSGN's capacity to generalize to scenes with novel number of objects. We report the metrics in Table 4 and Table 5 (Section A). As can be seen, GSGN demonstrates quite decent generalization performance. Data efficiency in downstream tasks. The compositional structure learned by GSGN can be useful in downstream tasks that require reasoning of part-object relationships. Here we consider a classification task. The input images are generated from the same distribution but with different random et al., 2018) within the subtree of each object for aggregation, providing more inductive bias for extracting interactions. We train the two classifiers using 64, 128, 256, 512, 1024, 2048, 4096, 8192, and 16000 training samples. We choose the best learning rate for both classifiers using a fixed validation set of size 12800. We report classification accuracy on a fixed test set also of size 12800. As shown in Figure 6 (Section B), GSGN approximately doubles the data efficiency on this downstream task compared to SPACE-O.

5. CONCLUSION

We have proposed GSGN, the first deep generative model for unsupervised scene graph discovery from multi-object scenes without knowledge of individual parts. GSGN infers the probabilistic scene graph by combining top-down prior and bottom-up image features. This utilizes the higher-level appearance information to guide lower-level decomposition when there is compositional ambiguity caused by severe occlusion. GSGN is also able to generate novel out-of-distribution objects and scenes through scene graph manipulation. While we have demonstrated the effectiveness of a threelevel GSGN as a proof-of-concept, it remains an open question whether GSGN can scale to more realistic data, with deeper hierarchies and more complex appearance. Interesting future directions include developing a recurrent module to dynamically control the depth of the scene graph, using iterative inference as suggested by Hinton (2021) for better coordination across levels, and improving the single-level inference module. 

C LEARNING PRIMITIVE TEMPLATES

It is often reasonable to assume that the vast number of complex entities can be composed from only a modest number of primitive entities. Identifying such primitive entities through discrete latent variables would bring additional interpretability. Hence, we consider an extension to GSGN (denoted GSGN-Mem) where the leaf nodes become pointers to a jointly learned memory M of primitive templates. Each slot in M stores the low-dimensional shape embedding of a learned primitive part. These embeddings are considered model parameters, and are trained jointly via backpropagation. To link the probabilistic scene graph with the memory of primitives, we decompose the appearance variable z appr v for each leaf node v into z appr v = (z addr v , z what v ). Here, z addr v is a one-hot addressing variable that points to one of the slots in M, and z what v is a continuous variable that explains the remaining variability in the appearance of primitives, such as color, material, and lighting effects. Similar to Bornschein et al. (2017) , we factorize the prior as: p(z appr v | z appr pa(v) ) = p(z addr v | z appr pa(v) , M) p(z what v | M[z addr v ]) , where M[z addr v ] is the retrieved memory content. To decode from the templates, we first retrieve the embedding indexed by z addr ). Here, denotes pixel-wise multiplication, g(•) is implemented as a spatial broadcast decoder (Watters et al., 2019) , and h(•) a spatial broadcast decoder followed by sub-pixel convolutions (Shi et al., 2016) . We trained GSGN-Mem on the Compositional CLEVR dataset. We find it is able infer the scene graph, as we show qualitatively in Figure 7 . The primitives have learned the appearance of all the parts in the dataset, while reserving some memory capacity for potential future use (see Figure 8 ).

D IMPLEMENTATION DETAILS D.1 FOREGROUND INFERENCE

We provide the implementation outline for computing q(z fg | x) in Algorithm 1 and Algorithm 2. The inference is carried out in a top-down fashion, starting at level 1 that corresponds to the root node. The core inference module, GSGN (l+1) , takes as input the appearance variable z appr v and the local image region x v for a node v in level l, and outputs the latent variables z ch(v) and local image regions x ch(v) for all children nodes of v. Here, ch(v) denotes the set of all children nodes of v, and z ch(v) and x ch(v) are defined as: z ch(v) = {(z pres u , z pose u , z appr u )} u∈ch(v) , x ch(v) = {x u } u∈ch(v) . GSGN (l+1) is shared for all nodes v in level l, and serves as a generic module for decomposing an entity at level l into its parts at level (l + 1). There are three submodules in GSGN (l+1) , namely PresPoseApprPri (l+1) for computing the conditional prior, and PresPoseEnc (l+1) and ApprEnc (l+1) that are adapted from the SPACE encoder. All three submodules can be implemented as CNNs. For PresPoseApprPri (l+1) and PresPoseEnc (l+1) , each cell in the output feature map corresponds to one child node, and the number of cells determines the maximum out-degree at level l. Hence, the latent variables at the children nodes can be inferred in parallel through a feedforward pass of each submodule.

Algorithm 1 Foreground Inference

Input: image x, number of scene graph levels L 1: Infer the appearance variable for the root node: q(z appr r | x) = ApprEnc (l=1) (x) z appr r ∼ q(z appr r | x), x r = x 2: for each scene graph level l = 1, 2, . . . , L -1 do 3: for each node v in level l parallel do 4: Infer the latent variables and crop local image regions for all children nodes of v: z ch(v) , x ch(v) = GSGN (l+1) (z appr v , x v ) 5: end for 6: end for 7: return all foreground variables z fg Algorithm 2 GSGN (l+1) : Foreground Inference at Level (l + 1) Input: appearance variable z appr v and local image region x v for a node v in level l 1: Compute conditional prior: p(z ch(v) | z appr v ) = PresPoseApprPri (l+1) (z appr v ) 2: Use SPACE to predict presence and pose variables from x v : q SPACE (z pres,pose ch(v) | x v ) = PresPoseEnc (l+1) (x v ) 3: Sample the presence and pose variables: z pres,pose ch(v) ∼ q(z pres,pose ch(v) | z appr v , x v ) ∝ p(z pres,pose ch(v) | z appr v ) q SPACE (z pres,pose ch(v) | x v ) 4: for each child node u ∈ ch(v) parallel do 5: Use spatial transformer to crop local image region for node u: x u = ST (x v , z pose u ) 6: Use SPACE to predict appearance variable from x u : q SPACE (z appr u | x u ) = ApprEnc (l+1) (x u ) 7: Sample the appearance variable: z appr u ∼ q(z appr u | z appr v , x u ) ∝ p(z appr u | z appr v ) q SPACE (z appr u | x u ) 8: end for 9: return latent variables z ch(v) and local image regions x ch(v) for all children nodes of v 

Posterior

Unconditioned Prior q(z scale v | z appr pa(v) , x pa(v) ) 0.1N (1.2, 0.1) + 0.1N (0, 0.05) + 0.8N (-0.6, 0.1)

D.4 BACKGROUND INFERENCE

The prior and posterior for the background variable are implemented as: p(z bg | z fg ) = p(z bg | z appr r ) , q(z bg | z fg , x) = q(z bg | z appr r , x) . We provide the implementation outline for computing q(z bg | z appr r , x) in Algorithm 3. The two submodules, BgPri and BgEnc, are implemented as CNNs, and their architectures can be found in Figure 10 . Both submodules output the mean and standard deviation of the background variable. We apply Softplus nonlinearity to ensure that the standard deviations are positive. z appr  q SPACE (z bg | x) = BgEnc(x) 3: Sample the background variable:  z bg ∼ q(z bg | z appr r , x) ∝ p(z bg | z appr r ) q SPACE (z bg | x) 4: return background variable z bg A B Input (1 x 1 x 128) ConvBlock (c=128, k=1, s=1, p=0) ConvReLU (c=32, k=1, s=1, p=0) Conv (c=16, k=1, s=1, p=0) Output (1 x 1 x 16) Input (128 x 128 x 3) ConvBlock (c=8, k=4, s=2, p=1) ConvBlock (c=8, k=4, s=2, p=1) ConvBlock (c=8, k=4, s=2, p=1) ConvReLU (c=32, k=4, s=1, p=0) Conv (c=16, k=1, s=1, p=0) Output (1 x 1 x 16) ConvBlock (c=8, k=4, s=2, p=1) ConvBlock (c=8, k=4, s=2, p=1)

D.5 TRAINING DETAILS

We train GSGN on a single GPU, using Adam optimizer (Kingma & Ba, 2014) with a batch size of 64 and a learning rate of 3 × 10 -4 , for up to 500 epochs. We use gradient clipping to ensure that the infinity norm of the gradient does not exceed 1.0. The temperature for Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2016) is exponentially annealed from 2.5 to 0.5 during the first 20 epochs. Similar to Slot Attention (Locatello et al., 2020) , the learning rate is linearly increased from 0 to 3 × 10 -4 during the first 10 epochs, and exponentially decayed to half of its value every 100 epochs. We set σ fg = 0.3 and σ bg = 0.1. On the Compositional CLEVR dataset, σfoot_1 fg has an initial value of 0.15 2 , and is linearly increased from 0.15 2 to 0.3 2 during epochs 20-40. Similar to SPACE (Lin et al., 2020b) , the mixing weight mr is fixed at the start of training. On the 2D Shapes dataset, we fix mr = 1 × 10 -5 for 1 epoch, while on the Compositional CLEVR dataset, we fix mr = 0.1 for 2 epochs. GSGN-9 is trained on two GPUs, each taking a batch size of 32, using the same schedule.

E COMPARISON WITH SUPERPIXEL HIERARCHY

In this section, we qualitatively compare GSGN with a simple baseline called Superpixel Hierarchy (Wei et al., 2018) . This algorithm hierarchically merges pixels into superpixels based on connectivity and color histograms, until the full image is merged into one superpixel. Using this hierarchy, the algorithm is able to produce segmentation results when given the desired number of superpixels (can be any number between one and the total number of pixels). It has been shown that Superpixel Hierarchy outperforms the widely used FH (Felzenszwalb & Huttenlocher, 2004) and SLIC (Achanta et al., 2012) methods and also some more recent methods. In principle, Superpixel Hierarchy (SH) should be able to perform object and part grouping, by assigning a superpixel to an object or a part. We show qualitative segmentation results on our datasets in Figure 11 , using the publicly available code 2 . Here, we compute the mean color for each superpixel, and each pixel is drawn in the mean color of the superpixel to which it belongs. For object and part grouping, we provide SH with the groundtruth number of components (the number of objects/parts plus one for background). As can be seen, when doing object grouping, SH tends to merge some parts into the background. This seems to be caused by some characteristics of our datasets. Specifically, in the 2D Shapes dataset, the parts within an object do not touch each other, making it hard to merge them into one superpixel because superpixels are supposed to be 4-connected. In the Compositional CLEVR dataset, the light and dark regions of the background tend to be split into different superpixels due to dissimilarity in color histograms. For part grouping, SH sometimes merges tiny parts into the background on the 2D Shapes dataset, and splits metal parts that have specular highlights on the Compositional CLEVR dataset. When more superpixels are allowed, SH is generally able to distinguish foreground from background, but this kind of oversegmentation cannot serve our purpose of building the scene graph. F RESULTS ON SEVEN-PART COMPOSITIONAL CLEVR DATASET In this section, we show results on a slightly more complex version of the Compositional CLEVR dataset, in which we introduce four new parts and replace some of the objects. The new dataset contains ten types of objects composed from a total of seven parts. Three of the object types contain a single part, another three contain two parts, and the remaining four contain three parts. We use the same network architecture and auxiliary KL terms as those we used for the original Compositional CLEVR dataset. We find that adjusting the weight of the KL terms like in β-VAEs (Higgins et al., 2017) helps improve training stability. Specifically, we multiply the original and auxiliary KL terms of part-level presence and pose variables by a hyperparameter β. We keep β = 1 for the first 50 epochs, then linearly increase β from 1 to 5 during the next 30 epochs, and finally keep β = 1 for the remaining epochs. We report qualitative results in Figure 12 , and quantitative results in Table 10 and Table 11 . 



We use v to refer to both node v in the probabilistic scene graph and the entity that node v represents. https://github.com/semiquark1/boruvka-superpixel



Figure 1: (A) Probabilistic scene graph representation. Each node represents an entity in the scene, and is associated with an appearance variable. Each edge is associated with a pose variable that specifies the coordinate transformation between the child node and the parent node. (B) Spatial attention during inference. We first decompose the scene into high-level objects, and then attend to each object to figure out the constituent parts. (C) Recursive decoding process (a single recursive step shown). The image patch xu 1 and mask mu 1 of an internal node u1 are decoded from the image patches and masks of all its children nodes.

Figure 2: Visualization of inferred scene graphs on 2D Shapes dataset.

Figure 3: Visualization of inferred scene graphs on Compositional CLEVR dataset.

Figure 4: Image manipulation results. (A) Object-level manipulation. (B) Part-level manipulation.

Figure 5: (A) Objects sampled from GSGN learned prior. (B) Objects sampled from SPACE-O prior. (C) Scenes sampled from GSGN learned prior.

Figure 6: Comparison of data efficiency in a downstream classification task that requires reasoning of the part-whole relationship.

Figure 7: Visualization of scene graphs inferred by GSGN-Mem on Compositional CLEVR dataset.

Figure 8: Learned memory of primitive parts. Each column corresponds to one slot. The first row shows the decoded mask that captures the shape of the parts. The remaining rows are samples from the learned prior that fill in plausible color, material, and lighting effects.

and decode it into a (close to) binary mask mv = g(M[z addr v ]) that captures the shape of v. We then add color, material, and lighting effects by applying multiplicative modification controlled by z what v , and obtain xv = mv h(z what v

and z bg have dimensions 128 and 8, respectively.Algorithm 3 Background InferenceInput: image x, appearance variable of the root node z appr r 1: Compute conditional prior: p(z bg | z appr r ) = BgPri(z appr r ) 2: Predict the background variable from x:

Figure 10: CNN architectures of (A) BgPri and (B) BgEnc used in our three-level GSGN.

Figure 11: Segmentation results produced by Superpixel Hierarchy.

Figure 12: Visualization of inferred scene graphs on seven-part Compositional CLEVR dataset.

Quantitative results on 2D Shapes dataset.

Quantitative results on Compositional CLEVR dataset.

Robustness to occlusion on Compositional CLEVR dataset.

Generalization performance on 2D Shapes dataset.

Generalization performance on Compositional CLEVR dataset.

Auxiliary KL term used only at the part level.

Quantitative results on seven-part Compositional CLEVR dataset.

Robustness to occlusion on seven-part Compositional CLEVR dataset.

ACKNOWLEDGMENTS

This work was supported by Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government [21ZR1100, A Study of Hyper-Connected Thinking Internet Technology by autonomous connecting, controlling and evolving ways]. The authors would like to thank Skand Vishwanath Peri, Chang Chen, and Jindong Jiang for helpful discussion, and anonymous reviewers for constructive comments.

annex

 3) , and (C) ApprEnc (3) used in our three-level GSGN. To implement GSGN-9, in which the maximum out-degree is 9, we just need to change the layers in green so that the output feature maps have spatial size 3 × 3.For concreteness, in Figure 9 , we provide the CNN architectures of the submodules of GSGN (3) used in our experiments. They are responsible for decomposing objects into parts. The submodules at the other levels are similar. In Figure 9 , each ConvBlock involves a convolution layer followed by group normalization (Wu & He, 2018) and ReLU nonlinearity (Nair & Hinton, 2010) . ConvReLU denotes a convolution layer directly followed by ReLU nonlinearity. The PixelShuffle(r) layer (Shi et al., 2016) converts an input feature map of size [H, W, C × r × r] to an output feature map of sizePresPoseApprPri (3) (Figure 9A ) takes in the 64-dimensional appearance variable of an object, and outputs the conditional prior for each of its four possible children. Here, output 1 includes two (unnormalized) logits for the presence variable, and the mean and standard deviation of the 5dimensional pose variable, including 2-dimensional position and 1-dimensional rotation, scaling, and depth values. Output 2 includes the mean and standard deviation of the 32-dimensional appearance variable of a part. Notice that due to the PixelShuffle layer, not all parameters are shared across the four children. This allows the network to predict a different distribution for the latent variables of each child. PresPoseEnc (3) (Figure 9B ) takes in the cropped image region of an object (resized to 64×64), and infers the presence and pose of its children. ApprEnc (3) (Figure 9C ) takes in the image region of a part (cropped from the input to PresPoseEnc (3) and resized to 32 × 32), and infers the appearance of that part. Notice that the cropping and appearance inference can be done in parallel for all parts. We apply Softplus nonlinearity to ensure that the standard deviations are positive.

D.2 PARAMETERIZATION OF POSE VARIABLES

As mentioned in Section D.1, the pose variable includes position, scaling, rotation, and depth:) .(14) These are all Gaussian variables. Before feeding them to the spatial transformer, we first apply nonlinear squashing functions to convert their values into the desired range. In the following, we describe each variable in detail.Position. z where v ∈ R 2 represents the center position of v in the image patch of its parent x pa(v) . Similar to SPAIR (Crawford & Pineau, 2019b) and SPACE (Lin et al., 2020b) , we divide x pa(v) into r × r grid cells, with r 2 being the maximum out-degree. Each v of the same parent is associated with a cell index (i, j), and is expected to identify an entity in x pa(v) that is close to cell (i, j). Specifically, z where v is used to compute the deviation of entity v from the center of cell (i, j). Let c i,j ∈ R 2 denote the center of cell (i, j), and d ∈ R denote the side length of each cell. The center position of entity v can be computed as:) .(15) Here, a is a hyperparameter. If a = 0.5, then the center of entity v must be within cell (i, j). Typically, a is set to be greater than 0.5 so that the network can learn to distribute the entities that are close together across multiple cells. In our experiments, we set a = 0.55 for GSGN (r = 2), and a = 0.75 for GSGN-9 (r = 3).and the object scale is computed as: (17) Here, zscale v = 1 would mean that entity v has the same size as its parent. We find that limiting the range of object scale is necessary for GSGN-No-Aux to learn object-level decomposition (otherwise, GSGN-No-Aux tends to explain the entire scene as one object). With the auxiliary KL terms introducing some preferred prior (described in Section D.3), in GSGN the object scale can probably be computed in the same way as the part scale, but we have not investigated this.Rotation. z theta v ∈ R represents the rotation of v in x pa(v) . Since there is no groundtruth, the network is free to decide what each entity should look like when there is no rotation, and predict the rotation angles accordingly. On the 2D Shapes dataset, the rotation is computed as:) .(18) On the Compositional CLEVR dataset, since all objects are placed upright on a plane, we compute the rotation as:Depth. z depth v ∈ R represents the relative depth of v with respect to its siblings (other children of pa(v)). It is transformed to a positive value:zdepth v is then used to construct a transparency map α v that is the same size as x pa(v) . Ideally, for a pixel (i, j), α v (i, j) = 1 if entity v contains this pixel and at the same time has a smaller zdepth v than all its siblings that also contain the pixel, and α v (i, j) = 0 otherwise. To determine which pixels are contained in entity v, we can simply use the decoded mask mv and the presence variable z pres v . Specifically, we first place the mask into the parent coordinate frame by an inverse spatial transformer:where mv is the same size as x pa(v) , and ideally mv (i, j) = 1 if entity v is present and contains pixel (i, j), and mv (i, j) = 0 otherwise (in practice, α v and mv are not strictly binary). We can now compute α v (i, j) by a masked Softmax over negative depth values:

D.3 AUXILIARY KL TERMS

In this section, we list the auxiliary KL terms used in our experiments. They have the same weight as the original KL terms that appear in the ELBO.Table 6 lists the auxiliary KL terms that are the same across all levels. We use a Bernoulli prior with low success probability for z pres v to encourage sparse tree structures. The prior for z depth v has its mean value set to 4. This helps avoid the negative half of Softplus nonlinearity that tends to make the transformed zdepth v values (Equation 20) less distinguishable from each other.Table 7 lists the auxiliary KL terms that are only used at the object level for the Compositional CLEVR dataset. The prior for z scale v has its mean value set to -0.2. According to Equation 16, this corresponds to an approximate mean value of 0.24 for object size zscale v . We note that this value does not match the average object size in the dataset. It is simply set to be small to encourage decomposition and avoid explaining the entire scene as one object. The prior for z appr v is used only in GSGN-9. We find that it helps prevent posterior collapse, which we observed in GSGN-9 but not in GSGN. On the 2D Shapes dataset, we use a Gaussian mixture prior for z scale v , as shown in Table 8 . This is to reflect the prior knowledge that these 2D objects can have significantly different scales.Neither the mean values nor the mixing weights are chosen to match the actual distribution of object scales in the dataset. We estimate this auxiliary KL term using 100 samples from the posterior. This is not expensive since z scale v has only two dimensions.Finally, as shown in Table 9 , we use a Gaussian mixture prior for the part scales. The component mean values, after being transformed according to Equation 17, are approximately 0.91, 0.5, and 0.24. Again, we note that neither the mean values nor the mixing weights match the actual distribution of part scales in our datasets. These values are chosen to encourage decomposition of objects into parts and at the same time reflect the prior knowledge that parts can have significantly different scales. We estimate this auxiliary KL term using 100 samples from the posterior. 

Posterior

Unconditioned Prior Posterior Unconditioned Prior 

G ANALYSIS OF POSSIBLE SCENE GRAPHS

In this section, we provide a brief analysis of how variational the scene graphs can be on the Compositional CLEVR dataset.Let us first investigate the possible scene graph structures (i.e., ignoring the difference caused by pose and appearance). There are 1-4 objects in a scene, each consisting of 1-3 parts. This leads to a total of Finally, we account for the appearance variations. There are 7 colors and 2 materials, leading to 14 choices of appearance for each part. Combining with pose variations, we obtain 3 × 14 + 3 × 14 2 + 4×14 3 = 11606 possible subtrees for each object. Hence, the total number of possible scene graphs is11606-1 = 7.57 × 10 14 .We note that the above calculation still does not account for the pose variations of objects. In fact, object poses are sampled from continuous distributions as opposed to discrete ones, leading to infinite variations.

