WEAKLY-SUPERVISED NEURO-SYMBOLIC IMAGE MA-NIPULATION VIA MULTI-HOP COMPLEX INSTRUCTIONS

Abstract

We are interested in image manipulation via natural language text -a task that is extremely useful for multiple AI applications but requires complex reasoning over multi-modal spaces. Recent work on neuro-symbolic approaches e.g., The Neuro Symbolic Concept Learner (NSCL) (Mao et al., 2019) has been quite effective for solving Visual Question Answering (VQA) as they offer better modularity, interpretability, and generalizability. We extend NSCL for the image manipulation task and propose a solution referred to as NEUROSIM. Previous work either requires supervised training data in the form of manipulated images or can only deal with very simple reasoning instructions over single object scenes. In contrast, NEUROSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NEU-ROSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides the manipulation. We design neural modules for manipulation, as well as novel loss functions that are capable of testing the correctness of manipulated object and scene graph representations via query networks trained merely on VQA data. An image decoder is trained to render the final image from the manipulated scene graph. Extensive experiments demonstrate that NEUROSIM, without using target images as supervision, is highly competitive with SOTA baselines that make use of supervised data for manipulation. Neuro-symbolic Image Manipulator (NEUROSIM) Change the size of the thing behind the large ball to big O ß scene() O ß filter(O, large) O ß filter(O, sphere) O ß relate(O, behind) O ß change_size(O, large)

1. INTRODUCTION

The last decade has seen a significant growth in application of neural models to a variety of tasks including those in computer vision (Chen et al., 2017; Krizhevsky et al., 2012) , NLP (Wu et al., 2016) , robotics and speech (Yu & Deng, 2016) . It has been observed that these models often lack interpretability (Fan et al., 2021) , and may not always be well suited to handle complex reasoning tasks (Dai et al., 2019) . On the other hand, classical AI systems can seamlessly perform complex reasoning in an interpretable manner due to their symbolic representation (Pham et al., 2007; Cai & Su, 2012) . But these models are often found lacking in their ability to handle low level representations and be robust to noise. A natural question then arises: Can we design models which capture the best of both these paradigms? The answer lies in the recent development of Neuro-Symbolic models (Dong et al., 2019; Mao et al., 2019; Han et al., 2019) which combine the power of (purely) neural with (purely) symbolic representations. An interesting sub-class of these models work with a finite sized domain specification language (DSL) and make use of deep networks to learn neural representations of the concepts specified in the DSL. The learned representations are then used for performing downstream reasoning via learning of symbolic programs. This line of work was first popularized by Andreas et al. (2016) ; Hu et al. (2017) ; Johnson et al. (2017a) , followed by Mao et al. (2019) , who look at the task of Visual Question Answering (VQA), and other follow-up works such as learning meta-concepts (Han et al., 2019) . Studies (Andreas et al., 2016; Hu et al., 2017; Mao et al., 2019) have shown that these models have several desirable properties such as modularity, interpretability, and improved generalizability. Motivated by the above, our aim is to build neuro-symbolic models for the task of weakly supervised manipulation of images comprising multiple objects, via complex multi-hop natural language instructions. Existing work includes weakly supervised approaches (Nam et al., 2018; Li et al., 2020) that require textual descriptions of images during training and are limited to very simple scenes (or instructions). Supervised approaches (Zhang et al., 2021; El-Nouby et al., 2019) , though capable of handling multiple objects and complex multi-hop instructions, require explicit annotations in the form of target manipulated images; ref. Section 2 for a survey. We are interested in a weakly supervised solution that only makes use of data annotated for VQA, avoiding the high cost of getting supervised annotations, in the form of target manipulated images. Our key intuition is: this task can be solved by simply querying the manipulated representation without ever explicitly looking at the target image. Figure 1 : The problem setup. Our solution builds on Neuro-Symbolic Concept Learner (NSCL) proposed by Mao et al. (2019) for solving VQA. We extend this work to incorporate the notion of manipulation operations such as change, add, and remove objects in a given image. As one of our main contributions, we design novel neural modules and a training strategy that just uses VQA annotations as weakly supervised data for the task of image manipulation. The neural modules are trained with the help of novel loss functions that measure the faithfulness of the manipulated scene and object representations by accessing a separate set of query networks, interchangeably referred to as quantization networks, trained just using VQA data. The manipulation takes place through interpretable programs created using primitive neural and symbolic operations from a Domain Specific Language (DSL). Separately, a network is trained to render the image from a scene graph representation using a combination of L 1 and adversarial losses as done by Johnson et al. (2018) . The entire pipeline is trained without any intermediate supervision. We refer to our system as Neuro-Symbolic Image Manipulator (NEUROSIM). Figure 1 shows an example of I/O pair for our approach. For our experiment purposes, we extend CLEVR (Johnson et al., 2017b) , a benchmark dataset for VQA, to incorporate manipulation instructions and create a dataset referred to as Complex Image Manipulation via Natural Language Instructions (CIM-NLI). We will release this dataset publicly post acceptance. Our evaluation on CIM-NLI dataset shows that, despite being weakly supervised. we are highly competitive or improve upon state-of-the-art supervised approaches (Zhang et al., 2021; El-Nouby et al., 2019) for this task, generalize well to scenes with more objects, and specifically perform well on instructions which involve multi-hop reasoning.

2. RELATED WORK

Table 1 categorizes the related work across three broad dimensions -problem setting, task complexity, and approach. The problem setting comprises of two sub-dimensions: i) supervision type -self, direct, or weak, ii) instruction format -text or UI-based. The task complexity comprises of following sub-dimensions: ii) scene complexity -single or multiple objects, ii) instruction complexity -zero or multi-hop instructions, iii) kinds of manipulations allowed -add, remove, or change. Finally, approach consists of the following sub-dimensions: i) model -neural or neuro-symbolic and ii) whether symbolic program is generated on the way or not. Dong et al. (2017) , TAGAN (Nam et al., 2018) , and ManiGAN (Li et al., 2020) are close to us in terms of the problem setting. These manipulate the source image using a GAN-based encoder-decoder architecture. Their weak supervision differs from ours -We need VQA annotation, they need captions or textual descriptions. The complexity of their natural language instructions is restricted to 0-hop. Most of their experimentation is limited to single (salient) object scenes, and it is unclear how these strategies would perform with multi-object situations with intricate relationships. Lastly, while our approach requires only an explicit manipulation (delta) command during inference, existing approaches require partial target image description, and it is unclear how their method can be extended to the task where only the delta is given. In terms of task complexity, the closest to us are approaches such as TIM-GAN (Zhang et al., 2021) , GeNeVA (El-Nouby et al., 2019) , which build an encoder decoder architecture and work with a latent representation of the image as well as the manipulation instruction. They require explicit annotations in terms of manipulated images during training. We argue that this can require a significant more annotation effort, compared to our weak supervision setting, where we only need visual question answer annotations. Unlike us, these approaches work with purely neural models, and as shown in our experiments, their performance is heavily dependent on the amount of data available for training. In terms of technique, the closest to our work are neuro-symbolic approaches for VQA such as NSVQA (Yi et al., 2018) , NSCL (Mao et al., 2019) , Neural Module Networks (Andreas et al., 2016) and its extensions (Hu et al., 2017; Johnson et al., 2017a) . Clearly, while the modeling approach is similar and consists of constructing latent programs, the desired task are different in two cases. Our work extends the NSCL approach for the task of automated image manipulation. A related task is text guided image retrieval, where goal is to retrieve (not manipulate) an image from the database complying with the changes asked for in the input instruction (Vo et al., 2019; Chen et al., 2020) . Another line of research (Jiang et al., 2021; Shi et al., 2021) deals with editing global features, such as brightness, contrast, etc., instead of object level manipulations like in our case. Recent works (Ramesh et al., 2022; Saharia et al., 2022 ) on text to image generation using diffusion models trained on massive (image, caption) data, are capable of generating photorealistic images given text. These also have the capability of editing images e.g. using text-diffs (Ramesh et al., 2022) but require captions for input images. Further, it is unclear how to extend this line of work to language guided complex image manipulation settings where multi-hop reasoning may be required; preliminary studies (Marcus et al., 2022) have highlighted their shortcomings in terms of compositional reasoning and dealing with relations. 3 NEUROSIM: NEURO-SYMBOLIC IMAGE MANIPULATOR The learner can query repeatedly until it learns to perform the manipulation task correctly. Note that the learner does not have access to the supervised data corresponding to triplets of the form (I s , C, I f ), where I s is the starting image, C is the manipulation command, and I f is the resulting final image, for the task of manipulation. Inspired by this, we set out to test this hypothesis by building a model capable of manipulating images, without target images as supervision. Figure 3 captures a high level architecture of the proposed NEUROSIM pipeline. NEUROSIM allows manipulating images containing multiple objects, via complex natural language instructions. Similar to Mao et al. (2019) , NEUROSIM assumes the availability of a domain-specific language (DSL) for parsing the instruction text T into an executable program P . NEUROSIM is capable of handling addition, removal, and change operations over image objects. It reasons over the image for locating where the manipulation needs to take place followed by carrying out the manipulation operation. The first three modules, namely i) visual representation network, ii) semantic parser, and iii) concept quantization network are suitably customized from the NSCL and trained as required for our purpose. In what follows, we describe the design and training mechanism of NEUROSIM. 2] Semantic Parsing Module: The input to this module is a manipulation instruction text T in natural language. Output is a symbolic program P generated by parsing the input text. The symbolic programs are made of operators, that are part of our DSL (Specified in Appendix Section A). 3] Concept Quantization Network: Any object in an image is defined by the set of visual attributes (A), and set of symbolic values (S a ) for each attribute a ∈ A. E.g., attributes can be shape, size, etc. Different symbolic values allowed for an attribute are also known as concepts. E.g., S color = {red, blue, green, . . .}. Each visual attribute a ∈ A is implemented via a separate neural network f a (•) which takes the object embedding as input and outputs the attribute value for the object in a continuous (not symbolic) space. Let f color : R dobj -→ R dattr represent a neural network for the color attribute and consider o ∈ R dobj as the object embedding. Then, v color = f color (o) ∈ R dattr is the embedding for the object o pertaining to the color attribute. Each symbolic concept s ∈ S a for a particular attribute a (e.g., different colors) is also assigned a respective embedding in the same continuous space R dattr . Such an embedding is denoted by c s . These concept embeddings are initialized at random, and later on fine tuned during training. An attribute embedding (e.g. v color ) can be compared with the embeddings of all the concepts (e.g., c red , c blue , etc.) using cosine similarity, for the purpose of concept quantization of objects. Training for VQA: As a first step, we train all the above three modules via a curriculum learning process (Mao et al., 2019) . Semantic parser is trained jointly with the concept quantization networks for generating programs corresponding to the question texts coming from the VQA dataset. The corresponding output programs are composed of primitive operations coming from the DSL (e.g. filter, count, etc.) and does not include constructs related to manipulation operations. This trains the first three modules with high accuracy on the VQA task.

3.3. NOVEL MODULES AND TRAINING PROCEDURE FOR NEUROSIM

NEUROSIM training starts with three sub-modules trained on the VQA task as described in Section 3.2. Next, we extend the original DSL to include three additional functional sub-modules within semantic parsing module, namely add, remove, and change. Refer to appendix section A for details on the DSL. We now reset the semantic parsing module and train it again from scratch for generating programs corresponding to image manipulation instruction text T . Such a program is subsequently used by the downstream pipeline to reason over the scene graph G I and manipulate the image. In this step, the semantic parser is trained using an off-policy program search based REINFORCE (Williams, 1992) algorithm. Unlike the training of semantic parser for the VQA task, in this step, we do not have any final answer like reward supervision for training. Hence, we resort to a weaker form of supervision. In particular, consider an input instruction text T and set of all possible manipulation program templates P t from which one can create any actual program P that is executable over the scene graph of the input image. For a program P ∈ P t , our reward is positive if this program P selects any object (or part of the scene graph) to be sent to the manipulation networks (change/add/remove). For e.g., consider the program change(filter(scene())), if after executing filter(scene()), we do not get even a single object selected, then we give a negative reward, signifying that this program cannot be correct, else we give a positive reward. Incorrect programs can also lead to object (objects) being selected for manipulation, which is why this is a weak supervision. See Appendix C for more details. Once semantic parser is retrained, we clamp the first three modules and continue using them for the purpose of parsing instructions and converting images into their scene graph representations. Scene graphs are manipulated using our novel module called manipulation network which is describe next. 4] Manipulation Network: This is our key module responsible for carrying out the manipulation operations. We allow three kinds of manipulation operations -add, remove, and change. Each of these operations are a composition of a quasi-symbolic and symbolic operation. A symbolic operation corresponds to a function that performs the required structural changes (i.e. addition/deletion of a node or an edge) in the scene graph G I against a given instruction. A quasi-symbolic operations is a dedicated neural network that takes the relevant part of G I as input and computes new representations of nodes and edges that are compatible with the changes described in the parsed instruction. (a) Change Network: For each visual attribute a ∈ A (e.g. shape, size, . . . ), we have a separate change neural network that takes the pair of (object embedding, embedding of the changed concept) as input and outputs the embedding of the changed object. This is the quasi-symbolic part for the change function, while the symbolic part is an identity mapping. For e.g., let g color : R dobj+dattr -→ R dobj represent the neural network that changes the color of an object. Consider o ∈ R dobj as the object embedding and c red ∈ R dattr as the concept embedding for the red color, then o = g color (o; c red ) ∈ R dobj represents the changed object embedding, whose color would be red. After applying the change neural network, we obtain the changed representation of the object o = g a (o; c s * a ), where s * a is the desired changed value for the attribute a. This network is trained using following losses. ℓ a = - ∀s∈Sa I s=s * a log [p(h a ( o) = s)] ℓ a = - ∀a ′ ∈A,a ′ ̸ =a ∀s∈S a ′ p(h a ′ (o) = s) log [p(h a ′ ( o) = s)] where, h a (x) gives the concept value of the attribute a (in symbolic form s ∈ S a ) for the object x. The quantity p (h a (x) = s) denotes the probability that the concept value of the attribute a for the object x is equal to s and is given as follows p (h a (x) = s) = exp dist(fa(x),cs) / s∈Sa exp dist(fa(x),c s ) where, dist(a, b) = (a ⊤ b -t 2 )/t 1 is the shifted and scaled cosine similarity, t 1 , t 2 being constants. The first loss term ℓ a penalizes the model if the (symbolic) value of the attribute a for the manipulated object is different from the desired value s * a in terms of probabilities. The second term ℓ a , on the other hand, penalizes the model if the values of any of the other attributes a ′ , deviate from their original values. Apart from these losses, we also include following additional losses. ℓ cycle = ∥o -g a ( o; c old )∥ 2 ; ℓ consistency = ∥o -g a (o; c old )∥ 2 (3) ℓ objGAN = - o ′ ∈O [log D(o ′ ) + log(1 -D (g a (o ′ ; c)))] where c old is the original value of the attribute a of object o, before undergoing change. Intuitively the first loss term ℓ cycle says that, changing an object and then changing it back should result in the same object. The second loss term ℓ consistency intuitively means that changing an object o that has value c old for attribute a, into a new object with the same value c old , should not result in any change. These additional losses prevent the change network from changing attributes which are not explicitly taken care of in earlier losses (1) and ( 2). For e.g., rotation or location attributes of the objects that are not part of our DSL. We also impose an adversarial loss ℓ objGAN to ensure that the new object embedding o is from the same distribution as real object embeddings. See Appendix C for more details. (b) Remove Network: The remove network takes the scene graph G I of the input image and removes the subgraph from G I that contains the nodes (and incident edges) corresponding to the object(s) that need to be removed, and returns a new scene graph G I which is reduced in size. The quasi-symbolic function for the remove network is identity. (c) Add Network: For adding a new object into the scene, add network requires the symbolic values of different attributes, say {s a1 , s a2 , . . . , s a k }, for the new object, e.g., {red, cylinder, . . .}. It also requires the spatial relation r (e.g. RightOf) of the new object with respect to an existing object in the scene. The add function works by first predicting the object (node) embedding o new for the object to be added, followed by predicting edge embeddings for new edges incident on the new node. New object embedding is obtained as follows: o new = g addObj ({c sa 1 , c sa 2 , • • • , c sa k }, o rel , c r ) where, o rel is the object embedding of an existing object, relative to which new object's position r is specified. After this, for each existing objects o i in the scene, an edge e new,i is predicted between the newly added object o new and existing object o i in following manner: e new,i = g addEdge ( o new , o i ). Functions g addObj (•) and g addEdge (•) are quasi-symbolic operations. Symbolic operations in add network comprises adding the above node and the incident edges into the scene graph. The add network is trained in a self-supervised manner. For this, we pick a training image and create it's scene graph. Next, we randomly select an object o from this image and quantize it's concepts, along with a relation with any other object o i in the same image. We then use our remove network to remove this object o from the scene. Finally, we use the quantized concepts and the relation that were gathered above and add this object o back into the scene graph using g addObj (•) and g addEdge (•). Let the embedding of the object after adding it back is o new . Training losses for this network are as follows: ℓ concepts = - k j=1 log p(h aj ( o new ) = s aj ) ; ℓ relation = -log(p(h r ( o new , o i ) = r)) (5) ℓ objSup = ∥o -o new ∥ 2 ; ℓ edgeSup = i∈O ∥e old,i -e new,i ∥ 2 (6) ℓ edgeGAN = - ∀i∈O [log D({o; e old,i ; o i }) + log(1 -D ({ o new ; e new,i ; o i }))] where s aj is the required (symbolic) value of the attribute a j for the original object o, and r is the required relational concept. O is the set of the objects in the image, e old,i is the edge embedding for the edge between original object o and its neighboring object o i . Similarly, e new,i is the corresponding embedding of the same edge but after when we have (removed + added back) the original object. The loss terms ℓ concepts and ℓ relation ensure that the added object comprises desired values of attributes and relation, respectively. Since we had first removed and then added the object back, we already have the original edge and object representation, and hence we use them in loss terms given in equation 6. We use adversarial loss equation 7 for generating real (object, edge, object) triples and also a loss similar to equation 4 for generating real objects. For optimizing the generator in eq. equation 4 equation 7 modified GAN loss (Goodfellow et al., 2014) is used.

3.4. IMAGE RENDERING FROM SCENE GRAPH

5] Rendering Network: Design and training methodology for this module closely follows Johnson et al. (2018) . We take multiple images {I 1 , I 2 • • • I n } and generate their scene graph {G I1 , G I2 • • • G In } using the visual representation network described earlier. Each of these scene graphs is then processed by a graph convolutional network and passed through a mask regression network followed by a box regression network to generate a coarse 2-dimensional structure (scene layout). A Cascaded Refinement Network (CRN) (Chen & Koltun, 2017) is then employed to generate an image from the the scene layout. A min-max adversarial training procedure is used to generate realistic images as formulated in GAN (Goodfellow et al., 2014) , using two discriminators -i) A patch-based image discriminator that ensures the quality of overall image, and ii) An object discriminator that ensures the quality of object appearance. Datasets: Among the existing datasets, CSS (Vo et al., 2019) contains simple 0-hop instructions and is primarily designed for the text guided image retrieval task. CRIR (Chen et al., 2020) extends CSS to include multi-hop instructions but is not open source. Other datasets such as i-CLEVR (El-Nouby et al., 2019) and CoDraw are designed for iterative image editing (El-Nouby et al., 2019) . i-CLEVR contains only "add" instructions and CoDraw doesn't contain multi-hop instructions. Hence we created our own multi-object multi-hop instruction based image manipulation dataset, referred to as CIM-NLI. This dataset was generated with the help of CLEVR toolkit (Johnson et al., 2017b) -details of the generation process are described in the Appendix B. CIM-NLI consists of (Source image I, Instruction text T , Target image I * ) triplets. The dataset contains a total of 18K, 5K, 5K unique images and 54K, 14K, 14K instructions in the train, validation and test splits respectively. Refer Appendix B for more details about the dataset splits. Baselines: Weakly supervised baselines (Li et al., 2020; Nam et al., 2018) for this task are designed for a problem setting different from ours -single salient object scenes, simple 0-hop instructions (Refer Section 2 for details). Further, they require paired images and their textual descriptions as annotations. We, therefore, do not compare with them in our experiments. Instead, we compare our model with purely supervised approaches such as TIM-GAN (Zhang et al., 2021) and GeNeVA (El-Nouby et al., 2019) . In order to make a fair and meaningful comparison between the two kinds (supervised and weakly-supervised) approaches, we carve out the following set-up. Assume the cost required to create one single annotated example for image manipulation task be α m while the corresponding cost for the VQA task be α v . Let α = α m /α v . Let β m be the number of annotated examples required by a supervised baseline for reaching a performance level of η m on the image manipulation task. Similarly, let β v be the number of annotated VQA examples required to train NEUROSIM to reach the performance level of η v . Let β = β m /β v . We are interested in figuring out the range of β for which performance of our system (η v ) is at least as good as the baseline (η m ). Correspondingly we can compute the ratio of the labelling effort required, i.e., α * β, to reach these performance levels. If α * β > 1, our system achieves the same or better performance, with lower annotation cost. See Appendix F, G for computational resources and hyperparameters respectively. Evaluation Metrics: For evaluation on image manipulation task, we use two metrics -i) FID, ii) Recall@k. FID (Heusel et al., 2017) measures the realism of the generated images. Recall@k measures the semantic similarity of gold manipulated image I * and system produced manipulated image I. For computing Recall@k, we use I as a query and retrieve images from a corpus comprising the entire test set (gold manipulated images) and the source image I corresponding to I. similar to Zhang et al. (2021) , the query image and the corpus images are embedded into a latent space through an autoencoder trained on CLEVR dataset. Cosine similarity is used for ranking retrieved images. Method R1, R3 correspond to Recall@1,3 respectively. FID: lower is better; Recall: higher is better. β = 0.054 β = 0.07 β = 0.1 β = 0.2 β = 0.54 FID R1 R3 FID R1 R3 FID R1 R3 FID R1 R3 FID R1 R3 GeNeVA 22.0 6.6 58.7 - - - - - - - - -

4.1. PERFORMANCE WITH VARYING DATASET SIZE

Table 2 compares the performance of NEUROSIM system with TIM-GAN and GeNeVA with varying levels of β on CIM-NLI. Despite being weakly supervised, NEUROSIM performs significantly better than both the baselines for β ≤ 0.1 ( alternatively α ≥ 10) and very close to its closest competitor for β = 0.2 ( alternatively α = 5), using the R@1 performance metric. This clearly demonstrates the strength of our approach in learning to manipulate while only making use of VQA annotations. We hypothesize that, in most cases, NEUROSIM will be preferable since, we expect the cost of annotating an output image for manipulation to be significantly higher than the cost of annotating a VQA example. FID scores for NEUROSIM could potentially be improved by a doing a joint training of VQA module along with image decoder loss, and is a direction for future work.

4.2. PERFORMANCE WITH INCREASING REASONING HOPS

Method Hops Table 3 compares baselines with NEUROSIM for performance over instructions requiring 0-hop versus multi-hop (1 -3 hops) reasoning. When dealing with multi-hop instructions, we see a massive drop of 14.8 and 7.8 points in the performance of TIM-GAN trained on 10% (5.4Kdata points) and Full (54Kdata points) CIM-NLI data respectively. NEUROSIM trained on 10% data, without output image supervision, sees a performance drop of only 1.5 points implying that it is much better at handling the complex reasoning involved.

4.3. ZERO-SHOT GENERALIZATION TO LARGER SCENES

We developed another dataset called CIM-NLI-LARGE, consisting of scenes having 10 -13 objects (See Appendix B for details). We study the combinatorial generalization ability of NEUROSIM and the baselines when the models are trained on CIM-NLI containing scenes with 3 -8 objects only and evaluated on CIM-NLI-LARGE. Table 5 captures such a comparison. NEUROSIM does significantly better than TIM-GAN and GeNeVA trained on 10% (5.4Kdata points) of CIM-NLI data for e.g. it improves over TIM-GAN R1 score by 33.5 points. NEUROSIM nearly matches TIM-GAN's performance trained on full CIM-NLI data. This demonstrates the superior generalization capabilities of our weakly supervised model compared to supervised baselines.

TIM-GAN Input Image

Instruction NEUROSIM GeNeVA Remove the large object in front of the gray rubber cylinder. Remove the matte cylinder left of the small object that is behind the tiny cube. There is a rubber thing in front of the red matte ball; change the shape of it to cylinder. There is a large sphere, add a small purple matte ball behind it. There is a large cyan shiny thing that is on the right side of the small purple matte ball; change the shape of it to sphere. NEUROSIM produces interpretable output programs, showing the steps taken by the model to edit the images. Some examples and errors are shown in Appendix J. This highlights the ease of detecting failures of NEUROSIM, which is not possible with neural baselines.

4.5. HUMAN EVALUATION

For the human evaluation study, in each instance, we provided evaluators with four images (1) input image, (2) ground-truth image, (3) manipulated image generated by NEUROSIM (5.4K), and (4) manipulated image generated by TIM-GAN (54K). Images generated by the two systems are randomly shuffled to avoid any annotation bias. Evaluators were asked two simple binary (yes:1/no:0) questions about each system. The questions evaluated: (Q1) does the system perform the desired change mentioned in the input instruction, (Q2) does the system introduces any undesired changes other than the required ones. See Appendix Table 20 for the exact text of the questions. There were a total of 7 evaluators, and each was given the same set of 30 random image quadruples. We strongly believe image rendering module of NEUROSIM pipeline and encoder modules used for computing Recall@k adds some amount of inefficiencies resulting in lower R1 and R3 scores for us. Therefore, we decide to assess the quality of manipulated scene graph G I that gets generated in our pipeline. For this, we consider the task of text guided image retrieval as proposed by Vo et al. (2019) . In this task, an image from the database has to retrieved which would be the closest match to the desired manipulated image but no manipulated image needs to be generated. Therefore, we use our manipulated scene graph G I as the latent representation of the input instruction and image for image retrieval. We retrieve images from the database based on a novel graph edit distance between NEUROSIM generated G I of the desired manipulated images, and scene graphs of the images in the database. This distance is defined using the Hungarian algorithm (Kuhn, 1955) with a simple cost defined between any 2 nodes of the graph. See Appendix D for a detailed explanation. Table 6 captures the performance of NEUROSIM and other popular baselines for the image retrieval task. From this table, we observe that NEUROSIM significantly outperforms supervised learning baselines by a margin of about 50% without ever using output image supervision. This result demonstrates that NEUROSIM edits the scene graph in a meaningful way. Refer to Appendix D, K, for additional results and ablations respectively.

5. CONCLUSION

We present an neuro-symbolic, interpretable approach NEUROSIM to solve image manipulation task using weak supervision of VQA annotations, building on existing work on neuro-symbolic VQA (Mao et al., 2019) . Unlike previous approaches, ours is the first work that can handle multiobject scenes with complex instructions requiring multi-hop reasoning, and solve the task without any output image supervision. Our experiments on a newly created dataset of image manipulation demonstrates the potential of our approach compared to supervised baselines. Directions for future work include carefully understanding the nature of errors made by our symbolic programs, and have a human in the loop to provide feedback to the system for correction. Another direction would be experimenting with more complex and real image datasets; recent works on Neuro-symbolic VQA for real images Li et al. (2019) can be a good starting point.

B DATASET DETAILS B.1 CIM-NLI DATASET

This dataset was generated with the help of CLEVR toolkit (Johnson et al., 2017b ) by using following recipe. 1. First, we create a source image I and the corresponding scene data by using Blender (Community, 2018) software. 2. For each source image I created above, we generate multiple instruction texts T 's using its scene data. These are generated using templates, similar to question templates proposed by Johnson et al. (2017b) . 3. For each such (I, T ) pair, we attach a corresponding symbolic program P (not used by NEU-ROSIM though) as well as scene data for the corresponding changed image. 4. Finally, for each (I, T ) pair, we generate the target gold image I * using Blender software and its scene data from previous step. Below are some of the important characteristics of the CIM-NLI dataset. • Each source image I comprises several objects and each object comprises four visual attributescolor, shape, size, and material. • Each instructions text T comprises one of the following three kinds of manipulation operationsadd, remove, and change. • An add instruction specifies color, shape, size, and material of the object that needs to be added. It also specifies a direct (or indirect) relation with one or more existing objects (called reference object(s)). The number of relations that are required to traverse for nailing down the target object is referred to as # of reasoning hops and we have allowed instructions with up to 3-hops reasoning. We do not generate any 0-hop instruction for add due to ambiguity of where to place the object inside the scene. • A change instruction first specifies zero or more attributes to uniquely identify the object that needs to be changed. It may also specify a direct (or indirect) relation with one or more existing reference objects. Lastly, it specifies the target values of an attribute for the identified object which needs to be changed. • A remove instruction specifies zero or more attributes of the object(s) to be removed. Additionally, it may specify a direct (or indirect) relation with one or more existing reference objects. Table 9 captures the fine grained statistics about the CIM-NLI dataset. Specifically, it further splits each of the train, validation, and test set across the instruction types -add, remove, and change.

B.2 CIM-NLI-LARGE DATASET

We created another dataset called CIM-NLI-LARGE to test the generalization ability of NEUROSIM on images containing more number of objects than training images. CIM-NLI-LARGE tests the zero-shot transfer ability of both NEUROSIM and baselines on scenes containing more objects. Each image in CIM-NLI-LARGE dataset comprises of 10 -13 objects as opposed to 3 -8 objects in CIM-NLI dataset which was used to train NEUROSIM. The CIM-NLI-LARGE dataset consists of 1K unique input images. We have created 3 instructions for each image resulting in a total of 3K instructions. The number of add instructions is significantly less since there is very little free space available in the scene to add new objects. To create scenes with 12 and 13 objects, we made all objects as small size and the minimum distance between objects was reduced so that all objects could fit in the scene. Table 10 captures the statistics about this dataset.

B.3 MULTI-HOP INSTRUCTIONS

In what follows, we have given examples of the instructions that require multi-hop reasoning to nail down the location/object to be manipulated in the image. • Remove the tiny green rubber ball. (0-hop) • There is a block right of the tiny green rubber ball, remove it. (1-hop) • Remove the shiny cube left of the block in front of the gray thing. (2-hop) • Remove the small thing that is left of the brown matte object behind the tiny cylinder that is behind the big yellow metal block. 

C.2 MANIPULATION NETWORK

In what follows, we provide finer details of manipulation network components. Change Network: As described in Section 3.3 of the main paper, we have a change neural network for each attribute. For changing the current attribute value of a given object o, we use the following neural network: o = g a (o; c s * a ), where s * a is the desired changed value for the attribute a. o is the new representation of the object. We model g a (•) by a single layer neural network without having any non-linearity. The input dimension of this neural network is (256 + 64) because we concatenate the object representation o ∈ R 256 with the desired concept representation d ∈ R 64 . We pass this concatenated vector through g a (•) to get the revised representation of the object: o ∈ R 256 . The loss used to train the weights of the change network is a weighted sum of losses equation 1 to equation 4 given in the main paper. This leads to the overall loss function given below. L overall_change = λ 1 ℓ a + λ 2 ℓ a + λ 3 ℓ cycle + λ 4 ℓ consistency + λ 5 ℓ objGAN (8) where, ℓ objGAN above is the modified GAN loss (Goodfellow et al., 2014) . Here λ 1 = 1, λ 2 = 1/((num_attrs -1) * (num_concepts)), λ 3 = λ 4 = 10 3 , and λ 5 = 1/(num_objects). Here, (num_objects) is the number of objects in input image, (num_attrs) is the total number of attributes for each object, and (num_concepts) are the total number of concepts in the NSCL (Mao et al., 2019) framework. The object discriminator is a neural network with input dimension 256 and a single 300 dimensional hidden layer with ReLU activation function. This discriminator is trained using standard GAN objective ℓ objGAN . See Add Network: The neural operation in the add operator comprises of predicting the object representation for the newly added object using a function g addObj (•). This function is modeled as a single layer neural network without any activation. The input to this network is a concatenated vector [[c sa 1 , c sa 2 , • • • , c sa k ], o rel , c r ], where [c sa 1 , c sa 2 , • • • , c sa k ] represents the concatenation of all the concept vectors of the desired new objects. The vector o rel is the representation of the object with whom the relation (i.e. position) of the new object has been specified and c r is the concept vector for The loss used to train the add network weights is a weighted sum of losses equation 5 to equation 7 along with an object discriminator loss. The overall loss is given by the following expression. L overall_add = λ 1 ℓ concepts + λ 2 ℓ relation + λ 3 ℓ objSup + λ 4 ℓ edgeSup + λ 5 ℓ edgeGAN + λ 6 ℓ objGAN (9) where, ℓ objGAN and ℓ edgeGAN above denotes the modified GAN loss (Goodfellow et al., 2014) . Here λ 1 = λ 2 = 1/(num_attrs), λ 3 = λ 4 = 10 3 , λ 5 = 1/(num_objects). The object discriminator is a neural network with input dimension as 256 and a single 300 dimensional hidden layer with ReLU activation function. This discriminator is trained using the standard GAN objective ℓ objGAN . Note, ℓ objGAN has 2 parts -i) the loss for the generated (fake) object embedding using the add network, and ii) the loss for the real objects (all the unchanged object embeddings of the image). The former is unscaled but the latter one is scaled by a factor of 1/(num_objects). The edge discriminator is a neural network with input dimension as (256 * 3) and a single 300 dimensional hidden layer with ReLU activation function. As input to this discriminator network, we pass the concatenation of the two objects and the edge connecting them. This discriminator is trained using the standard GAN objective ℓ edgeGAN . See Along with each method, we have also written the number of data points from CIM-NLI dataset that were used for training. R1 and R3 correspond to Recall@1 and Recall@3, respectively.

D.2 IMAGE RETRIEVAL TASK

A task that is closely related to the image manipulation task is the task of Text Guided Image Retrieval, proposed by Vo et al. (2019) . Through this experiment, our is to demonstrate that NEUROSIM is highly effective in solving this task as well. In what follows, we provide details about this task, baselines, evaluation metric, how we adapted NEUROSIM for this task, and finally performance results in Table 12 . This table is a detailed version of the Table 6 in the main paper. Under review as a conference paper at 2023 Task Definition: Given an Image a text instruction T , and a database of images D, the task is to retrieve an image from the database that is semantically as close to the ground truth manipulated image as possible. Note, for each such (I, T ) pair, some image from the database, say I ∈ D, is assumed to be the ideal image that should ideally be retrieved at rank-1. This, so called desired gold retrieval image might even be an image which is the ideal manipulated version of the original images I in terms of satisfying the instruction T perfectly. Or, image I may not be such an ideal manipulated image but it still may be the image in whole corpus D that comes closest to the ideal manipulated image. In practice, while measuring the performance of any such system for this task, the gold manipulated image for (I, T ) pair is typically inserted into the database D and such an image then serves as the desired gold retrieval image I. Baselines: Our baselines includes popular supervised learning systems designed for this task. The first baseline is TIRG proposed by Vo et al. (2019) where they combine image and text to get a joint embedding and train their model in a supervised manner using embedding of the desired retrieved image as supervision. For completeness, we also include comparison with other baselines -Concat, Image-Only, and Text-Only -that were introduced by Vo et al. (2019) . A recent model proposed by Chen et al. (2020) uses symbolic scene graphs (instead of embeddings) to retrieve images from the database. Motivated by this, we also retrieve image via the scene graph that is generated by the manipulation module of NEUROSIM. However, unlike Chen et al. ( 2020), the nodes and edges in our scene graph have associated vectors and make a novel use of them while retrieving. We do not compare our performance with Chen et al. ( 2020) since it's code is unavailable and we haven't been able to reproduce their numbers on datasets used in their paper. Moreover, Chen et al. ( 2020) uses full supervision of the desired output image (which is converted to a symbolic scene graph), while we do not. Evaluation Metric: We use Recall@k (and report results for k = 1, 3) for evaluating the performance of text guided image retrieval algorithms which is standard in the literature. Retrieval using Scene Graphs: We use the scene graph generated by NEUROSIM as the latent representation to retrieve images from the database. We introduce a novel yet simple method to retrieve images using scene graph representation. For converting an image into the scene graph, we use the visual representation network of NEUROSIM. Given the scene graph G for the input image I and the manipulation instruction text T , NEUROSIM converts the scene graph into the changed scene graph G I , as described in Section C in Appendix. Now, we use this graph G I as a query to retrieve images from the database D. For retrieval, we use the novel graph edit distance (GED) between G I and the scene graph representation of the database images. The scene graph for each database image is also obtained using the visual representation network of NEUROSIM. The graph edit distance is given below. GED(G I , G D ) = ∞ |N I | ̸ = |N D | min π∈Π ∀i∈{1,2,••• ,|N I |} c(n i , y i ) otherwise. where, G I = (N I , V I ) and G D = (N D , V D ). n i and y i are the node embeddings of the query graph G I and scene graph G D of an image from the database. c(a, b) is the cosine similarities between embeddings a and b. This GED is much simpler than that defined in Chen et al. (2020) , since it does not need any hand designed cost for change, removal, or addition of nodes, or different attributes values. It can simply rely on the cosine similarities between node embeddings. We use the Hungarian algorithm (Kuhn, 1955) for calculating the optimal matching π of the nodes, among all possible matching Π. We use the negative of the cosine similarity scores between nodes to create the cost matrix for the Hungarian algorithm to process. This simple yet highly effective approach (See Table 6 in the main paper and Table 12 in the appendix), can be improved by more sophisticated techniques that include distance between edge embeddings and including notion of subgraphs in the GED. We leave this as future work. This result shows that our manipulation network edits the scene graph in a desirable manner, as per the input instruction. Along with each method, number of data points from CIM-NLI used for training are written. Method Instruction remove 69.6 82.5 69.5 82.5 69.5 82.6 69.5 82.5 69.6 82.5 β = 0.054 β = 0.07 β = 0.1 β = 0.2 β = 0.54 R1 R3 R1 R3 R1 R3 R1 R3 R1 R3 GeNeVA add 0.0 57.3 - - - - - - Table 13 : Detailed performance comparison of NEUROSIM with TIM-GAN (Zhang et al., 2021) and GeNeVA (El-Nouby et al., 2019) with varying β levels, split across add, remove and change instructions. The '-' entries for GeNeVA were not computed due to excessive training time; it's performance is abysmal even when using full data. We always use 100K VQA examples (5K Images, 20 questions per image) for our weakly supervised training. R1 and R3 correspond to Recall@1 and 3, respectively. For Recall, higher the score is better.

D.3 DETAILED MULTI-HOP REASONING PERFORMANCE

Table 14 below provides a detailed split of the performance numbers reported in Table 3 of the main paper across i) number of hops (0 -3 hops) and ii) type of instructions (add/remove/change). We observe that for change and remove instructions, NEUROSIM improves over TIM-GAN and GeNeVA trained on 5.4K CIM-NLI data points by a significant margin (∼ 20% on 3-hop change/remove instructions). However, NEUROSIM lags behind TIM-GAN when the entire CIM-NLI labelled data is used to train TIM-GAN. We also observe that all the models perform poorly on the add instructions, as compared to change and remove instructions.

D.4 DETAILED PERFORMANCE FOR DIFFERENT COST RATIOS β

Table 2 in Section 4 of the main paper showed the performance of NEUROSIM compared with TIM-GAN and GeNeVA for various values of β, where β is the ratio of the number of annotated We find that for the change operator, NEUROSIM performs better than TIM-GAN by a margin of ∼ 8% (considering Recall@1) for β ≤ 0.1. For the remove operator, NEUROSIM performs better than TIM-GAN by a margin of ∼ 4% (considering Recall@1) for β ≤ 0.2. Overall, NEUROSIM performs similar to TIM-GAN, for β = 0.2, for remove and change operators. All models perform poorly on the add operator as compared to the change and remove operators. We find that having full output image supervision allows TIM-GAN to reconstruct (copy) the unchanged objects from the input to the output for all the operators. This results in a higher recall in general but it's effect is most pronounced in the Recall@3. NEUROSIM, on the other hand, suffers from rendering errors which makes the overall recall score (especially Recall@3) lower. We believe that improving image rendering quality would significantly improve the performance of NEUROSIM and we leave this as future work. Experiment: We use the above dataset for testing the addition and removal of objects using NeuroSIM (See Fig 6) . We train NeuroSIM's decoder to generate images from scene graphs of the minecraft dataset. We assume access to a parser that gives us programs for an instruction. For removal, we use the same remove network as described above, while for addition, we assume access to the features of object to be added, which is added to the scene graph of the image and the decoder decodes the final image. See Figure 6 for a set of successful examples on the Minecraft dataset. We see that using our method, one can add and remove objects from the scene successfully, without using any output image as supervision during training. Though we have assumed the availability of parser in the above set-up, training it jointly with other modules should be straightforward, and can be achieved using our general approach described in Section 3 of the main paper. 

1] Zero-Shot Domain Transfer

Our first experiment was to check the performance of NEUROSIM (trained only on synthetic images from CLEVR dataset) on real-world images without any retraining/fine-tuning. For this, we handpicked a few real images from GQA dataset and performed following steps: 1. Generated a scene graph for the image using our existing visual representation network (without retraining it). 2. Next, we queried for the color of an object in the image using our query network. 3. Next, we changed the color of the given object through our existing manipulation network. 4. Finally, rendered the image using the representation of the changed object. Figure 7 shows probabilities obtained when we query the representation of real-world objects using our pretrained query networks before and after applying the pretrained change network to these object representations. From this zero-shot experiment, our query network and manipulation networks is able to disentangle attributes such as the object's color and also change it. However, because the rendering module is never trained on real images, it struggles to generate the real images. It seems to map the object with the shapes it learned during CLEVR training. Training the image rendering module using graph-based representations on real images (e.g. object representations obtained using Faster RCNN) is likely to eliminate the above problem and is part of our future and ongoing work. (2016a) . This scene graph is fed as input to the rendering module. We want to emphasize that the rendering module is not trained fully to convergence due to a lack of computational resources and time during the discussion phase. • Next, we performed inference on the above trained image rendering module using unseen sample images from the dataset. Following ideas from Mao et al. (2019) and Li et al. (2021) we used a pre-trained ResNet classifier as our program parser, for selecting objects in a given scene as well as tagging their class labels (e.g. horse, elephant, etc.) and corresponding probabilities. This is similar to the concept quantization step described in the main paper. Fig 8 shows that after applying the remove operation to an object in the given image, NEUROSIM is able to reconstruct the image without that object while keeping the rest of the scene intact (when compared with baseline rendered image). We believe a more comprehensive training of the image rendering module to convergence will result in better-quality visuals. We have demonstrated the remove operation on real-world images through the above experiment. We believe this result is still significant, since ours is the the first work in this direction to achieve complex image manipulation through text. Performing the add and change operations on real-world images is future work. The hyperparameters for VQA task are kept same as default values coming from the prior work (Mao et al., 2019) . We refer the readers to Mao et al. (2019) for more details. We obtained a question answering accuracy of 99.3% after training on the VQA task.

G.2 TRAINING SEMANTIC PARSER

The semantic parser is trained to parse instructions. Learning of this module happens using the REINFORCE algorithm as described in Section C of this appendix. During REINFORCE algorithm, we search for positive rewards from the set {7, 8, 10}, and negative rewards from the set {0, 2, 3}. We finally choose a positive reward of 8 and negative reward of 2. For making this decision, we first train the semantic parser for 20 epochs and then calculate its accuracy by running it on the quantized scenes from the validation set. For a particular output program, we say it is correct if it leads to an object being selected (see Section C of the appendix for more information) and this is how the accuracy of the semantic parser is calculated. This accuracy is a proxy for the real accuracy. An alternative is to use annotated ground truth programs for calculating accuracy and then selecting hyperparameters. However, we do not use ground truth programs. All other hyperparameters are kept the same as used by Mao et al. (2019) to train the parser on VQA task. We obtain a validation accuracy of 95.64% after training the semantic parser for manipulation instructions.

G.3 TRAINING MANIPULATION NETWORKS

The architecture details of the manipulation network are present in Section C of this appendix. We use batch size of 32, learning rate of 10 -3 , and optimize using AdamW (Loshchilov & Hutter, 2017) with weight decay of 10 -4 . Rest of the hyperparameters are kept the same as used in Mao et al. (2019) . During training, at every 5 th epochs, we calculate the manipulation accuracy by using the query networks that were trained while training the NEUROSIM on VQA data. This serves as a proxy to the validation accuracy. • For the change network training, we use the query accuracy of whether the attribute that was suppose to change for a particular object, has changed correctly or not. Also, whether any other attribute has changed or not. • For the add network training, we use the query accuracy of whether the attributes of the added object are correct or not. Also, whether the added object is in a correct relation with reference object or not. We obtained a validation accuracy (based on querying) of 95.9% for the add network and an accuracy of 99.1% for the change network.

G.4 IMAGE DECODER TRAINING

The architecture of the image decoder is similar to Johnson et al. (2018) but our input scene graph (having embeddings for nodes and edges) is directly processed by the graph neural network. We use a batch size of 16, learning rate of 10 -5 , and optimize using Adam (Kingma & Ba, 2014) optimizer. The rest of the hyperparameters are same as Johnson et al. (2018) . We train the image decoder for a fixed set of 1000K iterations.

H QUALITATIVE ANALYSIS

Figures 9,10,11 compare the images generated by NEUROSIM, TIM-GAN, and GeNeVA on add, change and remove instructions respectively. NEUROSIM's advantage lies in semantic correctness of manipulated images. For example, see Figure 9 row #3,4; Figure 10 row #2; 11 all images. In these images, NEUROSIM was able to achieve semantically correct changes, while TIM-GAN, GeNeVA faced problems like blurry, smudged objects while adding them to the scene, removing incorrect objects from the scene, or not changing/partially changing the object to be changed. Images generated

TIM-GAN Input Image

Instruction NEUROSIM GeNeVA There is a shiny thing that is on the right side of the shiny block, a big gray metallic ball in front of it. There is a rubber thing behind the matte thing in front of the tiny rubber object, add a tiny blue shiny sphere behind it. Add a small gray rubber cylinder that is in front of the big cube.

Ground Truth

Add a large gray metallic cylinder that is in front of the small rubber object behind the tiny green matte cylinder. There is a purple shiny object in front of the purple metal ball, add a large red matte ball to the left of it. 

TIM-GAN Input Image

Instruction NEUROSIM GeNeVA There is a rubber thing in front of the red matte ball; change the shape of it to cylinder. Change material of the rubber object in front of the small rubber thing that is left of the tiny gray matte sphere that is in front of the yellow block to shiny. There is a small matte thing; change the color of it to purple.

Ground Truth

There is a cylinder that is behind the small metallic cylinder; change the size of it to tiny. There is a tiny cylinder that is to the left of the small blue thing to the left of the big green metallic cylinder; change the material of it to matte. by TIM-GAN are better in quality as compared to NEUROSIM. We believe the reason for this is that TIM-GAN, being fully supervised, only changes a small portion of the image and has learnt to copy a significant portion of the input image directly to the output. How ever this doesn't insure the semantic correctness of TIM-GAN's manipulation, as described above with examples where it makes errors. The images generated by NEUROSIM look slightly worse since the entire image is generated from object based embeddings in the scene graph. Improving neural image rendering from scene graphs can be a promising step to improve NEUROSIM.

TIM-GAN Input Image

Instruction NEUROSIM GeNeVA There is a large metal object left of the metallic object that is to the right of the large metallic thing in front of the sphere, remove it.

Ground Truth

Remove the large shiny object that is behind the big purple cylinder. There is a big sphere in front of the big ball behind the blue thing, remove it. There is a metallic thing in front of the small gray rubber thing, remove it There is a shiny cube, remove it. • [Logical Errors] This set includes images generated by our model which have logical errors. That is, manipulation instruction has been interpreted incorrectly and a different manipulation has been performed. This happens mainly due to an incorrect parse of the input instruction into the program, or manipulation network not trained to the perfection. For example, change network changing attributes which were supposed to remain unchanged. • [VQA Errors] The query networks are not ideal and have errors after they are trained on the VQA task. This in turn causes errors in supervision (obtained from query networks) while training the manipulation networks and leads to a less than optimally trained manipulation network. Also, during inference, object embeddings may not be perfect due to the imperfections in the visual representation network and that leads to incorrect rendering.

J INTERPRETABILITY OF NEUROSIM

NEUROSIM allows for interpretable image manipulation through programs which are generated as an intermediate representation of the input instruction. This is one of the major strengths of NEUROSIM, since it allows humans to detect where NEUROSIM failed. This is not possible with purely neural models, that behave as a black-box. Knowing about the failure cases of NEUROSIM also means that it can be selectively trained to improve certain parts of the network (for eg individually training on change instructions to improve the change command, if the model is performing poorly on change instructions). We now assess the correctness of intermediate programs using randomly selected qualitative examples present in Figure 13 . Since no wrong program was obtained in the randomly selected set, we find 2 more data points manually, to show some wrong examples.

K ABLATIONS

Table 19 shows the performance of NEUROSIM while certain loss terms are removed while learning of the networks. This depicts the importance of loss terms that we have considered. In particular we test the performance of the network by removing edge adversarial loss used by add network (row 2), object adversarial losses for both add and change networks (row 3, 5), self supervision losses used by add network (row 4), cyclic (row 6) and consistency (row 7) losses used by change network. Table 19 : Ablations conducted by removing some loss terms. ℓ is the total loss before any ablation.

Loss

For each loss term being removed, the superscript denotes which network it belongs to (add or change). Ablations are conducted for the setting where β = 0.054 (see main paper Section 4 for the definition of β)

L HUMAN EVALUATION

Table 20 for the questions asked to human evaluators for the human evaluation study. See Section 4.5 of the main paper for more details. Question 1: [Change] Are all the attributes (color, shape, size, material, and relative position) of the changed object mentioned in the instructions identical between the ground truth image and the system-generated image? [Add] Are all the attributes (color, shape, size, material, and relative position) of the added object mentioned in the instructions identical between the ground truth image and the system-generated image? [Remove] Are same objects removed in ground truth image and the systemgenerated image? Question 2: [Change] Are all the attributes (color, shape, size, material, and relative position) of the remaining objects identical between the ground truth image and the system-generated image? [Add] Are all the attributes (color, shape, size, material, and relative position) of the remaining objects identical between the ground truth image and the system-generated image? [Remove] Are all the attributes (color, shape, size, material, and relative position) of the remaining objects identical between the ground truth image and the system-generated image? Table 20 : Questions asked to human evaluators for evaluating NEUROSIM and TIM-GAN. Note that there are some variations in the questions for Change, Add, and Remove instructions dues to different semantic nature of the instructions.



ST:= Supervision Type, IF:= Instruction Format, SC:= Scene Complexity, IC:=Instruction Complexity. Abbreviations (column values) SS:=Self Supervision, DS:=Direct Supervision, WS:=Weak Supervision, #: Human Written, MO:= Multiple Objects, MO * := Multiple Objects with Regular Patterns, SO:= Single Object, †: Natural Images, N/A:= Not applicable, MH:=Multi-Hop, ZH:=Zero-Hop, N:= Neural, NS:= Neuro-Symbolic, ✓:= Yes, ✗:= No. See Section 2 for more details.

Figure 2: Motivating example for our approach.The key motivation behind our approach comes from the following hypothesis: consider a learner L (e.g., a neural network or the student in Fig 2)with sufficient capacity trying to achieve the task of manipulation over Images I. Further, let each image be represented in terms of its properties, or properties of its constituent parts (e.g. objects like apple, leaf, tree as shown in Fig 2), where each property comes from a finite set S e.g, attributes of objects in an image. Let the learner be provided with the prior knowledge (for e.g. through Question Answering as in Fig2) about properties (e.g., color) and their possible values (e.g., red). Then, in order to learn the task of manipulation, it suffices to provide the learner with a query network, which given a manipulated image Ĩ constructed by the learner via command C, can correctly answer questions (i.e. query) about the desired state of various properties of the constituents of the image

Figure 3: High level architecture of NEUROSIM.3.2 MODULES INHERITED FROM NSCL1] Visual Representation Network: Given input image I, this network converts it into a scene graph G I = (N, E). The nodes N of this scene graph are object embeddings and the edges E are embeddings capturing relationship between pair of objects (nodes). Node embeddings are obtained by passing the bounding box of each object (along with the full image) through a ResNet-34(He  et al., 2016b). Edge embeddings are obtained by concatenating the corresponding object embeddings.

.6 (-14.8) NEUROSIM (5.4K) 64.5 63.0(-1.5)

Figure 4: Visual comparison of NEUROSIM results with TIM-GAN and GeNeVA.4.4 QUALITATIVE ANALYSIS AND INTERPRETABILITYFigure4shows anecdotal examples for visually comparing NEUROSIM with baselines. Note, GeNeVA either performs the wrong operation on the image (row #1, 2, 4, and 5) or simply copies the input image to output without any modifications (row #3). TIM-GAN often makes semantic errors which show its lack of reasoning -for example, removing the wrong objects in row #3. Compared to baselines, NEUROSIM produces semantically more meaningful image manipulation. NEUROSIM can easily recover occluded objects (row #4). All models make rendering errors such as partial removal of objects, shape distortion (rows #2, 4, and 5). More results are in Section H of appendix.

extending the type system ofMao et al. (2019) and add ConceptSet because our add operation takes as input a set of concepts depicting attribute values of the new object being added (refer

Fig 5a for an overview of the change operatorRemove Network: The remove network is a symbolic operation as described in Section 3.3 of the main paper. That is, given an input set of objects, the remove operation deletes the subgraph of the scene graph that contains the nodes corresponding to removed objects and the edges incident on those nodes. See Fig 5c for an overview of the remove operator.

Figure 5: Overview of new operators (change, add and remove) added to the DSL.

Fig 5b for an overview of the add operator D ADDITIONAL RESULTS D.1 DETAILED PERFORMANCE FOR ZERO-SHOT GENERALIZATION ON LARGER SCENES

RESULTS ON DATASETS FROM DIFFERENT DOMAINS D.5.1 MINECRAFT DATASET Dataset Creation: We create a new dataset having (Image, instruction) by building over the Minecraft dataset used inYi et al. (2018). Specifically, we create zero and one hop remove instructions and one hop add instructions similar the creation of CIM-NLI. This dataset contains scenes and objects from the Minecraft video game and is used in prior works for testing Neuro-Symbolic VQA systems like NSCLMao et al. (2019) and NS-VQAYi et al. (2018). The setting of the Minecraft worlds dataset is significantly different from CLEVR in terms of concepts and attributes of objects and visual appearance.

Figure 6: Results for addition and removal from images of the minecraft dataset

Figure 7: Examples of querying an object representation before and after the change operation by NEUROSIM trained only on synthetic images from CLEVR dataset. Example images are taken from GQA dataset Hudson & Manning (2019).

For reinforcing the interpretability benefits of our model, we have shown the output of remove operation on these examples in Fig 8 as well as the steps taken by the model to achieve this. For each example in this figure, -The leftmost image is the source image that needs to be manipulated. -The rightmost (bottom) image is obtained after rendering the scene graph of the source image. The purpose of this image is to show the baseline quality of the rendering module. -The rightmost (top) image is obtained after manipulation operation (removal of a single object in this case) is performed by NEUROSIM.

Figure 8: Examples of the application of operation on real images. Please note the programs are applied on the scene graphs. For better visualization, program steps have been shown on the full image.

Figure 9: Visual comparison of NEUROSIM with TIM-GAN and GeNeVA for the add operator. The red bounding boxes in the ground truth output image indicate the objects required to add to the input image.

Figure 10: Visual comparison of NEUROSIM with TIM-GAN and GeNeVA for the change operator. The red bounding boxes in the input and ground truth output image indicate the objects required to be changed.

Figure 11: Visual comparison of NEUROSIM with TIM-GAN and GeNeVA for the remove operator. The red bounding boxes in the input image indicate objects required to be removed.

Figure 12 captures the images generated by our model where it has made error. The kind of errors that NEUROSIM makes can be broadly classified into three categories. • [Rendering Errors] This set includes images generated by our model which are semantically correct but suffer from rendering errors. The common rendering errors include malformed cubes, partial cubes, change in position of objects, and different lighting.

Comparison of Prior Work. Abbreviations (column titles)

Performance comparison of NEUROSIM with TIM-GAN and GeNeVA with varying β levels. The '-' entries for GeNeVA were not computed due to excessive training time; it's performance is low even when using full data. We always use 100K VQA examples (5K Images, 20 questions per image) for our weakly supervised training.

R1 results for 0-hop (ZH) vs multi-hop (MH) instruction guided image manipulation.

Table4shows the average scores of evaluators across different questions. NEUROSIM performs much better on Q1 despite TIM-GAN using full annotation data, implying better semantic manipulation by NEUROSIM. TIM-GAN does significantly better on Q2 demonstrates its ability to generate better images. The average Fleiss' kappa score(Fleiss et al., 2013) is 0.796, implying high inter-evaluator agreement. Performance on generalization to Larger Scenes

Quality assessment of G I via image retrieval task.4.6 QUANTITATIVE ASSESSMENT OF MANIPULATED SCENE GRAPH G I

Statistics of CIM-NLI dataset introduced in this paper.

Statistics of CIM-NLI-LARGE dataset.

As explained in Section 3.3 of the main paper, for training with weaker form of supervision, we use an off-policy program search based REINFORCE (Williams, 1992) algorithm for calculating the exact gradient. For this, we define a set of all possible program templates P t . For a given input instruction text T , we create a set of all possible programs {P T } from P t . For e.g. given a template {remove(relate(•, f ilter(•, scene())))}, this is filled in all possible ways, with concepts, conceptSet, attributes and relational concepts extracted from the input sentence to get programs for this particular template. All such programs created using all templates form the set P T . All P T are executed over the scene graph of the input image. A typical program structure in our work is of the form manip_op(reasoning()), where manip_op represents the manipulation operator, for example change, add, or remove; and reasoning() either selects objects for change or remove, or it selects a reference object for adding another object in relation to it. After a hyperparameter search for the reward (refer Section G of the appendix), we assign a reward of +8 if the reasoning() part of the program leads to an object being selected for change/remove instruction or a related object being selected for add instruction. If no such object is selected, we give a reward of +2. Reward values were decided on the basis of validation set accuracy. We find that with this training strategy, we achieve the validation set accuracy of 95.64%, where this accuracy is calculated based on whether a program lead to an object being selected or not. Note, this is a proxy to the actual accuracy. For finding the actual accuracy, we would need a validation set of (instruction, ground truth output program) pairs, but we do not use this supervised data for training or validation.



Detailed performance scores for NEUROSIM, TIM-GAN, and GeNeVA for zero-shot generalization to larger scenes (with ≥ 10 objects) from CIM-NLI-LARGE dataset, while models are trained on images with 3 -8 objects. Table has separate performance numbers for add, remove, and change instructions.

Performance scores (Recall@1) on the Image Retrieval task, comparing NEUROSIM with TIM-GAN and GeNeVA with increase in reasoning hops, for add, remove, and change instructions.

Performance scores (Recall@1) for NEUROSIM with TIM-GAN and GeNeVA with increase in reasoning hops, for add, remove, and change instructions. Along with each method, number of data points from CIM-NLI used for training are written.(with output image supervision) image manipulation examples required by the supervised baselines, to the number of annotated VQA examples required to train NEUROSIM. In Table13, we show a detailed split of the performance, for the add, change, and remove operators, across the same values of β as taken before.

6. ETHICS STATEMENT

All the datasets used in this paper were synthetically generated and do not contain any personally identifiable information or offensive content. The ideas and techniques proposed in this paper are useful in designing interpretable natural language-guided tools for image editing, computer-aided design, and video games. One of the possible adverse impacts of AI-based image manipulation is the creation of deepfakes Vaccari & Chadwick (2020) (using deep learning to create fake images). To counter deepfakes, several researchers Dolhansky et al. (2020) ; Mirsky & Lee (2021) have also looked into the problem of detecting real vs. fake images.

7. REPRODUCIBILITY STATEMENT

Code for baselines in all our experiments are publicly available, as stated in Section 4. All the training details (e.g., data splits, data processing steps, hyperparameters) are provided in Section 4, Appendix B, and Appendix G. We use the CLEVR dataset (Johnson et al., 2017b) and CLEVR toolkit (code to generate the dataset) for creating the new datasets introduced in this work. These are publicly available to use. Data creation methodology has been explained in Appendix B. Code for NEUROSIM will be open-sourced post acceptance.

APPENDIX A DOMAIN SPECIFIC LANGUAGE (DSL)

Table 7 captures the DSL used by our NEUROSIM pipeline. The first 5 constructs in this table are common with the DSL used in Mao et al. (2019) . The last 3 operations (Change, Add, and Remove) were added by us to allow for the manipulation operations. 2019) while the last one is an extension of the type system for handling the inputs to the Add operator.

Type Remarks

ObjConcept Concepts for any given object, such as blue, cylinder, etc.Attribute Attributes for any given object, such as color, shape, etc.RelConcept Relational concepts for any given object pair, such as RightOf, LeftOf, etc.

ObjectSet Depicts multiple objects

ConceptSet A set of elements of ObjConcept type 

E END-TO-END TRAINING

The main objective of this work is to make use weakly supervised VQA data for the image manipulation task without using output image supervision. But a natural extension of our work is to use output image supervision as well, to improve the performance of NEUROSIM. We devised an experiment to compare how much performance boost can be obtained by utilizing ground truth output (manipulated) images as the supervision for different modules of NEUROSIM. This experiment demonstrates the value of end-to-end training for NEUROSIM and how it can exploit the supervised data. We refer to this variant as NEUROSIM(e2e). We begin with a pre-trained NEUROSIM model trained with VQA annotations and then fine-tune it using supervised manipulation data. The detailed results are given in Table 15 . This experiment demonstrates that with a small amount of supervised data, the performance of NEUROSIM can be significantly improved (e.g., more than 9 points increase for the change instruction with only 5.4K supervision examples) Given the significant increase in performance of NEUROSIM when using supervised data, we also test it's generalization capability (Analogous to Section 4.2, 4.3), and quality of scene graph retrieval (Analogous to Section 4.5).From Table 16 , we see that NEUROSIM(e2e) shows improved zero-shot generalization to larger scenes. Even when trained on just 5.4k CIM-NLI data, NEUROSIM(e2e) improves over TIM-GAN-54k by 3.9 R@1 points. A 5.3 point improvement over TIM-GAN is observed when full CIM-NLI data is used.Next, we measure drop in performance with increasing reasoning hops. From Table 17 , we see that NEUROSIM(e2e) achieves the lowest drop when compared to TIM-GAN. NEUROSIM(e2e) improves over weakly supervised NEUROSIM baseline by 6.6 R@1 points.Finally, we measure quality of scene graph via retrieval. From These findings suggest that NEUROSIM(e2e) significantly outperforms other supervised approaches in almost all settings. One can fine-tune the image decoder and the visual representation network to further enhance the findings, which should greatly enhance the outcomes.

F COMPUTATIONAL RESOURCES

We trained all our models and baselines on 1 Nvidia Volta V100 GPU with 32GB memory and 512GB system RAM. 

