WEAKLY-SUPERVISED NEURO-SYMBOLIC IMAGE MA-NIPULATION VIA MULTI-HOP COMPLEX INSTRUCTIONS

Abstract

We are interested in image manipulation via natural language text -a task that is extremely useful for multiple AI applications but requires complex reasoning over multi-modal spaces. Recent work on neuro-symbolic approaches e.g., The Neuro Symbolic Concept Learner (NSCL) (Mao et al., 2019) has been quite effective for solving Visual Question Answering (VQA) as they offer better modularity, interpretability, and generalizability. We extend NSCL for the image manipulation task and propose a solution referred to as NEUROSIM. Previous work either requires supervised training data in the form of manipulated images or can only deal with very simple reasoning instructions over single object scenes. In contrast, NEUROSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NEU-ROSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides the manipulation. We design neural modules for manipulation, as well as novel loss functions that are capable of testing the correctness of manipulated object and scene graph representations via query networks trained merely on VQA data. An image decoder is trained to render the final image from the manipulated scene graph. Extensive experiments demonstrate that NEUROSIM, without using target images as supervision, is highly competitive with SOTA baselines that make use of supervised data for manipulation.

1. INTRODUCTION

The last decade has seen a significant growth in application of neural models to a variety of tasks including those in computer vision (Chen et al., 2017; Krizhevsky et al., 2012) , NLP (Wu et al., 2016) , robotics and speech (Yu & Deng, 2016) . It has been observed that these models often lack interpretability (Fan et al., 2021) , and may not always be well suited to handle complex reasoning tasks (Dai et al., 2019) . On the other hand, classical AI systems can seamlessly perform complex reasoning in an interpretable manner due to their symbolic representation (Pham et al., 2007; Cai & Su, 2012) . But these models are often found lacking in their ability to handle low level representations and be robust to noise. A natural question then arises: Can we design models which capture the best of both these paradigms? The answer lies in the recent development of Neuro-Symbolic models (Dong et al., 2019; Mao et al., 2019; Han et al., 2019) which combine the power of (purely) neural with (purely) symbolic representations. An interesting sub-class of these models work with a finite sized domain specification language (DSL) and make use of deep networks to learn neural representations of the concepts specified in the DSL. The learned representations are then used for performing downstream reasoning via learning of symbolic programs. This line of work was first popularized by Andreas et al. (2016); Hu et al. (2017); Johnson et al. (2017a) , followed by Mao et al. (2019) , who look at the task of Visual Question Answering (VQA), and other follow-up works such as learning meta-concepts (Han et al., 2019) . Studies (Andreas et al., 2016; Hu et al., 2017; Mao et al., 2019) have shown that these models have several desirable properties such as modularity, interpretability, and improved generalizability. Motivated by the above, our aim is to build neuro-symbolic models for the task of weakly supervised manipulation of images comprising multiple objects, via complex multi-hop natural language instructions. Existing work includes weakly supervised approaches (Nam et al., 2018; Li et al., 2020) that require textual descriptions of images during training and are limited to very simple scenes (or instructions). Supervised approaches (Zhang et al., 2021; El-Nouby et al., 2019) , though capable of handling multiple objects and complex multi-hop instructions, require explicit annotations in the form of target manipulated images; ref. Section 2 for a survey. We are interested in a weakly supervised solution that only makes use of data annotated for VQA, avoiding the high cost of getting supervised annotations, in the form of target manipulated images. Our key intuition is: this task can be solved by simply querying the manipulated representation without ever explicitly looking at the target image. 2019) for solving VQA. We extend this work to incorporate the notion of manipulation operations such as change, add, and remove objects in a given image. As one of our main contributions, we design novel neural modules and a training strategy that just uses VQA annotations as weakly supervised data for the task of image manipulation. The neural modules are trained with the help of novel loss functions that measure the faithfulness of the manipulated scene and object representations by accessing a separate set of query networks, interchangeably referred to as quantization networks, trained just using VQA data. The manipulation takes place through interpretable programs created using primitive neural and symbolic operations from a Domain Specific Language (DSL). Separately, a network is trained to render the image from a scene graph representation using a combination of L 1 and adversarial losses as done by Johnson et al. (2018) . The entire pipeline is trained without any intermediate supervision. We refer to our system as Neuro-Symbolic Image Manipulator (NEUROSIM). Figure 1 shows an example of I/O pair for our approach. For our experiment purposes, we extend CLEVR (Johnson et al., 2017b) , a benchmark dataset for VQA, to incorporate manipulation instructions and create a dataset referred to as Complex Image Manipulation via Natural Language Instructions (CIM-NLI). We will release this dataset publicly post acceptance. Our evaluation on CIM-NLI dataset shows that, despite being weakly supervised. we are highly competitive or improve upon state-of-the-art supervised approaches (Zhang et al., 2021; El-Nouby et al., 2019) for this task, generalize well to scenes with more objects, and specifically perform well on instructions which involve multi-hop reasoning.

2. RELATED WORK

Table 1 categorizes the related work across three broad dimensions -problem setting, task complexity, and approach. The problem setting comprises of two sub-dimensions: i) supervision type -self, direct, or weak, ii) instruction format -text or UI-based. The task complexity comprises of following sub-dimensions: ii) scene complexity -single or multiple objects, ii) instruction complexity -zero or multi-hop instructions, iii) kinds of manipulations allowed -add, remove, or change. Finally, approach consists of the following sub-dimensions: i) model -neural or neuro-symbolic and ii) whether symbolic program is generated on the way or not. Dong et al. (2017) , TAGAN (Nam et al., 2018), and ManiGAN (Li et al., 2020) are close to us in terms of the problem setting. These manipulate the source image using a GAN-based encoder-decoder architecture. Their weak supervision differs from ours -We need VQA annotation, they need captions or textual descriptions. The complexity of their natural language instructions is restricted to 0-hop. Most of their experimentation is limited to single (salient) object scenes, and it is unclear how these strategies would perform with multi-object situations with intricate relationships. Lastly, while our approach requires only an explicit manipulation (delta) command during inference, existing approaches require partial target image description, and it is unclear how their method can be extended to the task where only the delta is given. In terms of task complexity, the closest to us are approaches such as TIM- GAN (Zhang et al., 2021 ), GeNeVA (El-Nouby et al., 2019) , which build an encoder decoder architecture and work with a latent representation of the image as well as the manipulation instruction. They require explicit annotations in terms of manipulated images during training. We argue that this can require a significant more



Figure 1: The problem setup. Our solution builds on Neuro-Symbolic Concept Learner (NSCL) proposed by Mao et al. (2019) for solving VQA. We extend this work to incorporate the notion of manipulation operations such as change, add, and remove objects in a given image. As one of our main contributions, we design novel neural modules and a training strategy that just uses VQA annotations as weakly supervised data for the task of image manipulation. The neural modules are trained with the help of novel loss functions that measure the faithfulness of the manipulated scene and object representations by accessing a separate set of query networks, interchangeably referred to as quantization networks, trained just using VQA data. The manipulation takes place through interpretable programs created using primitive neural and symbolic operations from a Domain Specific Language (DSL). Separately, a network is trained to render the image from a scene graph representation using a combination of L 1 and adversarial losses as done byJohnson et al. (2018). The entire pipeline is trained without any intermediate supervision. We refer to our system as Neuro-Symbolic Image Manipulator (NEUROSIM). Figure1shows an example of I/O pair for our approach.

