WEAKLY-SUPERVISED NEURO-SYMBOLIC IMAGE MA-NIPULATION VIA MULTI-HOP COMPLEX INSTRUCTIONS

Abstract

We are interested in image manipulation via natural language text -a task that is extremely useful for multiple AI applications but requires complex reasoning over multi-modal spaces. Recent work on neuro-symbolic approaches e.g., The Neuro Symbolic Concept Learner (NSCL) (Mao et al., 2019) has been quite effective for solving Visual Question Answering (VQA) as they offer better modularity, interpretability, and generalizability. We extend NSCL for the image manipulation task and propose a solution referred to as NEUROSIM. Previous work either requires supervised training data in the form of manipulated images or can only deal with very simple reasoning instructions over single object scenes. In contrast, NEUROSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NEU-ROSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides the manipulation. We design neural modules for manipulation, as well as novel loss functions that are capable of testing the correctness of manipulated object and scene graph representations via query networks trained merely on VQA data. An image decoder is trained to render the final image from the manipulated scene graph. Extensive experiments demonstrate that NEUROSIM, without using target images as supervision, is highly competitive with SOTA baselines that make use of supervised data for manipulation.

1. INTRODUCTION

The last decade has seen a significant growth in application of neural models to a variety of tasks including those in computer vision (Chen et al., 2017; Krizhevsky et al., 2012 ), NLP (Wu et al., 2016) , robotics and speech (Yu & Deng, 2016) . It has been observed that these models often lack interpretability (Fan et al., 2021) , and may not always be well suited to handle complex reasoning tasks (Dai et al., 2019) . On the other hand, classical AI systems can seamlessly perform complex reasoning in an interpretable manner due to their symbolic representation (Pham et al., 2007; Cai & Su, 2012) . But these models are often found lacking in their ability to handle low level representations and be robust to noise. A natural question then arises: Can we design models which capture the best of both these paradigms? The answer lies in the recent development of Neuro-Symbolic models (Dong et al., 2019; Mao et al., 2019; Han et al., 2019) which combine the power of (purely) neural with (purely) symbolic representations. An interesting sub-class of these models work with a finite sized domain specification language (DSL) and make use of deep networks to learn neural representations of the concepts specified in the DSL. The learned representations are then used for performing downstream reasoning via learning of symbolic programs. This line of work was first popularized by Andreas et al. (2016); Hu et al. (2017); Johnson et al. (2017a) , followed by Mao et al. (2019) , who look at the task of Visual Question Answering (VQA), and other follow-up works such as learning meta-concepts (Han et al., 2019) . Studies (Andreas et al., 2016; Hu et al., 2017; Mao et al., 2019) have shown that these models have several desirable properties such as modularity, interpretability, and improved generalizability. Motivated by the above, our aim is to build neuro-symbolic models for the task of weakly supervised manipulation of images comprising multiple objects, via complex multi-hop natural language instructions. Existing work includes weakly supervised approaches (Nam et al., 2018; Li et al., 2020) that require textual descriptions of images during training and are limited to very simple scenes (or 1

