HYBRID NEURO-SYMBOLIC REASONING BASED ON MULTIMODAL FUSION

Abstract

Deep neural models and symbolic Artificial Intelligence (AI) systems have contrasting advantages and disadvantages. Neural models can be trained from raw, incomplete and noisy data to obtain abstraction of features at various levels, but their uninterpretability is well-known. On the other hand, the traditional rulebased symbolic reasoning encodes domain knowledge, but its failure is often attributed to the acquisition bottleneck. We propose to build a hybrid learning and reasoning system which is based on multimodal fusion approach that brings together advantageous features from both the paradigms. Specifically, we enhance convolutional neural networks (CNNs) with the structured information of 'if-then' symbolic logic rules obtained via word embeddings corresponding to propositional symbols and terms. With many dozens of intuitive rules relating the type of a scene with its typical constituent objects, we are able to achieve significant improvement over the base CNN-based classification. Our approach is extendible to handle first-order logical syntax for rules and other deep learning models.

1. INTRODUCTION

Deep learning technology is being employed with increasing frequency in recent years LeCun et al. (2015 )Schmidhuber (2014) . Various deep learning models have achieved remarkable results in computer vision Krizhevsky et al. (2017) , remote sensing Zhu et al. (2017) , target classification in SAR images Chen et al. (2016) , and speech recognition Graves et al. (2013) Hinton et al. (2012) . In the domain of natural language processing (NLP), deep learning methods are used to learn word vector representations through neural language models Mikolov et al. (2013) and performing composition over the learned word-vectors for classification Collobert et al. (2011) . Convolutional neural networks (CNNs), for example, utilize layers with convolving filters that are applied to local features. CNN is widely used for image tasks and is currently state-of-the-art for object recognition and detection. Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing Yih et al. (2015) However, the success of deep learning comes at a cost. The first and foremost is its reliance on large amounts of labeled data, which are often difficult to collect and entail a slow learning process. Second, deep models are brittle in the sense that a trained network that performs well on one task often performs very poorly on a new task, even if the new task is very similar to the one it was originally trained on. Third, they are strictly reactive, meaning that they do not use high-level processes such as planning, causal reasoning, or analogical reasoning. Fourth, human expertise cannot be used which can often reduce the burden of acquiring training data which is often expensive to collect. Purely data-driven learning can lead to uninterpretable and sometimes counter-intuitive results Nguyen et al. (2014 )Szegedy et al. (2013) . The sub-symbolic neural approaches allow us to mimic human cognitive thought processes by extracting features at various levels of abstraction from direct observation and thereby facilitate learning. But humans also learn from general high-level knowledge expressed declaratively in logical syntax. A representation language allows recursive structures to be easily represented and manipulated, which is usually difficult in a neural learning environment. But a symbolic reasoning system is not good to adapt to new environments by learning and reasoning based on traditional theorem-proving, which can be computationally expensive. Moreover, a purely symbolic system based on traditional AI requires enormous human effort as knowledge are manually programmed and not learned. Central to classical AI is the use of language-like propositional representations to encode knowledge. The symbolic elements of a representation in classical AI -the constants, functions, and predicates -are typically hand-crafted. Inductive Logic Programming Muggleton (1990) methods learn hypothesis rules given background knowledge, and a set of positive and negative examples. The systems we have discussed until now do not model uncertainty which is essential in practical applications. Various probabilistic logics Halpern (2005) and Markov Logic Networks Richardson & Domingos (2006) (MLNs) handle uncertainty using weight attached to every rule. Practical applications of these networks have been limited as inference is not scalable to a large number of rules. It is, therefore, desirable to develop a hybrid approach, embedding declarative representation of knowledge, such as domain and commonsense knowledge, within a neural system. In this paper, hybrid approach is applied to indoor scene classification, which has been extensively studied in field of computer vision Chen et al. ( 2018). However, compared with outdoor scene classification, this is an arduous issue due to the large variety of density of objects within a typical scene. In addition, high-accuracy models already exist for outdoor scene classification while indoor scene classification is not. In order to accomplish our objective, the acquisition, representation, and utilization of visual commonsense knowledge represents a set of critical opportunities in advancing computer vision past the stage where it simply classifies or identifies which objects occur in imagery Davis et al. (2015) . The contributions of this paper is summarized as followed: • A joint representation multimodal fusion framework is applied to exploit the early fusion of vectorized logical knowledge and images for the task of indoor scene classification. Experiments show that higher classification accuracy is obtained compared to traditional image classification methods. • A 'if-then' logical knowledge system is built based on reviews of each indoor scene class which are scraped from Google open source, through Word2Vec and BERT embedding. This helps to get a better contextual representation of words detected by object detection. • A unique rules embedding approach is proposed, which allows to converge 'if-then' logic of probability with image representation. The embedding approach has different representations during training and inference process. The rest of the paper is organized as follows. The next section 2 surveys the related work. The hybrid framework is explained in section 3. Section 4 details implementation and evaluation of experiments. Finally, we conclude with some future directions in 5. 



, search query retrieval Shen et al. (2014), sentence modelling Kalchbrenner et al. (2014), and other traditional NLP tasks Collobert et al. (2011).

Hybrid neural-symbolic systems concern Chen et al. (2016)Garcez et al. (2009)Hammer & Hitzler (2007)Rosenbloom et al. (2017)Sun (1994)Wermter & Sun (2001) the use of problem-specific symbolic knowledge within the neurocomputing paradigm, specifically, symbolic domain and commonsense knowledge within the deep learning paradigm in our case. They are useful for enhancing various tasks, including logical inferencing, extracting relational knowledge Guillame-Bert et al. (2010)Gust et al. (2007), image classification, and action selection. Combination of logic rules and neural networks has been considered to construct network architectures from given rules to perform reasoning and knowledge acquisition. Neural-symbolic systems, such as EBL-ANN Shavlik & Towell (1989), KBANN Szegedy et al. (2013) and C-ILP Garcez et al. (2009), LENSR Xie et al. (2019), like our proposal, deal with propositional formulae. KBANN, for example, maps problem-specific domain theories, represented in propositional logic, into neural networks and then refines this reformulated knowledge using back-propagation. Propositional symbols are directly represented as nodes whereas we vectorize each propositional symbol as its semantic representation and appended to the abstraction of low-level observations. Other neural-symbolic systems are exploring on knowledge graph Chen et al. (2020)Kampffmeyer et al. (2019)Li et al. (2019)Zablocki et al. (2019), which is a natural symbol. It is not only a se-

