HYBRID NEURO-SYMBOLIC REASONING BASED ON MULTIMODAL FUSION

Abstract

Deep neural models and symbolic Artificial Intelligence (AI) systems have contrasting advantages and disadvantages. Neural models can be trained from raw, incomplete and noisy data to obtain abstraction of features at various levels, but their uninterpretability is well-known. On the other hand, the traditional rulebased symbolic reasoning encodes domain knowledge, but its failure is often attributed to the acquisition bottleneck. We propose to build a hybrid learning and reasoning system which is based on multimodal fusion approach that brings together advantageous features from both the paradigms. Specifically, we enhance convolutional neural networks (CNNs) with the structured information of 'if-then' symbolic logic rules obtained via word embeddings corresponding to propositional symbols and terms. With many dozens of intuitive rules relating the type of a scene with its typical constituent objects, we are able to achieve significant improvement over the base CNN-based classification. Our approach is extendible to handle first-order logical syntax for rules and other deep learning models.



However, the success of deep learning comes at a cost. The first and foremost is its reliance on large amounts of labeled data, which are often difficult to collect and entail a slow learning process. Second, deep models are brittle in the sense that a trained network that performs well on one task often performs very poorly on a new task, even if the new task is very similar to the one it was originally trained on. Third, they are strictly reactive, meaning that they do not use high-level processes such as planning, causal reasoning, or analogical reasoning. Fourth, human expertise cannot be used which can often reduce the burden of acquiring training data which is often expensive to collect. Purely data-driven learning can lead to uninterpretable and sometimes counter-intuitive results Nguyen et al. (2014 )Szegedy et al. (2013) . The sub-symbolic neural approaches allow us to mimic human cognitive thought processes by extracting features at various levels of abstraction from direct observation and thereby facilitate learning. But humans also learn from general high-level knowledge expressed declaratively in logical syntax. A representation language allows recursive structures to be easily represented and manipulated, which is usually difficult in a neural learning environment. But a symbolic reasoning system is not good to adapt to new environments by learning and reasoning based on traditional theorem-1



is being employed with increasing frequency in recent years LeCun et al. (2015)Schmidhuber (2014). Various deep learning models have achieved remarkable results in computer vision Krizhevsky et al. (2017), remote sensing Zhu et al. (2017), target classification in SAR images Chen et al. (2016), and speech recognition Graves et al. (2013)Hinton et al. (2012). In the domain of natural language processing (NLP), deep learning methods are used to learn word vector representations through neural language models Mikolov et al. (2013) and performing composition over the learned word-vectors for classification Collobert et al. (2011). Convolutional neural networks (CNNs), for example, utilize layers with convolving filters that are applied to local features. CNN is widely used for image tasks and is currently state-of-the-art for object recognition and detection. Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing Yih et al. (2015), search query retrieval Shen et al. (2014), sentence modelling Kalchbrenner et al. (2014), and other traditional NLP tasks Collobert et al. (2011).

