SELF-SUPERVISED CONTRASTIVE LEARNING

Abstract

Neuro-symbolic models of artificial intelligence (AI) have been recently developed to perform tasks involving abstract visual reasoning that is a hallmark of human intelligence but remains challenging for deep neural network methods. However, most of the current neuro-symbolic models also rely on supervised learning and auxiliary annotations, different from human cognitive processes that are much dependent on the general cognitive abilities of entity and rule recognitions, rather than learning how to solve the specific tasks from examples. In this work, we propose a neuro-symbolic model by self-supervised contrastive learning (NS-SSCL) with unique and invariant representations of entities and rules in the perception and reasoning modules, respectively, to solve Raven's Progressive Matrices (RPMs) and its variant, a typical type of visual reasoning task used to test human intelligence. The perception module parses each object into invariant representations of attributes. The reasoning module grounds the representations of object attributes to form the latent rule representations also through SSCL. Further, the relationships between the neural representations of object attributes and symbols used for rule reasoning are coherently mapped. Finally, the scene generation engine aggregates all attribute and rule representation distributions to produce a probabilistic representation of the target. NS-SSCL obtains state-of-the-art performance in unsupervised models to solve the RAVEN and V-PROM benchmarks, even better than most of the supervised models. The success of the proposed model suggests that construction of general cognitive abilities like humans may render the AI algorithms to solve complex tasks involving higher-level cognition such as abstract reasoning in a human-like manner.

1. INTRODUCTION

Abstract reasoning is essential for human intelligence. The capability of abstract reasoning in humans is domain-general and can be effectively estimated by a simple visual reasoning task test, such as Raven's Progressive Matrices (RPMs) (Raven et al., 1938) . The premise of RPMs is that it does not rely on domain-specific knowledge or verbal ability, but the test performance is diagnostic of verbal, spatial and mathematical reasoning abilities (Carpenter et al., 1990; Snow et al., 1984) . Hence, it is believed that solving RPMs by artificial intelligence (AI) might be a cornerstone toward artificial general intelligence (AGI) (Bilker et al., 2012; Zhang et al., 2019a) . Although deep neural networks by supervised learning have achieved a great success in visual categorizing, such a neural network architecture is not versatile for visual reasoning (Hoshen & Werman, 2017; Barrett et al., 2018) . The core feature in visual reasoning tasks (e.g., RPMs) is that the rules governing the organization of entities are semantically defined by the spatiotemporal relations between entities, but rather intrinsic to the entities per se (Lovett et al., 2007) . Thereby, the semantics of entities and the underlying rules are weakly connected. This leads to end-to-end deep-learning (DL) algorithms are inefficiently to concurrently learn both properties. Although a number of variant DL models have been developed to achieve high performance superior to humans (Zhang et al., 2019b; Zhuo & Kankanhalli, 2020; Mańdziuk & Zychowski, 2019; Zhuo & Kankanhalli, 2021) , these models are monolithic, lacking clear distinctions between the processes of perception and reasoning like in humans (Marcus & Davis, 2020; Fodor & Pylyshyn, 1988) . To mimic the human-like processes involved in solving abstract reasoning tasks, some neuro-symbolic (NS) methods have been recently proposed to combine the deep neural network for the perception module with a reasoning module

