SELF-SUPERVISED CONTRASTIVE LEARNING

Abstract

Neuro-symbolic models of artificial intelligence (AI) have been recently developed to perform tasks involving abstract visual reasoning that is a hallmark of human intelligence but remains challenging for deep neural network methods. However, most of the current neuro-symbolic models also rely on supervised learning and auxiliary annotations, different from human cognitive processes that are much dependent on the general cognitive abilities of entity and rule recognitions, rather than learning how to solve the specific tasks from examples. In this work, we propose a neuro-symbolic model by self-supervised contrastive learning (NS-SSCL) with unique and invariant representations of entities and rules in the perception and reasoning modules, respectively, to solve Raven's Progressive Matrices (RPMs) and its variant, a typical type of visual reasoning task used to test human intelligence. The perception module parses each object into invariant representations of attributes. The reasoning module grounds the representations of object attributes to form the latent rule representations also through SSCL. Further, the relationships between the neural representations of object attributes and symbols used for rule reasoning are coherently mapped. Finally, the scene generation engine aggregates all attribute and rule representation distributions to produce a probabilistic representation of the target. NS-SSCL obtains state-of-the-art performance in unsupervised models to solve the RAVEN and V-PROM benchmarks, even better than most of the supervised models. The success of the proposed model suggests that construction of general cognitive abilities like humans may render the AI algorithms to solve complex tasks involving higher-level cognition such as abstract reasoning in a human-like manner.

1. INTRODUCTION

Abstract reasoning is essential for human intelligence. The capability of abstract reasoning in humans is domain-general and can be effectively estimated by a simple visual reasoning task test, such as Raven's Progressive Matrices (RPMs) (Raven et al., 1938) . The premise of RPMs is that it does not rely on domain-specific knowledge or verbal ability, but the test performance is diagnostic of verbal, spatial and mathematical reasoning abilities (Carpenter et al., 1990; Snow et al., 1984) . Hence, it is believed that solving RPMs by artificial intelligence (AI) might be a cornerstone toward artificial general intelligence (AGI) (Bilker et al., 2012; Zhang et al., 2019a) . Although deep neural networks by supervised learning have achieved a great success in visual categorizing, such a neural network architecture is not versatile for visual reasoning (Hoshen & Werman, 2017; Barrett et al., 2018) . The core feature in visual reasoning tasks (e.g., RPMs) is that the rules governing the organization of entities are semantically defined by the spatiotemporal relations between entities, but rather intrinsic to the entities per se (Lovett et al., 2007) . Thereby, the semantics of entities and the underlying rules are weakly connected. This leads to end-to-end deep-learning (DL) algorithms are inefficiently to concurrently learn both properties. Although a number of variant DL models have been developed to achieve high performance superior to humans (Zhang et al., 2019b; Zhuo & Kankanhalli, 2020; Mańdziuk & Zychowski, 2019; Zhuo & Kankanhalli, 2021) , these models are monolithic, lacking clear distinctions between the processes of perception and reasoning like in humans (Marcus & Davis, 2020; Fodor & Pylyshyn, 1988) . To mimic the human-like processes involved in solving abstract reasoning tasks, some neuro-symbolic (NS) methods have been recently proposed to combine the deep neural network for the perception module with a reasoning module for symbolic logic execution (Yi et al., 2019; Mao et al., 2019; Zhang et al., 2021) . However, these models also need to learn the connections between the contexts of instances and the supervised answers from scratch as DL models. In striking contrast, humans who have never previously met the problems can soon solve the abstract reasoning tasks. Humans do not rely on the task-specific experiences in solving the tasks, but their prior general cognitive abilities, namely, object and rule recognitions. Humans, even at a very early stage of life, can recognize tons of objects and their attributes (Spelke, 1990) , and soon later recognize the rules governing the world and apply these rules to new situations (Gopnik et al., 2004) . For these reasons, the RPM tests are capable of evaluating human's general reasoning abilities (Marcus & Davis, 2020; Fodor & Pylyshyn, 1988) . Although it remains open questions about how humans learn and form the object and rule representations, a critical feature of these general cognitive abilities granted for abstract reasoning is that the attribute and rule representations are unique and invariant in the brain (Li & DiCarlo, 2008; Mansouri et al., 2020) . For instance, we recognize the same color of 'green' from different objects and recognize the latent rule governing the color relationship across a set of objects. Thus far, it remains challenges to design an automatic AI algorithm behaving like humans in these tasks. To better investigate abstract visual reasoning, RAVEN (Appendix A) and other RPM-like datasets have been proposed. In RAVEN dataset, each RPM problem consists of 9 panels in a form of 3×3 matrix with 8 context panels and a missing panel at the 9th entry (Matzen et al., 2010; Zhang et al., 2019a) . The goal of the task is to find out the correct answer from 8 candidate panels that completes the matrix with satisfactions of the latent rules governing the organization of object attributes in the three continuous panels within each row and are consistent across the three rows in the matrix (Figure 1 ). Overall, the task requires two independent cognitive abilities of visual perception and rule reasoning. If visual perception module can perfectly recognize all object attributes, then the process of identifying the latent rules becomes plain and reduced to exhaustive search in the rule space (Matzen et al., 2010; Zhang et al., 2019a) . Different from most of visual perception tasks, both the object attributes and rules are required to be identified. This is difficult for deep neural networks, and so far also remains challenges for the NS models. Additionally, the V-PROM task (Appendix F) is also an RPM-like task but with natural images (Teney et al., 2020) . This new task increases the difficulty of visual perception. Inspired by the above-mentioned human cognitive processes in abstract reasoning, we here demonstrate a human-like NS model can solve abstract visual reasoning in a human-like manner. To build up a non-verbal visual reasoning ability for an AI model on the basis of the general cognitive abilities in object and rule recognitions, we move a step further towards a NS model without supervisions of the answers or annotations, but with a self-supervised contrastive learning (SSCL) method to establish both the object and rule recognition abilities, denoted as NS-SSCL. The motivation of this unsupervised approach is to make representations of the same attributes and the same rules are as close as possible across different objects and problems, respectively. Critically, the mapping from the neural embeddings of object attributes and the symbols used for rule logic execution can be further established by virtue of their stable representations, even though these representations learned by SSCL are not necessarily aligned well with ground truths in the tasks due to the unsupervised nature. Notably, the current model relies on the probability codes of the discrete symbols of object attribute values (Figure 1 ) as similar as used in the probabilistic abduction and execution (PrAE) model (Zhang et al., 2021) . However, the PrAE model used supervisions of the correct and incorrect answers from the candidate panels, and also relies on the ground truth of rules contained in each instance as auxiliary annotations. In this work, without these ample supervisions, NS-SSCL obtains state-of-the-art performance accuracy in the unsupervised models on the context of instance, but without the candidates, in solving the RPM-like tasks, even better than most of the previous supervised models.

2. RELATED WORK

Object representations It is critical to correctly recognize the exact object attributes in visual reasoning tasks, as the latent rules governing the context are defined by these features. Although deep neural networks are versatile to fit any desired function constrained by the loss function, the neural embeddings of latent object representations are too flexible to fit well with the semantics of

