EFFECTIVE ABSTRACT REASONING WITH DUAL-CONTRAST NETWORK

Abstract

As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven's Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM problem formulation, the correct answer filled into the missing entry of the third row/column has to best satisfy the same rules shared between the first two rows/columns. Thus we design a simple yet effective Dual-Contrast Network (DCNet) to exploit the inherent structure of RPM puzzles. Specifically, a rule contrast module is designed to compare the latent rules between the filled row/column and the first two rows/columns; a choice contrast module is designed to increase the relative differences between candidate choices. Experimental results on the RAVEN and PGM datasets show that DCNet outperforms the state-of-the-art methods by a large margin of 5.77%. Further experiments on few training samples and model generalization also show the effectiveness of DCNet. Code is available at https://github.com/visiontao/dcnet.

1. INTRODUCTION

Abstract reasoning capability is a critical component of human intelligence, which relates to the ability of understanding and interpreting patterns, and further solving problems. Recently, as a step towards improving the abstract reasoning ability of machines, many methods (Santoro et al., 2018; Zhang et al., 2019a; b; Zheng et al., 2019; Zhuo & Kankanhalli, 2020) are developed to solve Raven's Progress Matrices (RPM) (Domino & Domino, 2006; Raven & Court, 1938) , since it is widely believed that RPM lies at the heart of human intelligence. As the example shown in Figure 1 , given a 3 × 3 problem matrix with a final missing piece, the test taker has to find the logical rules shared between the first two rows or columns, and then pick the correct answer from 8 candidate choices to best complete the matrix. Since the logical rules hidden in RPM questions are complex and unknown, solving RPM with machines remains a challenging task. As described in (Carpenter et al., 1990) , the logical rules applied in a RPM question are manifested as visual structures. For a single image in the question, the logical rules could consist of several basic attributes, e.g., shape, color, size, number, and position. For the images in a row or column, the logical rules could be applied row-wise or column-wise and formulated with an unknown relationship, e.g., AND, OR, XOR, and so on (Santoro et al., 2018; Zhang et al., 2019a) . If we can extract the explicit rules of each question, the problem can be easily solved by using a heuristicsbased search method (Zhang et al., 2019a) . However, given an arbitrary RPM question, the logical rules are unknown. What's worse -even the number of rules is unknown. As a result, an intelligent machine needs to simultaneously learn the representation of these hidden rules and find the correct answer to satisfy all of the applied rules. With the success of deep learning in computer vision, solving RPM puzzles with neural networks has become popular. Because the learned features might be inconsistent with the logical rules, many supervised learning methods, e.g., DRT (Zhang et al., 2019a) , WReN (Santoro et al., 2018 ) and LEN (Zheng et al., 2019 ), MXGNet (Wang et al., 2020 ) and ACL (Kim et al., 2020) , not only use the ground truth answer of each RPM question but also the auxiliary annotations (such as logical rules with shape, size, color, number, AND, OR, XOR) to learn the appropriate feature representation. Although auxiliary annotations provide lots of priors about how the logical rules are applied, noticeable performance improvement is not always obtained on different RPM problems, such as in the results reported in Table 1 . Moreover, such a learning strategy requires additional supervision. When auxiliary annotations are not available, it will fail to boost the performance. For example, DRT (Zhang et al., 2019a ) cannot be applied to PGM dataset (Santoro et al., 2018) for the lack of structure annotations. To overcome the constraint of using auxiliary annotations, a recent method CoPINet (Zhang et al., 2019b) only uses the ground truth answer of each question. Meanwhile, to produce the feature representation of hidden rules, CoPINet assumes there are at most N attributes in each problem, and each of which is subject to the governance of M rules. However, due to N and M being unknown for arbitrary RPM problems, such an assumption is still too strong. In this work, we aim to learn the abstract reasoning model by using only the ground truth answer of each question, and there is not any assumption about the latent rules. According to the RPM problem formulation (Carpenter et al., 1990) , it can be concluded that finding the correct answer of a RPM puzzle mainly depends on two contrasts of rules: (1) compare the hidden rules between the filled row/column and the first two rows/columns to check whether they are the same rules; (2) compare all candidate choices to check which one best satisfies the hidden rules. Unlike making specific assumptions about latent rules that are only valid for particular cases, the above two contrasts are general properties of all RPM problems. Considering above two contrasts, we propose a simple yet effective Dual-Contrast Network (DC-Net) to solve RPM problems. Specifically, a rule contrast module is used to compute the difference between the filled row/column and the first two rows/columns, which checks the difference between latent rules. Additionally, the second choice contrast module is used to increase the relative differences of all candidate choices, which helps find the correct answer when confusingly similar choices exist. Experiments on two major benchmark datasets demonstrate the effectiveness of our method. In summary, our main contributions are as follows: • We propose a new abstract reasoning model on RPM with only ground truth answers, i.e. there are not any assumptions or auxiliary annotations about the latent rules. Compared to previous methods, the problem setting of our method is more challenging, as we aim for an intelligent agent to learn a strong model with a small amount of supervision. • We propose a simple yet effective Dual-Contrast Network (DCNet) that consists of a rule contrast module and a choice contrast module. By exploiting the inherent structures of each RPM with basic problem formulation, robust feature representation can be learned. • Experimental results on RAVEN (Zhang et al., 2019a) and PGM (Santoro et al., 2018) datasets show that our DCNet significantly improves the average accuracy by a large margin of 5.77%. Moreover, from the perspective of few-shot learning, DCNet outperforms the state-of-the-art method CoPINet (Zhang et al., 2019b) by a noticeable margin when few training samples are provided, see Table 3 and 4 . Further experiments on model generalization also show the effectiveness of our method, see 



Fan et al., 2020). To aid in the diagnostic evaluation ofVQA systems,  Johnson et al. (2017)  designed a CLEVR dataset by minimizing bias and providing rich ground-truth representations for both images and questions. It is expected that rich diagnostics could help better understand the visual reasoning capabilities of VQA systems. Recently, to understand the human actions in videos,Zhou et al. (2018)  proposed a temporal relational reasoning network to learn and reason about temporal dependencies between video frames at multiple time scales. Besides, for explainable video action reasoning,Zhuo et al. (2019)  proposed to explain performed actions by recognizing the semantic-level state changes from a spatio-temporal video graph with pre-defined rules. Different from these visual tasks, solving RPM puzzles depends on sophisticated logical rules

