GRAPHLOG: A BENCHMARK FOR MEASURING LOG-ICAL GENERALIZATION IN GRAPH NEURAL NET-WORKS

Abstract

Relational inductive biases have a key role in building learning agents that can generalize and reason in a compositional manner. While relational learning algorithms such as graph neural networks (GNNs) show promise, we do not understand their effectiveness to adapt to new tasks. In this work, we study the logical generalization capabilities of GNNs by designing a benchmark suite grounded in first-order logic. Our benchmark suite, GraphLog, requires that learning algorithms perform rule induction in different synthetic logics, represented as knowledge graphs. GraphLog consists of relation prediction tasks on 57 distinct procedurally generated logical worlds. We use GraphLog to evaluate GNNs in three different setups: single-task supervised learning, multi-task (with pre-training), and continual learning. Unlike previous benchmarks, GraphLog enables us to precisely control the logical relationship between the different worlds by controlling the underlying first-order logic rules. We find that models' ability to generalize and adapt strongly correlates to the availability of diverse sets of logical rules during multi-task training. We also find the severe catastrophic forgetting effect in continual learning scenarios, and GraphLog provides a precise mechanism to control the distribution shift. Overall, our results highlight new challenges for the design of GNN models, opening up an exciting area of research in generalization using graph-structured data.

1. INTRODUCTION

Relational reasoning, or the ability to reason about the relationship between objects and entities, is considered one of the fundamental aspects of intelligence (Krawczyk et al., 2011; Halford et al., 2010) , and is known to play a critical role in cognitive growth of children (Son et al., 2011; Farrington-Flint et al., 2007; Richland et al., 2010) . This ability to infer relations between objects/entities/situations, and to compose relations into higher-order relations, is one of the reasons why humans quickly learn how to solve new tasks (Holyoak and Morrison, 2012; Alexander, 2016) . The perceived importance of relational reasoning for generalization has fueled the development of several neural network architectures that incorporate relational inductive biases (Battaglia et al., 2016; Santoro et al., 2017; Battaglia et al., 2018) . Graph neural networks (GNNs), in particular, have emerged as a dominant computational paradigm within this growing area (Scarselli et al., 2008; Hamilton et al., 2017a; Gilmer et al., 2017; Schlichtkrull et al., 2018; Du et al., 2019) . However, despite the growing interest in GNNs and their promise of relational generalization, we currently lack an understanding of how effectively these models can adapt and generalize across distinct tasks. In this work, we study the task of logical generalization in the context of relational reasoning using GNNs. One example of such a reasoning task (from everyday life) can be in the context of a familygraph where the nodes are family members, and edges represent the relationships (brother, father, etc). The objective is to learn logical rules, which are compositions of a specific form, such as "the son of my son is my grandson". As new compositions of existing relations are discovered during the lifetime of a learner, (e.g., the son of my daughter is my grandson), it should remember the old compositions, even as it learns new compositions (just like we humans do). This simplistic example can be extended to the more complex (and yet practical) scenarios like identifying novel chemical compositions, or recommender systems, where agents have to learn and retain compositions of existing relations. We study the effect of such generalization by analyzing the ability of GNNs in learning new relation compositions, leveraging first-order logic. In particular, we study how GNNs can induce logical rules and generalize by combining such rules in novel ways after training. We propose a benchmark suite, GraphLog, that is grounded in first-order logic. Figure 1 shows the setup of the benchmark. Given a set of logical rules, we create a diverse set of logical worlds with a different subset of rules. For each world (say W k ), we sample multiple knowledge graphs (say g k i ). The learning agent should learn to induce the logical rules for predicting the missing facts in these knowledge graphs. Using our benchmark, we evaluate the generalization capabilities of GNNs in the supervised setting by predicting inductively unseen combinations of known rules within a specific logical world. We further analyze how various GNN architectures perform in the multi-task and continual learning scenarios, where they have to learn over a set of logical worlds with different underlying logics. Our setup allows us to control the distribution shift by controlling the similarity between the different worlds, in terms of the overlap in logical rules between different worlds. … Our analysis provides the following insights about logical generalization capabilities of GNNs: r 2 ∧ r 3 ⟹ r 1 … … W 1 r 4 ∧ r 2 ⟹ r 3 … … W 2 r 1 r 2 ⟹ r 5 … W n r 2 ∧ r 3 ⟹ r 1 r 4 ∧ r 5 ⟹ r 7 r 1 ∧ r 2 ⟹ r 5 … … … … g 1 i g 2 i g n i r 4 ∧ r 2 ⟹ r 3 r 4 ∧ r 5 ⟹ r 7 r 1 ∧ r 4 ⟹ r 6

Rules

• Two architecture choices for GNNs strongly (and positively) impact generalization: 1) incorporating multirelational edge features using attention, and 2) modularising GNN architecture to include a parametric representation function, to learn representations for the relations using a knowledge graph structure. • In the multi-task setting, training a model on a more diverse set of logical worlds improves generalization and adaptation performance. • All the evaluated models exhibit catastrophic forgetting. This indicates that the models are not learning transferable representations and compositions and just overfitting to the current task -highlighting the challenge of lifelong learning in context of logical generalization and GNNs.

2. BACKGROUND AND RELATED WORK

GNNs. Several GNN architectures have been proposed to learn representations of the graph inputs (Scarselli et al., 2008; Duvenaud et al., 2015; Defferrard et al., 2016; Kipf and Welling, 2016; Gilmer et al., 2017; Hamilton et al., 2017b; Schlichtkrull et al., 2018) . Previous works have focused on evaluating GNNs in terms of their expressive power (Barceló et al., 2019; Morris et al., 2019; Xu et al., 2018) , usefulness of features (Chen et al., 2019) , and explaining their predictions (Ying et al., 2019) . Complementing these works, we evaluate GNN models on the task of logical generalization.



Figure 1: GraphLog setup: A set of rules (grounded in propositional logic) is partitioned into overlapping subsets. It is used to define unique worlds. Within each world W k , several knowledge graphs g k i (governed by rule set of W k ) are generated.

Aggregate statistics of the worlds in GraphLog. Statistics for each individual world are in the Appendix.

