LOGICDP: CREATING LABELS FOR GRAPH DATA VIA INDUCTIVE LOGIC PROGRAMMING

Abstract

Graph data, such as scene graphs and knowledge graphs, see wide use in AI systems. In real-world and large applications graph data are usually incomplete, motivating graph reasoning models for missing-fact or missing-relationship inference. While these models can achieve state-of-the-art performance, they require a large amount of training data. Recent years have witnessed the rising interest in label creation with data programming (DP) methods, which aim to generate training labels from heuristic labeling functions. However, existing methods typically focus on unstructured data and are not optimized for graphs. In this work, we propose LOGICDP, a data programming framework for graph data. Unlike existing DP methods, (1) LOGICDP utilizes the inductive logic programming (ILP) technique and automatically discovers the labeling functions from the graph data; (2) LOGICDP employs a budget-aware framework to iteratively refine the functions by querying an oracle, which significantly reduces the human efforts in function creations. Experiments show that LOGICDP achieves better data efficiency in both scene graph and knowledge graph reasoning tasks.

1. INTRODUCTION

Graph data are widely used in many applications as structured representations for complex information. In the visual domain, a scene graph can be used to represent the semantic information of an image. As shown in Figure 1a , each node in the graph corresponds to a localized entity (e.g., x 1 , x 2 , and x 3 ) and each edge represents the semantic relations between a pair of entities (e.g., ⟨x 3 , Has, x 1 ⟩). Similarly, knowledge graphs (KGs) represent real-world facts with entities and the relations that connect them. Standard KGs such as Freebase Toutanova & Chen (2015) and WordNet (Bordes et al., 2013) consist of facts that describe commonsense knowledge and have played important roles in many applications (Yang et al., 2017; Sun et al., 2019; Mitchell et al., 2018; Yang et al., 2022) . Graph datasets are usually incomplete and new facts can be inferred from the existing facts. For example, in Figure 1a , the class label of x 3 is missing, and one can infer that it is a Car because x 3 Has a Wheel and a Window. This process is referred to as graph reasoning. A large body of research has been proposed to address this task. For scene graph reasoning, methods such as iterative message passing (Xu et al., 2017) , LSTM (Zellers et al., 2018) and GNN (Yang et al., 2018) are proposed. And for KG reasoning, a variety of graph embedding methods (Bordes et al., 2013; Sun et al., 2019) are proposed. These graph reasoning methods rely on standard supervised training, where the model is fed with a fixed set of data that are curated beforehand. Such training leads to a state-of-the-art performance when a large amount of data is available. However, this approach is shown to be suboptimal with respect to data efficiency (Misra et al., 2018; Shen et al., 2019) , and creating a sufficiently large training dataset manually can be expensive. In this work, we approach this problem by adopting the data programming (DP) paradigm (Ratner et al., 2016) , which aims at creating a large high-quality labeled dataset from a set of labeling functions that generate noisy, incomplete, and conflicting labels. To this end, we propose LOGICDP, a DP framework that creates training data for graph reasoning models. Compared to the existing domain-general DP frameworks (Ratner et al., 2016; 2017) , LOGICDP utilizes the inductive logic

annex

programming technique and can automatically generate labeling functions in the form of first-order logic rules from a small labeled set; LOGICDP also employs a budget-aware framework that refines the functions through weak supervision from an oracle.The LOGICDP framework is flexible and agnostic to the choice of the graph reasoning model and the ILP method. In experiments, we evaluate LOGICDP on scene graph and KG datasets together with several strong generic weakly-supervised methods, which suggests that LOGICDP scales well with large graph datasets and generalizes better than the baselines. We also showcase the training with human oracles and discuss its potential as a novel human-in-the-loop learning paradigm.

2. RELATED WORK

Data Programming. LOGICDP is related to the recent advances in data-centric AI and data programming (Ratner et al., 2016; 2017; Varma & Ré, 2018) , which aims at a new paradigm for training and evaluating ML models with weak supervision. Existing frameworks such as Snorkel (Ratner et al., 2017) show great potential in this direction but are also limited in applications where expert labeling functions are difficult or expensive to construct, and they do not utilize the rich semantic information in graphs to automate the process. In LOGICDP, we incorporate the ILP technique and automatically generate the labeling functions from a small set of graph data.Graph reasoning and inductive logic programming. Graph reasoning is the fundamental task performed on graph data. It can be addressed by many approaches. For example, graph embedding methods (Bordes et al., 2013; Sun et al., 2019) , GNN-based methods (Yang et al., 2018; Zhang et al., 2020) and various deep models (Xu et al., 2017; Zellers et al., 2018) . While these data-driven models achieve state-of-the-art performance, they require many samples to train. On the other hand, graph reasoning can be solved by finding a multi-hop path in the graph that predicts the missing facts (Guu et al., 2015; Lao & Cohen, 2010; Lin et al., 2015; Gardner & Mitchell, 2015; Das et al., 2016) . Specifically, inductive logic programming (ILP) methods (Galárraga et al., 2015; Evans & Grefenstette, 2018; Payani & Fekri, 2019; Campero et al., 2018; Yang & Song, 2020; Yang et al., 2017) learns to predict missing facts by searching for such paths and representing them as logic rules. Compared to the previous methods, ILP methods are more data-efficient and offer better interpretability. In this work, we investigate using logic rules learned by ILP methods as labeling functions and generating triples to train a data-driven graph reasoning model.

3. PRELIMINARIES AND PROBLEM STATEMENT

We consider graphs such as scene graphs and KGs that consist of a set of facts in the form of headpredicate-tail triples. Here, we use the scene graph in Figure 1a as a running example. Formally, we

