COMPLEX QUERY ANSWERING WITH NEURAL LINK PREDICTORS

Abstract

Neural link predictors are immensely useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries that arise in a number of domains, such as queries using logical conjunctions (∧), disjunctions (∨) and existential quantifiers (∃), while accounting for missing edges. In this work, we propose a framework for efficiently answering complex queries on incomplete Knowledge Graphs. We translate each query into an end-to-end differentiable objective, where the truth value of each atom is computed by a pre-trained neural link predictor. We then analyse two solutions to the optimisation problem, including gradient-based and combinatorial search. In our experiments, the proposed approach produces more accurate results than state-of-the-art methods -black-box neural models trained on millions of generated queries -without the need of training on a large and diverse set of complex queries. Using orders of magnitude less training data, we obtain relative improvements ranging from 8% up to 40% in Hits@3 across different knowledge graphs containing factual information. Finally, we demonstrate that it is possible to explain the outcome of our model in terms of the intermediate solutions identified for each of the complex query atoms. All our source code and datasets are available online 1 .

1. INTRODUCTION

Knowledge Graphs (KGs) are graph-structured knowledge bases, where knowledge about the world is stored in the form of relationship between entities. KGs are an extremely flexible and versatile knowledge representation formalism -examples include general purpose knowledge bases such as DBpedia (Auer et al., 2007) and YAGO (Suchanek et al., 2007) , domain-specific ones such as Bio2RDF (Dumontier et al., 2014) and Hetionet (Himmelstein et al., 2017) for life sciences and WordNet (Miller, 1992) for linguistics, and application-driven graphs such as the Google Knowledge Graph, Microsoft's Bing Knowledge Graph, and Facebook's Social Graph (Noy et al., 2019) . Neural link predictors (Nickel et al., 2016) tackle the problem of identifying missing edges in large KGs. However, in many complex domains, an open challenge is developing techniques for answering complex queries involving multiple and potentially unobserved edges, entities, and variables, rather than just single edges. We focus on First-Order Logical Queries that use conjunctions (∧), disjunctions (∨), and existential quantifiers (∃). A multitude of queries can be expressed by using such operators -for instance, the query "Which drugs D interact with proteins associated with diseases t 1 or t 2 ?" can be rewritten as ?D : ∃P.interacts(D, P ) ∧ [assoc(P, t 1 ) ∨ assoc(P, t 2 )], which can be answered via sub-graph matching. However, plain sub-graph matching cannot capture semantic similarities between entities and relations, and cannot deal with missing facts in the KG. One possible solution consists in computing all missing entries via KG completion methods (Getoor & Taskar, 2007; De Raedt, 2008; Nickel et al., 2016) , but that would materialise a significantly denser KG and would have intractable space and time complexity requirements (Krompaß et al., 2014) . In this work, we propose a framework for answering First-Order Logic Queries, where the query is compiled in an end-to-end differentiable function, modelling the interactions between its atoms. The truth value of each atom is computed by a neural link predictor (Nickel et al., 2016 ) -a differentiable model that, given an atomic query, returns the likelihood that the fact it represents holds true. We then propose two approaches for identifying the most likely values for the variable nodes in a query -either by continuous or by combinatorial optimisation. Recent work on embedding logical queries on KGs (Hamilton et al., 2018; Daza & Cochez, 2020; Ren et al., 2020) has suggested that in order to go beyond link prediction, more elaborate architectures, and a large and diverse dataset with millions of queries is required. In this work, we show that this is not the case, and demonstrate that it is possible to use an efficient neural link predictor trained for 1-hop query answering, to generalise to up to 8 complex query structures. By doing so, we produce more accurate results than state-of-the-art models, while using orders of magnitude less training data. Summarising, in comparison with other approaches in the literature such as Query2Box (Ren et al., 2020) , we find that the proposed framework i) achieves significantly better or equivalent predictive accuracy on a wide range of complex queries, ii) is capable of out-of-distribution generalisation, since it is trained on simple queries only and evaluated on complex queries, and iii) is more explainable, since the intermediate results for its sub-queries and variable assignments can be used to explain any given answer.

2. EXISTENTIAL POSITIVE FIRST-ORDER LOGICAL QUERIES

A Knowledge Graph G ⊆ E × R × E can be defined as a set of subject-predicate-object s, p, o triples, where each triple encodes a relationship of type p ∈ R between the subject s ∈ E and the object o ∈ E of the triple, where E and R denote the set of all entities and relation types, respectively. One can think of a Knowledge Graph as a labelled multi-graph, where entities E represent nodes, and edges are labelled with relation types R. Without loss of generality, a Knowledge Graph can be represented as a First-Order Logic Knowledge Base, where each triple s, p, o denotes an atomic formula p(s, o), with p ∈ R a binary predicate and s, o ∈ E its arguments. Conjunctive queries are a sub-class of First-Order Logical queries that use existential quantification (∃) and conjunction (∧) operations. We consider conjunctive queries Q in the following form: Q[A] ?A : ∃V 1 , . . . , V m .e 1 ∧ . . . ∧ e n where e i = p(c, V ), with V ∈ {A, V 1 , . . . , V m }, c ∈ E, p ∈ R or e i = p(V, V ), with V, V ∈ {A, V 1 , . . . , V m }, V = V , p ∈ R. (1) In Eq. ( 1), the variable A is the target of the query, V 1 , . . . , V m denote the bound variable nodes, while c ∈ E represent the input anchor nodes. Each e i denotes a logical atom, with either one (p(c, V )) or two variables (p(V, V )), and e 1 ∧ . . . ∧ e n denotes a conjunction between n atoms.



At https://github.com/uclnlp/cqd † Equal contribution, alphabetical order.



Figure 1: Examples of First-Order Logical Queries using existential quantification (∃), conjunction (∧), and disjunction (∨) operators -their dependency graphs are D ← P ← {t 1 , t 2 }, and D ← A ← {Oscar, Emmty}, respectively.

