DIFFERENTIABLE LOGIC PROGRAMMING FOR PROBABILISTIC REASONING

Abstract

This paper studies inductive logic programming for probabilistic reasoning. The key problems, i.e. learning rule structures and learning rule weights, have been extensively studied with traditional discrete searching methods as well as recent neural-based approaches. In this paper, we present a new approach called Differentiable Logic Programming (DLP), which provides a flexible framework for learning first-order logical rules for reasoning. We propose a continuous version of optimization problem for learning high-quality rules as a proxy and generalize rule learning and forward chaining algorithms in a differentiable manner, which enables us to efficiently learn rule structures and weights via gradient-based methods. Theoretical analysis and empirical results show effectiveness of our approach.

1. INTRODUCTION

Learning to reason and predict is a fundamental problem in the fields of machine learning. Representative efforts on this task include neural networks (NN) and inductive logic programming (ILP). The NNs and ILP methods represent learning strategies of two extremes: the ideas behind NNs are to use fully differentiable real-valued parameters to perceive the patterns of data, while in the fields of ILP, we search for determined and discrete structures to match the patterns of data. Over the years, the former approaches, i.e. neural-based methods, have achieved state-of-the-art performance in solving tasks from many different fields, while the latter ones have fallen behind due to their inherited inferior in fitting noisy and probabilistic data. However, it was pointed out that there is a debate over the problems of systematicity and explanability in connectionist models, as they are black-box models that are hard to be explained. To tackle the problem, numerous methods have been proposed to combine the advantages of both connectionist and symbolic systems. Most existing efforts focus on two different manners: using logic to enhance neural networks and using neural networks to help logical reasoning. The former approaches (Rocktäschel & Riedel (2017) , Minervini et al. (2020c) , Vedantam et al. (2019) , Dong et al. (2019) ) modify the structures of NNs to capture some features of logic. Some of them, known as neural theorem provers (Rocktäschel & Riedel (2017) , Minervini et al. (2020a) ), represent entities with embedding vectors that implies the semantics of them. Further more, they absorb symbolic logical structures into the neural reasoning framework to enhance the expressiveness of the models. For example, to prove the existence of (grandf ather, Q, Bart) where Q is the target entity we wish to find, these systems use logic rules such as grandf ather ← f ather of, parent of to translate the original goal (grandf ather, Q, Bart) into subgoals that can be subsequently proved by operating on entity embeddings. Thus, the expressiveness and interpretability of the systems is improved with the help of logic. The latter approaches (Yang et al. (2017) , Xiong et al. (2017) Sadeghian et al. (2019) , Qu et al. (2021) ) enhance traditional inductive logic programming with the help of neural networks. Generally, they use different techniques to solve the key problems of ILP, which is to learn structures of logical rules from exponential large space. Some of them (Yang et al. (2017) , Sadeghian et al. (2019) , Yang & Song (2020) ) approximate the evaluation of all possible chain-like logic rules in a single model, making learning of the model differentiable. However, as mentioned in Sadeghian et al. (2019) , these models inevitable give incorrect rules with high confidence values due to the low-rank approximation of evaluating exponential many logic rules at the same time, which also makes it hard to identify high-quality logic rules and explain the predictions made by these models. The other line of the research (Yang & Song (2020) , Qu et al. (2021) ) propose different methods to generate high-value rules such as reinforce learning and EM algorithms. However, since structure learning of logical rules is a very hard problem, they are limited in only searching chain-like horn clauses, which is less expressive and general. In this paper, we propose a novel differentiable programming framework, called Differentiable Logic Programming (DLP), to build a bridge between the ideas of differentiable programming and symbolic reasoning. Our approach enjoys the merits of connectionist systems, i.e., high expressiveness and easy to learn, as well as the merits of ILP systems, i.e., explanability and clear structures for decision making. We study the construction of a probabilistic reasoning model, and discuss the properties of valuable logic rules. Based on that, we propose a novel rule learning framework that approximates the combinatory search problem with a continuous relaxation which enables us to learn structures of logic rules via a differentiable program. Once valuable rules are learnt, we can further fine-tune the rule weights and perform probabilistic forward chaining to predict the existence of unobserved terms.

2. RELATED WORK

Our work is related to previous efforts on Inductive Logic Programming (ILP) fields and their extensions. Representative methods of ILP includes FOIL (Quinlan (2004) ), MDIE (Muggleton (2009) ), AMIE (Galárraga et al. (2015) ), Inspire (Schüller & Kazmi (2018) ), RLvLR (Omran et al. (2018) ) and so on. Generally, these methods search for logic rules in exponential large space to obtain valuable logic rules and make predictions based on them. However, despite the well-designed searching algorithms and pruning techniques, these methods suffer from their inherent limitations of relying on discrete counting and predefined confidence. More recently, different learning algorithms have been proposed to overcome the drawbacks of ordinary ILP methods. Many of them consider a special kind of ILP tasks namely knowledge graph completion, where most of the proposed methods (Yang et al. (2017) , Rocktäschel & Riedel (2017) , Sadeghian et al. (2019) , Minervini et al. (2020b) , Yang & Song (2020) , Qu et al. (2021) ) focus on learning chain-like rules, and these methods use different learning strategies to learn valuable rules. Some of them are based on reinforcement learning (Xiong et al. (2017) , Chen et al. (2018) , Das et al. (2018 ), Lin et al. (2018) , Shen et al. (2018) ), and they train agents to find the right reasoning paths to answer the questions in knowledge graphs. Qu et al. (2021) uses recurrent neural networks as rule generators and train them with EM algorithms. Yang et al. (2017) , Sadeghian et al. (2019) and Yang & Song (2020) propose end-to-end differentiable methods, which can be trained efficiently with gradient-based optimizers. These methods are similar in spirit with our approach, as they claim to be able to learn rule structures in a differentiable manner. However, what they actually do is to find a low-rank tensor approximation for simultaneous execution of all possible rules of exponential space with different confidence scores, and by doing so they suffer from the risk of assigning wrong rules with high scores (Sadeghian et al. (2019) ). Also, although Yang & Song (2020) claims that their attentions usually becomes highly concentrate after convergence, there is no theoretical guarantee so extracting logic rules implying these model could be problematic because there might be exponential potential rules that have confidence scores higher than zero. The parameters learnt by these models are dense vectors thus they suffer from the problem of explainability. Compared with them, our method is able to generate sparse solutions that explicitly learns logic rules for reasoning with a more flexible rule search space while keeping the rule learning procedure differentiable. There are other methods that focus on different types of ILP problems. Lu et al. (2022) treats relation prediction task as a decision making process, and they use reinforcement learning agents to select the right paths between heads and tails. Our approach is more general and is able to deal with different tasks. Rocktäschel & Riedel (2017) and Minervini et al. (2020b) propose a generalized version of backward chaining with the help of neural embedding methods, and show great performance on both relation prediction and knowledge graph completion tasks. Compared to them, our approach doesn't require the help of embeddings, thus our predictions are more explainable. There are also interesting methods based on embedding and neural networks (Bordes et al. (2013) , Wang et al. (2014) , Yang et al. (2015) , Nickel et al. (2016 ), Trouillon et al. (2016) , Cai & Wang (2018) , Dettmers et al. (2018) , Balazevic et al. (2019) , Sun et al. (2019 ), Teru et al. (2020) , Zhu et al. (2021) ). Since they are less relevant to logic reasoning, we do not cover them in details here.

3.1. FIRST-ORDER LOGIC

This paper focuses on learning first-order logic (FOL) rules for reasoning. Across this paper, we assume predicates are from a countable universe P where we use uppercase P, Q, ... ∈ P to represent predicates. We use a, b, c, ... ∈ V to represent constants and x, y, z to represent variables. An example of grammars of FOL applied in the experiments of this paper are: φ(x) := P (x) | φ(x, x) | ∃y : φ(y) ∧ φ(x, y), φ(x, y) := P (x, y) | φ(x) ∧ φ(y) ∧ φ(x, y) | ∃z : φ(x, z) ∧ φ(z) ∧ φ(z, y). (1) Grammars are critical for ILP systems, because they not only define the syntax of FOL formulas, but also determine the expressive power and search space of FOL formulas. However, in this paper we will not restrict the specific formulation of the grammar. Instead, we use a common formulation to represent them and our approach is equivalently applicable to any reasonable grammar: φ(x) := P (x) | F 1 (x) | F 2 (x) | F 3 (x) | ... Also, Eq. 1 is frequently used to demonstrate the ideas in the following sections. We use x to represent a tuple of variables, v for a tuple of constants for notation simplification given it's clear in the context. F i represents a possible format that φ could take. For example, in the definition of φ(x) in Eq. 1, we have x = x and F 1 (x) := φ(x, x), F 2 (x) := ∃y : φ(y) ∧ φ(x, y). Logic classifiers and logic rules Logic formulas φ can be regarded as classifiers (Barceló et al. (2020) ). For example, consider φ(x) := ∃y : Red(y) ∧ Edge(x, y) and Blue(x) ← φ(x). φ can be regarded as a classifier, where we have φ(x) = 1 for nodes x with red neighbors and 0 otherwise. Generally, logic classifiers take (set of) entities (e.g. node x) as input and compute the output (e.g. φ(x)) by grounding the logic formula on the background statements (e.g. {Edge(x, y), Red(y)}). Blue(x) ← φ(x) is a logic rule that tells us the rule head Blue(x) can be concluded if the rule body φ(x) is satisfied. In the fields of probabilistic reasoning, each logic rule can be assigned with a weight indicating the degree of certainty of the rule. Forward chaining and backward chaining Forward chaining methods are critical in automated deduction, as they enable us to repeatedly deduce new lemmas from known theorems. Forward chaining starts from known conditions and logic rules, and move forward towards a conclusion by applying the logic rules. Then, they absorb the deduced conclusions into known conditions and apply the rules again until there are no further new conclusions deduced. Backward chaining methods are the opposite of forward chaining, as they move backwards from the conclusions to the potential conditions implied by the rules.

3.2. PROBLEM STATEMENT

This paper studies probabilistic inductive logic programming. The input data is a tuple (S B , S P , S N ) where S B , S P , S N are sets of ground atoms of the form {P 1 (v 1 ), P 2 (v 2 ), ...}, S B is a set of background assumptions, S P is a set of positive instances, and S N is a set of negative instances. The target is to construct a model so that when applied to S B , it produces the positive conclusions in S P , as well as rejecting the negative instances in S N . This naturally leads to the following problems: • Rule Mining. Our model is based on logic rules, and one key problem is finding useful rules that produce the results in S P when grounded on S B . • Probabilistic Reasoning. It is often infeasible to directly perform forward chaining with logic rules when input data is noisy. A rule-based inference model p(Q(x) ∈ S P |S B ) is needed to take the uncertainty of logic rules into account. 

4. MODEL

In this section, we introduce our proposed Differentiable Logic Programming (DLP) framework. The general idea is to use differentiable programs to solve the problems of rule mining and learning the prediction model. As mentioned in Sec. 3, the learning problems require us to identify important logic classifiers from discrete space and assign feasible weights to them. In this paper, we introduce a differentiable module called Logic Perceptron (LP) to help us deal with the problem. Given a grammar of logic classifiers φ, we provide a method to construct corresponding LPs ψ that is able to capture any φ with limited size. The LPs are stacked as a network to capture more complex logic classifiers as well as being end-to-end differentiable. We further propose a new optimization problem whose solutions are sparse so that each local optimal of it corresponds to the symbolic structure of a logic rule being revealed. Moreover, these learnt LPs are organized into a prediction model that generalizes forward chaining and can be learnt by maximizing the likelihood. Both these optimization problems are continuous and differentiable w.r.t. parameters, which makes them can be solved via gradient-based methods. Figure 1 presents a brief illustration of general ideas behind our model. Next, we introduce the details of our approach.

4.1. OVERVIEW

In this section we first introduce the general ideas behind our approach, as well as highlighting the key challenges in our learning framework. The details of model implementation are discussed in the next sections. We start with an example which illustrates the general ideas of probabilistic reasoning as well as providing the intuitions and motivations behind our approach. Example 4.1 (Human Reasoning). Consider the query "Are a and b friends?". In this case, we have Q(x) = Friend(a, b). To answer the query, one may first ask "Do a and b know each other?" which corresponds to a logical classifier φ 1 (a, b) = Know(a, b). If φ 1 (a, b) = 1, our confidence in x and y being friends is increased, and we call this φ 1 proved Q, also Q is the target predicate of φ 1 , denoted as Q ← φ 1 . With the answer of φ 1 , one may continue to ask φ 2 : "Do a and b live in the same town?", ..., where each evaluation of φ i serves as an evidence that proves the existence of Q. In the example, we use different logic classifiers to prove the query from different aspects, and all the classifiers used are highly relevant to the target predicate Friend. This makes sense because our belief in x and y being friends is higher when we realize they know each other, which corresponds to a relative high value of p(Friend(x, y)|Know(x, y) = 1), but irrelevant facts such as "they both drink water" won't help. In fact, this simple example illustrates the overall ideas of our model, formally described as follows. Proposition 4.2 (Properties of Φ). Given the input data (S B , S P , S N ), let Q be a d-ary target predicate, Φ Q = {φ 0 , φ 1 , ..., φ L-1 } be a set of logic classifiers and p 0 ∈ (0, 1] a fixed threshold. With the following statements being satisfied: (1) We start from l = 0 and let S P (0) = S P ; (2) For φ l , its precision:  and increase l by 1 and go to (2) again until l = L; (4) S P (L) = ∅. Then, there exists a prediction model p(Q(x) ∈ S P |S B ) = f (φ 0 (x), φ 1 (x), ..., φ L-1 (x)) such that its error rate satisfies: p(Q(x) ∈ S P (l) |φ l (x) = 1) = x∈V d φ l (x) 1 Q(x)∈S P (l) x∈V d φ l (x) 1 Q(x)∈S (l) P ∪S N ≥ p 0 ; (3) We let S P (l+1) = S P (l) \ {Q(x) | x ∈ V d , φ l (x) = 1} Err [f ; (S B , S P , S N )] ≤ (1 -p 0 )N Q p 0 N ≤ 1 -p 0 , where N Q = x∈V d 1 Q(x)∈S P and N = x∈V d 1 Q(x)∈S P ∪S N . Prop. 4.2 implies how we learn such logic classifiers for proving Q. Generally our rule learning procedure can be seen as a variant of boosting method where each rule is regarded as a weak classifier, but we focus on the rule precision rather than misclassification rate. This is because (1) in most situations the input data is very sparse that one can obtain a small misclassification rate by simply always predicting false, and (2) often a single logic rule is only able to prove a relative small portion of the positive instances and a large number of logic rules are often needed to make complete predictions of the target predicates. Thus, our learning procedure is formally described as follows. Rule Learning Framework Given the input data (S B , S P , S N ), suppose we are to learn L rules for each target predicate Q ∈ P. We first start with an empty rule set Φ Q = ∅ for each Q, and assign each instance Q(v) in S P and S N with a weight w Q(v) = 1 (This corresponds to the statement (1) in Prop 4.2). Then, we perform the follow steps recursively: we first find a logic classifier φ having high precision on the weighted data, i.e., max φ p(Q(x) ∈ S P |φ(x) = 1) = x∈V d 1 Q(x)∈S P w Q(x) φ(x) x∈V d 1 Q(x)∈S P ∪S N w Q(x) φ(x) . (4) This generalizes the statement (2) in Prop. 4.2. We add φ into Φ Q . Then, we evaluate φ on S B , and for instances v where φ (v) = 1, if Q(v) ∈ S P , we let w Q(v) ← τ 1 w Q(x) ; if Q(v) ∈ S N , we let w Q(v) ← τ 2 w Q(v) , where τ 1 , τ 2 ∈ [0, +∞) are fixed values. Note that this step generalizes statement (3) in Prop. 4.2, and statement (3) is a special case of the above step where we let τ 1 = 0,τ 2 = 1. This procedure is then repeated for L times. Now we have introduced the overall learning framework of our approach except two problems: how we identify high-precision logic classifiers (Eq. 4) and how we learn the prediction model p(Q(x) ∈ S P |S B ). In this paper, we propose a differentiable model to solve these problems. In the next sections we discuss the implementation of our model in details. In Sec. 4.2, we introduce logic perceptrons (LP) as building blocks of our model as well as how we stack them together to express more complex logic classifiers. In Sec. 4.3 we introduce how use LPs to perform probabilistic reasoning. Then, we present our methods for learning the model in Sec. 4.4.

4.2. LOGIC PERCEPTRONS

A Logic Perceptron (LP) is a differentiable model that generalizes the ordinary logic classifiers into continuous space. Given the grammar of logic classifiers: φ(x) := A(x) | F 1 (x) | F 2 (x) | F 3 (x) | ... | F K (x), We provide two methods for constructing corresponding LPs given the grammar 5, which are treestructured LPs (LP-tree) and layer-structured LPs (LP-layer). The main difference is that LP-tree strictly satisfies the constraints discussed in Sec. 4.4 while LP-layer is more compressed. As will be shown in Sec. 5, they produce similar results in the experiments.

LP-tree

We first define the correspondence between LPs ψ and logic classifiers φ as follows. ψ(x; α) = [F 1 (x; α), F 2 (x; α), F 3 (x; α), ...F K (x; α)] w α , s.t. K i=1 w i = 1, w i ≥ 0 f or i = 1, 2, ...K , where w ∈ R K is an attention vector, α ∈ R is a hyperparameter that helps to keep the sparsity of the model. The functionalities of α are discussed in Sec. 4.4, and often in experiments we set α = 1. w α is an element-wise exponent applied to w. The evaluation of each F i (x; α) in Eq. 6 is corresponded to each F i (x) in Eq. 5, where we define F i (x) := F j (x) ∧ F k (x) ⇐⇒ F i (x; α) = F j (x; α)F k (x; α), F i (x) := F j (x) ∨ F k (x) ⇐⇒ F i (x; α) = [F j (x; α), F k (x; α)] w (i) α , F i (x) := ¬F j (x) ⇐⇒ F i (x; α) = 1 -F j (x; α), F i (x) := ∃y : F j (x, y) ⇐⇒ F i (x; α) = y F j (x, y; α), F i (x, y) := F j (x) ⇐⇒ F i (x, y; α) = F j (x; α), F i (x) := φ(x) ⇐⇒ F i (x; α) = [P 1 (x), P 2 (x), ..., P |P| , ψ(x)] w (i) α , where for each w (i) we have j w (i) j = 1 and w (i) j ≥ 0. ψ is a pointer to another LP. We exclude the universal quantifier ∀ because this can be equivalently expressed by ¬ ∃ ¬. To construct more complex and expressive LPs, we can generate arbitrary numbers of LPs within a tree structure. Initially, we create a ψ 0 (x) as a root node. As shown in Eq. 7, the evaluation of ψ 0 requires its pointers ψ to be explicitly assigned, so we create a new LP for each pointer ψ of ψ 0 to be a child of ψ 0 . The same procedure is performed on the leaf nodes of the tree for arbitrary times while expanding the depth of the tree. Once we reached the desired depth, we simply assign empty nodes as ψ for the leaf nodes to terminate the construction procedure. These LPs compose a LP-tree where ψ 0 serves as the output of the tree. Figure 1 illustrates this procedure. LP-layer Generally layer structured LPs are similar with tree structured LPs except that now we stack multiple LPs linearly as layers. Suppose for LP-layer we have a total number of L LPs {ψ (1) , ψ (2) , ..., ψ (L) }. The evaluation of each LP is exactly the same as in Eq. 6 and 7 except that for each LP ψ (l) in LP-layer we have F i (x) := φ(x) ⇐⇒ F i (x; α) = [P 1 (x), P 2 (x), ..., ψ (1) (x; α), ..., ψ (l-1) (x; α)] w (i) α , , thus ψ (l) is able to access the layers before it P 1 , P 2 , ..., ψ (1) , ..., ψ (l-1) , and the LPs are organized as layers illustrated in Fig. 1 , where ψ (L) serves as the output of the layers. An advantage of LPlayer is that it's very simple to extend the model: we only need to stack more layers. The following proposition states the expressiveness of these two construction approaches. Proposition 4.3 (Expressiveness of LP-tree and LP-layer). Given a grammar of logic classifiers of the form in Eq. 5, suppose a logic classifier φ is constructed by recursively applying the grammar for N > 0 times, then we have: (1) In worst cases LP-tree with O(K N ) LPs can express φ; (2) LP-layer with N LPs can express φ.

4.3. INFERENCE

We now discuss how we infer p(Q(x) ∈ S P |S B ) for every x by generalizing forward chaining. In this section, we assume we have learnt a set of Ψ Q = {ψ 1 , ψ 2 , ...} for every Q ∈ P, and the goal is to infer the (unknown) positive / negative instances in S P , S N . For each predicate Q ∈ P and every x, we let Q (0) (x) = 1 if Q (0) (x) ∈ S B and Q (0) (x) = 0 otherwise. Then, at iteration t, we: (1) Evaluate LPs: We evaluate every ψ ∈ Ψ Q on Q (t-1) at every x ∈ V d . This procedure costs O(|V| d R Ψ (|P| + N )N ) where N is the size of the network, d is the maximum arity of predicates and LPs when evaluating ψ, R Ψ is the amount of rules, i.e., the total size of Ψ Q for each Q. (2) Update inferences: We update our inferences about Q(x) for each Q ∈ P and x. This procedure costs O(|V| d R Ψ ). Q(x) = sigmoid (Update ({ψ Q (x)|ψ Q ∈ Ψ Q })) , Q (t) (x) = max{ Q(x), Q (0) (x)}, ( ) where Update is the function that specifies how we update the predictions based on the groundings of ψ Q ∈ Ψ Q . The common implementation of the update function used in this paper is Update ({ψ Q (x)|ψ Q ∈ Ψ Q }) = ψ Q ∈Ψ Q w ψ Q ψ Q (x), but any differentiable update functions (MLP, etc.) is also applicable. After T rounds of iterations, we directly pick the values of Q (T ) (x) as an approximation of p(Q(x) ∈ S P |S B ), while p(Q(x) ∈ S N |S B ) = 1 -p(Q(x) ∈ S P |S B ). Also, the time complexities we provided here are rather loose. In reality, input data is often very sparse, and it turns out that the |V| n term in the time complexities can be significantly reduced. See appendix for more discussion.

4.4. LEARNING

In this section, we discuss how we solve the two problems stated in Sec. 3, i.e., mining logic rules and learning the probabilistic prediction model.

Learning Symbolic Structures via Continuous Optimization

We now discuss how we learn the structures of logic classifiers, i.e., to solve the problem max φ p(Q(x) ∈ S P |φ(x) = 1). Theorem 4.4 (Sparse attentions). Consider the optimization problem min ψ L(ψ) = -log E x∼p1 [ψ(x; α)] + log E x∼p2 [ψ(x; β)], where ψ(x; α) and ψ(x; β) are obtained by one iteration of the inference procedure. Here, we assume E x∼p1 [ψ(x; α)] > 0 and E x∼p2 [ψ(x; β)] > 0. With the following constraints being satisfied: (1) α > β > 1; (2) ψ is the root of a LP-tree; (3) Negations are applied only on leaf nodes. Then, at each local minima of L(ψ), the attention vectors w, w (i)) , ... used for evaluating ψ are one-hot vectors and ψ explicitly captures a logic classifier. We say ψ captures a logic classifier φ when evaluated on any background assumption S B , ψ(x) > 0 ⇐⇒ φ(x) = 1 for any x. With the above theorem, we can directly derive the following corollary. Corollary 4.5 (Proxy problem). Minimization of the optimization problem min ψ L(ψ) = -log E [ψ(x; α)|Q(x) ∈ S P ] + log E [ψ(x; β)|Q(x) ∈ S N ] , yields a near-optimal solution for solving problem 11, with the constraints of theorem 4.4 being satisfied. We say the solution is near-optimal because (1) the LPs are only capable to capture the logic classifiers within their expressiveness power, and (2) the ranges of φ and ψ are different: we have φ(x) ∈ {0, 1} and ψ(x; α) ∈ [0, +∞). For example, if we have φ(x) := ∃y : Neighbor(x, y), then the corresponding ψ(x) is equal to the amount of neighbors that x have. Also, the 3 constraints are sufficient conditions for the optimization problem to guarantee to converge at the points where ψ captures a logic classifier, but they are not always necessary. In most cases we can relax these constraints and set α = β = 1 while keeping the sparsity of the solutions, which is shown in the experiments. The LP-tree constructed from the grammar 1 naturally satisfies the constraint (2), while for LP-layer the constraint (2) is relaxed. See appendix C for more discussion of how these constraints work. Learning the Inference Model To learn the inference model p(Q(x) ∈ S P |S B ), we fix the parameters of each ψ ∈ Ψ. Then, we proceed the inference steps for a fixed number T . Since the Update function is differentiable w.r.t. its parameters, the obtained p(Q(x) ∈ S P |S B ) for each Q ∈ P and x is also differentiable, and we can optimize the inference model by simply maximizing the likelihood of p(Q(x) ∈ S P |S B ) on data. With the proposed approaches, the whole learning procedure of our model described in Sec. 4.1 is realized.

5.1. DATASETS

We consider datasets from different fields to test the model's ability in solving ILP tasks, systematic reasoning and knowledge graph completion. These datasets include: ILP tasks We test the model's expressiveness by applying the model to solve the 20 ILP tasks proposed in Evans & Grefenstette (2018) . These tasks test the expressive power of an ILP model, including learning the concepts of even numbers, family relations, graph coloring, etc. Systematicity We test the model's systematicity (Lu et al. (2022) ) on CLUTRR (Sinha et al. (2019) ) datasets. These tasks test a model's ability of generalizing on unseen data of different distributions. Models are trained on small scale of data and tested on larger data with longer resolution steps.

Knowledge graph completion

We test the model's capability of performing probabilistic reasoning on knowledge graphs including UMLS, Kinship (Kok & Domingos (2007) ), WN18RR (Dettmers et al. (2018) ) and FB15k-237 (Toutanova & Chen (2015) ). These tasks test a model's ability of dealing with probabilistic and noisy data. For Kinship and UMLS, since there are no standard data splits, we take the data splits from Qu et al. (2021) .

5.2. MODEL CONFIGURATION

On all experiments, we use the grammar presented in Eq. 1, and if the input data does not contain unary predicates, we simply use an invented one P inv (x) ≡ 1. We stack LPs the same way provided in Sec. 4.2, where the depth of LP-tree and number of layers of LP-layer are 5 for ILP tasks, 3 for Systematicity tasks and KG completion. For inference model, the number of iterations is 10 for ILP and Systematicity tasks and 1 for KG completion. Due to space constraints, we left the detailed model configuration and a sketch figure of the constructed network in Appendix D.

5.3. COMPARED ALGORITHMS

We observe there are few models capable of solving all the tasks, so we pick different algorithms for comparison for different tasks. For ILP tasks, we choose Evans & Grefenstette (2018) . 2018)), a principle ILP method which uses different program templates to solve the problems, our approach uses the same grammar for all problems. With the same model architecture, our model is able to achieve high accuracy on the CLUTRR datasets. Our model is also able to learn valuable logic rules and make fairly accurate predictions on much noisier knowledge graphs. 2. Performance w.r.t. rule complexity. We conduct experiments to study the model performance under different rule complexity on UMLS dataset in Tab. 4. We can see that if we force the rules to be too simple, it's hard to capture informative patterns of data; On the other hand, if we force the rules to be too complex, the performance is also decreased due to the loss of rule generality. Figure 2 : Distribution of rules. 3. Performance w.r.t. reweighting techniques. We study the effects of reweighting training data in Tab. 5. We train models on UML dataset with different reweighting methods. We can see that even without reweighting, our model is able to capture various logical patterns by merely randomly initializing model parameters. Besides, replacing, i.e., removing data instances that are correctly predicted, achieved worst results. This is because for noisy data it's better to learn more different logical patterns to prove the targets, and removing them prevents the model to learn more information about them. 4. Effects of fine-tuning rule weights. We find on most situations the model is able to make fairly precise predictions without training rule weights as in Tab. 6. We set the weights of each rule to be its precision on training data, and conduct comparison experiments based on that. 5. Effects of hyperparameters α and β. To show that in most situations we can safely set α and β to a relative small value, we conduct experiments on UMLS dataset with different settings of α and β, summarized in Tab. 7. Suprisingly, the performance under different α and β greater or equal than 1.0 is quite similar. This implies that on most situations we can safely set α and β to 1.0 to both simplify computation and stabilize the learning of model. 6. Distribution of rules accuracy. We summarize the distributions of learnt rule accuracy and average rule contributions for UMLS in Fig. 2 . The average contribution of a rule equals to the average decrement of scores of queries in test data if we remove the rule and hence it measures how important a rule is on average for predicting the correct answers. We can see that most learnt rule are rather inaccurate, as there are barely any rules having precision higher than 0.5, but putting them together, rules of accuracy 0 -0.4 contribute the most for proving the target queries. 

6. CONCLUSION

This paper studies inductive logic programming, and we propose Differentiable Logic Programming framework to solve the problems of structure learning and weights learning. We generalize the discrete rule search problem and forward chaining algorithm in a continuous and probabilistic manner and use a differentiable program with a proxy problem to solve the learning problem. Both theoretical and empirical evidences are present to prove the efficiency of our algorithm. In the future, we plan to explore the possibility of combining neural network architecture to help the model discovery more complex and accurate logical patterns. A PROOF OF PROPOSITION 4.2 We now prove Prop. 4.2. Proposition A.1 (Properties of Φ). Given the input data (S B , S P , S N ), let Q be a d-ary target predicate, Φ Q = {φ 0 , φ 1 , ..., φ L-1 } be a set of logic classifiers and p 0 ∈ (0, 1] a fixed threshold. With the following statements being satisfied: (1) We start from l = 0 and let S P (0) = S P ; (2) For φ l , its precision: p(Q(x) ∈ S P (l) |φ l (x) = 1) = x∈V d φ l (x) 1 Q(x)∈S P (l) x∈V d φ l (x) 1 Q(x)∈S (l) P ∪S N ≥ p 0 ; (3) We let S P (l+1) = S P (l) \ {Q(x) | x ∈ V d , φ l (x) = 1} and increase l by 1 and go to (2) again until l = L; (4) S P (L) = ∅. Then, there exists a prediction model p(Q(x) ∈ S P |S B ) = f (φ 0 (x), φ 1 (x), ..., φ L-1 (x)) such that its error rate satisfies: Err [f ; (S B , S P , S N )] ≤ (1 -p 0 )N Q p 0 N ≤ 1 -p 0 , ( ) where N Q = x∈V d 1 Q(x)∈S P and N = x∈V d 1 Q(x)∈S P ∪S N . Proof: The proof is constructive. We let f (φ 0 (x), ..., φ L-1 (x)) = min 1, φ∈Φ φ(x) , and let N φi be the number of Q(x)s that are true and also predicted true by φ i , with those Q(x)s related to φ 1 , φ 2 , ..., φ i-1 removed first. It's easy to observe that the number of Q(x)s predicted true by φ i but are actually false is W φi ≤ 1-p p N φi . For simplification here we let Q(x) = 1 if Q(x) ∈ S P and 0 otherwise. Thus, we have: Err [f ; (S B , S P , S N )] = x 1 Q(x)̸ =pw(Q(x)) N = 1 N G x (1 -Q(x)) min    1, φ∈Φ φ(x)    ≤ 1 N i W φi ≤ 1 N i 1 -p p N φi = (1 -p)N Q pN ≤ 1 -p. B PROOF OF THEOREM 4.4 AND COROLLARY 4.5 We now prove Prop. 4.5 and Corollary 4.5 . Theorem B.1 Consider the optimization problem min ψ L(ψ) = -log E x∼p1 [ψ(x; α)] + log E x∼p2 [ψ(x; β)], where ψ(x; α) and ψ(x; β) are obtained by one iteration of the inference procedure. Here, we assume E x∼p1 [ψ(x; α)] > 0 and E x∼p2 [ψ(x; β)] > 0. With the following constraints being satisfied: (1) α > β > 1; (2) ψ is the root of a LP-tree; (3) Negations are applied only on leaf nodes. Then, at each local minima of L(ψ), the attention vectors w used for evaluating ψ are one-hot vectors and ψ explicitly captures a logic classifier. Proof: Before we proceed to prove the proposition, we need to first study in what situations does a LP ψ capture a logic classifier φ. Across this section we assume we are given a inference network G composed of {ψ 1 , ψ 2 , ..., ψ L , ψ} where ψ is the output node built upon {ψ 1 , ψ 2 , ..., ψ L } we care about. The following lemma explains the properties of ψ when it captures a logic classifier. Lemma B.2 Given G and {ψ 1 , ψ 2 , ...} stated above. If we perform a restricted breadth first search across the inference network where: • We start with the output node ψ; • For each node we go through, the parameters of the node w defined in Eq. 6-7 are a one-hot vector where one of its dimensions equals to 1; • From the current node, the next paths we go are a subset of the inverse edges of ψ i , where we only consider the ones corresponding to the nonzero dimensions of w discussed above defined in Eq. 6-7. Then, ψ captures a logic classifier φ, i.e., for all x, ψ(x) > 0 ⇐⇒ φ(x) > 0. The lemma is quite straightforward: by restricting a LP with the above constraints, its evaluation procedure naturally simulates the grounding of a logic classifier. We can construct a logic classifier φ captured by ψ as follows: for each node we went through in the procedure of lemma B.2, we replace the LP ψ with an actual logic classifier φ which is constructed with the correspondence defined in Eq. 7. Moreover, for each one-hot vectors w, we pick the predicates or sub-formulas corresponding to the nonzero dimension of these vectors. By doing so, we observe that the evaluation results of φ for any x are the same as ψ does because every computational steps of both φ and ψ stay the same. Thus, to prove that a ψ captures a φ, we need to show that the parameters of ψ satisfy the constraints in lemma B.2. We now proceed to prove Theorem 4.4 recursively: we first prove that the parameters w of ψ must satisfies the proposition, i.e. it is one-hot, no matter whether w (1) , w (2) , ... of F 1 , F 2 ,... of ψ are or not; Then, we prove that the parameters w (i1) , w (i2) , ... w.r.t. the nonzero dimension w of w must also be one-hot; After that, we show that if ψ satisfies the proposition, the next nodes of ψ in the restricted BFS path must also satisfies the proposition; Hence the theorem is proved. To study w, we fix all other parameters in the inference network G except w for ψ, and the evaluation of ψ(x; α) becomes a simple polynomial function: ψ(x; α) = i F i (x)w α i , and the evaluation of L ′ becomes: L ′ (ψ) = -log E x∼p [ψ(x; α)] + log E x∼q [ψ(x; β)] = -log x ψ(x; α)p(x) + log x ψ(x; β)q(x) = -log x i p(x)F i (x; α) w α i + log x i q(x)F i (x; β) w β i . Hence, L ′ (ψ) is a polynomial function w.r.t. w and we need to show that at each local minima of L ′ (ψ) there cannot exist two w i and w j that are all larger than 0, which will be discussed later in this section. Now, assume we have already know that w of ψ satisfies the proposition, i.e., w i = 1 for some i. To proceed, we need to prove that all parameters w (i1) , w (i2) , ... w.r.t. F i are also one-hot. We do this similarly as before: we fix all other parameters in the inference network except some w (i k ) emerged in a neighbor of ψ in the inference network. By carefully checking the construction steps of F i in Eq. 7, it's easy to show that L ′ (ψ) is also a polynomial function w.r.t. w (i k ) under the constraints that w is one-hot. The following theorem directly proves that these functions indeed satisfy the constraints. Theorem B.3 (Sparse Attentions.) The following function h(w, α, β) = i A i (α)w α i i B i (β)w β i ( ) has no local maxima w.r.t. w i ∈ (0, 1) with the following constraints being satisfied: (1) Attention vector: w i ≥ 0 for each dimension i and i w i = 1; (2) Positive coefficients: A i (α) ≥ 0 and B i (β) ≥ 0 for every i and all α, β > 0; (3) Non-empty results: i A i (α)w α i > 0 and i B i (β)w β i > 0; (4) α > β > 1. Proof of theorem B.3: We now prove that h(w, α, β) has no local maxima where each w i ∈ (0, 1) by contradiction. Assume we are now given some w 0 where there are at least two dimensions of w 0 that are nonzero, and the target is to prove that w 0 is not a local maxima of h. To do so, we create a new vector v 0 composed of nonzero dimensions of w 0 together with a function h ′ (v, α, β), and h(w, α, β) = h ′ (v, α, β) = i C i (α)v α i i D i (β)v β i = f (v, α) g(v, β) . ( ) To prove that h(w, α, β) is not at local maxima, we can instead show that v 0 is not a local maxima of h ′ (v, α, β) is not at local maxima. We first consider the situations where the partial derivatives of h ′ w.r.t. each v i are not all the same, for example, suppose ∂h ′ ∂v1 | v=v0 > ∂h ′ ∂v2 | v=v0 . Directly study all dimensions of v is intractable with constraint i v i = 1, so we instead fix all parameters v 3 , v 4 , ... at the corresponding value of v 0 and only study the two parameters v 1 and v 2 where we let v 1 = v 01 + x, v 2 = v 02 -x and so the above constraint is naturally satisfied. Thus, we have h ′ (v, α, β) = h ′′ (x) and dh ′′ (x) dx x=0 = ∂h ′ (v, α, β) ∂v 1 v=v0 - ∂h ′ (v, α, β) ∂v 2 v=v0 > 0. Since the derivative w.r.t. x is larger than 0, v 0 is not a local maxima. Next, we discuss the situation where ∂h ′ ∂vi | v=v0 = λ are all the same. We again let v 1 = v 01 + x, v 2 = v 02 -x, and fix all other parameters of v. We have d 2 h ′′ (x) dx 2 x=0 = 1 g 3 g 2 α(α -1) A 1 v α-2 1 + A 2 v α-2 2 -2gαβ(A 1 v α-1 1 + A 2 v α-1 2 )(B 1 v β-1 1 + B 2 v β-2 2 ) + 2f β 2 B 1 v β-1 1 + B 2 v β-2 2 2 -gf β(β -1) B 1 v β-2 1 + B 2 v β-2 2 . Since we assume ∂h ′ ∂vi | v=v0 = λ, we have ∂h ′ ∂v i v=v0 = 1 g 2 αgA i v α-1 i -βf B i v β-1 i = λ, =⇒ 1 g 2 αgA i v α i -βf B i v β i = v i λ, =⇒ i 1 g 2 αgA i v α i -βf B i v β i = i v i λ, =⇒ f g (α -β) = λ. Substituting Eq. 23 into Eq. 22, we have d 2 h ′′ (x) dx 2 x=0 = f g 1 f A 1 α(α -1)v α-2 1 - 1 g B 1 β(β -1)v β-2 1 + f g 1 f A 2 α(α -1)v α-2 2 - 1 g B 2 β(β -1)v β-2 2 ≥ f gv 1 1 f A 1 αv α-1 1 - 1 g B 1 βv β-1 1 + f gv 2 1 f A 2 αv α-1 2 - 1 g B 2 βv β-1 2 = λ v 1 + λ v 2 = f g (α -β) 1 v 1 + 1 v 2 > 0. So v 0 is not a local maxima of h ′ . Thus, we have proved that h(w, α, β) has no local maxima for w i ∈ (0, 1). End of proof of theorem B.3. Since f and log f share the same minimum points for any f > 0 we have shown that the output node ψ indeed satisfies the conditions in the proposition. It's straightforward to show the nodes along the paths that ψ is built on also satisfy the conditions. Suppose we are studying another ψ ′ that is used for computing ψ. By writing the detailed computation steps for evaluating ψ and fixing all irrelevant parameters, we can show that ψ(x; α) = x ′ A(x ′ , α)ψ(x ′ ; α) and thus E x∼p [ψ(x; α)] = E x ′ ∼p ′ [ψ(x ′ ; α)], where p ′ is an unnormalized distribution. Thus, the same conclusion holds for all ψ in the restricted BFS path. We now have proved that every local minima of the proxy problem corresponds to a logic classifier, and it's much easier to prove the rest of the conclusions as collaborations of the first one. For conclusion (2), we notice that when ψ converges, p(Q(x)|φ(x) = 1) = p(φ(x) = 1|Q(x) = 1) p(φ(x) = 1) p(Q(x) = 1) This form of h(w, α, β) no longer keeps the good properties of Eq. 19, as it's second order derivatives w.r.t. some dimension of w no longer guarantee to be larger or equal than 0. We provide an example to illustrate this. Consider we want to learn logic rules of Q(x) ← φ 1 (x) := P 1 (x) ∧ P 1 (x) and Q(x) ← φ 2 (x) := P 2 (x) ∧ P 2 (x). This expression is redundant because the two terms of the conjunction are the same. Suppose we the model we used here is ψ 1 (x; α) = ψ 2 (x; α)ψ 2 (x; α) where ψ 2 (x; α) = i w α i P i (x) is a soft selection over all possible predicates. Unluckily, in the training data the logic rule Q(x) ← φ 1 (x) and Q(x) ← φ 2 (x) are never satisfied, i.e., p(Q(x)|φ 1 (x)) = p(Q(x)|φ 2 (x)) = 0, but for the rule Q(x) ← φ 3 (x) := P 1 (x) ∧ P 2 (x) we have p(Q(x)|φ 3 (x)) > 0. Then, it's easy to observe that the global minima of the proxy problem happens when w 1 = w 2 = 0.5. Although this situation disobeys with the conclusions of the proxy problem, both the training data, the target logic rules and construction of LPs are ill-conditioned, as we placed the same ψ 2 between the conjunction as well as in the training data we are unable to find any high-precision logic rules. We argue that in reality this can often be avoided because we can increase the expressiveness of the model, and set α and β to a relative large value so that even the above situation happens, because w α 1 = w α 2 are rather small values, it provides little stimulation to the model, and thus the model is encouraged to choose other rule structures that better explains the data. Also, because during training when the model is not at convergence, dimensions of w are rather small values, and the higher order terms j w nj α j where j n j is large are much smaller than oridinary terms, which means they provide little influence to the overall derivatives of w, so this problem is less serious.

Negations.

When training completes and model converges, negations do not influence the validity of the model. However, during training, negations might be tricky to deal with, as adding negations to the training LPs might lead to the coefficients of corresponding w i in h(x, α, β) being negative. The following example illustrated how negations might influence the training process. Consider we want to learn the logic rules Q(x) ← φ 1 (x) := ¬P 1 (x) and Q(x) ← φ 2 (x) := ¬P 2 (x), and we construct a model of ψ 1 (x; α) = 1 -ψ 2 (x; α) where ψ 2 (x; α) = i w α i P i (x). Suppose E[φ 1 (x)|Q(x) = 1] = E[φ 2 (x)|Q(x) = 1] = E 1 , E[φ 1 (x)|Q(x) ̸ = 1] = E[φ 2 (x)|Q(x) ̸ = 1] = E 2 Then, by letting ψ 2 (x; α) = 0.5 α P 1 (x) + 0.5 α P 2 (x), we observe that L(ψ) = -log E [ψ(x; α)|Q(x) = 1] + log E [ψ(x; β)|Q(x) ̸ = 1] = -log (0.5 α (E[φ 1 (x)|Q(x) = 1] + E[φ 2 (x)|Q(x) = 1])) + log 0.5 β (E[φ 1 (x)|Q(x) ̸ = 1] + E[φ 2 (x)|Q(x) ̸ = 1]) = -α + β + constant. Thus, we can see L(ψ) is monotonic decreasing w.r.t. α -β, and such ψ 2 can reach a smaller value than φ 1 and φ 2 once α -β is sufficiently large. We can set α = β to a relative small value and carefully select the target LPs when trying to assign negations to them to avoid the problem from happening. So far we have discussed the situations when the model might fail to converge if we discard the constraints of the proxy problem. As we can see, most of these invalidate situations can be avoided by setting α = β = 1 or 2 as well as providing a reasonable grammar of target logic rules. We argue that even if sometimes the model fails to converge at capturing a logic classifier, we can reinitialize the parameters of the relevant LPs randomly and train the model again so that it converges at other minimum points.

D EXPERIMENT DETAILS

In this section we explain the detailed model configuration for each experiment.

D.1 ILP TASKS

The 20 ILP tasks introduced by Evans & Grefenstette (2018) cover problems from integer recognition, family tree reasoning, general graph algorithms and so on. We briefly summarize them here. Generally, knowledge graphs are much noisier than ILP tasks, as shown in Fig. 2 , most learnt rules have a rather small value of accuracy compared to ILP tasks (correct rule accuracy is 1) and systematicity tests (learnt rule accuracy usually more than 0.8). To handle such uncertainty, we set the number of inference iterations to be 1 and learn more rules for each predicate. On all tasks our grammar is the same as in ILP tasks. Because there are no unary predicates in KG, the resulted grammar is essentially φ(x, y) := ∃z : φ(x, z) ∧ φ(z, y), which is corresponds to chain-like rules. On all KG completion tasks we learn rules of length 3. We create negative statements via negative sampling. Instead of removing the proved queries, we reduce the weights of corresponding queries with the fixed ratio 0.8. We train for 400 times for each relation and remove duplicated rules. When testing, we choose different update functions for inference, including the original ones in Eq. 9, a modification of the original ones where we set the restriction ψ(x) = max{ψ(x), 1} thus the evaluation value provided by ψ is exactly the same as the corresponding logic classifier, and a multi-layer perceptron based on the validation data. For evaluation, we use the standard filtered ranking (Bordes et al. (2013) ) metrics, including Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hit@k (H@k). When there are multiple tail nodes assigned with the same score, we compute the expectation of each evaluation metric over all random shuffles of entities (Qu et al. (2021) ). After all, in all of the experiments, we use Adam (Kingma & Ba (2015)) optimizer with lr = {0.01, 0.1}. We let α = β = {1, 2}. We generate the structural parameters of LPs w using a softmax function as follows: w = softmax(w ′ ), where w ′ ∈ R nw are real-valued vectors with no constraints. We prepossess the data to add permutations for ordinary predicates, for example, for every 2-ary statement P (a, b) in data we create an invented one P ′ (b, a). For randomly initialized parameters, we draw them independently from Gaussian distribution N (0, 1), but this is not necessary and other distributions (uniform, Xavier, ...) also work well. Illustrations of inference network architecture constructed with the procedure discribed in Sec. 4.3 are in Fig. 3 .

E DISCUSSION OF TIME COMPLEXITY

In this section we discuss the model complexity. Suppose the inference network is composed of N LPs. Suppose the arities of predicates and LPs are at most n. We now discuss the time complexity of each part of the inference model.

General time complexity

We first take a look at a single LP's behaviour, where we assume that all other LPs in the network are already evaluated. From the definitions of LPs Eq. 6 and Eq. 7, we can see that evaluating one LP invloves: ψ(x; α) = (w T ) α [F 1 (x; α), F 2 (x; α), F 3 (x; α), ...] T . Since the total amount of F i is limited and does not scale as the network or input data grows, this step costs T (ψ) ≤ i T (F i ) + O(|V| n ), where we let T (ψ) be the time complexity for evaluating ψ(x; α) for all x, T (F i ) be the time complexity for evaluating F i (x; α) for all x, etc. For each F i , we have: F i (x; α) = P 1 (x), P 2 (x), ..., P |P| (x), ψ ′ i (x) w (i) α ⇐⇒ T (F i ) ≤ O(|P||V| n ), F i (x; α) = F j (x; α)F k (x; α) ⇐⇒ T (F i ) ≤ T (F j ) + T (F k ) + O(|V| n ), F i (x; α) = [F j (x; α), F k (x; α)] w (i) α ⇐⇒ T (F i ) ≤ T (F j ) + T (F k ) + O(|V| n ), F i (x; α) = 1 -F j (x; α) ⇐⇒ T (F i ) ≤ T (F j ) + O(|V| n ), F i (x; α) = y F j (x, y; α) ⇐⇒ T (F i ) ≤ T (F j ) + O(|V| n ), F i (x, y; α) = F j (x; α) ⇐⇒ T (F i ) ≤ T (F j ) + O(|V| n ). (42) Thus we can see that for one ψ, we have T (ψ) ≤ Fi O(|P||V| n ) = O(|P||V| n ). To evaluate all nodes in the network, we simple evaluate LPs one by one in topological order, which directly gives a total time complexity of T (Ψ) = O(|P||V| n N ). The update procedure for one x is a function with R Ψ (amount of learnt rules) inputs, and all implementations we introduced here all make the evaluation of the function O(R Ψ )), so the evaluation over all x takes O(|V| n R Ψ ) time.

Time complexity on sparse graphs

In reality often the input data is sparse, and the |V| n term in time complexities can be reduced significantly. Here, we analyse the time complexity of the model under knowledge graph completion experiments. Suppose the input graph has |V| nodes, |P| predicates and M edges, and we are learning chain-like rules of length L. Thus, each output unit, corresponding to a rule, is composed of L LPs, and can be written as follows: ψ(x, y; α) = z1,z2,...,z L-1 P 1 (x, z 1 ; α)P 2 (z 1 , z 2 ; α)...P L (z L-1 , y; α), where P i (x, y; α) = [P 1 (x, y), P 2 (x, y), ...]w α i . We can efficiently calculate ψ(x, y; α) for all nodes y in the knowledge graph with the same source x. The algorithm is an extension of L-step breadth first search described as follows. S L maps nodes y with nonzero value ψ(x, y; α) to ψ(x, y; α). S 0 ←MAP(∅) 3: S 0 [x] ← 1.0 4: for l ← 1 to L do 5: S l ←MAP(∅) 6: for P ∈ P, s ∈ S l-1 , t ∈ N P (s) do (48) Thus, one-time of evaluating a total number of R Ψ rules for all node pairs in the graph takes T ≤ R Ψ |V|T single = O(R Ψ LN (L) N (1) |V|). ( ) Since in sparse graphs we often have N (1) << N (L) << |V|, this estimation of time complexity (Eq. 49) is much less than the original one (Eq. 44), which is O(R Ψ L|P||V| 3 ) in this case.

F CASE STUDIES

In this section we illustrate part of the logic rules we learned on ILP tasks, systematicity tests and knowledge graph completion. Grandmother(x, y) ← ψ 1 (x, y) = ∃z : Brother(x, z) ∧ Grandmother(z, y) ← ψ 2 (x, y) = ∃z : Father(x, z) ∧ Mother(z, y) Causes(x, y) ← ψ 1 (x, y) = ∃z 1 , z 2 : Contains(z 1 , x) ∧ LocationOf(z 1 , z 2 ) ∧OccursIn(z 2 , y) ← ψ 2 (x, y) = ∃z 1 , z 2 : Contains(z 1 , x) ∧ LocationOf(z 1 , z 2 ) ∧Complicates(y, z 2 ) ← ψ 3 (x, y) = ∃z 1 , z 2 : IngredientOf(z 1 , x) ∧ IsA(z 1 , z 2 ) ∧Causes(z 2 , y) ← ψ 4 (x, y) = ∃z 1 , z 2 : IngredientOf(z 1 , x) ∧ InteractsWith(z 2 , z 1 ) ∧Causes(z 2 , y)

G ADDITIONAL ABLATION STUDY

In this section we provide empirical results for additional ablation study.

G.1 SPARSITY OF LP-LAYER

We run LP-layer on the experiments used in this paper and obtain the sparisity of learnt parameters. For each w used for prediction, if after training max{w 1 , w 2 , ...} ≥ 0.99 we simply regard w as being converged. Then, we obtain the results in Tab. 9.

G.2 DEPTH OF NETWORKS

We run LP-layer and LP-tree with different depths {3, 5, 10} on ILP and Systematicity tests. Depth of 3 cannot complete some of the ILP tasks which requires more reasoning steps. Depth of 5 and 10 are both able to complete ILP tasks. Results on Systematicity tests are very similar with depth 3, 5, 10, as is shown in Tab. 10.

G.3 CLIPPING ON LP

We now study the effects of whether to restrict the range of ψ to be [0, 1] by simply setting ψ(x; α) ← min{ψ(x; α), 1}. The results are shown in Tab. 11. 



Figure 1: Illustration of logic classifiers (left) and DLP framework (middle, right). The target rule is Blue(x) ← ∃y : Red(y) ∧ Edge(x, y). In the left figure, φ 1 (x) directly captures the target rule. In the middle figure, the model is constructed by extending the unary node ψ 1 (x) with tree structure. In the right figure, the model is constructed by stacking multiple layers. (see Sec. 4.2 for detailed description). Both these structures are constructed with the grammar in Eq. 1. ψ 1 learns to identify the target logic rule via gradient descent and converges at the correct (colored as blue) reasoning paths.

For systematic reasoning, we choose Graph Attention Networks (GAT)(Velickovic et al. (2018)), Graph Convolutional Networks (GCN)(Kipf & Welling (2017)), Recurrent Neural Networks (RNN)(Schuster & Paliwal (1997)), Long Short-Term Memory Networks (LSTM)(Hochreiter & Schmidhuber (1997)), Gated Recurrent Units (GRU)(Cho et al. (2014)), Convolutional Neural Networks (CNN)(Kim (2014)), CNN with Highway Encoders (CNNH)(Kim et al. (2016)), Greedy Neural Theorem Provers (GNTP)(Minervini et al. (2020a)), Multi-Headed Attention Networks (MHA)(Vaswani et al. (2017)), Conditional Theorem Provers (CTP)(Minervini et al. (2020b)), R5(Lu et al. (2022)). For knowledge graph completion, we choose rule-based methods including Markov Logic Networks (MLN)(Richardson & Domingos (2006)), PathRank(Lao & Cohen (2010)), NeuralLP(Yang et al. (2017)), DRUM(Sadeghian et al. (2019)), CTP(Minervini et al. (2020b)), M-Walk(Shen et al. (2018)), MINERVA(Das et al. (2018)), NLIL(Yang & Song (2020)), RNNLogic(Qu et al. (2021)).

Figure 3: Illustration of two inference networks generated by procedure in Sec. 4.3. We remove F i nodes for notation clarity. Output nodes are colored in blue. The left figure corresponds to φ(x, y) := A(x, y) | ∃z : φ(x, z) ∧ A(z, y), which is applicable to the systematicity tests and knowledge graph completion tasks. The right figure corresponds to φ(x) := A(x) | ∃y : φ(x, y) ∧ φ(y) and φ(x, y) := A(x, y) | ∃z : φ(x, z) ∧ φ(z) ∧ φ(z, y), which is used in the ILP tasks.

t] ← S l [t] + (w l [P ]) α S l-1 [s]Here, we denote N (s) as the neighbors of s in the KG, and N P (s) as the neighbors of s connected by edge type P . Thus, one run of the function takes at least T single = l s∈S l-1 P ∈P t∈N P (s) k) as the max amount of nodes in a node's k-hop subgraph, we haveT single ≤ L l=1 N (l-1) O(N (1) )≤ O(LN (L) N (1) ).

← ψ 1 (x) = Zero(x) ← ψ 2 (x) = ∃y : ψ 3 (y) ∧ Succ(y, x) ψ 3 (x) = ∃y : ψ 4 (y) ∧ Succ(y, x) ψ 4 (x) = ∃y : DivisibleBy3(y) ∧ Succ(y, x) AdjacentToRed(x) ← ψ 1 (x) = ∃y : Red(y) ∧ Edge(x, y) Connect(x, y) ← ψ 1 (x, y) = Edge(x, y) ← ψ 2 (x, y) = ∃z : Connect(x, z) ∧ Edge(z, y)

Results on CLUTRR.DLP-tree .990±.006 .994±.001 1.0±.000 .995±.002 .997±.001 .996±.000 1.0±.000 .992±.001 .990±.000 .994±.002 1.0±.000 .992±.002 .996±.001 DLP-layer .991±.003 .993±.001 1.0±.000 .995±.002 .997±.000 .996±.000 1.0±.000 .992±.002 .990±.000 .997±.001 1.0±.000 .992±.002 .996±.001

Results on Kinship and UMLS.

Results on FB15k-237 and WN18RR.

Study of rule complexity.

Algorithm 1 Inference in sparse graphs Input: graph G, predicates P, source node x, rule length L, model parameters α, w 1 , w 2 , ..., w L . Output: S L .1: function EVALUATE(x, G, L, α, w 1 , w 2 , ..., w L )

Depth of networks.

Setting ψ(x; α) ← min{ψ(x; α), 1}.

annex

which is monotone decreasing w.r.t. L(ψ) so conclusion (2) holds. Here, we assume when ψ converges, E[ψ(x; α)|ψ(x; α) > 0] ≈ E x∈P os [ψ(x; α)|ψ(x; α) > 0].

C DISCUSSION OF CONSTRAINTS IN PROXY PROBLEM

In this section we discuss the properties and functionalities of the three constraints in the proxy problem as well as constructing example to illustrate how they work.Proxy Problem Minimization of the optimization problemyields a near-optimal solution for solving problem 11, with the constraints of theorem 4.4 being satisfied.Properties of α and β. We first discuss the properties of hyperparameters α and β. As show in appendix B, α > β is necessary to keep the second-order derivative positive. If we remove this constraints by simply setting α = β = 1, then we observe that Eq. 23 actually becomesand we have the second-order derivativeThis means that while training the model, it is possible that the model's derivatives w.r.t. multiple nonzero dimensions w i of some w become 0 and the model might falls into local minima where it doesn't capture any logic classifier. We now discuss the situations where these risks actually exist.The first and most simple case is when p = q, i.e.,Apparently in this case, training the model provides nothing because L(ψ) is a fixed scalar irrelevant to ψ. We argue that this is not a big problem since it requires p and q, corresponding to the distributions of positive and negative data instances, to be the same.By extending the above case, we can construct a more general situation. Recall that in theorem B.3, we havewhere we let m be the dimension corresponding to the global maxima. If there are some dimension k of w such that A k (1) = B k (1) = 0, then for any w ′ satisfying w ′ m > 0 and wBm(1) = sup{h(w, 1, 1)}. To solve this problem, we can add a normalization term to the original proxy problem, i.e.Thus, even when A k (1) = B k (1) = 0, the normalization term still encourages the model to assign larger value to w m .Another problematic situation still happens when there exists i ̸ = j such that A i (1) = A j (1) > 0 and B i (1) = B j (1) > 0. This is not usual because it requires there exists two different logic classifiers φ 1 and φ 2 to haveEven when it happens, it's easy to deal with the problem, as during training we always reweight or sample different batches of training data, yielding a different data distribution p ′ and q ′ , We can also set α = β > 1, for example, α = β = 2, to avoid such situations from happening.Tree structured paths constraints. We now discuss the second constraint in the problem. If we remove this constraints and there are circles for some unfixed nodes, we observe that L(ψ) w.r.t. the corresponding w might no more holds the form of Eq. 19. Note that in Eq. 7 we define the conjunction as F i (x) = F j (x)F k (x). If the evaluation of F j and F k both contains some w, i.e. F j = a T j (w) α and F k = a T k (w) α where a j and a k are irrelevant quantities with w, then F j = a T j (ww T ) α a k which no more holds the form of Eq. 19. Instead, we haveTask 1-6 In task 1 to task 6 we are provided with natural numbers from 0 to 9, which are defined as follows: S B = {zero(0), succ(0, 1), ..., succ(8, 9)}.(35) The target is to learn to recognize predecessor, even / odd numbers, the less-than relation and divisible by 3 or 5.Task 7-8 Task 7-8 requires us to learn the relation member and length of a list. Nodes in a list is encoded as follows: cons(x, y) if the node after x is y, and value(x, y) if the value of node x is y. Two background statements are given, corresponding to the list [4, 3, 2, 1] and [2, 3, 2, 4] .Task 9-14 In task 9-14 we are provided with different facts about family relations, and we need to learn the relations including son, grandparent, husband, uncle, relatedness and father. An example of rules would be son(x, y) ← f ather(x, y) ∧ φ(x), φ(x) := brother(x, y) ∨ f ather(x, y) where φ implies the male property.Task 15-20 In task 15-20 we are provided with labeled directed graphs, and we are asked to learn general concepts of graph algorithms. These tasks includes to learn whether a node is adjacent to a red node; whether a node has at least two children; whether a graph is well-colored, i.e., to identify if there are two adjacent nodes of the same color; whether two nodes are connected; and recognize graph cycles.As a principled approach, ∂ILP (Evans & Grefenstette (2018) ) is able to solve all these tasks. However, they need to construct different language templates and program templates for each task, for example, to solve the even numbers problem, they use the following templates:In contrast, we use a unified model to solve the tasks. Our grammar of logic classifiers are as follows:For some tasks if there are no observed predicates of arity 1 or 2, we simply create an invented predicates with all 1 values for every variables x. On all tasks we set α = β = 1. The construction of inference network of the model and how we train the model is the same as described in Sec. 4.1, Sec. 4.2 and Sec. 4.4. The correctness of learnt model is confirmed by checking whether all positive queries are predicted as true and all negative queries are predicted as false by the model.

D.2 SYSTEMATICITY TESTS

All model configurations are the same as in the ILP tasks.

D.3 KNOWLEDGE GRAPH COMPLETION

The statistics of datasets are summarised as follows. 

