GOING BEYOND APPROXIMATION: ENCODING CON-STRAINTS FOR EXPLAINABLE MULTI-HOP INFERENCE VIA DIFFERENTIABLE COMBINATORIAL SOLVERS

Abstract

Integer Linear Programming (ILP) provides a viable mechanism to encode explicit and controllable assumptions about explainable multi-hop inference with natural language. However, an ILP formulation is non-differentiable and cannot be integrated into broader deep learning architectures. Recently, Thayaparan et al. (2022) proposed a novel methodology to integrate ILP with Transformers to achieve end-to-end differentiability for complex multi-hop inference. While this hybrid framework demonstrates to deliver better answer and explanation selection than transformer-based and ILP solvers, the neuro-symbolic integration still relies on a convex relaxation of the ILP formulation, which can produce suboptimal solutions. To improve these limitations, we propose Diff-Comb Explainer, a novel neuro-symbolic architecture based on Differentiable BlackBox Combinatorial solvers (DBCS) (Pogančić et al., 2019) . Unlike existing differentiable solvers, the presented model does not require the transformation and relaxation of the explicit semantic constraints, allowing for a direct and a more efficient integration of ILP formulations. Diff-Comb Explainer demonstrates improved accuracy in answer and explanation selection over non-differentiable solvers, Transformers and constraint-based differentiable multi-hop inference frameworks.

1. INTRODUCTION

Given a question expressed in natural language, ILP-based Multi-hop Question Answering (QA) aims to construct an explanation graph of interconnected facts (i.e., natural language sentences) to support the answer (see Figure 1 ). This framework provides a viable mechanism to encode explicit and controllable assumptions about the structure of the inference (Khashabi et al., 2018; Khot et al., 2017; Khashabi et al., 2016) . For this reason, inference based on constrained optimization is generally regarded as interpretable and transparent, providing structured explanations in support of the underlying reasoning process (Thayaparan et al., 2020) . However, ILP solvers are non-differentiable and cannot be integrated as part of a broader deep learning architecture (Paulus et al., 2021; Pogančić et al., 2019) . Moreover, these approaches are often limited by the exclusive adoption of hard-coded heuristics for the inference and cannot be optimised end-to-end on annotated corpora to achieve performance comparable to deep learning counterparts (Thayaparan et al., 2022; Khashabi et al., 2018) . In an attempt to combine the best of both worlds, Thayaparan et al. (2022) proposed a novel neurosymbolic framework (Diff-Explainer) that integrates explicit constraints with neural representations via Differentiable Convex Optimization Layers (Agrawal et al., 2019) . Diff-Explainer combines constraint optimization solvers with Transformers-based representations, enabling end-to-end training for explainable multi-hop inference. The non-differenitability of ILP solvers is alleviated by approximating the constraints using semi-definite programming (Helmberg, 2000) . This approximation usually requires non-trivial transformations of ILP formulations into convex optimization problems. Since constraint-based multi-hop inference is typically framed as optimal subgraph selection via binary optimization (0, 1), The semi-definite relaxation employed in Diff-Explainer necessitates a continuous relaxation of the discrete variables (from {0, 1} to [0, 1]). While this process can provide (Xie et al., 2020; Jansen et al., 2018) tight approximations for ILP problems, this relaxation can still lead to sub-optimal solutions in practice (Yoshida, 2011; Thapper & Živnỳ, 2018) leading to errorneous answer and explanation prediction. To improve on these limitations, we propose Diff-Comb Explainer, a novel neuro-symbolic architecture based on Differentiable BlackBox Combinatorial solvers (DBCS) (Pogančić et al., 2019) . The proposed algorithm transforms a combinatorial optimization solver into a composable building block of a neural network. DBCS achieves this by leveraging the minimisation structure of the combinatorial optimization problem, computing a gradient of continuous interpolation to address the non-differenitability of ILP solvers. In contrast to Diff-Explainer (Thayaparan et al., 2022) , DBCS makes it possible to compute exact solutions for the original ILP problem under consideration, approximating the gradient. Our experiments on multi-hop question answering with constraints adopted from Explana-tionLP (Thayaparan et al., 2021) yielded an improvement of 11% over non-differentiable solvers and 2.08% over Diff-Explainer. Moreover, we demonstrate that the proposed approach produces more accurate and faithful explanation-based inference, outperforming non-differentiable ILP-based solvers, Diff-Explainer and Transformer-based approaches.

2. RELATED WORK

Constraint-based multi-hop inference ILP has been applied for structured representation (Khashabi et al., 2016) and over semi-structured representation extracted from text (Khot et al., 2017; Khashabi et al., 2018) . Early approaches were unsupervised. However, recently Thayaparan et al. (2021) proposed the ExplanationLP model optimised towards answer selection via Bayesian optimisation. ExplanationLP was limited to fine-tuning only nine parameters and used pre-trained neural embedding. Diff-Explainer (Thayaparan et al., 2022) was the first approach to integrate constraints into a deep-learning network via Differentiable Convex Optimisation Layer (Agrawal et al., 2019) by approximating ILP constraints using Semi-definite programming. (Lovász & Schrijver, 1991) . Clark et al. (2021) proposed "soft theorem provers" operating over explicit theories in language. This hybrid reasoning solver integrates natural language rules with transformers to perform deductive reasoning. Saha et al. (2020) improved on top of it, enabling the answering of binary questions along with the proofs supporting the prediction. The multiProver (Saha et al., 2021) evolves on top of these conceptions to produce an approach that is capable of producing multiple proofs supporting the answer. While these hybrid reasoning approaches produce explainable and controllable inference, they assume the existence of natural language rules and have only been applied to synthetic datasets. On the other hand, our approach does not require extensive rules set and can tackle complex scientific and commonsense QA. For each candidate answer The integration of Differentiable Blackbox Combinatorial solvers will result in better explanation and answer prediction.

Hybrid reasoning with Transformers

Differentiable Blackbox Combinatorial Optimisation Solver Given the following bounded integer problem: min x∈X c • x subject to Ax ≤ b, where X ∈ Z n , c ∈ R n , x are the variables, A = [a 1 , . . . , a m ] ∈ R m×n is the matrix of constraint coefficients and b ∈ R m is the bias term. The output of the solver g(c) returns the arg min x∈X of the integer problem. Differentiable Combinatorial Optimisation Solver (Pogančić et al., 2019) (DBCS) assumes that A, b are constant and the task is to find the dL/dc given global loss function L with respect to solver output x at a given point x = g(ĉ). However, a small change in c is typically not going to change the optimal ILP solution resulting in the true gradient being zero. In order to solve this problem, the approach simplifies by considering the linearisation f of L at the point x. f (x) = L(x) + dL dx (x) • (x -x) to derive: df (g(c)) dc = dL dc By introducing the linearisation, the focus is now to differentiating the piecewise constant function f (g(c)). The approach constructs a continuous interpolation of f (g(c)) by function f λ (w). Here the hyper-parameter λ > 0 controls the trade-off between informativeness of the gradient and faithfulness to the original function. 3 Diff-COMB EXPLAINER: DIFFERENTIABLE BLACKBOX COMBINATORIAL SOLVER FOR EXPLAINABLE MULTI-HOP INFERENCE ILP-based QA is typically applied to multiple-choice question answering (Khashabi et al., 2018; Khot et al., 2017; Khashabi et al., 2016; Thayaparan et al., 2021) . Given Question (Q) and the set of candidate answers C = {c 1 , c 2 , c 3 , . . . , c n } the aim is to select the correct answer c ans . In order to achieve this, ILP-based approaches convert question answer pairs into a list of hypothesis H = {h 1 , h 2 , h 3 , . . . , h n } (where h i is the concatenation of Q with c i ) and typically adopt a retrieval model (e.g: BM25, FAISS (Johnson et al., 2017) ), to select a list of candidate explanatory facts F = {f 1 , f 2 , f 3 , . . . , f k }. Then construct a weighted graph G = (V, E, W ) with edge weights W : E → R where V = {{h i } ∪ F }, edge weight W ik of each edge E ik denote how relevant a fact f k is with respect to the hypothesis h i . Given this premise, ILP-based Multi-hop QA can be defined as follows (Thayaparan et al., 2022) : Definition 3.1 (ILP-Based Multi-Hop QA). Find a subset V * ⊆ V , h ∈ V * and E * ⊆ E such that the induced subgraph G * = (V * , E * ) is connected, weight W [G * = (V * , E * )] := e∈E * W (e) is maximal and adheres to set of constraints M c designed to emulate multi-hop inference. The hypothesis h i with the highest subgraph weight W [G * = (V * , E * )] is selected to be the correct answer c ans . As illustrated in Figure 2 , Diff-Comb Explainer has 3 major parts: Graph Construction, Subgraph Selection and Answer/Explanation Selection. In Graph Construction, for each candidate answer c i we construct graph G i = (V i , E i , W i ) where the V i = {h i } ∪ {F } and weights W i ik of each edge E i ik denote how relevant a fact f k is with respect to the hypothesis h i . These edge weights (W i ik ) are calculated using a weighted (θ) sum of scores calculated using transformer-based (ST rans) embeddings and lexical overlap. In the Subgraph Selection step, for each G i Differentiable Blackbox Combinatorial Solver (DBCS) with constraints are applied to extract subgraph G * . In this paper, we adopt the constraints proposed for ExplanationLP (Thayaparan et al., 2021) . ExplanationLP explicit abstraction by grouping facts into abstract and grounding facts. Abstract facts are core scientific facts that a question is attempting to test, and grounding facts link concepts in the abstract facts to specific terms in the question/answer. For example, in Figure 1 the core scientific fact is about the nature of convex lens and how they refract light (F 1 ). Facts F 2 , F 3 , F 4 help to connect the abstract fact to the question/answer. Finally, in Answer/Explanation Selection the model is to predict the correct answer c ans and relevant explanations F exp . During training time, the loss is calculated based on gold answer/explanations to fine-tune the transformers (ST rans) and weights (θ). The rest of the section explains each of the components in detail.

3.1. GRAPH CONSTRUCTION

In order to facilitate grounding abstract chains, the retrieved facts F are classified into grounding facts F G = {f g 1 , f g 2 , f g 3 , ..., f g l } and abstract facts F A = {f a 1 , f a 2 , f a 3 , ..., f a m } such that F = F A ∪ F G and l + m = k. Similarly to Diff-Explainer (Thayaparan et al., 2022) , we use two relevance scores: semantic and lexical scores, to calculate the edge weights. We use a Sentence-Transformer (STrans) (Reimers et al., 2019) bi-encoder architecture to calculate the semantic relevance. The semantic relevance score from STrans is complemented with the lexical relevance score. The semantic and lexical relevance scores are calculated as follows: Semantic Relevance (s): Given a hypothesis h i and fact f j we compute sentence vectors of ⃗ h i = ST rans(h i ) and ⃗ f j = ST rans(f j ) and calculate the semantic relevance score using cosinesimilarity as follows: s ij = S( ⃗ h i , ⃗ f j ) = ⃗ h i • ⃗ f j ∥ ⃗ h i ∥∥ ⃗ f j ∥ (4) Lexical Relevance (l): The lexical relevance score of hypothesis h i and f j is given by the percentage of overlaps between unique terms (here, the function trm extracts the lemmatized set of unique terms from the given text): l ij = L(h i , f j ) = |trm(h i ) ∩ trm(f j )| max(|trm(h i )|, |trm(f j )|) Given the above scoring function, we construct the edge weights matrix (W ) as follows: W i jk =              -θggl jk (j, k) ∈ FG -θaal jk (j, k) ∈ FA θgal jk j ∈ FG, k ∈ FA θ qgl l jk + θqgss jk j ∈ FG, k = hi θ qal l jk + θ qal s jk j ∈ FA, k = hi (6) Here relevance scores are weighted by θ parameters which are clamped to [0, 1].

3.2. SUBGRAPH SELECTION VIA DIFFERENTIABLE BLACKBOX COMBINATORIAL SOLVERS

Given the above premises, the objective function is defined as: min -1(W i • Y i ) (7) We adopt the edge variable Y i ∈ {0, 1} (n+1)×(n+1) where Y i j,k (j ̸ = k) takes the value of 1 iff edge E i jk belongs to the subgraph and Y i jj takes the value of 1 iff V i j belongs to the subgraph. Given the above variable, the constraints are defined as follows: Answer selection constraint The candidate hypothesis should be part of the induced subgraph: j ∈ {hi} Y i jj = 1 (8) Edge and Node selection constraint If node V i j and V i k are selected then edges E i jk and E i kj will be selected. If node V i j is selected, then edge E jj will also be selected: Y i jk ≤ Y i jj ∀ (j, k) ∈ E (9) Y i jk ≤ Y kk ∀ (j, k) ∈ E (10) Y i jk ≥ Y jj + Y kk -1 ∀ (j, k) ∈ E (11) Abstract fact selection constraint Limit the number of abstract facts selected to M : i Y i jj ≤ M ∀j ∈ F A

3.3. ANSWER AND EXPLANATION SELECTION

The solved adjacency variable Ŷ i represents the selected edges for each candidate answer choice c i . Not all datasets provide gold explanations. Moreover, even when the gold explanations are available, they are only available for the correct answer with no explanations for the wrong answer. In order to tackle these shortcomings and ensure end-to-end differentiability, we use the softmax (σ) of the objective score (W i • Ŷ i ) as the probability score for each choice. We multiply each objective score W i • Ŷ i value by the temperature hyperparameter (T ) to obtain soft probability distributions γ i (where γ i = (W i • Ŷ i ) • T ). The aim is for the correct answer c ans to have the highest probability. In order to achieve this aim, we use the cross entropy loss l c as follows to calculate the answer selection loss L ans as follows: L ans = l c (σ(γ 1 , γ 2 , • • • γ n ), c ans ) If gold explanations are available, we complement L ans with explanation loss L exp . We employ binary cross entropy loss l b between the selected explanatory facts and gold explanatory facts F exp for the explanatory loss as follows: L exp = l b ( Ŷ ans [f 1 , f 2 , . . . , f k ], F exp ) We calculate the total loss (L) as weighted by hyperparameters λ ans , λ exp as follows: L = λ ans L ans + λ exp L exp (15) 4 EMPIRICAL EVALUATION

4.1. ANSWER AND EXPLANATION SELECTION

We use the WorldTree corpus (Xie et al., 2020) for training the evaluation of explanation and answer selection. The 4,400 question and explanations in the WorldTree corpus are split into three different subsets: train-set, dev-set and test-set. We use the dev-set to assess the explainability performance since the explanations for test-set are not publicly available. The background knowledge is consists of 5000 abstract facts from the WorldTree table store (WTree) (Xie et al., 2020) and over 100,000 is-a grounding facts from ConceptNet (Speer et al., 2017) . Baselines: We use the following baselines to compare against our approach for the WorldTree corpus: 1. BERT Base and BERT Large (Devlin et al., 2019) : To use BERT for this task, we concatenate every hypothesis with k retrieved facts, using the separator token [SEP] . We use the Hugging-Face (Wolf et al., 2019) implementation of BertForSequenceClassification, taking the prediction with the highest probability for the positive class as the correct answer. 2. ExplanationLP: Non-differentiable version of ExplanationLP. Using the constraints stated in Section 3, we fine-tune the θ parameters using Bayesian optimization and frozen STrans representations. This baseline aims to evaluate the impact of end-to-end fine-tuning over the nondifferentiable solver. 3. Diff-Explainer: Diff-Explainer has already exhibited better performance over other explainable multi-hop inference approaches, including ILP-based approaches including TableILP (Khashabi et al., 2016) , TupleILP (Khot et al., 2017) and graph-based neural approach PathNet (Kundu et al., 2019) . Similar to our approach, we use ExplanationLP constraints with Diff-Explainer. We use similar hyperparameters and knowledge base used in Thayaparan et al. (2022) . Metrics The answer selection is evaluated using accuracy. For evaluation of explanation selection, we use Precision@K. In addition to Precision@K, we introduce two new metrics to evaluate the truthfulness of the answer selection to the underlying inference. The metrics are as follows: Explanatory Consistency@K: Question/answer pair with similar explanations indicates similar underlying inference (Atanasova et al., 2022) . The expectation is that similar underlying inference would produce similar explanations (Valentino et al., 2021; 2022) . Given a test question Q t and retrieved explanations E t we find set of Questions Q s t = {Q 1 t , Q 2 t , . . . } with at least K overlap gold explanations along with the retrieved explanations E s t = {e 1 t , e 2 t , . . . }. Given this premise, Explanatory Consistency@K is defined as follows: e i t ∈E s t [e i t ∈ E t ] e i t ∈E s t |e i t | Explanatory Consistency measures out of questions/answer pairs with at least K similar gold explanations and how many of them share a common retrieved explanation. Faithfulness: The aim is to measure how much percentage of the correct prediction is derived from correct inference and wrong prediction is derived from wrong inference over the entire set. Let's say that the set of questions correctly answered as A Qc , wrongly answered questions A Qw , set of questions with at least one correctly retrieved explanation as A Q1 and set of questions where no correctly retrieved explanations A Q0 . Given this premise, Faithfulness is defined as follows: |A Qw ∩ A Q0 | + |A Qc ∩ A Q1 | |A Qc ∪ A Qw | A higher faithfulness implies that the underlying inference process is reflected in the final answer prediction. Table 1 illustrates the explanation and answer selection performance of Diff-Comb Explainer and the baselines. We report scores for Diff-Comb Explainer trained for only the answer and optimised jointly for answer and explanation selection. Since BERT does not provide explanations, we use facts retrieved from the fact retrieval for the best k configuration (k = 3) as explanations. We also report the scores for BERT without explanations. We draw the following conclusions from the results obtained in Table 1 (The performance increase here are expressed in absolute terms): (1) Diff-Comb Explainer improves answer selection performance over the non-differentiable solver by 9.47% with optimising only on answer selection and 10.89% with optimising on answer and explanation selection. This observation underlines the impact of the end-to-end fine-tuning framework. We can also observe that strong supervision with optimising explanation selection yields better performance than weak supervision with answer selection. (2) Diff-Comb Explainer outperforms the best transformer-based model by 14.14% for answer selection. This increase in performance demonstrates that integrating constraints with transformer-based architectures leads to better performance. (3) Diff-Comb Explainer outperform the best Diff-Explainer configuration (answer and explanation selection) by 0.56% even in the weak supervision setting (answer only optimization). We also outperform Diff-Explainer by 1.98% in the best setting. (4) Diff-Comb Explainer is better for selecting relevant explanations over the other constraint-based solvers. Diff-Comb Explainer outperforms the non-differentiable solver at Precision@K by 8.41% (k =1) and 6.05% (k =2). We also outperform Diff-Explainer by 3.63% (k =1) and 4.55% (k =2). The improvement of Precision@K over the Fact Retrieval only (demonstrated with BERT + FR) by 16.98% (k =1) and 24.74% (k =2) underlines the robustness of our approach to noise propagated by the upstream fact retrieval. (5) Our models also exhibit higher Explanatory Consistency over the other solvers. This performance shows that the optimization model is learning and applying consistent inference across different instances. We also outperform the fact retrieval model which was also a transformer-based model trained on gold explanations. (6) Answer prediction by black-box models like BERT do not reflect the explanation provided. This fact is indicated by the low Faithfulness score obtained by both BERT Base /BERT Large . In contrast, the high constraint-based solver's Faithfulness scores emphasise how the underlying inference reflects on the final prediction. In particular, our approach performs better than the non-differentiable models and Diff-Explainer. In summary, despite the fact that Diff-Explainer and Diff-Comb Explainer approaches use the same set of constraints, our model yields better performance, indicating that accurate predictions generated by ILP solvers are better than approximated sub-optimal results.

4.2. QUALITATIVE ANALYSIS

Table 2 illustrates some of the explanations extracted for ExplanationLP, Diff-Explainer and Diff-Comb Explainer. Both explanations and answer predictions in Question (1) are entirely correct for our model. In this example, both ExplanationLP and Diff-Explainer have failed to retrieve any correct explanations or predict the correct answer. Both the approaches are distracted by the strong lexical overlaps with the wrong answer. Question (1): Which measurement is best expressed in light-years?: Correct Answer: the distance between stars in the Milky Way. ExplanationLP Answer: the time it takes for planets to complete their orbits. Explanations: (i) a complete revolution; orbit of a planet around its star takes 1; one planetary year, (ii) a light-year is used for describing long distances Diff-Explainer Answer: the time it takes for planets to complete their orbits. Explanations: (i) a light-year is used for describing long distances, (ii) light year is a measure of the distance light travels in one year Diff-Comb Explainer Answer: the distance between stars in the Milky Way. Explanations: (i) light years are a astronomy unit used for measuring length, (ii) stars are located light years apart from each other Question (2) at least one explanation is correct and a correct answer prediction for our model. In the example provided, Diff-Explainer provides the correct answer prediction with both the retrieved facts not being explanatory. Diff-Explainer arrives at the correct answer prediction with no explanation addressing the correct answer. In Question (3) both our model and Diff-Explainer provide the correct answer but with both facts not being explanations. The aforementioned qualitative (Question 1 and 2) and quantitative measures (Explanatory Consistency@K, Faithfulness) indicate how the underlying explanatory inference results in the correct prediction; there are cases where false inference still leads to the correct answer with our model as well. In this case, the inference is distracted by the strong lexical overlaps irrelevant to the question. However, from the qualitative analysis, we can conclude that the explainable inference that happens with our model is more robust and coherent when compared to the Diff-Explainer and nondifferentiable models.

4.3. KNOWLEDGE AGGREGATION WITH INCREASING DISTRACTORS

One of the key characteristics identified by Thayaparan et al. (2022) is the robustness of Diff-Explainer to distracting noise. In order to evaluate if our model also exhibits the same characteristics, we ran our model for the increasing number of retrieved facts k and plotted the answer selection accuracy for WorldTree in Figure 3 . As illustrated in the Figure, similar to Diff-Explainer, our approach performance remains stable with increasing distractors. We also continue to outperform Diff-Explainer across all sets of k. BERT performance drops drastically with increasing distractors. This phenomenon is in line with existing work (Thayaparan et al., 2022; Yadav et al., 2019a) . We hypothesise that with increasing distractors, BERT overfits quickly with spurious inference correlation. On the other hand, our ap- proach circumvents this problem with the inductive bias provided by the constraint optimization layer.

4.4. COMPARING ANSWER SELECTION WITH ARC BASELINES

Table 3 presents a comparison of publicly reported baselines on the ARC Challenge-Corpus (Clark et al., 2018) and our approach. These questions have proven to be challenging to answer for other LP-based question answering and neural approaches. While models such as UnifiedQA (Khashabi et al., 2020) and AristoBERT (Xu et al., 2021) have demonstrated performance of 81.14 and 68.95, they have been trained on other question-answering datasets, including RACE (Lai et al., 2017) . Moreover, despite its performance, UnifiedQA does not provide explanations supporting its inference. In Table 3 , to provide a rigorous comparison, we only list models that have been trained only on the ARC corpus and provides explanations supporting its inference to ensure fair comparison. Here the explainability column indicates if the model delivers an explanation for the predicted answer. A subset of the models produces evidence for the answer but remains intrinsically black-box. These models have been marked as Partial. As illustrated in the Table 3 , Diff-Comb Explainer outperforms the best non-differentiable constraint-solver model (ExplanationLP) by 2.8%. We also outperform a transformer-only model AutoRocc by 1.97%. While our improvement over Diff-Explainer is small, we still demonstrate performance improvements for answer selection. On top of performances obtained for explanation and answer selection with WorldTree corpus, we have also established better performances than leaderboard approaches.

5. CONCLUSION

This paper proposed a novel framework for encoding explicit and controllable assumptions as part of an end-to-end learning framework for explainable multi-hop inference using Differentiable Blackbox Combinatorial Solvers (Pogančić et al., 2019) . We empirically demonstrated improved answer and explanation selection performance compared with the existing differentiable constraint-based solver for multi-hop inference (Thayaparan et al., 2022) . We also demonstrated performance gain and increased robustness to noise when combining constraints with transformer-based architectures. In this paper, we adopted the constraints of ExplanationLP, but it is possible to encode more complex inference constraints within the model. Diff-Comb Explainer builds on previous work by Thayaparan et al. (2022) and investigates the combination of symbolic knowledge (expressed via constraints) with neural representations. We hope this work will encourage researchers to encode different domain-specific priors, leading to more robust, transparent and controllable neuro-symbolic inference models for NLP.

6. APPENDIX

The pseudo-code to train Diff-Comb Explainer end-to-end is summarized in Algorithm 1. Experimental Setup: We employ the following experimental setup: • Sentence Transformer Model: ALL-MPNET-BASE-V2 (Song et al., 2020) . • Fact retrieval representation: ALL-MPNET-BASE-V2 trained with gold explanations of WorldTree Corpus to achieve a Mean Average Precision of 40.11 in the dev-set. • Fact retrieval: FAISS retrieval (Johnson et al., 2017) using pre-cached representations. • Background knowledge: 5000 abstract facts from the WorldTree table store (WTree) and over 100,000 is-a grounding facts from ConceptNet (CNet) (Speer et al., 2017) . • The experiments were carried out for k = {1, 2, 3, 5, 10, 20, 30, 40, 50} and the best configuration for each model is selected. • The hyperparameters λ, λ ans , λ exp , T were fine-tuned for 50 epochs using the Adpative Experimentation Platform. • M =2 for ExplanationLP, Diff-Explainer and Diff-Comb Explainer.

6.1. EXTERNAL CODE-BASES

• Differentiable Blackbox Combinatorial Solvers Examples: https://github.com/marti us-lab/blackbox-differentiation-combinatorial-solvers • Sentence Transformer code-base: https://huggingface.co/sentence-transformer s/all-mpnet-base-v2 



is used in camera to focus means to concentrate a convex lens causes light to refract and concentrate to magnify distant objects

Light rays are focused by the lens of a camera through the process of refraction is a kind of process

Figure1: Example of question, answer and explanations as graph(Xie et al., 2020;Jansen et al., 2018)

Figure 2: End-to-end architectural diagram of Diff-Comb Explainer. The integration of Differentiable Blackbox Combinatorial solvers will result in better explanation and answer prediction.

Figure 3: Accuracy for different number of retrieved facts.

Comparison of explanation and answer selection of Diff-Comb Explainer against other baselines. Explanation Selection was carried out on the dev set as the test explanation was not public available.

Question (2): Which type of precipitation consists of frozen rain drops?: Correct Answer: sleet. ExplanationLP Answer: snow. Explanations: (i) precipitation is when snow fall from clouds to the Earth, (ii) snow falls Diff-Explainer Answer: sleet. Explanations: (i) snow falls, (ii) precipitation is when water falls from the sky Diff-Comb Explainer Answer: sleet. Explanations: (i) sleet is when raindrops freeze as they fall, (ii) sleet is made of ice Question (3): Most of the mass of the atom consists of?: Correct Answer: protons and neutrons. ExplanationLP Answer: neutrons and electrons. Explanations: (i) neutrons have more mass than an electron, (ii) neutrons have more mass than an electron Diff-Explainer Answer: protons and neutrons. Explanations: (i) the atomic mass is made of the number of protons and neutrons, (ii) precipitation is when water falls from the sky Diff-Comb Explainer Answer: protons and neutrons. Explanations: (i) the atomic mass is made of the number of protons and neutrons, (ii) precipitation is when water falls from the sky

Example of predicted answers and explanations (Only CENTRAL explanations) obtained from our model with different levels of fine-tuning.

ARC challenge scores compared with other Fully or Partially explainable approaches trained only on the ARC dataset.

6.2. INTEGER LINEAR PROGRAMMING OPTIMIZATION

The components of the linear programming system is as follows:• Solver: Gurobi Optimization https://www.gurobi.com/products/gurobi-optimizer / The hyperparatemers used in the ILP constraints:• Maximum number of abstract facts (M ): 2Infrastructures used:• CPU Cores: 32 • CPU Model: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz • Memory: 128GB

6.3. HYPERPARAMETERS

For Diff-Comb Explainer we had to fine-tune hyperparameters λ, λ ans , λ exp , T . We fine-tune for 50 epochs using the Adpative Experimentation Platform with seed of 42.The bounds of the hyperparameters are as follows:• λ: [100, 300]The hyperparameters adopted for our approach are as follows:• λ: 152 The hyperparameters adopted for BERT are as follows:• gradient accumulation steps: 1 • learning rate: 1e-5• weight decay: 0.0 • adam epsilon: 1e-8• warmup steps: 0 • max grad norm: 1.0 • seed: 42We fine-tuned using 4 Tesla V100 GPUs for 10 epochs in total with batch size 32 for Base and 16 for Large.

6.4. DATA

WorldTree Dataset: Data can be obtained from: http://cognitiveai.org/explanationbank/ ARC-Challenge Dataset: https://allenai.org/data/arc. Only used the Challenge split.

