LEARNING TO DECEIVE KNOWLEDGE GRAPH AUGMENTED MODELS VIA TARGETED PERTURBATION

Abstract

Knowledge graphs (KGs) have helped neural models improve performance on various knowledge-intensive tasks, like question answering and item recommendation. By using attention over the KG, such KG-augmented models can also "explain" which KG information was most relevant for making a given prediction. In this paper, we question whether these models are really behaving as we expect. We show that, through a reinforcement learning policy (or even simple heuristics), one can produce deceptively perturbed KGs, which maintain the downstream performance of the original KG while significantly deviating from the original KG's semantics and structure. Our findings raise doubts about KG-augmented models' ability to reason about KG information and give sensible explanations.

1. INTRODUCTION

Recently, neural reasoning over knowledge graphs (KGs) has emerged as a popular paradigm in machine learning and natural language processing (NLP). KG-augmented models have improved performance on a number of knowledge-intensive downstream tasks: for question answering (QA), the KG provides context about how a given answer choice is related to the question (Lin et al., 2019; Feng et al., 2020; Lv et al., 2020; Talmor et al., 2018) ; for item recommendation, the KG mitigates data sparsity and cold start issues (Wang et al., 2018b; 2019a; b; 2018a) . Furthermore, by using attention over the KG, such models aim to explain which KG information was most relevant for making a given prediction (Lin et al., 2019; Feng et al., 2020; Wang et al., 2018b; 2019b; Cao et al., 2019; Gao et al., 2019) . Nonetheless, the process in which KG-augmented models reason about KG information is still not well understood. It is assumed that, like humans, KG-augmented models base their predictions on meaningful KG paths and that this process is responsible for their performance gains (Lin et al., 2019; Feng et al., 2020; Gao et al., 2019; Song et al., 2019) . In this paper, we question if existing KG-augmented models actually use KGs in this human-like manner. We study this question primarily by measuring model performance when the KG's semantics and structure have been perturbed to hinder human comprehension. To perturb the KG, we propose four perturbation heuristics and a reinforcement learning (RL) based perturbation algorithm. Surprisingly, for KG-augmented models on both commonsense QA and item recommendation, we find that the KG can be extensively perturbed with little to no effect on performance. This raises doubts about KG-augmented models' use of KGs and the plausibility of their explanations.

2. PROBLEM SETTING

Our goal is to investigate whether KG-augmented models and humans use KGs similarly. Since KGs are human-labeled, we assume that they are generally accurate and meaningful to humans. Thus, across different perturbation methods, we measure model performance when every edge in the KG has been perturbed to make less sense to humans. To quantify the extent to which the KG has been perturbed, we also measure both semantic and structural similarity between the original KG and perturbed KG. If original-perturbed KG similarity is low, then a human-like KG-augmented model should achieve worse performance with the perturbed KG than with the original KG. Furthermore, we evaluate the plausibility of KG-augmented models' explanations when using original and perturbed KGs, by asking humans to rate these explanations' readability and usability. 

Notation

Let F θ be an KGaugmented model, and let (Xtrain, Xdev, Xtest) be a dataset for some downstream task. We denote a KG as G = (E, R, T ), where E is the set of entities (nodes), R is the set of relation types, and T = {(e1, r, e2) | e1, e2 ∈ E, r ∈ R} is the set of facts (edges) composed from existing entities and relations (Zheng et al., 2018) . Let G = (E, R , T ) be the KG obtained after perturbing G, where R ⊆ R and T = T . Let f (G, G ) be a function that measures similarity between G and G . Let g(G) be the downstream performance when evaluating F θ on Xtest and G. Also, let ⊕ denote the concatenation operation, and let NL(e) denote the set of L-hop neighbors for entity e ∈ E. High-Level Procedure First, we train F θ on Xtrain and G, then evaluate F θ on Xtest and G to get the original performance g(G). Second, we freeze F θ , then perturb G to obtain G . Third, we evaluate F θ on Xtest and G to get the perturbed performance g(G ). Finally, we measure g(G) -g(G ) and f (G, G ) to assess how human-like F θ 's reasoning process is. This procedure is illustrated in Fig. 1 . In this paper, we consider two downstream tasks: commonsense QA and item recommendation. Commonsense QA Given a question x and a set of k possible answers A = {y1, ..., y k }, the task is to predict a compatibility score for each (x, y) pair, such that the highest score is predicted for the correct answer. In commonsense QA, the questions are designed to require commonsense knowledge which is typically unstated in natural language, but more likely to be found in KGs (Talmor et al., 2018) . Let F text φ be a text encoder (Devlin et al., 2018) , F graph ψ be a graph encoder, and F cls ξ be an MLP classifier, where φ, ψ, ξ ⊂ θ. Let G (x,y) denote a subgraph of G consisting of entities mentioned in text sequence x ⊕ y, plus their corresponding edges. We start by computing a text embedding htext = F text φ (x ⊕ y) and a graph embedding hgraph = F graph φ (G (x,y) ). After that, we compute the score for (x, y) as S (x,y) = F cls ξ (htext ⊕ hgraph). Finally, we select the highest scoring answer: ypred = arg max y∈A S (x,y) . KG-augmented commonsense QA models vary primarily in their design of F graph ψ . In particular, path-based models compute the graph embedding by using attention to selectively aggregate paths in the subgraph. The attention scores can help explain which paths the model focused on most for a given prediction (Lin et al., 2019; Feng et al., 2020; Santoro et al., 2017) . Item Recommendation We consider a set of users U = {u1, u2, ..., um}, a set of items V = {v1, v2, ..., vn}, and a user-item interaction matrix Y ∈ R m×n with entries yuv. If user u has been observed to engage with item v, then yuv = 1; otherwise, yuv = 0. Additionally, we consider a KG G, in which R is the set of relation types in G. In G, nodes are items v ∈ V, and edges are facts of the form (v, r, v ), where r ∈ R is a relation. For the zero entries in Y (i.e., yuv = 0), our task is to predict a compatibility score for user-item pair (u, v), indicating how likely user u is to want to engage with item v. We represent each user u, item v, and relation r as embeddings u, v, and r, respectively. Given a user-item pair (u, v), its compatibility score is computed as u, v , the inner product between u and v. KG-augmented recommender systems differ mainly in how they use G to compute u and v. Generally, these models do so by using attention to selectively aggregate items/relations in G. The attention scores can help explain which items/relations the model found most relevant for a given prediction (Wang et al., 2018b; 2019b) .



Figure 1: Proposed KG Perturbation Framework. Our procedure consists of three main steps: (1) train the KG-augmented model on the original KG, then freeze the model; (2) obtain the perturbed KG by applying N = |T | perturbations to the full original KG; and (3) compare the perturbed KG's downstream performance to that of the original KG.

