HIERARCHICAL RELATIONAL LEARNING FOR FEW-SHOT KNOWLEDGE GRAPH COMPLETION

Abstract

Knowledge graphs (KGs) are powerful in terms of their inference abilities, but are also notorious for their incompleteness and long-tail distribution of relations. To address these challenges and expand the coverage of KGs, few-shot KG completion aims to make predictions for triplets involving novel relations when only a few training triplets are provided as reference. Previous methods have focused on designing local neighbor aggregators to learn entity-level information and/or imposing a potentially invalid sequential dependency assumption at the triplet level to learn meta relation information. However, pairwise triplet-level interactions and context-level relational information have been largely overlooked for learning meta representations of few-shot relations. In this paper, we propose a hierarchical relational learning method (HiRe) for few-shot KG completion. By jointly capturing three levels of relational information (entity-level, triplet-level and contextlevel), HiRe can effectively learn and refine meta representations of few-shot relations, and thus generalize well to new unseen relations. Extensive experiments on benchmark datasets validate the superiority of HiRe over state-of-the-art methods. The code can be found in https://github.com/alexhw15/HiRe.git. Current few-shot KG methods have, however, focused on designing local neighbor aggregators to learn entity-level information, and/or imposing a sequential assumption at the triplet level to learn meta relation information (See Table 1 ). The potential of leveraging pairwise triplet-level interactions and context-level relational information has been largely unexplored.

1. INTRODUCTION

Knowledge graphs (KGs) comprise a collection of factual triplets, (h, r, t), where each triplet expresses the relationship r between a head entity h and a tail entity t. Large-scale KGs (Vrandečić & Krötzsch, 2014; Mitchell et al., 2018; Suchanek et al., 2007; Bollacker et al., 2008) can provide powerful inference capabilities for many intelligent applications, including question answering (Yao & Van Durme, 2014) , web search (Eder, 2012) and recommendation systems (Wang et al., 2019) . As KGs are often built semi-automatically from unstructured data, real-world KGs are far from complete and suffer from the notorious long-tail problem -a considerable proportion of relations are associated with only very few triplets. As a result, the performance of current KG completion methods significantly degrades when predicting relations with a limited number (few-shot) of training triplets. To tackle this challenge, few-shot KG completion methods have been proposed including GMatching (Xiong et al., 2018) , MetaR (Chen et al., 2019) , FSRL (Zhang et al., 2020) , FAAN (Sheng et al., 2020) and GANA (Niu et al., 2021) . These methods focus on predicting the missing tail entity t for query triplets by learning from only K reference triplets about a target relation r. Given a target relation r and K reference triplets, K-shot KG completion aims to correctly predict the tail entity t for each query triplet (h, r, ?) using the generalizable knowledge learned from reference triplets. Thus, the crucial aspect of few-shot KG completion is to learn the meta representation of each few-shot relation from a limited amount of reference triplets that can generalize to novel relations. To facilitate the learning of meta relation representations, we identify three levels of relational information (see Figure 1 ). ( 1) At the context level, each reference triplet is closely related to its wider contexts, providing crucial evidence for enriching entity and relation embeddings. 

Methods

Entity-level Triplet-level Context-level Seq. Pair. GMatching ✓ ✗ ✗ ✗ MetaR ✓ ✗ ✗ ✗ FSRL ✓ ✓ ✗ ✗ FAAN ✓ ✓ ✗ ✗ GANA ✓ ✓ ✗ ✗ HiRe (ours) ✓ ✗ ✓ ✓ Table 1 : Summary of few-shot KG completion methods based on different levels of relational information used. In this paper, we propose a Hierarchical Relational learning framework (HiRe) for few-shot KG completion. HiRe jointly models three levels of relational information (entity-level, triplet-level, and context-level) within each few-shot task as mutually reinforcing sources of information to generalize to few-shot relations. Here, "hierarchical" references relational learning performed at three different levels of granularity. Specifically, we make the following contributions: • We propose a contrastive learning based context-level relational learning method to learn expressive entity/relation embeddings by modeling correlations between the target triplet and its true/false contexts. We argue that a triplet itself has a close relationship with its true context. Thus, we take a contrastive approach -a given triplet should be pulled close to its true context, but pushed apart from its false contexts -to learn better entity embeddings. • We propose a transformer based meta relation learner (MRL) to learn generalizable meta relation representations. Our proposed MRL is capable of capturing pairwise interactions among reference triplets, while preserving the permutation-invariance property and being insensitive to the size of the reference set. • We devise a meta representation based embedding learner named MTransD that constrains the learned meta relation representations to hold between unseen query triplets, enabling better generalization to novel relations. Lastly, we adopt a model agnostic meta learning (MAML) based training strategy (Finn et al., 2017) to optimize HiRe on each meta task within a unified framework. By performing relational learning at three granular levels, HiRe offers significant advantages for extracting expressive meta relation representations and improving model generalizability for few-shot KG completion. Extensive experiments on two benchmark datasets validate the superiority of HiRe over state-of-the-art methods.

2.1. RELATIONAL LEARNING IN KNOWLEDGE GRAPHS

KG completion methods utilize relational information available in KGs to learn a unified lowdimensional embedding space for the input triplets. TransE (Bordes et al., 2013) is the first to use relation r as a translation for learning an embedding space, i.e., h + r ≈ t for triplet (h, r, t). A scoring function is then used to measure the quality of the translation and to learn a unified embedding space. TransH (Wang et al., 2014) and TransR (Lin et al., 2015) further model relation-specific information for learning an embedding space. ComplEx (Trouillon et al., 2016) , RotatE (Sun et al., 2019b) , and ComplEx-N3 (Lacroix et al., 2018) improve the modeling of relation patterns in a vector/complex space. ConvE (Dettmers et al., 2018) and ConvKB (Nguyen et al., 2018) employ convolution operators to enhance entity/relation embedding learning. However, these methods require a large number of triplets for each relation to learn a unified embedding space. Their performance significantly degrades at few-shot settings, where only very few triplets are available for each relation.

2.2. FEW-SHOT KG COMPLETION

Existing few-shot KG completion methods can be grouped into two main categories: (1) Metric learning based methods: GMatching (Xiong et al., 2018) is the first work to formulate few-shot (one-shot) KG completion. GMatching consists of two parts: a neighbor encoder that aggregates one-hop neighbors of any given entity, and a matching processor that compares similarity between query and reference entity pairs. FSRL (Zhang et al., 2020) relaxes the setting to more shots and explores how to integrate the information learned from multiple reference triplets. FAAN (Sheng et al., 2020) proposes a dynamic attention mechanism for designing one-hop neighbor aggregators. (2) Meta learning based methods: MetaR (Chen et al., 2019) learns to transfer relation-specific meta information, but it simply generates meta relation representations by averaging the representations of all reference triplets. GANA (Niu et al., 2021) puts more emphasis on neighboring information and accordingly proposes a gated and attentive neighbor aggregator. Despite excellent empirical performance, the aforementioned methods suffer from two major limitations. First, they focus on designing local neighbor aggregators to learn entity-level information. Second, they impose a potentially invalid sequential dependency assumption and utilize recurrent processors (i.e., LSTMs (Hochreiter & Schmidhuber, 1996) ) to learn meta relation representations. Thus, current methods fail to capture pairwise triplet-level interactions and context-level relational information. Our work is proposed to fill this important research gap in the literature.

2.3. CONTRASTIVE LEARNING ON GRAPHS

As a self-supervised learning scheme, contrastive learning follows the instance discrimination principle that pairs instances according to whether they are derived from the same instance (i.e., positive pairs) or not (i.e., negative pairs) (Hadsell et al., 2006; Dosovitskiy et al., 2014) . Contrastive methods have been recently proposed to learn expressive node embeddings on graphs. In general, these methods train a graph encoder that produces node embeddings and a discriminator that distinguishes similar node embedding pairs from those dissimilar ones. DGI (Velickovic et al., 2019) trains a node encoder to maximize mutual information between patch representations and high-level graph summaries. InfoGraph (Sun et al., 2019a ) contrasts a global graph representation with substructure representations. (Hassani & Khasahmadi, 2020) propose to contrast encodings from one-hop neighbors and a graph diffusion. GCC (Qiu et al., 2020 ) is a pre-training framework that leverages contrastive learning to capture structural properties across multiple networks. In KGs, we note that positive and negative pairs naturally exist in the few-shot KG completion problem. However, the potential of contrastive learning in this task is under-explored. In our work, we adopt the idea of contrastive learning at the context level to capture correlations between a target triplet and its wider context. This enables enriching the expressiveness of entity embeddings and improving model generalization for few-shot KG completion. To the best of our knowledge, we are the first to integrate contrastive learning with KG embedding learning for few-shot KG completion.

3. PROBLEM FORMULATION

In this section, we formally define the few-shot KG completion task and problem setting. The notations used in the paper can be found in Appendix A. Definition 1 Knowledge Graph G. A knowledge graph (KG) can be denoted as G = {E, R, T P}. E and R are the entity set and the relation set, respectively. T P = {(h, r, t) ∈ E × R × E} denotes the set of all triplets in the knowledge graph. Definition 2 Few-shot KG Completion. Given (i) a KG G = {E, R, T P}, (ii) a reference set S r = {(h i , t i ) ∈ E × E|∃r, s.t. (h i , r, t i ) ∈ T P} that corresponds to a given relation r ∈ R, where |S r | = K, and (iii) a query set Q r = {( hj , r, ?)} that also corresponds to relation r, the K-shot KG completion task aims to predict the true tail entity for each triplet from Q r based on the knowledge learned from G and S r . For each query triplet ( hj , r, ?) ∈ Q r , given a set of candidates C hj ,r for the missing tail entity, the goal is to rank the true tail entity highest among C hj ,r . As the definition states, few-shot KG completion is a relation-specific task. The goal of a few-shot KG completion model is to correctly make predictions for new triplets involving relation r when only a few triplets associated with r are available. Therefore, the training process is based on the unit of tasks, where each task is to predict for new triplets associated with a given relation r, denoted as T r . Each meta training task T r corresponds to a given relation r and is composed of a reference set S r and a query set Q r , i.e. T r = {S r , Q r }: S r = {(h 1 , t 1 ), (h 2 , t 2 ), ..., (h K , t K )}, Q r = {( h1 , C h1,r ), ( h2 , C h2,r ), ..., ( hM , C hM ,r )}, where M is the size of query set Q r . Mathematically, the training set can be denoted as T train = {T i } I i=1 . The test set T test = {T j } J j=1 can be similarly denoted. Note that all triplets corresponding to the relations in test set are unseen during training, i.e., T train ∩ T test = ∅. 

4. THE PROPOSED METHOD

In this section, we present our proposed learning framework in details. As discussed earlier, we identify the research gap in K-shot KG completion, where learning only entity-level relational information and capturing sequential dependencies between reference triplets prevent the model from capturing a more stereoscopic and generalizable representation for the target relation. To fill this gap, we propose to perform three hierarchical levels of relational learning within each meta task (context-level, triplet-level and entity-level) for few-shot KG completion. An overview of our proposed hierarchical relational learning (HiRe) framework can be found in Appendix C.

4.1. CONTRASTIVE LEARNING BASED CONTEXT-LEVEL RELATIONAL LEARNING

Given a reference set S r with K training triplets, existing works (e.g., (Xiong et al., 2018; Niu et al., 2021 )) seek to learn better head/tail entity embeddings by aggregating information from its respective local neighbors, and then concatenate them as triplet representation. Although treating the neighborhoods of head/tail entity separately is a common practice in homogeneous graphs, we argue that this approach is sub-optimal for few-shot KG completion due to the loss of critical information. Taking Figure 2 as an example, jointly considering the wider context shared by head/tail entity would reveal crucial information -both "Lionel Messi" and "Sergio Agüero" playFor "Argentina's National Team" -for determining whether the relation workWith holds between the given entity pair ("Lionel Messi", "Sergio Agüero"). Notably, our statistical analysis further affirms that the triplets on KGs indeed share a significant amount of context information (see Appendix B). Motivated by this important observation, we propose the idea of jointly considering the neighborhoods of head/tail entity as the context of a given triplet to exploit more meticulous relational information. To imbue the embedding of the target triplet with such contextual information, a contrastive loss is employed by contrasting the triplet with its true context against false ones. Formally, given a target triplet (h, r, t), we denote its wider context as C (h,r,t) = N h ∪ N t , where N h = {(r i , t i )|(h, r i , t i ) ∈ T P} and N t = {(r j , t j )|(t, r j , t j ) ∈ T P}. Our goal is to capture context-level relational information -the correlation between the given triplet (h, r, t) and its true context C (h,r,t) . We further propose a multi-head self-attention (MSA) based context encoder, which models the interactions among the given context C (h,r,t) and assigns larger weights to more important relation-entity tuples within the context shared by head entity h and tail entity t. Specifically, given a target triplet (h, r, t) and its context C (h,r,t) , each relation-entity tuple (r i , t i ) ∈ C (h,r,t) is first encoded as re i = r i ⊕ t i , where r i ∈ R d and t i ∈ R d are the relation and entity embedding, respectively, and r i ⊕ t i indicates the concatenation of two vectors r i and t i . An MSA block is then employed to uncover the underlying relationships within the context and generate context embedding c: c 0 = [re 1 ; re 2 ; ..., ; re K ], K = |C (h,r,t) |, c = K i=0 α • re i , α = MSA(c 0 ), where c 0 is the concatenation of the embeddings of all relation-entity tuples and |x| is the size of set x. The self-attention scores among all relation-entity tuples from C (h,r,t) can be computed by Eq. 4. Tuples with higher correlations would be given larger weights and contribute more towards the embedding of C (h,r,t) . The detailed implementation of MSA is given in Appendix D.1. Additionally, we synthesize a group of false contexts { C(h,r,t)i } by randomly corrupting the relation or entity of each relation-entity tuple (r i , t i ) ∈ C (h,r,t) . The embedding of each false context C(h,r,t)i can be learned via the context encoder as ci . Then, we use a contrastive loss to pull close the embedding of the target triplet with its true context and to push away from its false contexts. The contrastive loss function is defined as follows: L c = -log exp(sim(h ⊕ t, c)/τ ) N i=0 exp(sim(h ⊕ t, ci )/τ ) , ( ) where N is the number of false contexts for (h, r, t), τ denotes the temperature parameter, h⊕t indicates the triplet embedding represented as the concatenation of its head and tail entity embeddings, sim(x, y) measures the cosine similarity between x and y. As such, we can inject context-level knowledge into entity embeddings attending to key elements within the context of the given triplet.

4.2. TRANSFORMER BASED TRIPLET-LEVEL RELATIONAL LEARNING

After obtaining the embeddings of all reference triplets, the next focus is to learn meta representation for the target relation r. For all reference triplets associated with the same relation r, our proposed MRL aims to capture the commonality among these reference triplets and obtain the meta representation for relation r. To comprehensively incorporate triplet-level relational information in the reference set, we leverage an SAB on the embeddings of all reference triplets from S r (see the details of SAB in Appendix D.2): X = [x 0 ; x 1 ; ...; x K ], x i ∈ R 2d , 0 ≤ i ≤ K, X ′ = SAB(X) ∈ R K×2d , where x i denotes the i-th reference triplet. The output of SAB has the same size of the input X, but contains pairwise triplet-triplet interactions among X. The transformed embeddings of reference triplets, X ′ , are then fed into a two-layer MLP to obtain the meta representation R Tr , given by R Tr = 1 K K i=1 MLP(X ′ ), where the meta representation R Tr is generated by averaging the transformed embeddings of all reference triplets. This ensures that R Tr contains the fused pairwise triplet-triplet interactions among the reference set in a permutation-invariant manner.

4.3. META REPRESENTATION BASED ENTITY-LEVEL RELATIONAL LEARNING

A crucial aspect of few-shot KG completion is to warrant the generalizability of the learned meta representation. The learned meta representation R Tr should hold between (h i , t i ) if h i and t i are associated with r. This motivates us to refine R Tr under the constraints of true/false entity pairs. Translational models provide an intuitive solution by using the relation as a translation, enabling to explicitly model and constrain the learning of generalizable meta knowledge at the entity level. Following KG translational models (Bordes et al., 2013; Ji et al., 2015) , we design a score function that accounts for the diversity of entities and relations to satisfy such constraints. Our method is referred to as MTransD that effectively captures meta translational relationships at the entity level. Given a target relation r and its corresponding reference set S r , after obtaining its meta representation R Tr , we can now calculate a score for each entity pair (h i , t i ) ∈ S r by projecting the embeddings of head/tail entity into a latent space determined by its corresponding entities and relation simultaneously. Mathematically, the projection process and the score function can be formulated as: h pi = r pi h ⊺ pi h i + I m×n h i , t pi = r pi t ⊺ pi t i + I m×n t i , score(h i , t i ) = ||h pi + R Tr -t pi || 2 , where ||x|| 2 represents the ℓ 2 norm of vector x, h i /t i are the head/tail entity embeddings, h pi /t pi are their corresponding projection vectors. r pi is the projection vector of R Tr , and I m×n is an identity matrix (Ji et al., 2015) . In this way, the projection matrices of each head/tail entity are determined by both the entity itself and its associated relation. As a result, the projected tail entity embedding should be closest to the projected head entity embedding after being translated by R Tr . That is to say, for these triplets associated with relation r, the corresponding meta representation R Tr should hold between h pi and t pi at the entity level in the projection space. Considering the entire reference set, we can further define a loss function as follows: L(S r ) = (hi,ti)∈Sr max{0, score(h i , t i ) + γ -score(h i , t ′ i )}, where γ is a hyper-parameter that determines the margin to separate positive pairs from negative pairs. score(h i , t ′ i ) calculates the score of a negative pair (h i , t ′ i ) which results from negative sampling of the positive pair (h i , t i ) ∈ S r , i.e.(h i , r, t ′ i ) / ∈ G. Till now, we have obtained meta relation representation R Tr for each few-shot relation r, along with a loss function on the reference set.

4.4. MAML BASED TRAINING STRATEGY

Noting that L(S r ) in Eq. 12 is task-specific and should be minimized on the target task T r , we adopt a MAML based training strategy (Finn et al., 2017) to optimize the parameters on each task T r . The obtained loss on reference set L(S r ) is not used to train the whole model but to update intermediate parameters. Please refer to Appendix E for the detailed training scheme of MAML. Specifically, the learned meta representation R Tr can be further refined based on the gradient of L(S r ): R ′ Tr = R Tr -l r ∇ R Tr L(S r ), where l r indicates the learning rate. Furthermore, for each target relation T r , the projection vectors h pi , r pi and t pi can also be optimized in the same manner of MAML so that the model can generalize and adapt to a new target relation. Following MAML, the projection vectors are updated as follows: h ′ pi = h pi -l r ∇ hpi L(S r ), r ′ pi = r pi -l r ∇ rpi L(S r ), t ′ pi = t pi -l r ∇ tpi L(S r ). With the updated parameters, we can project and score each entity pair (h j , t j ) from the query set Q r following the same scheme as reference set and obtain the entity-level loss function L(Q r ): h pj = r ′ pj h ′ pj ⊺ h j + I m×n h j , t pj = r ′ pj t ′ pj ⊺ t j + I m×n t j , score(h j , t j ) =∥ h pj + R ′ Tr -t pj ∥ 2 , L(Q r ) = (hj ,tj )∈Qr max{0, score(h j , t j ) + γ -score(h j , t ′ j )}, where (h j , t ′ j ) is also a negative triplet generated in the same way as (h i , t ′ i ). The optimization objective for training the whole model is to minimize L(Q r ) and L c together, given by: L = L(Q r ) + λL c , ( ) where λ is a trade-off hyper-parameter that balances the contributions of L(Q r ) and L c .

5.1. DATASETS AND EVALUATION METRICS

We conduct experiments on two widely used few-shot KG completion datasets, Nell-One and Wiki-One, which are constructed by (Xiong et al., 2018) . For fair comparison, we follow the experimental setup of GMatching (Xiong et al., 2018) , where relations associated with more than 50 but less than 500 triplets are chosen for few-shot completion tasks. For each target relation, the candidate entity set provided by GMatching is used. The statistics of both datasets are provided in Table 2 . We use 51/5/11 and 133/16/34 tasks for training/validation/testing on Nell-One and Wiki-One, respectively, following the common setting in the literature. We report both MRR (mean reciprocal rank) and Hits@n (n = 1, 5, 10) on both datasets for the evaluation of performance. MRR is the mean reciprocal rank of the correct entities, and Hits@n is the ratio of correct entities that rank in top n. We compare the proposed method against other baseline methods in 1-shot and 5-shot settings, which are the most common settings in the literature.

5.2. BASELINES

For evaluation, we compare our proposed method against two groups of state-of-the-art baselines: ), for the initialization of our proposed HiRe. Following the literature, the embedding dimension is set to 100 and 50 for Nell-One and Wiki-One, respectively. On both datasets, we set the number of SAB to 1 and each SAB contains one self-attention head. We apply drop path to avoid overfitting with a drop rate of 0.2. Conventional The maximum number of neighbors for a given entity is set to 50, the same as in prior works. For all experiments except for the sensitivity test on the trade-off parameter λ in Eq. 21, λ is set to 0.05 and the number of false contexts for each reference triplet is set to 1. The margin γ in Eq. 12 is set to 1. We apply mini-batch gradient descent to train the model with a batch size of 1, 024 for both datasets. Adam optimizer is used with a learning rate of 0.001. We evaluate HiRe on validation set every 1, 000 steps and choose the best model within 30, 000 steps based on MRR. All models are implemented by PyTorch and trained on 1 Tesla P100 GPU.

5.4. COMPARISON WITH STATE-OF-THE-ART METHODS

Table 3 compares HiRe against baselines on Nell-One and Wiki-One under 1-shot and 5-shot settings. In general, conventional KG completion methods are inferior to few-shot KG completion methods, especially udner 1-shot setting. This is expected because conventional KG completion methods are designed for scenarios with sufficient training data. Overall, our HiRe method outperforms all baseline methods under two settings on both datasets, which validates its efficacy for few-shot KG completion. Especially, as the number of reference triplets increases, HiRe achieves Nell-One Wiki-One Methods MRR Hits@10 Hits@5 Hits@1 MRR Hits@10 Hits@5 Hits@1 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1 Table 4 : Ablation study of our proposed HiRe under 3-shot and 5-shot settings on Wiki-One. Ablation on ↓ Components 3-shot 5-shot MTransD MRL Context MRR Hits@10 Hits@5 Hits@1 MRR Hits@10 Hits@5 Hits@ As for performance gains in terms of MRR, Hits@10, Hits@5, and Hits@1, HiRe surpasses the second best performer by +3.8%, +7.1%, +6.7%, and +1.4% in 1-shot setting, and by +2.7%, +8.3%, +7.5%, and +0.7% in 5-shot setting on Nell-One. For performance gains on Wiki-One, HiRe outperforms the second best method by +0.8%, +2.9%, +0.8%, and +0.5% in 1-shot setting, and by +3.0%, +3.3%, +2.4%, and +3.8% in 5-shot setting. HiRe achieves large performance improvements in terms of all metrics, proving that leveraging hierarchical relational information enhances the model's generalizability and leads to an overall improvement in performance.

5.5. ABLATION STUDY

Our proposed HiRe framework is composed of three key components. To investigate the contributions of each component to the overall performance, we conduct a thorough ablation study on both datasets under 3-shot and 5-shot settings. The detailed results on Wiki-One are reported in Table 4 . relation representations. The resultant performance drop is significant; the MRR drops by 4.3% and 5.1% respectively under 3-shot and 5-shot settings. Although the use of LSTM brings some improvements over simplistic averaging through capturing triplet-level relational information, the imposed unrealistic sequential dependency assumption results in limited performance gains. This demonstrates the necessity and superiority of our proposed transformer based MRL in capturing pairwise triplet-level relational information to learn meta representations of few-shot relations. w/o Context: By ablating L c from Eq. 21, we remove contrastive learning based context-level relational learning but retain triplet-level and entity-level relational information. As compared to jointly considering the context of the target triplet, the resultant performance drop verifies our assumption that the semantic contextual information plays a crucial role in few-shot KG completion. Similar conclusions can also be drawn from the ablation results on Nell-One (See Appendix F).

5.6. HYPER-PARAMETER SENTITIVITY

We conduct two sensitivity tests for the number of false contexts N and the trade-off parameter λ in Eq. 21 on both datasets under 1/3/5-shot settings. See Appendix G for detailed results on Nell-One. For hyper-parameter N , we set N as 1, 2, 4, and 6. As Figure 3 shows, HiRe performs the best when N = 1 on all settings, and its performance slightly drops when N = 2. As N continues to increase, the performance of HiRe drops accordingly. One main reason is that, too many false contexts would dominate model training, causing the model to quickly converge to a sub-optimal state. For hyper-parameter λ, since the value of contrastive loss is significantly larger than that of the margin loss, λ should be small to ensure effective supervision from the margin loss. Thus, we study the impact of different values of λ between 0 and 0.5. As Figure 4 shows, HiRe achieves the best performance when λ = 0.05. With the contrastive loss (i.e., λ > 0), HiRe consistently yields better performance, proving the efficacy of our contrastive learning based context-level relational learning.

6. CONCLUSION

This paper presents a hierarchical relational learning framework (HiRe) for few-shot KG completion. We investigate the limitations of current few-shot KG completion methods and identify that jointly capturing three levels of relational information is crucial for enriching entity and relation embeddings, which ultimately leads to better meta representation learning for the target relation and model generalizability. Experimental results on two commonly used benchmark datasets show that HiRe consistently outperforms current state-of-the-art methods, demonstrating its superiority and efficacy for few-shot KG completion. The ablation analysis and hyper-parameter sensitivity study verify the significance of the key components of HiRe.

APPENDIX A NOTATIONS

The notations and symbols used in this paper are summarized in Table 5 . B MOTIVATION: SHARED CONTEXT STATISTICS One of our key motivations is that jointly considering the wider context shared by head/tail entity would reveal crucial information for learning expressive entity embeddings. To justify our motivation, we perform a statistical analysis on Nell-One and Wiki-One dataset and the results are summarized in Table 6 . Overall, out of the 189,635 triplets in Nell-One dataset, up to 39,234 triplets share entities in their contexts. That means, there exists at least one entity that is connected to both the head entity and the tail entity of the given triplet. These 39,234 triplets share 117,386 entities in all, making each triplet have almost three shared entities by average. Triplets that share entities in their contexts constitute more than 20.68% and 9.2% of the total triplets, respectively. More strictly, the triplets that share relation-entity tuples in their contexts (i.e., meaning that the head and tail entity are connected to the same entity by the same relation in the context) constitute 13.77% on Nell-One and 4.6% on Wiki-One, respectively. Our analysis affirms that the triplets on KGs indeed share a significant amount of context information. Our method is thus designed to leverage such crucial information for learning more expressive entity embeddings. For context-level relational learning, we employ a Multi-Head Self-Attention (MSA) block (Vaswani et al., 2017) to uncover the underlying relationships within the context C (h,r,t) of a given triplet (h, r, t) and generate the context embedding c. ℎ Specifically, take a context that contains k relation-entity tuples as an example, the context embedding is initialized as c ∈ R k×dc (d c = 100 and d c = 200 for WiKi-One and Nell-One, respectively). This input is first transformed into three different matrices: the query matrix Q ∈ R k×dv , the key matrix K ∈ R k×d k and the value matrix V ∈ R k×dv with dimension d q =d k =d v =d c . Subsequently, the attention function is calculated as follows: • The above process can be unified into a single function: Attention(Q, K, V) = softmax( Q • K ⊤ √ d k ) • V. The logic behind Eq. 22 is straightforward. Step 1 computes a score between each pair of different relation-entity tuples from the context. Step 2 normalizes the scores to enhance gradient stability for improved training, and Step 3 translates the scores into probabilities. Finally, each relation-entity tuple is updated by the weighted sum based on the probabilities. Overall, Eq. 3 and Eq. 4 in the main paper can be re-formulated as: c = Attention(c 0 ) = Attention(Q, K, V) = A • V = softmax( Q • K ⊤ √ d k ) • V, where A = [α 1 ; α 2 ; ..., ; α k ]. Instead of performing a single attention function with d c -dimensional queries, keys and values, it would be beneficial to linearly project the queries, keys and values h times with different linear projections (called "multi-head") (Vaswani et al., 2017) . On each of these projected versions of queries, keys and values, the attention function is executed in parallel. MultiHead(Q, K, V) = Concat(head 1 , ..., head h )W O , where head i = Attention(QW Q i , KW K i , VW V i ), where the projections are parameter matrices W Q i ∈ R d×d k , W K i ∈ R d×d k , W V i ∈ R d×d k and W O ∈ R hdv×d . In each paralleled attention layer, d k = d v = d/h.

D.2 DETAILS OF SET ATTENTION BLOCK

To comprehensively incorporate triplet-level relational information in the reference set S r , we design a transformer based MRL using a set attention block (SAB) (Lee et al., 2019) to model interactions among all reference triplets in S r . SAB is built upon multi-head attention, defined as: SAB(X) := LayerNorm(H + rFF(H)) where H = LayerNorm(X + MultiHead(X, X, X)), rFF is any row-wise feedforward layer and LayerNorm is layer normalization (Ba et al., 2016).

E MAML BASED TRAINING STRATEGY

The detailed MAML based training framework can be described as follows: Eq. 6-Eq. 8; 7 Learn entity-level relational information via MTransD by Eq. 9-Eq. 11; 8 Calculate the loss on reference set L(S r ) by Eq. 12; 9 Update the parameters based on L(S r ) by Eq. 13-Eq. 16; 10 Calculate the loss on query set L(Q r ) by Eq. 17-Eq. 20; 11 Update the model parameters based on the overall loss function Eq. 21

F ABLATION STUDY ON NELL-ONE

As discussed in Section 5.5, each component in our proposed HiRe framework plays an important role in few-shot KG completion. Here, we provide further ablation results on Nell-One in Table 7 . These results support our findings reported in the main paper and confirm that the removal of any component leads to performance drops in terms of all evaluation metrics. Table 7 : Ablation study of HiRe under 3-shot and 5-shot settings on Nell-One. Ablation on ↓ Components 3-shot 5-shot MTransD MRL Context MRR Hits@10 Hits@5 Hits@1 MRR Hits@10 Hits@5 Hits@ G HYPER-PARAMETER SENSITIVITY STUDY ON NELL-ONE Figure 6 and Figure 7 report further sensitivity test results on Nell-One for the number of false contexts N and the trade-off parameter λ. As shown in Figure 6 , as the number of false contexts increases, the performance of HiRe drops slightly because its model training would converge to a sub-optimal state. Moreover, the best λ value is also 0.05 on Nell-One, as shown in Figure 7 . The overall findings are consistent with those we draw from the results on Wiki-One in Section 5.6. As can be seen, our proposed HiRe quadratically scales with the number of reference triplets in each task. Nevertheless, given the number of reference triplets is often very small (i.e., 1, 3, 5) , our proposed HiRe framework scales reasonably well.



(2)  At the triplet level, capturing the commonality among limited reference triplets is essential for learning meta relation representations. (3) At the entity level, the learned meta relation representations should well generalize to unseen query triplets.

Figure 1: Three levels of relational information: (a) Context-level, (b) Triplet-level, (c) Entity-level.

Figure 2: Entity neighborhoods v.s. Triplet context. Our method jointly considers the context of the target triplet to enable the identification of crucial information, as highlighted in the right figure.

Figure 3: Hyper-parameter sensitivity study with respect to the number of false contexts N on Wiki-One.

Figure 5: An overview of HiRe framework composed of three key components. (1) Contrastive learning based context-level relational learning; (2) Transformer based triplet-level relational learning; (3) Meta representation based entity-level relational learning. Given a target relation r and its corresponding reference set S r and query set Q r , we employ a contrastive loss L c between the true/false contexts and the anchor triplet (take (h 1 , r, t 1 ) as an example) via our proposed contrastive learning based context-level relational learning method. The meta representation of the target relation R Tr is learned by our Transformer based meta relation learner (MRL), capturing pairwise triplet-level relational information. Lastly, MTransD refines the learned meta relation representation at the entity level constrained by L(S r ). The whole learning framework is optimized by a MAML based training strategy.

Compute the scores between different input matrices as S = Q • K ⊤ ; Normalize the scores for the stability of gradient as S n = S/ √ d k ; Translate the scores into probabilities with softmax function A = softmax(S n ); Step 4: Obtain the weighted value matrix C = A • V

MAML based training framework of HiRe. Input: T train : Training tasks G b : Background graph 1 while not converged do 2 Sample a task T r = {S r , Q r } from T train ; 3 Construct context C (h,r,t) for each reference triplet (h, r, t) in S r based on G b ; 4 Encode C (h,r,t) and produce context embedding c by Eq. 3-Eq. 4; Learn context-level relational information based on contrastive learning by Eq. 5; Learn triplet-level meta representation for relation r via transformer based MRL by

Figure 6: Hyper-parameter sensitivity study with respect to the number of false contexts on Nell-One.

Figure 7: The impact of different λ values in Eq. 21 on Nell-One. λ = 0 means that we remove the contrastive learning based context-level relational learning.

State-of-the-art models (e.g., FSRL(Zhang et al., 2020) and GANA(Niu et al., 2021)) utilize LSTMs for this purpose, which inevitably imposes an unrealistic sequential dependency assumption on reference triplets since LSTMs are designed to model sequence data. However, reference triplets associated with the same relation are not sequentially dependent on each other; the occurrence of one reference triplet does not necessarily lead to other triplets in the reference set. Consequently, these LSTM based models violate two important properties. First, the model should be insensitive to the size of reference set (i.e., few-shot size K). Second, the triplets in the reference set should be permutation-invariant. To address these issues, we resort to modeling complex interactions among reference triplets for learning generalizable meta relational knowledge.

Statistics of datasets.

Comparison against state-of-the-art methods on Nell-One and Wiki-One. MetaR-I and MetaR-P indicate the In-train and Pre-train of MetaR (Chen et al., 2019), respectively. OOM indicates out of memory.

Notations and Symbols.

Statistical results on Nell-One and WiKi-One. We show the number of triplets that share entities or relation-entity tuples in their contexts. "Shared tuples" means that the head entity and tail entity are connected to the same entity by the same relation in the context, and "shared entities" means that the head entity and tail entity are connected to the same entity by any relation, as illustrated in the left figure. All Numbers are calculated on the training set.C OVERVIEW OF THE PROPOSED HIRE FRAMEWORKFigure5shows an overview of our proposed hierarchical relational learning (HiRe) framework.

Complexity analysis of three hierarchical relational learning modules with respect to the number of parameters and the number of multiplication operations in each epoch.

Table8lists the complexity of all the hierarchical relational learning modules, where d denotes the dimension of entity embeddings, k denotes the number of relation-entity tuples in the context, and n denotes the number of reference triplets in each task.

