BEYOND LINK PREDICTION: ON PRE-TRAINING KNOWLEDGE GRAPH EMBEDDINGS

Abstract

Knowledge graph embeddings (KGE) models provide low-dimensional representations of entities and relations in a knowledge graph (KG). Most prior work focuses on training and evaluating KGE models for the task of link prediction; the question of whether or not KGE models provide useful representations more generally remains largely open. In this work, we explore the suitability of KGE models (i) for more general graph-structure prediction tasks and (ii) for downstream tasks such as entity classification. For (i), we found that commonly trained KGE models often perform poorly at structural tasks other than link prediction. Based on this observation, we propose a more general multi-task training approach, which includes additional self-supervised tasks such as neighborhood prediction or domain prediction. In our experiments, these multi-task KGE models showed significantly better overall performance for structural prediction tasks. For (ii), we investigate whether KGE models provide useful features for a variety of downstream tasks. Here we view KGE models as a form of self-supervised pre-training and study the impact of both model training and model selection on downstream task performance. We found that multi-task pre-training can (but does not always) significantly improve performance and that KGE models can (but do not always) compete with or even outperform task-specific GNNs trained in a supervised fashion. Our work suggests that more research is needed on the relation between pretraining KGE models and their suitability for downstream applications.

1. INTRODUCTION

Knowledge graph embeddings (KGE) provide low-dimension representations of entities and relations of a knowledge graph (KG). Although a large number of KGE models have been proposed in the literature-see for example the surveys of Nickel et al. (2015) , Wang et al. (2017) and Ji et al. (2021) -, most prior work focuses on the task of link prediction, i.e., answering questions such as (Austin, capitalOf, ?) by reasoning over an incomplete KB. In addition to link prediction, it is often argued that KGEs can provide representations that capture semantic properties of the entities and, indeed, pre-trained KGE models have been used to inject structured knowledge into language models (He et al., 2020; Zhang et al., 2019) , visual models (Baier et al., 2017) , recommender systems (El-Kishky et al., 2022; Wang et al., 2018) , question answering systems (Ilyas et al., 2022) and other types of downstream models (Wang et al., 2017) . The question of whether pre-trained KGE models provide generally useful representations remains largely open. Likewise, it is not well-understood how choices taken in model training and model selection affect these representations. In this work, we shed light onto these questions from multiple directions. First, we study the suitability of out-of-the-box KGE models for basic graph-structure prediction tasks beyond link prediction. In particular, we consider the tasks of predicting the relation of a triple as suggested by Chang et al. (2020) (e.g., the relationship between Austin and Texas), the domain and range of a relation (e.g., whether Austin is a capital), as well as entity and relation neighborhood of each entity (e.g., which other entities are related to Austin). Perhaps surprisingly, we found that commonly trained KGE models often performed poorly on such tasks, challenging the intuition that KGE models capture graph structure well. Second, we investigate whether KGE models are suitable pre-trained representations for node-level downstream tasks such as entity classification (e.g., the profession of a person) or regression (e.g., the average rating of a movie). To do so, we conducted an empirical study using 27 downstream tasks on two different KGs. We found that out-of-the-box KGE models often perform decent on these tasks and, in fact, the best KGE models can (but do not always) exceed the performance of recent graph neural networks such as KE-GCN (Yu et al., 2021) . However, the KGE models with best downstream task performance were often not the best-performing models for link prediction. For example, we found that the basic TransE model (Bordes et al., 2013) may be superior to KGE models more suited to link prediction such as ComplEx (Trouillon et al., 2016) or RotatE (Sun et al., 2019) . This suggests that link prediction performance is not necessarily indicative of downstream task performance. Both of these findings suggest that the focus on link prediction tasks is too narrow for pre-training KGE models, i.e., to provide generally useful features.. We thus explore whether the performance of KGE models for both graph-structure prediction and downstream tasks can be improved by better pre-training and model selection. Inspired by multi-task approaches in other areas-such as natural language processing (Aribandi et al., 2022; Sanh et al., 2022) or computer vision (Doersch & Zisserman, 2017 )-, we included the graph-structure prediction tasks discussed above as additional training objectives and as evaluation measures during model selection. In particular, we propose a multi-task training (MTT) and a multi-task ranking (MTR) approach that both can be used along with an arbitrary KGE model class and without a substantial increase in computational cost. In our experimental study, the resulting multi-task KGE models had significantly better overall performance for graph-structure prediction tasks and often (but not always) also led to better downstream task performance. We also found that downstream task performance could be further improved by using a smaller set of pre-training tasks. The results suggest that the optimal choice of tasks depends on the dataset, KGE model class, and downstream task and may be difficult to determine in practice. In summary, the contributions of this paper are as follows: (i) We show empirically that commonly trained KGE models fail at basic graph-structure prediction tasks beyond link prediction. (ii) We propose novel multi-task training and ranking approaches that address this shortcoming. (iii) We explore the impact of standard and multi-task training as well as different approaches for model selection on downstream task performance. (iv) We contextualize KGE model performance with results obtained from recent graph neural networks, which-in contrast to KGE models-are trained directly on each downstream task. Although our work makes a step toward improved pre-training of KGE models, it also suggests that more research is needed on the relation between pre-training KGE models and their general suitability for downstream applications.

2. PRELIMINARIES AND RELATED WORK

We briefly describe KGE models, training and evaluation methods for link prediction, as well as prior work on other tasks. A more comprehensive discussion can be found in surveys such as (Nickel et al., 2015; Wang et al., 2017; Ji et al., 2021) .

Link prediction.

A knowledge graph G ⊆ E × R × E is a collection of (subject, predicate, object)triples over a set E of entities and a set R of relations. Triples represent known facts such as (Austin, capitalOf, Texas). In the KGE literature, the link prediction task is the task of inferring the subject or object to questions of form (?, capitalOf, Texas) and (Austin, capitalOf, ?), respectively. KGE models. KGE models (Sun et al., 2019; Trouillon et al., 2016; Bordes et al., 2013) represent each entity and each relation of a KG with a a low-dimensional embedding, commonly a real or complex vector. KGE models have an associated scoring function s : E × R × E → R that associates each triple with a real-valued score. Intuitively, high scores indicate plausible triples, low scores implausible triples. Commonly, the scoring function depends on the input triple only through the embeddings of its arguments. For example, TransE (Bordes et al., 2013 ) is a translation-based model with s(i, k, j) = -e i + r k -e j , where e i ∈ R d and r k ∈ R d denote entity and relation embeddings of dimensionality d > 0, respectively. Scoring functions can be more involved, e.g., based on convolutional neural networks (Dettmers et al., 2018) or transformers (Chen et al., 2021a) . Standard training. KGE models are commonly trained on the link prediction task. We only give a high-level description here. For each triple (s, p, o) in the training data G train , KGE models are trained such that the score s(s, p, o) is high (a positive) but, for certain choices of o ∈ E such that (s, p, o ) / ∈ G train , the score of s(s, p, o ) is low (a negative); similarly for subjects s ∈ E with (s , p, o) / ∈ G train . The actual cost function varies across training types (e.g., sampled negatives or all negatives), loss function (e.g, cross entropy), and more generally the choice of hyperparameters; see (Ali et al., 2021; Ruffinelli et al., 2020) for a more detailed discussion and experimental comparison. Standard evaluation. The most commonly used evaluation protocol for KGE models is entity ranking (ER), which is also based on link prediction. Given a test triple (s, p, o) / ∈ G train , the model is used to answer the link prediction queries (s, p, ?) and (?, p, o). In particular, the scores of all possible answers that do not already occur in the training data are computed. The model is evaluated based on the rank of the test answers s and o, respectively. Common metrics are the mean reciprocal rank (MRR) and Hits@K. The reliability of entity ranking in assessing model performance was studied and questioned, e.g., in (Safavi & Koutra, 2020; Tiwari et al., 2021; Zhou et al., 2022; Wang et al., 2019) . In contrast, our focus is mostly on other evaluation tasks. Other training approaches. RESCAL (Nickel et al., 2011) , one of the earliest KGE models, trained on the reconstruction task. Such tasks aim to to construct the entire training data using cost functions such as s,p,o I[(s, p, o) ∈ G train ] -s(s, p, o) 2 2 , where I[•] is a 0/1 indicator. A similar approach was explored by Li et al. (2021) . We do not consider such methods further because training costs are excessive (at least unless squared error is used) and the empirical performance reported by Li et al. (2021) is generally far behind KGE models trained with link prediction. Chen et al. (2021b) proposed to augment the link prediction task with relation prediction during training (but not evaluation). We expand upon this work by considering additional pre-training tasks and by focusing on graph-structure prediction and downstream task performance instead. Other evaluation approaches. In early (and rarely in recent) work, KGE models were evaluated using triple classification (Socher et al., 2013; Wang et al., 2014; Lin et al., 2015; Wang et al., 2022) . We do not consider this task in this work because performance estimates are typically overly optimistic and misleading unless hard negatives are used (Safavi & Koutra, 2020) ; such hard negatives are generally not available. Chang et al. (2020) evaluated KGE models on the relation prediction task, which we also consider as one of the evaluation tasks in this work. There is also work on explaining or interpreting KGE models (Meilicke et al., 2018; Allen et al., 2021; Rim et al., 2021) , whereas our focus is on studying whether such models provide useful representations in the first place. As mentioned in the introduction, pre-trained KGE models have been used as a components in language models (He et al., 2020; Zhang et al., 2019) , visual models (Baier et al., 2017) , recommender systems (El-Kishky et al., 2022; Wang et al., 2018) , or question answering systems (Ilyas et al., 2022) . Likewise, (Pezeshkpour et al., 2018; Jain et al., 2021 ) evaluated pre-trained KGE models for entity classification or regression tasks, as we do. We expand on this line of work by using a larger set of tasks (graph-structure prediction and more downstream tasks), by proposing improved pre-training methods, and by studying the impact of pre-training on downstream task performance.

3. GRAPH-STRUCTURE PREDICTION

In addition to link prediction, we explore the suitability of KGE models for other basic graphstructure prediction tasks. An example and summary is given in Table 1 . We describe the form of the queries for each task as a triple such as (s, ?, * ), where s or o denote input entities, p denotes an input relation, ? denotes the prediction target, and * acts as a wildcard. Using this notation, we consider the following tasks and queries: • Link prediction (LP): Given a relation and a subject, predict the object (denoted (s, p, ?)). Likewise, given a relation and an object, predict the subject (denoted (?, p, o)). • Relation prediction (REL, Chang et al. (2020) ; Chen et al. (2021b) : Given two entities s and p, predict the relation between them (denoted (s, ?, o)). • Domain prediction (DOM): Given a relation, predict its domain (denoted (?, p, * )) or its range (denoted ( * , p, ?)). • Entity neighborhood prediction (NBE): Given a subject entity, predict related objects (denoted (s, * , ?)). Likewise, given an object, predict related subjects (denoted (?, * , o)). • Relation neighborhood prediction (NBR): Given a entity, predict the relations where it occurs as subject (denoted (s, ?, * )) and where it occurs as object (denoted ( * , ?, o)). 1 : Graph-structure prediction tasks used for self-supervised pre-training and evaluation along with example queries. Here ? denotes the prediction target and * acts as a wildcard.

Knowledge graph

Note that we use the wildcard to denote existential quantification. For example, given a ground-truth KG G and domain prediction query (?, p, * ), an entity s ∈ E is a correct answer if there exists an entity o ∈ E such that (s, p, o) ∈ G. We chose this particular set of tasks because they are simple, they capture basic information about the graph structure beyond link prediction, and they only have one prediction target (an entity or a relation). The latter property allows efficient pre-training and evaluation, as discussed below. For this reason, we exclude tasks such as entity-pair prediction (Wang et al., 2019) (denoted (?, p, ?) in our notation) or reconstruction (Nickel et al., 2011) (denoted (?, ?, ?)). In our experimental study, we also found that the exclusion of some of the above pre-training tasks (e.g., LP) can further improve downstream task performance. The optimal choice of tasks depends on dataset, KGE model, and downstream task, however. We leave the exploration of task selection as well as on exploring additional pre-training tasks to future work.

Multi-task ranking (MTR).

To evaluate the performance of KGE models on the graph-structure prediction tasks, we generalize the entity ranking (ER) protocol for link prediction. Intuitively, for each of the nine tasks (REL as well as LP/DOM/NBE/NBR for both subject targets and for object targets), we construct a query from each test triple,foot_0 obtain a ranking of the prediction targets (entity or relation) that do not already occur in the training data, and then use metrics such as MRR or Hits@K. The final MTR metric is given by the micro-average over all nine tasks. We now describe how to obtain task-specific rankings. First, for a REL query of form (s, ?, o), we proceed as in (Chang et al., 2020) and rank all r ∈ R such that (s, r , o) / ∈ G train in descending order of their scores s(s, r , o). For the other tasks, which involve wildcards, it is not immediately clear how to perform prediction using a KGE model. We first discuss scoring and ranking, then filtering of training data. Consider for example the NBR query (s, ?, * ), where our goal is to rank relations. The perhaps simplest approach to obtain a relation ranking is to first rank all triples of form (s, r , o ), where r ∈ R and o ∈ E, and then rank relations by their first appearance (e.g., the relation of the highest-scoring triple is ranked at the top). More generally, we make use of an extended score function that accepts wildcards. The approach just described corresponds to using s(s, r , * ) = max o ∈E s(s, r , o ), i.e, the score of a relation r is the score of its most plausible triple. Although other aggregation functions are feasible, we only consider max-aggregation because it does not make any additional assumptions on the scoring function. To filter training data during model evaluation, we remove all relations r such that (s, r , o ) ∈ G train for some o ∈ E; i.e., we remove all prediction targets that are already implied by the training data. We proceed similarly for all other tasks involving wildcards. Note that the number of score computations needed to predict entity targets for queries without wildcards is O(|E|), whereas the one for queries with wildcards is O(|E||R|). We discuss below how the latter cost can be reduced to O(|E|).

Multi-task training (MTT)

. We now generalize standard KGE model training to all of the graphstructure prediction tasks. Our goal is to be able to improve KGE model performance at these tasks, while at the same time keeping training and prediction cost low. We do this by constructing a taskspecific cost function for each individual task first; the final cost function is then given as a weighted linear combination of the task-specific costs (and additional regularization terms), where the weights are hyperparameters. The task-specific cost functions for link prediction and relation prediction are obtained as in standard training (Sec. 2): For each positive triple (s, p, o) ∈ G, we construct a set of negatives according to the query (i.e., by perturbing the position of the prediction target) and then apply the loss function (e.g., cross entropy). For the other tasks, which involve wildcards, we proceed differently. Instead of performing some form of (costly) score aggregation during training, we "convert" tasks with wildcards into tasks without wildcards. To do so, we make use of three virtual wildcard entities-one for subjects (any S ), one for relations (any R ), and one for objects (any O )-and learn embeddings for these entities. During training, we conceptually replace wildcards by their corresponding wildcard entity and proceed as before. For example, for training triple (s, p, o) and NBR query (s, ?, * ), we consider the virtual triple (s, p, any O ) along with query (s, ?, any O ). By doing so, we converted the NBR task into a REL task. We also use the so-obtained wildcard embeddings during prediction time in the same fashion; e.g., we set s(s, r , * ) = s(s, r , any O ). Instead of performing score aggregation, the model thus directly learns extended scores. The advantage of the MTT approach is that (i) the prediction costs remain stable, i.e., the cost of graph-structure prediction or downstream task prediction is unaffected by the number or choice of pre-training tasks, and (ii) the pre-training costs increase only linearly in the number of tasks. Note that the wildcard embeddings are not used for entity-level downstream tasks. Nevertheless, using wildcard entities during training affects all other entities as well. This is because the embedding of each entity occurs in all graph-structure prediction tasks. The entity embeddings of a good KGE model thus needs to be suitable for all these tasks, not just for link prediction.

4. EXPERIMENTAL STUDY

We conducted a large experimental study. Our goals were (i) to assess KGE model performance for graph-structure prediction, (ii) to assess performance of pre-trained KGE models on downstream tasks, (iii) to assess the effect of multi-task training on both graph-structure prediction and downstream task performance, and (iv) to contextualize these results by comparing them to results obtained by recent graph neural networks (GNNs).

4.1. EXPERIMENTAL SETUP

Datasets, code, and scripts to reproduce all experimental results are available at <link-provided-in-final-version>. Knowledge graphs. We used three commonly used benchmark datasets for evaluating KGE models: FB15K-237 (Toutanova & Chen, 2015) , WNRR (Dettmers et al., 2018), and YAGO3-10 (Mahdisoltani et al., 2014) . Each dataset is associated with a training, a validation and a test split. FB15K-237 and WNRR are designed to be harder benchmarks for link prediction. YAGO3-10 is not, but it is considerably larger. Dataset statistics are summarized in Table 5 in the appendix. KGE models. We considered four popular, representative KGE models: TransE (Bordes et al., 2013) and DistMult (Yang et al., 2015) (basic translational and factorization models, resp.) as well as RotatE (Sun et al., 2019) and ComplEx (Trouillon et al., 2016) (SOTA translational and factorization models). RotatE and ComplEx are the methods of choice for low-cost embeddings with good prediction performance (Ruffinelli et al., 2020; Sun et al., 2019) , and-with an increase in model size and/or training cost (Lacroix et al., 2018; Chen et al., 2021b) -can perform as well as SOTA models of other KGE model types such as the transformer-based HittER model (Chen et al., 2021a) . KGE training. We used LibKGE (Broscheit et al., 2020) for STD training (LP only) as a baseline and added MTT/MTR model training/evaluation. All KGE models were trained for a maximum of 200 epochs with early stopping on validation MRR checked every 10 epochs. We used cross-entropy as loss function, as it systematically outperformed other losses in most prior studies. We used 1vsAll training with FB15K-237 and WNRR (to achieve good results) and NegSamp with YAGO3-10 (to scale to this larger dataset). Since we are interested in pre-trained KGE models, no information from downstream tasks is used for KGE model training and selection; e.g., the same KGE model is used for all downstream tasks in each experiment. In particular, models were selected w.r.t. performance (MRR) on the validation data. Unless stated otherwise, models trained with STD training use the LP task, models trained with MTT training use MTR. Further improvements may be made by using downstream tasks during training (Aribandi et al., 2022) ; we leave such exploration to future work. KGE evaluation. We evaluate KGE models with respect to each of the five graph-structure prediction tasks of Sec. 3 (LP, REL, DOM, NBE, NBR) using filtered MRR on test data. We also aggregate these metrics into the multi-task ranking MRR (MTR). KGE hyperparameters. We closely follow the approach of the experimental study of Ruffinelli et al. (2020) to perform hyperparameter selection. We performed 30 random trials using SOBOL sampling (Bergstra & Bengio, 2012) over a large search space to tune several hyperparameters, e.g. regularization, embedding size, batch size, dropout, initialization, and task weights (each in [0.1, 10.0], log scale). To keep our study feasible, we reduced the maximum batch and embedding size for larger datasets and expensive models. The full search space can be found in Table 8 . Downstream tasks. We collected or created data for 27 downstream tasks on FB15K-237 or YAGO3-10. This includes the datasets of Jain et al. (2021) for entity classification on FB15K-237 and YAGO3-10, which aims to predict the types of entities at different granularities. For regression, we use the datasets of Pezeshkpour et al. (2018) for YAGO3-10, which consist of temporal prediction tasks (e.g., the year an event took place), and the dataset of Huang et al. (2021) for node importance prediction. We also created several regression tasks for FB15K-237 from the multimodal data of García-Durán et al. ( 2018) by predicting literals associated to entities (e.g., a date, a person's height, the rating of a movie). Datasets statistics are given in Tables 6 and 7 in the appendix. Downstream models. We use scikit-learn (Pedregosa et al., 2011) using only the node embedding of the pre-trained KG model as input. For classification, we use multilayer perceptrons (MLP), logistic regression, KNN, and random forests. For regression, we use MLP and linear regression. Downstream training. Each model was trained using 5-fold cross validation and selected based on mean validation performance across folds (see below). We then retrained the selected model on the union of the training and validation split (if present). To tune hyperparameters, we use 10 trials of random search with SOBOL sampling for each downstream model. The search space is given in Table 9 . Note that we treat the choice of downstream model as a hyperparameter as well. Downstream evaluation. For entity classification, we report weighted F1, as in Jain et al. (2021) , aggregated across all classification tasks (denoted EC). For regression, we chose relative squared error (RSE) because it is interpretable and allows meaningful averaging across the different regression tasks (denoted REG). An RSE value of 1 is equivalent to the performance of a model that predicts the average of the dependent variable in the evaluation data; lower values are better. For each metric, we report the mean and standard deviation over 3 training runs of the downstream model. Downstream baselines. We consider multiple baseline models to contextualize the results from pretrained KGE models. In contrast to KGEs, the baselines are directly trained on the downstream task (i.e., no pre-training) and need to access the KG to perform predictions. We include KE-GCN (Yu et al., 2021) , a recent GNN with state-of-the-art results for graph alignment and entity classification. For regression tasks, we use a linear layer after the final convolutional layer of KE-GCN. 2 We tune hyperparameters using 30 SOBOL trials (as for KGE models); the search space is shown in Table 9 . For training, evaluation, and model selection, we follow the approach for our downstream models (e.g,. 5-fold CV). We also consider selected SOTA results of other downstream models; see Sec. 4.5.

4.2. GRAPH-STRUCTURE PREDICTION

In Table 2 , we report test MRR of all graph-structure prediction tasks from Table 1 for KGE models using standard training and link prediction for model selection (STD) and our proposed multi-task training and model selection (MTT).foot_2 Bold entries show best performance per metric and evaluation method. For easier comparison between STD and MTT, underlined entries highlight the best performance compared to the entry with the same corresponding KGE model on the same dataset, but that uses the other training method. The columns labeled Downstream Tasks are discussed in Sec. 4.3. Table 2 : Performance on test data of graph-structure prediction and downstream tasks with STD and MTT training, as well as KE-GCN. For graph-structure prediction, we report MRR (higher is better), for entity classification (EC) we report weighted F1 (higher is better), and for regression (REG) we show relative squared error (lower is better). Bold entries show best performance per task. Underlined entries show best performance between STD and MTT.

Graph

The results show that across all datasets and KGE models, STD training performed poorly on all graph-structure tasks, except LP and (often) REL. The performance for these tasks improved significantly with MTT training in almost all cases; these tasks have been introduced as auxiliary training objectives. This suggests that models trained solely using link prediction fail to capture graph structure more generally. Also note that MTT models had slightly lower performance on LP, but the decrease was often small and outweighted by significantly improved performance over the other tasks (often 2x-4x, up to 10x, depending on model, task and dataset). A notable exception was NBE, which is the only task that uses wildcard embeddings for relations. Here STD occasionally outperformed MTT (on YAGO-10 using DistMult and often on WNRR; see Tab. 11 in the appendix). Generally, however, MTT improved significantly on STD for graph structure prediction and can thus be used to improve KGE's ability to learn multiple graph tasks simultaneously.

4.3. DOWNSTREAM TASKS

Table 2 also shows mean performance across all downstream tasks for each benchmark dataset. As before, bold entries show best performance per metric and evaluation method, and underlined entries facilitate performance comparisons across the different training approaches. We report performance for each individual downstream task in tables13 to 16 in the appendix. The best overall downstream task performance across all KGE and KE-GCN models was achieved by MTT in all cases. The margin compared to STD was sometimes small (e.g., EC on FB15K-237) and sometimes large (e.g., REG on YAGO3-10). The margin compared to KE-GCN, which trains directly on each task, was large. Nevertheless, STD training occasionally performed better than MTT (e.g., on EC tasks for FB15K-237). This suggests that capturing a wider variety of graph structures does not necessarily translate to better downstream task performance. We explore this further in Section 4.4, where we consider subsets of the MTT tasks. Ultimately, we conclude that the choice of pre-training objective clearly has an impact on downstream performance, although it is currently unclear how to make this choice. Our results also suggest that-perhaps surprisingly-models with weaker performance during pretraining with both STD and MTT often performed competitively in downstream tasks and sometimes even outperformed models with stronger pre-training performance. For example, ComplEx considerably outperformed RotatE and TransE on FB15K-237 on LP and MTR, but both models outperformed ComplEx on the EC tasks for that dataset. Similar observations can be made about both EC and REG tasks on YAGO3-10. The REG tasks on FB15K-237 were an exception though; here higher performance during pre-training translated to better performance on downstream tasks. Generally, these results are problematic, as they suggest that LP and MTR are often inadequate to guide the choice of the KGE model class, a problem that needs further exploration in future work.

4.4. IMPACT OF TASK SELECTION AND MODEL SELECTION

Next, we explored the impact of task selection and, in particular, whether all proposed MTT tasks are beneficial. To keep computational costs feasible, we focused on FB15K-237 with ComplEx and TransE. We explored performance using STD, MTT, and MTT without either the LP, DOM, NBE, or NBR pre-training task. Our results are summarized in Tab. 3. We found that for graph structure predictions, excluding a task generally led to lower performance on that task, as expected. It may also, however, lead to a boost in performance on other tasks. For example, the best NBE performance for ComplEx is obtained when LP is excluded. For downstream tasks, we observe that the choice of training tasks can have a significant impact and that good choices differ between KGE models and downstream tasks. For example, compared to full MTT training, using a subset of tasks led to large improvements for ComplEx on EC and for TransE on REG. In both cases, as well as with TransE on EC, the best performance is obtained by removing one of the tasks during training. This reinforces our previous observation that including more tasks during pre-training does not necessarily lead to higher downstream performance, but it also provides more evidence that STD training is not enough for good downstream task performance. In fact, good models can be obtained without including the link prediction tasks: e.g., the best performance for TransE on REG was obtained when LP was excluded. We also explored the impact of model selection methods. 

4.5. COMPARISON TO TASK-SPECIFIC MODELS

We compared the performance of a pre-trained ComplEx model (using MTT) to best results for additional downstream tasks from the literature. These prior results were obtained by task-specific models and were not reproduced by us; see Sec. A.3 for a description of tasks and detailed results. We found that in most cases, this pre-trained ComplEx model did not reach the performance of SOTA task-specific models (which in some cases leveraged additional information). More exploration is needed to whether and when pre-trained KGE models are preferable (e.g., as in the tasks of Tab. 2) and on the effectiveness-cost trade-off of alternative approaches.

5. CONCLUSION

In this work, we explored methods to pretrain KGE models for tasks beyond link prediction. First, we showed empirically that commonly trained KGE models fail at basic graph-structure prediction tasks and proposed a novel multi-task training and ranking approaches. These multi-task KGE models led to substantially better performance, i.e, their embeddings captured more information about graph structure. Second, we explored downstream task performance for a number of entity classification and regression tasks. Here multi-task training generally led to the best overall performance, but the margin was sometimes small. Our ablation studies suggest that pre-training can be further improved by a data-and model-specific selection of both pre-training tasks and model selection metric. Generally, more research is needed on how to make these choices and, more generally, on the relation between pre-training KGE models and their general suitability for downstream applications. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441-1451, 2019. Ying Zhou, Xuanang Chen, Ben He, Zheng Ye, and Le Sun. Re-thinking knowledge graph completion evaluation from an information retrieval perspective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022.

A.3 COMPARISON TO SOTA RESULTS

In Table 10 , we compare downstream task performance of a pre-trained ComplEx model using the MTT approach with state-of-the-art task-specific models from the literature. Note that we did not train these models ourselves and experimental setups used in prior work may be very different from ours. In these experiment, we did not reach state-of-the-art performance with KGE models. Aside from training directly for the task, this may be because (i) some of the datasets (AIFB, MUTAG) are very small so that KGE models do not learn much before overfitting, (ii) some knowledge graphs are multi-modal (MDGENRE, DMGFULL), but we do not leverage this informationfoot_3 On MDGENRE, however, the additional modalities do not play a significant role, as not only does our pre-trained KGE achieve comparative performance, but so do the non-multimodal baselines from Bloem et al. (2021) . This is not the case with the DMGFULL datasets, however, which contains a significantly higher number of triples from different modalities. 



The nine queries for test triple (s, p, o) are precisely the ones given in the task descriptions. In our experiments, this led to better performance than using a single dimensional output in the final convolution layer as done byHuang et al. (2021). Due to space constraints, we report results on WNRR in Table11in the appendix. There are multi-modal KGE models, however, e.g.(Pezeshkpour et al., 2018).



Performance on FB15K-237 of graph-structure prediction and downstream tasks of STD and various forms of multi-task training of KGE models on test data. Metrics and format follow those of Table 2. The objective w/o LP is an MTT objective with all tasks in Table 1 except for LP.

Table 4 reports performance on FB15K-237 of some KGE models using both training approaches across different types of model selection methods: selecting on LP (the standard approach), selecting on MTR and selecting directly on the metric used to evaluate the downstream task. We found that STD training performed best in combination with LP model selection. MTT performance on downstream tasks improved consistently Performance on FB15K-237 downstream tasks for different KGE model training (STD/MTT) and selection approaches (LP/MTR/weighted F1/RSE). Weighted F1/RSE use downstream tasks data for model selection. when using LP instead of MTR for model selection, however. Model selection with the downstream task metric provides only marginal benefits for both STD and MTT and can in fact be detrimental, likely due to overfitting on validation data. This indicates that model selection without information about downstream tasks-i.e., using LP or MTR-is suitable. The combination that performed best in our study was MTT training and LP model selection. Overall, we found that full MTT training and MTR for model selection (as used in our main results of Tab. 2) was a suitable choice, but further improvements are possible by dataset-, model-and task-specific choices of pre-training task and validation objective.

Comparison of entity classification (EC) and node importance estimation (NIE) between best previously published models (directly trained on task) and pre-trained KGEs (MTT, selection with MTR or downstream metric).

Weighted F1 on test data of downstream classifiers (MLP, Logistic Regression, KNN and Random Forest) that use pre-trained KGE embeddings as input to solve entity classification tasks about entities in FB15K-237; and KE-GCN(Yu et al., 2021), a GCN that trains directly on the downstream data. Bold entries show best performance per task. Underlined entries show best performance per task compared to same KGE model and same downstream model. Datasets are sorted by decreasing size of the training set from left to right.

