HAS IT REALLY IMPROVED? KNOWLEDGE GRAPH BASED SEPARATION AND FU-SION FOR RECOMMENDATION

Abstract

In this paper we study the knowledge graph (KG) based recommendation systems. We first design the metric to study the relationship between different SOTA models and find that the current recommendation systems based on knowledge graph have poor ability to retain collaborative filtering signals, and higher-order connectivity would introduce noises. In addition, we explore the collaborative filtering recommendation method using GNN and design the experiment to show that the information learned between GNN models stacked with different layers is different, which provides the explanation for the unstable performance of GNN stacking different layers from a new perspective. According to the above findings, we first design the model-agnostic Cross-Layer Fusion Mechanism without any parameters to improve the performance of GNN. Experimental results on three datasets for collaborative filtering show that Cross-Layer Fusion Mechanism is effective for improving GNN performance. Then we design three independent signal extractors to mine the data at three different perspectives and train them separately. Finally, we use the signal fusion mechanism to fuse different signals. Experimental results on three datasets that introduce KG show that our KGSF achieves significant improvements over current SOTA KG based recommendation methods and the results are interpretable.



The recommendation system is the important technique for information filtering, which can help users find the data they want in a large amount of data. Collaborative filtering algorithm is a classical recommendation method, and its main idea is to make recommendations by mining the collaborative signals between users and items. As a deep learning method, graph neural networks (GNN) have been used to effectively mine users' collaborative signals, such as NGCF (Wang et al., 2019d) , LightGCN (He et al., 2020) . Recent works (Wu et al., 2021; Yu et al., 2022) use LightGCN as the backbone to introduce contrastive learning and achieve better performance. From LightGCN (He et al., 2020) to SimGCL (Yu et al., 2022) , the performance is constantly improving. For the convenience of description, we assume that there are two models, M1 and M2. The overall performance of M1 is better than that of M2. In practical applications, for some specific users, the recommendation effect of M1 may be inferior to that of M2. A problem that may be overlooked is, is M1 really an improvement over M2 ? That is, what is the relationship between M1 and M2? Figure 1 shows two possible relationships between M1 and M2. (i 1 ∼ i 6 are commodities, the circle "Test" represents the range of the test set, the circle "M1" represents the range of top K commodities given by the M1 model, the cricle "M2" represents the range of the test the top K commodities given by the M2 model). In Figure 1 (a), M1 learns new things on the basis of retaining the infor-mation of M2, while in Figure 1 (b) the improvement of M1 to M2 is to lose part of the information of M2 and learn more new information. To objectively measure these two cases, we design a new metric, Intersection@N, to measure the differences between two models. Based on this indicator, we conduct experiments between different collaborative filtering models (Wu et al., 2021; Yu et al., 2022) , and between different stacked layers of the same collaborative filtering model (which are described in Section 2.1 and Appendix C), and obtain two findings:(1) The relationship between different collaborative filtering methods using GNN is shown in Figure 1 (b). (2) The relationship between models based on collaborative filtering that uses the same GNN model but stacks different layers is shown in Figure 1 (b), i.e., models that stack higher layers cannot fully "include" models that stack lower layers. Many studies (Wang et al., 2019d; He et al., 2020) pointed out that within a certain range, the more GNN layers are stacked, the higher the model performance is. The performance decreases after exceeding this range. Numerous studies (He et al., 2020; Zhao & Akoglu, 2019) attribute the poor performance to the over-smoothing of nodes caused by multi-layer stacking, and based on this reason, many methods are designed to alleviate the over-smoothing. A common feature of these works is that they all choose a model with fixed stacked L layers as the final model. The default assumption in doing so is that the model with good performance (stacked L layers) is an improvement over the model with stacked T layers of poor performance and the improvement is understood as in Figure 1(a) . However our experiments show that this assumption is invalid and instead their relationship should be as shown in Figure 1 (b) (i.e. The first finding). The first observation of this paper is that the recommendation method based on collaborative filtering using GNN does not fully exploit the performance of GNN. Accompanying GNN, knowledge graphs (KG) are introduced into the recommendation systems to improve their performances with auxiliary information. The popular KG based recommendation methods are KGAT (Wang et al., 2019c) and KGIN (Wang et al., 2021) , which connect KG and user-item bipartite graphs through items, thus unify the two into one graph structure. According to KGAT and KGIN, we believe that the KG based recommendation system includes item-based collaborative signals, content signals and attribute-based collaborative signals. These three signals are mined in KGAT and KGIN. The first two signals are mined in the User-Item bipartite graph and the KG respectively, and the third signal is mined in the unified graph by higher-order connectivity (Wang et al., 2019c) . However, we have two observations with this unified graph structure: (1) Poor preservation for collaborative filtering signals. We use the Intersection@N metric to measure the KG-based recommendation method and the collaborative filtering recommendation method, and find the relationship between the two methods are as shown in Figure 1 (b). Existing methods that introduce KG discard half of the information based on collaborative filtering and learn more information introduced by KG, so the performance of the former is higher than that of the latter. (2) Unnecessary information are introduced by higherorder connectivity which makes the propagation path too long. Taking user u 2 recommending i 4 in Figure 5 as an example, a possible path of a path is u 2 like -→ a 3 r2 → i 4 , and the semantic information of this path is that u 2 likes items with a 3 attribute. However, the path given by KGAT is u 2 like → i 2 r1 → a 3 r2 → i 4 . In this path, the information of node i 2 includes content signal and user-based collaborative signal, which is not helpful for the original semantic information and will introduce unnecessary information. In addition, longer propagation paths also introduce noise. Based on above three observations, in this paper we propose a general knowledge graph based separation and fusion model. It consists of three core parts to meet the three challenges mentioned above: Cross-Layer Fusion Mechanism. We find that there are differences in the information learned between models that stack different layers, so we cannot directly select a model that stacks N layers. We design a model-agnostic, general-purpose Cross-Layer Fusion Mechanism without any trainable parameters, which fuses models stacked with different layers and can preserve the information of different models. Signal Extractor. We design three independent, separately trained signal extractors to extract the three kinds of signals in the data mentioned before, which can avoid the mutual influence of each signal. We use the existing collaborative filtering method for item-based collaborative signal extraction and the Cross-Layer Fusion Mechanism is applied to further improve the performance. For attribute-based collaborative signals, we process the original User-Item-Attribute graph into a User-Attribute-Item graph and apply LightGCN to extract signals from the User-Attribute graph. The Cross-Layer Fusion Mechanism is also applied. The extraction of the content signal draws on the idea of Transformer (Vaswani et al., 2017) , which uses the user as the Query Vector and the attribute as the Key Vector to obtain the user's interest in each attribute to get a fine-grained explanation. Signal Fusion Mechanism. The three signal extractors get different scores. We draw on the idea of ensemble learning and design Signal Fusion Mechanism, which is essentially weighted summation. We conduct extensive experiments on three real datasets. Experimental results show that our Cross-Layer Fusion Mechanism improves collaborative filtering better than state-of-the-art methods, such as SGL Wu et al. (2021 ), SimGCL Yu et al. (2022) . Our designed KGSF (Knowledge Graph based Separate and Fusion model) also outperforms state-of-the-art methods in knowledge graph-based recommendation, such as KGAT (Wang et al., 2019c) , KGIN (Wang et al., 2021) , KGCL (Yuhao et al., 2022) , HAKG (Du et al., 2022) . The contributions of this paper are summarized as follows: (1) The Intersection indicator is designed to measure the relationship between different models. Using this indicator we give a new explanation for the unstable performances of GNNs stacking different layers, and find that the improved relationship between most models belongs to Figure 1(a) . (2) The Cross-Layer Fusion Mechanism can effectively improve the performance of GNN, which is model-independent, general, and without any training parameters. (3) A highly interpretable and extensible KGSF framework is proposed. (4) Experimental studies on three datasets demonstrate the superiority and effectiveness of Cross-Layer Fusion Mechanism and KGSF.

2. PRELIMINARIES

It triggered a research boom since NGCF (Wang et al., 2019d) introduced GNN into the collaborative filtering method. LightGCN (He et al., 2020) removes activation and projection in NGCF and achieves remarkable results. Afterwards, contrastive learning was introduced and the SGL (Wu et al., 2021) framework was proposed. Then there is the SimGCL (Yu et al., 2022) method, which removes graph augmentations and adds perturbation on the basis of contrastive learning. The performance of the above method gradually increases. However, there are two problems to be studied. First, from NGCF and LightGCN to SGL and SimGCL, is their relationship shown in Figure 1 (a) or Figure 1 (b)? Second, the increase in the number of GNN layers will degrade the performance. Many studies attribute the reason to the learned embedding tends to be smooth. Is the relationship between these different layer shown in Figure 1 (a) or Figure 1 (b)? Here, we choose two SOTA methods SGL (Wu et al., 2021) and SimGCL (Yu et al., 2022) for comparative experiments, and give conclusions in Appendix C and 2.1, respectively. Meanwhile, a solution is given in subsection 2.2. In addition, we suggest readers to read concepts including "Intersection", "UA Graph", "IA Graph", etc. in Appendix B.

2.1. THE RELATIONSHIP BETWEEN DIFFERENT LAYERS IN THE SIMGCL AND SGL

A large number of GNN-based (Wang et al., 2019d; He et al., 2020) collaborative filtering methods believe that the performance will increase with the increase of the number of layers in a small range. However, the performance will decrease when the stacking reaches a certain level. Table 2 also confirms this. Many works (He et al., 2020; Zhao & Akoglu, 2019; Rong et al., 2019) attribute the poor performance to the fact that as the number of layer increases, the node representations learned by GNN become smoother and thus lack discrimination. Table 1 : Intersection@20 between different layers.

Yelp2018

Last-FM Amazon -Book L-1 L-2 L-3 L-1 L-2 L-3 L-1 L-2 L-3

SGL

L-1 1.0000 0.8284 0.7528 1.0000 0.7198 0.6398 1.0000 0.8101 0.7330 L-2 0.7791 1.0000 0.7838 0.7296 1.0000 0.6235 0.7027 1.0000 0.7371 L-3 0.6873 0.7601 1.0000 0.6000 0.5785 1.0000 0.6172 0.7154 1.0000 SimGCL L-1 1.0000 0.7945 0.7758 1.0000 0.7296 0.6661 1.0000 0.8179 0.8190 L-2 0.7583 1.0000 0.8494 0.7251 1.0000 0.7459 0.7922 1.0000 0.8751 L-3 0.7381 0.8462 1.0000 0.7126 0.8040 1.0000 0.7657 0.8443 1.0000 The default premise of this view is that a model with better performance is an extension of a model with poor performance, in which the former perfectly preserves the latter's information. To verify this, we still use Intersection@20 metric to conduct experiments between different layers of the same method, and the results are shown in Table 1 . We summarize our observation and conclusions as follows: (1) We found that on the whole, the similarity between different layers of the same method and the same dataset is between 57% and 85%. This shows that the information learned by different layers is also partly different and the relationship between them is intertwined as shown in Figure 1 (a). In other words, the layer with better performance is not an improvement on the layer with worse performance. One possible reason for the poor performance of the model with higher layers is that the total amount of high-level information in the dataset is smaller than low-level information. (2) We also found that the minimum of similarity between layers occurs between layers 1 and 3, except that the minimum of the SGL on the Last-FM dataset occurs between layers 2 and 3. This shows that with the increase of the number of layers, the retention ability of the upper layer to the lower layer is gradually reduced, and the new information learned by the upper layer is greater than the reduced. Therefore, the effect of the upper layer is better than that of the lower layer. Heretofore, many methods (Zhao & Akoglu, 2019; Rong et al., 2019; Liu et al., 2020; Feng et al., 2020; Chen et al., 2020) have been proposed to mitigate the performance degradation caused by increasing the number of layers. However, this is based on the premise that the betterperforming model retains the information of the poorer-performing model. Our experiments prove that this premise is wrong.

2.2. CROSS-LAYER FUSION MECHANISM

According to the experiments in section 2.1, we found that the information learned between different layers of the same model is independent. Therefore, it's naive to think that simply retaining the information between different layers improves performance without modifying the model structure. We designed a method that is model-agnostic and can be applied to any graph-based model that consists of user and/or item embedding. More importantly, no trainable parameters are introduced. Its structure is shown in Figure 4 (a). We will describe the detailed process in the next paragraph. First, we need to separately train models with different numbers of layers stacked(three models are trained using the green box frame shown in the Figure 4 , with one, two and three layers stacked respectively). Second, each model can get an embedding for a user and an item, and we multiply them to get the score. Three models get three different scores. Since the range of scores obtained by different models may be different, we apply maximum and minimum normalization to limit the scores to the range of 0 to 1. Finally, we fuse the different scores using a weighed summation and define that the value of weight is positively related to the performance of a single model. In order to verify the effectiveness of this method, we use the model trained in section 2.1, and then set the weights of the three models to be 1. The experimental results are shown in Table 2 , where %Imp. denotes the relative improvement of the best performing method(bold) over the strongest models(underlined) excluding our method. Our method achieves the best performance. From the point of datasets, our method has obvious improvement effect in Last-FM and Amazon-Book. From the point of the model, SGL has obvious improvement. This illustrate the feasibility and effectiveness of our method. We conduct two experiments to further explore how our method fuses information from different layers and whether new information is From the results, we can get the following conclusions: (1) We can find that the information shared by three models accounts for the highest proportion after fusion, accounting for about 45% to 69%. New information accounts for the least after fusion, accounting for about 0.5% to 3.9%. It shows that the cross-layer fusion mechanism tends to preserve the information shared by three models and will generate a small amount of new information. (2) In terms of the proportion of retained information, the information shared by the three models are larger than those shared by the two models (about 4.4% to 14.2%), and the latter is larger than the unique information of a single model (about 2.2% to 10.0%). This phenomenon is the same as ensemble learning. (3) The retention rate of the method for a single model information is about 78% to 93%, which means that it can not retain all the information of the three models. However, the amount of information lost is less than the sum of the unique information retained and the new information generated, so a better effect is achieved.

3. METHODOLOGY

In this section, we introduce our proposed Knowledge Graph-based Separate and Fusion model (KGSF) in detail. First, we decoupling the UIA which used by (Wang et al., 2019c; 2021) into three graph including UI Graph, UA Graph and IA Graph. Different signals can be found in each graph. Item-based collaborative filtering signal(i.e., item-user-item co-occurrence) can be found in UI Graph, attribute-based collaborative filtering signal(i.e., attribute-user-attribute co-occurrence) can be found in UA Graph and content signal(i.e., items with similar attributes are similar) can be found in IA Graph. We designed three different signal extractors to mine information. Finally, we designed a fuser to fuse different signals. It is important to note that this is not an end-to-end framework. Different signal extractors are trained separately.

3.1. ITEM-BASED CF-SIGNAL EXTRACTOR

It is a hot filed to extract collaborative signals in UI Graph for recommendation. Lots of work (Wang et al., 2019d; He et al., 2020; Wu et al., 2021; Yu et al., 2022; Wang et al., 2020b) has shown that GNN can effectively extract item-based collaborative signals. Here we adopt the LightGCN (He et al., 2020) , which is a typical and effective model. Its graph convolution operation is defined as: e (k+1) G U I :u = i∈N G U I u 1 N G U I u N G U I i e (k) G U I :i ; e (k+1) G U I :i = i∈N G U I i 1 N G U I i N G U I u e (k) G U I :u (1) , where e (k) G U I :i , e (k) G U I :u ∈ R d G U I are embedding of item i and user u at layer k, d G U I is embedding dimensions, specially e (0) G U I :i and e (0) G U I :u are ID embedding (i.e., trainable parameters). We define N G U I i = {u | (u, i) ∈ G U I } to represent the set of users u who have interacted with item i and N G U I u = {i | (u, i) ∈ G U I } to represent the set of items i that have interacted with user u. Each layer will get embeddings corresponding to items and users. After K layers stacking, LightGCN further combines the embeddings obtained at each layer to form the final representation of a user (an item): e G U I i = 1 K+1 K k=0 e (k) G U I :i , e G U I u = 1 K+1 K k=0 e (k) G U I :u . According to the conclusion in section 2.2, there are differences in the information learned by the models that stack different layers. Therefore, we first train models that stack different layers separately, and then use a cross-layer fusion mechanism for fusion.

3.2. ATTRIBUTE-BASED CF-SIGNAL EXTRACTOR

Figure 5 : High-Order Connectivity One of the purpose of introducing KG into recommendation system is to extract attribute-based collaborative filtering signals. Experiments (Wang et al., 2019c; 2021) show that such collaborative signals exist in UIA Graph. As shown in Figure 5 , a possible connection path from user u 2 to item i 4 can be expressed as u 2 like → i 2 r1 → a 3 r2 → i 4 . Each node in this path contains the information of other nodes (e.g., the information of node i 2 may contain the information of node a 2 , a 3 and a 4 ). But the meaning of introducing this path is to express that user u 2 like items containing attribute a 3 , so this path will introduce noise that is hard to estimate. In order to extract this collaborative signal of users more efficiently, we put the extraction of this collaborative filtering signal on the UA Graph, as shown in Figure 5 . At this point, the path can be simplified to u 2 like → a 3 r2 → i 4 . This removes i 2 from the original path to avoid introducing unnecessary information, resulting in a purer embedding. Here, as shown in Figure 4 (d) we also apply the LightGCN to extract the collaborative filtering signal from the UA Graph. User and attributes are represented by e (0) G U A :u and e (0) G U A :a in layer 0, they are ID embedding (i.e., trainable parameters). According to Formula 1, we can easily define the graph convolution operation in UA Graph as follows: e (k+1) G U A :u = a∈N G U A u e (k) G U A :a N G U A u N G U A a ; e (k+1) G U A :a = a∈N G U A a e (k) G U A :u N G U A a N G U A u , where e (k) G U A :a , e G U A :u ∈ R d G U A , d G U A is embedding dimensions. We define N G U A a = {u|(u, a) ∈ G U A } to represent the set of users u who have interacted with attribute a and N G U A u = {a|(u, a) ∈ G U A } to represent the set of attribute a that have interacted with user u. Each layer will get embeddings corresponding to attributes and users. After K layers stacking, LightGCN further combines the embeddings to obtained at each layer to form the final representation of a user or an attribute: e G U A a = 1 K+1 K k=0 e (k) G U A :a ; e G U A u = 1 K+1 K k=0 e (k) G U A :u . Now we have the embedding of user u and attribute a, which contain collaborative signals between users and attributes. In the task description, we talked about getting the user's rating for each item, thus we also need to give the definition of item embedding: e G U A i = a∈N G U A i e G U A a . Likewise, we first train models that stack different layers separately, and then use a cross-layer fusion mechanism for fusion.

3.3. CONTENT SIGNAL EXTRACTOR

The second reason for the introduction of knowledge graph is to provide richer information for the embedding of items. The content signal extractor we designed consists of two components. The first is the knowledge graph embedding layer, as shown in Figure 4 (e), which uses the knowledge graph embedding method like RotatE (Sun et al., 2018a) to obtain the attribute embedding. The other is the user interest mining layer, as shown in Figure 4 (f), its input is the attribute embedding obtained by the first layer, and then using the user as the query vector and the attribute as the key vector, and finally the embedding of the user and item is obtained, denoted as e G IA u and e G IA i,u respectively. For a detailed introduction to these tow layers, please refer to Appendix D.

3.4. SIGNAL FUSION MECHANISM

In the previous three subsections, we proposed three different signal extractors. Instead of concatenating between the three different items embedding and users embedding, we train three different signal extractors separately because the three extractors are independent of each other. Here, S ′ U I , S ′ U A , S ′ IA are used to represent the scores of item-based CF-signal extractor, attribute-based CF-signal extractor and content signal extractor, respectively, where S ′ U I , S ′ U A , S ′ IA ∈ R |U |×|I| . As shown in Figure 4 (b), since the three different scores have different ranges, we use max-min normalization to constrain their range between 0 and 1, denoted as S U I , S U A and S IA respectively. The final score is defined as S = τ 0 S U I + τ 1 S U A + τ 2 S IA , where τ 0 , τ 1 , τ 2 ∈ (0, 1]. The value of τ 0 , τ 1 , τ 2 depend on the performance of the three signal extractors. The stronger the performance of the signal extractor, the greater the corresponding weight.

3.5. MODEL PREDICTION AND MODEL OPTIMIZATION

In section 3.4, we mentioned that each signal extractor is trained separately, so within each signal extractor, user u's rating for item i is defined as the dot product of the corresponding embeddings. The following are the score definitions for each of the three signal extractors. ŷG U I u,i = e G ⊤ U I u e G U I i , ŷG U A u,i = e G ⊤ U A u e G U A i , ŷG IA u,i = e G ⊤ IA u e G IA i,u . It should be noted that the scores for item-based cf-signal extractor and attribute-based cf-signal extractor here refer to a model, as shown in the green box in Figure 4(a) . Here, we opt the BPR (Rendle et al., 2012) loss to optimize our model. Since the three signal extractors are trained separately, we give the loss functions of the three signal extractors in Appendix E. It should be noted that the training method in content signal extractor is to train knowledge graph embedding layer first, and then train user interest mining layer alone after the training of knowledge graph embedding layer is completed. We present empirical results to demonstrate the effectiveness of our proposed KGSF framework. The experiments are designed to answer the following research questions: RQ1: How does KGSF perform, comparing to the state-of-the-art knowledge-aware recommend models? RQ2: How do different components of KGSF (e.g., the attention guiding mechanism, the cross-layer fusion mechanism, the independence and completeness of three signal extractors, effectiveness of signal fusion mechanism) affect the performance of KGSF? RQ3: Can KGSF provide meaningful insights on user intents and give an intuitive impression of explainability? Only some experiments and results are shown here. Please refer to Appendix F for more details.

4.1. SETUP

Datasets, Baselines and Evaluation Metrics. We use three benchmark datasets: Amazon-Book, Last-FM and Yelp2018, which are extensively evaluated by the SOTA methods (Wang et al., 2019c; 2021; Du et al., 2022; Yuhao et al., 2022) and vary in iterms of domain, size and sparsity. Table 7 presents the overall statistics information of our experimented datasets. We adopt two widely-used evaluation protocols Krichene & Rendle (2022) recall@K and ndcg@K, where K=20 by default. To demonstrate the effectiveness of KGSF, we compare it with the state-of-the-art methods, covering KG-free method (Rendle et al., 2012) , embedding-based method (Zhang et al., 2016) , propagationbased methods (Wang et al., 2019a; c; 2021; 2020c) and multiview-based methods (Du et al., 2022; Yuhao et al., 2022) . See Appendix F.2 for more details. (1) KGSF consistently yields the best performance on the dataset, except for ndcg metric on the Last-FM dataset. In particular, it achieves significant improvement even over the strongest baselines w.r.t. recall@20 by 10.41%, 6.75% and 2.31% in Yelp2018, Last-FM and Amazon-Book, respectively. This demonstrates the rationality and effectiveness of KGSF.

4.2. PERFORMANCE COMPARISON(RQ1)

We attribute these improvements to fusion mechanism and the decoupling of graphs. (2) Jointly analyzing the performance of KGSF across the three datasets, we find that the improvement on Yelp2018 dataset is more significant than that on other dataset. One possible reason is that collaborative signal is more significant to the dataset than the content siganl. This confirms that KSGF has strong adaptability to different types of signals. (3) From the perspective of a single component of KGSF, UI component work best, and its performance on Yelp2018 dataset is much higher than the existing SOTA methods that introduce KG. Its performance on Amazon-Book dataset higher than the SOTA methods except KGIN. Therefore, the proportion of item-based collaborative filtering signals in these three dataset is higher than that of the other two signals. If the KG can not be modeled correctly, the introduction of it may weaken the collaborative signal and introduce noise, such as learning on a unified graph structure in KGAT and KGIN, and using a gating mechanism in HAKG to fuse the decoupled embedding, etc.. (4) From the fusion of any two components of KGSF, there is no performance degradation and the performance is better than that of a single component. The fusion performance of UI component and UA component is better in most cases. This proves that the introduction of UA graph is effective and it can extract information that can not be extracted by UI graph. The effectiveness of the fusion mechanism is also illustrated.

4.3. STUDY OF KGSF(RQ2)

In these section, we present part of our results to explore KGSF. We refer readers to Appendix F for more experiments. = 0.2008). We observe: (1) Overall, the value of Intersection@20 between any two component is less than 0.54, indicating that at least 46% of the information extracted between any tow signal extractors is different. It proves that one of the three signal extractors are independent. (2) Comparing the three datasets, it can be found that the three extractors have the highest independence on Yelp2018 and the lowest independence on the Last-FM. Comparing the performance of each extractors in Table 4 , we find that the performance of UA component and IA component is the worst on Yelp2018, and the performance is the best on Last-FM. Therefore, the independence between individual signal extractors is negatively correlated with the performance between individual signal extractors. (3) In the same dataset, comparing each row, we can find that the independence among the three signal extractors is "IA&UA", "UI&IA" and "UI&UA" in order from strong to weak. At the same time, combined with the results in Table 4 , it can be found that the effect of fusion is also in this order from low to high in most cases. This shows that the stronger the independence, the less the effect will be improved after fusion. Table 6 : Intersection@20 between KGSF and three components.

Yelp2018

Last-FM Amazon-Book KGSF KGSF* KGSF KGSF* KGSF KGSF* UI 0.9076 0.8884 0.7874 0.6827 0.8547 0.8147 UA 0.4155 0.1544 0.7773 0.6293 0.6949 0.3606 IA 0.3920 0.1264 0.6777 0.2866 0.6443 0.1757

Effectiveness of Signal Fusion

Mechanism. We will explore the preservation effect of the signal fusion mechanism and whether new information is generated. The experimental step is to select a benchmark method and then analyze the component proportions of the benchmark method to the three independent signal extractors. The benchmark method selected here is KGSF. The experimental results here are shown in Table 6 and Figure 6(c) . From the results, we summarize the following observation: (1) From the perspective of retention ability, the retention ability of UI component in all datasets is the strongest (about 79% ∼ 91%), followed by UA component (about 42% ∼ 78%), and the IA component is the worst (about 39% ∼ 68%). One possible reason is the setting of τ i in the signal fusion mechanism. Within a certain range, the higher the value, the better the retention effect for this component. (2) According to Figure 6 , we find that about 6% ∼ 9% of the new information is generated in the three datasets. It shows that the signal fusion mechanism will generate new information.

5. RELATED WORK

Existing recommendation methods that introduce knowledge graph can be mainly classified into four categories, namely, embedding-based methods, path-based methods, propagation-based methods and multiview-based methods. We give a brief introduction in the Appendix G.

6. CONCLUSION AND FUTURE WORK

In this paper, we first propose the Intersection metric to measure the relationship between different models. Experiments verify that the relationship between different collaborative filtering (CF) methods using GNN is shown in Figure 1 (b). In addition, the relationship between models using the same CF method but stacking different layers is also the same. We then design a model-agnostic cross-layer fusion mechanism that does not introduce any training parameters, and conduct extensive experiments on three real datasets to demonstrate its effectiveness. Then we analyze the current challenges of introducing KG recommendation methods, and design an extensible KGSF framework to improve the recommendation performance through independent signal extractors and fusion mechanisms, and demonstrate the performance of KGSF on three datasets effectiveness. Experiments show that direct fusion between different signals will interfere with each other, and the current difficulty is how to effectively fuse different features. In the future, we will explore whether the above problems exist in other tasks using GNN, object detection in computer vision, etc., and how to design an efficient, end-to-end feature fusion mechanism. 

B PROBLEM FORMULATION

We first introduce the data structures related to our studied problem, and then formulate our task. Table 9 : Different relations of Intersection Intersection@N(M 1 ) Intersection@N(M 2 ) Relation(M1,M2) < 1 = 1 Figure1(a) < 1 < 1 Figure1(b) = 1 = 1 Similar User-Item Bipartite Graph (UI). In this paper, we focus on learning user's preference from implicit feedback including click and purchase. We define user set as U = {u} and item set as I = {i}. The user-item bipartite graph is defined as G U I = {(u, i)|u ∈ U , i ∈ I}. If user u has interacted with item i, the pair (u, i) will be in G U I . Knowledge Graph(KG)/Item-Attribute Graph(IA). KG stores many real facts, which can express the relationship between entities. They are usually stored in the form of triples. We define T as a set of triplets, E as a set of entities, and R as a set of relations. Let G IA = {(h, r, t)|h, t ∈ E, r ∈ R} be a collection of triplets, where each (h, r, t) ∈ T means that there is a relation r between head entity h and tial entity t. For example, a triplet (Wolf Warriors, director, Jason Wu) indicates that the movie Wolf Warriors is directed by Json Wu. Here we assume that all items appear in KG as entities (i.e., I ∈ E), which is the common assumption of all existing knowledge-aware recommendation systems. We can connect the items in user-item graph with entities in KG to offer auxiliary semantics to interactions. User-Attribute Bipartite Graph(UA). User-Attribute bipartite graph is actually a combination of User-Item bipartite graph and Item-Attribute graph. Let A = {a|a ∈ E, a / ∈ I} be a collection of attributes, which is a supplement to items. Let G U A = {(u, a)|u ∈ U , a ∈ A, (u, i) ∈ G U I , (i, r, a) ∈ G IA }. If user u has been interacted with item i and a relation r of item i is attribute a, the pair (u, a) is in the User-Attribute bipartite graph G ( U A). For example, if user u has watch the movie Wolf Warriors, and the director of it is Json Wu, so it is said the user u has interacted with Json Wu, which can be expressed by (u,Jason Wu). User-Item-Attribute(UIA). This is the same as the Collaborative Knowledge Graph definition in KGAT (Wang et al., 2019c) . Inersection@N. This is a new metric we propose to measure the difference between two models. Take a user u as an example, we consider two models, they will make two ordered lists of items sorted by rating for the user u. We remove the ones that appeared in the training set from these two ordered lists and then take out the top N constituent sets, which are recorded as M 1 and M 2 respectively. We denote the test set as T . So let Intersection@N (M i ) = |M1∩M2∩T | |Mi∩T | , where | • | is the number of elements in the set. The meaning of it is shown in Table 9 . Task Description. Given a user-item graph G U I , user-attribute graph G U A and item-attribute graph G IA , our task is to predict how likely that a user would adopt an item that she or he has never engaged with. First, we introduce SGL (Wu et al., 2021) and SimGCL (Yu et al., 2022) briefly. SGL proposes three graph argumentation including ND(Node-Drop), ED(Edge-Drop) and RW(Random-Walk) and then applies the InfoNCE (Gutmann & Hyvärinen, 2010) loss function to maximize the similarity of the same node and minimize the similarity of different nodes. SimGCL believes that the graph argumentation of SGL is not necessary. It adopts a simpler method, which is adding noise to nodes to obtain different representations of the same node. Finally InfoNCE is also applied.

C THE RELATIONSHIP BETWEEN SIMGCL AND SGL IN THE SAME LAYER

Then, we conducted two different experiments under the same conditions (such as dataset, batch size, etc.). It should be noted that graph argumentation method adopted by SGL in the experiment is ED because of its best performance. The first experiment is using recall@20 and ndcg@20 metrics to test the performance of SGL and SimGCL and the results are shown in Table 2 . The second experiment is using Intersection@20 metric to measure the similarity of the two methods in the same layer. The results are shown in Table 10 (Take 0.6786 in the dataset yelp2018 as an example, it is represented in the first layer and based on SGL, the similarity between them is 0.6786, that is |SGL∩SimGCL∩T | |SGL∩T | = 0.6786). From the result, we find that SimGCL performs better on Yelp2018 and Amazon-Book datasets. SGL is better on the Last-FM dataset. In addition, we observe that in Yelp2018 and Amazon-Book datasets, the similarity between the two methods is around 70% in any layer, while the similarity in Last-FM dataset is around 50%. This means that they learned a part of the information that the other did not. From the observation, we can draw the following conclusions: (1) SimGCL is not an improvement over SGL and the relationship between them is intertwined as shown in Figure 1 (a), rather than inclusive or juxtaposed. (2) Both graph argumentation and node perturbation are necessary. This contradicts the conclusions of the SimGCL (Yu et al., 2022) .

D CONTENT SIGNAL EXTRACTOR

Knowledge Graph Embedding Layer. In the UI Graph, the movies Wolf Warrior, Wolf Warrior 2 and Mermaid are independent of each other as three ID embeddings. The introduction of knowledge graph tells us that Wolf Warrior and Wolf Warrior 2 are more similar than Wolf Warrior and Mermaid, because the former has the same director, genres and actors, while the latter does not. A large number of methods such as TransR (Lin et al., 2015) , RotatE (Sun et al., 2018a) , etc. have excellent performance in capturing this similarity. Here we apply the RotatE, which projects entities and relations to the complex plane (i.e., h, t ∈ C d G IA , d G IA is embedding dimension). Specifically, given a triple (h, r, t) ∈ G IA , the scoring function d(h, r, t) will be as high as possible. If the triple (h ′ , r, t ′ ) / ∈ G IA , the scoring function d(h ′ , r, t ′ ) will be as low as possible. The scoring function is defined as d(h, r, t) = -∥h • r -t∥, where • is multiplication in the complex field, ∥ • ∥ is L1 norm. RotatE's training method is to consider that the score of valid triplets is higher than broken triplets. Its loss function is defined as follows: L RotatE = log σ(γ + d(h, r, t)) - n i=1 p (h ′ i , r, t ′ i ) log σ (-d (h ′ i , r, t ′ i ) -γ) (4) p h ′ i , r, t ′ j | {(h i , r i , t i )} = exp αd h ′ j , r, t ′ j i exp αd (h ′ i , r, t ′ i ) , where γ, α is hyper-parameter, σ(•) is sigmoid function, (h, r, t) ∈ G IA , (h, r, t) / ∈ G IA . The purpose of introducing p here is to distinguish easy negative samples from difficult negative samples, thereby improving performance. After training, we will get the embedding representation of the entity e ′ i , where e ′ i ∈ E. Since A ⊂ E, we take out the entity that satisfies e ′ i ∈ A in the entity set, denoted as e. Beacuse e ∈ C d G IA , e contains two parts, real part and imaginary part, which we denote as Re(e) and Im(e) respectively. For the convenience of representation, we denote the embedding of entity e as e G IA a = Re(e) ⊕ Im(e), where ⊕ is vector concatenation and e G IA a ∈ R d G IA ×2 . User Interest Mining Layer. If user u likes the item i, it must have intention. KGIN (Wang et al., 2021) defines intent as a set of relations, which is coarse-grained. To explore more fine-grained intent, we not only want to know which relationships users are more interested in, but also which attributes users are more interested in. The attention mechanism is the key to realizing this idea. Our idea about attention mechanism is based on Transformer. We take the user as the query vector and the attribute as the key vector to obtain the user's weight for each attribute, and finally aggregate the attributes to obtain the embedding of items. The network architecture of user interest mining layer is shown in Figure 4 other is that not all information can be provided in IA Graph (e.g., the IA Graph only provides the director and actor information of a certain movie, but the user likes the movie because of the genre of the movie, so introducing the other attribute can solve this situation). Considering the same attribute has different relationships with the same item (e.g., the director and actor of Wolf Warrior are both Jason Wu). In order to model it, we define R ′ = {id(r) | (h, r, t) ∈ G IA , h ∈ J , t ∈ A} ∪ {id(-r) | (h, r, t) ∈ G IA , h ∈ A, t ∈ G IA }, where id(r) represents generating a unique, 0-based number for relation r. We also define r(a, i) ∈ R ′ , which represents the relation id between item i and attribute a, where a ∈ A ∪ {other}, i ∈ J . Especially, r(other, i) = |R ′ |. It should be noted that if there is no special description, symbol e G IA a includes e G IA other After introducing the introduced symbols, we begin to introduce the user interest mining layer in detail. According to Transformer (Vaswani et al., 2017) ,TransR (Lin et al., 2015) ,RGCN (Schlichtkrull et al., 2018) , we first give the definition of the value vector and key vector of the attribute: Key a = W r(a,i) key e G IA a , V alue a = W r(a,i) value e G IA a , where a ∈ A ∪ {other}, i ∈ J , W r(a,i) key ∈ R k×(d G IA ×2) , W r(a,i) value ∈ R v×(d G IA ×2) , Key a ∈ R k , V alue a ∈ R v , k and v are the dimension of key vector and value vector of the attribute respec-tively. Then we define query vector and value vector of user as: Query u = W query e G IA u , V alue u = W value e G IA u , where u ∈ U, e G IA u ∈ R du , W query ∈ R k×du , W value ∈ R v×du , Query u ∈ R k , V alue u ∈ R v , e G IA u is trainable parameters. Here we define N u,i = {a | (u, i) ∈ G U I , (i, r, a) ∈ G IA , u ∈ U , i ∈ I, a ∈ A, r ∈ R} to represent the set of attributes of item i that user u have interacted with. Then user u's attention to each attribute of item i is defined as Attention u,i = Sof tmax Query u ⊤ Key a | a ∈ N u,i We multiply the attention by the value vector of the corresponding attribute to model the user's attention to the attribute. Different from the previous two extractors, the embedding of the item hear is related to the user, that is, the embedding of the same is different in different users (e.g., given the item i, user u 1 and u 2 have different degrees of attention to the attributes of the item i, that is, the attention is different, the embedding of the item i will also be different). We define the embedding of item i as: e G IA i,u = a∈Nu,i f f (Attention u,i (a) • V alue a ) , f f (e) = W 1 f f LeakyReLU W 2 f f e , where W 2 f f ∈ R p×v , W 1 f f ∈ R v×p , p is the dimension of hidden layer, LeakyReLU (•) is activation function. Finally, we define the embedding of user u as e G IA u = V alue u . Since we added the other attribute, we hope that its attention and other attribute attention are mutually exclusive (i.e., if other's attention are high, the other attribute is low rather than as high as other). To constrain this relationship, we introduce the following loss terms (i.e., guiding attention mechanism): L attention = exp -(Attention u,i (other) -µ) 2 2σ 2 , ( ) where µ and σ are hyper-parameters. This formula is the probability density function of normal distribution. When other's attention value is in the middle, its punishment will be greater. When other's attention value tends to both ends, the punishment will be small.

E OPTIMIZATION

This section shows the loss function of the three signal extractors. The first two of which consist of BPR and regular terms, and the last one has one more attention constraint than the first two. L U I = (u,i)∈G U I (u,j) / ∈G U I -ln σ y G U I u,i -y G U I u,j + λ G U I Θ G U I 2 2 L U A = (u,i)∈G U I (u,j) / ∈G U I -ln σ y G U A u,i -y G U A u,j + λ G U A Θ G U A 2 2 L IA = (u,i)∈G U I (u,j) / ∈G U I -ln σ y G IA u,i -y G IA u,j + λ G IA attention L attention + λ G IA Θ G IA 2 2 (11) where Θ G U I = e (0) G U I :i , e G U I :u | i ∈ I, u ∈ U , Θ G U A = e (0) G U A :a , e G U A :u | a ∈ A, u ∈ U and Θ G IA = {e G IA other , e G IA a , e G IA u , W r(a,i) key , W r(a,i) value , Query u , V alue u , W 1 f f , W 2 f f | a ∈ A, u ∈ U , r(a, i) ∈ R ′ } are the set of model parameters. λ G U I , λ G U A , λ G IA and λ G IA attention are hyperparameters to control the L 2 regularization term and attention loss, respectively.

F ADDITIONAL EXPERIMENTAL SETUPS AND RESULTS

F.1 BASELINES MF (Rendle et al., 2012) is a typical matrix factorization method and doesn't use KG information. It uses ID embeddings of users and items to make the prediction in implementation. CKE (Zhang et al., 2016) is a representative embedding method. It utilizes the TransR (Lin et al., 2015) to encode entities in the KG, which are then used as input to MF framework. KGNN-LS (Wang et al., 2019a ) is a propagation-based model, which converts KG into userspecific graphs, and then considers user preference on KG relations and label smoothness in the information aggregation phase, so as to generate the different representation of the same item under different users. KGAT (Wang et al., 2019c ) is a propagation-based recommend model. It applies a unified relationaware attentive aggregation mechanism in UIA to generate user and item representations. CKAN (Wang et al., 2020c ) is based on KGNN-LS, which utilizes different aggregation schemes on the user-item graph and KG respectively, to mine knowledge association and collaborative signals. KGIN (Wang et al., 2021) is a state-of-the-art propagation-based method, which models user interaction behaviors with latent intents, and proposes a relation-aware information aggregation scheme to capture long-range connectivity in KG. HAKG (Du et al., 2022) is also a state-of-the-art multiview-based method which embed users, relations and items in hyperbolic space and use a hyperbolic aggregation scheme. It learns from UI and IA graph to generate collaborative signals and knowledge associations, and applies gating mechanism to fuse them. KGCL (Yuhao et al., 2022) is a general contrastive learning framework using knowledge graph augmentation schema. Besides, it leverages additional supervision signals to guide a crossview contrastive learning paradigm.

F.2 PARAMETER SETTINGS

For a fair comparison, we fix the size of ID embeddings as 64 (except that the embedding of the UA component on the Amazon-Book dataset is 128), the optimizer as Adam (Kingma & Ba, 2014) , and the batch size as 4096 for all methods. The Xavier (Glorot & Bengio, 2010) initializer is used to initialize the model parameters. We consider learning rate lr ∈ {10 -4 , 10 -3 , 10 -2 }, τ 0 , τ 1 , τ 2 ∈ {0.1, 0.2, ..., 1.0}, λ G U I , λ G U A , λ G IA ∈ {10 -3 , 10 -4 , 10 -5 }, µ = 0.5, σ = 0.15, λ G IA attention = 10 -5 , p = 128. The parameters for all baseline methods are carefully tuned to achieve optimal performance. Specifically, for KGAT (Wang et al., 2019c) , we set the depth to three with the hidden size {64, 32, 16}, and use the pre-trained ID embeddings of MF (Rendle et al., 2012) as initialization; for CKAN (Wang et al., 2020c) , KGNN-LS (Wang et al., 2019a) , we set the size of neighborhood to 16; for KGIN (Wang et al., 2021) , we fix the number of intents to 4; Moreover, early stopping strategy is performed for all methods, i.e., premature stopping if recall@20 on the test set does not increase for 10 successive epochs.

F.3 ADDITIONAL STUDY OF KGSF(RQ2)

Since graph decoupling, independent training and signal fusion mechanism are the core of KGSF, we conduct extensive experiments to explore their effectiveness. Specifically, we first analyze individual components, including the attention guiding mechanism of the IA component and the cross-layer fusion mechanism of the UA component. We then delve into the independence and completeness of each signal extractor. Impact of Attention Guiding Mechanism in IA Graph. Here we verify the effectiveness of the attention guiding mechanism, thus we design two variants of the IA component. A variant discards the attention guiding mechanism, denoted as "IA w/o G", where λ G IA attention = 0. Another variant retains the attention guiding mechanism, denoted as "IA", where λ G IA attention = 1. The experimental 2.98% 1.82% 8.47% 6.75% 6.98% 3.79% UA-3 2.64% 1.60% 7.86% 6.21% 5.83% 3.00% Fusion 3.08% 1.88% 9.13% 7.30% 7.46% 4.00% Impact of model depth and Cross-Layer Fusion Mechanism in UI component. Here, we search for L in range {1, 2, 3} and then use cross-layer fusion mechanism to fuse them. We use "UA-i" to represent UA component of the stacked i layer and use "Fusion" to represent the model with the cross-layer fusion mechanism. The experimental results are shown in Table 12 . Our observations are as follows: • In the three datasets, the performance of "UA-2" is greater than that of "UA-1" and the performance of "UA-1" is greater than that of "UA-3". One possible reason for this is that the UA component can simplify high-order connectivity proposed by KGAT. In KGAT, it is necessary to stack 3 to 4 layers to achieve better results, while UA component can achieve better results after stacking 2 layers. • According to the conclusion in Section 2.1, the information extracted by UA component at different layers is different. We applied the cross-layer fusion mechanism and found that the performance was better than that of a single model. The improvement was obvious on the Last-FM and Amazon-Book datasets. This further verifies the existence of partial information independence in different layers and the effectiveness of the cross-layer fusion mechanism. Table 13 : Intersection@20 between KGIN and three components.

Yelp2018

Last-FM Amazon-Book KGIN KGIN* KGIN KGIN* KGIN KGIN* UI 0.4002 0.5588 0.5725 0.5900 0.5501 0.6144 UA 0.3852 0.2039 0.5212 0.4953 0.5318 0.2725 IA 0.3680 0.1699 0.5558 0.2734 0.5042 0.1617 Completeness of Three Signal Extractors. We judge the completeness of each signal extractor, that is, whether these three signal extractors can extract all information after the introduction of KG. Our experimental setup is to select a benchmark method and then analyze the component proportions of the benchmark method to the three independent signal extractors. The benchmark method selected here is the SOTA method KGIN (Wang et al., 2021) . The experimental results are shown in Figure 6 (b) and • According to Figure 7 , we find that about 30% of the information is not extracted by the three signal extractors. The UI component and UA component apply the cross-layer fusion mechanism. According to the conclusion in Section 2.2, the cross-layer fusion mechanism will lose some information. Therefore, we don't use the cross-layer fusion mechanism, so the original signal base extractors are changed from three ("UI", "UA", "IA") to seven ("UI-1", "UI-2", "UI-3", "UA-1", "UA-2", "UA-3", "IA"). The experimental results show that about 25% of the information does not exist in the three signal extractors. This shows that the information in the three signal extractors cannot extract all information, only convers 75% of the information. Therefore the three signal extractors are not completeness. A possible reason is that the relationship is not preserved in the three signal extractors. • According to Table 13 , we can find that the components of KGIN (Wang et al., 2021) are UI, UA and IA in descending order, among which UI components accounts for about 60% of the three datasets. From the perspective of UI component, its retention rate is somewhere between 40% to 50%. This shows that the end-to-end method of KGIN cannot completely retain the collaborative filtering signal of user-item. The reason why the performance of KGIN] is better than collaborative filtering is that the discarded information is smaller than the new information learned.

F.4 EXPLAINABILITY OF KGSF (RQ3)

Benefiting from the separate modeling and fusion mechanism of KGSF, we can get more finegrained explanations than KGIN. We randomly selected user u 335 and a related item i 4079 (from the test, unseen in the training phase), which is a book called Never Go Back. The interpretability obtained after visualizing it is shown in Figure 8 . We have the following findings: • From Figure 8 (a), we can find that the user u 335 likes this book not because of the itembased collaborative signal, but because of the attribute-based collaborative signal or content signal. • Figure 8 (b) and Figure 8 (c) are the user's overall preference and preference for a certain attribute of the item i 4079 generated by the UA component, respectively. Figure 8 (b) shows that after the dot product between the user and all attributes, we select the 10 attributes with largest values as the most favorite 10 attributes of users. We can find that all 10 attributes are writers, and the writing styles of writers tends to styles like thriller, crime, mystery, etc., which we can think of as user preferences. Figure 8 (c) shows the first three attributes obtained by normalizing the attributes of the item i 4079 and the user through dot product. It can be found that 40% of the reasons why user u 335 likes item i 4079 are the author Lee Child, 27% of the reasons are the protagonist "Jack Reacher" in the book, and 10% of the reasons are because the book is a novel. • Figure 8 The two separately trained models get the same result, it shows that this interpretability is credible. Figure 8 (e) gives detailed attention. For example, user prefers that the character in the book is "Jack Reacher" in the book (attention for path Nerver Go Back characters -→ Jack Reacher is 0.24), instead of "Jack Reacher" appears in the book (attention for path Jack Reacher appears -→ Nerver Go Back is only 0.01). This is very intuitive.

G RELATED WORK

Embedding-based methods (Zhang et al., 2016; Cao et al., 2019a; Ai et al., 2018; Cao et al., 2019b; Huang et al., 2018; Wang et al., 2020a; 2018b) first use knowledge graph embedding techniques (e.g., TransR (Lin et al., 2015) , RotatE (Sun et al., 2018a) ) to obtain entity and relation embeddings, and then input these embeddings into subsequent recommendation networks. For example, CKE (Zhang et al., 2016) uses the TransR (Lin et al., 2015) to learn the structural information of entities from the knowledge graph, and then inputs the learned embeddings into matrix factorization (MF (Rendle et al., 2012) ). KTUP (Cao et al., 2019a) applies TransH (Wang et al., 2016) on both the user-item bipartite graph and the knowledge graph to jointly learn user preferences and complete recommendations. Although these methods can capture the similarity between entities brought by the KG, they ignore the information brought by higher-order connectivity. Path-based methods (Catherine & Cohen, 2016; Hu et al., 2018; Jin et al., 2020; Ma et al., 2019; Wang et al., 2019e; Sun et al., 2018b ) find higher-order connectivity for recommendation by finding semantic paths in the KG and then connecting items and users. These paths can be input to RNN network (Wang et al., 2019e; Sun et al., 2018b) or employ an attention mechanism (Hu et al., 2018) to extract user preferences. For example, KPRN (Wang et al., 2019e) infer the potential high-order connectivity of a user-item interaction by mining the sequential dependence within a knowledge-aware path. But defining the correct meta-path requires domain knowledge, which is labor-intensive and time-consuming for KG with multiple relationships and various types of entities. At the same time, recommendation systems applied in different fields cannot be transferred to each other, and the generalization ability is poor. Propagation-based (Wang et al., 2021; 2019c; Tu et al., 2021; Wang et al., 2018a; 2019a; b; 2020c) methods rely on the information aggregation ability of GNN and the stacking ability of layers to capture high-order connection in an end-to-end manner automatically. For example, KGAT (Wang et al., 2019c ) introduces an attention mechanism on unified UKG graph for learning. Based on KGAT (Wang et al., 2019c) , KGIN (Wang et al., 2021) transfers relational information and introduces intent nodes between users and items to achieve better interpretability and performance. Although this unified graph structure captures collaborative filtering signals, high-order connectivity and knowledge information. It is experimentally proved that only part of this information can be captured. The reason why it is not fully captured is that these information are mixed together when propagating on the graph, which introduces an incalculable noises. Multiview-based methods (Zou et al., 2022; Yuhao et al., 2022; Du et al., 2022) choose to construct multiple views to learn information from different perspectives, and then learn embeddings about items and users by designing a fusion mechanism. For example, HAKG (Du et al., 2022) learns from user-item graph and knowledge graph to generate collaborative signals and knowledge associations, and applies gating mechanism to fuse them. Although this method achieves better performance than propagation-based methods, it still cannot lean the full information of these views because different information is still represented by one embedding.



Figure 1: Two categories of improvement

Figure 2: Components of the fused model(SGL).

Figure 3: Components of the fused model(SimGCL).

Figure 4: Illustration of the proposed KGSF framework.

Figure 6: Components of the KGSF.

(f). The inputs to this layer are e G IA a and e G IA other , where e G IA a , e G IA other ∈ R d G IA ×2 . The reason for introducing e G IA

Figure 7: Components of the KGIN.

Figure 8: Explanations of user intents and real cases in Amazon-Book.

(d)  and Figure8(e) are the user's attention to the relationship of this book and the attribute attention of this book generated by IA component, respectively. It can be seen that the reason why user u 335 likes the book is because of the type of the book, the character in the book and the author of the book, which is similar to the conclusion in Figure8(c).

Performance comparison between SGL and SimGCL.

Overall performance comparison.

Performance comparison of KGSF components.   Wu et al., 2021)  in UI component. From the results, we summarize the following observations:

Intersection@20 between different components. signal extractor is different. We adopt Intersection@20 to measure independence. Specifically, we calculate the Intersection@20 between the different signal extractors that have been trained. The results are shown in Table 5 (the value of each row is based on the extractor, e.g., 0.2008 in the Yelp2018 indicating that UA component is used as the base, that is |U A∩IA∩T | |U A∩T |

Statistics of the datasets.

Performance comparison of SGL, SimGCL and their Fusion

Impact of Attention Guiding Mechanism.

Degraded performance can be observed in all datasets, indicating the necessity of the attention guiding mechanism. Specifically, if the attention guiding mechanism is not introduced, the model will preferentially optimize the Other embedding, resulting in the Other's attention close to other attribute's attention, which is not helpful for the interpretability of the model. Introducing the attention guiding mechanism can make the model pay more attention to the information of other attribute.

Impact of the number layers L and Fusion.

The Inverse in the Table11meansKGIN[9]



