NODE IMPORTANCE SPECIFIC META LEARNING IN GRAPH NEURAL NETWORKS

Abstract

While current node classification methods for graphs have enabled significant progress in many applications, they rely on abundant labeled nodes for training. In many real-world datasets, nodes for some classes are always scarce, thus current algorithms are ill-equipped to handle these few-shot node classes. Some meta learning approaches for graphs have demonstrated advantages in tackling such few-shot problems, but they disregard the impact of node importance on a task. Being exclusive to graph data, the dependencies between nodes convey vital information for determining the importance of nodes in contrast to node features only, which poses unique challenges here. In this paper, we investigate the effect of node importance in node classification meta learning tasks. We first theoretically analyze the influence of distinguishing node importance on the lower bound of the model accuracy. Then, based on the theoretical conclusion, we propose a novel Node Importance Meta Learning architecture (NIML) that learns and applies the importance score of each node for meta learning. Specifically, after constructing an attention vector based on the interaction between a node and its neighbors, we train an importance predictor in a supervised manner to capture the distance between node embedding and the expectation of same-class embedding. Extensive experiments on public datasets demonstrate the state-of-the-art performance of NIML on few-shot node classification problems.

1. INTRODUCTION

Graph structure can model various complicated relationships and systems, such as molecular structure (Subramanian et al., 2005) , citationships (Tang et al., 2008b) and social media relationships (Ding et al., 2019) . The use of various deep learning methods (Hamilton et al., 2017; Kipf & Welling, 2016) to analyze graph structure data has sparked lots of research interest recently, where node classification is one of the essential problems. Several types of graph neural networks (GNNs) (Veličković et al., 2017; Wu et al., 2020) have been proposed to address the problem by learning high-level feature representations of nodes and addressing the classification task end-toend. Despite the success in various domains, the performance of GNNs drops dramatically under the few-shot scenario (Mandal et al., 2022) , where extremely few labeled nodes are available for novel classes. For example, annotating nodes in graph-structured data is challenging when the samples originate from specialist disciplines (Guo et al., 2021) like biology and medicine. Many meta learning works, including optimization-based methods (Finn et al., 2017) and metricbased methods (Snell et al., 2017; Vinyals et al., 2016) , have demonstrated their power to address few-shot problems in diverse applications, such as computer vision and natural language processing (Lee et al., 2022) . In meta learning, a meta learner is trained on various tasks with limited labeled data in order to be capable of fast generalization and adaption to a new task that has never been encountered before. However, it is considerably challenging to generalize these meta learning algorithms designed for independent and identically distributed (i.i.d.) Euclidean data to graph data. To address the few-shot node classification problem, some graph meta learning approaches have been proposed (Liu et al., 2021; Ding et al., 2020; Yao et al., 2020) . They structure the node classification problem as a collection of tasks. The key idea is to learn the class of nodes in the query set by transferring previous knowledge from limited support nodes in each task. However, most existing approaches simply assume that all labeled nodes are of equal importance to represent the class they belong to. Differences and interdependencies between nodes are not considered in the learning process of the few-shot models. Since only limited data points are sampled to generate tasks in meta learning, each sampled task has high variance; therefore, treating all the data points equally might lead to loss of the crucial information supplied by central data points and render the model vulnerable to noises or outliers. In particular, the relationship between nodes and neighbors in a graph is an important factor that carries node information in addition to node features, and can be utilized as a starting point to investigate the importance of nodes. Although some work (Ding et al., 2020) considers the importance of nodes, there is lack of theoretical analysis about it. To address the aforementioned challenges, we first explore, in a theoretical manner, the effect of distinguishing nodes of different degree of importance on the lower bound of the accuracy of the model. We analyze the ProtoNet (Snell et al., 2017) , and conclude that when important nodes are given more weight when computing prototype representations in a task, the prototype will get closer to its own expectation, thus the lower bound of the accuracy will be increased. Based on this theoretical result, we propose a node importance meta learning framework (NIML) for learning and using the node importance in a task. Specifically, an attention vector is constructed for each node to describe the relationship distribution of that node and its neighbors. Then we train a supervised model using this attention vector as input to learn the distance between the node embedding and the same-class prototype expectation, effectively capturing the importance of that node to its class. The obtained distance will be used to calculate a weighted prototype in meta learning. We conduct experiments on three benchmarks, and results validate the superiority of proposed NIML framework. To summarize, the main contributions of this paper are as follows: 1) We theoretically explore the influence of node importance on the lower bound of model accuracy and show the benefit of distinguishing between nodes of different importance in a meta learning task. The theory conclusion can be applied to any domain, not only graph data. 2) We design a category-irrelevant predictor to estimate the distance between node embedding and approximated prototype expectation and follow the theorem conclusion to compute a weighted prototype, where we construct an attention vector as the input, which describes the distribution of neighbor relationships for a given node. 3) We perform extensive experiments on various real-world datasets and show the effectiveness of our approach.

2.1. GRAPH NEURAL NETWORKS

Recent efforts to develop deep neural networks for graph-structured data have been largely driven by the phenomenal success of deep learning (Cao et al., 2016; Chang et al., 2015) . A large number of graph convolutional networks (GCNs) have been proposed based on the graph spectral theory. Spectral CNN (Bruna et al., 2013) mimics the properties of CNN by defining graph convolution kernels at each layer to form a GCN. Based on this work, researches on GCNs are increasingly getting success in (Defferrard et al., 2016; Henaff et al., 2015; Kipf & Welling, 2016) . Graph Attention Networks (GATs) (Veličković et al., 2017) learn the weights of node neighbors in the aggregation process by an attention mechanism. GraphSAGE (Hamilton et al., 2017) utilizes aggregation schemes to aggregate feature information from local neighborhoods. However, modern GNN models are primarily concerned with semi-supervised node classification. As a result, we develop a GNN framework to address the few-shot difficulties in graph data, which is one of their largest obstacles.

2.2. META LEARNING

Existing meta learning algorithms mainly fall into two categories (Hospedales et al., 2020) : optimization-based meta learning and metric-based meta learning. Optimization-based meta learning (Finn et al., 2017; Li et al., 2017; Mishra et al., 2017; Ravi & Larochelle, 2016; Mishra et al., 2017) aims to learn an initialization of parameters in a gradient-based network. MAML (Finn et al., 2017) discovers the parameter initialization that is suitable for various few-shot tasks and can be used in any gradient descent model. MetaSGD (Li et al., 2017) advances MAML and learns the initialization of weights, gradient update direction, and learning rate in a single step. Metric-based meta learning (Liu et al., 2019; Ren et al., 2018; Snell et al., 2017; Sung et al., 2018; Vinyals et al., 2016) focuses on learning a generalized metric and matching function from training tasks. In partic-ular, Prototypical Networks (ProtoNet) (Snell et al., 2017) embed each input into a continuous latent space and carry out classification using the similarity of an example to the representation of latent classes. Matching Networks (Vinyals et al., 2016) learn a weighted nearest-neighbor classifier with attention networks. Ren et al. (2018) propose a novel extension of ProtoNet that are augmented with the ability to use unlabeled examples when producing prototypes. Relation Network (Sung et al., 2018) classifies new classes by computing a relation score between the query set and a few samples in each new class. Most existing meta learning methods cannot be directly applied to graph data due to lack of the ability to handle node dependencies.

2.3. FEW SHOT LEARNING ON GRAPHS

Current node representation learning cannot handle unseen classes with few-shot data. Some fewshot research on graphs target on node/link/graph classification (Mandal et al., 2022) . We introduce the node classification works as follows. Meta-GNN (Zhou et al., 2019) extends MAML (Finn et al., 2017) to graph data. RALE (Liu et al., 2021) considers the dependency between nodes within a task and alignment between tasks, then learns the hub-based relative and absolute location embedding. G-Meta (Huang & Zitnik, 2020) uses a local subgraph to represent the nodes given local structural information. MetaHG (Qian et al., 2021) presents a heterogeneous graph few-shot learning model for automatically detecting illicit drug traffickers on Instagram. MetaTNE (Lan et al., 2020) combines the skip-gram mechanism with meta learning to capture the structural information with known labels and without node attributes. GFL (Yao et al., 2020) implements few-shot classification on unseen graphs for the same set of node classes. GPN (Ding et al., 2020) aggregates node importance scores and learns node embedding with a few-shot attributed network based on ProtoNet. However, a theoretical analysis of the effect of node importance on meta learning is still missing.

3.1. META LEARNING PROBLEM SETUP

We first introduce some notations of few-shot classification problems. Let C be the space of classes with a probability distribution τ , and χ be the space of input data. We sample N classes c 1 , • • • , c N i.i.d form τ to form an N -way classification problem. For each class c i , k data points are sampled as S i = { s x 1 , • • • , s x k |( s x j , s y j ) ∈ χ × C ∩ ( s y j = c i )} to constitute the support set, where s x j ∈ R D , D is the dimension of input data, s y j is the class of s x j . Thus the support set is a union of S i , and S = ∪ N i=1 S i . Besides, for each class c i , we sample m data points to form a part of query set Q in the same way. The table of notation and definition can be found in the appendix. The core idea of meta learning algorithms is to train on various tasks sampled from distribution τ and then equip the model with the ability to fast generalize and adapt to unseen tasks with limited labeled data. Each N -way k-shot task is sampled by the above method. In the meta-train phase, ground truth of S and Q are both known, and Q is used to evaluate the performance of model updated by S. During the meta-test phase, the performance of the model will be evaluated on unseen classes. We assume each unseen class follows the same distribution τ .

3.2. PROTOTYPICAL NETWORKS

ProtoNet (Snell et al., 2017 ) is a metric-based meta learning algorithm. It learns an embedding function f ϕ : R D → R M , which maps input data from χ to the embedding space. The M -dimensional prototype representation c i for each class c i is computed by averaging the embedding of all data points belonging to c i in the support set: c i = 1 |S i | k j=1 f ϕ ( s x j ). (1) Given a distance function d(x, x ′ ), the probability a data point x belongs to class n is calculated by Softmax function over squared distance between the embedding of x and prototype representations. p ϕ (y = n|x) = exp(-d(f ϕ (x), c n )) N j=1 exp(-d(f ϕ (x), c j )) . (2) The prediction of an input x is computed by taking argmax over probability function p ϕ (y = n|x). Let ŷ be the prediction of an input x, then ŷ = arg max j (p ϕ (y = j|x)). The loss function for input data belongs to class n is in the form of negative log-likelihood J(ϕ) = -log(p ϕ (y = n|x)). Thus, the parameters of embedding function f ϕ is updated by minimizing the sum of loss functions on query sets. After the process of meta learning, the function f ϕ has the ability to embed data points belonging to the same class to the same group in the embedding space R M .

4. THEORETICAL ANALYSIS

In this section, we use ProtoNet (Snell et al., 2017) , a classic metric-based meta learning algorithm as an example, to theoretically explore the effect of node importance on the lower bound of model accuracy in the embedding space. The theoretical conclusion is that assigning higher weight to the data point that has closer distance to the prototype expectation will increase the lower bound of accuracy. This conclusion thus motivates us to use abundant data to learn the distance between node representation and prototype expectation in NIML framework. We derive our theorem based on a previous work (Cao et al., 2019) . The detailed proof process is included in the Appendix A.1. We first define the expected accuracy R of ϕ as: R(ϕ) = E c E S,x,y I arg max j {p ϕ (ŷ = j | x, S)} = y , where I denotes the indicator function. In order to simplify the theorem, we present the analysis for a special case: 2-way 2-shot problem i.e. a binary classification with 2 nodes for each class. Note that the theorem we present can also be extended to an N -way k-shot problem. We adopt the assumption that for any input x in each class c, the embedding vector f ϕ (x) follows a Gaussian distribution, where p( f ϕ (x) | y = c) = N (µ c , Σ c ). µ c is the expectation of f ϕ (x) when the input x belongs to class c, and Σ c is the expected intra-class variance of class c. We denote Σ as the variance between classes. Define importance based on prototype deviation: We want to explore the influence of differentiating data with different degrees of importance on the accuracy R. Since only a few data points are sampled for one class to form a task, when we compute c i following Equation( 1), there exists deviation between c i and µ i . As we simplify the problem to a 2-shot setting, the embedding vector of two nodes belonging to the class c i can be denoted by µ i -ϵ 1 and µ i + ϵ 2 respectively. We would like to emphasize that the sign of ϵ i can be permuted freely and will have no effect on the theorem. After that, we naturally treat the node which has an embedding vector that is closer to the expectation µ i as the more important node. Based on this consideration, we redefined the prototype calculation as below. Definition 1 We change the definition of c i to a weighted form. Let x 1 and x 2 be the feature vector of two nodes belonging to class c i . The embedding of x 1 and x 2 is: f ϕ (x 1 ) = µ i -ϵ 1 , and f ϕ (x 2 ) = µ i + ϵ 2 . w 1 and w 2 are weights related to f ϕ (x 1 ) and f ϕ (x 2 ), which can be either trainable or pre-defined. Then, c i = w 1 w 1 + w 2 f ϕ (x 1 ) + w 2 w 1 + w 2 f ϕ (x 2 ). When w 1 = w 2 in Equation( 4), Equation( 4) is equivalent to Equation( 1). We would like to prove our key idea: in Definition 1, when w 1 , w 2 and ϵ 1 , ϵ 2 have opposite relative value relationships (i.e. If w 1 > w 2 , ϵ 1 < ϵ 2 ), which means greater weight is assigned to the more important node, this setting allows the lower bound of the model to be raised. Some theoretical results are provided below, and the whole proof is included in the Appendix. Let a and b denote the two classes sampled from τ for a task. Since all classes follow the same distribution, we only need to select one class and investigate the model accuracy for each node inside this class and extend the results to remaining classes. Let x be the feature of a node drawn from class a, then Equation( 3) can be written as: R(ϕ) = E a,b∼τ E x∼a,S I[ŷ = a]. Proposition 1 We can express Equation( 5) as a probability function: R(ϕ) = Pr a,b,x,S (ŷ = a) = Pr a,b,x,S (α > 0), where α ≜ ∥f ϕ (x) -c b ∥ 2 -∥f ϕ (x) -c a ∥ 2 . From the one-sided Chebyshev's inequality, it can be derived that: 4). Then, R(ϕ) = Pr(α > 0) ⩾ E[α] 2 Var(α) + E[α] 2 . (7) Lemma 1 Consider space of classes C with sampling distribution τ , a, b iid ∼ τ. Let S = {S a , S b } S a = { a x 1 , . . . , a x k } , S b = { b x 1 , . . . , b x k } , k ∈ N is E x,S|a,b [α] = (µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b -2µ a ) T σ b + σ T b σ b -σ T a σ a , E a,b,x,S [α] = 2 Tr(Σ) + σ T b σ b -σ T a σ a , E a,b [Var(α | a, b)] ≤ 8 1 + 1 k Tr{Σ c 1 + 1 k Σ c + 2Σ + σ T b σ b + σ T a σ a }, where σ a = a w 2a ϵ 2 -a w 1a ϵ 1 a w 2 + a w 1 , σ b = b w 2b ϵ 2 -b w 1b ϵ 1 b w 2 + b w 1 Lemma 1 provides several key components for Theorem 1. Two new variables are introduced: σ a and σ b , defined by σ a = c a -µ a and σ b = c b -µ b . Theorem 1 Under the condition where Lemma 1 hold, we have: R(ϕ) ⩾ (2 Tr(Σ) + σ T b σ b -σ T a σ a ) 2 f 1 (σ a , σ b ) + f 2 (σ a , σ b ) , where f 1 (σ a , σ b ) = 12 Tr{Σ c ( 3 2 Σ c + 2Σ + σ T b σ b + σ T a σ a )} f 2 (σ a , σ b ) = E a,b [((µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b ) T σ b ) 2 . The lower bound of model accuracy R(ϕ) is in the form of a fraction, where we denote the denominator using the sum of two functions f 1 (σ a , σ b ) and f 2 (σ a , σ b ). We would like to investigate the effect of a change in σ a , σ b on R(ϕ), where σ a , σ b are the bias between µ a , µ b and c a , c b . From the definition in Lemma 1, we can divide σ c for a class c into three cases: If w and ϵ are negatively correlated, the value of σ c is closest to 0 among the three cases; If the same w is given to each ϵ, this corresponds to the case of calculating the prototype directly with the average embedding value. If w and ϵ are positively correlated, which is an opposite case from the first one, the value of σ c is farthest from 0. We emphasize that for all classes in one episode, they have the same assignment strategy, thus σ a and σ b are positively correlated. According to Theorem 1, we notice that σ a and σ b always appear in the form of a squared norm; thus, their positives or negatives have little effect on the result. In the numerator, σ T b σ b and σ T a σ a are subtractive, whereas they are additive in the denominator. After analyzing their degree and coefficients, we can reach the following conclusion: when we use the first strategy to assign values for w and ϵ, the lower bound of accuracy R(ϕ) will be improved. In detail, when w and ϵ are negatively correlated, σ a and σ b are both closest to 0, resulting in an increase in the value of lower bound. This theoretical result is exactly in line with our perception: when the value of σ a and σ b are close to 0, it means that the prototype embedding we compute with the weighted node embedding is very close to its expectation µ a and µ b , which is what we anticipate the prototype should achieve. Besides, from f 2 (σ a , σ b ), we can conclude that bringing σ b close to 0 will help reduce the sensitivity of the lower bound to µ b . Thus, if the distance ϵ between given data point and prototype expectation could be predicted, the weight can be assigned by the first strategy to enhance the model accuracy.

5. FRAMEWORK

Inspired by theoretical results, we propose to prioritize node importance in graph meta learning problems by introducing an importance score predictor. In detail, by constructing an attention vector to describe the relationship distribution of a given node, we end-to-end predict the distance between node embedding and prototype expectation, which is further used to compute a weighted average of node embeddings as the more accurate prototype representation. We denote an undirected graph as G = (V, E, A, X), where

5.1. FEW-SHOT NODE CLASSIFICATION TASK

V = {v 1 , • • • , v n } is the node set, E = {e 1 , • • • , e m } is the edge set. The adjacency matrix A = {0, 1} n×n represents the graph structure, where a ij denotes the weight between node v i and v j . X ∈ R n×d is the feature matrix, where x i ∈ R d represents the feature of node v i . We focus on solving few-shot node classification problems. Episode training is adopted in the meta-train phase as previous works Snell et al. (2017) , which samples several tasks and updates parameters based on the sum of the loss functions of the query sets. In our problem, nodes in the graphs correspond to data points in Euclidean space, and an N -way k-shot problem implies that each of the N categories has k nodes. The query set and support set are illustrated in Figure 1 .

5.2. NODE REPRESENTATION LEARNING

Our graph prototypical network has a node representation learning component. Following the idea from ProtoNet (Snell et al., 2017) introduced in Section 3, we aim to train an embedding function f θ (v i , x i ) that learns the node representation of v i , thus prototypes representing each category of the task can be computed. The node classification can then be implemented by calculating the distance between the current node and each prototype. On graph data, the embedding function is implemented with an inductive Graph Neural Network (GNN) (Hamilton et al., 2017 ) that learns a low-dimensional latent representation of each node. It follows a neighborhood combination and aggregation scheme, where each node recursively fetches information from its neighbors layer by layer. Let h l v denote a node v's representation at the l th step, h l N (v) = AGGREGATE l (h l-1 u , ∀u ∈ N (v)), h l v = σ(W l • CONCAT(h l-1 v , h l N (v) )), where N (v) represents node v's (sampled) neighbors. The first step is to aggregate the representations of neighbor nodes in layer l -1 into a new vector h l N (v) . The node representation on layer l -1 and the aggregated neighborhood representation are concatenated, which is then fed to a fully connected layer with nonlinear activation function σ. We denote this L-layer GNN by f θ (•).

5.3. NIML: NODE IMPORTANCE SPECIFIC PROTOTYPICAL NETWORK

Prototype is typically calculated by averaging node embeddings inside the support set as Equation( 1) shows. However, based on our theoretical findings, distinguishing nodes of different importance within a category can increase the model accuracy. When the number of nodes in the task is relatively small, the deviation produced by randomly sampling nodes for the prototype computation can be reduced by assigning higher weights to nodes with more importance (i.e. less distance to the prototype expectation). We therefore develop a model to learn the importance score of each node, which contributes to a weighted prototype computation. Although the theory motivates us to assign weights according to the distance between the node representation and the prototype expectation, it is based on the assumption that the distance ϵ is known. To overcome this obstacle, we design a model which end-to-end predicts the distance. Since numerous tasks are sampled during meta-train phase, we get access to relatively abundant nodes belonging to each class. When the number of nodes in a category is large enough, the prototype expectation µ c can be approximated by the mean embedding of same-class nodes among the whole graph, where µ c ≃ mean(f ϕ (x u )), for each node u belongs to class c. Then the ground truth distance ϵ between a node v and its same-class prototype expectation can be computed by d vp = d(f ϕ (x v ), µ c ). Thus, theoretically speaking, we expect that the distance function can be learned with the iterative meta-training. The next step is to decide which node information should be used to predict the distance. Directly using node embedding generated by Proto-GCN as input does not meet our expectation for distance predictor. Proto-GCN maps same-class nodes to close locations in the embedding space; whereas distance predictor maps nodes of comparable importance to close distance value, so nodes of different categories may be mapped to the same location (as shown in Figure 6 in Appendix A.3). Hence, it is necessary to design an input which containing as little label information as possible. 1 3 4 2 5 6 " !" " !# " !$ " !% " !& " ! = [" !# , " !& , " !% , " !$ , " !" ] ' )' = '() ( ℎ ) , +*,() ( ℎ * ) is all same-class nodes as v MLP sorted attention vector

Figure 2: Illustration of distance model

Due to the feature smoothing mechanism of GNN, an L-layer GNN brings the same smooth intensity for each node. Assuming we consider the homophily graph, the neighboring nodes have similar features. With equal smooth intensity, the similarity between a central node and its neighbors is higher than that between a marginal node and its neighbors, thus the relationship between a central node and its neighbors is more uniformly distributed. We thus construct an attention vector α v for each node v to represent the relationship distribution, where a more uniform distribution indicates a higher node importance and a much closer distance to prototype expectation. As shown below and in Figure 2 , each component in α v is an attention score between node v and u ∈ N (v). Note that a fixed number of neighbors are sampled for each node. α v = [α v1 , • • • , α v|N (v)| ], α vu = exp(LeakyReLU(a T [W h v ∥ W h u ]) q∈N (v) (exp(LeakyReLU(a T [W h v ∥ W h q ])) , ( ) where W is a linear transformation, ∥ is a concatenation operation. Attention coefficient is calculated by a single-layer feed-forward neural network with a LeakyReLu nonlinear activation and parameterized by a vector a, then a Softmax function is utilized for normalization. Thus, α v is the category-irrelevant node representation that describes the relation distribution between given node v and its neighbors. We use sorted α v as the input of the supervised distance predictor to avoid the effect of neighbor nodes' sampling order. For a node v in class c, the distance between node representation and prototype is predicted by a multi-layer supervised model: d(f ϕ (x v ), µ c ) = M LP (SORTED(α v )), ) where x v is the node feature, µ c = mean(f ϕ (x u )), for all nodes u belongs to class c. Then given the support set S c of class c, the importance score s v is computed by s v = exp(-d(f ϕ (x v ), µ c )) u∈Sc exp(-d(f ϕ (x u ), µ c )) . ( ) Prototype representation c of class c can be obtained by a weighted combination of embeddings, c = v∈Sc s v f θ (x). ( ) Then the probability p(c|v) that a node v with feature x belonging to class c can be computed following the Softmax function in Equation( 2). Thus, the loss function L can be defined as a sum over query set Q of negative log-probability of a node v's true label c. L = 1 N |Q| N c=1 v∈Qc -logp(c|v), ( ) where N is the number of classes, Q c is the nodes that belong to class c in query set Q. The parameters in representation network f θ (•) and importance score network are then updated by SGD. 

6. EXPERIMENT

To verify the effectiveness of NIML on few-shot node classification problem, in this section, we first introduce the experimental settings and then present the detailed experiment results with ablation study and parameter analysis on three public datasets.

6.1. EXPERIMENT SETTINGS

We implement the experiment on three public datasets: Reddit (Hamilton et al., 2017) , Amazon-Electronic (McAuley et al., 2015) , and DBLP (Tang et al., 2008a) . Details of datasets are provided in Appendix A.2. N classes are sampled episode by episode from training classes in meta-train phase, and N novel classes from testing classes are used for evaluation. A fixed number of neighbors are sampled to construct the attention vector, where zero is padded for the nodes without enough neighbors. We compare with several baselines which can be grouped into three categories. • GNNs: We test on four graph algorithm including DeepWalk, node2vec, GCN and GAT. Deep-Walk (Perozzi et al., 2014) is done by a series of random work technique, and node embeddings are learnt from the random walks. Node2vec (Grover & Leskovec, 2016 ) is an extension from DeepWalk, which is a combination of DFS and BFS random walk. GCN (Kipf & Welling, 2016) is like an first-order approximation of spectral graph convolutions. GAT (Veličković et al., 2017) leverages self-attention to enable specifying different weights to different nodes in neighborhood. • Meta Learning: We test on two typical meta learning algorithms without using GNN as backbone. ProtoNet Snell et al. ( 2017) is a metric-based meta learning method, which learns an embedding function and use prototype to do a classification. MAML Finn et al. ( 2017) is an optimizationbased meta learning method, which learns a good parameter initialization of networks. • Meta Learning GNN: We consider six works that implement GNN in a meta learning framework. Proto-GCN is a baseline we design for an ablation purpose, which learns a GCN as an embedding function and uses the average value as a prototype. Meta- GCN Zhou et al. (2019) is a previous work which extends MAML to graph data by using a GCN base model. Proto-GAT and Meta-GAT are two baselines where the embedding function is GAT. We also include two related works: RALE (Liu et al., 2021) introduces hub nodes and learns both relative and absolute location node embedding; GPN (Ding et al., 2020) learns node importance by aggregating the importance score.

6.2. EXPERIMENT RESULTS

Table 1 shows the performance comparison results on 5-way 3-shot and 5-way 5-shot problems on each dataset. We report the average performance of accuracy and F1 score after ten repetitions Among the GNNs, the typical methods DeepWalk and node2vec are far inferior to other methods since they rely on a large number of labeled data to learn good node representations. GCN and GAT are better than the previous two methods, but they still cannot achieve satisfying performance on this few-shot problem. In terms of ProtoNet and MAML, although they have shown the ability to deal with few-shot problems of Euclidean data, they are hard to handle graph data without considering the graph structure, i.e. node dependency. Due to the incorporation of both meta-learning and graph structure, the meta-learning GNN model outperforms the previous two types of models, which demonstrates that meta learning methods can effectively deal with the problem of few samples in graph data under a GNN configuration. For the four basic Meta Learning GNN model: Meta-GCN, Proto-GCN, Meta-GAT and Proto-GAT, they all achieve similar performance. Our model NIML outperforms other baselines in each case. The advantage of NIML is slightly advanced in the 5-shot case than in the 3-shot case, thanks to a better refinement of prototype calculation using the importance score in the case of additional nodes. Methods of computing importance score. We implement ablation study to test the performance of different methods of computing importance score and provide results of four models shown in Figure 3 . Proto-GCN compute prototype directly by mean function; GPN train a score aggregation model; Proto-GCN+GAT use GAT to learn importance score for each node. The results indicate that distinguishing the importance of various nodes will have a significant impact on the model performance, and NIML is closely connected with the theory conclusion, thus makes its advantages more significant.

Reddit

Effect of N -way/ k-shot/ m-query. We analyze the effect of number of class N , support set size k and query set size m on the accuracy of three datasets. The results of each dataset are depicted in Figure 4 . 1) As N grows, the difficulty of predicting increases, resulting in a decline in performance. 2) The accuracy will always increase as k increasing, and the curves tend to flatten in some instances. 3) The query set size m has the least impact on model accuracy of all variables. Larger m may result in decrease in performance, which may be due to the difficulty that larger query sets bring to parameter update. This work begins with a theoretical analysis of the effect of node importance on the model, and concludes that providing a greater weight to the data point whose embedding is closer to the expectation of same-class prototype would enhance the lower bound of model accuracy. This theory can also be applied to other domains, not just graph. Then we propose node importance meta learning (NIML) closely based on theoretical conclusion. We construct an attention vector to represent the relationship distribution between node and its neighbors, and train a distance predictor to learn the distance between node embedding and an approximation of prototype expectation. Experiments demonstrate the superior capability of our model in few-shot node classification. NIML has the potential to be utilized in any Proto-based few-shot node classification framework to compute prototype.

A APPENDIX

A.1 THEORY PROOF 4). Then, ∼ τ. Let S = {S a , S b } S a = { a x 1 , . . . , a x k } , S b = { b x 1 , . . . , b x k } , k ∈ N is E x,S|a,b [α] = (µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b -2µ a ) T σ b + σ T b σ b -σ T a σ a (19) E a,b,x,S [α] = 2 Tr(Σ) + σ T b σ b -σ T a σ a (20) E a,b [Var(α | a, b)] ≤ 8 1 + 1 k Tr{Σ c 1 + 1 k Σ c + 2Σ + σ T b σ b + σ T a σ a } where σ a = a w 2a ϵ 2 -a w 1a ϵ 1 a w 2 + a w 1 , σ b = b w 2b ϵ 2 -b w 1b ϵ 1 b w 2 + b w 1 Proof: From the definition of prototype, we have: c a = a w 1 a w 1 + a w 2 • ϕ( a x 1 ) + a w 2 a w 1 + a w 2 • ϕ( a x 2 ) = a w 1 a w 1 + a w 2 • (µ a -ϵ 1 ) + a w 2 a w 1 + a w 2 • (µ a + ϵ 2 ) = µ a + ϵ 2a w 2 -ϵ 1a w 1 a w 1 + a w 2 We denote the second term as σ a , thus c a = µ a + σ a and c b = µ b + σ b . Since α = ∥ϕ(x) -c b ∥ 2 -∥ϕ(x) -c a ∥ 2 , E x,S|a,b [α] = E x,S|a,b [∥ϕ(x) -c b ∥ 2 -∥ϕ(x) -c a ∥ 2 ] = E x,S|a,b [∥ϕ(x) -c b ∥ 2 ] -E x,S|a,b [∥ϕ(x) -c a ∥ 2 ] We denote E x,S|a,b [∥ϕ(x) -c b ∥] and E x,S|a,b [∥ϕ(x) -c a ∥] as i and ii respectively. For a random vector X, the expectation of quadratic form is E[∥X∥ 2 ] = T r(V ar(X)) + E T E, thus, i = E x,S|a,b [∥ϕ(x) -c b ∥ 2 ] = T r(V ar(ϕ(x) -c b )) + E[ϕ(x) -c b ] T E[ϕ(x) -c b ] Since V ar(X) = E[X 2 ] -(E[X]) 2 , V ar(ϕ(x) -c b ) = E[ϕ(x) -c b T (ϕ(x) -c b )] -E[ϕ(x) -c b ] 2 = E[ϕ(x) -c b T (ϕ(x) -c b )] -(µ a -c b )(µ a -c b ) T = Σ c + µ a µ T a + 1 k Σ c + c b c b T -µ a c b T -c b µ T a -[µ a µ T a -µ a c b T -c b µ T a + c b c b T ] = (1 + 1 k )Σ c Since E[ϕ(x) -c b ] = µ a -c b , i = (1 + 1 k )Σ c + (µ a -c b ) T (µ a -c b ) ii = (1 + 1 k )Σ c + (µ a -c a ) T (µ a -c a ) = (1 + 1 k )Σ c + σ T a σ a Thus, i -ii = (µ a -c b ) T (µ a -c b ) -σ T a σ a = µ T a µ a -µ T a (µ b + σ b ) -(µ b + σ b ) T µ a + (µ b + σ b ) T (µ b + σ b ) -σ T a σ a = µ T a µ a -2µ T a µ b -2µ T a σ b + µ T b µ b + 2µ T b σ b + σ T b σ b -σ T a σ a and, E x,S|a,b [α] = (µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b -2µ a ) T σ b -σ T a σ a Since E a,b,x,S [α] = E a,b [E x,S|a,b [α]], we have, E a,b,x,S [α] = E a,b [i -ii] = E a,b [µ T a -2µ T a µ b + µ T b µ b + 2µ T b σ b -2µ T a σ b + σ T b σ b -σ T a σ a ] = T r(Σ) + µ T µ -2µ T µ + T r(Σ) + µ T µ + 2µ T σ b -2µ T σ b + σ T b σ b -σ T a σ a = 2T r(Σ) + σ T b σ b -σ T a σ a Thus, E a,b,x,S [α] = 2T r(Σ) + σ T b σ b -σ T a σ a . Then we do an inequality scaling on the variance of α. 1 k )σ T a Σ c σ a Thus, E a,b [V ar(α|a, b)] ≤ E a,b [2V ar(∥ϕ(x) -c b ∥ 2 ) + 2V ar(∥ϕ(x) -c a ∥ 2 )] = E a,b [8(1 + 1 k ) 2 T r( 2 c ) + 8(1 + 1 k )[(µ a -c b ) T c (µ a -c b ) + σ T a Σ c σ a ]] = 8(1 + 1 k )E a,b [T r{(1 + 1 k )Σ 2 c + Σ c ((µ a -c b ) T (µ a -c b ) + σ T a σ a )}] = 8(1 + 1 k )T r{Σ c [(1 + 1 k )Σ c + 2Σ + σ T b σ b + σ T a σ a ]} A.1.2 PROOF OF THEOREM 1 Under the condition where Lemma 1 hold, we have: R(ϕ) ⩾ (2 Tr(Σ) + σ T b σ b -σ T a σ a ) 2 f 1 (σ a , σ b ) + f 2 (σ a , σ b ) (22) where f 1 (σ a , σ b ) = 12 Tr{Σ c ( 3 2 Σ c + 2Σ + σ T b σ b + σ T a σ a )} f 2 (σ a , σ b ) = E a,b [((µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b ) T σ b ) 2 ] Proof: From the three equations in Lemma 1, we plug in the result to Equation( 7) and do an inequality scaling as shown in below. Since we know: V ar(α) = E[α 2 ] -E[α] 2 = E a,b|x,S [α 2 |a, b] -E a,b,x,S [α] 2 = E a,b [V ar(α|a, b) + E x,S [α|a, b] 2 ] -E a,b,x,S [α] 2 Then, R(ϕ) ≥ 2T r(Σ) + σ T b σ b -σ T a σ a f 1 (σ a , σ b ) + E a,b [[(µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b -2µ a ) T σ b -σ T a σ a ] 2 ] ≥ 2T r(Σ) + σ T b σ b -σ T a σ a f 1 (σ a , σ b ) + f 2 (σ a , σ b ) where f 1 (σ a , σ b ) = 8(1 + 1 k ) Tr{Σ c ( 1 + 1 k ) Σ c + 2Σ + σ T b σ b + σ T a σ a )} f 2 (σ a , σ b ) = E a,b [((µ a -µ b ) T (µ a -µ b ) + (2µ b + σ b ) T σ b ) 2 ] In the 2-way 2-shot case we talked about, k = 2. A.1.3 EXTEND THE ALGORITHM TO N CLASS Let x and y denote the pair of query set. Let α i = ∥ϕ(x) -c i ∥ 2 -∥ϕ(x) -c y ∥ 2 , hence R(ϕ) = P r c,x,S (∪ N i=1,i̸ =y α i > 0). By Frechet's inequality: R(ϕ) > N i=1,i̸ =n P r(α i > 0) -(N -2) After plug in the inequality of R(ϕ) in Theorem 1, the lower bound of accuracy for N classes problem can be obtained. We implement the proposed framework in PyTorch. We set the number of episode as 500 with an early stopping strategy. The representation network f θ (•), i.e. GCN, consists of two layers with dimension size 32 and 16, respectively. Both of them are activated with ReLU function. We train the model using Adam optimizer, whose learning rate is set to be 0.005 initially with a weight decay of 0.0005. The size of query set is set to be 15 for all datasets. The Proto-GCN and distance predictor are both learnt during meta-train phase. We also provide an anonymous Github link in the supplementary file. A.4 DIFFERENCE BETWEEN NIML AND GPN Even though, both NIML and GPN make an effort to compute weighted prototypes, the two methods are designed with different intentions. NIML starts with a theoretical analysis, quantify the node importance as the distance from the node to its same-class prototype expectation and conclude that assigning higher weights to nodes with closer distance will enhance the lower bound of model accuracy. After that, NIML adopts the idea that the distribution of the relationship between a given node and its neighbors can reflect the node importance and then construct an attention vector that depicts the relationship distribution as input to predict the distance in a supervised manner, further learning the node importance. While GPN adopts a different view that assumes the importance of a node is highly correlated with its neighbor's importance and derive a score aggregation mechanism using GAT as the backbone, which has similar characteristic to message passing that relies on graph homophily. We think this is the main reason why NIML outperforms GPN as shown in Table 1 .

A.5 VISUALIZATION OF RELATIONSHIP BETWEEN SCORE AND DISTANCE

In order to verify whether NIML follows the theory, we visualize the relationship between score and distance in figure 7. For a selected category, we calculate the embedding of five nodes with the same label belonging to the support set and visualize them in the figure together with the prototype expectation (mean of all same-class embeddings) of that category. The shade of the color represents the score. The darker the color, the higher the score, where the darkest color is the prototype. The distance between points in the figure is consistent with the distance between node embedding. Here we present three groups of visualization. From the result, we find that our algorithm always assigns higher weights to closer nodes, but very strict distinctions may not be made for certain cases where the distance is relatively close. Although the detail of some cases is inconsistent, the overall trend is consistent with the theory. 



the shot number, and y(x) = a. Define c a and c b as shown in Equation(

Figure 1: An episode for a 3way k-shot problem. c i represents node class, k and m denote the number of nodes for each class in support set S and query set Q.

Figure 3: Different methods of computing importance score

Figure 4: Effect of support set size k on three datasets

the shot number, and y(x) = a. Define c a and c b as shown in Equation(

Figure 5: Histogram of Reddit dataset

Figure6provides an illustration of difference between the Proto-based GCN and distance predictor, where the bottom right figure depicts the embedding space of a prototypical network and the upper right figure is the distance in the embedding space between a given node and its same-class prototype. The distance is equivalent to the length of gray arrow in bottom right figure.

Figure 6: Difference between Proto-based GCN model and distance predictor: distance predictor maps node of similar importance to similar distance value regardless of node category, Proto-based GCN maps same-class nodes to close locations in the embedding space.

Figure 7: Visualization of the relationship between score and distance

Experiment result on Reddit, Amazon-Electronic and DBLP w.r.t ACC and F1 (%)

Consider space of classes C with sampling distribution τ , a, b



