GRAPH DOMAIN ADAPTATION VIA THEORY-GROUNDED SPECTRAL REGULARIZATION

Abstract

Transfer learning on graphs drawn from varied distributions (domains) is in great demand across many applications. Emerging methods attempt to learn domaininvariant representations using graph neural networks (GNNs), yet the empirical performances vary and the theoretical foundation is limited. This paper aims at designing theory-grounded algorithms for graph domain adaptation (GDA). (i) As the first attempt, we derive a model-based GDA bound closely related to two GNN spectral properties: spectral smoothness (SS) and maximum frequency response (MFR). This is achieved by cross-pollinating between the OT-based (optimal transport) DA and graph filter theories. (ii) Inspired by the theoretical results, we propose algorithms regularizing spectral properties of SS and MFR to improve GNN transferability. We further extend the GDA theory into the more challenging scenario of conditional shift, where spectral regularization still applies. (iii) More importantly, our analyses of the theory reveal which regularization would improve performance of what transfer learning scenario, (iv) with numerical agreement with extensive real-world experiments: SS and MFR regularizations bring more benefits to the scenarios of node transfer and link transfer, respectively. In a nutshell, our study paves the way toward explicitly constructing and training GNNs that can capture more transferable representations across graph domains. Codes are released at https://github.com/Shen-Lab/GDA-SpecReg. Problem setup. We are given i.i.d. samples (

1. INTRODUCTION

Many applications call for "transferring" graph representations learned from one distribution (domain) to another, which we refer to as graph domain adaptation (GDA). Examples include temporally-evolved social networks (Wang et al., 2021) , molecules of different scaffolds (Hu et al., 2019) , and protein-protein interaction networks in various species (Cho et al., 2016) . In general, this setting of transfer learning is challenging due to the data-distribution shift between the training (source) and test (target) domains (i.e. P S (G, Y ) ̸ = P T (G, Y )). In particular, such a challenge escalates for graph-structured data that are abstractions of diverse nature (You et al., 2021; 2022) . Despite the tremendous needs arising from real-world applications, current methods for GDA (as reviewed in Section 2) mostly fall short in delivering competitive target performance with theoretical guarantee. Inevitably those approaches assuming distribution invariance (or adopting heuristic principles) are restricted in theory (Garg et al., 2020; Verma & Zhang, 2019) . The emerging approaches (Zhang et al., 2019; Wu et al., 2020) straightforwardly apply adversarial training between source and target representations, intentionally founded on the DA theory to bound the target risk (Redko et al., 2020) . However, the generic DA bound in theory is agnostic to graph data and models, which could be more precisely tailored for graphs. We therefore set out to explore the following question: How to design algorithms to boost transfer performance across different graph domains, with the grounded theoretical foundation? Our step-by-step answers are as follows. (i) Derivation of model-based GDA bound. Building upon the rigorous assurance established in the DA theory (Section 3), we start by directly rewriting the OT-based (optimal transport) DA bound (Redko et al., 2017; Shen et al., 2018) for graphs (Corollary 1), which is closely coupled with the Lipschitz constant of graph encoders. The nontrivial challenge here is how to formulate GNN Lipschitz w.r.t the distance metric of non-Euclidean data. Leveraging the graph filter theory (Gama et al., 2020; Arghal et al., 2021) , we first state that GNNs can be constructed stably w.r.t. the misalignment of edges and that of node features, multiplied by two spectral properties respectively: spectral smoothness (SS) and maximum frequency response (MFR) (Lemma 1). Subsequently, we utilize SS and MFR to formulate GNN Lipschitz w.r.t graph distances into a general form, and instantiate it as (informally) max O(SS), O(MFR) w.r.t. the commonly-used matching distance (Gama et al., 2020; Arghal et al., 2021) (Lemma 2) . This leads to the first model-based GDA bound. (ii) Theory-grounded spectral regularization. One potential way to tighten the DA bound is to modulate the Lipschitz constant (Section 3). Guided by the theoretical results above, we are wellmotivated to propose spectral regularization (i.e. SSReg and MFRReg) to restrict the target risk bound (Section 4.2). We also extend the GDA theory into the more challenging conditional-shift scenario (Li et al., 2021a; Zhao et al., 2019) (Lemma 3) , where spectral regularization still applies. (iii) Interpretation on how theory drives practice. Our further analyses on theory reveal which regularization would improve performance of what graph transfer scenario: specifically, SSReg and MFRReg are respectively beneficial to the scenarios of node transfer and link transfer (Section 4.2), (iv) with extensive numerical evidences from un/semi-supervised (cross-species protein-protein interaction) link prediction and (temporally-shifted paper topic) node classification (Section 5).

2. RELATED WORKS

Self-supervision on graphs. Graph self-supervised learning, surging recently, learns empirically more generalizable representations through exploiting vast unlabelled graph data (please refer to (Xie et al., 2021) for a comprehensive review). The success of self-supervision largely hinges on big data and, more importantly, heuristically-designed pretext tasks. The tasks can be predictive (Velickovic et al., 2019; Hu et al., 2019; Jin et al., 2020; You & Shen, 2022; You et al., 2020b; Chien et al., 2021; Talukder et al., 2022) or contrastive (You et al., 2020a; Zhu et al., 2020b; Qiu et al., 2020; Wei et al., 2022) , which does not provide theoretical guarantee of the target performance and, as a result, occasionally leads to "negative transfer" in practice (Hu et al., 2019; You et al., 2020a) . Transferring GNNs with explicit covariate shifts. To promote target performance, one line of work is to utilize more data and make specific assumptions. One such example is to assume the access to source labels and the explicit covariate shift that P S (Y |G) = P T (Y |G) and P S (G) ̸ = P T (G) in a specific way, which enables theoretical tools for certain guarantees. (Ruiz et al., 2020; Yehudai et al., 2021) study the specific setting of size generalization and use the graphon theory (Lovász, 2012) to develop size-invariant representations. (Bevilacqua et al., 2021) works on transfer learning in shifting d-patterns of subgraphs and adopts the theory of GNN expressiveness (Xu et al., 2018; Morris et al., 2019) to demonstrate the existence of negative-transferring GNNs despite their universal approximation capability. Accordingly, the study proposes d-pattern classification pre-training to help escape from negative-transferring GNNs. These methods are restricted to the designated transfer learning scenarios. Besides, some other works (Fan et al., 2021; Sui et al., 2021; Li et al., 2021b; Kenlay et al., 2021; Chen et al., 2022; Li et al., 2022; Zhang et al., 2022; Jin et al., 2022) adopt the implicit covariate shift assumption with source labels while lacking assurance in theory, e.g. (Wu et al., 2022a; b) assumes that the shift could be implicitly modeled with an environment learner (please refer to (Gui et al., 2022) for a comprehensive review). Graph domain adaptation. To deliver a generally applicable guarantee, several methods (Dai et al., 2019; Cai et al., 2021; Zhang et al., 2019; Wu et al., 2020; Xu et al., 2022) additionally utilize target graphs to learn domain-invariant representations. According to the DA theory (Ben-David et al., 2007; 2010; Redko et al., 2020; Zhang et al., 2020; Yan et al., 2017) , the target risk is guaranteed to be bounded (please refer to (Redko et al., 2020) for a comprehensive review). The generic DA bound is not designated for graph data or encoders where further improvement could be achieved. Domain adaptation with optimal transport. Studies on transfer learning across distinctly distributed data have proliferated in the past few years, known as domain adaptation (DA) (Redko et al., 2020) . Based on the aforementioned problem setup, we decompose the trained GNN h = g • f into the feature extractor f : G → R D ′ (Z = f (G)) and discriminator g : R D ′ → Y (Y = g(Z) ). Without the loss of generalizability, we consider a binary classification task where Y = [0, 1] and Y ∈ Y is the probability to belong to class 1. We denote the labeling function given representations as ĝ : R D ′ → Y, and the (empirical) source and target risks as εS (g, ĝ) = 1 NS NS n=1 |g(Z n ) -ĝ(Z n )| and ϵ T (g, ĝ) = E PT(Z) |g(Z) -ĝ(Z)| , respectively. Applying DA with optimal transport (OT), if the covariate shift holds on representations that P S (Y |Z) = P T (Y |Z), the target risk ϵ T (g, ĝ) is bounded as in the following theorem. Theorem 1 (Redko et al., 2017; Shen et al., 2018; Li et al., 2021a) Suppose that the learned discriminator g is C g -Lipschitz where the Lipschitz norm ∥g∥ Lip = max Z1,Z2 |g(Z1)-g(Z2)| ρ(Z1,Z2) = C g holds for some distance function ρ (Euclidean distance here). Let H := {g : Z → Y} be the set of bounded real-valued functions with the pseudo-dimension P dim(H) = d that g ∈ H, with probability at least 1 -δ the following inequality holds: ϵ T (g, ĝ) ≤ εS (g, ĝ) + 4d N S log( eN S d ) + 1 N S log( 1 δ ) + 2C g W 1 P S (Z), P T (Z) + ω, where ω = min ∥g∥Lip≤Cg ϵ S (g, ĝ) + ϵ T (g, ĝ) denotes the model discriminative ability (to capture source and target data), and the first Wasserstein distance is defined as (Villani, 2009) : W 1 (P, Q) = sup ∥g∥Lip≤1 E PS(Z) g(Z) -E PT(Z) g(Z) . The thorough tightness justification of the OT-based DA bound can be found in ( (Redko et al., 2020) , Section 5.3-5.5). The theorem indicates that the generalization gap depends on both the domain-divergence 2C g W 1 (P S (Z), P T (Z)) and the model discriminability ω. Adversarial training. Motivated from Theorem 1 (or its variants differing mainly in distribution divergence metrics (Ben-David et al., 2007; 2010; Mansour et al., 2009) ), a well-developed practice is to learn domain-invariant representations via jointly optimizing the source risk and the distribution divergence term (conceptually, W 1 (P S (Z), P T (Z))) as in (Redko et al., 2017; Shen et al., 2018) : min f,∥g∥Lip≤Cg 1 N S NS n=1 ℓ(g • f (G n ), Y n ) + γ Ŵ1 P S (f (G)), P T (f (G)) , ( ) where ℓ is the loss function used for training, Ŵ1 is the empirically calculated first Wasserstein distance with implementation details following (Shen et al., 2018) presented in Appendix E, and γ is the trade-off factor. Besides co-optimizing Ŵ1 , implementing domain classifiers is another effective way to alleviate the domain discrepancy (Zhang et al., 2019; Wu et al., 2020) .

4. METHODS

4.1 MODEL-BASED DA BOUND FOR GRAPH-STRUCTURED DATA Noticing that the generic DA theory (Theorem 1) is agnostic to data structures and encoders, our first step is to directly rewrite it for graph-structured data (G) accompanied with graph feature extractors (f ) as follows. The covariate shift assumption is now reframed as P S (Y |G) = P T (Y |G). Corollary 1. Let's assume that the learned discriminator is C g -Lipschitz continuous as described in Theorem 1, and the graph feature extractor f (also referred to as GNN) is C f -Lipschitz that ∥f ∥ Lip = max G1,G2 ∥f (G1)-f (G2)∥2 η(G1,G2) = C f for some graph distance measure η. Let H := {h : G → Y} be the set of bounded real-valued functions with the pseudo-dimension P dim(H) = d that h = g • f ∈ H, with probability at least 1 -δ the following inequality holds: ϵ T (h, ĥ) ≤ εS (h, ĥ) + 4d N S log( eN S d ) + 1 N S log( 1 δ ) + 2C f C g W 1 P S (G), P T (G) + ω, where the (empirical) source and target risks are εS (h, ĥ) = 1 NS NS n=1 |h(G n ) -ĥ(G n )| and ϵ T (h, ĥ) = E PT(G) |h(G) -ĥ(G)| , respectively, where ĥ : G → Y is the labeling function for graphs and ω = min ∥g∥Lip≤Cg,∥f ∥Lip≤C f ϵ S (h, ĥ) + ϵ T (h, ĥ) . The property in Corollary 1 designated for GNNs, as we focus on, is the Lipschitz constant C f = max G1,G2 ∥f (G1)-f (G2)∥2 η(G1,G2) . Success in instantiating this conceptual model-related property for graphs provides important insights into how to rationally construct or train GNNs for better transferability. Other data-relevant properties (e.g. W 1 (P S (G), P T (G))) are left to future works. Instantiating the GNN Lipschitz constant is however nontrivial. The distance metric η(G 1 , G 2 ) is formulated w.r.t non-Euclidean structures and its computation hinges on solving the graph matching problem whose time complexity is exponential in the number of nodes (Riesen & Bunke, 2009; Riesen et al., 2010) . Rather than working on the denominator of the Lipschitz norm, we take an alternative perspective of the numerator, i.e. ∥f (G 1 ) -f (G 2 )∥ 2 , which is related to the GNN stability (potentially multiplied with the distance term to eliminate the denominator). This essentially motivates us to draw a connection between transferability and stability. Thanks to the permutation invariance property of GNNs, graph matching does not need to be solved here. Building upon the graph filter theory (Gama et al., 2020; Arghal et al., 2021) , we first state that GNNs can be constructed stably: □ Lemma 1. Suppose that G is the set for graphs of the size the size N G after padding with isolated nodes, similar to (Zhu et al., 2021)  . Given ∀G 1 , G 2 ∈ G and A 1 = U 1 Λ 1 U T 1 , A 2 = U 2 Λ 2 U T 2 , the eigen decomposition for adjacency matrices A 1 and A 2 that Λ 1 = diag([λ 1,1 , ..., λ 1,N G ]), Λ 2 = diag([λ 2,1 , ..., λ 2,N G ] ) (eigen values are sorted in the descending order). A GNN is constructed by composing a graph filter and nonlinear mapping that f (G 1 ) = r(σ(S(A 1 )X 1 W )) = r(σ(U 1 S(Λ 1 )U T 1 X 1 W )) where S is the polynomial function that S(A 1 ) = ∞ k=0 s k A k 1 , W ∈ R D×D ′ is the learnable weight matrix, r is the mean/sum/max readout function to pool node representations, and the pointwise nonlinearity holds as |σ(b) -σ(a)| ≤ |b -a|, ∀a, b ∈ R. Assuming ∥X∥ op ≤ 1 and ∥W ∥ op ≤ 1 (∥•∥ op stands for operator norm), the following inequality holds: ∥f (G 1 ) -f (G 2 )∥ 2 ≤ C λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max |S(Λ 2 )| ∥X 1 -P * X 2 ∥ F , where τ = (∥U 1 -U 2 ∥ F + 1) 2 -1 stands for the eigenvector misalignment which can be bounded, P * = argmin P ∈Π ∥X 1 -P X 2 ∥ F + ∥A 1 -P A 2 P T ∥ F , Π is the set of permutation matrices, O(∥A 1 -P * A 2 P * T ∥ 2 F ) is the remainder term with bounded multipliers defined in (Gama et al., 2020) , and C λ is the spectral Lipschitz constant that ∀λ i , λ j , |S(λ i ) -S(λ j )| ≤ C λ |λ i -λ j |.

Proof. See Appendix A.

The lemma integrates the stability properties w.r.t. node feature and edge perturbations in (Gama et al., 2020) and (Arghal et al., 2021) , respectively, where C λ tied to edge perturbations characterizes spectral smoothness (SS) of the underlining graph filter, and max(|S(Λ 2 )|) tangled with node feature perturbations defines maximum frequency response (MFR). We note that in practice for Lemma 1, no padding is actually needed if the summation pooling is adopted as in our experiments (Xu et al., 2018) , since the padded nodes do not contribute to the graph embedding (Zhu et al., 2021) . Analysis on more sophisticated architectures is left to future works. We next instantiate the GNN Lipschitz constant for the commonly-used matching distance (Gama et al., 2020; Arghal et al., 2021) in the following lemma, achieved by eliminating the distance term in the numerator and denominator of the GNN Lipschitz constant. □ Lemma 2. Suppose that G is the set for graphs of the size N G after padding with isolated nodes, similar to (Zhu et al., 2021) . We define the matching distance between G 1 , G 2 ∈ G as η(G 1 , G 2 ) = min P ∈Π ∥X 1 -P X 2 ∥ F + ∥A 1 -P A 2 P T ∥ F . Suppose that the edge perturbation is bounded that ∀G 1 , G 2 ∈ G, ∥A 1 -P * A 2 P * T ∥ F ≤ ε with the optimal permutation P * , and there exists an eigenvalue λ * ∈ R to achieve the maximum |S(λ * )| < ∞. We can then calculate the Lipschitz constant of GNN as: C f = max C λ K 1 + εK 2 , |S(λ * )| , where K 1 , K 2 is the supremes of (1 + τ √ N G ) and the remainder multiplier in Lemma 1, respectively, following similar philosophies in ( (Gama et al., 2020) , Theorem 1). Proof. See Appendix B. In implementation we regularize on the subtraction between numerators and denominators multiplied by certain threshold, considering the numerical instability issue in division. To recap, C λ and |S(λ * )| characterize SS and MFR, respectively. Thereafter, the direct incorporation of Corollary 1 and Lemma 2 results in the first model-based DA bound for graph data. It builds the foundation to understand and further develop GDA algorithms. Although in our theory GNN is assumed to be composed of a graph filter and a nonlinear activation layer, in practice such an architecture is ubiquitous, well-known for its simplicity and effectiveness (Wu et al., 2019; Li et al., 2019) . We also provide theory for two-layer GNNs in Appendix I which is naturally extendable to multi-layer architectures via induction.

4.2. THEORY-GROUNDED SPECTRAL REGULARIZATION

Recall that, in the DA inequality (4) (see also Section 3), the gap between the target and source risks is bounded with (i) the distribution divergence between source and target (W 1 term) multiplied by Lipschitz constants, and (ii) the discriminative capability of NNs to capture invariant knowledge (the ω term) restricted by Lipschitz constants. This indicates that varying the Lipschitz constant results in the intrinsic trade-off between domain-divergence and discriminability to vary the DA bound. Therefore, in search of a sweet spot in the trade-off, one way to tighten the bound is to regularize the Lipschitz constant of NNs. Equipped with the derived graph model-based DA bound in Section 4.1, we are now ready to propose regularizing the GNN spectral properties of SS or MFR for balancing between domain-divergence and discriminability. We implement regularization as follows. Implementations. Driven by the analysis, we incorporate spectral regularization (SpecReg ∈ {SSReg, MFRReg}, SSReg for C λ and MFRReg for |S(λ * )| in equation ( 6)) into the traditional domain-invariant representation learning framework (3) as follows: min f,∥g∥Lip≤Cg 1 N S NS n=1 ℓ(g • f (G n ), Y n ) + γ Ŵ1 P S (f (G)), P T (f (G)) + γ ′ SpecReg f, {G n } NS n=1 , {G n } NT n=1 , where γ ′ is the trade-off factor, tuned over {1, 1e-1, 1e-2, 1e-3} through validation. Denote Z = f (G) and the spectral signals Z = U T Z, X = U T X, SS and MFR regularization (SSReg, MFRReg) are specifically implemented by regularizing on the spectral outputs w.r.t. spectral inputs as: SSReg f, {G n } NS n=1 , {G n } NT n=1 = D∈{S,T} 1 N D ND n=1 sum ReLU(abs( Zn [2 : N G , :] -Zn [1 : N G -1, :]) -υ abs(Λ n Xn [2 : N G , :] -Λ n Xn [1 : N G -1, :])) , ( ) MFRReg f, {G n } NS n=1 , {G n } NT n=1 = D∈{S,T} 1 N D ND n=1 sum ReLU(abs( Zn ) -υ abs( Xn )) , where X[m, n] denotes matrix indexing, sum is the matrix summation, ReLU and abs are pointwise rectifier and absolute functions, respectively, and υ is the threshold value tuned over {0.1, 1, 10, 100, 1000} through validation. An overview of the implementation is depicted in Figure 1 . Analysis on complexity of spectral regularization (with eigendecomposition) is presented in Appendix H. How could theory further drive practice? We also notice that in special scenarios, either of the two regularizations (SS and MFR) can be favored than the other. (i) Node transfer: When the node features preserve (mostly) the invariant label-related information, while edges are less labelrelevant (noisy) and distribute highly distinctly across source and target domains, we would have large K 1 , ε in Lemma 2 that C λ K 1 + εK 2 > |S(λ * )|. Thus regularizing SS leads to the decreased domain-divergence term (C f decreases) and the less increased discriminability term w.r.t. node features (GNN is sensitive enough to label-preserving node features as shown in Lemma 1). (ii) Link transfer: Oppositely, when edges are invariantly label-preserving across source and target domains whereas node features are relatively noisy, regularizing MFR is more desired. This provides us with the guidance to select the regularization in practical applications with domain knowledge.

4.3. EXTENSION TO THE SEMI-SUPERVISED SETTING

Results so far do not assume the access to target labels, which is in the unsupervised setting. In this section, we show that our analysis also applies in the semi-supervised setting, where the more challenging conditional shift is considered-P S (Y |G) ̸ = P T (Y |G)-in addition to the access to a small amount of target labelled data (Li et al., 2021a; Zhao et al., 2019) . We provide a finite-sample semi-supervised OT-based GDA bound in the following lemma. □ Lemma 3. Under the assumption of Corollary 1, we further assume that there exists a small amount of i.i.d. samples with labels {(G n , Y n )} N ′ T n=1 from the target distribution P T (G, Y ) (N ′ T ≪ N S ) and bring in the conditional shift assumption that domains have different labeling function ĥS ̸ = ĥT and max G1,G2 | ĥD(G1)-ĥD(G2)| η(G1,G2) = C h ≤ C f C g (D ∈ {S, T}) for some constant C h and distance measure η. Let H := {h : G → Y} be the set of bounded real-valued functions with the pseudo-dimension P dim(H) = d, with probability at least 1 -δ the following inequality holds: ϵ T (h, ĥT ) ≤ N ′ T N S + N ′ T εT (h, ĥT ) + N S N S + N ′ T εS (h, ĥS ) + N S N S + N ′ T 8d N ′ T log( eN ′ T d ) + 2 N ′ T log( 1 δ ) + 8d N S log( eN S d ) + 2 N S log( 1 δ ) 1 2 + 2C f C g W 1 (P S (G), P T (G)) + ω , where ω = min |ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| . Proof. See Appendix C. Since the main form of inequality ( 10) is consistent with inequality (4), similar discussion can be made following the thoughts in Section 4.2, to demonstrate that we can properly regularize the spectral properties of GNNs (SS and MFR) to tighten the target risk and seek a sweet spot between the domain-divergence and discriminability. Following optimization (7), the semi-supervised training procedure is implemented as: We evaluate our proposed algorithms, SSReg and MFRReg, in two real-world applications of graph transfer learning : (i) link prediction of protein-protein interactions (PPIs) across different species (Szklarczyk et al., 2021) , and (ii) node classification of paper topics across different time periods (Wu et al., 2020; Hu et al., 2020) . min f,∥g∥Lip≤Cg 1 N S NS n=1 ℓ(g • f (G n ), Y n ) + 1 N ′ T N ′ T n=1 ℓ(g • f (G n ), Y n ) + γ Ŵ1 P S (f (G)), P T (f (G)) + γ ′ SpecReg f, {G n } NS n=1 , {G n } NT n=1 .

5.1. PREDICTING PROTEIN-PROTEIN INTERACTIONS IN VARIOUS SPECIES

Datasets. PPI networks have proven important to understand functional genomics and analyze biological pathways (Sharan et al., 2007; Navlakha & Kingsford, 2010) . But in most species the coverage of experimental PPI data remains low (Sledzieski et al., 2021) . We utilize protein sequences together with freely accessible computational PPIs via whole-genome comparisons (Szklarczyk et al., 2021) to predict experimental PPIs, i.e. the graph is built with nodes represented as protein sequences and edges as computational PPIs. We collect PPIs of species from the STRING database (Szklarczyk et al., 2021) where PPIs in the neighborhood, fusion and co-occurrence channels are defined computational, and in the co-expression and experiments (we refer to it as physical to prevent confusion) channels (to be predicted) are experimental which deals with expensive functional genomics experiments (Parkinson et al., 2009) or direct lab assays (Brückner et al., 2009) . More details of PPI data are shown in Appendix D, and ablations in Appendix G. tially similar in expression profiles if they are with similar promoter regions (might indicate locating in the neighborhood) (Park et al., 2002) , involved in fusion events (Fernebro et al., 2006) , or sharing co-occurrence patterns (Larmuseau et al., 2019) . For physical interactions, we hypothesize that they rely more on node features (sequences, node transfer) considering the recent breakthrough that protein structure information (and ultimately physical interactions and functions) can be recovered from sequences with high accuracy (Jumper et al., 2021) . We calculate the homophily ratios (Zhu et al., 2020a; Pei et al., 2020) of 3 out of 4 species in Figure 2 as numerical evidence to support the hypotheses (see Appendix F for the computing procedure, where zebrafish is excluded due to the overly sparse (<200) physical interactions as shown in Appendix D). Results. The results of unsupervised transfer are shown in Table 1 and 2 . We put semi-supervised results in Appendix G with the similar findings. Through comparing between: (i) methods w/o and w/ DA techniques, (ii) methods w/o and w/ spectral regularization, and (iii) our proposed methods and SOTA competitors, we have the following findings. (i) Vanilla DA in general provides benefits though occasionally degrades performance. With the assistance of domain-invariant representation learning, GNNs generally lead to a better performance with exceptions, i.e., with Wasserstein distance guided DA (DA-W, as in optimization (3) (Redko et al., 2017; Shen et al., 2018) ), 11 out of 16 metrics turn higher than w/o DA in the unsupervised setting, and so do 10 out of 16 in semi-supervised. The occasional performance degrade fits our analysis in Section 4.2, that the optimized source risk and distribution divergence of representations could lead to either improved or deteriorated performances without guarantee. Similar phenomenon happens to DA with domain classifiers (DA-C, (Ganin et al., 2016) ). Comparison between DA-W and DA-C shows that, DA-W performs better than DA-C when transferring to graphs with the larger domain gap (i.e. from human to fruit fly and yeast, justified by their biological taxonomic ranks and phylogenetic distances (Alberts et al., 2002) , see Appendix D for details). (ii) In a principled way, spectral regularization further boosts GNN performance with DA-W and consistently alleviates performance degradation. Under the regularization on GNN spectral properties, domain-invariant representation learning provides further benefits in a principled way. Specifically, MFRReg improves 13 out of 16 co-expression interaction prediction versus DA-W, since it assists GNN to mine from the computational PPIs that co-expression interactions are more correlated with, as hypothesized in Section 5.1. We show in Figure 3 that it is a consequence of a better sweet spot when regularizing MFR, which echos our analysis in Section 4.1, that MFRReg benefits more in the link transfer setting. Moreover, SSReg improves 13 out of 16 physical interaction prediction which reinforces GNN to dig more from protein sequence embeddings that physical interactions are more relevant to. This echos our analysis that SSReg benefits more in the node transfer setting, conforming to our theory-grounded regularization design. Besides, we observe that the improvements are more significant in the unsupervised setting than semi-supervised. A possible reason is that, under the guidance of certain target labelled data during transferring, DA-W are less prone to capture superfluous information even without regularization. (iii) Integrating protein sequences together with computational PPIs leads to better transfer performance. Comparing with SOTA competitors, we find that utilizing computational PPIs alone (such as Mashup) or utilizing protein sequences alone (such as D-SCRIPT which heavily relies on the pre-trained sequence encoder on a tremendous and diverse population of protein sequences or even structures (Bepler & Berger, 2019) ) leads to less competitive results than integrating them together when transferring across species. For self-supervised pre-training methods (such as GraphCL), the existence of domain gap prompts "negative transfer" in the unsupervised setting, which is alleviated under the guidance of certain target labels in the semi-supervised setting. (Wu et al., 2020) . See Appendix E for more details. We also assay our methods on the large-scale OGB benchmark (Hu et al., 2020) .

5.2. CLASSIFYING PAPER TOPICS OF DIFFERENT TIME PERIODS

Hypothesis. Paper topics are verified to have strong homophily with citations mostly (Zhu et al., 2020a; Pei et al., 2020 ) (link transfer), which is adopted as the hypothesis in our study. Results. The results of unsupervised transfer are shown in Table 3 , demonstrating the applicability of the proposed spectral regularization in different applications. Comparing graph representation learning w/o and w/ DA techniques, domain-invariant representations generally improve performances. Further built upon the SOTA UDAGCN, applying DA-W to minimize the domain divergence on representations benefits the transfer performance from ACM to DBLP while hurting it from DBLP to ACM, which is consistent with the observation (i). Similar results (also referring to Appendix G, Table 8 for the semi-supervised setting) on the large-scale ogbn-arxiv dataset are shown in Table 3 . Via spectral regularization, specifically MFRReg to exploit the invariant topology information across domains that is more related to node labels as hypothesized, our methods achieve better performance than all competitors, which is consistent with the observation (ii) in Section 5.1.

6. CONCLUSIONS

To fulfill the practical demands of transfer learning on graph data, we develop theory-grounded spectral regularization for GNNs to learn transferable graph representations. We first leverage domain adaptation with optimal transport theory to dive into the guaranteed bound for the transfer performance. This bound reveals that varying the Lipschitz constant of the GNNs could lead to a tighter bound by balancing domain divergence and GNN power. Building on the graph filter theory, we next show that one can regularize GNN spectral properties to regularize the Lipschitz constant, which motivates us to propose spectral regularizations. Numerical results conform to our theoretical analysis that regularizing SS and MFR brings benefits to the scenarios of node transfer and link transfer, respectively, in both the unsupervised and supervised settings.

APPENDIX

A PROOF FOR LEMMA 1 □ Lemma 1. Suppose that G is the set for graphs of the size N G after padding with isolated nodes, similar to (Zhu et al., 2021)  . Given ∀G 1 , G 2 ∈ G and A 1 = U 1 Λ 1 U T 1 , A 2 = U 2 Λ 2 U T 2 , the eigen decomposition for adjacency matrices A 1 and A 2 that Λ 1 = diag([λ 1,1 , ..., λ 1,N G ]), Λ 2 = diag([λ 2,1 , ..., λ 2,N G ] ) (eigen values are sorted in the descending order). A GNN is constructed by composing a graph filter and nonlinear mapping that f (G 1 ) = r(σ(S(A 1 )X 1 W )) = r(σ(U 1 S(Λ 1 )U T 1 X 1 W )) where S is the polynomial function that S(A 1 ) = ∞ k=0 s k A k 1 , W ∈ R D×D ′ is the learnable weight matrix, r is the mean/sum/max readout function to pool node representations, and the pointwise nonlinearity holds as |σ(b) -σ(a)| ≤ |b -a|, ∀a, b ∈ R. Assuming ∥X∥ op ≤ 1 and ∥W ∥ op ≤ 1, the following inequality holds: ∥f (G 1 ) -f (G 2 )∥ 2 ≤ C λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max |S(Λ 2 )| ∥X 1 -P * X 2 ∥ F , where τ = (∥U 1 -U 2 ∥ F + 1) 2 -1 stands for the eigenvector misalignment which can be bounded, P * = argmin P ∈Π ∥X 1 -P X 2 ∥ F + ∥A 1 -P A 2 P T ∥ F , Π is the set of permutation matrices, O(∥A 1 -P * A 2 P * T ∥ 2 F ) is the remainder term with bounded multipliers defined in (Gama et al., 2020) , and C λ is the spectral Lipschitz constant that ∀λ i , λ j , |S(λ i ) -S(λ j )| ≤ C λ |λ i -λ j |. Proof. Denote the optimal permutation matrix for G 1 , G 2 as P * , we compute the difference of the GNN outputs:  and (c ) further applies the triangle inequality; (e) adopts the assumption ∥X∥ op ≤ 1, ∥W ∥ op ≤ 1 which in practice can be guaranteed with normalization, and easily extended to the case with ∥X∥ op ≤ K, ∥W ∥ op ≤ K, ∀K > 0, and because S(P * A 2 P * T ) = (P * U 2 )S(Λ 2 )(P * U 2 ) T can be diagonalized, its operator norms equal the spectral radius; (f) is the direct outcome borrowed from (Gama et al., 2020) Theorem 1. We complete the proof. ∥f (G 1 ) -f (G 2 )∥ 2 = ∥r(σ(S(A 1 )X 1 W )) -r(σ(S(A 2 )X 2 W ))∥ 2 (a) = ∥r(σ(S(A 1 )X 1 W )) -r(σ(S(P * A 2 P * T )P * X 2 W ))∥ 2 (b) ≤ ∥S(A 1 )X 1 W -S(P * A 2 P * T )P * X 2 W ∥ F (c) ≤ ∥W ∥ op ∥S(A 1 )X 1 -S(P * A 2 P * T )X 1 + S(P * A 2 P * T )X 1 -S(P * A 2 P * T )P * X 2 ∥ F (d) ≤ ∥W ∥ op ∥X 1 ∥ op ∥S(A 1 ) -S(P * A 2 P * T )∥ F + ∥W ∥ op ∥S(P * A 2 P * T )∥ op ∥X 1 -P * X 2 ∥ F (e) ≤ ∥S(A 1 ) -S(P * A 2 P * T )∥ F + max(|S(Λ 2 )|)∥X 1 -P * X 2 ∥ F (f ) ≤ C λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max(|S(Λ 2 )|)∥X 1 -P * X 2 ∥ F , B PROOF FOR LEMMA 2 □ Lemma 2. Suppose that G is the set for graphs of the size N G after padding with isolated nodes, similar to (Zhu et al., 2021) . Define the matching distance between G 1 , G 2 ∈ G as η(G 1 , G 2 ) = min P ∈Π ∥X 1 -P X 2 ∥ F + ∥A 1 -P A 2 P T ∥ F . Suppose that the edge perturbation is bounded that ∀G 1 , G 2 ∈ G, ∥A 1 -P * A 2 P * T ∥ F ≤ ε with the optimal permutation P * , and there exists an eigenvalue λ * ∈ R to achieve the maximum |S(λ * )| < ∞. We can then calculate the Lipschitz constant of GNN as: C f = max C λ K 1 + εK 2 , |S(λ * )| , where K 1 , K 2 is the supremes of (1 + τ √ N G ) and the remainder multiplier in Lemma 1, respectively, following similar philosophies in ( (Gama et al., 2020) , Theorem 1). Proof. To calculate the Lipschitz constant C f w.r.t the matching distance, based upon Lemma 1, we assure the following inequality: ∥f (G 1 ) -f (G 2 )∥ 2 ≤ C λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + |S(λ * )|∥X 1 -P * X 2 ∥ F , ≤ C f η(G 1 , G 2 ), the latter inequality of which can be rewritten as: C λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) -C f ∥A 1 -P * A 2 P * T ∥ F +(|S(λ * )| -C f )∥X 1 -P * X 2 ∥ F ≤ 0, which is necessary for: C λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) -C f ∥A 1 -P * A 2 P * T ∥ F ≤ 0, (|S(λ * )| -C f )∥X 1 -P * X 2 ∥ F ≤ 0, which is equivalent to: C f ≥ C λ K 1 + εK 2 , C f ≥ |S(λ * )|. The bounding of K 1 , K 2 follows (Gama et al., 2020) Theorem 1 and the first minimum solution can be calculated from the quadratic function w.r.t. the edge matching distance ∥A 1 -P * A 2 P * T ∥ F . Let C f takes the larger value between them, we complete the proof. C PROOF FOR LEMMA 3 = C h ≤ C f C g (D ∈ {S, T}) for some constant C h and distance measure η. Let H := {h : G → Y} be the set of bounded real-valued functions with the pseudo-dimension P dim(H) = d, with probability at least 1 -δ the following inequality holds: □ ϵ T (h, ĥT ) ≤ N ′ T N S + N ′ T εT (h, ĥT ) + N S N S + N ′ T εS (h, ĥS ) + N S N S + N ′ T 2C f C g W 1 (P S (G), P T (G)) + ω + [ 8d N ′ T log( eN ′ T d ) + 2 N ′ T log( 1 δ ) + 8d N S log( eN S d ) + 2 N S log( 1 δ )] 1 2 , where ω = min |ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| . Proof. Before showing the designated lemma, we first introduce the following inequality to be used that: |ϵ S (h, ĥS ) -ϵ T (h, ĥT )| = |ϵ S (h, ĥS ) -ϵ S (h, ĥT ) + ϵ S (h, ĥT ) -ϵ T (h, ĥT )| ≤ |ϵ S (h, ĥS ) -ϵ S (h, ĥT )| + |ϵ S (h, ĥT ) -ϵ T (h, ĥT )| (a) ≤ |ϵ S (h, ĥS ) -ϵ S (h, ĥT )| + 2C f C g W 1 P S (G), P T (G) , where (a) results from (Shen et al., 2018) Lemma 1 with the assumption max(∥h∥ Lip , max G1,G2 | ĥD(G1)-ĥD(G2)| η(G1,G2) ) ≤ C f C g , D ∈ {S, T}. Similarly, we obtain: |ϵ S (h, ĥS ) -ϵ T (h, ĥT )| ≤ |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| + 2C f C g W 1 P S (G), P T (G) . We therefore combine them into: |ϵ S (h, ĥS ) -ϵ T (h, ĥT )| ≤ 2C f C g W 1 P S (G), P T (G) + min |ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| , i.e. the following holds to bound the target risk ϵ T (h, ĥT ): ϵ T (h, ĥT ) ≤ ϵ S (h, ĥS ) + 2C f C g W 1 P S (G), P T (G) + min |ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| . We next link the bound with the empirical risk and labeled sample size by showing, with probability at least 1 -δ that: ϵ T (h, ĥT ) ≤ ϵ S (h, ĥS ) + 2C f C g W 1 P S (G), P T (G) + min |ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| ≤ εS (h, ĥS ) + 2C f C g W 1 P S (G), P T (G) + min |ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )| + 2d N S log( eN S d ) + 1 2N S log( 1 δ ), and: ϵ T (h, ĥT ) ≤ εT (h, ĥT ) + 2d N ′ T log( eN ′ T d ) + 1 2N ′ T log( 1 δ ), which results from (Mohri et al., 2018) Theorem 11.8. Lastly, we combine the above two inequalities, with probability at least 1 -δ that: ϵ T (h, ĥT ) (a) ≤ N ′ T N S + N ′ T εT (h, ĥT ) + 2d N ′ T log( eN ′ T d ) + 1 2N ′ T log( 1 δ ) + N S N S + N ′ T εS (h, ĥS ) + 2d N S log( eN S d ) + 1 2N S log( 1 δ ) + N S N S + N ′ T 2C f C g W 1 (P S (G), P T (G)) + min(|ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )|) (b) ≤ N ′ T N S + N ′ T εT (h, ĥT ) + 4d N ′ T log( eN ′ T d ) + 1 N ′ T log( 1 δ ) + N S N S + N ′ T εS (h, ĥS ) + 4d N S log( eN S d ) + 1 N S log( 1 δ ) + N S N S + N ′ T 2C f C g W 1 (P S (G), P T (G)) + min(|ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )|) (c) ≤ N ′ T N S + N ′ T εT (h, ĥT ) + N S N S + N ′ T εS (h, ĥS ) + N S N S + N ′ T 2C f C g W 1 (P S (G), P T (G)) + min(|ϵ S (h, ĥS ) -ϵ S (h, ĥT )|, |ϵ T (h, ĥS ) -ϵ T (h, ĥT )|) + [ 8d N ′ T log( eN ′ T d ) + 2 N ′ T log( 1 δ ) + 8d N S log( eN S d ) + 2 N S log( 1 δ )] 1 2 , where (a) is the outcome of applying the union bound with coefficient 

D DATASET STATISTICS

For PPI data, we construct the graph with nodes as protein sequences (using protein language modeling (Tay et al., 2020; Karimi et al., 2019) for node features), and edges as computational interactions of neighborhood, fusion and co-occurrence channels (Szklarczyk et al., 2021) (i.e. the edge feature dimension is 3). Dataset statistics for PPI and citation networks are shown in Figure 5 and 6, respectively. For the species involved in PPI networks, we depict their relationship in biological taxonomic ranks in Figure 4 , showing the domain gap between species in concept. 

E DETAILED CONFIGURATIONS

The assayed data are released under the MIT license, and to our best knowledge, contain no privacyinfringing contents. Experiments are distributed on computer clusters with Tesla K80 GPU (11 GB memory) and NVIDIA A100 GPU (40 GB memory).

E.1 WASSERSTEIN-1 DISTANCE ESTIMATION

We follow the most routine procedure (Shen et al., 2018) to estimate Wasserstein-1 distance of graph representations between distributions in adversarial training (3). Specifically, given the encoder f and source and target data distributions P S (G), P T (G), we estimate the distance as: Ŵ1 P S (G), P T (G) = max ∥ f ∥≤1 E PS(G) f (f (G)) -E PT(G) f (f (G)), where the critic function f is instantiated with a multilayer perceptron and satisfies ∥ f ∥ = sup ∥ f (x)-f (y)∥2 ∥x-y∥2 ≤ 1, which is achieved by enforcing the gradient penalty (∥∇ x f (x)∥ 2 -1) 2 . The final estimation is thus implemented as: Ŵ1 P S (G), P T (G) = max f E PS(G) f (f (G)) -E PT(G) f (f (G)) -γ(∥∇ x f (x)∥ 2 -1) 2 , where we follow (Shen et al., 2018; Gulrajani et al., 2017) to set γ = 10 by default.

E.2 PREDICTING PROTEIN-PROTEIN INTERACTIONS ACROSS VARIOUS SPECIES

During collecting PPI data from the STRING database (Szklarczyk et al., 2021) , we abandon the interactions in channels of neighborhood transferred, coexpression transferred, and experiments transferred to prevent information leakage across species (especially from supervisions of co-expression and physical interaction prediction). For co-expression and physical interactions, we use the high-quality threshold of 0.7 (Szklarczyk et al., 2021) to convert them into binary labels. We use Sinkhorn Transformer (Tay et al., 2020) (a variant of Transformer with sparse attention) with depth 4, attention head 4 and bucket size 32 to embed protein sequences, and further adopt GIN (Xu et al., 2018) with depth 3 and MLP depth 2 to perform message passing across proteins. We also try HRNN (Karimi et al., 2019; 2020a) with k-mer 75 to embed protein sequence and GAT (Veličković et al., 2017) with depth 3 and attention head 8 to perform message passing, with comparison on human to yeast transfer shown in Table 7 , which states in our case that, (Sinkhorn) Transformer and GIN outperform HRNN and GAT, respectively. In training, we hold out 20% of human PPIs for validation. For the semi-supervised setting, 0.1% of experimental PPIs are in addition available during training. We train with convergence assured for 500 epochs with learning rate 0.0001, hidden dimension 256 and batch size 128 which is sampled by random walk, optimized by Adam optimizer. For domain-invariant representation learning guided by Wasserstein distance (as described in optimization (3)) please refer to (Shen et al., 2018) for implementation, and see (Ganin et al., 2016) for one with domain classifiers. We selected the trade-off factor γ in optimization (3) from {1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6} through validation. In evaluation, we follow (Cho et al., 2016) to perform negative sampling for experimental PPIs, to ensure that the known interactions compose only 5%, with the assumption the number of unobserved interactions is only a small fraction. On baseline implementation, for Mashup (Cho et al., 2016) , we apply the official MATLAB code to extract protein representation w.r.t. computational PPIs in the neighborhood, fusion and cooccurrence channels, on top of which we apply a two-layer MLP to train with labels; for D-SCRIPT (Sledzieski et al., 2021) , we apply the official PyTorch APIs to perform the sequence embedding projection, inter-protein residue contact and interaction modeling, where we use the same Sinkhorn Transformer as in our methods to generate sequence embeddings rather than the large model pretrained on additional data of protein sequences with structures, for fair comparison and also complexity consideration; for GraphCL (You et al., 2020a) , we follow the main idea and implement it with the node masking augmentation, where we randomly replace 20% of animo acids in each protein sequences with mask tokens.

E.3 CLASSIFYING PAPER TOPICS OF DIFFERENT TIME PERIODS

We follow the same experiment setting as in UDAGCN (Wu et al., 2020) , which we implement DA-W and spectral regularization on. The compared SOTAs without DA techniques are DeepWalk (Perozzi et al., 2014) , LINE (Tang et al., 2015) , GraphSAGE (Hamilton et al., 2017) , DNN (MLP on node features alone), and GCN (Kipf & Welling, 2016) , and ones with DA techniques are DGRL (Ganin et al., 2016) , AdaGCN (Dai et al., 2019) , and UDAGCN (Wu et al., 2020) . We notice that the reported results in (Wu et al., 2020) are not directly comparable with (Mao et al., 2021) due to different experimental configurations (e.g. validation partition). By applying DA-W, we replace the domain classifier with the Wasserstein distance critic (Shen et al., 2018) to learn domain-invariant representations, and further implement our proposed spectral regularization.

F HOMOPHILY RATIO COMPUTATION PROCEDURE

Given a graph G = {V, E} with the set of nodes V and edges E as described in Section 2. Denote the labeled edge set as E ′ , to quantify the association between E and E ′ , we borrow the idea from (Zhu et al., 2020a; Pei et al., 2020) to calculate the homophily ratio as: Hom = 1 |V | v∈V |{(v, w) : w ∈ N (v) and (v, w) ∈ E ′ }| |N (v)| , where N (v) is the set of neighbor nodes of v determined by E. In cross-species PPI prediction, E ′ is the experimental (co-expression or physical) PPIs. To measure the association between computational and experimental PPIs, we define E as the set of edges if there is any computational PPI (neighborhood, fusion or co-occurrence) value greater than the medium threshold of 0.4 (Szklarczyk et al., 2021) . To measure the association between protein sequences and experimental PPIs, we first calculate the sequence identity (Karimi et al., 2020b ) between all protein pairs, and define E as the set of edges if the sequence identity is greater than 0.3 (Pearson, 2013) .

G MORE EXPERIMENTAL RESULTS

Results of semi-supervised cross-species protein-protein interaction prediction are shown in Tables 8, 9 ; semi-supervised paper topic classification in Table 10 ; ablation studies of model architectures and adversarial training strategies in Tables 11, 12 . Table 8 : Semi-supervised transfer of cross-species protein-protein co-expression interaction prediction. Numbers in red are the top-2 AUROC (AUPRC) (mean±std%). DA-C, DA-W denotes domain-invariant representation learning with domain classifiers (Ganin et al., 2016) and guided by Wasserstein distance as in optimization (3) (Redko et al., 2017; Shen et al., 2018) , respectively. and neural network propagation (complexity O(N ) (You et al., 2020c) ). Thus, in theory, the former step dominates the time complexity. In practice, when training on large networks, mini-batch training is commonly adopted (Hamilton et al., 2017; Zeng et al., 2019) to avoid computational burden and empirically for better generalizability (Hamilton et al., 2017) , where the node number in each batch is restricted to be much smaller than the whole network. In our experiments, batch size is set as 128, with ≤ 3 seconds cost brought for each epoch (see Table 13 ), which is five times less than the cost brought by adversarial training (20 seconds, see Table 13 ). We also provide the time consumption for the full training in Table 15 , including the best-performed SOTA GraphCL, where the time consumption for GraphCL was around 2 times higher in pretraining and 2 times lower in finetuning compared with training from scratch. Nevertheless, training from scratch did not take longer than 10 hours. We further provide Table 14 to demonstrate the numerical running time of EVD is acceptable as we see in Table 14 , with batch size ≤ 2048. Nevertheless, if it is believed that there are indispensable benefits from the full-batch training, we can always perform the full-batch EVD for only once before training, for the repetitive usage later. Moreover, we would like to clarify that the spectral properties of graph filters in theory do not have to depend on inputs (sampled graphs or not, an analog is in signal processing, a low-pass system, here "low-pass" is the spectral property, would suppress high frequencies for any input signal). In details, suppose that the graph filter function S( ≤ C λ , and (ii) maximum frequency response (MFR) B that max |S(λ)| ≤ B. We can see from the above statement that, spectral properties of SS and MFR are formulated on eigenvalues but not eigenvectors. In practice, the spectral regularization is performed in the spectra of sampled networks (with eigenvalues in Λ B range in D B ) and the readers might concern whether the regularized properties in D B still hold in the spectra of ego-graphs of nodes (with eigenvalues in Λ E range in D E ) since whether D E ⊆ D B is unclear. According to the eigenvalue interlacing theorem (Haemers, 1995) , since adjacency matrices of ego-graphs are principal submatrices of sampled networks, we have min(Λ B ) ≤ min(Λ E ) ≤ max(Λ E ) ≤ max(Λ B ) that essentially leads to D E ⊆ D B . Therefore, regularized spectral properties on spectra D E (in practice) should be preserved on spectra D B (in theory). Memory. Beyond time complexity, memory consumption results from data processing (while all compared methods use the same amount of data) and model propagation. Compared to the best-performed SOTA GraphCL, memory consumption was similar between GraphCL and Transformer+GIN(+DA-W+SpecReg), since GraphCL also uses the same Transformer+GIN backbone architecture to extract protein representations (Transformer to embed protein sequences as vertex features, and then GIN to conduct message passing along topology, please refer to Appendix E for details). GPU memory taken by Transformer+GIN was around 24GB which is within the capacity of the conventional NVIDIA A100 GPU (40 GB memory). The improving AUPRC of PPI prediction with regard to such affordable computational resources is significant. The computational resource is usually not the bottleneck for in-silico methods (in our case, an NVIDIA A100 GPU + less than 10 hours), while detecting PPI (usually highly imbalanced Rao et al. ( 2014)) via wet laboratories is very costly with expensive reagents for weeks or months. For instance, the routine yeast two-hybrid system requires steps including building vectors, transforming plasmids, cell culture, luciferase assay, etc Brückner et al. (2009) , where accurate predictions could play a critical role in accelerating the process. I LEMMA 4: LIPSCHITZ CONSTANT OF TWO-LAYER GNN □ Lemma 4. Suppose that G is the set for graphs of the size the size N G after padding with isolated nodes, similar to (Zhu et al., 2021) . Following the setting in Lemma 1 and 2, a GNN layer is constructed as f (•) (G) = σ(S (•) (A)XW (•) ), and a two-layer GNN as f (G) = r • f (2) • f (1) (G), where r is the mean/sum/max readout function to pool node representations. Denote the SS and MFR terms of f (l) as t (l) SS = C (l) λ K 1 + εK 2 , t MFR = |S (l) (λ * )| that C (l) f = max{t (l) SS , t MFR }, l ∈ {1, 2} per Lemma 2, we can then calculate the Lipschitz constant of GNN as: C f = max t (2) SS + t (2) MFR t (1) SS , t MFR t (1)

MFR .

Proof. The key step is to recognize that the input of the 2nd GNN layer is actually a graph with the same adjacency matrix A, but a different node feature matrix X (1) = f (1) (G). Denote the optimal permutation matrix for G 1 , G 2 as P * , we thus compute the difference of the GNN outputs: ∥f (G 1 ) -f (G 2 )∥ 2 = ∥(r • f (2) ) • f (1) (G 1 ) -(r • f (2) ) • f (1) (G 2 )∥ 2 (a) ≤ C (2) λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max(|S (2) (Λ 2 )|)∥X (1) λ (1 + τ N G ) -C f ∥A 1 -P * A 2 P * T ∥ F + 1 + |S (2) (λ * )| O(∥A 1 -P * A 2 P * T ∥ 2 F ) ≤ 0, |S (2) (λ * )S (1) (λ * )| -C f ∥X 1 -P * X 2 ∥ F ≤ 0, which is equivalent to: C f ≥ (C (2) λ K 1 + εK 2 ) + |S (2) (λ * )|(C λ K 1 + εK 2 ), C f ≥ |S (2) (λ * )S (1) (λ * )|.



Figure1: An overview of spectral regularization. In implementation we regularize on the subtraction between numerators and denominators multiplied by certain threshold, considering the numerical instability issue in division.

Figure 2: Homophily ratio difference between adjacency matrices built with sequence identity (SI) and computational PPI (CPPI). Higher values indicate more label-preserving of protein sequences.

Figure 3: Spectral regularization performance of different threshold values υ in co-expression and physical interaction prediction (with trade-off factors γ ′ selected via validation).

where (a) is due to the permutation invariance property of graph filters; (b) is achieved with the triangle inequality and the assumption |σ(b) -σ(a)| ≤ |b -a|, ∀a, b ∈ R; (c) and (d) use the fact that for any two matrices A, B, ∥AB∥ F ≤ min(∥A∥ op ∥B∥ F , ∥A∥ F ∥B∥ op ),

Lemma 3. Under the assumption of Corollary 1, we further assume that there exists a small amount of i.i.d. samples with labels {(G n , Y n )} N ′ T n=1 from the target distribution P T (G, Y ) (N ′ T ≪ N S ) and bring in the conditional shift assumption that domains have different labeling function ĥS ̸ = ĥT and max G1,G2 | ĥD(G1)-ĥD(G2)| η(G1,G2)

and (c) result from the Cauchy-Schwartz inequality and (c) additionally adopt the assumption N ′ T ≪ N S , following the sleight-of-hand in(Li et al., 2021a)  Theorem 3.2.

Figure 4: The relationship among species in biological taxonomic ranks.

τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max(|S (2) (Λ 2 )|) C (1) λ (1 + τ N G )∥A 1 -P * A 2 P * T ∥ F + O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max(|S (1) (Λ 2 )|)∥X 1 -P * X 2 ∥ F τ N G ) + max(|S (2) (Λ 2 )|)C (1) λ (1 + τ N G ) ∥A 1 -P * A 2 P * T ∥ F + 1 + max(|S (2) (Λ 2 )|) O(∥A 1 -P * A 2 P * T ∥ 2 F ) + max(|S (2) (Λ 2 )|) max(|S (1) (Λ 2 )|)∥X 1 -P * X 2 ∥ F, where (a), (b) are due to the reuse the inequalities (a)-(f) in Lemma 1. Next, following the same spirit in Lemma 2, to calculate the Lipschitz constant of f , we assure the inequality:C (2) λ (1 + τ N G ) + |S (2) (λ * )|C

τ N G ) ∥A 1 -P * A 2 P * T ∥ F + 1 + |S (2) (λ * )| O(∥A 1 -P * A 2 P * T ∥ 2 F ) + |S (2) (λ * )S (1) (λ * )|∥X 1 -P * X 2 ∥ F ≤ C f η(G 1 , G 2 ), that is: C (2) λ (1 + τ N G ) + |S (2) (λ * )|C

(1 + τ N G ) -C f ∥A 1 -P * A 2 P * T ∥ F + 1 + |S (2) (λ * )| O(∥A 1 -P * A 2 P * T ∥ 2 F ) + |S (2) (λ * )S (1) (λ * )| -C f ∥X 1 -P * X 2 ∥ F ≤ 0, τ N G ) + |S (2) (λ * )|C

Unsupervised transfer of cross-species protein-protein co-expression interaction prediction. Numbers in red are the top-2 AUROC (AUPRC) (mean±std%). DA-C, DA-W denotes domain-invariant representation learning with domain classifiers(Ganin et al., 2016) and guided by Wasserstein distance as in optimization (3)(Redko et al., 2017;Shen et al., 2018), respectively.

Unsupervised transfer of cross-species protein-protein physical interaction prediction.

Unsupervised transfer of paper topic classification in temporally evolved citation networks. Numbers in red are the best accuracies (mean±std%), which without standard deviation are from(Wu et al., 2020).

Paper topic classification on ogbn-arxiv under different label rates. Reported numbers are accuracy (%).

Dataset statistics of PPI networks of different species. # denotes "number of".

Dataset statistics of citation networks from different sources (in different time periods). # denotes "number of".

Comparisons between different protein sequence encoders and GNNs in the human to yeast transfer setting.

Semi-supervised transfer of cross-species protein-protein physical interaction prediction.

Paper topic semi-supervised classification on ogbn-arxiv under different label rates. Reported numbers are accuracy (%).

Unsupervised transfer of cross-species (human to yeast) protein-protein co-expression interaction prediction. Numbers in red are the best performance among the sub-row.

Unsupervised transfer of cross-species (human to fruit fly) protein-protein co-expression interaction prediction..

•) with a certain domain D 1 (which is a polynomial function resided in a GNN, see Lemma 1) is constructed (or regularized) to be C

Running time for one epoch training in unsupervised cross-species (human to yeast) protein-protein interaction prediction.

Running time for eigenvalue decomposition under different batch sizes in unsupervised cross-species (human to yeast) protein-protein interaction prediction.

Running time for full training in unsupervised cross-species (human to yeast) protein-protein interaction prediction.

ACKNOWLEDGEMENT

This project was in part supported by the National Institute of General Medical Sciences (R35GM124952 to Y.S.), the National Science Foundation (CCF-1943008 to Y.S.), and the US Army Research Office Young Investigator Award (W911NF2010240 to Z.W.). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

annex

Configurations. We train models on PPIs of Homo Sapiens (human) which contain abundant PPI evidences (Ewing et al., 2007) and evaluate the models in four other species: Mus Musculus (mouse), Danio Rerio (zebrafish), Drosophila Melanogaster (fruit fly) and Saccharomyces Cerevisiae (yeast), which are repeated for three times for statistical significance. In evaluation, we follow (Cho et al., 2016) to perform negative sampling for experimental PPIs, to ensure that the known interactions compose only 5%. We adopt Transformer (Tay et al., 2020 ) (v.s. HRNN (Karimi et al., 2019; 2020a) ) to embed protein sequences and then GIN (Xu et al., 2018 ) (v.s. GAT (Veličković et al., 2017) ) to perform message passing, with comparisons in Appendix E. The compared state-of-the-art (SOTA) approaches are Mashup (Cho et al., 2016) and D-SCRIPT (Sledzieski et al., 2021) designed for PPI prediction as well as GraphCL (You et al., 2020a) for general graph (self-supervised) representation learning. See Appendix E for details.Hypotheses. We hypothesize that co-expression interactions are more associated with links (than physical interactions do, link transfer), based upon the existing findings that two genes are poten-

