DEEP GRAPH-LEVEL ORTHOGONAL HYPERSPHERE COMPRESSION FOR ANOMALY DETECTION Anonymous

Abstract

Graph-level anomaly detection aims to identify abnormal samples of a set of graphs in an unsupervised manner. It is non-trivial to find a reasonable decision boundary between normal data and anomalous data without using any anomalous data in the training stage, especially for data in graphs. This paper first proposes a novel deep graph-level anomaly detection model, which learns the graph representation with maximum mutual information between substructure features and global structure features while exploring a hypersphere anomaly decision boundary. We implement an orthogonal projection layer to keep the training data distribution consistent with the decision hypersphere thus avoiding erroneous evaluations. More importantly, we further propose projecting the normal data into the interval region between two co-centered hyperspheres, which makes the normal data distribution more compact and effectively overcomes the issue of outliers falling close to the center of the hypersphere. The numerical and visualization results on a few graph datasets demonstrate the effectiveness and superiority of our methods in comparison to many baselines and state-of-the-art.

1. INTRODUCTION

Anomaly detection is an essential task with various applications, such as detecting abnormal patterns or actions in credit-card fraud, medical diagnosis, sudden natural disasters (Aggarwal, 2017) , etc. Usually, in anomaly detection, the training data only contain normal data and are used to train a model that can distinguish unusual patterns from abnormal ones. Anomaly detection on tabular data and images has been extensively studied recently (Ruff et al., 2018; Goyal et al., 2020; Chen et al., 2022; Liznerski et al., 2021; Sohn et al., 2021) . In contrast, there is little work on graph data despite the fact that graph data anomaly detection is very useful in various problems, such as identifying abnormal communities in social networks or detecting unusual protein structures in biology experiments. Compared with the other types of data, graph data is inherently complicated and rich in structural and relational information. The complexity of graph structure facilitates us to learn graph-level representations with discriminative patterns in many supervised tasks (e.g., graph classification). As for graph-level anomaly detection, however, the intricate graph structure brings many obstacles to this unsupervised learning problem. Graph anomaly detection usually composes four families: anomalous edge (Ouyang et al., 2020; Xu et al., 2020) , node (Zhu & Zhu, 2020; Bojchevski & Günnemann, 2018) , sub-graph (Wang et al., 2018; Zheng et al., 2018) , and graph-level detections (Zheng et al., 2019; Chalapathy et al., 2018) . Herein, the target of the graph-level algorithms is to explore a regular group pattern and distinguish the abnormal manifestations of the group. Group abnormal behaviors usually foreshadow some unusual events and thus play an important role in practical applications. In the past five years, few approaches have focused on graph-level anomaly detection because of the difficulty of representing graphs into feature vectors without using any label information. Graph kernel can measure the similarity between graphs and regard the result as a representation non-strictly or implicitly. Based on this, graph anomaly detection task usually performs as two-stage. In our experiments (see Section 4), we also find that one-class SVM with graph kernels sometimes yields unsatisfying performances since graph kernels may not be effective enough to quantify the similarity between graphs. So there is a large room for improvement regarding graph anomaly detection to our best knowledge. Concerning end-to-end models, Ma et al. (2022) proposed a global and local knowledge distillation method for graph-level anomaly detection, which learns rich global and local normal pattern information by random joint distillation of graph and node representations. The method needs to train two GCNs jointly at a high time cost. Zhao & Akoglu (2021) combined the Deep SVDD objective function and graph isomorphism network to learn a hypersphere of normal samples. Qiu et al. (2022) also sought a hypersphere decision boundary and optimized the representations learned by k GNNs close to the reference GNN while maximizing the differences between k GNNs, but did not consider the relationship between the graph-level representation and node features. Collecting all approaches based on the hypersphere assumption in graph anomaly detection, we find that the practical decision region may be an ellipsoid instead of a standard hypersphere, thus causing the error when the standard hypersphere evaluation is employed. Except for that, our experiment also confirms that anomalous data may appear in decision regions that are not filled with normal data, especially near the center of the hypersphere. In order to effectively explore a better representation without label information and obtain a more suitable decision boundary with high efficiency, in this paper, we propose a one-class deep graph-level anomaly detection method and its improved version. The first proposed model, Deep Orthogonal Hypersphere Contraction (DOHSC), uses the mutual information of local feature maps and the global representation to learn a high-quality representation and simultaneously optimizes it to distribute in a hypersphere area. An orthogonal projection layer then renders the decision region more hyperspherical and compact to decrease evaluation errors. With regard to phenomenon that anomalous data falling close to the hyperspherical center, an improved graph-level Deep Orthogonal Bi-Hypersphere Compression (DO2HSC) for anomaly detection architecture is proposed. From a cross-sectional point of view, DO2HSC limits the decision area (of normal data) to an interval enclosed by two co-centered hyperspheres and learns the orthogonality-projected representation similarly. The framework of the methods mentioned above is shown in Figure 1 correspondingly. Furthermore, we define a new evaluation way according to DO2HSC, and comprehensive experimental results verify the effectiveness of all proposed methods. In summary, the main contributions of our work are listed as follows. • First, we present a new graph-level hypersphere contraction algorithm for anomaly detection tasks, which is jointly trained via mutual information loss between local and global representations and hypersphere decision loss. • Second, we impose an orthogonal projection layer on the proposed model to promote training data distribution close to the standard hypersphere, thus avoiding errors arising from inconsistencies between assessment criteria and actual conditions. • Finally, we propose an improved graph-level deep orthogonal bi-hypersphere compression model to further explore a decision region enclosed by two co-centered hyperspheres, which can effectively prevent anomalous data falling close to the hyperspherical center and surpass baselines significantly in the experiments.

2. PROPOSED APPROACH

In this section, we first introduce a joint learning architecture in detail, named as Graph-Level Deep Orthogonal Hypersphere Contraction. Then an improved algorithm is proposed to compensate for the underlying assumption's deficiency. 2.1 GRAPH-LEVEL DEEP ORTHOGONAL HYPERSPHERE CONTRACTION 2.1.1 VALLINA MODEL Given a set of graphs G = {G 1 , ..., G N } with N samples, the proposed model aims to learn a kdimensional representation and then set a soft-boundary according to it. In this paper, the Graph Isomorphism Network (GIN) (Xu et al., 2019) is employed to obtain the graph representation in three stages: first, input the graph data and integrate neighbors of the current node (AGGREGATE); second, combine neighbor and current node features (CONCAT); finally, integrate all node information (READOUT) into one global representation. Mathematically, the i-th node features of l-th layer and the global features of its affiliated j-th graph would be denoted as z i Φ = CONCAT({z (l) i } L l=1 ), Z Φ (G j ) = READOUT({z i Φ } |Gj | i=1 ), where z i Φ ∈ R 1×k and Z Φ (G j ) ∈ R 1×k . To integrate the contained information and enhance the differentiation between node-level and global-level representations, we append additional fully connected layers denoted as the forms M Υ (•) and T Ψ (•), respectively, where Υ and Ψ are the parameters of the added layers. So the integrated node-level and graph-level representations are obtained via h i Φ,Υ := M Υ (z i Φ ), H Φ,Ψ (G j ) := T Ψ (Z Φ (G j )), To better capture the local information, we utilize the batch optimization property of neural networks to maximize the mutual information (MI) between local and global representations in each batch G ⊆ G, which is defined by Sun et al. (2020) as the following term: Φ, Ψ, Υ = arg max Φ,Ψ,Υ I Φ,Ψ,Υ (h Φ,Υ , H Φ,Ψ (G)) . Specifically, the mutual information estimator I Φ,Ψ,Υ follows Jensen-Shannon MI estimator (Nowozin et al., 2016) with a positive-negative sampling method as below, I Φ,Ψ,Υ (h Φ,Υ , H Φ,Ψ (G)) : = Gj ∈G 1 |G j | u∈Gj I Φ,Ψ,Υ h u Φ,Υ (G j ), H Φ,Ψ (G) = Gj ∈G 1 |G j | u∈Gj E -σ -h u Φ,Υ (x + ) × H Φ,Ψ (x) -E σ h u Φ,Υ (x -) × H Φ,Ψ (x) , where a softplus function σ(z) = log(1 + e z ) is activated after vector multiplication between node and graph representations. For x as an input sample graph, we calculate the expected mutual information with its positive samples x + and negative samples x -, which are generated from distribution across all graphs in a subset. Given each G = (V G , E G ) and node set V G = {v i } |G| i=1 , the positive and negative samples are divided in this way: x + = x ij , if v i ∈ G j , 0, otherwise. And x -produces the opposite result in each of the conditions above. In the next step, a dataenclosing decision boundary is required for our anomaly detection task. According to the assumption that most normal data can locate in a hypersphere, the center of this decision boundary should be initialized through c = 1 N N i=1 H Φ,Ψ (G i ). With this center, we expect to optimize the learned representation of normal data to be distributed as close to it as possible, so that the unexpected anomalous data falling out of this hypersphere would be detected. Besides, the regularization term is adopted to avoid over-fitting problems. Collectively denote the weight parameters of Φ, Ψ and Υ as Q := Φ ∪ Ψ ∪ Υ, we formulate the training loss with two joint objectives -Hypersphere Contraction and MI: min Φ,Ψ,Υ 1 |G| |G| i=1 ∥H Φ,Ψ (G i ) -c∥ 2 + λI Φ,Ψ,Υ (h Φ,Υ , H Φ,Ψ (G)) + µ 2 Q∈Q ∥Q∥ 2 F , where |G| denotes the number of graphs in batch G and λ is a trade-off factor, the third term is a network weight decay regularizer with the hyperparameter µ. After orthogonal projection, the ellipsoid is expected to be transformed into a standard hypersphere, which avoids the evaluation error. Note: here data are simulated only for illustration.

2.1.2. ORTHOGONAL PROJECTION LAYER

However, an empirical study shows that a hyperellipsoid is commonly observed during deep representation learning. This phenomenon would lead to inaccuracies in the final test because the evaluation results are based on the hypersphere decision region. Problem (7) obviously cannot guarantee the soft-boundary of learned representation to be a standard hypersphere like Figure 2 . Accordingly, we append an orthogonal projection layer after obtaining the global representation. Note that we pursue orthogonal features of latent representation rather than computing the projection onto the column or row space of H Φ,Ψ . This method is equivalent to performing PCA and using the standardized principal components. Our experiments also justify the necessities of this projection step and standardization process, which will be discussed further in Section 4.4 and Appendix G. Specifically, the projection layer can be formulated as HΦ,Ψ,Θ (G) = Proj Θ (H Φ,Ψ (G)) = H Φ,Ψ W, subject to H⊤ Φ,Ψ,Θ HΦ,Ψ,Θ = I k ′ (8) where Θ := {W ∈ R k×k ′ } are the projection parameters, I k ′ denotes an identity matrix, and k ′ is the projected dimension. Note that to achieve (8), one may consider adding a regularization term α 2 ∥ H⊤ Φ,Ψ,Θ HΦ,Ψ,Θ - I k ′ ∥ 2 F with large enough α to the objective, which is not very effective and will lead to one more tuning hyperparameter. Instead, we propose to achieve (8) via singular value decomposition, i.e., where Λ = diag(ρ 1 , ρ 2 , ..., ρ |G| ) and V are the diagonal matrix with sigular values and rightsingular matrix of H Φ,Ψ , respectively. It needs to be emphasized that UΛV ⊤ = H Φ,Ψ , W := V k ′ Λ -1 k ′ , V k ′ := [v 1 , ..., v k ′ ] denotes the first k ′ right-singular vectors and Λ k ′ := diag(ρ 1 , ..., ρ k ′ ). In each forward propagation epoch, the original weight parameter is substituted to a new matrix W in subsequent loss computations.

2.1.3. ANOMALY DETECTION

Attaching with an orthogonal projection layer, the improved initialization of the center is rewritten in the following form c = 1 N N i=1 HΦ,Ψ,Θ (G i ) and the final objective function for anomaly dectection tasks in a mini-batch would become min Θ,Φ,Ψ,Υ 1 |G| |G| i=1 ∥ HΦ,Ψ,Θ (G i ) -c∥ 2 + λ G∈G I Φ,Ψ,Υ h Φ,Υ , HΦ,Ψ,Θ (G) + µ 2 Q∈Q ∥Q∥ 2 F . After the training stage, a decision boundary r will be fixed, which is calculated based on the 1 -ν percentile of the training data distance distribution: r = arg min r P(D ≤ r) ≥ ν where D := {d i } N i=1 follows a sampled distribution P, and d i = ∥ HΦ,Φ,Θ (G i ) -c∥. Accordingly, the anomalous score of i-th instance is defined as follows: s i = d 2 i -r2 where s = (s 1 , s 2 , . . . , s N ). It is evident that when the score is positive, the instance is identified as abnormal, and the opposite is considered normal. The detailed procedures of algorithm are summarized into Algorithm 1 (see Appendix A). It starts with graph representation learning and promotes the training data to approximate the center of a hypersphere while adding an orthogonal projection layer. Unfortunately, it can be observed from Figure 3 that the anomalous data would appear in partial regions of the learned decision hypersphere, which are not filled by the training data, especially the region close to the center. To handle this particular situation, an improved graph-level anomaly detection approach, termed as Graph-Level Deep Orthogonal Bi-Hypersphere Compression, will be designed in the next section.

2.2. GRAPH-LEVEL DEEP ORTHOGONAL BI-HYPERSPHERE COMPRESSION

As Figure 3 suggests, we found peculiar phenomena in our empirical results that the learned distribution of training data sometimes could not satisfy the hypersphere assumption, where anomalous data might appear within the decision region and therefore led to suboptimal detection performance. To explore the reason behind, we examine the counter-intuitive behaviors of high-dimensional Gaussian distributions, and the simulation results imply the soap-bubble problem, where anomalous samples could exist near the center of learned hypersphere (see Appendix B for more details). Since DOHSC cannot detect anomalies close to the center, we propose an improved approach, which sets the decision boundary as an interval region between two co-centered hyperspheres. This can narrow the decision area's scope and induce normal data to fill the entire interval area as much as possible. After the same graph representation learning stage, we firstly utilize the DOHSC model for a few epochs and initialize the large radius r max and the small radius r min of the interval area according to the 1 -ν percentile and ν of the sample distances distribution, respectively. The aforementioned descriptions can be denoted mathematically as below. r max = arg min r P(D ≤ r) ≥ ν, r min = arg min r P(D ≤ r) ≥ 1 -ν. ( ) After fixing the decision boundaries r max and r min , the improved training loss is also set with a trade-off factor λ, which implicitly emphasizes the importance of the max-min term: min Θ,Φ,Ψ,Υ 1 |G| |G| i=1 (max{d i , r max } -min{d i , r min }) + λ G∈G I Φ,Ψ,Υ h Φ,Υ , HΦ,Ψ,Θ (G) + µ 2 Q∈Q ∥Q∥ 2 F . This decision loss has the lowest bound r max -r min and can be jointly minimized with mutual information effectively. Besides, the evaluation standard of test data is also needed to change based on this interval structure. More specifically, all instances located in the inner hypersphere and out of the outer hypersphere should be identified as anomalous graphs; only those located in the interval area should be regarded as normal data. Compared with equation 13, we reset a new score function to award the positive samples beyond [r min , r max ] and meanwhile punishing the negative samples within the range. Accordingly, the distinctive scores are calculated by s i = (d i -r max ) • (d i -r min ), where i ∈ {1, ..., N }. This way, we can also effectively identify a sample's abnormality by its score. In general, the improved deep graph-level anomaly detection algorithm changes the decision boundary and effectively makes the normal area more compact. Correspondingly, the new practical evaluation is raised to adapt to the improved detection way. Eventually, we summarize the detailed procedures of the optimization into Algorithm 2 (see Appendix A).

3. CONNECTION WITH PREVIOUS WORK

Actually, few studies have been undertaken in graph-level anomaly detection (GAD). Existing solutions to GAD tasks can be categorized into two types: a two-stage approach and an end-to-end one. Two-stage GAD methods first transform graphs into graph embeddings by graph neural networks or into similarities between graphs by graph kernels, and then apply off-the-shelf anomaly detectors such as local outlier factor (LOF) (Breunig et al., 2000) , one-class support vector machine (OCSVM) (Schölkopf et al., 1999) , etc. The drawback of the two-stage method is that the graph feature extractor and outlier detector are independent, and some graph kernels produce "hand-crafted" features that are deterministic without much space to improve. End-to-end approaches overcome this problem by utilizing deep graph learning techniques, such as graph convolutional network (GCN) (Welling & Kipf, 2016) and graph isomorphism network (GIN) (Xu et al., 2019) . With an anomaly measure as the objective, end-to-end approaches jointly learn an effective graph representation for GAD task (Zhao & Akoglu, 2021; Qiu et al., 2022; Ma et al., 2022) . Zhao & Akoglu (2021) optimized one-class model based on deep support vector data description (Deep SVDD) as the anomaly measure. Here we clarify the differences between (Zhao & Akoglu, 2021) and ours. First, the proposed model employed mutual information loss in the graph learning stage to obtain the graph representation incorporating local and global information. On the contrary, Zhao & Akoglu (2021) directly utilized the readout result of GIN. We also impose an orthogonal projection on the learned representation to maintain consistency between the learned decision boundary and normal data distribution. More importantly, we present a new approach to constructing decision boundary. The new approach learns two hyperspheres, between which the region accommodates normal data and hence yields more space for abnormal data. As for graph kernel, we summarized the previous work in Appendix C.

4.1. DATASET

In this work, we test our method on six real-world graph datasetsfoot_0 , which contain three social networks datasets (COLLAB, COX2, and IMDB-Binary) and three bioinformatics datasets (DD, ER MD, and MUTAG). The details of the datasets are shown in Table 1 . We compare our method with the following unsupervised graph-level anomaly detection methods: Random Walk (RW) (Gärtner et al., 2003; Kashima et al., 2003) , Shortest Path Kernel (SP) (Borgwardt & Kriegel, 2005) , Weisfeiler-Lehman Sub-tree Kernel (WL) (Shervashidze et al., 2011) , Neighborhood Hash Kernel (NH) (Hido & Kashima, 2009) . Besides graph kernels, we also compare three graph-level representation learning methods: Deep One Class Model with GIN network (OCGIN) (Zhao & Akoglu, 2021) , Graph-level embedding Learning via Mutual Information Max-imization+Deep SVDD (infoGraph+Deep SVDD) (Sun et al., 2020; Ruff et al., 2018) , Global and Local Knowledge Distillation for Graph-level Anomaly Detection (GLocalKD) (Ma et al., 2022) and One Class Graph Transformation Learning (OCGTL) (Qiu et al., 2022) .

4.3. RESULTS

In this section, extensive experimental results are displayed to validate the effectiveness of the proposed models. The averages and standard deviations of the Area Under Operating Characteristic Curve (AUC) are used to support the comparable experiments by repeating each algorithm ten times. The higher value of the AUC metric represents better performance. Tables 2-4 report the AUC metric and its standard deviations. It can be seen that the proposed methods basically achieve the best AUC values compared to other algorithms on all datasets. Both approaches outperform other stateof-the-art baselines, and DO2HSC obtains superior performance and get 5% higher performance than other algorithms on many datasets, such as MUTAG, COLLAB Class 1, and ER MD Class 0, IMDB-Binary Class 1, etc. It is woth mentioning that we defeat infoGraph+Deep SVDD with a large improvement, which is a degraded version of the proposed models, thus showing that promoting representation learning towards the anomaly detection goal is meaningful and well targeted. The anomaly detection visualization results of DO2HSC are displayed in Figure 4 and those of DOHSC are also shown in Appendix F. We draw them by setting the projection dimension to three directly. Results of different perspectives are given to avoid blind spots in the field of vision, demonstrating excellent performance. Hence, it can be concluded that the effect of the improved model is in line with our motivation and shows much potential. 0.5910 ± 0.0000 0.8397 ± 0.0000 0.7902 ± 0.0000 0.5408 ± 0.0000 0.5760 ± 0.0000 WL 2 +OCSVM 0.5051 ± 0.0000 0.7989 ± 0.0000 0.6977 ± 0.0000 0.5736 ± 0.0000 0.4286 ± 0.0000 WL 5 +OCSVM 0.5079 ± 0.0000 0.8021 ± 0.0000 0.7884 ± 0.0000 0.5990 ± 0.0000 0.4376 ± 0.0000 WL 8 +OCSVM 0.5106 ± 0.0000 0.8035 ± 0.0000 0.7953 ± 0.0000 0.5979 ± 0.0000 0.5057 ± 0.0000 WL 10 +OCSVM 0.5122 ± 0.0000 0.8031 ± 0.0000 0.7996 ± 0.0000 0.5937 ± 0.0000 0.5034 ± 0.0000 NH+OCSVM 0.5976 ± 0.0000 0.8054 ± 0.0000 0.6414 ± 0.0000 0.4841 ± 0.0000 0.4717 ± 0. 

4.4. ABLATION STUDY

In this section, we display the ablation study of the orthogonal projection layer on three datasets. From quantitive comparisons, we concluded that orthogonality positively influences all performance to some extent. It also well supports our assumption discussed in Section 2.1.2. Except for the aforementioned contents, please see the detailed experiment configurations, parameter sensitivity and robustness of the proposed models, supplemented visualizations of distance distributions for anomaly detection and visualization comparison between proposed models with orthogonal projection layer and without orthogonal projection layer in Appendix G, which can also strongly support our theory and validate the effectiveness.

5. CONCLUSION

This paper has proposed two novel end-to-end graph-level AD methods, DOHSC and DO2HSC, which combined the effectiveness of mutual information between node-level and global features to learn graph representation and the power of hypersphere compression. DOHSC and DO2HSC mitigate the possible shortcomings of hypersphere boundary learning by applying an orthogonal projection for global representation. Furthermore, DO2HSC projects normal data between the interval area of two co-centered hyperspheres to significantly alleviate the soap-bubble issue. Our comprehensive experimental results strongly demonstrate the superiority of DOHSC and DO2HSC on multifarious datasets. In future work, we will explore more efficient and excellent graph representations and refine our learning process for decision regions.

A SUPPLEMENTED ALGORITHM

Algorithm 1 summarizes the procedure of DOHSC in detail. It starts with graph representation learning and promotes the training data to approximate the center of a hypersphere while adding an orthogonal projection layer. Also, DO2HSC is recapped in Algorithm 2 and also starts with same graph representation learning. Differently, DOHSC is utilized of few epochs to initial the decision boundaries and after that, improved optimization is applied. 

B RELATED PROOF OF BI-HYPERSPHERE LEARNING MOTIVATION

The traditional idea of detecting outliers is to inspect the distribution's tails with an ideal assumption that the normal data are Gaussian. Following the assumption, one may argue that an anomalous sample can be distinguished by its large Euclidean distance to the data center (ℓ 2 norm ∥z -c∥, where c denotes the centroid), and accordingly, the abnormal dataset is {z : ∥z -c∥ > r} for some decision boundary r. However in high dimensional space, Gaussian distributions look like soap-bubblefoot_1 , which means the normal data are more likely to locate in a bi-hypersphere (Vershynin, 2018) , such as {z : r min < ∥z-c∥ < r max }. To better understand this counterintuitive behavior, let us generate some normal samples X ∼ N (0, I d ), where d is the data dimension in {1, 10, 50, 100, 200, 500}. As Figure 6 indicates, only the univariate Gaussian has a near-zero mode, whereas other highdimensional Gaussian distributions leave plenty of offcenter spaces in blank. The soap-bubble problem in high-dimensional distributions is well demonstrated in Table 6 : the higher the dimension is, the greater the quantities of data further away from the center, especially for 0.01-quantile distance. Thus, we cannot make a sanguine assumption that all of the normal data locate within some radius of a hypersphere (i.e. {z : ∥z -c∥ < r}). Using Lemma 1 of (Laurent & Massart, 2000) , we can prove that proposition 1, which matches the values in the Table 6 that when the dimension is larger, normal data are more likely lies away from center. Proposition 1 Suppose z 1 , z 2 , • • • , z n are sampled from N (0, I d ) independently. Then for any z i and all t ≥ 0, the following inequality holds: for each batch G do Calculate the improved total loss via 15; 22: Back-propagation and update Φ, Ψ, Υ and Θ, respectively; Further update the orthogonal parameter Θ of orthogonal projection layer by equation 9; 24: end for until convergence 26: Calculate the anomaly detection scores s through equation 16; return The anomaly detection scores s. P ∥z i ∥ ≥ d -2 √ dt ≥ 1 -e -t . Figure 5 : Histogram of distances (Euclidean norm) from the center of normal samples under 16-dimensional Gaussian distributions N (0, I). Three groups of anomalous data are also 16dimensional and respectively sampled from N (µ 1 , 1 10 I), N (µ 2 , I), and N (µ 3 , 5I), where the population means µ 1 , µ 2 , µ 3 are randomized within [0, 1] for each dimension. We also simulate a possible case of outlier detection, in which data are all sampled from 16dimensional Gaussian with orthogonal covariance: 10,000 normal samples follow N (0, I) and the first group of 1,000 outliers are from N (µ 1 , 1 10 I), the second group of 500 outliers are from N (µ 2 , I), and the last group of 2,000 outliers are from N (µ 3 , 5I). Figure 5 well exemplifies that abnormal data from other distribution (group-1 outliers) could fall into the small distance away from the center of the normal samples. 

C RELATED WORK ON GRAPH KERNEL

To learn with graph-structured data, graph kernels that measure the similarity between graphs become an established and widely-used approach (Kriege et al., 2020) . A large body of work has emerged in the past years, including kernels based on neighborhood aggregation techniques and walks and paths. Shervashidze et al. (2011) introduced Weisfeiler-Lehman (WL) algorithm, a wellknown heuristic for graph isomorphism. In Hido & Kashima (2009) , Neighborhood Hash kernel was introduced, where the neighborhood aggregation function is binary arithmetic. The most influential graph kernel for paths-based kernels is the shortest-path (SP) kernel by Borgwardt & Kriegel (2005) . For walks-based kernels, Gärtner et al. (2003) and Kashima et al. (2003) simultaneously proposed graph kernels based on random walks, which count the number of label sequences along walks that two graphs have in common. These graph kernel methods have the desirable property that they do not rely on the vector representation of data explicitly but access data only via the Gram matrix.

D EXPERIMENT CONFIGURATION

In this part, the experiment settings are listed for reproducing. First, each dataset is divided into two parts: training and testing sets. We randomly sample eighty percent of the normal data as the training set, and the remaining normal data together with the randomly sampled abnormal data in a one-to-one ratio to form the testing set. With regard to the experiment settings of compared graph-kernel baselines, we adopted the classical AD method, One-Class SVM (OC-SVM) (Schölkopf et al., 2001) and used 10-fold cross-validation to make a fair comparison. All graph kernels via GraKel (Siglidis et al., 2020) to extract a Kernel matrix and apply OC-SVM in scikit-learn (Pedregosa et al., 2011) are implemented. Specifically, we selected Floyd Warshall as the SP kernel's algorithm and set lambda as 0.01 for the RW kernel. WL kernel algorithm is sensitive to the number of iterations, so we test four types with the iteration of {2, 5, 8, 10} and denote them as WL 2 , WL 5 , WL 8 , WL 10 . For all graph kernels, the outputs are normalized. About infoGraph+Deep SVDD, the first stage runs in 20 epochs, and the second stage pretrains 50 epochs and trains 100 epochs. In OCGIN, GLocalKD, and OCGTL, their default or reported parameter settings are adopted to reproduce experimental results. But there still exists some special situations like, due to the limited device, the relatively large-scale dataset is tested with a small batch size, such as DD. This may lead to worse performance for compared algorithms. Regarding our model DOHSC, we firstly set 1 epoch in the pretraining stage to initialize the center of the decision boundary and then train the model in 500 epochs. The convergence curves are given in Figure 7 to indicate that the final optimized results are adopted. The improved method DO2HSC is also set 1-epoch pretraining stage and trains DOHSC 5 epochs to initialize a suitable center and decision boundaries r max and r min , where the percentile ν of r max is fixed to 0.05. After initializing, the model is trained in 500 epochs. For both proposed approaches, the trade-off factor λ is set to 10 to ensure decision loss as the main optimization objective. Dimensions of the GIN hidden layer and orthogonal projection layer are fixed as 16 and 8, respectively. About the backbone network, a 4-layer GIN and a 3-layer fully connected neural network are adopted. Besides, the averages and standard deviations of the Area Under Operating Characteristic Curve (AUC) are used to support the comparable experiments by repeating algorithm times. The higher value of the AUC metric represents better performance. When calculating the AUC of graph-kernel baselines, we estimated the radius of the hypersphere as 99 percentile of all squared distances to the separating hyperplane and then determined the score as the difference between squared distances and its square radius. Regarding our model DOHSC, we firstly set 1 epoch in the pretraining stage to initialize the center of the decision boundary and then train the model in 500 epochs. The convergence curves are given in Figure 7 . The improved method DO2HSC is also set 1-epoch pretraining stage and trains DOHSC 5 epochs to initialize a suitable center and decision boundaries r max and r min , where the percentile ν of r max is fixed to 0.1. After initializing, the model is also trained in 500 epochs. For both proposed approaches, the trade-off factor λ is set to 10. Dimensions of the GIN hidden layer and orthogonal projection layer are fixed as 16 and 8, respectively. About the backbone network, a 4-layer GIN and a 3-layer fully connected neural network are adopted. Besides, the averages and standard deviations of the Area Under Operating Characteristic Curve (AUC) are used to support the comparable experiments by repeating each algorithm ten times. The higher value of the AUC metric represents better performance. When calculating the AUC of graph-kernel baselines, we estimated the radius of the hypersphere as 99 percentile of all squared distances to the separating hyperplane and then determined the score as the difference between squared distances and its square radius.

E DOHSC AND DO2HSC ON NON-GRAPH DATA

Since our DOHSC and DO2HSC can also be applied to non-graph data such as images, here we compare them with some state-of-the-art anomaly detections Ruff et al. (2018) ; Goyal et al. (2020) ; Figure 9 shows the distance distributions of the two-stage method, the proposed model DOHSC, and the improved DO2HSC. Here, the distance is defined as the distance between the sample and the center of the decision hypersphere. The distance distribution denotes the sample proportion in this distance interval to the corresponding total samples. It can be intuitively observed that most distances of instances are close to the decision boundary because of the fixed learned representation. As mentioned earlier, the jointly-trained algorithm has mitigated the situation, and the obtained representation makes many instances have smaller distances from the center of the sphere. However, like we wrote in Section 2, anomalous data may occur in regions with less training data, especially the region close to the center, which is also confirmed by (a) and (b) of Figure 9 . Differently, DO2HSC effectively shrinks the decision area, and we find that the number of outliers is obviously reduced due to a more compact distribution of training data.

F.1 PARAMETER SENSITIVITY AND ROBUSTNESS

To claim the stability of our models, we analyze the parameter sensitivity and robustness of DOHSC and DO2HSC, respectively. Consider that the projection dimension varies in {4, 8, 16, 32, 64, 128} while the hidden layer dimension of the GIN module ranges from 4 to 128. In Figure 11 , the DO2HSC model has less volatile performances than DOHSC, especially when the training dataset is sampled from COX2 class 0, as Subfigure (d) shows. Noticeably, a higher dimension of the GIN hidden layer usually displays a better AUC result since the quality of learned graph representations improves when the embedding space is large enough. In addition, we assess different aspects of model robustness. More specifically, the AUC results about two "ratios" are displayed: 1) Different sampling ratios for the training set; 2) Different ratios of noise disturbance for the learned representation. In Subfigures (c) (f), the purple bars regard normal data as class 0, while green bars treat normal data as class 1. Notice that most AUC results are elevating along with a higher ratio of authentic data in the training stage, demonstrating our models' potential in the unsupervised setting. On the other hand, when more noise is blended into the training dataset, the AUC performances of yellow line and blue line always stay stable at a high level. This outcome verifies our models' robustness in response to the alien data. The percentile parameter sensitivity is also given in this part. It is worth mentioning that we test DOHSC with varying percentile in {0.01, 0.1, ..., 0.8} and test DO2HSC only in {0.01, 0.05, 0.1} because two radii of DO2HSC is obtained by percentile ν and 1 -ν. Two radii will be equal when ν = 0.5 and the interval between two co-centered hyperspheres will disappear. From the table, the performance would decrease when a larger percentile is set obviously. For example, on the MUTAG dataset, setting the percentile as 0.01 is more beneficial for DOHSC than setting it as 0.8, and setting the percentile as 0.01 is better than setting it as 0.1 for DO2HSC due to the change of the interval area. 

G SUPPLEMENTED RESULTS OF ABLATION STUDY

First, the ablation study of whether orthogonal projection needs standardization is conducted. To be more precise, we are pursuing orthogonal features, i.e., finding a projection matrix for orthogonal latent representation (with standardization) instead of computing the projection onto the column or row space of the projection matrix (non-standardization), though they are closely related to each other. This is equivalent to performing PCA and using the standardized principal components. Therefore, we show the comparison between DOHSC with standardization and without standardization. From Table 10 , it is observed that the performance of DOHSC without standardization is acceptable and most results of it are better than the two-stage baseline, i.e., infoGraph+Deep SVDD. It verifies the superiority of the end-to-end method over the two-stage baselines. However, the model with standardization outperforms the non-standardized one in almost all cases. Besides, the ablation study of using the mutual information maximization loss is shown in Table 11 . It can be intuitively concluded that mutual information loss does not always have a positive impact on all data. This also indicates that the anomaly detection optimization method and orthogonal projection we designed are effective instead of entirely due to the loss of mutual information. To demonstrate the effectiveness of the orthogonal projection layer (OPL), we conduct ablation studies and visualize the comparison of 3-dimensional results produced with OPL and without OPL, respectively. For each model trained on a particular dataset class, we show the result without OPL on the left side, while the result with OPL is displayed on the right. As Figure 12 illustrates, OPL drastically improves the distribution of the embeddings to be more spherical rather than elliptical. Similarly, with the help of OPL, other embeddings show a more compact and rounded layout. 



https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/



Figure 1: Architecture of the proposed models (right top: DOHSC; right bottom: DO2HSC).

Figure2: Variations in decision boundary with and without the orthogonal projection layer. In the left subfigure, the real decision region where training data are distributed may be an ellipsoid (dark grey). This contradicts with the hypersphere decision boundary (light grey) set by Optimization 7. After orthogonal projection, the ellipsoid is expected to be transformed into a standard hypersphere, which avoids the evaluation error. Note: here data are simulated only for illustration.

Figure 3: Illustration of inevitable flaws in DOHSC on both the training and testing data of COX2. Left: the ℓ 2 -norm distribution of 4-dimensional distances learned from the real dataset; Right: the pseudo-layout in two-dimensional space sketched by reference to the empirical distribution.

Figure 4: Visualization results of the DO2HSC on MUTAG Class 0 in different perspectives.

Figure 6: Histogram of distances (Euclidean norm) from the center of 10,000 random samples under (univariate or) high-dimensional Gaussian distributions N (0, I).

Figure 7: Convergence curves of the proposed models on the MUTAG dataset.

Figure 9: Distance distributions were obtained by infoGraph+Deep SVDD, the proposed model, and the improved proposed model on COX2. The first row represents the distance distribution of the training samples in relation to the decision boundary. The last row indicates the distance distribution of the test data with respect to the decision boundary.

Figure 10: Visualization results of the DOHSC with MUTAG in different perspectives.

Figure 12: Visualizations on the MUTAG dataset Class 0 (left: with OPL; right: without OPL).

Figure 13: Visualizations on the MUTAG dataset Class 1 (left: with OPL; right: without OPL).

Figure 14: Visualizations on the COX2 dataset Class 0 (left: with OPL; right: without OPL).

Figure 15: Visualizations on the COX2 dataset Class 1 (left: with OPL; right: without OPL).

Description for six datasets.

Average AUCs with standard deviation (10 trials) of different graph-level anomaly detection algorithms. We assess models by regarding every data class as normal data, respectively. The best results are marked in bold and '-' means out of memory.

Average AUCs with standard deviation (10 trials) of different graph-level anomaly detection algorithms. We assess models by regarding every data class as normal data, respectively. The best results are marked in bold.



Ablation study of the orthogonal projection layer. We test models by regarding every data class as normal data, respectively. The best performance is highlighted in bold.

Algorithm 1 Graph-Level Deep Orthogonal Hypersphere Contraction (DOHSC) Input: The input graph set G, dimensions of GIN hidden layers k and orthogonal projection layer k ′ , a trade-off parameter λ and the coefficient of regularization term µ, pretraining epoch T , learning rate η. Output: The anomaly detection scores s. 1: Initialize the network parameters Φ, Ψ, Υ and the orthogonal projection layer parameter Θ; 2: for t → T do

Algorithm 2 Graph-Level Deep Orthogonal Bi-Hypersphere Compression (DO2HSC) Input: The input graph set G, dimensions of GIN hidden layers k and orthogonal projection layer k ′ , a trade-off parameter λ and the coefficient of regularization term µ, pretraining epoch T 1 , iterations of initializing decision boundaries T 2 , learning rate η. Output: The anomaly detection scores s.Initialize the network parameters Φ, Ψ, Υ and the orthogonal projection layer parameter Θ; 2: for t → T 1 do for each batch G do Initialize decision boundaries r max and r min via equation 14; repeat

Offcenter distance under multivariate Gaussian at different dimensions and quantiles.

Parameter sensitivity of the proposed methods with different percentiles (all normal data is set to Class 0.

Comparison of the orthogonal projection layer with or w/o standardization.

Comparison of the loss supervision with or w/o mutual information loss (MIL).

annex

 Liznerski et al. (2021) on Fashion-MNIST. The results are reported in Table 7 . First, DOHSC and DO2HSC obtained seven best AUCs out of ten in total. And in the remaining three classes, the proposed models still achieved comparable performance with gaps of less than 2%. Second, Deep SVDD plays an important baseline role relative to DOHSC and DOHSC defeats it by a large margin in all classes. It further verifies that the proposed orthogonal projection is meaningful and helpful. In general, bi-hypersphere learning also performs sufficiently on common datasets and is very competitive compared to these state-of-the-art anomaly detection algorithms (Deep SVDD, DROCC and FCDD). In terms of the average AUC values for all classes of the dataset in Table 8 , our algorithm outperforms all the compared baselines reproduced above, but it is worth mentioning that the reported performance of the IGD (Scratch) algorithm (Chen et al., 2022) is superior to our algorithm with no more than 2% gap. 

F SUPPLEMENTED VISUALIZATION

This part shows the related supplemented visualization results of the anomaly detection task.From Section 4, we can see some DOHSC results are improved a lot through DO2HSC. For example, compared with DOHSC, DO2HSC often improves the results by less than 2% on most of the datasets. But on class 1 of ER MD, DO2HSC has a more than 20% improvement. Here the distance distributions of DOHSC and DO2HSC on ER MD are given in Figure 8 to prove this improvement. In Subfigure (a), the anomalous data appear in the distance interval [0,1], especially in the region close to the center, and less or even none of the normal data distributes in it. On the contrary, DO2HSC divided normal data and anomalous data more specifically, and both sides of the interval have anomalous data, as we assumed before. 

