DO NOT TRAIN IT: A LINEAR NEURAL ARCHITECTURE SEARCH OF GRAPH NEURAL NETWORKS

Abstract

Neural architecture search (NAS) for Graph neural networks (GNNs), called NAS-GNNs, has achieved significant performance over manually designed GNN architectures. However, these methods inherit issues from the conventional NAS methods, such as high computational cost and optimization difficulty. More importantly, previous NAS methods have ignored the uniqueness of GNNs, where the non-linearity has limited effect. Based on this, we are the first to theoretically prove that a GNN fixed with random weights can obtain optimal outputs under mild conditions. With the randomly-initialized weights, we can then seek the optimal architecture parameters via the sparse coding objective and derive a novel NAS-GNNs method, namely neural architecture coding (NAC). Consequently, our NAC holds a no-update scheme on GNNs and can efficiently compute in linear time. Empirical evaluations on multiple GNN benchmark datasets demonstrate that our approach leads to state-of-the-art performance, which is up to 200× faster and 18.8% more accurate than the strong baselines.

1. INTRODUCTION

Remarkable progress of graph neural networks (GNNs) has boosted research in various domains, such as traffic prediction, recommender systems, etc., as summarized in (Wu et al., 2021) . The central paradigm of GNNs is to generate node embeddings through the message-passing mechanism (Hamilton, 2020) , including passing, transforming, and aggregating node features across the input graph. Despite its effectiveness, designing GNNs requires laborious efforts to choose and tune neural architectures for different tasks and datasets (You et al., 2020) , which limits the usability of GNNs. To automate the process, researchers have made efforts to leverage neural architecture search (NAS) (Liu et al., 2019a; Zhang et al., 2021b) for GNNs, including GraphNAS (Gao et al., 2020 ), Auto-GNN (Zhou et al., 2019) , PDNAS (Zhao et al., 2020) and SANE (Zhao et al., 2021b) . In this work, we refer the problem of NAS for GNNs as NAS-GNNs. While NAS-GNNs have shown promising results, they inherit issues from general NAS methods and fail to account for the unique properties of GNN operators. It is important to understand the difficulty in general NAS training (e.g., architecture searching and weight evaluation). Based on the searching strategy, NAS methods can be categorized into three types: reinforcement learning-based methods (Zoph & Le, 2017), evolutionary algorithms-based methods (Jozefowicz et al., 2015) , and differential-based methods (Liu et al., 2019a; Wu et al., 2019a) Both reinforcement learning-based and evolutionary algorithm-based methods suffer from high computational costs due to the need to re-train sampled architectures from scratch. On the contrary, the weight-sharing differential-based paradigm reuses the neural weights to reduce the search effort and produces the optimal sub-architecture directly without excessive processes, such as sampling, leading to significant computational cost reduction and becoming the new frontier of NAS. However, the weight sharing paradigm requires the neural weights to reach optimality so as to obtain the optimal sub-architecture based on its bi-level optimization (BLO) strategy (Liu et al., 2019a) , which alternately optimizes the network weights (outputs of operators) and architecture parameters (importance of operators). First, it is hard to achieve the optimal neural weights in general due to the curse of dimensionality in deep learning, leading to unstable searching results, also called the optimization gap (Xie et al., 2022) . Second, this paradigm often shows a sloppy gradient estimation (Bi et al., 2020a; b; Guo et al., 2020b) due to the alternating optimization, softmax-based estimation, and unfairness in architecture updating. This type of work suffers from slow convergence during training and is sensitive to initialization due to the wide use of early termination. If not worse, it is unclear why inheriting weights for a specific architecture is still efficient-the weight updating and sharing lack interpretability. Is updating the GNN weights necessary? Or, does updating weights contribute to optimal GNN architecture searching? Existing NAS-GNN methods rely on updating the weights, and in fact, all these issues raised are due to the need to update weights to the optimum. Unlike other deep learning structures, graph neural networks behave almost linearly, so they can be simplified as linear networks while maintaining superior performance (Wu et al., 2019b) . Inspired by this, we find that the untrained GNN model is nearly optimal in theory. Note that the first paper on modern GNNs, i.e., GCN (Kipf & Welling, 2017a) , already spotted this striking phenomenon in the experiment, but gained little attention. To the best of our knowledge, our work is the first to unveil this and provide theoretical proof. The issues mentioned before may not be as much of a concern given no weight update is needed, making NAS-GNN much simpler. In this paper, we formulate the NAS-GNN problem as a sparse coding problem by leveraging the untrained GNNs, called neural architecture coding (NAC). We prove that untrained GNNs have built-in orthogonality, making the output dependent on the linear output layer. With no-update scheme, we only need to optimize the architecture parameters, resulting in a single-level optimization strategy as opposed to the bi-level optimization in the weight-sharing paradigm, which reduces the computational cost significantly and improves the optimization stability. Much like the sparse coding problem (Zhang et al., 2015) , our goal is also to learn a set of sparse coefficients for selecting operators when treating these weights collectively as a dictionary, making sharing weights straightforward and understandable. Through extensive experiments on multiple challenging benchmarks, we demonstrate that our approach is competitive with the state-of-the-art baselines, while decreasing the computational cost significantly, as shown in Fig. 1 . In summary, our main contributions are: -Problem Formulation: We present (to our best knowledge) the first linear complexity NAS algorithm for GNNs, namely NAC, which is solved by sparse coding. -Theoretical Analysis: Our NAC holds a no-update scheme, which is theoretically justified by the built-in model linearity in GNNs and orthogonality in the model weights. -Effectiveness and Efficiency: We compare NAC with state-of-the-art baselines and show superior performance in both accuracy and speed. Especially, NAC brings up to 18.8% improvement in terms of accuracy and is 200× faster than baselines.

2. RELATED WORK AND PRELIMINARIES

Graph Neural Networks (GNNs) are powerful representation learning techniques (Xu et al., 2019) with many key applications (Hamilton et al., 2017) . Early GNNs are motivated from the spectral perspective, such as Spectral GNN (Bruna et al., 2014) that applies the Laplacian operators directly. ChebNet (Defferrard et al., 2016) approximates these operators using summation instead to avoid a high computational cost. GCN (Kipf & Welling, 2017b) further simplifies ChebNet by using its first order, and reaches the balance between efficiency and effectiveness, revealing the message-passing mechanism of modern GNNs. Concretely, recent GNNs aggregate node features from neighbors and stack multiple layers to capture long-range dependencies. For instance, GraphSAGE (Hamilton et al., 2017) concatenates nodes features with mean/max/LSTM pooled neighbouring information. GAT (Velickovic et al., 2018) aggregates neighbor information using learnable attention weights. GIN (Xu et al., 2019) converts the aggregation as a learnable function based on the Weisfeiler-Lehman test instead of prefixed ones as other GNNs, aiming to maximize the power of GNNs. GNNs consist of two major components, where the aggregation step aggregates node features of target nodes' neighbors and the combination step passes previous aggregated features to networks to



Figure 1: Accuracy vs. running time on Cora. NAC (ours) outperforms the leading methods significantly in both accuracy and speed (in minutes).

