EFFICIENT AUTOMATIC GRAPH LEARNING VIA DESIGN RELATIONS Anonymous authors Paper under double-blind review

Abstract

Despite the success of automated machine learning (AutoML), which aims to find the best design, including the architecture of neural networks and hyper-parameters, conventional AutoML methods are computationally expensive and hardly provide insights into the relations of different model design choices. This work focus on the scope of AutoML on graph tasks. To tackle the challenges, we propose FALCON, an efficient sample-based method to search for the optimal model design on graph tasks. Our key insight is to model the design space of possible model designs as a design graph, where the nodes represent design choices, and the edges denote design similarities. FALCON features 1) a task-agnostic module, which performs message passing on the design graph via a Graph Neural Network (GNN), and 2) a task-specific module, which conducts label propagation of the known model performance information on the design graph. Both modules are combined to predict the design performances in the design space, navigating the search direction. We conduct extensive experiments on 27 node and graph classification tasks from various application domains. We empirically show that FALCON can efficiently obtain the well-performing designs for each task using only 30 explored nodes. Specifically, FALCON has a comparable time cost with the one-shot approaches while achieving an average improvement of 3.3% compared with the best baselines.



Locally, FALCON extends the design subgraph via a search strategy detailed in Section 3.3. Globally, FALCON approaches the optimal design navigated by the inductive bias of the design relations. design space. Thus, an efficient search strategy should rapidly rule out a large subset of the design space with potentially bad performance leveraging such learned inductive bias. Proposed approach. To overcome the limitations, we propose FALCON, an AutoML framework on graph tasks that achieves state-of-the-art sample efficiency and performance by leveraging model design insights. Our key insight is to build a design graph over the design space of architecture and hyper-parameter choices. FALCON extracts model design insights by learning a meta-model that captures the relation between the design graph and model performance and uses it to inform a sample-efficient search strategy. FALCON consists of the following two novel components. Design space as a graph. Previous works view the model design space as a high-dimensional space with isolated design choices (You et al., 2020b) , which offer few insights regarding the relations between different design choices. For example, through trial runs if we find the models with more than 3 layers do not work well without batch normalization, this knowledge can help us reduce the search space by excluding all model designs of more than 3 layers with batch normalization set to false. While such insights are hardly obtained with existing AutoML algorithms (Liu et al., 2019; Pham et al., 2018; Gao et al., 2019; Zoph & Le, 2017; Cai et al., 2019) , FALCON achieves it via constructing a graph representation, design graph, among all the design choices. Figure 1 (a) shows a visualization of a design graph, where each node represents a candidate design, and edges denote the similarity between the designs. See Section 3.1 for details on the similarity and graph construction. Search by navigating on the design graph. Given the design graph, FALCON deploys a Graph Neural Network predictor, short for meta-GNN, which is supervised by the explored nodes' performances and learns to predict the performance of a specific design given the corresponding node in the design graph. The meta-GNN is designed with 1) a task-agnostic module, which performs message passing on the design graph, and 2) a task-specific module, which conducts label propagation of the known model performance information on the design graph. Furthermore, we propose a search strategy that uses meta-GNN predictions to navigate the search in the design graph efficiently. Experiments. We conduct extensive experiments on 27 graph datasets, covering node-and graphlevel tasks with distinct distributions. Moreover, we demonstrate FALCON' potential applicability on image datasets by conducting experiments on the CIFAR-10 image dataset. Our code is available at https://anonymous.4open.science/r/Falcon.

2. RELATED WORK

Automatic Machine Learning (AutoML) is the cornerstone of discovering state-of-the-art model designs without costing massive human efforts. We introduce four types of related works below. Sample-based AutoML methods. Existing sample-based approaches explore the search space via sampling candidate designs, which includes heuristic search algorithms, e.g., Simulated Annealing, Bayesian Optimization approaches (Bergstra et al., 2011; White et al., 2021; Ma et al., 2019) , evolutionary- (Xie & Yuille, 2017; Real et al., 2017) and reinforcement-based methods (Zoph & Le, 2017; Zhou et al., 2019; Gao et al., 2019) . However, they tend to train thousands of models from scratch, which results in the low sample efficiency. For example, (Zoph & Le, 2017; Gao et al., 2019) usually involve training hundreds of GPUs for several days, hindering the development of AutoML in real-world applications (Bender et al., 2018) . Some hyper-parameter search methods aim to reduce the computational cost. For example, Successive Halving (Karnin et al., 2013) allocates the training resources to more potentially valuable models based on the early-stage training information. Li et al. (2017) further extend it using different budgets to find the best configurations to avoid the trade-off between selecting the configuration number and allocating the budget. Jaderberg et al. (2017) combine parallel search and sequential optimisation methods to conduct fast search. However, their selective mechanisms are only based on the model performance and lack of deep knowledge, which draws less insight into the relation of design variables and limits the sample efficiency. One-shot AutoML methods. The one-shot approaches (Liu et al., 2019; Pham et al., 2018; Xie et al., 2019; Bender et al., 2018; Qin et al., 2021) have been popular for the high search efficiency. Specifically, they involve training a super-net representing the design space, i.e., containing every candidate design, and shares the weights for the same computational cell. Nevertheless, weight sharing degrades the reliability of design ranking, as it fails to reflect the true performance of the candidate designs (Yu et al., 2020) . Graph-based AutoML methods. The key insight of our work is to construct the design space as a design graph, where nodes are candidate designs and edges denote design similarities, and deploy a Graph Neural Network, i.e., meta-GNN, to predict the design performance. Graph Hy-perNetwork (Zhang et al., 2019a) directly generates weights for each node in a computation graph representation. You et al. (2020a) study network generators that output relational graphs and analyze the link between their predictive performance and the graph structure. Recently, Zhao et al. (2020) considers both the micro-(i.e., a single block) and macro-architecture (i.e., block connections) of each design in graph domain. AutoGML (Park et al., 2022) designs a meta-graph to capture the relations among models and graphs and take a meta-learning approach to estimate the relevance of models to different graphs. Notably, none of these works model the search space as a design graph. Design performance predictor. Previous works predict the performance of a design using the learning curves (Baker et al., 2018) , layer-wise features (Deng et al., 2017) , computational graph structure (Zhang et al., 2019a; White et al., 2021; Shi et al., 2019; Ma et al., 2019; Zhang et al., 2019b; Lee et al., 2021a) , or combining dataset information (Lee et al., 2021a) via a dataset encoder. To highlight, FALCON explicitly models the relations among model designs. Moreover, it leverages the performance information on training instances to provide task-specific information besides the design features, which is differently motivated compared with Lee et al. (2021b) that employs meta-learning techniques and incorporate hardware features to rapidly adapt to unseen devices. Besides, meta-GNN is applicable for both images and graphs, compared with Lee et al. (2021a) .

3. PROPOSED METHOD

This section introduces our proposed approach FALCON for sample-based AutoML. In Section 3.1, we introduce the construction of design graph, and formulate the AutoML goal as a search on the design graph for the node with the best task performance. In Section 3.2, we introduce our novel neural predictor consisting of a task-agnostic module and a task-specific module, which predicts the performances of unknown designs. Finally, we detail our search strategy in Section 3.3. We refer the reader to Figure 1 (b) for a high-level overview of FALCON.

3.1. DESIGN SPACE AS A GRAPH

Motivation. Previous works generally consider each design choice as isolated from other designs. However, it is often observed that some designs that share the same design features, e.g., graph neural networks (GNNs) that are more than 3 layers and have batch normalization layers, may have similar performances. Moreover, the inductive bias of the relations between design choices can provide valuable information for navigating the design space for the best design. For example, suppose we find that setting batch normalization of a 3-layer GCN (Kipf & Welling, 2017) and a 4-layer GIN (Xu et al., 2019) to false both degrade the performance. Then we can reasonably infer that a 3-layer GraphSAGE (Hamilton et al., 2017) with batch normalization outperforms the one without. We could leverage such knowledge and only search for the designs that are more likely to improve the task performance. To the best of our knowledge, FALCON is the first method to explicitly consider such relational information among model designs. Design graph. We denote the design graph as G(N , E), where the nodes N include the candidate designs, and edges E denote the similarities between the candidate designs. Specifically, we use the notion of design distance to decide the graph connectivity, and we elaborate on them below. Design distance. For each numerical design dimension, two design choices have a distance 1 if they are adjacent in the ordered list of design choices. For example, if the hidden dimension size can take values [16, 32, 64, 128] , then the distance between 16 and 32 is 1, and the distance between 32 and 128 is 2. For each categorical design dimension, any two distinct design choices have a distance 1. We then define the connectivity of the design graph in terms of the design distance:  = {(u, v)|u ∈ N s , v ∈ N s , (u, v) ∈ N } are the edges. Given the design subgraph, we formulate the AutoML problem as searching for the node, i.e., design choice, with the best task performance.

3.2. META-GNN FOR PERFORMANCE PREDICTION

Here we introduce a meta-model, named meta-GNN, to predict the performance of model designs, i.e., nodes of the design subgraph. The goal of meta-GNN is learning the inductive bias of design relations, which is used to navigate the search path on the design graph. As is illustrated in Figure 2 , the meta-GNN comprises a task-agnostic module and a task-specific module, used to capture the knowledge of model design and task performance, respectively. Task-agnostic module. The task-agnostic module uses a design encoder to encode the design features on nodes of the design subgraph, and a relation encoder to capture the design similarities and differences on edges of the design subgraph. After that, it performs message passing on the design subgraph. We introduce each component below: • Design encoder: it computes the node features of design subgraph by the concatenation of the feature encoding of each design dimension. For numerical design dimensions,we conduct min-max normalization on their values as the node features. For categorical design dimensions such as aggregation operator which takes one of (SUM, MAX, MEAN), we encode it as a one-hot feature. • Relation encoder: it captures the similarity relationships between the connecting designs. For each (d i , d j ) ∈ E, we encode the design dimension where d i and d j differ by a one-hot encoding. • Message passing module: a GNN model is used to take the design subgraph and the processed features to perform message passing and output node representations. This information will be combined with the task-specific module to predict the design's performance. Task-specific module. The task-specific module takes into account the information of design performance on selected training instances and thus is specific for one dataset. The challenge of including such task-specific performance is that it is only available on a very limited set of explored nodes. To overcome the challenge, we use label propagation to propagate the performance information of explored nodes to the unexplored nodes. This is based on our observation that models trained with similar designs typically make similar predictions on instances. We provide an example in Figure 2 to illustrate the task-specific module. • Identifying critical instances: The first step is to identify critical training instances that result in different performances across different designs. Here we use a set of explored designs (anchors) to provide the instance-wise performances. Specifically, the (i, j) element of the top left matrix of Figure 2 represents whether the i-th design can correctly predict the label of j-th instance. Then, we compute the entropy of each training instance's performance over the anchors. Then we obtain the instance-wise probability via Softmax on the entropy vector, from which we sample instances that result in high variation across designs. The high variation implies that these instances can distinguish good designs from bad ones in the design subgraph, which are informative. • Label propagation and linear projection: Based on the inductive bias of smoothness, we perform label propagation to make the task-specific information available to all candidate designs. Concretely, label propagation can be written as Y (k+1) = α • D -1/2 AD -1/2 Y (k) + (1 -α)Y (k) (1) where each row of Y is the performance vector of design i (if explored) or a zero vector (for the unexplored designs). D ∈ R |Ns|×|Ns| is the diagonal matrix of node degree, A ∈ R |Ns|×|Ns| is the adjacent matrix of the design subgraph, and α is a hyper-parameter. After label propagation, we use a linear layer to project the performance information to another high-dimensional space. Finally, as shown in Figure 2 , we concatenate the output embeddings of the task-specific and taskagnostic modules and use an MLP to produce the performance predictions. Objective for Meta-GNN. The training of neural performance predictor is commonly formulated as a regression using mean square error (MSE), in order to predict how good the candidate designs are for the current task. However, the number of explored designs is usually small for sample-efficient AutoML, especially during the early stage of the search process. Thus, the deep predictor tends to overfit, degrading the reliability of performance prediction.  Y Ω = GET-VALIDATION-PERFORMANCE(Ω, V ) // Explore the initial nodes (for V epochs) 4: while t = |Ω| < K do 5: G (t) s ← (N v = Ω ∪ Γ, E = SIMILARITY(N v ) // (1) Update the design subgraph 6: while not converge do 7: θ ← θ -η • ∂L( ŶΩ , h θ (G (t) s ) Ω )/∂θ // (2) Compute Eq. 2 and conduct optimization 8: end while 9: // (3) Sample a candidate node with probability proportional to the meta-GNN's prediction 10: d (t) = SAMPLE-WITH-PROBABILITY(Γ, Softmax(h θ (G (t) s ) Γ )) 11: Y t = GET-VALIDATION-PERFORMANCE(d (t) , V ) // (2) Explore the current selected node 12: Ω ← Ω ∪ {d (t) }, Γ ← Γ ∪ MULTI-HOP-NEIGHBORS({d (t) }) 13: end while 14: D = SELECT-TOPK({Ω i : Y i } K i=1 , size = MIN(⌈10% • K⌉, 5)) // Models to be fully trained 15: Y ′ = GET-VALIDATION-PERFORMANCE(D, V ′ ) // Obtain the final performance 16: I = ARGMAX(Y ′ ) // Obtain best model 17: return D I , Y ′ I a quadratic number of training pairs, thus reducing overfitting. Furthermore, predicting relative performance is more robust across datasets than predicting absolute performance. Overall, the objective is formulated as follows: L( Ŷ , Y ) = N i=1 ( Ŷi -Y i ) 2 + λL rank ( Ŷ , Y ), where L rank ( Ŷ , Y ) = N i=1 N j=i (-1) I(Yi>Yj ) • σ Ŷi -Ŷj τ (2) where λ is the trade-off hyper-parameter, τ is the temperature controlling the minimal performance gap that will be highly penalized, and σ is the Sigmoid function. Thus, the meta-GNN is trained to predict the node performance on the design subgraph supervised by the explored node performance.

3.3. SEARCH STRATEGY

Equipped with the meta-GNN, we propose a sequential search strategy to search for the best design in the design graph. The core idea is to leverage meta-GNN to perform fast inference on the dynamic design subgraph, and decide what would be the next node to explore, thus navigating the search. We summarize our search strategy in Algorithm 1. Concretely, our search strategy consists of the following three steps: • Initialization: As shown in Figure 1 (b), FALCON randomly samples multiple nodes on the design graph. The motivation of sampling multiple nodes in the initial step is to enlarge the receptive field on the design graph to avoid the local optima and bad performance, which is empirically verified in Appendix D. Then, FALCON explore the initialized nodes by training designs on the tasks and construct the instance mask for the task-specific module. • Meta-GNN training: Following Figure 2 , meta-GNN predicts the performance of the explored nodes. The loss is computed via Equation 2 and back-propagated to optimize the meta-GNN. • Exploration via inference: Meta-GNN is then used to make predictions for the performances of all candidate nodes. Then we apply Softmax on the predictions as the probability distribution of candidate designs, from which FALCON samples a new node and updates the design subgraph. At every iteration, FALCON extends the design subgraph through the last two steps. After several iterations, it selects and retrains a few designs in the search trajectory with top performances. Overall, FALCON approaches the optimal design navigated by the relational inductive bias learned by meta-GNN, as shown in Figure 1 (b) .

4. EXPERIMENTS

We conduct extensive experiments on 27 graph datasets and an image dataset. The goal is twofold: (1) to show FALCON's sample efficiency over the existing AutoML methods (cf. Section 4.2) and (2) to provide insights into how the inductive bias of design relartions navigate the search on design graph (cf. Section 4.3).

4.1. EXPERIMENTAL SETTINGS

We consider the following tasks in our evaluation and we leave the details including dataset split, evaluation metrics, and hyper-parameters in Appendix A. Node classification. We use 6 benchmarks ranging from citation networks to product or social networks: Cora, CiteSeer, PubMed (Sen et al., 2008) , ogbn-arxiv (Hu et al., 2020) , AmazonComputers (Shchur et al., 2018) , and Reddit (Zeng et al., 2020) . Graph classification. We use 21 benchmark binary classification tasks in TUDataset (Morris et al., 2020) , which are to predict certain properties for molecule datasets with various distribution. Image classification. We use CIFAR-10 (Krizhevsky, 2009) . See details in Appendix C. Baselines. We compare FALCON with three types of baselines: • Simple search strategies: Random, Simulated Annealing (SA), Bayesian Optimization (BO) (Bergstra et al., 2011) . • AutoML approaches: DARTS (Liu et al., 2019) , ENAS (Pham et al., 2018) , GraphNAS (Gao et al., 2019) , AutoAttend (Guan et al., 2021) , GASSO (Qin et al., 2021) , where the last three methods are specifically designed for graph tasks. • Ablation models: FALCON-G and FALCON-LP, where FALCON-G discards the design graph and predicts the design performance using an MLP, and FALCON-LP removes the task-specific module and predicts design performance using only the task-agnostic module. We also include a naive method, BRUTEFORCE, which trains 5% designs from scratch and returns the best design among them. The result of BRUTEFORCE is regarded as the approximated ground truth performance. We compare FALCON and the simple search baselines under sample size controlled search, where we limit the number of explored designs. We set the exploration size as 30 by default. Design Space. We use different design spaces on node-and graph-level tasks. Specifically, The design variables include common hyper-parameters, e.g., dropout ratio, and architecture choices, e.g., layer connectivity and batch normalization. Moreover, we consider node pooling choices for the graph classification datasets, which is less studied in the previous works (Cai et al., 2021; Gao et al., 2019; Zhou et al., 2019) . Besides, we follow You et al. (2020b) and control the number of parameters for all the candidate designs to ensure a fair comparison. See Appendix A.2 for the details.

4.2. MAIN RESULTS

Node classification tasks. Table 1 summarizes the performance of FALCON and the baselines. Notably, FALCON takes comparable search cost as the oneshot methods and is 15x less expensive than GraphNAS. Moreover, FALCON achieves the best performances over the baselines with sufficient margins in the most datasets, using only 30 explored designs. For example, FALCON outperforms ENAS by 1.8% in CiteSeer and GASSO by 1.6% in Ama-zonComputers. Also, the removal of the design graph and task-specific module decreases the performance constantly, which validates their effectiveness. It is worth mentioning that FALCON is competitive with BRUTEFORCE, demonstrating the excellence of FALCON in searching for globally bestperforming designs. We further investigate the speed-performance trade-off of FALCON and other sample-based approaches in ogbn-arxiv. We run several search trials under different sample sizes. As shown in Figure 3 , FALCON reaches the approximated ground truth result with very few explored nodes. In contrast, SA and Random require more samples to converge, while BO performs bad even with a large number of explored nodes, potentially due to its inability in dealing with high-dimensional design features. Graph classification tasks. The graph classification datasets cover a wide range of graph distributions. In Table 2 , we report the selected performance results for graph classification tasks and leave other results including the search costs in Appendix B. We highlight three observations: • On average, the state-of-the-art AutoML baselines algorithms perform close to the simple search methods, indicating the potentially unreliable search, as similarly concluded by Yu et al. (2020) . • FALCON surpasses the best AutoML baselines with an average improvement of 3.3%. The sufficient and consistent improvement greatly validates our sample efficiency under a controlled sample size. where FALCON can explore the designs that are more likely to perform well through the relational inference based on the relations of previously explored designs and their performances. • In the second block, we attribute the high sample efficiency of FALCON to the exhibition of design relations and the performance information from the training instances. Specifically, FALCON outperforms FALCON-LP by 4.87% on average, indicating that the task-specific module provides more task information that aids the representation learning of model designs, enabling a fast adaption on a certain task. Moreover, FALCON gains an average improvement of 6.43% compared to FALCON-G, which justifies our motivation that the design relations promote the learning of relational inductive bias and guide the search on the design graph. We also conduct experiments similar to Figure 3 to investigate how FALCON converges with the increasing sample size (cf. Appendix B.1) and report the best designs found by FALCON for each dataset (cf. Appendix B.2). Besides, we provide sensitivity analysis on FALCON's hyper-parameters, e.g., number of random start nodes C (cf. Appendix D). Image classification task. We demonstrate the potential of FALCON in image domain. Due to space limitation, we leave the results of CIFAR-10 to Appendix C. We found FALCON can search for designs that are best-performing, compared with the baselines. Specifically, it gains average improvements of 1.4% over the simple search baselines and 0.3% over the one-shot baselines on the architecture design space, with search cost comparable to the one-shot based baselines.

4.3. CASE STUDIES OF FALCON

We study FALCON in two dimensions: (1) Search process: we probe FALCON's inference process through the explanations of meta-GNN on a design graph, and (2) Design representations: we visualize the node representations output by the meta-GNN to examine the effect of design choices. Search process. We use GNNExplainer (Ying et al., 2019) to explain the node prediction of meta-GNN and shed light on the working mechanism of FALCON. Here we consider the importance of each design dimension for each node's prediction. We demonstrate on a real case when searching on CIFAR-10 (cf. Table 12 for the design space). For conciseness, we focus on two design dimensions: (Weight Decay, Batch Size). Then, given a node of interest n ′ = (0.9, 128), we observe the change in its predictions and dimension importance during the search process. -------------------------------Explored node n t : ... (0.99, 64) (0.9, 64) ... (0.99, 128) ... Performance of n t : ... ------------------------------- Where + andindicate the relative performance of the explored nodes, t is the current search step. Interestingly, we see that the prediction on n ′ and the dimension importance evolve with the explored designs and their relations. For example, when weight decay changes from 0.99 to 0.9, there is a huge drop in the node performance, which affects the prediction of n ′ and increases the importance of Weight Decay as design performance seems to be sensitive to this dimension. In the left figure, the better the design performance, the darker the color. Generally, the points with small distance have similar colors or performances, indicating that meta-GNN can distinguish "good" nodes from "bad" nodes. For the right figure, different colors represent different dropout ratios. The high discrimination indicates that the dropout ratio is an influential variable for learning the design representation, which further affects design performance. This evidence validates the meta-GNN's expressiveness and capacity to learn the relational inductive bias between the design choices.

5. CONCLUSION, LIMITATION, AND FUTURE WORK

This work introduces FALCON, an efficient sample-based AutoML framework. We propose the concept of design graph that explicitly models and encodes relational information among model designs. On top of the design graph, we develop a sample-efficient strategy to navigate the search on the design graph with a novel meta-model. One future direction is to better tackle the high average node degree on the design graphwhich could cause over-smoothing, especially when the design variables include many categorical variables. And a simple solution is to use edge dropout to randomly remove a portion of edges at each training epoch. Another future direction is to better adapt FALCON on continuous design variables via developing a dynamic design graph that enable a more fine-grained search between the discretized values.

REPRODUCIBILITY STATEMENT

All of the datasets used in this work are public. For experimental setup, we state the detailed settings in Appendix A and Appendix C, including the graph pre-processing, dataset splits, hyper-parameters. Moreover, we include our code in an anonymous link for public access. For the results, we report the best models found by our algorithm as well as their corresponding performances. Overall, we believe we have made great efforts to ensure reproducibility in this paper.

ETHICS STATEMENT

In this work, we propose a novel algorithm to search for the best model designs where no human subject is related. This work could promote the discovery of more powerful and expressive models and provide insights into design relations. However, while best-performing models may be "experts" in fulfilling given tasks, they are not necessarily fair towards different user or entity groups. We believe this is a general issue in the AutoML area and should be well addressed to ensure the ethics of models in real-world applications. TopkPool (Gao & Ji, 2019) , SAGPool (Lee et al., 2019) , PANPool (Ma et al., 2020) , EdgePool (Diehl, 2019 ) Node pooling loop [2, 4, 6] Specifically, The STACK design choice means directly stacking multiple GNN layers, i.e., without skip-connections. We also support node pooling operations for graph classification tasks, where the pooling loop stands for the number of message passing layers between each pooling operation. If the number of message passing layers is m and the node pooling loop is l, there will be a node pooling layer after the ith message passing layer in the design model (hierarchical pooling), where i ∈ {1 + k • l | k = 0, . . . , ⌈(m -1)/l⌉ -1}. Moreover, to avoid duplicated and invalid designs, some design variables are required to satisfy specific dependency rules, and we take two examples to elaborate on this point. • If the node pooling flag of a design is False, then the design does not have any value on node pooling type and node pooling loop, and vice versa. For example, we denote node pooling flag as f , node pooling type as t, and * as any design choice, then(f =False, t= * ) or (f =False, l=*) will both be invalid. • The node pooling loop should not exceed the number of message passing layers. For example, design A (m=4, l=4) and design B (m=4, l=6) that take the same values on other design variables are duplicated. Thus, the design graph constructed under dependency rules is more complex. Without loss of generality, we define that the distance of (f =False) and (f =True, l=MIN({i ∈ L})) as 1, where L represents the design choice of node pooling loop. Thus, the design graph is a connected graph that enables the exploration of any node with random initialization. It is also worth mentioning that the search strategy of FALCON is modularized given the design graph. In contrast, the dependency rules constrain the action space of reinforcement learning methods, e.g., (f =Ture → False) is inapplicable, which requires special operation inside the controller. We further summarize the statistics and construction time of the design graphs in Table 6 , where DG-1 and DG-2 denote the design graphs for node-level and graph-level tasks, respectively. We use multi-processing programing on 50 CPUs (Intel Xeon Gold 5118 CPU @ 2.30GHz) to conduct the graph construction. Note that we don't have to construct the entire design graph in the pre-processing step, since we only extend the small portion of the design graph, i.e., , the design subgraph, during the search process. Thus, the total time cost of constructing the design subgraph will be O(E ′ ) where E ′ is the number of edges in the design subgraph, which largely lowers the time costs.

A.2.2 DESIGN SPACES FOR THE ONE-SHOT BASELINES

The one-shot models (Liu et al., 2019; Pham et al., 2018) is built upon a super-model that is required to contain all of the architecture choices. We build the macro search space over entire models for both node classification and graph classification datasets with constraints. Firstly, we do not consider CAT (skip-concatenate) a layer connectivity choice, and we also remove design variables for node pooling. The reason is that CAT and node pooling operations change the input shape and make the output embeddings inapplicable for the subsequent weight-sharing modules in our settings. Secondly, the layer connectivity is customized for each layer following the previous works (Liu et al., 2019; Pham et al., 2018) , instead of setting as a global value for every layer. Overall, we summarize the design space in Table 7 . To enable a fair comparison, we fine-tune the hyper-parameters and report the best results of the architectures found by the one-shot methods according to their performance in the validation sets.

B MORE EXPERIMENTAL RESULTS ON GRAPH TASKS B.1 GRAPH CLASSIFICATION TASKS

Task performance. Here we provide more results of task performance on the graph classification dataset. We repeat each experiment at least 3 times and report the average performances and the standard errors. The results are summarized in Table 8 . The results well demonstrate the preeminence of FALCON in searching for good designs under different data distributions. Search cost. As shown in Figure 5 , we report the search cost of Random, DARTS, ENAS, GraphNAS, and FALCON on the selected datasets. The time measurements are conducted on a single NVIDIA GeForce 3070 GPU (24G). We see FALCON has a comparable time cost with Random and DARTS, which empirically proves the efficiency of FALCON. However, as FALCON still needs to sample designs and train them from scratch (i.e., the search cost of FALCON is bounded by the search cost of Random), the computational cost is relatively high in large datasets, e.g., OVCAR-8 and MCF-7. We can potentially alleviate this limitation via integrating dataset sampling to reduce time costs. ROC-AUC v.s. exploration size. Here we report the change in task performance on graph classification datasets with the number of explored nodes. In Figure 6 , we visualize the results on two graph classification datasets. We see that FALCON can approach the best-performing designs quickly as the explored size grows.

B.2 BEST DESIGNS

In Table 9 and Table 10 , we summarize the best designs found by FALCON and BRUTEFORCE in each dataset, where the average number of parameters is 137.5k for all the graph classification datasets. Note that we select the best designs according to their performance on validation sets; thus, there are cases where FALCON surpasses BRUTEFORCE. We highlight the design variables that are different between FALCON and BRUTEFORCE for comparison. To estimate the uncertainty of Bruteforce, we compute the 95% confidence interval of Bruteforce using bootstrapping. Moreover, we consider a variant of Bruteforce baseline to compare with Bruteforce. Specifically, we train all the designs in the design space for 30 epochs, select the top 10% design, and resume the training until 50 epochs. After that, we choose the top 50% designs to be fully trained and return the best fully trained design based on the validation performance. We run Bruteforce-bootstrap on four datasets as demonstrations. We summarize the results in Table 11 . 



Figure 1: Overview of FALCON. (a) Design graph example. We present a small design graph on TU-COX2 graph classification dataset. The design choices are shown in the table, #pre, #mp, #post denotes the numbers of pre-processing, message passing, and post-processing layers, respectively. The better design performance, the darker node colors. (b) FALCON search strategy. Red: Explored nodes. Green: Candidate nodes to be sampled from. Blue: The best node. Gray: Other nodes. Locally, FALCON extends the design subgraph via a search strategy detailed in Section 3.3. Globally, FALCON approaches the optimal design navigated by the inductive bias of the design relations.

Figure 2: Meta-GNN Framework: Task-agnostic module generates the embedding given the design variables and their graphical structures. Task-specific module leverages performance information and conducts label propagation to generate the task-specific embeddings. The two embeddings are concatenated and input into an MLP for predicting the design performance.

Figure 3: Accuracy v.s. the number of explored nodes on ogbn-arxiv.

Design representations. In Figure 4, we visualize the high-dimensional design representations via T-SNE (van der Maaten & Hinton, 2008) after training the meta-GNN on the Cora dataset.

Figure 4: T-SNE visualization for the design representations on Cora dataset.

Figure 5: Search cost on the selected datasets.

Figure 6: Accuracy v.s. the number of explored nodes on two graph classification datasets.

Definition 1 (Design Graph Connectivity) The design graph can be expressed as G(N , E), where the nodes N = {d 1 , . . . , d n } are model designs, and (d i , d j ) ∈ E iff the design distance between d i and d j is 1.

Search results on five node classification tasks, where Time stands for the search cost (GPU•hours). We conduct t-test to compute p-value on our method with the best AutoML baselines.

Selected results for the graph classification tasks. The average task performance (ROC-AUC) of the architectures searched by FALCON is 3.3% over the best AutoML baselines.

Design Space for the node-level tasks (except for Reddit). 5,832 candidates in total.

Design Space for the graph-level tasks. 58,320 candidates in total.

Design space for the one-shot baselines on node and graph classification tasks.

Statistics and the construction time of the design graphs.

Average parameters & Best designs in the node classification datasets.

Test performances of Bruteforce and its variant. Surprisingly, we found that the performance of Bruteforce and Bruteforce-bootstrap are very close. This indicates that Bruteforce (fully trained 5% design) is a good surrogate of Bruteforcebootstrapping (fully trained 5% design, but using bootstrapping selection), and could also well approximate the ground truth performance of the best design.

Best designs in the graph classification datasets.

A EXPERIMENT DETAILS

A.1 SETTINGS Graph classification datasets. The graph classification datasets used in this work are summarized in Table 3 . And the detailed dataset statics can be referred from https://chrsmrrs.github. io/datasets/docs/datasets/.Table 3 : List of the graph classification datasets used in this work.Small Scale AIDS, BZR-MD, COX2-MD, DHFR-MD, Mutagenicity, NCI1, NCI109, PTC-MM, PTC-MR Medium/Large Scale Tox21-AhR, MCF-7, MOLT-4, UACC257, Yeast, NCI-H23, OVCAR-8, P388, PC-3, SF-295, SN12C, SW-620 Specifically, all datasets are binary classification tasks that predict certain properties for small molecules. For example, the labels in Tox21-AhR represent toxicity/non-toxicity, while the graphs in Mutagenicity are classified into two classes based on their mutagenic effect on a bacterium (Morris et al., 2020) . Consequently, we use atom types as the node features and bond types as edge features.Evaluation metrics. For Reddit, we use F1 score (micro) as the evaluation metric following the previous work (Zeng et al., 2020) . For other node classification tasks and image dataset, we use classification accuracy as the evaluation metric. For the graph classification tasks, we use ROC-AUC as the evaluation metric.Dataset splits. For ogbn-arxiv and Reddit, we use the standardized dataset split. For other node classification datasets, we split the nodes in each graph into 70%, 10%, 20% in training, validation, and test sets, respectively. For graph classification tasks, we split the graphs into 80%, 10%, 10% for training, validation, and test sets, respectively.Hyper-Parameters. We tuned the hyper-parameters of the baselines based on the default setting in their public codes. For FALCON, we construct the candidate set as the 3-hop neighbors of the explored nodes and set the number of start nodes as min(⌈10% • K⌉, 10), where K denotes the exploration size. The meta-GNN is constitute of 3 message-passing layers and 3 label propagation layers. All the experiments are repeated at least 3 times.

A.2.1 DESIGN SPACES FOR THE SAMPLE-BASED METHODS

In this work, we use different design spaces for the datasets depending on the task types, i.e., node or graph level. We summarize the design variables and choices in Table 4 and Table 5 . For the design space of Reddit, we replace "Aggregation" in Table 4 with "Convolutional layer type", which takes values from {GCNConv (Kipf & Welling, 2017) , SAGEConv (Hamilton et al., 2017) , GraphConv (Morris et al., 2019) , GINConv (Xu et al., 2019 ), ARMAConv (Bianchi et al., 2019) , TAGConv (Du et al.) }.

C EXPERIMENTAL RESULTS ON THE IMAGE TASK C.1 DATASET PRE-PROCESSING

We use the CIFAR-10 (Krizhevsky, 2009) image dataset to show FALCON can work well on other machine learning domains. This dataset consists of 50,000 training images and 10,000 test images. We randomly crop them to size 32 × 32, and conduct random flipping and normalization.

C.2 DESIGN SPACES

Here we use two different design spaces to demonstrate FALCON's ability in searching for both hyper-parameters and architectures on image dataset.Hyper-parameter Design Space. We consider a broad space of hyper-parameter search, including common hyper-parameters like Batch Size. We train each design using a SGD optimizer, which requires weight decay and momentum as hyper-parameters. We also use a learning rate (LR) scheduler, which reduces the learning rate when validation performance has stopped improving. Specifically, the scheduler will reduce the learning rate by a factor if no improvement is seen for a 'patience' number of epochs. It is also worth mentioning that FALCON is flexible for other sets of hyper-parameter choices determined by the user end. Architecture Design Space. We construct micro design space for the computational cells. Each cell constitutes of two branches, where we enable five selections: separable convolution with kernel size 3 × 3 and 5 × 5, average pooling and max pooling with kernel size 3×3, and identity. In each branch, we have one dropout layer, where the dropout ratio is one of {0.0, 0.3, 0.6}, and one batch normalization layer. We also use one of identity and skip-sum as skip-connection within each branch. After the input data is separately computed in each branch, the outputs are added as the final cell output, which is different with the original ENAS paper that searches the computational DAG on the defined nodes.

C.3 EXPERIMENTAL RESULTS

Here we set the exploration size as 20 for all the sample-based methods. For hyper-parameter design space, we compare FALCON with the baselines that are available for hyper-parameter tuning. Moreover, to accelerate the search process, FALCON explores an unknown design by fine-tuning a pretrained model for several epochs based on the selected hyper-parameters, instead of training each candidate design from scratch. The results are summarized in Table 13 . For the architecture design space. For ENAS and DARTS, the learning rate is 0.01, and the maximum training epoch is 300. We repeat each experiment three times and summarize the results in Table 14 . 

D SENSITIVITY ANALYSIS

We analyze the sensitivity of the hyper-parameters in the search strategy of FALCON, using a node classification task, CiteSeer, and a graph classification task, Tox21-AhR. Specifically, we study the influence of the number of random starting nodes and the candidate scale resulting from an explored node (i.e., how many hop neighbors of an explored node are to be included in the candidate set).As shown in Figure 7 , we find that FALCON outperforms the best AutoML baselines with a large range of hyper-parameters. Specifically, 73% and 97% hyper-parameter combinations of FALCON rank best among the baselines in CiteSeer and Tox21-AhR, respectively.Moreover, we discover an interesting insight about the size of receptive field, i.e., the number of design candidates, during the search process of FALCON. According to the construction of design subgraph, the receptive field size is O(rh d ), where r is the number of random start nodes, h is the number of neighbor hops, and d is the average node degree. We find that the performance of design searched by FALCON increases with the receptive field's size until it reaches a certain scale.Such patterns have been widely observed in multiple datasets. While the receptive field on the design subgraph should contain sufficient candidates for sampling good ones, it should also prune inferior design space, which doesn't provide insights on navigating the best-performing designs. Thus, the size of the receptive field may be a crucial factor influencing the search quality of FALCON.

