FASTER HYPERPARAMETER SEARCH FOR GNNS VIA CALIBRATED DATASET CONDENSATION

Abstract

Dataset condensation aims to reduce the computational cost of training multiple models on a large dataset by condensing the training dataset into a small synthetic set. State-of-the-art approaches rely on matching the model gradients for the real and synthetic data and have recently been applied to condense large-scale graphs for node classification tasks. Although dataset condensation may be efficient when training multiple models for hyperparameter optimization, there is no theoretical guarantee on the generalizability of the condensed data: data condensation often generalizes poorly across hyperparameters/architectures in practice, while we find and prove this overfitting is much more severe on graphs. In this paper, we consider a different condensation objective specifically geared towards hyperparameter search. We aim to generate the synthetic dataset so that the validation-performance rankings of the models, with different hyperparameters, on the condensed and original datasets are comparable. We propose a novel hyperparameter-calibrated dataset condensation (HCDC) algorithm, which obtains the synthetic validation data by matching the hyperparameter gradients computed via implicit differentiation and efficient inverse Hessian approximation. HCDC employs a supernet with differentiable hyperparameters, making it suitable for modeling GNNs with widely different convolution filters. Experiments demonstrate that the proposed framework effectively maintains the validation-performance rankings of GNNs and speeds up hyperparameter/architecture search on graphs.

1. INTRODUCTION

Graph neural networks (GNNs) have found remarkable success in tackling a variety of graph-related tasks (Hamilton, 2020) . However, the prevalence of large-scale graphs in real-world contexts, such as social, information, and biological networks (Hu et al., 2020) , which frequently scale up to millions/billions of nodes and edges, poses significant computational issues for training GNNs. While training a single model can be expensive, designing deep learning models for new tasks requires substantially more computation, as this involves training multiple models on the same dataset many times to verify the design choice (e.g., the architecture and hyperparameter choice (Elsken et al., 2019) ). Towards this end, we consider the following question: how can we reduce the computational cost for training multiple models on the same dataset, for hyperparameter search/optimization? Natural approaches to reduce the training set size include methods such as graph coreset selection (Baker et al., 2020 ), graph sparsification (Batson et al., 2013) , graph coarsening (Loukas, 2019) and graph sampling (Zeng et al., 2019) . However, all of these methods involve selecting samples from the given training set, which limits the performance. A more effective alternative is to synthesize informative samples rather than select from the given samples. Dataset condensation (Zhao et al., 2020) has emerged as a competent data mechanism to synthesize data, with promising results. It aims to produce a small synthetic training set such that a model trained on the synthetic set obtains testing accuracy comparable to that trained on the original training set. Although dataset condensation achieves the state-of-the-art performance for neural networks trained on condensed samples, this technique is inadequate for accelerating hyperparameter search/optimization, as: (1) theoretically, dataset condensation obtains synthetic samples that minimize the performance drop of a specific model; however, there is no performance guarantee when using this condensed data to train other models, and (2) in practice, it is unclear how condensation methods compare with strong baselines such as various coreset methods, in terms of their ability to preserve the outcome of architecture/hyperparameter optimization. In this paper, we identify the poor generalizability of existing condensed data approaches on graphs (Jin et al., 2021) across architectures/hyperparameters, as this topic has been overlooked in the existing literature, which focuses more on image condensation. We prove that graph condensation fails to preserve validation performance rankings of GNN architectures, and identify two dominant reasons for this failure: (1) most GNNs differ from each other in terms of their convolution filter design. Thus, when performing condensation with a single GNN, the condensed data is overfitted to the corresponding GNN filter, a single biased point in the set of GNN filters; and (2) the learned adjacency matrix of the synthetic graph considerably overfits the condensed data, and thus fails to maintain the characteristics of the original adjacency matrix. To solve the poor generalizability issue, we develop a new dataset condensation framework that preserves the outcome of hyperparameter search/optimization on the condensed data. We propose to learn synthetic data as well as its validation split such that the validation performance ranking of architectures on the synthetic and original datasets are comparable. Under the assumption of a continuous hyperparameter space or a generic supernet which interpolates all architectures, we find and prove that the goal of preserving validation performance rankings can be realized by matching the hyperparameter gradients on the synthetic and original validation data. The hyperparameter gradients (or hypergradients for short) can be efficiently computed with constant memory overhead by the implicit function theorem (IFT) and the Neumann series approximation of an inverse Hessian (Lorraine et al., 2020) . Consequently, we propose a hyperparameter calibrated dataset condensation (HCDC) framework assuming continuous hyperparameters, which is suitable to modeling GNNs with different convolution matrices. Experiments demonstrate the effectiveness of the proposed framework in preserving the performance rankings of GNNs. Although beyond the scope of this paper, HCDC also has the potential to be combined with the supernets in differentiable neural architecture search (differentiable NAS) methods (Liu et al., 2018) to tackle the general neural architecture space for image and text data. Our contributions can be summarized as follows: (1) We formulate a new dataset condensation objective for hyperparameter optimization and propose the hyperparameter calibrated dataset condensation (HCDC) framework that learns synthetic validation data by matching the hypergradients. (2) We prove the hardness of generalizing the condensed graph across GNN architectures, and the validity of HCDC in preserving the validation performance rankings of GNNs. (3) Experiments demonstrate the effectiveness of HCDC in further reducing the search time of off-the-shelf graph NAS algorithms, from several hours to minutes on graphs with millions of nodes.

2.1. SETTINGS: NODE CLASSIFICATION AND GNNS

This paper adopts graph learning notations, but HCDC is generally applicable to other data, tasks, and models; see Appendix B for discussions. Node classification on a graph considers a graph T = (A, X, y) with adjacency matrix A ∈ {0, 1} n×n , node features X ∈ R n×d , node class labels y, and mutually disjoint node-splits  V train V val V test = [n]. Using a graph neural network (GNN) f θ,λ : R n×n ≥0 × R n×d → R n×K ,

2.2. BACKGROUND: STANDARD DATASET CONDENSATION METHODS

Now, we review the standard dataset condensation (SDC) and its natural bilevel optimization (BL) formulation (Wang et al., 2018) .



where θ ∈ Θ denotes the parameters and λ ∈ Λ denotes the hyper-parameters (if they exist), we aim to find θ T = arg min θ L train T (θ, λ), where L train T (θ, λ) := i∈Vtrain ℓ([f θ,λ (A, X)] i , y i ) and ℓ(ŷ, y) is the cross-entropy loss. The node classification loss L train T (θ, λ) is under the transductive setting, which can be easily generalized to the inductive setting by assuming only {A ij | i, j ∈ V train } and {X i | i ∈ V train } are used during training.

