FASTER HYPERPARAMETER SEARCH FOR GNNS VIA CALIBRATED DATASET CONDENSATION

Abstract

Dataset condensation aims to reduce the computational cost of training multiple models on a large dataset by condensing the training dataset into a small synthetic set. State-of-the-art approaches rely on matching the model gradients for the real and synthetic data and have recently been applied to condense large-scale graphs for node classification tasks. Although dataset condensation may be efficient when training multiple models for hyperparameter optimization, there is no theoretical guarantee on the generalizability of the condensed data: data condensation often generalizes poorly across hyperparameters/architectures in practice, while we find and prove this overfitting is much more severe on graphs. In this paper, we consider a different condensation objective specifically geared towards hyperparameter search. We aim to generate the synthetic dataset so that the validation-performance rankings of the models, with different hyperparameters, on the condensed and original datasets are comparable. We propose a novel hyperparameter-calibrated dataset condensation (HCDC) algorithm, which obtains the synthetic validation data by matching the hyperparameter gradients computed via implicit differentiation and efficient inverse Hessian approximation. HCDC employs a supernet with differentiable hyperparameters, making it suitable for modeling GNNs with widely different convolution filters. Experiments demonstrate that the proposed framework effectively maintains the validation-performance rankings of GNNs and speeds up hyperparameter/architecture search on graphs.

1. INTRODUCTION

Graph neural networks (GNNs) have found remarkable success in tackling a variety of graph-related tasks (Hamilton, 2020) . However, the prevalence of large-scale graphs in real-world contexts, such as social, information, and biological networks (Hu et al., 2020) , which frequently scale up to millions/billions of nodes and edges, poses significant computational issues for training GNNs. While training a single model can be expensive, designing deep learning models for new tasks requires substantially more computation, as this involves training multiple models on the same dataset many times to verify the design choice (e.g., the architecture and hyperparameter choice (Elsken et al., 2019) ). Towards this end, we consider the following question: how can we reduce the computational cost for training multiple models on the same dataset, for hyperparameter search/optimization? Natural approaches to reduce the training set size include methods such as graph coreset selection (Baker et al., 2020) , graph sparsification (Batson et al., 2013) , graph coarsening (Loukas, 2019) and graph sampling (Zeng et al., 2019) . However, all of these methods involve selecting samples from the given training set, which limits the performance. A more effective alternative is to synthesize informative samples rather than select from the given samples. Dataset condensation (Zhao et al., 2020) has emerged as a competent data mechanism to synthesize data, with promising results. It aims to produce a small synthetic training set such that a model trained on the synthetic set obtains testing accuracy comparable to that trained on the original training set. Although dataset condensation achieves the state-of-the-art performance for neural networks trained on condensed samples, this technique is inadequate for accelerating hyperparameter search/optimization, as: (1) theoretically, dataset condensation obtains synthetic samples that minimize the performance drop of a specific model; however, there is no performance guarantee when using this condensed data to train other models, and (2) in practice, it is unclear how condensation methods compare with strong baselines such as various coreset methods, in terms of their ability to preserve the outcome of architecture/hyperparameter optimization. In this paper, we identify the poor generalizability of existing condensed data approaches on graphs (Jin et al., 2021) across architectures/hyperparameters, as this topic has been overlooked in the existing literature, which focuses more on image condensation. We prove that graph condensation fails to preserve validation performance rankings of GNN architectures, and identify two dominant reasons for this failure: (1) most GNNs differ from each other in terms of their convolution filter design. Thus, when performing condensation with a single GNN, the condensed data is overfitted to the corresponding GNN filter, a single biased point in the set of GNN filters; and (2) the learned adjacency matrix of the synthetic graph considerably overfits the condensed data, and thus fails to maintain the characteristics of the original adjacency matrix. To solve the poor generalizability issue, we develop a new dataset condensation framework that preserves the outcome of hyperparameter search/optimization on the condensed data. We propose to learn synthetic data as well as its validation split such that the validation performance ranking of architectures on the synthetic and original datasets are comparable. Under the assumption of a continuous hyperparameter space or a generic supernet which interpolates all architectures, we find and prove that the goal of preserving validation performance rankings can be realized by matching the hyperparameter gradients on the synthetic and original validation data. The hyperparameter gradients (or hypergradients for short) can be efficiently computed with constant memory overhead by the implicit function theorem (IFT) and the Neumann series approximation of an inverse Hessian (Lorraine et al., 2020) . Consequently, we propose a hyperparameter calibrated dataset condensation (HCDC) framework assuming continuous hyperparameters, which is suitable to modeling GNNs with different convolution matrices. Experiments demonstrate the effectiveness of the proposed framework in preserving the performance rankings of GNNs. Although beyond the scope of this paper, HCDC also has the potential to be combined with the supernets in differentiable neural architecture search (differentiable NAS) methods (Liu et al., 2018) to tackle the general neural architecture space for image and text data. Our contributions can be summarized as follows: (1) We formulate a new dataset condensation objective for hyperparameter optimization and propose the hyperparameter calibrated dataset condensation (HCDC) framework that learns synthetic validation data by matching the hypergradients. (2) We prove the hardness of generalizing the condensed graph across GNN architectures, and the validity of HCDC in preserving the validation performance rankings of GNNs. (3) Experiments demonstrate the effectiveness of HCDC in further reducing the search time of off-the-shelf graph NAS algorithms, from several hours to minutes on graphs with millions of nodes.

2.1. SETTINGS: NODE CLASSIFICATION AND GNNS

This paper adopts graph learning notations, but HCDC is generally applicable to other data, tasks, and models; see Appendix B for discussions. Node classification on a graph considers a graph T = (A, X, y) with adjacency matrix A ∈ {0, 1} n×n , node features X ∈ R n×d , node class labels y, and mutually disjoint node-splits V train V val V test = [n] . Using a graph neural network (GNN) f θ,λ : R n×n ≥0 × R n×d → R n×K , where θ ∈ Θ denotes the parameters and λ ∈ Λ denotes the hyper-parameters (if they exist), we aim to find θ T = arg min θ L train T (θ, λ), where L train T (θ, λ) := i∈Vtrain ℓ([f θ,λ (A, X)] i , y i ) and ℓ(ŷ, y) is the cross-entropy loss. The node classification loss L train T (θ, λ) is under the transductive setting, which can be easily generalized to the inductive setting by assuming only {A ij | i, j ∈ V train } and {X i | i ∈ V train } are used during training.

2.2. BACKGROUND: STANDARD DATASET CONDENSATION METHODS

Now, we review the standard dataset condensation (SDC) and its natural bilevel optimization (BL) formulation (Wang et al., 2018) . SDC's objective. Standard dataset condensation aims to find a synthetic graph S = (A ′ , X ′ , y ′ ) of size c ≪ n, with (weighted) adjacency matrix (1) However, the above problem involves a nested-loop optimization and solving the inner loop for θ S (S) at each iteration requires a computationally expensive procedure: unrolling the neural network's computational graph for S over multiple optimization steps of θ. A ′ ∈ R c×c ≥0 , node features X ′ ∈ R c×d , node labels y ′ ∈ [K] c , SDC in a gradient matching formulation. Zhao et al. (2020) alleviate this computational issue by introducing a gradient matching (GM) formulation. Parameter-matching formulation. To start, we assume neural network f θ,λ is a locally smooth function, and thus similar weights θ S ≈ θ T imply similar mappings. Then one can formulate the condensation objective as matching the optimized parameters (which depends on initialization θ 0 ), i.e., finding S * = arg min S E θ0∼P θ 0 [D(θ S (S, θ 0 ), θ T (θ 0 ))]s.t.θ S (S, θ 0 ) := arg min θ L train S (θ(θ 0 ), λ), where θ T (θ 0 ) := arg min θ L train T (θ(θ 0 ), λ) and D(•, •) is a distance function. Reduction to gradient matching via approximations. The parameter-matching problem is still a bilevel optimization but can be simplified via several approximations. (1) Firstly, θ S (S, θ 0 ) is approximated by the output of an incomplete gradient-descent optimization, θ S (S, θ 0 ) ≈ θ S t+1 ← θ S t -η∇ θ L train S (θ S t , λ). However, the target parameter θ T (θ 0 ) may be far away from θ S t+1 . Zhao et al. (2020) propose to match θ S t+1 with incompletely optimized  θ T t+1 ← θ T t -η∇ θ L train T (θ T t , λ) at each iteration t. Consequently, the SDC's objective is now S * = arg min S E θ0∼P θ 0 [ T -1 t=0 D(θ S t , θ T t )]. (θ S t , λ), ∇ θ L train T (θ S t , λ)). Repeating this inductive argument, the standard condensation objective is finally approximated by matching the gradients at each iteration t, S * = arg min S E θ0∼P θ 0 T -1 t=0 D ∇ θ L train S (θ S t , λ), ∇ θ L train T (θ S t , λ) . With this gradient matching objective, we obtain a single deep network with parameters θ trained on the condensed graph S. The condensed graph S is optimized such that the distance between the gradient vectors of L train T and of L train S w.r.t. the parameters θ is minimized. Cosine distance D(•, •) = cos(•, •) works well in practice (Zhao et al., 2020) .

2.3. CHALLENGES: STANDARD DATASET CONDENSATION IS PROBLEMATIC ACROSS GNNS

For ease of theoretical discussions, in this subsection, we consider single-layer message passing GNNs. Message passing GNNs can be interpreted as iterative convolution over nodes (i.e., message passing) (Ding et al., 2021) where L) , where C α (l) (A) is the convolution matrix parametrized by α (l) , W (l) is the learnable linear weights, and σ(•) denotes the non-linearity. One-dimensional convolution neural networks (1D-CNNs) can be expressed by a similar formula, f (X) = ( K) , . . . , α (K) ]. P is the cyclic permutation matrix (of a unit shift). The kernel size is (2K + 1), K ≥ 0; see Appendix B.2 for details. X (0) = X, X (l+1) = σ(C α (l) (A)X (l) W (l) ) for l ∈ [L], and f (A, X) = X ( k=K k=-K α (k) P k )XW , parameterized by θ = [α, W ] where α = [α (- Despite the success of the gradient matching algorithm in preserving the model performance when trained on the condensed dataset (Wang et al., 2022) , it naturally overfits the model f θ,λ used during condensation and generalizes poorly to others. There is no guarantee that the condensed synthetic data S * which minimizes the objective (Eq. ( 2)) for a specific model f θ,λ (marked by its hyperparameter λ) can generalize well to other models f θ,λ ′ where λ ′ ̸ = λ. We aim to demonstrate that this overfitting issue can be much more severe on graphs than on images, where our main theoretical results can be informally summarized as follows. Proposition. Standard dataset condensation using gradient matching algorithm (Eq. ( 2)) is problematic across GNNs. The condensed graph using a single-layer message passing GNN may fail to generalize to the other GNNs with a different convolution matrix. We first show the successful generalization of SDC across one-dimensional convolution neural networks (1D-CNN). Then, we show a contrary result on GNNs: failed generalization of SDC across GNNs. These theoretical analyses demonstrate the hardness of data condensation on graphs. Our analysis is based on the achievability condition of a gradient matching objective; see Assumption 1 in Appendix A. In Lemma 1 of Appendix C.1, under least square regression with linear GNN/CNN (see Appendix B.4 for formal definitions), if the standard dataset condensation GM objective is achievable, then the optimizer on the condensed dataset S is also optimal on the original dataset T . Now, we study the generalizability of the condensed dataset across different models. We first show a successful generalization of SDC across different 1D-CNN networks; see Proposition 1 in Appendix A. As long as we use a 1D-CNN with a sufficiently large kernel size K during condensation, we can generalize the condensed dataset to a wide range of models, i.e., 1D-CNNs with a kernel size K ′ ≤ K. However, we obtain a contrary result for GNNs in terms of the generalizability of condensed datasets across models. Two dominant effects, which cause the failure of the condensed graph's ability to generalize across GNNs, are discovered. Firstly, the learned adjacency A ′ of the synthetic graph S can easily overfit the condensation objective (see Proposition 2), and thus can fail to maintain the characteristics of the original structure and distinguish between different architectures; see Proposition 2 in Appendix A for the theoretical result and Table 1 for relevant experiments. Table 1 : Test accuracy of GNNs trained on condensed Ogbn-arxiv (Hu et al., 2020) graph verifying the two effects (Propositions 2 and 3) that hinders the generalization of the condensed graph across GNNs. Secondly, GNNs differ from each other mostly on the design of convolution filter C(A), i.e., how the convolution weights C depend on the adjacency information A. The convolution filter C(A) used during condensation is a single biased point in "the space of convolutions"; see Fig. 1 for a visualization, thus there is a mismatch of inductive bias when transferring to a different GNN. These two effects lead to the obstacle when transferring the condensed graph across GNNs, which is formally characterized by Proposition 3 in Appendix A. Proposition 3 provides an effective lower-bound on the relative estimation error of optimal model parameters when a different convolution filter C ′ (•) ̸ = C(•) is used.foot_1 According to the spectral characterization of convolution filters of GNNs (Table 1 of (Balcilar et al., 2021 )), we can approximately compute the maximum eigenvalue of Q for some GNNs. For example, if we condense with f C graph isomorphism network (GIN-0) (Xu et al., 2018) but train f C ′ GCN on the condensed graph, we have ∥W S C ′ -W T C ′ ∥/∥W T C ′ ∥ ⪆ deg + 1 where deg is the average node degree of the original graph. This large lower bound hints the catastrophic failure when transferring across GIN and GCN; see Table 1 .

3. HYPERPARAMETER CALIBRATED DATASET CONDENSATION

Our goal is to develop an optimal and reliable condensation method for architecture/hyperparameter search. Standard dataset condensation objective (Eq. (1)/Eq. ( 2)) does not accomplish this goal since it does not generalize across GNNs, as proven in Section 2.3. In this section, we propose a new condensation objective specifically for preserving the outcome of hyperparameter optimization (HPO) on the condensed dataset. HPO's objective. HPO finds the optimal hyperparameter λ T such that the corresponding model f θ,λ T minimizes the validation loss after training, i.e., We see HPO itself is a bilevel optimization, where the optimal parameter θ T (λ) is posed as a function of the hyperparameter λ, and so is the optimized validation loss L * T (λ); see Fig. 3 for illustration. Dataset condensation for HPO. If both the train and validation sets are defined on the condensed dataset S, the optimal hyperparameter λ S is well-defined. Our goal is to find the synthetic dataset S such that we can obtain comparable validation performance if the hyperparameters are optimized on the condensed dataset, i.e., L * T (λ T ) ≈ L * T (λ S ). Clearly, this goal looks very similar to the goal of standard dataset condensation, preserving generalization performance L test T (θ T ) ≈ L test T (θ S ), which inspires us to formulate the new objective as a bilevel optimization problem too, S * = arg min S L * T λ S (S) s.t. λ S (S) = arg min λ∈Λ L * S (λ), where optimized validation losses L * T (•) and L * S (•) are defined following Eq. (HPO). However, two challenges exist: (1) This formulation (Eq. ( 3)) is a nested optimization (for dataset condensation) over another nested optimization (for HPO) which is challenging to solve as high order gradients are required. (2) In addition, another challenge lies in the search space/feasible set of the hyperparameters Λ. In contrast to parameter optimization, where the search space is usually assumed to be the continuous and unbounded Euclidean space, the search space of hyperparameters Λ can be either a discrete set or a continuous one. Examples of discrete hyperparameters include neural network type, width, depth, batch size, etc. Often we face compositions of these discrete-and continuous-natured hyperparameters, and we can either model them all as discrete ones and search by grid search, Bayesian optimization, and reinforcement learning; or relax the discrete search space to a continuous one. Hyperparameter calibration: a sufficient alternative to HPO's objective. To solve the aforementioned two challenges, we propose an sufficient alternative to Eq. (HPO). Specifically, we propose to identify the condensed dataset that preserves the outcome of HPO on Λ without solving the HPO objective. We call this hyperparameter calibration, which is formally defined in Definition 1. Definition 1 (Hyperparameter Calibration). Given original dataset T , generic model f θ,λ , and hyperparameter search space Λ, we say a condensed dataset S is hyperparameter calibrated, if for any λ 1 ̸ = λ 2 ∈ Λ, it holds that, L * T (λ 1 ) -L * T (λ 2 ) L * S (λ 1 ) -L * S (λ 2 ) > 0, In other words, changes of the optimized validation loss on T and S always have the same sign between hyperparameters λ 1 ̸ = λ 2 . It is clear that if hyperparameter calibration (HC) is satisfied, HPO on the original and condensed datasets yields the same result. Therefore, our mission changes to ensuring hyperparameter calibration for a single pair of hyperparameters (λ 1 , λ 2 ). HCDC: hypergradient alignment objective for dataset condensation. To proceed, we assumes the existence of a continuous extension of the search space: the (possibly discrete) search space Λ can be extended to a compact and connected set Λ ⊃ Λ, where we can define continuation of the generic model f θ,λ on Λ so that f θ,λ is differentiable anywhere in Λ. In Section 4, we will elaborate on how to construct such an extended search space Λ. S i,t+1 ← λ S i,t -η∇ λ L * S (λ S i,t ). Now, with the existence of such a continual extension of the search space, if we limit our step size to be small, we only need to ensure hyperparameter calibration in Eq. (HC) under the special case that λ 1 is within the neighborhood of λ 2 , i.e., λ 1 ∈ B r (λ 2 ) for some r > 0. The change in validation loss is approximated up to first-order by the hypergradients, i.e., L * T (λ 1 ) -L * T (λ 2 ) ≈ ⟨∇ λ L * T (λ), ∆λ⟩, where λ 1 = λ + ∆λ, λ 2 = λ with r ≥ ∥∆λ∥ 2 → 0 + . The hyperparameter calibration condition within this tiny neighborhood B r (λ) is then simplified to ∇ λ L * T (λ) ∥ ∇ λ L * S (λ), i.e. , the two hypergradient vectors are aligned and pointing to the same direction. Assuming the extended search space Λ can be covered by the union of many small neighborhoods, we derive the following notion and equivalence relation of hypergradient alignment. Definition 2 (Hypergradient Alignment). We say hypergradients are aligned in an extended search space Λ, if for any λ ∈ Λ, it holds that ∇ λ L * T (λ) ∥ ∇ λ L * S (λ), i.e., cos(∇ λ L * T (λ), ∇ λ L * S (λ)) = 0. Theorem 1 (Equivalence between Hypergradient Alignment and Hyperparameter Calibration). Hypergradient alignment (Definition 2) is equivalent to hyperparameter calibration (Definition 1) on the connected and compact set, e.g., the extended search space Λ. We summarize the relations between the two notions (Definitions 1 and 2) as follows, Hypergrad. Alignment in Λ ⇐⇒ Hyperpara. Calibration in Λ =⇒ Hyperpara. Calibration in Λ. Therefore, hypergradient alignment on Λ is sufficient to ensure hyperparameter calibration on Λ, and hence the outcome of HPO over Λ is preserved. Consequently, as the core of our hyperparameter calibrated dataset condensation (HCDC), we propose the hypergradient alignment objective below S * = arg min S λ∈ Λ D ∇ λ L val T (θ T (λ), λ), ∇ λ L val S (θ S (λ), λ) , where cosine distance D(•, •) = cos(•, •) is used.

4. IMPLEMENTATIONS OF HCDC AND APPLICATIONS TO GNNS

Finally, we work on implementing and simplifying the hyperparameter calibrated dataset condensation (HCDC) objective and apply it to the graph architecture/hyperparameter search problem. Constructing the extended search space Λ. The HCDC objective requires hypergradient alignment over all λ's in an extended space Λ that is a compact and connected superset of Λ. Under the discrete search space Λ, which consists of p candidate hyperparameters, one can naively construct Λ as O(p 2 ) continuous paths connecting pairs of candidate hyperparameters (shown as blue lines in Fig. 2a ). This is apparently undesirable due to its quadratic complexity in p. We propose a construction of Λ with a linear complexity in p, which works as follows. For any i ∈ [p], we construct a "representative" path, named i-th HPO trajectory, which starts from λ S i,0 = λ i ∈ Λ and updates through λ S i,t+1 ← λ S i,t -η∇ λ L * S (λ S i,t ), shown as the orange dashed lines in Fig. 2a . All of the p trajectories will approach the optima λ S , forming "connected" paths between any pair of hyperparameters λ i ̸ = λ j ∈ Λ. This construction is also used in a continuous search space to save computation (except that we randomly select the starting points λ i ∼ P Λ ). To extend general discrete neural architecture space Λ into a continuously differentiable Λ, differentiable NAS approaches surveyed in Appendix D.3 can be used, and we leave exploration in this direction to future work. Computing hypergradients and optimizing hypergradient alignment loss in Eq. (HCDC). The hypergradients are the gradients of the optimized validation loss L * T (λ) = L val T (θ T (λ), λ) w.r.t the hyperparameters λ; see Fig. 3 for the illustration. The efficient computation of hypergradients ∇ λ L * T (λ) and ∇ λ L * S (λ) uses the implicit function theorem (IFT), ∇ λ L * T (λ) = - ∂ 2 L train T (θ,λ) ∂λ∂θ T ∂ 2 L train T (θ,λ) ∂θ∂θ T -1 ∇ θ L val T (θ, λ) + ∇ λ L val T (θ, λ), (IFT) where ∇ λ L val T (θ, λ) is the direct gradient, which is 0 when λ only affects the loss through the model f θ,λ . The first term is the product of the training mixed partials ∂ 2 L train T (θ,λ) ∂λ∂θ T , inverse training Hessian ∂ 2 L train T (θ,λ) ∂θ∂θ T -1 , and the validation gradients ∇ θ L val T (θ, λ). While the other parts can be computed by back-propagation, the inverse Hessian needs to be approximated. Instead of using the conjugate gradient method, Lorraine et al. (2020) propose a stable, tractable and efficient Neumann series approximation, ∂ 2 L train T (θ,λ) ∂θ∂θ T -1 = lim i→∞ i j=0 I - ∂ 2 L train T (θ,λ) ∂θ∂θ T j with constant memory constraint. To optimize the synthetic validation set S val w.r.t. the cosine hypergradient matching loss in Eq. (HCDC), we only need to take gradients of ∇ θ L val S (θ, λ) and ∇ λ L val S (θ, λ) w.r.t. S val , which can be handled by the same back-propagation technique in SDC, where we take gradients of ∇ θ L train S (θ, λ) w.r.t S train . Connecting HCDC to SDC (Eq. ( 2)). Theoretically speaking, the objective of HCDC, preserving the outcome of hyperparameter optimization (HPO), is orthogonal to the objective of SDC, which preserves the generalization performance. Therefore, we use SDC to learn the synthetic training dataset S train in Eq. ( 2) and HCDC to learn the synthetic validation dataset S val in Eq. (HCDC). Learning the synthetic training and validation dataset may result in disconnected training and validation set, which is allowed in graph learning.

Which graph architecture/hyperparameter search problems can HCDC solve?

We illustrate how to tackle the two types of search spaces: (1) discrete and finite Λ and (2) continuous and bounded Λ with two typical examples originated from the problem of searching for the best convolution matrix C(A) on a large graph T = (A, X, y). (1) Discrete and finite search space Λ: often the most important question of architecture search on large graphs is what design of convolution filter performs best on the given graph? One may simply train the set of p prior-defined GNNs {f C (i) [α (i) ,W ] | i = 1, . . . , p} whose convolution matrices are C = {C (1) α (1) (A), . . . , C α (p) (A)} and compare their validation performance. We can formulate this problem as HPO, by defining an "inter- polated" model f C [α,W ],λ whose convolution matrix is C α,λ (A) = λ (1) C (1) α (1) (A)+• • •+λ (p) C (p) α (p) (A) , where hyperparameters λ = [λ (1) , . . . , λ (p) ] ∈ Λ and parameters α = [α (1) , . . . , α (p) ]. The search space Λ = {λ 1 = e p 1 , . . . , λ p = e p p } is the set of unit vectors in R p . (2) Continuous and bounded search space Λ: one may also use a continuous generic formula, e.g., truncated series, to model a wide range of convolution filters, i.e., C λ (A) = p i=1 λ (i) C (i) (A), for example in ChebNet (Defferrard et al., 2016) or SIGN (Frasca et al., 2020) 1), despite the search space Λ is now continuous. (see Appendix B.2). The formula of C λ (A) is a special case of the C α,λ (A) in ( The complete pseudo-code of HCDC. We conclude this section by summarizing the implementation of HCDC in Algorithm 1. We assume a discrete and finite search space Λ. In Line 8, to compute ∇ S val D ∇ λ L * T (λ), ∇ λ L * S (λ) , we note that only ∇ λ L * S (λ) depends on S val . By Eq. (IFT), ∇ λ L * S (λ) = - ∂ 2 L train S (θ,λ) ∂λ∂θ S ∂ 2 L train S (θ,λ) ∂θ∂θ S -1 ∇ θ L val S (θ, λ) (there is no direct gradients since λ only affects the loss through the model f θ,λ ). Since only the validation loss term ∇ θ L val S (θ, λ) depends on S val , we only need to compute ∇ S val ∇ θ L val S (θ, λ) by back-propagation. Algorithm 1 HCDC: hyperparameter calibrated dataset condensation, which aims to preserve the validation performance ranking of architectures/hyperparameters. Require: Original dataset T . A set of NN architectures f θ,λ where λ ∈ Λ = {λ1, . . . , λp}. Require: Condensed training data Strain learned by standard gradient-matching algorithm (Eq. ( 2)). Randomly initialized synthetic graph S val for C classes. 1 for repeat k = 0, . . . , K -1 do 2 for λ = λ1, . . . , λp do 3 Initialize θ ← θ0 ∼ P θ 0 4 for epoch t = 0, . . . , T θ -1 do 5 Update θ ← θ -η θ ∇ θ L train S (θ, λ). 6 if t mod T λ = 0 then 7 Update λ ← λ -η λ ∇ λ L * S (λ). ▷ Hypergradients calculated using Eq. (IFT). 8 Update S val ← S val -ηS ∇S val D ∇ λ L * T (λ), ∇ λ L * S (λ) 9 return Condensed validation data S val .

5. RELATED WORK

Graph condensation (Jin et al., 2021) achieved the state-of-the-art on preserving GNNs' performance on the simplified graph. Jin et al. (2021) adapted the gradient matching algorithm (Zhao et al., 2020) (Eq. ( 2)) to graph data, together with a MLP-based graph generative model (Anand & Huang, 2018) , leaving out several major issues on its efficiency, performance, and generalizability. While the efficiency was improved by reducing the number of gradient matching steps (Jin et al., 2022) , the performance degradation on medium-and large-sized graphs still renders graph condensation impractical. Our HCDC is designed for hyperparameter/architecture search, where we train multiple models on the same dataset and the efficiency gain is much more significant. Implicit differentiation methods apply the implicit function theorem (IFT) (Eq. (IFT)) to nestedoptimization problems (Wang et al., 2019) . Lorraine et al. (2020) approximated the inverse Hessian by Neumann series, which is a stable alternative to conjugate gradients (Shaban et al., 2019) and scales IFT to large networks with constant memory. Lorraine et al. (2020) also showed that unrolling differentiation around locally optimal parameters for i steps is equivalent to approximating the inverse Hessian by Neumann series up to the first i terms. In addition, we summarize graph reduction methods (including graph coreset selection, graph sampling, graph sparsification, and graph coarsening), as well as more dataset condensation and coreset selection methods beyond graphs and differentiable NAS methods in Appendix D.

6. EXPERIMENTS

In this section we validate the effectiveness of hyperparameter calibrated dataset condensation (HCDC) when applied to speed up graph architecture/hyperparameter search. Spearman's rank correlation coefficient r s between two rankings of the ordered list of hyperparameters on the original and condensed datasets, which is concisely referred to as correlation, is used as an important evaluation metric, in addition to the percentage accuracy metric (referred to as performance). Synthetic experiments on CIFAR-10. We first consider a synthetically created set of hyperparameters on an image dataset, CIFAR-10. Consider the M -fold cross validation, where a fraction of 1/M samples are used as the validation dataset each time. The M -fold cross-validation process can be modeled by a set of M hyperparameters {φ i ∈ {0, 1} | i = 1, . . . , M }, where φ i = 1 if and only if the i-th fold is used for validation. The problem of finding the best validation performance among the M results can be modeled as a hyperparameter optimization problem with a discrete search space |Λ| = M . We compare HCDC with the gradient matching (Zhao et al., 2020) and distribution matching (Zhao & Bilen, 2021b) baselines. We also consider a uniform random sampling baseline and an early-stopping baseline where we train the same number of iterations (with the same batchsize) as the other methods but on the original dataset. The results of M = 20 and c/n = 2% and 4% are reported in Table 2 , where we see HCDC achieves the highest rank correlation. This experiment shows that HCDC can be applied to general types of data and tasks as long as the extended search space can be effectively and efficiently constructed. Finding the best convolution filter on (large) graphs. One application of HCDC we analyzed in Section 4 is to speed up the selection of the best-suited convolution filter design on large graphs. Following the method discussed in Section 4, we test HCDC against (1) Random: the random uniform sampling of nodes and find their induced subgraph, (2) GCond-X: graph condensation (Jin et al., 2021) but fix the synthetic adjacency to identity, (3) GCond: graph condensation algorithm in (Jin et al., 2021) , and (4) Whole Graph: when the model selection is performed on the original dataset. We use random uniform sampling to find the training synthetic subgraph before we apply HCDC. For the other coreset/condensation methods, which do not define the validation split, we randomly split the train and validation nodes according to the original split ratio. We report not only the Spearman's rank correlation, but also the test performance (on the original dataset) of the model selected by the condensed dataset. In Speeding up off-the-shelf graph architecture search algorithms. Finally, we test HCDC on how much speed-up it can provide to the off-the-shelf graph architecture search methods. We use graph NAS (Gao et al., 2019) on Ogbn-arxiv with a condensation ratio of c/n = 0.5%. The search space of architectures is the same as the set used in Table 3 with a focus on graphs with different convolution filters. We plot the best test performance of searched architecture (so far) versus the time spent during searching (in seconds) in Fig. 4 . We see HCDC, as a dataset condensation approach, can further speed up the search process of graph NAS and is orthogonal to the efficient search algorithms like Bayesian optimization or reinforcement learning used by NAS methods.

7. CONCLUSION

This paper considers a novel objective for dataset condensation: preserving the outcome of hyperparameter search/optimization. We propose the hyperparameter calibration formulation for this goal, which is then realized by aligning the hyperparameter gradients. We demonstrate both theoretically and experimentally that HCDC can effectively preserve the validation performance rankings of GNNs and accelerate the hyperparameter/architecture search on graphs. However, the overall performance of HCDC can be affected by (1) how the supernet generalize to unseen architectures; (2) where we align hypergradients in the search space; (3) how we learn the synthetic training set; (4) how we parameterize the synthetic graph/dataset; and leave the heuristic exploration of all possible techniques for these design choices for future work. Beyond graph datasets, HCDC has the potential to be integrated with differentiable neural architecture search (NAS) methods (Liu et al., 2018; Wang et al., 2020) to address general neural architecture space. We hope our work opens up a promising new avenue for speeding up hyperparameter/architecture search by compressing the underlying dataset.

A STANDARD DATASET CONDENSATION IS PROBLEMATIC ACROSS GNNS

In this section, we complete the theoretical details behind Section 2.3, which shows standard dataset condensation is problematic across GNNs. Assumption 1 (Achievability of a gradient matching Objective). A gradient matching objective is defined to be achievable if there exists a non-degenerate trajectory (θ S t ) T -1 t=0 (i.e., a trajectory that spans the entire parameter space Θ, i.e., span(θ S 0 , . . . , θ S T -1 ) ⊇ Θ), such that the gradient matching loss (the objective of Eq. (2) without expectation) on this trajectory is 0. Proposition 1 (Successful Generalization of SDC across 1D-CNNs). Consider least-squares regression with one-dimensional linear convolution f 2K+1 (X) θ = ( K) , . . . , α (K) ]. P is the cyclic permutation matrix (of a unit shift). The kernel size is (2K + 1), K ≥ 0. If the gradient matching objective of f 2K+1 is achievable, then the condensed dataset S * achieves the gradient matching objective on any trajectory {θ ′S t } T -1 t=0 for any linear convolution k=K k=-K α (k) P k )XW parameterized by θ = [α, W ] where α = [α (- f 2K ′ +1 θ ′ with kernel size (2K ′ + 1), K ≥ K ′ ≥ 0. The intuition behind Proposition 1 is that the 1D-CNN of kernel size (2K + 1) is a "supernet" of the 1D-CNN of kernel size (2K ′ + 1) if K ′ ≤ K, and the condensed dataset via a bigger model can generalize well to smaller ones. This result suggests us to use a sufficiently large model during condensation, to enable the generalization of the condensed dataset to a wider range of models. Proposition 2 (Condensed Adjacency Overfits SDC Objective). Consider least-squares regression with a linear GNN, f (A, X) = C(A)XW parameterized by W and C(A) which depends on graph adjacency A. For any (full-ranked) synthetic node features X ′ ∈ R c×d , there exists a synthetic adjacency matrix A ′ ∈ R c×c ≥0 such that the gradient matching objective is achievable. Proposition 3 (Failed Generalization of SDC across GNNs). Consider least-squares regression with a linear GNN, f C W (A, X) = C(A)XW parametrized by W , there always exists a condensed graph S * , such that the gradient matching objective for f C is achievable. However, if we train a new linear GNN f C ′ W (A, X) with convolution matrix C ′ (A ′ ) on S * , the relative error between the optimized model parameters of f C ′ W on the real and condensed graphs is ∥W S C ′ -W T C ′ ∥/∥W T C ′ ∥ ≥ max{σ max (Q)-1, 1- σ min (Q)}, where W T C ′ = arg min W ∥y -f C ′ W (A, X)∥ 2 2 , W S C ′ = arg min W ∥y ′ -f C ′ W (A ′ , X ′ )∥ 2 2 , and Q = X ⊤ [C(A)] ⊤ [C(A)]X X ⊤ [C ′ (A)] ⊤ [C ′ (A)]X -1 .

B MORE PRELIMINARIES

In this section, we describe in greater details the types of data, downstream tasks, and neural network models that our hyperparameter-calibrated dataset condensation (HCDC) applies to. Moreover, we also religiously define the simplified linear convolution regression problem with least-square loss and linear convolution models, which is the assumed setup for Lemma 1 and Propositions 1 to 3.

B.1 DOWNSTREAM TASKS

In Section 2 we have defined the downstream task that this paper mainly focus on, node classification on graphs. Where we are given a graph T = (A, X, y) with adjacency matrix A ∈ {0, 1} n×n , node features X ∈ R n×d , node class labels y ∈ [K] n , and mutually disjoint node-splits V train V val V test = [n] , and the goal is to predict the node labels. Here, to give a background on the convolution neural networks (CNNs) applications discussed in Section 2.3, we show as follows the settings above can be also used to describe per-pixel classification on images (e.g., for semantic segmentation) where CNNs are usually used. For per-pixel classification, we are given a set of n images of size w × h, so the pixel values of the j-th image can be formatted as a tensor X j ∈ R w×h×c if there are c channels. We are also given the pixel labels Y j ∈ [K] w×h for each image j ∈ [n] and the mutually disjoint imagesplits I train I val I test = [n]. Clearly, we can reshape the pixel values and pixel labels of Under review as a conference paper at ICLR 2023 the j-th image to wh × c and wh respectively, and concatenate those matrices from all images. Following this, denoting n = nwh, we obtain the concatenated pixel value matrix X ∈ R n×c and the concatenated pixel label vector y ∈ [K] n . The image-splits are translated into pixel-level splits where V train = {i | (j -1)wh ≤ i ≤ jwh, j ∈ I trian } (similar for V val and V test ) and V train V val V test = [n]. We can also define the auxiliary adjacency matrix A ∈ {0, 1} n×n on the n = nwh pixels, where A is block diagonal A = diag(A 1 , . . . , A n ) and A j ∈ {0, 1} wh×wh is the assumed adjacency (e.g. a two-dimensional grid) of the j-th image.

B.2 NEURAL NETWORK MODELS

This paper mainly focus on graph neural networks (GNNs) f θ,λ : R n×n ≥0 × R n×d → R n×K , where θ ∈ Θ denotes the parameters and λ ∈ Λ denotes the hyperparameters. In Section 2 we have seen that most GNNs can be interpreted as iterative convolution / message passing over nodes (Ding et al., 2021; Balcilar et al., 2021) where X (0) = X and f (A, X) = X (L) , and for l ∈ [L], the update-rule is, X (l+1) = σ C α (l) (A)X (l) W (l) , where C α (l) (A) is the convolution matrix parametrized by α (l) , W (l) is the learnable linear weights, and σ(•) denotes the non-linearity. Thus the parameters θ consists of all α's (if they exist) and W 's, i.e., θ = [α (0) , . . . , α (L-1) , W (0) , . . . , W (L-1) ]. More specifically, it is possible for GNNs to have more than one convolution filters per layer (Ding et al., 2021; Balcilar et al., 2021) and we may generalize Eq. ( 4) to, X (l+1) = σ p i=1 C (i) α (l,i) (A)X (l) W (l,i) . Within this common framework, GNNs differ from each other by the choice of convolution filters {C (i) }, which can be either fixed or learnable. If C (i) is fixed, there is no parameters α (l,i) for any i) is learnable, the convolution matrix relies on the learnable parameters α (l,i) and can be different in each layers (thus should be denoted as C (l,i) ). Usually for GNNs, the convolution matrix depends on the parameters in two possible ways: (1) the convolution matrix C (l,i) is scaled by the scalar parameter α (l,i) ∈ R, i.e., C (l,i) = α (l,i) C (i) (e.g. GIN (Xu et al., 2018) , ChebNet (Defferrard et al., 2016), and SIGN (Frasca et al., 2020) ); or (2) the convolution matrix is constructed by node-level self-attentions [C (l,i) ) ] i,j (e.g., GAT (Veličković et al., 2018) , Graph Transformers (Rong et al., 2020; Puny et al., 2020; Zhang et al., 2020) ). Based on (Ding et al., 2021; Balcilar et al., 2021) , we summarize the popular GNNs reformulated into the convolution over nodes / message-passing formula (Eq. ( 5)) in Table 4 . l ∈ [L]. If C ( ] ij = h α (l,i) X (l) i,: , X (l) j,: [C (i Convolutional neural networks can also be reformulated into the form of Eq. ( 5). For simplicity we only consider one-dimensional convolution neural network (1D-CNN) and the generalization to 2D/3D-CNNs is trivial. If we denote the constant cyclic permutation matrix (which corresponds to a unit shift) as P ∈ R n×n , the update rule of a 1D-CNN with kernel size (2K + 1), K ≥ 0 can be written as, X (l+1) = σ k=K k=-K α k P k X (l) W (l,k) . ( ) We will use this common convolution formula of GNNs (Eq. ( 5)) and 1D-CNNs (Eq. ( 6)) in Appendix B.4 and Proposition 1.

B.3 OTHER TYPES OF DATA, TASKS, AND MODELS

In Appendices B.1 and B.2 we have discussed the formal definition of two possible tasks (1) node classification on graphs and (2) per-pixel classification on images, and reformulated many popular GNNs and CNNs into a general convolution form (Eqs. ( 5) and ( 6)). However, we want to note that the application of dataset condensation methods (including the standard dataset condensation (Wang et al., 2018; Zhao et al., 2020; Zhao & Bilen, 2021b) and our HCDC) is not limited by the specific types of data, tasks, and models.  C = D -1/2 A D -1/2 SAGE-Mean 2 (Hamilton et al., 2017) Message Passing Fixed 2 C (1) = In C (2) = D -1 A GAT 3 (Veličković et al., 2018) Self-Attention Learnable # of heads        C (s) = A + In and h (s) a (l,s) (X (l) i,: , X (l) j,: ) = exp LeakyReLU( (X (l) i,: W (l,s) ∥ X (l) j,: W (l,s) ) • a (l,s) ) GIN 1 (Xu et al., 2018) WL-Test Fixed + Learnable 2 C (1) = A C (2) = In and h (2) ϵ (l) = 1 + ϵ (l) SGC 2 (Defferrard et al., 2016) Spectral Conv. Learnable order of poly.        C (1) = In, C (2) = 2L/λmax -In, C (s) = 2C (2) C (s-1) -C (s-2) and h (s) θ (s) = θ (s) ChebNet 2 (Defferrard et al., 2016) Spectral Conv. Learnable order of poly.        C (1) = In, C (2) = 2L/λmax -In, C (s) = 2C (2) C (s-1) -C (s-2) and h (s) 2) represents mean aggregator. Weight matrix in (Hamilton et al., 2017) is θ (s) = θ (s) GDC 3 (Klicpera et al., 2019) Diffusion Fixed 1 C = S Graph Transformers 4 (Rong et al., 2020) Self-Attention Learnable # of heads      C (s) i,j = 1 and h (s) (W (l,s) Q ,W (l,s) K ) (X (l) i,: , X (l) j,: ) = exp 1 √ dk,l (X (l) i,: W (l,s) Q )(X (l) j,: W (l,s) K ) T 1 Where A = A + I n , D = D + I n . 2 C ( W (l) = W (l,1) ∥ W (l,2) . 3 Need row-wise normalization. C (l,s) i,j is non-zero if and only if A i,j = 1, thus GAT follows direct-neighbor aggregation. 4 The weight matrices of the two convolution supports are the same, W (l,1) = W (l,2) . 5 Where normalized Laplacian L = I n -D -1/2 AD -1/2 and λ max is its largest eigenvalue, which can be approximated as 2 for a large graph. 6 Where S is the diffusion matrix S = ∞ k=0 θ k T k , for example, decaying weights θ k = e -t t k k! and transition matrix T = D -1/2 A D -1/2 . 7 Need row-wise normalization. Only describes the global self-attention layer, where dk,l are weight matrices which compute the queries and keys vectors. In contrast to GAT, all entries of C (l,s) i,j are non-zero. Different design of Graph Transformers (Puny et al., 2020; Rong et al., 2020; Zhang et al., 2020) use graph adjacency information in different ways, and is not characterized here, see the original papers for details. For HCDC, we can follow the conventions in (Zhao et al., 2020) to define the train/validation losses on iid samples and define the notion of dataset condensation as learning a smaller synthetic dataset with less number of samples. Here we leave the readers to (Zhao et al., 2020) for formal definitions of condensation on datasets with iid samples. More generally speaking, our HCDC can be applied as long as (1) the train and validation losses, i.e., L train T (θ, λ) and L val T (θ, λ) can be defined (as functions of the parameters and hyperparameters); and (2) we have an well-defined notion of the learnable synthetic dataset S, (e.g., which includes prior-knowledge like what is the format of the synthetic data in S and how the same model f θ,λ is applied). W (l,s) Q , W (l,s) Q ∈ R fl,

B.4 THE LINEAR CONVOLUTION REGRESSION PROBLEM

For the ease of theoretical analysis, in Lemma 1 and Propositions 1 to 3 we consider a simplified linear convolution regression problem as follows, θ T = arg min θ=[α,W ] ∥C α (A) XW -y∥ 2 (7) where we are given continuous labels y and use sum-of-squares loss ℓ(ŷ, y) = ∥ŷ -y∥ 2 2 instead of the cross entropy loss used for node/pixel classification. We also assume a linear GNN/CNN f θ=[α,W ] (A, X) = C α (A)XW is used, where C α (A) is the convolution matrix which depends on the adjacency matrix A and the parameters α ∈ R p , and W is the learnable linear weights with d elements (hence, the complete parameters consists of two parts, i.e., θ = [α, W ]). As explained in Appendix B.2, this linear convolution model f θ=[α,W ] (A, X) = C α (A)XW already generalizes a wide variety of GNNs and CNNs. For example, it can represents the (single-layer) graph convolution network (GCN) (Kipf & Welling, 2016) whose convolution matrix is defined as C(A) = D-1 2 Ã D-1 2 where Ã and D are the "self-loop-added" adjacency and degree matrix (for GCNN there is no learnable parameters in C ( A) so we omit α). It also generalizes the one-dimensional convolution neural network (1D-CNN), where the convolution matrix is C α (A) = k=K k=-K [θ] k P k and P is the cyclic permutation matrix correspond to a unit shift. It is important to note that although we considered this simplified linear convolution regression problem in some of our theoretical results, which is both convex and linear. We argue that most of the theoretical phenomena reflected by Lemma 1 and Propositions 1 to 3 can be generalized to the general non-convex losses and non-linear models; see Appendix C.4 for the corresponding discussions.

C PROOFS AND EXTENDED THEORETICAL RESULTS

In this section, we provide the proofs to the theoretical results Lemma 1 and Propositions 1 to 3 and Theorem 1, together with some extended theoretical discussions, including generalizing the linear convolution regression problem to non-convex losses and non-linear models (see Appendix C.4). To proceed, please recall the linear convolution regression problem defined in Appendix B.4, the achievability of gradient-matching objective (Eq. ( 2)) defined as Assumption 1 in Section 2.3.

C.1 VALIDITY OF STANDARD DATASET CONDENSATION

As the first step, we verify the validity of the standard dataset condensation (SDC) using the gradientmatching objective Eq. ( 2) for the linear convolution regression problem. Lemma 1. (Validity of SDC) Consider least square regression with linear convolution model f W (A, X) = C(A)XW parameterized by W . If the gradient-matching objective of f W is achievable, then the optimizer on the condensed dataset S, i.e., W S = arg min W L S (W ) is also optimal for the original dataset, i.e., L T (W S ) = min W L T (W ). Proof. In the linear convolution regression problem, sum-of-squares loss is used, and L T (W ) = ∥CXW -y∥ 2 2 (similarly L S (W ) = ∥C ′ X ′ W -y ′ ∥ 2 2 where C ′ = C(A ′ )). We assume X ⊤ C ⊤ CX ∈ R d×d is invertible and we can apply the optimizer formula for ordinary least square (OLS) regression to find the optimizer W T of L T (W ) as, W T = (X ⊤ C ⊤ CX) -1 X ⊤ C ⊤ y. Also, we can compute the gradients of L T (W ) w.r.t W as, ∇ W L T (W ) = 2X ⊤ C ⊤ (CXW -y), and similarly for ∇ W L S (W ). Given the achievability of the gradient-matching objective of f W , we know there exists a non-degenerate trajectory (W S t ) T -1 t=0 which spans the entire parameter space, i.e., span(W S 0 , . . . , W S T -1 ) = R d , such that the gradient-matching loss (the objective of Eq. ( 2) without expectation) on this trajectory is 0. Assuming D(•, •) is the L 2 norm (Zhao et al., 2020) , this means, ∇ W L T (W S t ) = ∇ W L S (W S t ) for t ∈ [T ] . Substitute in the formula for the gradients ∇ W L T (W ) and ∇ W L S (W ), we then have, X ⊤ C ⊤ (CXW S t -y) = X ′⊤ C ′⊤ (C ′ X ′ W S t -y ′ ) for t ∈ [T ]. Since the set of {W S t } T -1 t=0 spans the complete parameter space R d , we can transform the set of vectors {ω t •W S t } T -1 t=0 to the set of unit vectors {e d i } d-1 i=0 ∈ R d by a linear transformation. Meanwhile, the set of T equations above can be transformed to, X ⊤ C ⊤ (CXe d i -y) = X ′⊤ C ′⊤ (C ′ X ′ e d i -y ′ ) for i ∈ [d]. This directly leads to X ⊤ C ⊤ CX = X ′⊤ C ′⊤ C ′ X ′ and X ⊤ C ⊤ y = X ′⊤ C ′⊤ y ′ . Using the formula for the optimizers W T and W S above, we readily get, W T = (X ⊤ C ⊤ CX) -1 X ⊤ C ⊤ y = (X ′⊤ C ′⊤ C ′ X ′ ) -1 X ′⊤ C ′⊤ y ′ = W S . And hence, L T (W S ) = L T (W T ) = min W L T (W ), which concludes the proof. Again with similar procedure for the X ⊤ C ⊤ CX part, we finally can show that on the new trajectory (θ ′S t ) T -1 t=0 ∇ α L T (α, W ) = ∇ α L S (α, W ). This concludes the proof. □ Then we focus on the linear GNNs, we want to verify the insight that the learned adjacency A ′ of the condensed graph has "too many degrees of freedom" so that can easily overfit the gradient-matching objective, no matter what learned synthetic features X ′ are. Again, the proof of Proposition 2 uses some results in the proof of Lemma 1. Proof of Proposition 2: Now, we consider a linear GNN defined as f (A, X) = C(A)XW . From the proof of Lemma 1, we know that for the gradient-matching objective of f to be achievable, it is equivalent to require that, X ⊤ C ⊤ CX = X ′⊤ C ′⊤ C ′ X ′ and X ⊤ C ⊤ y = X ′⊤ C ′⊤ y ′ , where C and C ′ refer to C(A) and C(A ′ ) respectively. Firstly we note that once we find C ′ and X ′ such that satisfy the first condition X ⊤ C ⊤ CX = X ′⊤ C ′⊤ C ′ X ′ , we can always find y ′ ∈ R c such that X ⊤ C ⊤ y = X ′⊤ C ′⊤ y ′ since X ⊤ C ⊤ y ∈ R is a scalar. Now, we focus on finding the convolution matrix C ′ and the node feature matrix X ′ of the condensed synthetic graph to satisfy X ⊤ C ⊤ CX = X ′⊤ C ′⊤ C ′ X ′ . We assume n ≫ c ≫ d and consider the diagonalization of X ⊤ C ⊤ CX ∈ R d×d . Since X ⊤ C ⊤ CX is positive semi-definite, it can be diagonalized as X ⊤ C ⊤ CX = V S 2 V ⊤ where V ∈ R d is an orthogonal matrix and S ∈ R d is a diagonal matrix that S = diag(s 1 , . . . , s d ). For any (real) semi-unitary matrix U ∈ R c×d such that U ⊤ U = I d , we can construct C ′ X ′ = U SV ⊤ ∈ R c×d and we can easily verify they satisfy the condition, X ′⊤ C ′⊤ C ′ X ′ = V SU ⊤ U SV ⊤ = V S 2 V ⊤ = X ⊤ C ⊤ CX. Then since X ′ is full ranked, for any X ′ , by considering the singular-value decomposition of X ′ , we see that we can always find a convolution matrix C ′ such that C ′ X ′ = U SV and this concludes the proof. □ Finally, we use some results of Proposition 2 to prove Proposition 3, the failure of SDC when generalizating across GNNs. Proof of Proposition 3: We prove by two steps. For the first step, we aim to show that there always exist a condensed synthetic dataset S such that achieves the gradient-matching objective but the learned adjacency matrix A ′ = I c is the identity matrix. Clearly this directly follows form the proof of Proposition 2, where we only require C ′ X ′ = U SV (see the proof of Proposition 2 for details). If the learned adjacency matrix A ′ = I c , the for any GNNs, the corresponding convolution matrix C ′ is also (or proportional to) identity, thus we only need to set the learned node feature matrix X ′ = U SV to satisfy the condition. The first step is proved. For the second step, we evaluate the relative estimation error of the optimal parameter when transfer to a new GNN f C W with convolution filter C(•), i.e., ∥W S C -W T C ∥/∥W T C ∥. Using the formula for the optimal parameter in the proof of Lemma 1 again, we have, W T C = (X ⊤ C ⊤ CX) -1 X ⊤ C ⊤ y, and W S C = (X ′⊤ C ′⊤ C ′ X) -1 X ′⊤ C ′⊤ y ′ , where C ′ = C(A ′ ) = C(I c ) = C(I c ) (the last equation use the fact that the convolution matrix of GNNs are the same if the underlying graph is identity). Moreover, by the validity of SDC on f C W , we know, (see the proof of Lemma 1 for details), X ′⊤ C ′⊤ C ′ X ′ = X ⊤ C ⊤ CX and X ′⊤ C ′⊤ y ′ = X ⊤ C ⊤ y Thus, altogether we derive that X ′⊤ C ′⊤ C ′ X = X ⊤ C ⊤ CX and X ′⊤ C ′⊤ y ′ = X ⊤ C ⊤ y. And therefore, W S C = (X ⊤ C ⊤ CX) -1 X ⊤ C ⊤ y. Now, note that, ∥W S C -W T C ∥/∥W T C ∥ = X ⊤ [C(A)] ⊤ [C(A)]X X ⊤ [C(A)] ⊤ [C(A)]X -1 ) -I d X ⊤ C ⊤ y /∥X ⊤ C ⊤ y∥ ≥ max{σ max (Q) -1, 1 -σ min (Q)} where Q = X ⊤ [C(A)] ⊤ [C(A)]X X ⊤ [C(A)] ⊤ [C(A)]X -1 . This concludes the proof. □

C.3 VALIDITY OF HCDC

Finally, we complete the proof of Theorem 1 with more detials. Proof of Theorem 1: Firstly, we prove the necessity by contradiction. If there exists λ 0 ∈ Λ s.t. the two gradient vectors are not aligned at λ 0 , then there exists small perturbation ∆λ 0 such that L * T (λ 0 + ∆λ 0 ) -L * T (λ 0 ) and L * S (λ 0 + ∆λ 0 ) -L * S (λ 0 ) have different signs. Secondly, we prove the sufficiency by path-integration. For any pair λ 1 ̸ = λ 2 ∈ Λ, we have a path γ(λ 1 , λ 2 ) ∈ Λ from λ 2 and λ 1 , then integrating hypergradients ∇ λ L * T (λ) along the path recovers the hyperparameter-calibration condition. More specifically, along the path we have L (θ, ψ), within the small neighborhoods surrounding the pair of local minima (θ T , θ S ), we can approximate the non-convex loss and non-linear model with a convex/linear one respectively. Hence the generalizability issues with convex loss and liner model may hold. * T (λ 1 ) -L * T (λ 2 ) = γ(λ1,λ2) ∇ λ L * T (λ)dλ (similar for ∇ λ L * S (λ)). Thus we have, (L * T (λ 1 ) -L * T (λ 2 ))(L * S (λ 1 ) -L * S (λ 2 )) = γ(λ1,λ2) ∇ λ L * T (λ)dλ γ(λ1,λ2) ∇ λ L * S (λ)dλ ≥ γ(λ1,λ2) ∇ λ L * T (λ), ∇ λ L * S

D EXTENDED RELATED WORK

This section contains the extensive discussions of many related work/areas which cannot be fitted into the main paper due to the page limit.

D.1 DATASET CONDENSATION AND CORESET SELECTION

Firstly, we review the two main approaches to reducing the training set size while preserving model performance. Dataset condensation (or distillation) is first proposed in (Wang et al., 2018) as a learning-to-learn problem by formulating the network parameters as a function of synthetic data and learning them through the network parameters to minimize the training loss over the original data. However, the nested-loop optimization precludes it scaling up to large-scale in-the-wild datasets. Zhao et al. (2020) alleviate this issue by enforcing the gradients of the synthetic samples w.r.t. the network weights to approach those of the original data, which successfully alleviates the expensive unrolling of the computational graph. Based on the meta-learning formulation in (Wang et al., 2018) , Bohdal et al. (2020) and Nguyen et al. (2020; 2021) propose to simplify the inner-loop optimization of a classification model by training with ridge regression which has a closed-form solution, while Such et al. ( 2020) model the synthetic data using a generative network. To improve the data efficiency of synthetic samples in gradient-matching algorithm, Zhao & Bilen (2021a) apply differentiable Siamese augmentation, and Kim et al. (2022) introduce efficient synthetic-data parametrization. Recently, a new distribution-matching framework (Zhao & Bilen, 2021b) proposes to match the hidden features rather than the gradients for fast optimization, but may suffer from performance degradation compared to gradient-matching (Zhao & Bilen, 2021b) , where Kim et al. (2022) provide some interpretation. Graph condensation (Jin et al., 2021) achieves the state-of-the-art performance for preserving GNNs' performance on the simplified graph. However, Jin et al. (2021) only adapt the gradient-matching algorithm of dataset condensation Zhao et al. (2020) to graph data, together with a MLP-based generative model for edges (Anand & Huang, 2018; Simonovsky & Komodakis, 2018) , leaving out several major issues on efficiency, performance, and generalizability. Subsequent work aims to apply the more efficient distribution-matching algorithm (Zhao & Bilen, 2021b; Wang et al., 2022) of dataset condensation to graph (Liu et al., 2022) or speed up gradient-matching graph condensation by reducing the number of gradient-matching-steps (Jin et al., 2022) . While the efficiency issue of graph condensation is mitigated (Jin et al., 2022) , the performance degradation on medium-and large-sized graphs still renders graph condensation practically meaningless. Our HCDC is specifically designed for repeated training in architecture search, which is, in contrast, well-motivated. Coreset selection methods choose samples that are important for training based on heuristic criteria, for example, minimizing the distance between coreset and whole-dataset centers (Chen et al., 2010; Rebuffi et al., 2017) , maximizing the diversity of selected samples in the gradient space (Aljundi et al., 2019) , discovering cluster centers (Sener & Savarese, 2018) , and choosing samples with the largest negative implicit gradient (Borsos et al., 2020) . Forgetting (Toneva et al., 2018) measures the forgetfulness of trained samples and drops those that are not easy to forget. GraNd (Paul et al., 2021) selects the training samples that contribute most to the training loss in the first few epochs. Prism (Kothawade et al., 2022) select samples to maximize submodular set-functions which are combinatorial generalizations of entropy measures (Iyer et al., 2021) . Recent benchmark (Guo et al., 2022) of a variety of coreset selection methods for image classification indicates Forgetting, GraNd, and Prism are among the best performing corset methods but still evidently underperform the dataset condensation baselines. Although coreset selection can be very efficient, most of the methods above suffer from three major limitations: (1) their performance is upper-bounded by the information in the selected samples; (2) most of them do not directly optimize the synthetic samples to preserve the model performance; and (3) most of them select samples incrementally and greedily, which are short-sighted.

D.2 GRAPH REDUCTION

Secondly, we summarize the traditional graph reduction method for graph neural network training. Graph coreset selection is a non-trivial generalization of the above method coreset methods given the non-iid nature of graph nodes and the non-linearity nature of GNNs. The very few off-the-shelf graph coreset algorithms are designed for graph clustering (Baker et al., 2020; Braverman et al., 2021) and are not optimal for the training of GNNs. Graph sampling methods (Chiang et al., 2019; Zeng et al., 2019) can be as simple as uniformly sampling a set of nodes and finding their induced subgraph, which is understood as a graph-counterpart of uniform sampling of iid samples. However, most of the present graph sampling algorithms (e.g., ClusterGCN (Chiang et al., 2019) and GraphSAINT (Zeng et al., 2019) ) are designed for sampling multiple subgraphs (mini-batches), which forms a cover of the original graph for training GNNs with memory constraint. Therefore those graph mini-batch sampling algorithms are effectively graph partitioning algorithms and not optimized to find just one representative subgraph. Graph sparsification (Batson et al., 2013; Satuluri et al., 2011) and graph coarsening (Loukas & Vandergheynst, 2018; Loukas, 2019; Huang et al., 2021; Cai et al., 2020) algorithms are usually designed to preserve specific graph properties like graph spectrum and graph clustering. Such objectives are often not aligned with the optimization of downstream GNNs and are shown to be sub-optimal in preserving the information to train GNNs well (Jin et al., 2021) .

D.3 OTHER RELATED AREAS

Lastly, we list two important relevant areas to this work, implicit differentiation methods based on the implicit function theorem (IFT), and the differentiable neural architecture search (NAS) algorithms. Implicit differentiation methods apply the implicit function theorem (IFT) to the nested-optimization problems (Ochs et al., 2015; Wang et al., 2019) . The IFT requires inverting the training Hessian with respect to the network weights, where early work either computes the inverse explicitly (Bengio, 2000; Larsen et al., 1996) or approximates it as the identity matrix (Luketina et al., 2016) . Conjugate gradient (CG) is applied to invert the Hessian approximately (Pedregosa, 2016) , but is difficult to scale to deep networks. Several methods have been proposed to efficiently approximate Hessian inverse, for example, 1-step unrolled differentiation (Luketina et al., 2016) , Fisher information matrix (Larsen et al., 1996) , NN-structure aided Kronecker-factored inversion (Martens & Grosse, 2015) . Lorraine et al. (2020) use the Neumann inverse approximation, which is a stable alternative to CG (Shaban et al., 2019) and successfully scale gradient-based bilevel-optimization to large networks with constant memory constraint. It is shown that unrolling differentiation around locally optimal weights for i steps is equivalent to approximating the Neumann series inverse approximation up to the first i terms. Differentiable NAS methods, e.g., DARTS (Liu et al., 2018) explore the possibility of transforming the discrete neural architecture space into a continuously differentiable form and further uses gradient optimization to search the neural architecture. DARTS follows a cell-based search space (Zoph et al., 2018) and continuously relaxes the original discrete search strategy. Despite its simplicity, several work cast double on the effectiveness of DARTS (Li & Talwalkar, 2020; Zela et al., 2019) . SNAS (Xie et al., 2018) points out that DARTS suffers from the unbounded bias issue towards its objective, and it remodels the NAS and leverages the Gumbel-softmax trick (Jang et al., 2017; Maddison et al., 2017) to learn the exact architecture parameter. Differentiable NAS techniques have also been applied to graphs to automatically design data-specific GNN architectures (Wang et al., 2021; Huan et al., 2021) .

E IMPLEMENTATION DETAILS

In this section we list more implementation details on the experiments in Section 6. For the synthetic experiments on CIFAR-10, we randomly split the CIFAR-10 images into M = 20 splits and perform cross validation. For the baseline methods (Random, SDC-GM, SDC-DM), the dataset condensation is performed independently for each split. For HCDC, we first condense the training set of the synthetic dataset by SDC-GM. Then, we learn a separate validation set of with 1/M -size of the training set and train with the HCDC objective on the M -HPO trajectories as described in Section 4. We report the correlation between the ranking of splits (in terms of their validation performance on this split). For the Early-Stopping method, we only train the same number iterations as the other methods (with the same batchsize), which means there are only c/n * 500 epochs. For the experiments about finding the best convolution filter on (large) graphs, we create the set of ten candidate convolution filters as (see Table 4 for definitions and references) GCN, SAGE-Mean, SAGE-Max, GAT, GIN-ϵ, GIN-0, SGC(K=2), SGC(K=3), ChebNet(K=2), ChebNet(K-3). The implementations are provided by PyTorch Geometric https://pytorch-geometric. readthedocs.io/en/latest/modules/nn.html. We also select the GNN width from {128, 256} and the GNN depth from {2, 4} so there are 10 × 2 × 2 = 40 models in total.



The train/validation split of synthetic data is only required by HCDC; see Eq. (1) vs. Eq. (3). If C ′ (•) = C(•) Lemma 1 guarantees W S C ′ = W TC ′ and the lower bound in Proposition is 0.



and (possibly) train/validation 1 splits V ′ train V ′ val = [c]. The goal of SDC is to obtain comparable generalization performance on the real graph by training on the condensed graph, i.e., L test T (θ T , λ) ≈ L test T (θ S , λ) where θ S = arg min θ L train S (θ, λ) is the model parameters (of the same GNN f θ,λ ) optimized on the synthetic graph. By posing θ S as a function of the condensed graph S, SDC can be formulated as a bilevel optimization problem, S * = arg min S L train T (θ S (S), λ) s.t. θ S (S) := arg min θ L train S (θ, λ).

Generalization accuracy of graphs condensed with different GNNs (row) across GNNs (column) under c/n = 0.25%.

Figure 1: The manifold of GNNs with convolution filters C λ = I+λ (1) L+λ (2) ( 2 λmax L-I) linear combination of first two orders of ChebNet (Defferrard et al., 2016), λ's are hyperparameters; see Appendices B.2 and E projected to the plane of validation accuracy on condensed (x-axis) and original (y-axis) graphs under two ratios c/n on Cora (Yang et al., 2016). The GNN with C = ( 2 λmax -1)L ∝ L (red dot) is a biased point in this model space.

T = arg min λ∈Λ L * T (λ) where L * T (λ) := L val T (θ T (λ), λ) and θ T (λ) := arg min θ L train T (θ, λ). (HPO)

Figure 2: Illustration of the constructed extended search space Λ illustrated as the orange trajectory for both a) discrete Λ and (b) continuous Λ. The trajectory starts from λ S i,0 = λi ∈ Λ for discrete Λ (or random points for continuous Λ), and updates through λ S i,t+1 ← λ S i,t -η∇ λ L * S (λ S i,t ).

Figure 3: Loss landscape w.r.t. θ and λ. A hyperparameter λ has an optimal parameter θ T (λ) blue curve in (θ, λ)-plane in (a) that minimizes the train loss. In (b), injecting optimal parameters θ T (λ) into the validation loss, we obtain a function of validation loss w.r.t. λ denoted as L ⋆ T (λ) in (L, λ)-plane, shown as the orange curve. The purple dash line illustrates the hypergradients, i.e., gradient of L ⋆ T (λ) w.r.t. λ.

The rank correlation and validation performance on the original dataset of the M -fold cross validation ranked/selected on the condensed dataset on CIFAR-10.

Figure 4: Speed-up of the search process of graph NAS when combined with HCDC on Ogbn-arxiv, best test performance so far vs. time spent.

λ) dλ ≥0 the second last inequality by Cauchy-Schwarz inquality and the last inequality by cos(∇ λ L * T (λ), ∇ λ L * T (λ)) = 0 for any λ ∈ γ(λ 1 , λ 2 ) ∈ Λ. This concludes the proof. □ C.4 GENERALIZE TO NON-CONVEX AND NON-LINEAR CASE Although the results above are obtained for least squares loss and linear convolution model, it still reflects the nature of general non-convex losses and non-linear models. Since dataset condensation is effectively matching the local minima {θ T } of the original loss L train T (θ, ψ) with the local minima {θ S } of the condensed loss L train S

we see HCDC consistently outperforms the other approaches, and the test performance of selected architecture is close to the ground-truth best performance.

Spearman's rank correlation and test performance of convolution filter selected on the condensed graph.

Summary of GNNs formulated as generalized graph convolution.

annex

Despite its simplicity, Lemma 1 directly verifies the validity of the gradient-matching formulation of standard dataset condensation on some specific learning problems. Although the gradient-matching formulation (Eq. ( 2)) is an efficient but weaker formulation than the bilevel formulation of SDC (Eq. ( 1)), we see it is strong enough for some the linear convolution regression problem.

C.2 GENERALIZATION ISSUES OF SDC

Now, we move forward and focus on the generalization issues of (the gradient-matching formulation of) the standard dataset condensation (SDC) across GNNs.To start with, we prove the successful generalization of SDC across 1D-CNNs as follows, which is very similar to the proof of Lemma 1.

Proof of Proposition 1:

In Proposition 1, we consider one-dimensional linear convolution modelsthen from the proof of Lemma 1 we know the gradients of L T (W ) w.r.t W is again,We know the achievability of the gradient-matching objective means there exists a non-degenerate trajectory (θ S t ) T -1 t=0 which spans the entire parameter space, i.e., span(θ S 0 , . . . ,we know that there exists (α S t ) T -1 t=0 which spans R p and there exists (W S t ) T -1 t=0 which spans R d . Since the gradient-matching objective is minimized to 0 on (W S t ) T -1 t=0 which spans R d , following the same procedure as the proof of Lemma 1, we again obtain,Mean while, since the same gradient-matching objective is also minimized to 0 on (α S t ) T -1 t=0 which spans R p , we have,Again by linear combining the above T equations and because (α S t ) T -1 t=0 can be transformed to the unit vectors in R p , we have,Hence, for any new trajectory (α ′S t ) T -1 t=0 which spans R p ′ where p ′ = 2K ′ + 1, by linear combining the above equations, we have,With similar procedure for the X ⊤ C ⊤ CX part, we conclude that on the new trajectoryIt remains to prove that on any new trajectory ∇ α L T (α, W ) = ∇ α L S (α, W ). Only need to note that, ∇ α (k) L T (α, W ) = 2W ⊤ X ⊤ P k (CXW -y). Hence, by the p equations above we can readily show, X ⊤ P k y = X ′⊤ P ′k y ′ for k = -K, . . . , K.

