GENERALIZING AND TENSORIZING SUBGRAPH SEARCH IN THE SUPERNET

Abstract

Recently, a special kind of graph, i.e., supernet, which allows two nodes connected by multi-choice edges, has exhibited its power in neural architecture search (NAS) by searching better architectures for computer vision (CV) and natural language processing (NLP) tasks. In this paper, we discover that the design of such discrete architectures also appears in many other important learning tasks, e.g., logical chain inference in knowledge graphs (KGs) and meta-path discovery in heterogeneous information networks (HINs). Thus, we are motivated to generalize the supernet search problem on a broader horizon. However, none of the existing works are effective since the supernet's topology is highly task-dependent and diverse. To address this issue, we propose to tensorize the supernet, i.e., unify the subgraph search problems by a tensor formulation and encode the topology inside the supernet by a tensor network. We further propose an efficient algorithm that admits both stochastic and deterministic objectives to solve the search problem. Finally, we perform extensive experiments on diverse learning tasks, i.e., architecture design for CV, logic inference for KG, and meta-path discovery for HIN. Empirical results demonstrate that our method leads to better performance and architectures.

1. INTRODUCTION

Deep learning (Goodfellow et al., 2017) has been successfully applied in many applications, such as image classification for computer vision (CV) (LeCun et al., 1998; Krizhevsky et al., 2012; He et al., 2016; Huang et al., 2017) and language modeling for natural language processing (NLP) (Mikolov et al., 2013; Devlin et al., 2018) . While the architecture design is of great importance to deep learning, manually designing proper architectures for a certain task is hard and requires lots of human efforts or sometimes even impossible (Zoph & Le, 2017; Baker et al., 2016) . Recently, neural architecture search (NAS) techniques (Elsken et al., 2019) have been developed to alleviate this issue, which mainly focuses on CV and NLP tasks. Behind existing NAS methods, a multi-graph (Skiena., 1992) structure, i.e., supernet (Zoph et al., 2017; Pham et al., 2018; Liu et al., 2018) , where nodes are connected by edges with multiple choices, has played a central role. In such context, the choices on each edge are different operations, and the subgraphs correspond to different neural architectures. The objective here is to find a suitable subgraph in this supernet, i.e. better neural architectures for a given task. However, the supernet does not only arise in CV/NLP field and we find it also emerge in many other deep learning areas (see Table 1 ). An example is logical chain inference on knowledge graphs (Yang et al., 2017; Sadeghian et al., 2019; Qu & Tang, 2019) , where the construction logical rules can be modeled by a supernet. Another example is meta-path discovery in heterogeneous information networks (Yun et al., 2019; Wan et al., 2020) , where the discovery of meta-paths can also be modeled by a supernet. Therefore, we propose to broaden the horizon of NAS, i.e., generalize it to many deep learning fields and solve the new NAS problem under a unified framework. Since subgraphs are discrete objects (choices on each edge are discrete), it has been a common approach (Liu et al., 2018; Sadeghian et al., 2019; Yun et al., 2019) to transform it into a continuous optimization problem. Previous methods often introduce continuous parameters separately for each edge. However, this formulation cannot generalize to different supernets as the topological structures of supernets are highly task-dependent and diverse. Therefore, it will fail to capture the supernet's topology and hence be ineffective. In this paper, we propose a novel method TRACE to introduce a continuous parameter for each subgraph (all these parameters will form a tensor). Then, we propose to construct a tensor network (TN) (andrzej. et al., 2016; 2017) based on the topological structure of supernet. For different tensor networks, we introduce an efficient algorithm for optimization on supernets. Extensive experiments are conducted on diverse deep learning tasks. Empirical results demonstrate that TRACE performs better than the state-of-the-art methods in each domain. As a summary, our contributions are as follows: • We broaden the horizon of existing supernet-based NAS methods. Specifically, we generalize the concept of subgraph search in supernet from NAS to other deep learning tasks that have graph-like structures and propose to solve them in a unified framework by tensorizing the supernet. • While existing supernet-based NAS methods ignore the topological structure of the supernet, we encode the supernet in a topology-aware manner based on the tensor network and propose an efficient algorithm to solve the search problem. • We conduct extensive experiments on various learning tasks, i.e., architecture design for CV, logical inference for KG, and meta-path discovery for HIN. Empirical results demonstrate that our method can find better architectures, which lead to state-of-the-art performance on various applications.

2.1. SUPERNET IN NEURAL ARCHITECTURE SEARCH (NAS)

There have been numerous algorithms proposed to solve the NAS problem. The first NAS work, NASRL (Zoph & Le, 2017) , models the NAS problem as a multiple decision making problem and proposes to use reinforcement learning (RL) (Sutton & Barto, 2018) to solve this problem. However, this formulation does not consider the repetitively stacked nature of neural architectures and is very inefficient as it has to train many different networks to converge. To alleviate this issue, NASNet (Zoph et al., 2017) first models NAS as an optimization problem on supernet. The supernet formulation enables searching for transferrable architectures across different datasets and improves the searching efficiency. Later, based on the supernet formulation, ENAS (Pham et al., 2018) proposes weight-sharing techniques, which shares the weight of each subgraph in a supernet. This technique further improves searching efficiency and many different methods have been proposed under this framework (see Table 1 ), including DARTS (Liu et al., 2018) , SNAS (Xie et al., 2018) , and NASP (Yao et al., 2020) . DARTS is the first to introduce deterministic formulation to NAS field, and SNAS uses a similar parametrized method with DARTS under stochastic formulation. NASP improves upon DARTS by using proximal operator (Parikh & Boyd, 2014 ) and activates only one subgraphs in each iteration to avoid co-adaptation between subgraphs.

2.2. TENSOR METHODS IN MACHINE LEARNING

A tensor (Kolda & Bader, 2009 ) is a multi-dimensional array as an extension to a vector or matrix. Tensor methods have found wide applications in machine learning, including network compression (Novikov et al., 2015; Wang et al., 2018) and knowledge graph completion (Liu et al., 2020) . Recently, andrzej. et al. (2016; 2017) have proposed a unified framework called tensor network (TN), which uses an undirected graph to represent tensor decomposition methods. By using different graphs, TN can cover many tensor decomposition methods as a special case, e.g., CP (Bro, 1997) , Tucker (Tucker, 1966) , tensor train (Oseledets, 2011) , and tensor ring decomposition (Zhao et al., 2016) . However, it is not an easy task to construct a tensor network for a given problem, as the topological structure of tensor network is hard to design (Li & Sun, 2020) .

3. PROPOSED METHOD

Here, we describe our method for the supernet search problem. Section 3.1-3.2 introduce how we tensorize the supernets. Section 3.3 propose an optimization algorithm for the search problem which can be utilized for both stochastic and deterministic objectives. Finally, how supernet appears beyond existing NAS works and can be generalized to these new tasks are presented in Section 3.4. Notations. In this paper, we will use S to denote a supernet, and P to denote a subgraph in supernet. For a supernet S with T edges, we let all edges be indexed by e 1 , . . . , e T , and define C t to be the number of choices on edge e t with t ∈ {1, . . . , T }. With subscript i 1 , . . . , i T denoted by i -for short, we use S i-to denote the subgraph with choices i 1 ∈ {1, . . . , C 1 }, . . . , i T ∈ {1, . . . , C T }. And we use softmax(o i ) = exp(o i )/ n j=1 exp(o j ) to denote the softmax operation over a vector o i ∈ R n .

3.1. A TENSOR FORMULATION FOR SUPERNET

While existing works (Liu et al., 2018; Zoph et al., 2017; Pham et al., 2018) introduce parameters separately for each edge e t , since edges may correlate with each other, a more general and natural approach is to introduce a continuous parameter directly for each subgraph P ∈ S. Note that a subgraph P can be distinguished by its choices on each edge, we propose to encode all possible choices into a tensor T ∈ R C1×•••×C T and take these choices as indices, i.e., i -, to the tensor T . As a consequence, the architecture of the subgraph P is indexed as S i-, and T i-∈ [0, 1] represent how "good" P can be.  ) i-T i-= 1 , Same as existing supernet search works, the subgraph P is searched in the upper-level, while the network weight w is trained in the lower-level. D tr is the training set and D val is the validation set. However, we have the extra constraint i-T i-= 1 here, which is to ensure that the sum of probabilities for all subgraphs is one. Next, we show how P and T can be parameterized with topological information from the supernet in Section 3.2. Then, a gradient-based algorithm which can effectively handle the constrained bi-level optimization problem (1) is proposed in Section 3.3.

3.2. ENCODING SUPERNET TOPOLOGY BY TENSOR NETWORK (TN)

Existing methods consider each edge separately and can be seen as rank-1 factorization of the full tensor T (see Table 1 ) under (1), i.e. T i-= θ 1 i1 . . . θ T i T where θ t it ∈ R is the continuous parameter for choice i t on edge t. However, this formulation ignores the topological structure of different supernets, as it uses the same decomposition method for all supernets. Motivated by such limitation, we propose to introduce tensor network (TN) to better encode the topological structure of supernet. Our encoding  (t) (or N 2 (t)) is the index of nodes in supernet (1, 2 in (a) and 0, 1, 2 in (c)). The formula of (b) is T i1,i2,i3 = r1,r2 α 1 i1,r1 α 2 r1,i2,r2 α 3 i3,r2 and the formula of (d) is T i1,i2,i3 = r0,r1,r2 α 1 r0,i1,r1 α 2 r1,i2,r2 α 3 r2,i3,r0 . process is described in Algorithm 1, where N (S) to denotes the set of nodes in this supernet and N (S) ⊆ N (S) denotes the set of nodes that are connected to more than one edge. Specifically, we introduce a third-order tensor α t for each edge, which is based on previous methods (e.g., DARTS and SNAS), but uses tensors instead of vectors to allow more flexibility. And we introduce hyper-parameters R N1(t) and R N2(t) , which are corresponding to the rank of tensor network. Then, we use index summation to reflect the topological structure (common nodes) between different edges, i.e., T i-= Rn rn,n∈N (S) T t=1 α t r N 1 (t) ,it,r N 2 (t) , We also give two examples infoot_0 Figure 2 to illustrate our tensorizing process for two specific supernets. Algorithm 1 Supernet encoding process (a step-by-step and graphical illustration is in Appendix.G). Input: Supernet S; t) for edge e t which connects nodes N 1 (t) and N 2 (t); 2: Remove isolated nodes and obtain N (S); 3: Compute T i-by ( 2); 4: return encoded supernet T i-; 1: Introduce α t ∈ R R N 1 (t) ×Ct×R N 2 ( The reason of using N (S) instead of N (S) is 2 Proposition 1, which shows that using N (S) will not restrict the expressive power but allows the usage of less parameters. Proposition 1. Any tensors that can be expressed by T i-= Rn rn,n∈N (S) T t=1 α t r N 1 (t) ,it,r N 2 (t) can also be expressed by T i-= Rn rn,n∈N (S) T t=1 α t r N 1 (t) ,it,r N 2 (t) .

3.3. SEARCH ALGORITHM

To handle subgraph search problems for various applications (Table 1 ), we need an algorithm that can solve the generalized supernet search problem in (1), which is parameterized by (2). However, the resulting problem is hard to solve, as we need to handle the constraint on T , i.e., i-T i-= 1. To address this issue, we propose to re-parameterize α in (2) using softmax trick, i.e., Ti-= 1 / n∈N (S) Rn Rn rn,n∈N (S) T t=1 softmax(β (t) r N 1 (t) ,it,r N 2 (t) ). Then, the optimization objective J in (1) becomes max β J β (f (w * (β), P(β)); D val ), s.t. w * (β) = arg min w L (f (w, P(β)); D tr ) . ( ) where we have substitute discrete subgraphs P and normalization constraint on T with continuous parameters β. As shown in Proposition 2, we can now solve a unconstrained problem on β, while keeping the constraint on T satisfied. Proposition 2. i-Ti-= 1, where T is given in (3). Update supernet parameters β by gradient ascending ∇ β J β ; 7: end while 8: Obtain P * = S i-from the final T by setting i -= arg max i-Ti-; 9: Obtain w * (P * ) = arg min w ; L (f (w, P * ), D tr ) by retraining f (w, P * ); 10: return P * (searched architecture) and w * (fine-tuned parameters);

3.4. SUBGRAPH SEARCH BEYOND EXISTING NAS

Despite NAS, there are many important problems in machine learning that have a graph-like structure. Examples include meta-path discovery (Yang et al., 2018; Yun et al., 2019; Wan et al., 2020) , logical chain inference (Yang et al., 2017; Sadeghian et al., 2019) and structure learning between data points (Franceschi et al., 2019) . Inspired by the recent work that exploits graph-like structures in NAS (Li et al., 2020; You et al., 2020) , we propose to model them also as a subgraph search problem on supernets. Meta-path discovery. Heterogeneous information networks (HINs) (Sun & Han, 2012; Shi et al., 2017) are networks whose nodes and edges have multiple types. HIN has been widely used in many real-world network mining scenarios, e.g., node classification (Wang et al., 2019) and recommendation (Zhao et al., 2017) . For a heterogeneous network, a meta-path (Sun et al., 2011) is a path defined on it with multiple edge types. Intuitively, different meta-paths capture different semantic information from a heterogeneous network and it is important to find suitable meta-paths for different applications on HINs. However, designing a meta-path on a HIN is not a trivial task and requires much human effort and domain knowledge (Zhao et al., 2017; Yang et al., 2018) . Thus, we propose to automatically discover informative meta-paths instead of to manually design. To solve the meta-path discovery problem under the supernet framework, we first construct a supernet S (see Figure 2(a) ) and a subgraph P ∈ S will be a meta-path on HIN. While GTN (Yun et al., 2019) introduces weights separately for each edge, our model f (w, P) uses a tensor T to model P as a whole. The performance metric L(•) and M(•) will depend on the downstream task. In our experiments on node classification, we use cross-entropy loss for L(•) and macro F1 score for M(•).

Logical chain inference.

A knowledge graph (KG) (Singhal, 2012; Wang et al., 2017 ) is a multirelational graph composed of entities (nodes) and relations (different types of edges). KG has found wide applications in many different areas, including question answering (Lukovnikov et al., 2017) and recommendation (Zhang et al., 2016) . An important method to understand semantics in KG is logical chain inference, which aims to find underlying logic rules in KG. Specifically, a logical chain is a path on knowledge graph with the following form x-→ B1 z 1 -→ B2 . . .-→ B T y, where x, y, z 1 , . . . are entities and B 1 , B 2 , . . . , B T are different relations in the knowledge graph. And logical chain inference is to use a logical chain to approximate a relation in KG. Obviously, different logical chains can have critical influence on KG as incorrect logic rules will lead to wrong facts. However, directly solving the inference problem will have a exponential complexity as we have to enumerate all relations (Hamilton et al., 2018) . Thus, we propose to model it as a supernet search problem to reduce complexity. Since logical chain has a chain structure, we construct a supernet as in Figure 2 (a). Denote our target relation as B r , the adjacency matrix of relation B r as A Br , and the one-hot vector corresponding to entity x as v x . Our learning model f (w, P) now has no model parameter w, and the original bi-level problem is reduced to a single-level one with the following performance measure M (f (w, P), D) =  Br(x,y)=1 in D v x ( T i=1 A Bi )v y ,

4. EXPERIMENTS

All experiments are implemented on PyTorch (Paszke et al., 2017) except for the logical chain inference, which is implemented on TensorFlow (Abadi et al., 2016) following DRUM (Sadeghian et al., 2019) . We have done all experiments on a single NVIDIA RTX 2080 Ti GPU.

4.1. BENCHMARK PERFORMANCE COMPARISON

Here, we compare our proposed method with the state-of-the-art methods on three different applications that can be seen as subgraph search problems, i.e., neural architecture design for image classification, logical chain inference from KG, and meta-path discovery in HIN.

4.1.1. DESIGNING CONVOLUTIONAL NEURAL NETWORK (CNN) ARCHITECTURES

We first apply TRACE to the architecture design problem on CNN for image classification, which is currently the most famous application for supernet-based methods. We consider the following two different settings for our NAS experiments: (i). Stand-alone (Zoph & Le, 2017; Zoph et al., 2017) : train each architecture to converge to obtain separate w * (P); (ii). Weight-sharing (Liu et al., 2018; Xie et al., 2018; Yao et al., 2020) : share the same parameter w across different architectures during searching. And for both stand-alone and weight-sharing experiments, we repeat our method for five times and report the mean±std of test accuracy of searched architectures. Stand-alone setting. To enable comparison under stand-alone setting, we use the NAS-Bench-201 dataset (Dong & Yang, 2020) where the authors exhaustively trained all subgraphs in a supernet and obtain a complete record of each subgraph's accuracy on three different datasets: CIFAR-10, CIFAR-100 and ImageNet-16-120 (details in Appendix E.1). We use the stochastic formulation and compare our method with (i) Random Search (Yu et al., 2020) ; (ii) REINFORCE (policy gradient) (Zoph & Le, 2017) ; (iii) BOHB (Falkner et al., 2018) and (iv) REA (regularized evolution) (Real et al., 2018) . Results are in Table 2 , and we can see that our method achieves better results than all existing stand-alone NAS methods and even finds the optimal architecture on CIFAR-10 and CIFAR-100 dataset. Weight-sharing setting. We use the deterministic formulation, construct supernet follows (Liu et al., 2018) , and evaluate all methods on CIFAR-10 dataset (details are in Appendix E.1). These are the (Yu et al., 2020) 2.85±0.08 4.3 -random DARTS (1st) (Liu et al., 2018) 3.00±0.14 3.3 1.5 gradient DARTS (2nd) 2.76±0.09 3.3 4 gradient SNAS (Xie et al., 2018) 2.85±0.02 2.8 1.5 gradient GDAS (Dong & Yang, 2019) 2.93 3.4 0.21 gradient BayesNAS (Zhou et al., 2019) 2.81±0.04 3.4 0.2 gradient ASNG-NAS (Akimoto et al., 2019) 2.83±0.14 2.9 0.11 natural gradient NASP (Yao et al., 2020) 2.83±0.09 3.3 0.1 proximal algorithm R-DARTS (Zela et al., 2020) 2.95±0.21 -1.6 gradient PDARTS (Chen et al., 2019) 2.50 3.4 0.3 gradient PC-DARTS (Xu et al., 2020) 2 (Dong & Yang, 2019) 26.0 8.5 5.3 581 BayesNAS (Zhou et al., 2019) 26.5 8.9 3.9 -PDARTS (Chen et al., 2019) 24.4 7.4 4.9 557 PC-DARTS (Xu et al., 2020) 25 most popular setups for weight-sharing NAS. Results are in Table 3 and 4 , we can see that TRACE achieves comparable performances with existing weight-sharing NAS methods.

4.1.2. LOGIC CHAIN INFERENCE FROM KNOWLEDGE GRAPH (KG)

For logical chain inference, we use the deterministic formulation and compare our method with the following methods: Neural LP (Yang et al., 2017) , DRUM (Sadeghian et al., 2019), and GraIL (Teru et al., 2020) . Neural LP and DRUM are restricted to logical chain inference, while GraIL considers more complex graph structures. We also compare our method with random generated rules to better demonstrate the effectiveness of the proposed method. We do not compare our method with embedding-based methods, e.g. RotatE (Sun et al., 2019) as those methods all need embeddings for entities and cannot generalize found rules to unseen entities. Following the setting of DRUM, we conduct experiments on three KG datasets: Family, UMLS and Kinship (details are in Appendix E.2), and report the best mean reciprocal rank (MRR), Hits at 1, 3, 10 across 5 different runs. Results are in Table 5 , which demonstrates that our proposed method achieves better results than all existing methods. Besides, case studies in Section 4.2 further demonstrate that TRACE can find more accurate rules than others.

4.1.3. META-PATH DISCOVERY IN HETEROGENEOUS INFORMATION NETWORK (HIN)

Finally, we apply TRACE to meta-path discovery problem on HINs. Following existing works (Wang et al., 2019; Yun et al., 2019) , we use the deterministic formulation and conduct experiments on three benchmark datasets: DBLP, ACM and IMDB (details are in Appendix E.3) and compare our methods with the 1) baselines in GTN (Yun et al., 2019) , i.e., DeepWalk (Bryan et al., 2014) , metapath2vec (Dong et al., 2017) , GCN (Kipf & Welling, 2016) , GAT (Veličković et al., 2018) , HAN (Wang et al., 2019) and 2) random generated meta-paths. Results on different datasets are in Table 6 , which demonstrate that TRACE performs better than other methods on different HINs.

4.2. CASE STUDY

To further investigate the performance of TRACE, we list the top rules found by TRACE and other methods in Table 7 . This result demonstrates that our method can find more accurate logic rules than other baselines, which contributes to the superior performance of our method. We also give the architectures and meta-paths found by TRACE in Appendix F. Table 7 : An example of top 3 rules obtained by each method on Family dataset. We also investigate the impact of R n 's, which are ranks in tensor network (TN) on supernets. For simplicity, we restrict R n to be equal for all nodes n ∈ N (S) and compare the performance of different ranks with previous state-of-the-art REA (see Table 2 ) in Figure 3 (b) . Results demonstrate that while the rank can influence the final performance, it is easy to set rank properly for TRACE to beat other methods. We also adopt R n = 2 for all other experiments.

4.3.3. OPTIMIZATION ALGORITHMS

Finally, we compare TRACE with proximal algorithm (Parikh & Boyd, 2014) , which is a popular and general algorithm for constrained optimization. Specifically, proximal algorithm is used to solve (1) with the constraint i-T i-= 1 without Proposition 2. We solve the proximal step iteratively and numerically since there is no closed-form solutions. The comparison is in Figure 3 (c), and we can see that TRACE beats proximal algorithm by a large margin, which demonstrates that the re-parameterized by Proposition 2 is useful for optimization.

5. CONCLUSION

In this paper, we generalize supernet from neural architecture search (NAS) to other machine learning tasks. To expressively model the supernet, we introduce a tensor formulation to the supernet and represent it by tensor network (tensorizing supernets). We further propose an efficient gradient-based algorithm to solve the new supernet search problem. Empirical results across various applications demonstrate that our approach has superior performance on these machine learning tasks. A COMPLETE ALGORITHMS OF OPTIMIZATION ON TENSORIZED SUPERNET Here, we give a more detailed description for our algorithm under deterministic and stochastic formulation in Algorithm 3 and 4, respectively. In our experiments, we use the stochastic formulation for NAS under standalone setting, and deterministic formulation for others. Update supernet parameters β by gradient ascending ∇ β J β 7: Save the best w * (P), P so far 8: end while 9: return best w * (P), P in Step 7;

B COMPARISON OF PARAMETERS FOR DIFFERENT METHODS

Here, we compare the number of parameters and computational cost (FLOPs) of different methods for logic rule inference. We use n as the length of logical chains, d as the dimension of embeddings, r as the rank of tensor networks and e as the number of all relations. Results are in Table 8 . Since n and r is often significantly smaller than e and d (typical values are n = 3, r = 2, d = 128 and e ≈ 20 -30), TRACE has comparable number of parameters and computational cost with existing methods. For comparison on HIN, we use n as the length of meta-paths, d as the dimension of embeddings, and r as the rank of tensor networks. Results are in Table 9 . Note that r is often relatively small compared with d (typical values are r = 2 and d = 128), thus TRACE has the similar number of parameters and computational cost with existing methods. . . . α(t) it,r N 2 (t) . . . where similar process is done for all edges. Thus, only nodes whose degree is greater than 1 is actually needed for index contraction. And for n ∈ N (S) \ N (S), we can simply set R n = 1 without loss of expressive power.

F MORE EXPERIMENT RESULTS

F.1 CORRELATION ANALYSIS Indeed, the correlation is a good criterion to show the rationality of one-shot architecture search methods (Bender et al., 2018; Liu et al., 2018; Yu et al., 2020; Guo et al., 2020) . However, it is only a sufficient not necessary condition. Specifically, the goal of tensor T here is to capture good subgraph in the whole supernet, thus we expect the possibilities of P i will concentrate on some top subgraphs, which is shown in below Figure 4 . 



Figure 2(b) and (d) also follow the diagram figure of tensor network, reader may refer(andrzej. et al., 2017) for more details.2 All proofs are in Appendix D. If N1(t) is only connected by edge t (degree is 1), N2(t) cannot be only connected by edge t unless the supernet has only one edge t connecting two nodes N1(t), N2(t), which is a trivial case.



Figure 1: An example of supernet and two subgraphs with index i -.

Figure 2: Some examalphaple of supernets and the corresponding tensor networks (TNs). Information flows in supernet (a) (c) follow the order of number in blocks. Circles in (b) and (d) represent core tensors and edges represent indices. Edges connected by two circles indicate index summation on r N1(t) or r N2(t), where N 1 (t) (or N 2 (t)) is the index of nodes in supernet (1, 2 in (a) and 0, 1, 2 in (c)). The formula of (b) is T i1,i2,i3 = r1,r2 α 1 i1,r1 α 2 r1,i2,r2 α 3 i3,r2 and the formula of (d) is T i1,i2,i3 = r0,r1,r2 α 1 r0,i1,r1 α 2 r1,i2,r2 α 3 r2,i3,r0 .

which counts the number of pairs (x, y) that has relation B r and is predicted by logical chain x -→ B1 z 1 -→ B2 . . . -→ B T y in the KG D.

A) ← brother(C, B) , son(B, A) son(B, A) ← brother(B, A) son(C, A) ← son(C, B), mother(B, A) DRUM son(C, A) ← nephew(C, B), brother(B, A) son(C, A) ← brother(C, B), son(B, A) son(C, A) ← brother(C, B), daughter(B, A) TRACE son(C, A) ← son(C, B), wife(B, A) son(C, A) ← brother(C, B), son(B, A) son(C, A) ← nephew(C, B), brother(B, A) 4.3 ABLATION STUDY 4.3.1 IMPACT OF ENCODING APPROACH We compare TRACE with the following encoding methods on supernet: (i) DARTS, which introduces continuous parameters for each edge separately; (ii) RNN, which uses a RNN to compute the weights for each edge; (iii) CP decomposition, which generalizes DARTS to higher rank; (iv) TRACE(Full), which does not adopt Proposition 1 in Algorithm 1. Results on NAS-Bench-201 using CIFAR-100 are shown in Figure 3(a), and we can see that DARTS performs worse than other methods (CP and TRACE) due to insufficient expressive power. And TRACE achieves better results than CP by being topological aware. It also shows that our simplified encoding scheme does not harm the final performance, as verified in Proposition 1. (a) Encoding methods. (b) TN ranks. (c) Re-parametrization.

Figure 3: Ablation studies on NAS-Bench-201, CIFAR-100 is used

TRACE (deterministic formulation with weight-sharing) Input: Training set D tr , Validation set D val 1: Tensorize the supernet T with Algorithm 1; 2: Re-parameterize T to T using (3); 3: while not converged do 4: Update model parameters w(β) by gradient descending ∇ w L (f (w, β), D tr ) 5: Update supernet parameters β by gradient ascending ∇ β J β 6: end while 7: Obtain P * = S i-from the final T by setting i -= arg max i-Ti-; 8: Obtain w * (P * ) = arg min w ; L (f (w, P * ), D tr ) by retraining f (w, P * ); 9: return w * , P * ; Algorithm 4 TRACE (stochastic formulation) Input: Training set D tr , Validation set D val 1: Tensorize the supernet T with Algorithm 1; 2: Re-parameterize T to T using (3); 3: while not converged do 4: Sample a subgraph P from the probability distribution given by T 5: Solve w * (P) = arg min w L (f (w, P), D tr ) 6:

Figure 4: Correlation of T and ground-truth accuracy on NAS-Bench-201 for different datasets.

Figure 6: Architectures found by TRACE on weight-sharing setting.



Figure 7: Validation MRR during training on three KG datasets.

Figure 8: Validation macro F1 score during training on three HIN datasets.

Figure 9: Ablation studies on NAS-Bench-201, CIFAR-10 is used

A comparison of existing NAS/non-NAS works for designing discrete architectures based on our tensorized formulation for the supernet. "Topology" indicate topological structure of the supernet is utilized or not.

Thus, gradient-based training strategies can be reused, which makes the optimization very efficient. The complete steps for optimizing (4) are presented in Algorithm 2. Note that, our algorithm can solve both deterministic and stochastic formulation (see Appendix A). After the optimization of β is converged, we obtain P * from tensor T and retrain the model to obtain w * .Input: A subgraph search problem with training set D tr , validation set D val and supernet S;

Comparison with NAS methods in stand-alone setting on NAS-Bench-201.

Comparison with NAS methods in weight-sharing setting on CIFAR-10.

Comparison with NAS methods in weight-sharing setting on ImageNet.

Experiment results on logical chain inference.

Evaluation results on the node classification task (F1 score).

Comparison of number of parameters for different methods on KG.

Comparison of number of parameters for different methods on HIN.

Comparison on number of parameters for different encoding methods. ,it,r N 2 (t) ) . . .

Meta-paths found by GTN and TRACE on different HIN.

D.2 PROPOSITION 2

Proof. Following (3), we can have:exp(α t r N 1 (t) ,it,r N 2 (t) )j exp(α t r N 1 (t) ,j,r N 2 (t) )= 1Stand-alone setting The supernet used in NAS-Bench-201 has 3 nodes, and each pair of nodes is connected by a directed edge, which gives 6 edges in total. And for each edge, we have 5 different operations ("choices"): zero, skip connect, 1 × 1 convolution, 3 × 3 convolution and 3 × 3 average pooling. And the details of datasets used in NAS-Bench-201 are in Table 11 . Weight-sharing setting Our construction of supernet follows (Liu et al., 2018) . The supernet has 7 nodes, where the first two nodes are the output from previous two cells, respectively, and the last node performs depthwise concatenation to the output of the rest four nodes. Thus, the supernet has 8 edges with multiple choices (operations), and for each edge, we consider 8 different operations: zero, skip connect, 3 × 3 and 5 × 5 separable convolution, 3 × 3 and 5 × 5 dilated separable convolution, 3 × 3 max pooling and 3 × 3 average pooling. We evaluate all weight-sharing methods on CIFAR-10 dataset and the dataset division is the same as in Table 11 .

E.2 LOGIC CHAIN INFERENCE

In our experiments, we follow the setting in DRUM (Sadeghian et al., 2019) and set the max length of rules T to be 3 for all datasets. And we set the rank L to be 4 in DRUM based on best validation performances. The details of KG datasets used in experiments are in Table 12 . 

