TOWARDS DISCOVERING NEURAL ARCHITECTURES FROM SCRATCH

Abstract

The discovery of neural architectures from scratch is the long-standing goal of Neural Architecture Search (NAS). Searching over a wide spectrum of neural architectures can facilitate the discovery of previously unconsidered but wellperforming architectures. In this work, we take a large step towards discovering neural architectures from scratch by expressing architectures algebraically. This algebraic view leads to a more general method for designing search spaces, which allows us to compactly represent search spaces that are 100s of orders of magnitude larger than common spaces from the literature. Further, we propose a Bayesian Optimization strategy to efficiently search over such huge spaces, and demonstrate empirically that both our search space design and our search strategy can be superior to existing baselines. We open source our algebraic NAS approach and provide APIs for PyTorch and TensorFlow.

1. INTRODUCTION

Neural Architecture Search (NAS), a field with over 1 000 papers in the last two years (Deng & Lindauer, 2022) , is widely touted to automatically discover novel, well-performing architectural patterns. However, while state-of-the-art performance has already been demonstrated in hundreds of NAS papers (prominently, e.g., (Tan & Le, 2019; 2021; Liu et al., 2019a) ), success in automatically finding truly novel architectural patterns has been very scarce (Ramachandran et al., 2017; Liu et al., 2020) . For example, novel architectures, such as transformers (Vaswani et al., 2017; Dosovitskiy et al., 2021) have been crafted manually and were not found by NAS. There is an accumulating amount of evidence that over-engineered, restrictive search spaces (e.g., cell-based ones) are major impediments for NAS to discover truly novel architectures. Yang et al. (2020b) showed that in the DARTS search space (Liu et al., 2019b) the manually-defined macro architecture is more important than the searched cells, while Xie et al. (2019) and Ru et al. (2020) achieved competitive performance with randomly wired neural architectures that do not adhere to common search space limitations. As a result, there are increasing efforts to break these impediments, and the discovery of novel neural architectures has been referred to as the holy grail of NAS. Hierarchical search spaces are a promising step towards this holy grail. In an initial work, Liu et al. ( 2018) proposed a hierarchical cell, which is shared across a fixed macro architecture, imitating the compositional neural architecture design pattern widely used by human experts. However, subsequent works showed the importance of both layer diversity (Tan & Le, 2019) and macro architecture (Xie et al., 2019; Ru et al., 2020) . In this work, we introduce a general formalism for the representation of hierarchical search spaces, allowing both for layer diversity and a flexible macro architecture. The key observation is that any neural architecture can be represented algebraically; e.g., two residual blocks followed by a fullyconnected layer in a linear macro topology can be represented as the algebraic term ω = Linear(Residual(conv, id, conv), Residual(conv, id, conv), fc) . We build upon this observation and employ Context-Free Grammars (CFGs) to construct large spaces of such algebraic architecture terms. Although a particular search space is of course limited in its overall expressiveness, with this approach, we could effectively represent any neural architecture, facilitating the discovery of truly novel ones. Due to the hierarchical structure of algebraic terms, the number of candidate neural architectures scales exponentially with the number of hierarchical levels, leading to search spaces 100s of orders of magnitudes larger than commonly used ones. To search in these huge spaces, we propose an efficient search strategy, Bayesian Optimization for Algebraic Neural Architecture Terms (BANAT), which leverages hierarchical information, capturing the topological patterns across the hierarchical levels, in its tailored kernel design. Our contributions are as follows: • We present a novel technique to construct hierarchical NAS spaces based on an algebraic notion views neural architectures as algebraic architecture terms and CFGs to create algebraic search spaces (Section 2). • We propose BANAT, a Bayesian Optimization (BO) strategy that uses a tailored modeling strategy to efficiently and effectively search over our huge search spaces (Section 3). • After surveying related work (Section 4), we empirically show that search spaces of algebraic architecture terms perform on par or better than common cell-based spaces on different datasets, show the superiority of BANAT over common baselines, demonstrate the importance of incorporating hierarchical information in the modeling, and show that we can find novel architectural parts from basic mathematical operations (Section 5). We open source our code and provide APIs for PyTorch (Paszke et al., 2019) and TensorFlow (Abadi et al., 2015) at https://anonymous.4open.science/r/iclr23_tdnafs.

2. ALGEBRAIC NEURAL ARCHITECTURE SEARCH SPACE CONSTRUCTION

In this section we present an algebraic view on Neural Architecture Search (NAS) (Section 2.1) and propose a construction mechanism based on Context-Free Grammars (CFGs) (Section 2.2 and 2.3).

2.1. ALGEBRAIC ARCHITECTURE TERMS FOR NEURAL ARCHITECTURE SEARCH

We introduce algebraic architecture terms as a string representation for neural architectures from a (term) algebra. Formally, an algebra (A, F) consists of a non-empty set A (universe) and a set of operators f : , 1935) . In our case, A corresponds to the set of all (sub-)architectures and we distinguish between two types of operators: (i) nullary operators representing primitive computations (e.g., conv() or fc()) and (ii) k-ary operators with k > 0 representing topological operators (e.g., Linear(•, •, •) or Residual(•, •, •)). For sake of notational simplicity, we omit parenthesis for nullary operators (i.e., we write conv). Term algebras (Baader & Nipkow, 1999) are a special type of algebra mapping an algebraic expression to its string representation. E.g., we can represent a neural architecture as the algebraic architecture term ω as shown in Equation 1. Term algebras also allow for variables x i that are set to terms themselves that can be re-used across a term. In our case, the intermediate variables x i can therefore share patterns across the architecture, e.g., a shared cell. For example, we could define the intermediate variable x 1 to map to the residual block in ω from Equation 1 as follows: A n → A ∈ F of different arities n ≥ 0 (Birkhoff ω ′ = Linear(x 1 , x 1 , fc), x 1 = Residual(conv, id, conv) . Algebraic NAS We formulate our algebraic view on NAS, where we search over algebraic architecture terms ω ∈ Ω representing their associated architectures Φ(ω), as follows: arg min ω∈Ω f (Φ(ω)) , where f (•) is an error measure that we seek to minimize, e.g., final validation error of a fixed training protocol. For example, we can represent the popular cell-based NAS-Bench-201 search space(Dong & Yang, 2020) as algebraic search space Ω. The algebraic search space Ω is characterized by a fixed macro architecture Macro(. . .) that stacks 15 instances of a shared cell Cell(p i , p i , p i , p i , p i , p i ), where the cell has six edges, on each of which one of five primitive computations can be placed (i.e., p i for i ∈ {1, 2, 3, 4, 5} corresponding to zero, id, conv1x1, conv3x3, or avg pool, respectively). By leveraging the intermediate variable x 1

