SEARCHING FOR CONVOLUTIONS AND A MORE AMBITIOUS NAS

Abstract

An important goal of neural architecture search (NAS) is to automate-away the design of neural networks on new tasks in under-explored domains, thus helping to democratize machine learning. However, current NAS research largely focuses on search spaces consisting of existing operations-such as different types of convolution-that are already known to work well on well-studied problemsoften in computer vision. Our work is motivated by the following question: can we enable users to build their own search spaces and discover the right neural operations given data from their specific domain? We make progress towards this broader vision for NAS by introducing a space of operations generalizing the convolution that enables search over a large family of parameterizable linear-time matrix-vector functions. Our flexible construction allows users to design their own search spaces adapted to the nature and shape of their data, to warm-start search methods using convolutions when they are known to perform well, or to discover new operations from scratch when they do not. We evaluate our approach on several novel search spaces over vision and text data, on all of which simple NAS search algorithms can find operations that perform better than baseline layers.

1. INTRODUCTION

Neural architecture search is often motivated by the AutoML vision of democratizing ML by reducing the need for expert deep net design, both on existing problems and in new domains. However, while NAS research has seen rapid growth with developments such as weight-sharing (Pham et al., 2018) and "NAS-benches" (Ying et al., 2019; Zela et al., 2020) , most efforts focus on search spaces that glue together established primitives for well-studied tasks like vision and text (Liu et al., 2019; Li & Talwalkar, 2019; Xu et al., 2020; Li et al., 2020) or on deployment-time issues such as latency (Cai et al., 2020) . Application studies have followed suit (Nekrasov et al., 2019; Wang et al., 2020) . In this work, we revisit a broader vision for NAS, proposing to move towards much more general search spaces while still exploiting successful components of leading network topologies and efficient NAS methods. We introduce search spaces built using the Chrysalis,foot_0 a rich family of parameterizable operations that we develop using a characterization of efficient matrix transforms by Dao et al. (2020) and which contain convolutions and many other simple linear operations. When combined with a backbone architecture, the Chrysalis induces general NAS search spaces for discovering the right operation for a given type of data. For example, when inducing a novel search space from the LeNet architecture (LeCun et al., 1999) , we show that randomly initialized gradient-based NAS methods applied to CIFAR-10 discover operations in the Chrysalis that outperform convolutionsthe "right" operation for vision-by 1% on both CIFAR-10 and CIFAR-100. Our contributions, summarized below, take critical steps towards a broader NAS that enables the discovery of good design patterns with limited human specification from data in under-explored domains: • We define the broad NAS problem and discuss how it interacts with modern techniques such as continuous relaxation, weight-sharing, and bilevel optimization. This discussion sets up our new approach for search space design and our associated evaluations of whether leading NAS methods, applied to our proposed search spaces, can find good parameterizable operations. • We introduce Kaleidoscope-operations (K-operations), parameterizable operations comprising the Chrysalis that generalize the convolution while preserving key desirable properties: short description length, linearity, and fast computation. Notably, K-operations can be combined with fixed architectures to induce rich search spaces in which architectural parameters are decoupled from model weights, the former to be searched via NAS methods. • We evaluate the Chrysalis on text and image settings where convolutions are known to work well. For images, we construct the ButterfLeNet search space by combining K-operations with the wellknown LeNet (LeCun et al., 1999) . For text classification, we generalize the simple multi-width model of Kim (2014) . On both we evaluate several applicable NAS methods and find that singlelevel supernet SGD is able to find operations that come close to or match the performance of convolutions when searched from-scratch, while also improving upon them when warm-started. • We conclude by examining the generality of our approach on domains where convolutions are the "wrong" operation. We first consider permuted image data, where the "right" operation is permutation followed by convolution, and observe that NAS methods applied to the ButterfLeNet search space yield an architecture that outperforms all fixed operation baselines by 8%. Next we consider spherical MNIST data of Cohen et al. ( 2018), where the "right" operation is the spherical convolution from the same paper. We consider the K-operation search space that generalizes their network and again find that it outperforms convolutions by more than 20%. Our results highlight the capacity of K-operation-based search spaces, coupled with standard NAS methods, to broaden the scope of NAS to discovering neural primitives in new data domains.

1.1. RELATED WORK

AutoML is a well-studied area, with most work focusing on the fairly small search spaces of hyperparameter optimization (Bergstra & Bengio, 2012; Li et al., 2018) or on NAS (Elsken et al., 2019) . In the latter case, it has been observed that the search spaces are still "easy" in the sense that random architectures can do reasonably well (Elsken et al., 2019; Li & Talwalkar, 2019) . More recently, Real et al. ( 2020) demonstrated the possibility of evolving all aspects of ML-not just the model but also the training algorithm-from scratch. We seek to establish a middle ground in which search spaces are large and domain-agnostic but still allow the encoding of desirable constraints and the application of well-tested learning algorithms such as stochastic gradient descent (SGD). Our main contribution is a family of search spaces that build upon K-operations, which generalize parameterized convolutions (LeCun et al., 1999) . Most NAS search spaces only allow a categorical choice between a few kinds of convolutions (Liu et al., 2019; Zela et al., 2020; Dong & Yang, 2020) ; even when drastically expanded to include many types of filter sizes and other hyperparameters (Mei et al., 2020) , the operation itself is not generalized and so these search spaces may not be useful outside of domains where convolutions are applicable. Beyond NAS, recent work by Zhou et al. ( 2020) uses a meta-learning framing (Thrun & Pratt, 1998) to study how to learn more general types of symmetries-beyond simply translational-from multi-task data. This transfer-based setup allows a clear formalization of learning such equivariances, though unlike NAS, it is not applicable to single-task settings. In addition, their technical approach does not generalize two-dimensional convolutions due to computational intractabality, while our K-operations are indeed able to do so. The above works search over spaces of parameterizable operations by delineating a set of architectural or meta parameters to define the space over operations that are separate from model weights, which parameterize the operations found. In contrast, other efforts seek to simply outperform convolutions by directly training more expressive models. This includes several that use linear maps based on butterfly factors as drop-in replacements for linear or convolutional layers (Dao et al., 2019; 2020; Alizadeh vahid et al., 2020; Ailon et al., 2020) . Very recently, Neyshabur (2020) showed that a sparsity-inducing optimization routine can train fully connected nets that match the performance of convolutional networks and in the process the weights learn local connectivity patterns. However, none of these papers return parameterizable operations from a formally defined search space.

2. STATISTICAL AND OPTIMIZATION OBJECTIVES OF NAS

In this section we set up the statistical and algorithmic objectives of neural architecture search. This is critical since we seek a definition of NAS that encompasses not only categorical decisions but also learning primitives such as convolutions. We ignore latency and transfer considerations and instead focus on the statistical problem of learning to parameterize a function f w,a : Z Þ Ñ R so as to minimize its expected value E z f w,a pzq w.r.t. some unknown distribution over the data-domain Z. Here a A are architectures in some search space A and w W are model-weights of sufficient



FollowingDao et al. (2020), butterfly-based naming will be used throughout.

