IMPROVING DIFFERENTIABLE NEURAL ARCHITEC-TURE SEARCH BY ENCOURAGING TRANSFERABILITY

Abstract

Differentiable neural architecture search methods are increasingly popular due to their computational efficiency. However, these methods have unsatisfactory generalizability and stability. Their searched architectures are often degenerate with a dominant number of skip connections and perform unsatisfactorily on test data. Existing methods for solving this problem have a variety of limitations, such as cannot prevent the happening of architecture degeneration, being excessively restrictive in setting the number of skip connections, etc. To address these limitations, we propose a new approach for improving the generalizability and stability of differentiable NAS, by developing a transferability-encouraging tri-level optimization framework which improves the architecture of a main model by encouraging good transferability to an auxiliary model. Our framework involves three stages performed end-to-end: 1) train network weights of a main model; 2) transfer knowledge from the main model to an auxiliary model; 3) optimize the architecture of the main model by maximizing its transferability to the auxiliary model. We propose a new knowledge transfer approach based on matching quadruple relative similarities. Experiments on several datasets demonstrate the effectiveness of our method.

1. INTRODUCTION

Neural architecture search (NAS) (Zoph & Le, 2017; Liu et al., 2018b; Cai et al., 2019; Liu et al., 2019a; Pham et al., 2018; Real et al., 2019) , which aims to search for highly-performant neural architectures automatically, finds broad applications. Among various NAS methods, differentiable search methods (Liu et al., 2018b; Cai et al., 2019; Chen et al., 2019; Xu et al., 2020) gain increasing popularity due to their computational efficiency. In differentiable NAS, architectures are represented as differentiable variables and are learned using gradient descent. While differentiable NAS is computationally efficient, its generalizability and stability has been challenged in many works (Zela et al., 2019; Chu et al., 2020a; 2019; Zhou et al., 2020a; Chen & Hsieh, 2020) : the searched architecture is degenerate with a dominant number of skip connections; while having good performance on validation data, it performs unsatisfactorily on test data. For example, Zela et al. ( 2019) identified 12 NAS benchmarks based on four search spaces where architectures searched by standard DARTS (Liu et al., 2019a ) (a differentiable NAS method) have poor performance on test data of CIFAR-10, CIFAR-100, and SVH. A variety of approaches (Zela et al., 2019; Chu et al., 2020a; 2019; Zhou et al., 2020a; Chen & Hsieh, 2020; Chen et al., 2019; Liang et al., 2020; Wang et al., 2021) have been proposed to improve the generalizability and stability of differentiable NAS methods. These methods have various limitations, such as cannot improve search algorithms to prevent degenerate architectures from occurring (Zela et al., 2019) , cannot explicitly maximize the generalization performance of architectures (Chu et al., 2020a; 2019) , cannot broadly explore search spaces (Zhou et al., 2020a; Chen & Hsieh, 2020) , requiring extensive tuning of the number of skip connections (Chen et al., 2019; Liang et al., 2020) , etc. As a result, their effectiveness in improving differentiable NAS is less satisfactory. To address these limitations, we propose a new approach for improving the generalizability and stability of differentiable NAS methods. Specifically, we develop a transferability-encouraging tri-level optimization (TETLO) framework, which improves the architecture of a main model by encouraging effective knowledge transfer to an auxiliary model. Intuitively, to train a better auxiliary model, the main model needs to generate accurate knowledge (which is used to train the auxiliary); to generate accurate knowledge, the main model needs to improve its architecture (which is used to generate knowledge). Combining these two steps together, we conjecture that improving the auxiliary drives the main model to improve its architecture (empirical evidence is provided in Figure 2 ). Our method is also motivated by the theoretical analysis in (Liu et al., 2019b ) that good transferability improves generalization performance. In our framework, a main model and an auxiliary model have architectures and network weights to learn. Learning consists of three stages. In the first stage, we train the network weights of a main model while temporarily fixing its architecture. In the second stage, we leverage the main model trained in the first stage to help train an auxiliary model via transfer learning. To capture high-order data relationships, we propose a new knowledge transfer approach based on matching quadruple relative similarities (QRS), where the main model generates QRS relationships, e.g., the similarity between data example x and y is larger than that between w and z. Then the auxiliary model is trained by fitting these QRS relationships. In the third stage, we use validation performance of the auxiliary model as a measure of transferability from the main model to the auxiliary model and update the architecture of the main model by maximizing the transferability. The three stages are performed end-to-end in a three-level optimization framework.

The major contributions of this work include:

• We propose a transferability-encouraging tri-level optimization (TETLO) framework to improve the generalizability and stability of differentiable NAS methods. • We propose a new knowledge transfer approach based on matching quadruple relative similarities. • We perform various experiments which demonstrate the effectiveness of our method.

2.1. NEURAL ARCHITECTURE SEARCH

The goal of neural architecture search (NAS) is to automatically identify highly-performing neural architectures that can potentially surpass human-designed ones. The research of NAS has made considerable progress in the past few years. Early NAS approaches (Zoph & Le, 2017; Pham et al., 2018; Zoph et al., 2018) are based on reinforcement learning (RL), where a policy network learns to generate high-quality architectures by maximizing validation accuracy (used as reward). These approaches are conceptually simple and can flexibly perform search in any search space. In differentiable search methods (Cai et al., 2019; Liu et al., 2019a; Xie et al., 2019) , each candidate architecture is a combination of many building blocks. Combination coefficients represent the importance of building blocks. Architecture search amounts to learning these differentiable coefficients, which can be conducted using differentiable optimization algorithms such as gradient descent. Another paradigm of NAS methods (Liu et al., 2018b; Real et al., 2019) are based on evolutionary algorithms (EA). In these approaches, architectures are considered as individuals in a population. Each architecture is associated with a fitness score representing how good this architecture is. Architectures with higher fitness scores have higher odds of generating offspring (new architectures), which replace old architectures that have low-fitness scores. Our proposed method can be applied to any differentiable NAS methods. Recently, meta-NAS methods (Elsken et al., 2020; Lian et al., 2019) have been proposed for fast adaptation of neural architectures. Our method differs from meta-NAS approaches in the following aspects. First, our method aims at improving an architecture by encouraging it to have good transferability to an auxiliary model, while meta-NAS focuses on adapting a meta architecture to different task-specific architectures. Second, our method transfers knowledge from a main architecture to an auxiliary model using our newly proposed quadruple relative similarity matching mechanism while meta-NAS adapts a meta architecture to task-specific architectures via gradient descent update. Third, our method searches for architectures on a single dataset while meta-NAS operates on a collection of meta-training tasks (each with a training and validation set).

2.2. IMPROVE THE GENERALIZABILITY AND STABILITY OF DIFFERENTIABLE NAS

Various methods have been proposed for improving the generalizability and stability of differentiable NAS methods. Zela et al. (2019) proposed an early stopping approach based on eigenvalues of validation loss' Hessian matrix. This method early stops search algorithms when degenerate architectures occur, but can not improve search algorithms to prevent degenerate architectures from oc-



* The work was done while visiting UCSD. † Corresponding author

