NAS-BENCH-ASR: REPRODUCIBLE NEURAL ARCHITECTURE SEARCH FOR SPEECH RECOGNITION

Abstract

Powered by innovations in novel architecture design, noise tolerance techniques and increasing model capacity, Automatic Speech Recognition (ASR) has made giant strides in reducing word-error-rate over the past decade. ASR models are often trained with tens of thousand hours of high quality speech data to produce state-of-the-art (SOTA) results. Industry-scale ASR model training thus remains computationally heavy and time-consuming, and consequently has attracted little attention in adopting automatic techniques. On the other hand, Neural Architecture Search (NAS) has gained a lot of interest in the recent years thanks to its successes in discovering efficient architectures, often outperforming handcrafted alternatives. However, by changing the standard training process into a bi-level optimisation problem, NAS approaches often require significantly more time and computational power compared to single-model training, and at the same time increase complexity of the overall process. As a result, NAS has been predominately applied to problems which do not require as extensive training as ASR, and even then reproducibility of NAS algorithms is often problematic. Lately, a number of benchmark datasets has been introduced to address reproducibility issues by providing NAS researchers with information about performance of different models obtained through exhaustive evaluation. However, these datasets focus mainly on computer vision and NLP tasks and thus suffer from limited coverage of application domains. In order to increase diversity in the existing NAS benchmarks, and at the same time provide systematic study of the effects of architectural choices for ASR, we release NAS-Bench-ASR -the first NAS benchmark for ASR models. The dataset consists of 8, 242 unique models trained on the TIMIT audio dataset for three different target epochs, and each starting from three different initializations. The dataset also includes runtime measurements of all the models on a diverse set of hardware platforms. Lastly, we show that identified good cell structures in our search space for TIMIT transfer well to a much larger LibriSpeech dataset.

1. INTRODUCTION

Innovations in Deep Neural Network (DNN) architecture design, data augmentation techniques and a continuous increase in the amount of available high quality training datasets, resulted in a massive reduction in ASR word-error-rate over the past decade [Amodei et al., 2016; Kim et al., 2019; Park et al., 2019; Synnaeve et al., 2020] . However, training ASR models to achieve state-of-the-art performance remains challenging as it requires computationally heavy training process, e.g., often thousands of GPU-hours are needed for good convergence [Amodei et al., 2016; Kahn et al., 2020] . Furthermore, the requirement of hyper-parameter optimizations increases the computational loads in ASR training. Despite the system-level complexities in the training procedure, the importance of novel architecture design has proven extremely important in a variety of application domains including ASR [Chiu & Raffel, 2018; Pratap et al., 2020 ], computer vision [He et al., 2016; Krizhevsky et al., 2012] , and natural-language processing (NLP) [Devlin et al., 2019; Vaswani et al., 2017] . However, architecture design is a non-trivial task and often depends on years of experience, domain knowledge of the researchers and is driven by empirical successes. Over the past few years, the deep learning community is witnessing a trend in adopting automatic techniques to find neural network architectures over more traditional hand-designed alternatives. NAS algorithms are highly successful in discovering state-of-the-art architectures in various computer vision tasks [Cai et al., 2020; Howard et al., 2019; Lee et al., 2020; Real et al., 2018; Tan et al., 2019; Tan & Le, 2019] . However, many of them suffer from high computational demands, requiring a large number of architecture variations to be trained [Zoph & Le, 2017] . Furthermore, NAS algorithms are often difficult to reproduce by different researchers, mainly due to a non-standard use of training settings, e.g., hyperparameters, and subtle variations in the architecture search spaces [Li & Talwalkar, 2019; Sciuto et al., 2020] . Recently, a number of attempts have been made to mitigate these problems by releasing various benchmark datasets for the NAS research community [Dong & Yang, 2020; Klyuchnikov et al., 2020; Siems et al., 2020; Ying et al., 2019] . These datasets usually provide a direct mapping between an architecture variant and its post training performances, which can be used efficiently by a NAS algorithm speeding up the search process and, at the same time, providing common, fully reproducible environment for assessment and comparison of different algorithms. Initial attempts of creating benchmark datasets predominantly focus on image classification tasks (with only one existing work targeting NLP at the time of this writing), and thus suffer from poor application coverage. We address the lack of coverage problem by introducing a new NAS-benchmak dataset in the domain of ASR, to our best knowledge the very first of its kind. To build the dataset, we have trained 8, 242 unique convolutional neural network architectures on the TIMIT dataset [Garofolo et al., 1993] . We consider convolutional architectures due to their recent successes in the domain of ASR [Pratap et al., 2020; Hannun et al., 2019] . Moreover, convolution based architectures are computationally efficient to run on mobile devices, thus favouring real-time on-device deployment. Our dataset contains multiple runs of the entire training procedure of an architecture, spanning three initializations of the network parameters and three target epochs, amounting a total of 74, 178 = 8, 242 × 3 × 3 training runs. In addition to the per epoch validation and final test metrics, such as Phoneme Error Rate (PER), and CTC loss, we also provide run-times of the architectures on desktop and embedded GPUs for varying batch size. Furthermore, we compare a number of NAS algorithms [Zoph & Le, 2017; Real et al., 2018; Dudziak et al., 2020; Li et al., 2017; Li & Talwalkar, 2019] on our search space, highlighting potential challenges and differences compared to their performances on existing NAS benchmark dataset. Lastly, we show the transferability of the top architecture cells found on TIMIT to a much larger Librispeech dataset [Panayotov et al., 2015] . In summary, the contributions of this paper are: • Design of ASR NAS Search Space. ASR NAS-Bench is a first-of-its-kind search space for convolutional speech models. It facilitates the reproducible study of ASR through NAS methods and thus fills an important gap in the literature. The associated dataset consists of 8, 242 unique cells and contains validation and test metrics along with model parameters, FLOPs and on-device run-timesfoot_0 . • Enabling NAS for Large-scale ASR. Prohibitive training times for non-toy ASR datasets, has prevented NAS from strongly influencing the evolution of ASR architecture design. We show that ASR NAS-Bench is able to support the discovery of cell structures that generalize even to largescale datasets like Librispeech -a key breakthrough. We believe the methodological decisions in this paper will act as a blueprint for future work, where NAS plays a prominent role in ASR design. 



The NAS-Bench-ASR dataset and the code can be downloaded from https://github.com/ AbhinavMehrotra/nb-asr.



Validating Existing NAS Algorithm Design. Existing understanding of NAS is grossly influenced by image-based tasks. By systematically benchmarking popular NAS algorithms, under a rich ASR search space, our findings provide otherwise lacking scientific support for prior results.

