BALTO: FAST TENSOR PROGRAM OPTIMIZATION WITH BIASED-DIVERSITY-BASED ACTIVE LEARNING

Abstract

Tensor program optimization (TPO) based on pre-trained models can effectively reduce the computing time of deep neural networks. However, training of such models is prohibitively expensive, which highly depends on a large-scale dataset and thus requires tremendous time-consuming performance measurements (more than 1 million) on target platforms. In this paper, we propose BALTO, a fast TPO approach with biased-diversity-based active learning, aiming at significantly reducing training costs under similar program performance optimization ability. The key insight is that random sampling of existing approaches suffers from a heavy redundancy of low-performance programs, which incurs tremendous timeconsuming measurements. Inspired by this, BALTO removes such redundancy by introducing active learning (AL) to TPO for a much lower training cost. However, applying AL with a brute-force way in BALTO can lead to an overestimation problem. To address this, we further propose a biased-diversity-based diversity scheme specially designed for BALTO. We compare BALTO against TenSet on 6 typical hardware platforms over 2 learning models. Experimental results show that, on average, BALTO only requires 5% of the total measurements of TenSet to achieve the same or higher model accuracy. Moreover, the optimized tensor programs even outperform that of TenSet by 1.07% due to higher model accuracy.

1. INTRODUCTION

Tensor program optimization (TPO) can effectively reduce the computing time of neural networks by searching for high-performance programs in a designed search space (Chen et al., 2018; Zheng et al., 2020a; b) . In TPO, neural networks are first represented as tensor programs that describe the computation of multi-dimensional data arrays. Then performances of these tensor programs are measured on a target hardware platform. Such measurements are time-consuming and thus optimizing a given network can cost several days or even weeks, which greatly hinders the wide application of TPO (Zhu et al., 2022) . To accelerate TPO, pre-trained machine learning models (Adams et al., 2019; Haj-Ali et al., 2020; Anderson et al., 2020; Zheng et al., 2021) are introduced to replace a substantial part of the hardware measurements with performance predictions of a pre-trained model. For example, as the state-of-the-art approach, TenSet (Zheng et al., 2021) can significantly reduce the optimization time by up to 10× through training on a large-scale and well-established dataset. However, training the models is prohibitively expensive. The main reason is that these models highly depend on a large-scale training dataset. Unfortunately, collecting such a dataset involves massive performance measurements on hardware platforms, suffering from excessively long execution time. For example, for each hardware platform, around 8 million different tensor programs are required to be measured (Adams et al., 2019; Zheng et al., 2021) . Even on a high-end GPU like V100, such measurements still can consume about 4000 GPU hours. This burden could be much worse in those capability-limited systems, e.g., mobile devices such as NVIDIA Jetson TX2, most of which often require more than 5000 GPU hours to conduct the measurements. More importantly, a much larger dataset is required in real-world optimization tasks, so as to achieve better model generalization

