BALTO: FAST TENSOR PROGRAM OPTIMIZATION WITH BIASED-DIVERSITY-BASED ACTIVE LEARNING

Abstract

Tensor program optimization (TPO) based on pre-trained models can effectively reduce the computing time of deep neural networks. However, training of such models is prohibitively expensive, which highly depends on a large-scale dataset and thus requires tremendous time-consuming performance measurements (more than 1 million) on target platforms. In this paper, we propose BALTO, a fast TPO approach with biased-diversity-based active learning, aiming at significantly reducing training costs under similar program performance optimization ability. The key insight is that random sampling of existing approaches suffers from a heavy redundancy of low-performance programs, which incurs tremendous timeconsuming measurements. Inspired by this, BALTO removes such redundancy by introducing active learning (AL) to TPO for a much lower training cost. However, applying AL with a brute-force way in BALTO can lead to an overestimation problem. To address this, we further propose a biased-diversity-based diversity scheme specially designed for BALTO. We compare BALTO against TenSet on 6 typical hardware platforms over 2 learning models. Experimental results show that, on average, BALTO only requires 5% of the total measurements of TenSet to achieve the same or higher model accuracy. Moreover, the optimized tensor programs even outperform that of TenSet by 1.07% due to higher model accuracy.

1. INTRODUCTION

Tensor program optimization (TPO) can effectively reduce the computing time of neural networks by searching for high-performance programs in a designed search space (Chen et al., 2018; Zheng et al., 2020a; b) . In TPO, neural networks are first represented as tensor programs that describe the computation of multi-dimensional data arrays. Then performances of these tensor programs are measured on a target hardware platform. Such measurements are time-consuming and thus optimizing a given network can cost several days or even weeks, which greatly hinders the wide application of TPO (Zhu et al., 2022) . To accelerate TPO, pre-trained machine learning models (Adams et al., 2019; Haj-Ali et al., 2020; Anderson et al., 2020; Zheng et al., 2021) are introduced to replace a substantial part of the hardware measurements with performance predictions of a pre-trained model. For example, as the state-of-the-art approach, TenSet (Zheng et al., 2021) can significantly reduce the optimization time by up to 10× through training on a large-scale and well-established dataset. However, training the models is prohibitively expensive. The main reason is that these models highly depend on a large-scale training dataset. Unfortunately, collecting such a dataset involves massive performance measurements on hardware platforms, suffering from excessively long execution time. For example, for each hardware platform, around 8 million different tensor programs are required to be measured (Adams et al., 2019; Zheng et al., 2021) . Even on a high-end GPU like V100, such measurements still can consume about 4000 GPU hours. This burden could be much worse in those capability-limited systems, e.g., mobile devices such as NVIDIA Jetson TX2, most of which often require more than 5000 GPU hours to conduct the measurements. More importantly, a much larger dataset is required in real-world optimization tasks, so as to achieve better model generalization on different tensor programs. Consequently, as the size of the dataset increases, the number of the measurements can be significantly increased correspondingly. This can lead to great consumption of time and energy, and thus hinders wide deployment of ML-based TPO in industrial practice. To look more closely at the inside of the largescale datasets sampled randomly by existing approaches, we conduct a deep exploration of the distribution of these datasets. We observe that random sampling adopted in existing approaches can result in imbalanced training datasets, where the high-performance programs are excessively less than that of the low-performance ones. We randomly sample 90,000 tensor programs of TenSet (as shown in Figure 1 ) and find that high-performance programs only account for 19% and 8% of the total dataset on Platinum 8272CL CPU and T4 GPU respectively. In contrast, the low-performance programs take the bulk of the total dataset (81% and 92% respectively). Such imbalance can further lead to a heavy redundancy of the low-performance programs that incurs tremendous timeconsuming measurements. The main reason behind the redundancy is that the importance of generated tensor programs is highly different. In fact, program optimization pays more attention to those high-performance programs. Excessively more low-performance programs cannot offer additional benefit of predicting high-performance programs, thus being a heavy redundancy. To this end, we propose BALTO, a fast TPO approach with biased-diversity-based diversity active learning, aiming at significantly reducing training costs under similar optimization ability. The key insight is that random sampling of existing approaches suffers from a heavy redundancy of lowperformance programs, which incurs tremendous time-consuming measurements. Inspired by this, BALTO removes such redundancy by introducing active learning (AL) to TPO for a much lower training cost. In this way, the measurements can be significantly lowered, thus greatly reducing the training cost of building pre-trained model. However, applying AL in a brute-force way in BALTO can lead to an overestimation problem, where the relative performance of the estimated accuracy is much better than that of the ground truth. To address this problem, we further propose a biased-diversity-based scheme specifically designed for BALTO, which can efficiently reduce the distribution imbalance of the sampled programs caused by overestimation. Finally, we integrate BALTO into TenSet, and compare BALTO with state-of-the-art baseline TenSet on 6 typical hardware platforms (i.e., two GPU platforms and four CPU platforms) over two learning models (i.e., XGBoost, MLP). The experimental results show that BALTO achieves same or higher model accuracy while only requiring 5% of TenSet's hardware measurements. Moreover, the optimized tensor programs even outperform that of TenSet by 1.07% due to higher model accuracy. To the best of our knowledge, BALTO is the first work to reduce the training cost of pre-trainedmodel-based TPO by introducing AL. Summarily, our key contributions are three-fold: • We conduct a deep exploration on the distribution of large-scale datasets sampled randomly by existing approaches, and observe that the random sampling results in an imbalanced dataset, and thus suffers from heavy redundancy in the training dataset. • We propose BALTO, a fast TPO approach with active learning, aiming at significantly reducing training cost under similar optimization ability. To address the overestimation problem, we further propose a biased-diversity-based scheme specially deigned for BALTO. • We conduct a comprehensive performance evaluation on six typical hardware platforms, indicating that BALTO achieves same or higher model accuracy with 20× reduction in performance measurements. Moreover, the optimized programs even outperform that of TenSet by 1.07% on NVIDIA T4.

2. BACKGROUND

2.1 TENSOR PROGRAM OPTIMIZATION.



Figure 1: The ratio of high-performance programs is much smaller than low-performance programs. (a) NVIDIA T4. (b) Platinum 8272CL.

