THE BATCH SIZE CAN AFFECT INFERENCE RESULTS

Abstract

When performing matrix multiplication using GPUs, the cuBLAS library is commonly used for computational efficiency. Because of the cuBLAS' heuristics, a vast, deep neural network model with GPUs may produce different test results owing to the batch sizes in both the training and inference stages. In this paper, we show that the batch size affects the inference results of deep neural network models. Our test models were the well-known bidirectional encoder representations from transformers (BERT) and generative pre-trained transformer (GPT) natural language processing (NLP) models, and the super-resolution generative adversarial network (SRGAN) image generation model in FP32 and TF32. In the TF32 setting, the evaluation loss in BERT using the general language understanding evaluation (GLUE) data sometimes varied for different batch sizes. The GPT generated sentences depending on batch size, and we show the logit's mean square error by increasing the token length. The SRGAN model produced different images from batch to batch. However, these phenomena were not observed under the FP32 setting. Therefore, the batch size must be carefully managed in large-sized deep neural networks under the TF32 setting.

1. INTRODUCTION

Several numerical models include matrix multiplication as a fundamental component. For computational efficiency, tiling methods are employed for matrix multiplication on computers, but this leads to the accumulation of rounding errors. Considerable research has been undertaken to develop accurate and efficient matrix multiplication. Algorithm-based fault tolerance (ABFT) Kuang-Hua Huang & Abraham (1984) and autonomous-ABFT Braun et al. (2014) are two examples of representative studies. Furthermore, several algorithms have been compared in Fèvre et al. (2021) . The most widely used library for performing basic linear algebra operations, such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication, is termed basic linear algebra subprograms (BLAS). The BLAS is a specification for a set of low-level routines to execute standard linear algebra operations, such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the standard low-level routines for linear algebra libraries, including C ("CBLAS interface") and Fortran bindings ("BLAS interface"). Although the BLAS standard is broad, BLAS implementations are frequently optimized for speed on a single platform; therefore, employing them can result in significant performance gains. Most libraries that provide linear algebra functions adhere to the BLAS interface, allowing library users to create programs not connected to the BLAS library being used. With the development of GPGPU, BLAS implementations are being increasingly used, for instance, cuBLAS and rocBLAS . Additionally, general matrix multiplication (GEMM) is the matrix multiplication method contained in BLAS. Numerous studies have been conducted to develop GEMM. For instance, the study for efficient non-square and sparse matrix computations was performed in Qin et al. (2020) , the methods that fully utilize hardware resources were developed in Fatahalian et al. (2004), and research in Kelefouras et al. (2016) enhanced the optimization and effectiveness of performing GEMM on GPUs. A deep learning model based on matrix operations is used in many diverse tasks. To train and inference a deep neural network model, it is important to compute matrices of different sizes efficiently, hence, acceleration research is being conducted to include them in various accelerators extending

