THE BATCH SIZE CAN AFFECT INFERENCE RESULTS

Abstract

When performing matrix multiplication using GPUs, the cuBLAS library is commonly used for computational efficiency. Because of the cuBLAS' heuristics, a vast, deep neural network model with GPUs may produce different test results owing to the batch sizes in both the training and inference stages. In this paper, we show that the batch size affects the inference results of deep neural network models. Our test models were the well-known bidirectional encoder representations from transformers (BERT) and generative pre-trained transformer (GPT) natural language processing (NLP) models, and the super-resolution generative adversarial network (SRGAN) image generation model in FP32 and TF32. In the TF32 setting, the evaluation loss in BERT using the general language understanding evaluation (GLUE) data sometimes varied for different batch sizes. The GPT generated sentences depending on batch size, and we show the logit's mean square error by increasing the token length. The SRGAN model produced different images from batch to batch. However, these phenomena were not observed under the FP32 setting. Therefore, the batch size must be carefully managed in large-sized deep neural networks under the TF32 setting.

1. INTRODUCTION

Several numerical models include matrix multiplication as a fundamental component. For computational efficiency, tiling methods are employed for matrix multiplication on computers, but this leads to the accumulation of rounding errors. Considerable research has been undertaken to develop accurate and efficient matrix multiplication. Algorithm-based fault tolerance (ABFT) Kuang-Hua Huang & Abraham (1984) and autonomous-ABFT Braun et al. (2014) are two examples of representative studies. Furthermore, several algorithms have been compared in Fèvre et al. (2021) . The most widely used library for performing basic linear algebra operations, such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication, is termed basic linear algebra subprograms (BLAS). The BLAS is a specification for a set of low-level routines to execute standard linear algebra operations, such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the standard low-level routines for linear algebra libraries, including C ("CBLAS interface") and Fortran bindings ("BLAS interface"). Although the BLAS standard is broad, BLAS implementations are frequently optimized for speed on a single platform; therefore, employing them can result in significant performance gains. Most libraries that provide linear algebra functions adhere to the BLAS interface, allowing library users to create programs not connected to the BLAS library being used. With the development of GPGPU, BLAS implementations are being increasingly used, for instance, cuBLAS and rocBLAS . Additionally, general matrix multiplication (GEMM) is the matrix multiplication method contained in BLAS. Numerous studies have been conducted to develop GEMM. For instance, the study for efficient non-square and sparse matrix computations was performed in Qin et al. ( 2020), the methods that fully utilize hardware resources were developed in Fatahalian et al. ( 2004), and research in Kelefouras et al. (2016) enhanced the optimization and effectiveness of performing GEMM on GPUs. A deep learning model based on matrix operations is used in many diverse tasks. To train and inference a deep neural network model, it is important to compute matrices of different sizes efficiently, hence, acceleration research is being conducted to include them in various accelerators extending from tensor cores to TPUs. A hyperscale model such as this requires a huge amount of computing resources, because of the numerous matrix computations that may reduce accuracy. Therefore, batch learning is required to overcome memory limitations, and research has been undertaken on adaptive batch sizes for successful batch learning Devarakonda et al. ( 2017) McCandlish et al. (2018) . However, because the batch size changes the size of the matrix to be computed, the GEMM operation method's tiling size is varied by the batch size. Large batches are often utilized for rapid training, whereas relatively small batches are used for service. In addition, there is a problem caused by the difference between the training and inference stages, and at this time, the error is not synced, thus degrading the model's performance. Floating-point arithmetic is a method to effectively represent and calculate decimal points in computer science. It is expressed as an approximation through a floating point method at the expense of decimal-point precision and is known as a standard, such as IEEE 754. Floating-point precision for scientific operations generally uses 64-bit or more double-precision arithmetic. Because precision in deep learning has different computational costs and different precision requirements, instead of sacrificing precision, tensor operation accelerators that significantly increase the computational cost and speed are used. One representative example is NVIDIA's Tensor Core. Even when using low precision in deep learning, in certain cases, the bit configuration differs from the general floating-point standard. In general, it is composed by changing the combination of the sign bits, exponent bits, and mantissa bits that constitute the floating point. BF16 Kalamkar et al. (2019) , which increases the dynamic range of single-precision FP16 and reduces significant figures, and TF32, which enables the dynamic range of 32 bits using the 19-bit representation supported after NVIDIA's Ampere architecture, etc. In general, deep learning algorithms may encounter the problem of gradient explosion by abandoning the dynamic range. Increasing the dynamic range by abandoning significant digits at the same precision is effective. In this study, we used the bidirectional encoder representations from transformers (BERT) Devlin et al. ( 2018) and the generative pre-trained transformer (GPT) Brown et al. ( 2020) in natural language processing (NLP) models and the super-resolution generative adversarial network (SRGAN) Ledig et al. (2016) in an image generation model to estimate the phenomenon to investigate whether the batch size can affect the inference results under two types of floating-point arithmetic systems, TF32 and FP32. In BERT, the evaluation loss was compared using the benchmark GLUE data set Wang et al. (2018) by varying the training batch and the inference batch. In GPT, generated sentences were compared for batch size with the logit's mean square error when increasing the token length. In SRGAN, generated images were compared for inference from 256 data elements by batch to the data trained with 200,000 CelebFaces Attributes Dataset (CelebA) images Liu et al. (2015) .

2. EXPERIMENTS

We experimentally tested the three models BERT, GPT, and SRGAN. The GPUs specifications for this experiment were NVIDIA RTX 3090 (ampere) in TF32 and NVIDIA 2080 TI (Turing) in FP32.

2.1. BERT

Developed by Google in 2018 , BERT is a transformer-based language model. We performed fine tuning according to the batch size using the GLUE dataset based on BERT's pre-training data and performed text classification. A batch size of 32 was used for training and evaluation loss was measured using 1, 2, 4, 8, 16, 32, 64, and 128 batches for inference. Subsequently, significant differences were found in three datasets (CoLA, RTE, SST-2) out of nine GLUE datasets (Figure 1 ). This phenomenon did not occur in FP32 (Figure 1 ).

2.2. GPT

GPT is an unsupervised transformer language model created by OpenAI, which translates text, answer questions, summarizes passages, and generates text output. In GPT, the inference of generating texts was conducted based on the official trained model. The input data of all batch sizes is same. We choose that the batch size of the model is 1,2,4 and 8, and the FP32 and TF32 settings of NVIDIA are activated to enable computation acceleration of the tensor core. Figure 2 consists of the same

