FEW-BIT BACKWARD: QUANTIZED GRADIENTS OF AC-TIVATION FUNCTIONS FOR MEMORY FOOTPRINT RE-DUCTION

Abstract

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -as we show -can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing an optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

1. INTRODUCTION

Modern neural network models are getting larger and larger. One of the main bottlenecks in the training loop is the required device memory storage Ojika et al. (2020) ; Gao et al. (2020) . In this paper, we propose a universal approach that helps to reduce the model memory footprint during backpropagation. Note that this approach is complementary to other memory reducing techniques such as checkpointing Chen et al. (2016) or offloading Beaumont et al. (2021) . Our method can be applied to any neural network without any additional preprocessing. Memory consumed by the model during training (except intermediate tensors) can be split into two groups: 1) the model weights (including additional memory for the optimizer state), 2) activations saved for the backward pass, over which the computation is not carried out directly at the moment, but which will be required in the future to compute the gradients. Every operation in the computational graph generates a memory footprint. It is typically overlooked, that the application of the pointwise non-linearity (such as GELU or sigmoid) results in storing the input for the backward pass. We show that instead of keeping the full input tensor, it is possible to store a low-bit representation, which allows accurate gradients approximation. In this work, we propose to approximate the derivative of the activation function in a piecewiseconstant form. Such an approximation problem has to be solved once for each activation function, and we propose a simple technique to do that. The proposed approximation divides all values into several bins and saves only their corresponding bin indices instead of storing all values. This is a lossy compression, but the additional noise introduced by it is negligible as we will show on several benchmarks in Section 4. The main contributions of our paper are: • We propose new approximate backward computation schemes that significantly reduce the memory consumption of neural network training. • We benchmark our approach on several tasks. We show that it provides up to 40% memory reduction on various tasks while maintaining accuracy on par with the model trained via the standard approach.

2. QUANTIZED GRADIENTS OF ACTIVATIONS

Tensors saved for backward

Quantize and Save

Quantized tensors saved for backward Figure 2: Computation graph of both forward and backward pass. Orange and purple parts of the graph correspond to standard and proposed ways of saving tensors for backward, respectively. Vector x bit stands for the tensor saved using 2-bit quantization, while x denotes its uncompressed version. Gradients of activations using automatic differentiation. Modern deep learning frameworks use the reverse mode automatic differentiation to calculate the gradients of the loss over the model parameters. Forward computation can be associated with a directed acyclic graph, depicted in Fig. 2 . Each operation f computes the output X l+1 given the input X l and has to save some information S l that would be used on the backward pass in order to calculate the derivative ∂L/∂X l from ∂L/∂X l+1 and S l . Thus, in a typical training loop, the intermediates S l of all operations in the graph are stored in the memory during the whole forward pass until they are no longer needed after the completion of the corresponding backward operation during backward pass. This generates an additional memory, which can be quite significant and be larger than the total amount of parameters of the model.

Pointwise activations.

In this paper, we focus on a pointwise activation function, which is ubiquitous in modern neural network architectures. Given an input tensor X l we apply a function f to each of the elements of this tensor: f (X l ) = [f (X j1,...,j k l )] j1,...,j k , f : R → R. This operation is very cheap compared to other operations in the deep neural network model and does not attract much attention when analysing computational complexity. However, standard implementation in such a framework as PyTorch induces not a very small memory footprint and the whole input X l is saved for the backward pass. The backward pass for such a function consists of element-wise multiplication of the propagated gradient tensor by the derivative of the nonlinearity function at the points of the input tensor: if X l+1 = f (X l ), then the gradient of the loss L with respect to X l is computed as ∂L ∂X l = ∂L ∂X l+1 f ′ (X l ), where f ′ (X l ) is the tensor with elements, consisting of the derivative of f evaluated in each element of X l . From Eq. ( 1), it follows that for the backward pass we have to store only f ′ (X l ), and X l is not needed. ReLU activation function. To illustrate our idea, consider one of the most popular nonlinearities, f (x) = ReLU(x) = max(0, x). Its derivative f ′ takes only two values, 0 and 1 and it only requires 1 bit to store. If single precision is used, then the compression is 32, which is quite noticeable.

GELU activation function.

In modern transformer architectures Vaswani et al. (2017) the GELU Hendrycks & Gimpel (2016) nonlinearity is typically used. The derivative no longer takes two values. Instead, we propose to approximate f ′ by a piecewise-constant function. For example, if we allow 8 different values, we will need only 3 bits per each element (Fig. 1 ). s 2 s 3 s 4 s 5 y 1 y 2 y 3 y 4 y 5 f ′ (x) q(x|s, y) Figure 3 : GELU derivative and its approximation q(x|s, y) with five piecewise-constant intervals Quantized gradients of activations. In stochastic optimization, if the gradient for a given batch is computed approximately, the optimization may still converge. The GELU derivative (see Fig. 1 ) is quite "similar" to a piecewise-constant function: for large values of |x|, it is almost exactly equal to 0 or 1, and for small values of x, a rather interesting transition from 0 to 1 occurs. Instead of calculating the derivative exactly on the backward pass, we approximate it using a certain piecewise-constant approximation: q(x|s, y) = k i=1 y i 1[x ∈ [s i ; s i+1 ]], where s = (s 1 , • • • , s k+1 ) is a sorted vector of intervals, on which approximation is constant, y = (y 1 , • • • , y k ) is a vector of the corresponding values of approximation and 1 denotes an indicator function, which equals 1 whenever its argument is true and 0 otherwise. That means, that q(x|s, y) equals y i when x ∈ [s i ; s i+1 ], see Fig. 3 for illustration. As noted above, if the approximation has k constant intervals, instead of storing the full input tensor X, it will be possible to save only log k bits of information (per element of the input tensor), which, accordingly, will reduce the memory consumption by 32/ log k times for single precision. If quantizatoin scheme Eq. ( 2) is given, drop-in replacement for activation function f is very straightforward. On the forward pass, instead of the full tensor X, we have to save only indices of intervals to which the elements of X belong, and on the backward pass, we need to multiply gradient w.r. However, this result may depend on specific framework implementation and the used GPU, so in our experiments in Section 4 we do not consider the time gain, assuming that both layers are roughly equally fast, but focus specifically on memory savings.

3. OPTIMAL PIECEWISE-CONSTANT APPROXIMATION

Fig. 1 shows examples of an optimized 3-bit piecewise-constant approximation for several nonlinearity function. Finding the optimal approximation parameters (boundaries of intervals and values on them) is a challenging task. We propose to find them by minimizing the (weighted) L 2 norm of the error. Consider function f : R → R and its derivative f ′ . We will measure the quality of a piecewise constant approximation Eq. ( 2) with a weighted L 2 norm: min y,s L(s, y), L(s, y) = R (f ′ (x) -q(x|s, y)) 2 w(x)dx = k i=1 si+1 si (f ′ (x) -y i ) 2 w(x)dx, where w is some weight function reflecting our prior knowledge of the activation function argument distribution. Practical choices of w may be either 1[x ∈ [A; B]] (with some reasonable A and B, which should be large enough) which makes integral Eq. ( 3) tractable, or maybe, e.g., standard normal distribution. L(s, y) is differentiable w.r.t. s and y, so optimal piecewise-constant approximations can be found using standard gradient-based optimization techniques. But the minimization problem Eq. ( 3) has many local minima that are far from optimal. We suggest using dynamic programming to get some good initial approximation that can be further finetuned using gradient-based methods (but also can be used as is because it is very accurate on its own). Dynamic programming. We will assume that the weighting function w is chosen such that w(x) = 0 for x ̸ ∈ [A; B]. Consider the following auxiliary value: DP(t, k) = min y 1:k , s 1:k+1 , s.t.s1=A,s k+1 =t t A (f ′ (x) -q(x|y, s)) 2 w(x)dx, t ∈ R, k ∈ N. Essentially, DP(t, k) is the optimal piecewise constant approximation of size k for the given function f ′ on the interval [A; t]. The recurrent formula for this value is: DP(t, k + 1) = min t ′ DP(t ′ , k) + t t ′ (f ′ (x) -y(t ′ , t)) 2 w(x)dx , y(t ′ , t) = t t ′ w(x)f ′ (x)dx t t ′ w(x)dx , since a piecewise-constant approximation of size k + 1 consists of corresponding approximation of size k (first term) plus one constant interval (second term). Here t ′ chooses the right bound of approximation of size k, and y(t ′ , t) stands for the optimal value for the interval [t ′ ; t] Eq. ( 8). Then the minimal value of L(s, y) of size k is equal to DP(B, k). To solve the minimization problem Eq. ( 5), we suggest considering the discretization of t: A = t 0 < t 1 < • • • < t n = B and reducing the calculation of DP(t, k) to its approximation only in the points of discretization: DP(i, k) = min j {DP(j, k -1) + T (j, i)} , T (j, i) = ti tj (f ′ (x) -y(j, i)) 2 w(x)dx, y(j, i) = ti tj w(x)f ′ (x)dx ti tj w(x)dx . Eq. ( 7) can be calculated in O(n 2 K) time and O(nK) space, which is described in Appendix G in detail. Please note, that this routine should be evaluated only once, possibly by the framework developers, and the used indefinitely. Which means that number of discritization points n can be taken quite large, tens of thousends easily. That would make global solutoin of discrete problem, given in Eq. ( 7) very close to the global solution of the original problem Eq. ( 3). We give precalculated Few-bit approximations for many different pointwise nonlinearity functions in our implementation at https://github.com/anonymous/repository. 

4. EXPERIMENTS

The goal of our experiments is not only to show that the Few-bit nonlinearity approach provides memory savings during neural network training without loss of the final model quality. In addition, we want to experimentally prove that this approach does not change the learning dynamic itself because, in this case its application in practice is almost completely safe: there is a memory gain without loss of speed or quality, and without risks of interference with other training factors under study (hence, no additional search or fitting of other hyperparameters is needed). To achieve this goal, in addition to the main metrics of the trained model (which depend on specific tasks and benchmarks), we also compare the training loss and validation loss graphs during the whole training process. Further you will see that 1-bit and 2-bit f-bit approximations are already almost the same as the original nonlinearity layers. And the 3-and 4-bit Few-bit approximations achieve the original quality of the model. We The main analog of our Few-bit approach is the ActNN method. In Section 4.4, we make a detailed comparison with this method. The code to reproduce all experiments is available at https://github.com/anonymous/ repository, and all hyperparameters for training are presented in Appendix F.

4.1. GLUE benchmark.

In Table 1 we report results for RoBERTa-base model Liu et al. (2019) on GLUE benchmark Wang et al. (2019) for standard GELU and 1-, 2-, 3-and 4-bits Few-bit GELU. 1-and 2-bits versions have minor performance degradation, while 3-and 4-bits GELU have no visible difference and closely match vanilla GELU performance, which can be seen more clearly on the dependence of the metric, averaged across all GLUE tasks, on the number of bits in Few-bit approximation, represented in Fig. 6 . The behaviour of loss during training is depicted in Fig. 5 : 3and 4-bit versions are hardly distinguishable from the standard GELU. Table 1 : RoBERTa-base on GLUE benchmark with different quantization budgets. Metric: mean accuracy/correlation (task specific). Averaged across five runs. 1-bit GELU 2-bits GELU 3-bits GELU 4-bits GELU Vanila GELU stsb 0.906 (± 0.002) 0.907 (± 0.002) 0.910 (± 0.002) 0.909 (± 0.002) 0.909 (± 0.001) mnli-mm 0.870 (± 0.001) 0.870 (± 0.002) 0.871 (± 0.002) 0.870 (± 0.001) 0.871 (± 0.002) mrpc 0.880 (± 0.009) 0.884 (± 0.008) 0.884 (± 0.007) 0.885 (± 0.008) 0.882 (± 0.005) cola 0.595 (± 0.016) 0.580 (± 0.014) 0.596 (± 0.015) 0.607 (± 0.014) 0.604 (± 0.013) mnli 0.873 (± 0.001) 0.872 (± 0.002) 0.874 (± 0.001) 0.874 (± 0.002) 0.874 (± 0.001) sst2 0.939 (± 0.003) 0.938 (± 0.003) 0.941 (± 0.004) 0.941 (± 0.003) 0.943 (± 0.002) rte 0.752 (± 0.021) 0.756 (± 0.023) 0.780 (± 0.014) 0.771 (± 0.025) 0.771 (± 0.017) qqp 0.914 (± 0.001) 0.915 (± 0.000) 0.916 (± 0.001) 0.916 (± 0.001) 0.916 (± 0.001) qnli 0.925 (± 0.002) 0.925 (± 0.002) 0.926 (± 0.002) 0.927 (± 0.002) 0.927 (± 0.002) 4.2 RuDALL-E. In Fig. 4 tokens, which are obtained after encoding the input image using Sber-VQGANfoot_2 . Few-bit backward for ruDALL-E Malevich shows same behaviour as for RoBERTa-base architecture: 1-and 2-bit versions, although coping with training perfectly fine, demonstrates minor performance degradation, while 3-and 4-bit versions are indistinguishable from the original GELU. 2022) dataset with ReLU replaced with GELU, Swish and SiLU nonlinearity functions. Graphs for Swish nonlinearity can be seen in Fig. 8 and graphs for other nonlinearities can be seen in Fig. 13 in Appendix F: 1-and 2-bits have minor performance drop, while 3-and 4-bits are on par with standard nonlinearity. 4.4 ActNN. As a baseline, we use another quantization scheme ActNN Chen et al. (2021) . It works in a much wider spectrum of situations, as it can quantize not only pointwise nonlinearity layers but also all kinds of linear layers (convolutional and dense layers), normalization layers and pooling layers. Without going deep into details, ActNN divides the saved tensor H into chunks h i where each chunk is of an equal size G. Then, given the quantization budget of b bits, each chunk h i is normalized: u i = 2 b (h i -min{h i })/(max{h i } -min{h i }) , and its randomly quantized version is saved ūi = ⌈u i ⌉ with prob. u -⌊u i ⌋, ⌊u i ⌋ otherwise. Random rounding is performed in order to guarantee that the quantization is unbiased. For each group, two additional values min{h i } and max{h i } are saved as well, but for the group size of G = 256 it is only 0.125 additional bits per element, which we ignore in our following tests. ActNN by construction does not take into account the global behaviour of the nonlinearity derivative. We argue that for nonlinearity layers, it is very crucial, and thus our preoptimized quantization scheme is more preferable. To confirm that, we consider ActNN behaviour on the QQP task from the GLUE benchmark with respect to different quantization budgets and compare it with our method (Fig. 9 and Table 2 ). In general, our method with 1 bit less budget works the same or better than ActNN, which is very important in the low-bit setting. In Fig. 10 we compare ActNN and Few-bit for ResNet18 architecture on ImageNet dataset for SELU nonlinearity, while results for GELU and Swish nonlinearities can be found in Fig. 14 in Appendix F. Aggregated top-1 accuracy for all activation functions is presented in Fig. 7 . Our method steadily outperform ActNN which is especially noticeable for 1-bit regime: ActNN experience strong downgrade of accuracy, while Few-bit Backward has much closer performance to standard nonlinearities. This means that one-bit Few-bit backward can be used in cases where it is very important to reduce memory consumption by a neural network. ActNN Our 1-bit 0.8880 (±0.0008) 0.9080 (±0.0006) 2-bit 0.9072 (±0.0005) 0.9097 (±0.0006) 3-bit 0.9106 (±0.0003) 0.9114 (±0.0007) 4-bit 0.9113 (±0.0006) 0.9114 (±0.0005) Table 2 : Accuracy on QQP task from GLUE benchmark for ActNN and Few-bit (Our). Averaged across 5 runs. Few-bit approach is better for each memory budget. ' 2015) assumes some internal structure of model weights and saves memory by explicitly using this structure with low-rank methods from linear algebra. Low precision learning and low precision optimizers focus on using the lower precision floats to store weights, optimization parameters, and model gradients. All of these approaches are complementary to the proposed one and can be used together. Checkpointing Beaumont et al. (2019; 2021) ; Chen et al. (2016) methods save memory by the cost of more calculations. It stores a fewer number of activations and repeats the calculation of the rest from the saved checkpoints. Offloading methods Beaumont et al. (2020) send the saved activations to the computer's RAM and load them back to the video memory on the backwards passes, which also saves GPU memory at the cost of host-device communication time. ActNN Chen et al. ( 2021) is a framework for quantizing stored activations adaptively on the fly. In contrast to our work, it allows quantizing not only layers of element-by-element activations but also many others, including convolutional, normalization, and linear layers. However, this method depends on the distribution of elements of quantizable tensors and, because of that, its performance may degrade. Our approach, on the other hand, selects data-agnostic optimal quantization, which in practice turns out to be sufficient and easier to use.

6. CONCLUSION

We have proposed a method to reduce memory consumption during the training of deep neural network models by storing less information for backward pass in the element-wise activation functions. For effective training, there is no need to calculate the derivative of the activation functions precisely, but only its piecewise-constant approximation is sufficient. This makes it possible to save not the entire input tensor at each application of the activation function, but only the interval number in the piecewise-constant approximation. Experiments show that for a wide class of models and problems, storing only 3 bits of information per tensor element does not lead to degradation of the learning quality and saves about 20 percent of memory. We have proposed an efficient algorithm for constructing an optimal piecewise-constant approximation. The proposed drop-in replacements for popular activation functions (ReLU, GELU, Swish, Sigmoid and others) do not depend on the neural network model, the problem to be solved, or the peculiarities of data distribution. The replacement of the original activation functions by the proposed method can be performed at any training stage (both to models trained from scratch and to pre-trained models for subsequent fine-tuning) and does not require any changes in the training pipelines. An efficient CUDA implementation of the proposed method, together with pre-computed piecewise-constant approximations for many popular activation functions, is available for PyTorch at GitHub repositoryfoot_3 . A 

B DETAILED MEMORY MEASUREMENTS FOR DIFFERENT MODELS

We provide memory measurements for different model architectures in 2021), weight decay 0.2, betas (0.9, 0.98), eps 1e-6, gradient checkpointing 24, trained for 6h using 1xA100.

E COMBINATION OF ACTNN AND FEWBIT

ActNN method is more general and can be applied to the broader class of layers, while our method only focus on one class of layers -pointwise nonlinearities. In the cases when it is not enough and more memory saving is required it is possible to join these two methods and to use Fewbit for pointwise nonlinearities and ActNN for everything else. Such a combination should work better than pure ActNN, since Fewbit works better than ActNN for pointwise nonlinearity layers. To check this hypothesis we train ResNet18 on CIFAR10 dataset. We replace standard ReLU pointwise nonlinearity with GELU, compress all layers except GELU with 4-bit ActNN (since 2-bit ActNN is too much of a compression and model diverges) and GELU layers are compressed with either 2-bit ActNN or 2-bit Fewbit. On 

G DYNAMIC PROGRAMMING

It is easy to see that the optimal value of y for L(s, y) in Eq. (3) with given s is: y i (s) = si+1 si w(x)f ′ (x)dx si+1 si w(x)dx . ( ) Consider Eq. ( 7): both y(j, i) and T (j, i) can be calculated in advance using analytical formulas (if possible) or numerically for the corresponding 1-dimensional integrals. After that, the full array of DP(i, k) can be calculated in O(n 2 K) time and O(n 2 ) space, where K is the required number of constant intervals in the approximation Eq. ( 2). Please note that this optimization has to be performed only once, so n can be chosen quite large thus the result would be very close to the global minimum. Note that the space complexity can be reduced to O(n) by adding three auxilliary arrays F 2 , W and F W and rewriting Eq. ( 7): T (j, i) = F 2 (i) -F 2 (j) -y(j, i) 2 (W (i) -W (j)). F 2 (i) = (9) We can see that ultimately only O(n) one-dimensional integrals have to be stored, and everything else can be easily evaluated in O(1) time on the spot. The one-dimensional integrals can be calculated numerically in O(n) time and space complexity as well: F 2 (i + 1) = F 2 (i) + (10) Numerical results. In Fig. 1 , we provide some 3-bit examples for popular activation functions obtained with described method, and more fewbit approximations can be seen in Fig. 11 . In Table 3 we provide numerical values of error Eq. ( 3).



Implementation is taken from https://github.com/sberbank-ai/ru-dalle Implementation is taken from https://github.com/VKCOM/YouTokenToMe Implementation is taken from https://github.com/sberbank-ai/sber-vq-gan Source code repository can be found at https://github.com/anonymous/repository huggingface.co https://github.com/huggingface/transformers/blob/main/examples/ pytorch/text-classification/run_glue.py https://github.com/libffcv/ffcv-imagenet



Figure 1: Examples of 3-bit approximations for derivatives of popular nonlinearities: GELU, SELU, and Sigmoid.

Figure 4: Dynamic of loss values in finetuning of ruDALL-E Malevich with Few-bit GELU activations.

Figure 5: RoBERTa-base on QQP task from GLUE benchmark, averaged across 10 runs. (a): Training loss. (b): Training loss zoomed into the last third of the training. (c): Validation loss.

ResNet Architecture. We trained ResNet18 model He et al. (2016) on ImageNet Russakovsky et al. (2015) benchmark Leclerc et al. (

Figure 8: ResNet18 with ReLU replaced with Swish nonlinearity trained on Imagenet. (a): Training loss. (b): Training loss zoomed into the last third of the training. (c): Final validation top-1 accuracy. All graphs are averaged across three runs with different seeds. Error bars denote minimum and maximum values.

Figure 9: Comparison of RoBERTa-base on QQP task from GLUE benchmark with ActNN quantization and Few-bit approximation. Averaged across ten runs. (a): Training loss. (b): Training loss zoomed into the last third of the training. (c): Validation loss.

Figure 11: 1-to 4-bit approximations of popular nonlinearty layers.

Figure 12: ResNet18 on CIFAR10 dataset. All ReLUs are replaced with GELU. All layers except pointwise nonlinearities compress their activations saved for backward with 4-bit ActNN. GELUs compress their activations saved for backward with either 2-bit ActNN (orange) or 2-bit Fewbit (blue). ResNet18 without any compresssion is depicted with green. (a): Training loss and accuracy for the whole training course. (b): Training loss and accuracy zoomed to the last half of the training course. ActNN + Fewbit for pointwise nonlinearities works slightly better than pure ActNN.

x)w(x)dx, y(j, i) = (F W (j) -F W (i))/(W (j) -W (i)),

t. output not with the actual derivative of f , but with values from y corresponding to stored indices. Pseudocode is presented in Alg. 1.Speed of Few-bit ApproximationThe memory gain of a Few-bit layer does not slow down the speed. The standard nonlinearity layer calculates the activation function in the forward pass and the activation function gradient in the reverse pass. The activation function gradient usually includes complex functions such as exponent, erf, and others. The Few-bit version of the layer also calculates the activation function on forward pass, but the gradient calculation during backward pass is replaced by one binary search and one lookup in the value table (see Alg. 1). Our efficient implementation of this procedure using CUDA kernels runs several percent faster than the standard nonlinearity layer.

have tested two of the most important and commonly used neural network architectures: convolutional neural networks and transformer-based networks. We use standard popular open-source benchmarks with open hyperparameters for training in order to demonstrate the behavior of the Few-bit approach under drop-in replacement of standard nonlinearities without any hyperparameter optimization or specially selected training conditions. In Section 4.1, we test the RoBERT-a transformer-based neural network on the GLUEWang et al. (2019) benchmark, which includes 9 different NLP tasks. In Section 4.2, we test the training of the generative ruDALL-e model in the task of modeling the joint distribution of text and image tokens for the Russian Emoji dataset. We use the GELU nonlinearity for both transformer architectures, as it is the main nonlinearity function used in such models. In Section 4.3, we test the classical ResNet18 architecture on the ImageNet dataset using the open benchmark ffcvLeclerc et al. (2022). In the classical ResNet architecture, we replace all ReLU nonlinearities with one of GELU, SELU, or Swish to demonstrate that the Few-bit approach works with a wide range of different popular activation functions.

we present training dynamic of ruDALL-E 1 MalevichRamesh et al. (2021) model on Russian Emoji dataset. The datasetShonenkov et al. (2021) contains 2749 unique emoji icons and 1611 unique texts that were collected by web scrapping (the difference in quantities is due to the fact that there are sets, within which emojis differ only in color, moreover, some elements are homonyms in Russian). ruDALL-E Malevich is a big multimodal pretrained transformer, which learns the conditional distribution of images given some string of text (more precisely it autoregressively models the text and image tokens as a single stream of data). ruDALL-E Malevich encoder part is a 24 layer TransformerVaswani et al. (2017) model with 16 attention heads, 2048 hidden dimensions and standard GELU nonlinearity, which in total has 1.3B parameters. It works with 128 text tokens, which are prepared from the text input using YTTM tokenizer 2 , and 1024 image

Table Appendix B. "Model size"is the total memory used for storing model parameters (without model gradients and optimizator statistics). "All activations size" is the total memory used by tensors, saved for backward pass. "Nonlinearity activations size" is the part of all activations used only by nonlinearity layers. "Percentage saving" is memory saved on all activation using our method compared to full precision non-linearities, and percentage value in the "Maximum Batch Size" row is the increase in the batch size achievable by using our method compared to full precision non-linearities, taken in ideal circumstances. Maximum batch size is calculated with the assumption, that four model copies are stored on the device (model parameters, model gradients and optimizer statistics like two moments stored by Adam optimizer) for GPU with 32G memory.

