RE-PARAMETERIZING YOUR OPTIMIZERS RATHER THAN ARCHITECTURES

Abstract

The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers such as SGD. In this paper, we propose to incorporate model-specific prior knowledge into optimizers by modifying the gradients according to a set of model-specific hyper-parameters. Such a methodology is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with or better than the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. We hope to spark further research beyond the realms of model structure design. Code and models https://github.com/DingXiaoH/RepOptimizers.

1. INTRODUCTION

The structural designs of neural networks are prior knowledge 1 incorporated into the model. For example, modeling the feature transformation as residual-addition (y = x + f (x)) outperforms the plain form (y = f (x)) (He et al., 2016) , and ResNet incorporated such prior knowledge into models via shortcut structures. The recent advancements in structures have demonstrated the high-quality structural priors are vital to neural networks, e.g., EfficientNet (Tan & Le, 2019) obtained a set of structural hyper-parameters via architecture search and compound scaling, which served as the prior knowledge for constructing the model. Naturally, better structural priors result in higher performance. Except for structural designs, the optimization methods are also important, which include 1) the first order methods such as SGD (Robbins & Monro, 1951) and its variants (Kingma & Ba, 2014; Duchi et al., 2011; Loshchilov & Hutter, 2017) heavily used with ConvNet, Transformer (Dosovitskiy et al., 2020) and MLP (Tolstikhin et al., 2021; Ding et al., 2022) , 2) the high-order methods (Shanno, 1970; Hu et al., 2019; Pajarinen et al., 2019) which calculate or approximate the Hessian matrix (Dennis & Moré, 1977; Roosta-Khorasani & Mahoney, 2019) , and 3) the derivative-free methods (Rios & Sahinidis, 2013; Berahas et al., 2019) for cases that the derivatives may not exist (Sun et al., 2019) . We note that 1) though the advanced optimizers improve the training process in different ways, they have no prior knowledge specific to the model being optimized; 2) though we keep incorporating our up-to-date understandings into the models by designing advanced structures, we train them with optimizers like SGD (Robbins & Monro, 1951) and AdamW (Loshchilov & Hutter, 2017) , which are model-agnostic. To explore another approach, we make the following two contributions. 1) A methodology of incorporating the prior knowledge into a model-specific optimizer. We focus on non-convex models like deep neural networks, so we only consider first-order gradientbased optimizers such as SGD and AdamW. We propose to incorporate the prior knowledge via modifying the gradients according to a set of model-specific hyper-parameters before updating the parameters. We refer to this methodology as Gradient Re-parameterization (GR) and the optimizers as RepOptimizers. This methodology differs from the other methods that introduce some extra parameters (e.g., adaptive learning rate (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) ) into the training process in that we re-parameterize the training dynamics according to some hyper-parameters derived from the model structure, but not the statistics obtained during training (e.g., the moving averages recorded by Momentum SGD and AdamW). 2) A favorable base model. To demonstrate the effectiveness of incorporating the prior knowledge into the optimizer, we naturally use a model without careful structural designs. We choose a VGG-style plain architecture with only a stack of 3×3 conv layers. It is even simpler than the original VGGNet (Simonyan & Zisserman, 2014) (which has max-pooling layers), and has long been considered inferior to well-designed models like EfficientNets, since the latter have more abundant structural priors. Impressively, such a simple model trained with RepOptimizers, which is referred to as RepOpt-VGG, can perform on par with or better than the well-designed models (Table 3 ). We highlight the novelty of our work through a comparison to RepVGG (Ding et al., 2021) . We adopt RepVGG as a baseline because it also produces powerful VGG-style models but with a different methodology. Specifically, targeting a plain inference-time architecture, which is referred to as the target structure, RepVGG constructs extra training-time structures and converts them afterwards into the target structure for deployment. The differences are summarized as follows (Fig. 1 ). 1) Similar to the regular models like ResNet, RepVGG also adds priors into models with well-designed structures and uses a generic optimizer, but RepOpt-VGG adds priors into the optimizer. 2) Compared to a RepOpt-VGG, though the converted RepVGG has the same inference-time structure, the training-time RepVGG is much more complicated and consumes more time and memory to train. In other words, a RepOpt-VGG is a real plain model during training, but a RepVGG is not. 3) We extend and deepen Structural Re-parameterization (Ding et al., 2021) , which improves the performance of a model by changing the training dynamics via extra structures. We show that changing the training dynamics with an optimizer has a similar effect but is more efficient. Of note is that we design the behavior of the RepOptimizer following RepVGG simply for a fair comparison and other designs may work as well or better; from a broader perspective, we present a VGG-style model and an SGD-based RepOptimizer as an example, but the idea may generalize to other optimization methods or models, e.g., RepGhostNet (Chen et al., 2022) (Appendix D). From a practical standpoint, RepOpt-VGG is also a favorable base model, which features both efficient inference and efficient training. 1) As an extremely simple architecture, it features low memory consumption, high degree of parallelism (one big operator is more efficient than several small operators with the same FLOPs (Ma et al., 2018) ), and greatly benefits from the highly optimized 3×3 conv (e.g., Winograd Algorithm (Lavin & Gray, 2016)). Better still, as the model only comprises one type of operator, we may integrate many 3×3 conv units onto a customized chip for even higher efficiency (Ding et al., 2021) . 2) Efficient training is of vital importance to the application scenarios where the computing resources are limited or we desire a fast delivery or rapid iteration of models, e.g., we may need to re-train the models every several days with the data recently collected. Table 2 shows the training speed of RepOpt-VGG is around 1.8× as RepVGG. Similar to the inference, such simple models may be trained more efficiently with customized high-throughput training chips than a complicated model trained with general-purpose devices like GPU. 3) Besides the training efficiency, RepOptimizers overcome a major weakness of Structural Re-parameterization: the problem of quantization. The inference-time RepVGG is difficult to quantize via Post-Training Quantization (PTQ). With simple INT8 PTQ, the accuracy of RepVGG on ImageNet (Deng et al., 2009) reduces to 54.55%. We will show RepOpt-VGG is friendly to quantization and reveal the problem of quantizing RepVGG results from the structural transformation of the trained model. We naturally solve this problem with RepOptimizers as RepOpt-VGG undergoes no structural transformations at all.

funding

work was partly done during their internships at MEGVII Technology.

