RE-PARAMETERIZING YOUR OPTIMIZERS RATHER THAN ARCHITECTURES

Abstract

The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers such as SGD. In this paper, we propose to incorporate model-specific prior knowledge into optimizers by modifying the gradients according to a set of model-specific hyper-parameters. Such a methodology is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with or better than the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. We hope to spark further research beyond the realms of model structure design. Code and models https://github.com/DingXiaoH/RepOptimizers.

1. INTRODUCTION

The structural designs of neural networks are prior knowledgefoot_0 incorporated into the model. For example, modeling the feature transformation as residual-addition (y = x + f (x)) outperforms the plain form (y = f (x)) (He et al., 2016) , and ResNet incorporated such prior knowledge into models via shortcut structures. The recent advancements in structures have demonstrated the high-quality structural priors are vital to neural networks, e.g., EfficientNet (Tan & Le, 2019) obtained a set of structural hyper-parameters via architecture search and compound scaling, which served as the prior knowledge for constructing the model. Naturally, better structural priors result in higher performance. Except for structural designs, the optimization methods are also important, which include 1) the first order methods such as SGD (Robbins & Monro, 1951) and its variants (Kingma & Ba, 2014; Duchi et al., 2011; Loshchilov & Hutter, 2017) heavily used with ConvNet, Transformer (Dosovitskiy et al., 2020) and MLP (Tolstikhin et al., 2021; Ding et al., 2022) , 2) the high-order methods (Shanno, 1970; Hu et al., 2019; Pajarinen et al., 2019) which calculate or approximate the Hessian matrix (Dennis & Moré, 1977; Roosta-Khorasani & Mahoney, 2019) , and 3) the derivative-free methods (Rios & Sahinidis, 2013; Berahas et al., 2019) for cases that the derivatives may not exist (Sun et al., 2019) . We note that 1) though the advanced optimizers improve the training process in different ways, they have no prior knowledge specific to the model being optimized; 2) though we keep incorporating



Prior knowledge refers to all information about the problem and the training dataKrupka & Tishby (2007). Since we have not encountered any data sample while designing the model, the structural designs can be regarded as some inductive biases Mitchell(1980), which reflect our prior knowledge.

